December 16, 2016

The Hidden Costs of Managing Git

Git at Scale

Hidden Cost of Git Image

The Price of Free

The best thing about open source software is arguably freedom. Its proponents often break this down in two senses: (1) free as in beer, and (2) free as in speech. Git is a classic example because it’s free in both senses. It costs nothing for you to download and use as you wish, and you have the right to leverage and/or rework the source code as well.[1]

But here’s the catch: freedom isn’t free. We hear that phrase most often in the political arena, recognizing the (sometimes ultimate) cost that few have paid throughout history to guarantee liberty for many. It may come as a surprise, but it applies equally well to the use of much open source software, Git in particular.

The purpose of this article is to bring some of Git’s hidden costs into the light, so they might be better counted before adoption, rather than coming up as one nasty surprise after another over time. At the end of the day Git remains a good tool for a variety of jobs, but it’s not the right tool for every job. It’s all too easy to miss that and fail to count Git’s hidden costs until they’ve grown so painful that switching is as essential as it is difficult.

Costs of Scale

There are three senses in which computing typically needs to scale: vertically, horizontally, and geographically. A detailed discussion is well beyond our scope, but we may summarize briefly. To wit, vertical scaling concerns wringing the most from existing assets, horizontal scaling focuses on spreading work across more assets, and geographical scaling refers to that special set of headaches involved in uniting resources and personnel scattered around the globe.

Unfortunately, Git wasn’t designed with any of these in mind. Native Git does little to maximize use of computing hardware, so throwing additional processor cores or memory into the mix isn’t going to help you scale vertically. Most Git hosting solutions attack this by throwing a web-based front end on top of native Git, letting the very nature of web applications address the problem by asynchronously handling multiple requests. This works well to the extent that the file system can deliver, given that Git’s design relies entirely upon it.

Unfortunately, its unique, inventive storage model (filing assets by their SHA-1 hash values), while great for deduplication of storage, leaves much to be desired as repositories grow. A variety of Git commands require computing (and possibly re-computing many) hash values, a process which grows only slower over time with more and more content. This is complicated, pathologically in some cases, when working with large binary files.

The Git community has responded to this largely through a series of extensions and workflow changes aimed at keeping big files outside the actual version control repositories, integrating them into local, working folders as needed. Tools like git-annex and Git LFS are designed to simplify this process but add another tool into the mix and divide content, opposed to the Agile ideal of a “single source of truth”.

Further, Git hosting services typically charge additional fees for their use and storage space. You can always try to shoulder the burdens of such extensions yourself, but either way it’s another hidden cost of Git adoption for many working with large files.

Worse, the fundamental problem remains unaddressed. Even when using tools to keep large binaries outside your repositories, they still have a tendency to grow and split into many more. The practice is so common it has birthed the term “Git sprawl”. In effect, Git forces you to scale out repositories horizontally just to maintain acceptable performance.

This is another source of significant, hidden costs in terms of efficiency and productivity, particularly when working with continuous integration systems. It places burdens on DevOps teams to unify multiple repositories at build time and can entail the cost of additional servers sufficient to handle all the concurrent pull requests. We’ve seen customers with terabytes of content avoid Git for exactly this reason: when a single terabyte of content might require as many as one thousand Git repositories to maintain acceptable performance, who wants to manage all that?!

In terms of scaling geographically, the good news is that Git’s approach is rather efficient, both in terms of calculating what to transfer and its actual network protocol for shuttling bytes around. The bad news, however, is that it doesn’t offer any good way of unifying multiple teams all working concurrently on multiple repositories. It’s a dilemma: hosting them in a single location punishes anyone elsewhere, both through latency and possibly low-bandwidth or unreliable WAN links, while hosting them in multiple locations greatly complicates the process of putting it all back together and running unified builds.

I’ve written on this topic before, the short summary being that you’re going to be hit with three more hidden costs: (1) the non-trivial burden of replicating/mirroring content, (2) having to adopt and strictly maintain a clear workflow and set of branching practices, and (3) developing a plan—often largely on your own—to address HA/DR concerns. It shouldn’t be hard to imagine how keeping your large files outside the repository only complicates this process with yet more hidden costs for managing its details with each of those three concerns as well.

Costs of Use

Let’s turn our attention away from the mechanical costs of scaling and managing content and consider instead the human costs. Another oft-overlooked, hidden cost of adopting and using Git is its learning curve. One might take its broad adoption to indicate it’s trivially simple to pick up and get started, but in fact the opposite is the case. A recent survey by GitLab found that developers generally consider Git’s learning curve difficult, with no less than 40% voting it their greatest challenge in adoption.[2]

That survey’s audience should not be overlooked either, as it’s particularly telling. After all, if developers find Git challenging, persons who are of necessity highly technical and accustomed to mastering complex new tools on a regular basis, how well might you expect your less technically inclined personnel to fare? Uniting contributors from multiple disciplines, such as artists, animators, writers, et al., is a common challenge in the DevOps era, and Git’s learning curve doesn’t make it easy.

Quite the contrary, Git is not a novice-friendly versioning tool, offering advanced features and often demanding deep knowledge of its data model to avoid problems. For example, a rebase accident can destroy work on a particular branch. And while all the major Git hosting services no longer allow a forced-push to disrupt teams using shared branches, that only protects other team members from inconvenience; the user who made the mistake must walk an ugly path to fix his local repo.[3]

Thankfully, there are a variety of ways to mitigate Git’s learning curve, not least of which is by using one of the available graphical user interfaces (GUIs). Git includes a couple of graphical tools by default, but many users prefer third-party offerings.[4] It’s important to assess your non-technical contributors’ skills and match them with appropriate tools, or ensure that simple plugins are available for the applications they use every day.

This process of selecting the right interfaces for all your contributors, which must both enable their preferred workflow yet not overwhelm their technical chops, is another commonly overlooked cost of Git adoption. The best and simplest tools are rarely free, and those costs can add up quickly on a per-year or per-user basis as well.

And none of this addresses the most basic security considerations. Git was designed to deliver repos as an all-in-one proposition: everyone gets every file and folder and can do anything the file system allows. But do you really want everyone having access to every file and folder of your intellectual property (IP)? The reality for many organizations and projects is that Git’s native approach is simply too naïve.

Git hosting services often make it possible to “secure” entire repos via different roles, even protect specific branches by allowing pushes only from certain groups. But few offer the kind of down-to-the-file control and granular permissions that centralized versioning systems have traditionally offered. Anyone who has access to a repo typically has access to everything in it, the result of which being that the need to restrict IP involves partitioning it into yet more repos, contributing further to Git sprawl.

This is another hidden cost, and an increasingly important one in an age when outsourcing has become the norm. Juggling the need to restrict proprietary content while still allowing outsourced teams and consultants to contribute can significantly impact efficiency and figures prominently into the prior discussion of geographic scaling.

Conclusion

Git is free, both as in beer and as in speech, but adopting it and using it long term clearly aren’t inexpensive. Quite the contrary, Git’s costs tend only to grow over time in multiple directions. To be fair, Git is a great tool for small teams working small projects. But when those teams are successful, growing suddenly into big teams working big projects, they’re often blindsided by the hidden costs we’ve enumerated. But by then it’s often both difficult and still more expensive to switch to another, more capable versioning system.

In contrast, Perforce Helix is the one system you won’t outgrow. It doesn’t choke on large or many files, offers a wide variety of clients and plugins to unite all your contributors, offers far simpler data and workflow models, secures your content right down to the file via a flexible set of permissions, and won’t ever punish your success with a bunch of hidden costs. Helix also offers more advanced distributed (i.e., DVCS) features than Git and integrates with a broad range of industry standard tools right out of the box. Even better, it’s free (as in beer) for small teams and offers an easy trial—including support for those developers who still prefer Git.

Whatever versioning system you choose is going to cost you, in one currency or another; there’s no getting around that. The only question is whether you’ll pay simply up front or later in frustration, lost productivity, monetary costs you didn’t anticipate, possibly the time and effort of switching systems altogether, and perhaps even the complications and losses incurred by your content walking out the door without your permission. Why not give Helix a trial today instead and avoid all of that?

NOTES

[1] For more precise details, see the GNU General Public License version 2.

[2] For details and the survey document, see GitLab’s 2016 Global Developer Report.

[3] Those who have already shot themselves in the foot a few times will likely think of resetting the branch “pointer” to the head commit prior to the bad rebase (found by checking the reflog), but the level of understanding implied to fix such a simple mistake only highlights the aforementioned learning curve.

[4] An incomplete but useful list may be found here.