February 11, 2014

How Git Could Grow into an Enterprise SCM System

Git at Scale

Git is awesome for the type of problem it's good for. Relatively small groups of people, working on human-scale projects made up of text, code, and other small, easily-mergeable files, with no fine-grained read access control. If you want to start an open-source project, or a company internal "two-pizza team" that works like an open-source project, then git is a good choice.

However, it's sometimes fun to think about what it would take to extend Git to the point where you could use it as the version management system for a large organization, full of people who don't know or trust each other and need to work on large numbers of files, including some files that are huge, can never be merged, or both. Could it be done? Maybe. And you could learn a lot about Computer Science and Software Engineering by doing it. There's an optimistic page from 2012 that lists some of this work as ideas for Google Summer of Code but it looks like much more than a summer's worth of work.

  1. Large file support: Git doesn't handle large blobs well, but fortunately this isn't a deal-breaker if you can add a "multi-blob file" object type, so that one file does not necessarily equal one blob. You have to pull off an elegant piece of programming, though: a consistent, reasonably high-performance way to turn a single huge file into multiple blobs of Git-friendly size. The closest I know of to this is the hashsplitting in bup. You read through the file keeping a hash as you go, and splitting off blobs when a certain number of bits of the hash are all 1s. For best results here you'd want to try your implementation of hashsplitting on a bunch of real-world file formats. For some file formats, you'll probably get substantial storage savings compared to keeping the entire file in one blob.

    The related problem is indexing all of those blobs. Git packfile indexes (.idx files) are elegant and fast for the repository sizes in common use, but you'd need to extend them to support repositories with extreme numbers of blobs. You should be able to find the right index structure somewhere in the CS literature or borrow it from some other open-source project that has to index things. Although this would be a substantial amount of work, as long as you make a sincere effort and get something basically working, there are plenty of experts in the Git scene who would be likely to point out where you're going wrong.

  2. Network object storage: You can break up large blobs any way you want, but there's still only so much space on the hard drive. Right now, Git can obtain objects from three places: loose objects, packfiles, and "alternate" repositories stored locally. To make a repository bigger than your least-well-equipped contributor's free disk space, you would need to add the ability to get and put objects over the network. Because of how Git names its objects, this could be done with a robust, scalable distributed hash table if you aren't concerned about read access control.

    Even though DHTs are conceptually simple, making one work at scale is an interesting problem. It's worth it for the simplicity and resilience, but there's still quite a bit of challenging and rewarding programming here. You should also be on good terms with whoever takes on the problem of next-generation packfile indexes, since these two features will have to work together.

  3. Read access control: This is impossible for pure Git, but in an in-house deployment with enforced use of an access-controlled network object store you could do this. You would have to replace the simple network object storage from step 2 with something that can do lots of permission checks on objects, quickly.

    Although access control on network objects should meet the requirement, for extra security you can extend the system to check for local copies of things that the user obtained somehow but are now forbidden. An extra win from both a security and a usability POV is an antivirus-like tool that will police the local hard drive for stray copies of permission-denied or already-checked-in work, and delete them and/or report them.

    This item is more Software Engineering than Computer Science–you probably won't get a publication by grinding this out.

  4. Partial checkouts: Git doesn't have the concept of client workspaces that are subsets of the full repository. Git only has "bare" repos with no working directory at all, or non-bare repositories with everything in the working directory. For now, this puts a practical limit on repository size. You could deal with partial checkout of a huge repository in several ways.

    • Actually introduce Perforce-like path mapping. When users commit, Git would use the tree objects at HEAD for everything that's not mapped into the working copy. Only the tree(s) that are actually mapped to the working directory, their children, and their parents up to the repository root would have to be written.
    • Mount the repository with FUSE; this would mean that users would have to run a local daemon, but they would not have to walk the filesystem for changed files, which gets slow.

    Again, more Software Engineering than Computer Science.

  5. (Not an issue: HA, scaling to large numbers of users and sites. Git's design means you can do web-scale deployments with a few hundred lines of code to talk to a HA-KVS such as Zookeeper or etcd. I have implemented this as a set of hooks, and some companies are making an actual product out of Git HA, with a management interface and everything. But the HA-KVSs already solve the hard parts of the problem. Git is extraordinarily friendly to large-scale replication. Other server-side tools, such as bug trackers that use hard-to-replicate relational databases, will be problematic to scale to large numbers of users and sites long before Git does.)

  6. File locking. I'm putting this after number 5 because you could use the same HA-KVS that you use for replication to also keep track of locks on files. Of course you want to work in file formats that are mergeable where you can, but sometimes, a designer just needs to lock everyone out of an Adobe Photoshop file and that's that.

Facebook has solved some scaling problems in Mercurial by offloading some historical revisions to a network server and by using inotify to speed up file status checks. That seems to solve Facebook's immediate requirements, since their current codebase size is apparently still below the available storage on their smallest client system, and they don't have extremely large files in revision control or requirements for access control or locking. It doesn't make Mercurial into a general-purpose enterprise SCM system, though.

There are interesting projects to be done in the Git scaling area: multi-blob files, scaling up the index, and efficiently storing objects out on the network . There's existing work to build on, but there's still challenging and fun design and coding to be done. There are great projects here if you're into open source, want to get your PhD in Computer Science, and don't mind having your code mocked on the Internet.

Existing Git users, however, don't really need any of this. It's easy to forget that out of the 10,000 contributor Linux project, only a small core group actually uses Git to pull from member to member. And the Linux codebase is large by Git repository standards, but tiny compared to corporate codebases and versioned collections of large media files.

Git's current scaling capability is fine for the people who depend on it the most today. So who's going to bring the pieces of large-scale Git together? The obvious way to pull this off is for someone to implement everything in open source except for (3), the access-controlled object store and out-of-place object scanning tool, and offer that on a license or subscription basis. It's a natural freemium split because the open-source crowd doesn't care about micromanaging read access anyway.

One problem, though, is that the road from "Git" to "Enterprise SCM" leads right through "censorship-resistant publishing system." This is not an issue for enterprise deployments, but as William Gibson once wrote, "the street finds its own uses for things." An enterprise version management system based on Git would also be attractive to non-enterprise users interested in avoiding surveillance. For example, an underground site could run the no-access-control version of the network object store as a Tor hidden service, and publish content by "spending" a small amount of Bitcoin or a similar distributed currency on a ref. The current sponsors of Git, the employers of leading developers, make their livings from information centralization and are likely to experience more harm than good from that. So better for them to keep Git in its safe "text and code for small groups" niche. A Git descendant that's suitable for enterprise SCM will probably have to come from a new player.

It will be fun to watch this happen over the next five to ten years. If you need a full-scale enterprise SCM system to keep Git and non-Git users happy while you hit refresh on Hacker News and eat popcorn, there's always Perforce.

Questions?  Visit the Perforce Forums or follow me on Twitter.