October 25, 2012

Managing Projects Across Git Repositories

Git at Scale

multiple git repositoriesIn most projects using Git, a typical developer needs to work with much more code than can be comfortably managed in a single Git repository. Although it's possible to keep multiple repositories up to date manually, many Git developers have come up with solutions to track and work with changes that happen in more than one.

Repo: The Android Git Wrapper

Android is an extreme case of a project that has outgrown a single Git repository — it uses hundreds of them. So it's impractical to keep up to date with all of the Android repositories without some kind of tool on top of Git. To automate the "git clone" and "git pull" operations to catch up with Android, run a "repo sync" to get everything up to date. There are also "repo upload" and "repo download" commands, which automate the client side of interacting with the Gerrit code review system.

Why choose repo?Why not choose repo?

Well-documented on Android site.

Connects well to Gerrit.

Designed to implement one workflow, the Android one. May not work well for alternate workflow designs.

Gitslave and mr

Gitslave runs the same git commands in multiple Git repositories, and combines the results. For example, when creating a task branch, you can run a single command to create a branch with the same name in each of the repositories making up a project. It's a useful tool, but using multiple repositories linked by Gitslave means that you give up the atomicity of Git operations. It's possible for the git command started by Gitslave to succeed in some repositories and fail in others. According to the project web site, "There is a very loose relationship between commits in different repositories. You cannot easily and precisely determine what commit/SHA any other repository was at when a particular commit was made (though you can approximate and assume pretty easily)."

mr works much like Gitslave, but doesn't just support Git. You can use a mix of SCM systems, including Subversion, Git, CVS, Mercurial, bzr, darcs, and CVS.

Why choose Gitslave or mr?Why not choose Gitslave or mr?
Minimal interference with exiting Git repo structure.

Can leave project in an inconsistent state if an operation partly succeeds.

Difficult to visualize changes that affect multiple repositories.

Git Submodules and Subtrees

Git itself has two built-in ways to work with subprojects: submodules, and the newer git subtree tool. A submodule is a way to embed the content of a "foreign" git repository in another one. A submodule is locked to a given version of the other project — if you need to track a newer version, you'll need to update the submodule, then commit the change in the outer repository.

Git subtrees (not to be confused with the subtree merge strategy) work a little differently. You can work in a subdirectory of an existing repository, and split it out to treat it as a separate git repository. But you don't need to do a separate update and commit. Gunnar Wrobel has written a good blog post on working with subtrees.

Why choose submodules or subtrees?Why not choose submodules or subtrees?

Atomic commits; can't have a partial change to subproject state.

Tools are bundled with the default Git install (although subtree is only in new versions.)

Requires additional setup.

Requires developers to learn new commands to handle changes in subprojects.

Git Fusion

Perforce Git Fusion creates "remapped" Git repositories that can combine contents of multiple subprojects. The remapping happens within the upstream Perforce server, and which files live where are defined with a Perforce client workspace. You can actually aggregate multiple upstream git repositories, then pick and choose which parts to map into the repository you work in.

Each developer can work in a combined Git repository that includes code and history from multiple sub-projects.  Unlike wrapper tools, commits are atomic, since there are no separate repositories to deal with. And unlike submodules and subtrees, there are no extra git commands.

If you're familiar with how Git uses content hashing for security, this might sound too good to be true. After all, each and every commit is dependent on the entire history of the project before it, through a chain of SHA-1 hashes. However, Git Fusion's remapped Git repositories have completely valid hashes, so they'll work fine with any Git implementation. Because of how content hashing is central to Git's design, though, commits in a remapped repository can't be directly applied to the original, un-remapped one. If I clone a Git repository from an open-source project on the outside, I'll need to have an integration repository to clean up any changes I want to push back. I'm in the middle of figuring out a reasonable repository layout for a web project (watch this space), and I'll be trying out a mix of approaches, to see what works best.

Why choose Git Fusion?Why not choose Git Fusion?

Combined repository works like pure Git

No extra client-side tools or skills required

You may need an extra repository for integrating changes from your remapped repository to an original one.