Your Git Repository in a Database: Pluggable Backends in libgit2
September 12, 2017

Your Git Repository in a Database: Pluggable Backends in libgit2

Repository Management

Git has a well-known, well-defined structure for how it stores data. In the .git directory of every Git repository you can expect to find certain things: objects for the data, refs for the branch and tag pointers, and so on. Additionally, everything gets stored in flat files, though some formats are a bit more involved than others.

However, it turns out this is not the only way you can store data in a Git repository. You can actually use a relational, or NoSQL database; an in-memory data structure; or something like Amazon S3. The pluggable backends provided by the libgit2 library make all of this possible.

What This Means

Using alternative Git storage solutions is probably most interesting for services or products that provide Git hosting like we do at Perforce. Use cases for hosting providers include:

  • Caching Git data for lightning-fast access by using either an in-memory backend or a Memcached or Redis backend with fallbacks to traditional file storage.
  • Building a fault-tolerant storage solution, or even a multi-site replication solution by storing data in a modern database system designed for this purpose such as VoldemortRiak, or Cassandra.

Outside of hosting, there are several possible use cases for pluggable storage when incorporating Git access to tools and libraries.

The Two Datastores of a Git Repository

Git repositories aren't that complicated, though you would never know it by looking at Git's UI. Git repos are comprised of just two structures, upon which everything is based: object databases and ref databases. 

The Object Database

The object database is where all the data is stored:

  • The contents of all files
  • The structures of directories
  • Commits
  • Everything 

However, what's remarkable about the object database is that it's essentially nothing but a key-value store.

Git stores data in the object database using a hash-based retrieval, meaning that the keys of the store are the (SHA1) hashes of the values. That has some rather interesting implications: The values in the object database are essentially immutable and you don't need an update operation.

Object Database

What's left is a basic data structure with essentially four operations:

add(key, value)

It's easy to see you don't necessarily need flat file storage to implement something like this! Git's default, file-based object database is just one implementation of the abstract concept.

The Ref Database

The ref database stores a Git repository's references — the branches, tags, and HEAD.

Just like the object database, the ref database is also essentially a key-value store. The keys are the identifiers of the references, and the values are SHA1 hashes, which in turn correspond to commit objects in the object database.

Ref Database

The values of a ref database are mutable, which is a key difference when compared to the object database. The commit that master points to may change over time. That means there's a slight difference in the operations that a ref database must provide:

write(key, value)
rename(old_key, new_key)


Libgit2 is an implementation of Git written in pure C. It's designed to be an alternative to the Git reference implementation, providing easy linkage to other libraries and applications. It is actually the basis of the Git language bindings in many programming languages.

One of the less advertised features of libgit2 is that it has pluggable backends, which means that instead of storing the object database and the ref database in the way Git usually does it – in flat files – you can provide your own backend implementation and do whatever you want. Let's see how that works.

The Libgit2 Object Database Backend

The libgit2 object database code accesses data through functions in a C struct git_odb_backend, defined in git2/sys/odb_backend.h. It basically has the functions described above, with some additional functions for convenience (reading object headers only, streaming access, writing a packfile).

There are two built-in implementations for this struct that ship with libgit2. They implement the two object storage formats that Git traditionally supports:

  • odb_loose implements the loose file format backend. It accesses each object in a separate file within the objects directory, with the name of each file corresponding to the SHA1 hash of its contents.
  • odb_pack implements the packfile backend. It accesses the objects in Git packfiles, which is a file format used for both space-efficient storage of objects and for transferring the objects when pushing or pulling.

As you create a Git object database, you can provide any instance of the git_odb_backend struct, including a custom-built one. This lets you plug in your own implementations, as we'll see later in this article.

The Libgit2 Ref Database Backend

You can also provide a custom backend for the ref database, resulting in a potentially flat file-free Git repository. The technique libgit2 uses for this is essentially the same as with the object database. There is a struct git_refdb_backend, defined in git2/sys/refdb_backend.h, with functions for the different access operations.

There is just one implementation of the ref database backend that ships with libgit2: The file system backend refdb_fs, which accesses the refs in the refs directory of a repository.

Existing Alternative Backends

In addition to the built-in backends already mentioned, the libgit2-backends repository maintained by the libgit2 team provides a few custom object database backends:

These are not only useful by themselves, but they also provide a nice starting point for writing a custom backend of your own.

Setting It Up

Let's look at how to actually use these alternative backends.

What you would usually do when using the built-in backends would be to invoke git_repository_open with the file system path containing the usual .git directory contents, such as the loose object database, the packfiles, and the refs.

What we need to do instead when using custom backends is to invoke git_repository_wrap_odb , providing our own object database with a custom backend.

Let's say we have custom backends written for the Voldemort database, with the following constructor functions:

int git_odb_backend_voldemort(git_odb_backend **backend_out, git_repository *repo, const char *repo_id, const char *bootstrap_url, const char *store_name);  
int git_refdb_backend_voldemort(git_refdb_backend **backend_out, git_repository *repo, git_refdb *refdb, const char *bootstrap_url, const char *store_name);  

Here's how we can set up a Git repository backed by those backends:

git_repository    *repo;  
git_odb           *odb;  
git_odb_backend   *voldemort_odb_backend;  
git_refdb         *refdb;  
git_refdb_backend *voldemort_refdb_backend;  
int               error = 0;

error = git_odb_new(&odb);  
if (!error)  
  error = git_repository_wrap_odb(&repo, odb);
if (!error)  
  error = git_odb_backend_voldemort(&voldemort_odb_backend, repo, "my_repo", "tcp://localhost:6666", "git_odb");
if (!error)  
  error = git_odb_add_backend(odb, voldemort_odb_backend, 1);
if (!error)  
  error = git_refdb_new(&refdb, repo);
if (!error)  
  error = git_refdb_backend_voldemort(&voldemort_refdb_backend, refdb, "my_repo", "tcp://localhost:6666", "git_refdb");
if (!error)  
  error = git_refdb_set_backend(refdb, voldemort_refdb_backend);
if (!error)  
  git_repository_set_refdb(repo, refdb);
  • On line 8 we construct an object database without any backends.
  • On line 10 we construct a Git repository backing this object database.
  • On line 12 we construct the Voldemort object database backend.
  • On line 14 we plug in the Voldemort object database backend to the object database. Object databases support multiple backends, and the order in which lookups are done is based on a priority number. We give the Voldemort backend priority 1.
  • On line 16 we construct a ref database without any backends.
  • On line 18 we construct the Voldemort ref database backend, just like we did with the object database.
  • On line 20 we plug in the Voldemort ref database backend to the ref database.
  • On line 22 we finally plug in the ref database to our repository, and we have a functioning repository we can read and write to.

In place of the Voldemort backends, you could also use one of your own implementations or one of the existing custom implementations from libgit2-backends. You could even provide multiple custom object database backends by adding them with different priorities. This can come in very handy when implementing caching, for example.

If you're not working in raw C, you can take a look at all the language bindings based on libgit2 to see how you might be able to achieve this in your programming language.