libgit2 in 2024: the past

October 24, 2024

This is the first post in a three post series. libgit2: past, present, and future.

I was recently asked to speak at Git Merge 2024 about libgit2, and I thought it would be useful to talk about where libgit2 came from, where it is, and where it's going. If you weren't there, you can watch the video, but I also wanted to take the time to expand on that a bit.

I think of libgit2's history as being split into four distinct phases:

v0: The beginning

You might know that I'm the maintainer of libgit2, but I wasn't the creator of libgit2. libgit2 was quietly introduced almost 16 years ago by Shawn Pearce. (Happy birthday to us πŸŽ‚)

commit c15648cbd059b92c177586ab1701a167222c7681
Author: Shawn O. Pearce <spearce@spearce.org>
Date:   Fri Oct 31 09:57:29 2008 -0700

    Initial draft of libgit2

    Signed-off-by: Shawn O. Pearce <spearce@spearce.org>

If you don't know of Shawn, he was a prolific developer who had a tremendous impact on git, and the git ecosystem. He built a number of critical tools, including git-gui and git fast-import. He created both the libgit2 and JGit projects, and built tooling on top of them like the Gerrit code review tool.

Note: In my talk, I mistakenly stated that Shawn started libgit2 and JGit while he was employed at Google. I was corrected after the talk; I understand that he created them before starting at Google.

I don't actually know why Shawn started the libgit2 project, but I can make some educated guesses - and answer a question at the same time.

Why libgit2 is called libgit2

One question that I get asked a lot is "what happened to libgit1?" Well, there was no libgit1, per se, but there was β€” and is β€” a libgit.a. That's the static library that git produces as a support library for some of its commonly-used functions. This is especially important since git is composed of a several independent applications, mostly. Kind of. But more on that in a minute.

The fact that Shawn called this project libgit2 suggests that may have wanted it to be the common library that git itself was built on.

A platform for extensions

But even if that's not what he had in mind, it was clear that Shawn wanted to build more than just a platform, he wanted to build applications. He introduced Gerrit, but before that he introduced pg, an improved git user experience.

Ultimately, I actually don't know why Shawn started the libgit2 project, but I'm grateful for the overall good taste that he put in to the early, foundational decisions that formed libgit2.

v0.1: The GitHub years

Around the same time that Shawn was leaving the libgit2 project, GitHub was starting to pick up steam as a platform. And part of GitHub's platform was simple ways to create and commit files in their web UI. For this, they used a project called grit. Grit was a pure Ruby implementation of Git, which was very interesting, but had the disappointing property that it would corrupt those commits from time to time.

As a result, GitHub sponsored a Google Summer of Code project to "complete" libgit2. Despite the fact that a decade later, libgit2 remains incomplete, this was a tremendous success. Vicent MartΓ­ came on board to hack on the libgit2 project, alongside a GitHub employee named Russell Belfer. Between them, they quickly turned it into a project that was both modular and high-performance. Their goal was to build something capable of driving github.com, and both of their fingerprints remain all over the project. The factoring of the object database and reference database into pluggable backends was their work, as was the careful focus on efficiency and performance in every area that they touched.

This success led GitHub to sponsor a Google Summer of Code project again the following year, which led to Carlos MartΓ­n Nieto joining the project. His presence was similarly instrumental in turning libgit2 into what it is, with major refactorings to core components like configuration and references. He also led the work on Rugged, the Ruby bindings to libgit2, and git2go, the Go bindings. He ultimately became the co-maintainer of the project, and thankfully, still sends a pull request to the project from time to time.

v0.2: The Microsoft invasion

Not long after this, I started contributing to libgit2. I came into the libgit2 project twelve years ago (another anniversary πŸŽ‚) when Microsoft decided to adopt Git as its version control system -- both internally and in its public products. I've also told this story before, and also at a Git Merge.

But if you don't want to watch another video, I'll give you the tl;dr. My buddy Martin had taken note of the adoption of git itself, but also the adoption of GitHub, and crucially, git as a deployment mechanism to cloud services like Heroku. It was clear that git wasn't going to just become a version control system, but a developer platform. And that Microsoft couldn't afford to miss out on this platform; we needed to adopt Git.

But making the decision to adopt Git was only the first decision. We also had to decide how. If we rewind back a decade, it was clear that we needed to adopt Git as a technology, but weren't going to be able to adopt Git itself.

Git's architecture

To understand why, we should look a bit at git's architecture, and remember back to that discussion of libgit.a. Git β€” the project β€” is composed of many individual applications. When you run git pull, the git command turns around and invokes another command called git-pull. Which itself needs to invoke two other commands: git-fetch, then git-merge (or depending on your settings, maybe git-rebase). And both of these invoke a bunch of other git applications, the "porcelain commands". Things like git-merge-tree, git-ls-tree, etc.

One might argue that this is the logical conclusion of the Unix philosophy of making each program do one thing well. And that may be true, but putting philosophy aside, the practical result is that this architecture works pretty well on Unix systems, but not so much on Windows.

  1. Most Unixes exec(3) new processes quickly; Linux does this especially well. Windows, on the other hand, creates new threads quickly within an existing process, and doesn't CreateProcess quickly. Having a lot of little "porcelain" commands to do one little thing and exit quickly is an incredible performance suck.

  2. Even ignoring the overhead of process startup, invoking these little porcelain commands over and over again means that you have to re-read the same files over and over again. On systems that have slower I/O (like, yes, Windows), even these small performance differences add up since they're repeated.

  3. Probably most importantly, a decade ago, git wasn't just a bunch of individual executables that call each other. It was a bunch of executables, shell scripts, and perl scripts that call each other. (That's still true to an extent, but less so.)

    This is important because you're pretty much guaranteed to have a Bourne shell on Unix, and almost certainly Perl as well. But Windows? In 2010? Not so much. Windows Subsystem for Linux didn't exist yet. Git for Windows didn't even exist yet. Users were expected to install Cygwin or MinGW so that they had bash.exe and perl.exe, just so that they could use Git.

    If Microsoft wanted to put make it so that existing Git users could use Visual Studio, maybe we could expect users to have done this work by themselves. But if Microsoft wanted to make Git a real experience for Visual Studio users, then we needed to figure out how to bring Git to them.

Note: Remember that this is about libgit2 in the past. I would be remiss if I didn't give a shout-out to the hard work of the Microsoft engineers to improve the Git for Windows experience; Matthew, Jeff, and Stolee made huge improvements to the DX and performance for Git everywhere, not just Windows. But Johannes Schindelin has the unenviable task of maintaining Git for Windows, which is both a fork of Git and a distribution of Unix userspace that runs on Windows.

And then there's the server

Another problem was that we didn't just want to bring Git to developers' desktops, or inside Visual Studio. We also wanted to make sure that our developer platform β€” Azure DevOps β€” was able to host Git repositories. And this had all the same problems that the developer desktop has β€” performance and installation difficulties β€” but had a unique problem as well.

The Git server components (git-upload-pack and git-receive-pack) are built with the same architectural ideas as the rest of Git. That means that they run great on a Linux server (which is where GitHub, GitLab, and Bitbucket execute them). But trying to run them on a Windows server β€” in an IIS context β€” is an incredibly complicated exercise.

Trying to run them somewhere truly bespoke, like a constrained execution environment where you need a single, small binary, is so challenging that it's approaching impossible. Consider a Cloudflare Worker, for example. Our friends at Gitlip turned to libgit2 to run within Workers for exactly this reason.

So it was clear that Git β€” even if its architecture was born of the Unix philosophy β€” would not have been a good fit for us in 2010. Hence our interest and investment in libgit2.

This was β€” in my opinion β€” some of the finest days of the libgit2 project. We had two companies with senior developers working on libgit2 (and its ecosystem of bindings), building core products on top of it, and contributing to its success. And the project was always a place for collaboration, despite the fact that these two companies competed with each other.

v0.3: Maturation

By the time that Microsoft added libgit2 to Azure DevOps, the project was in a position that every major hosting provider, or "git forge", was using libgit2 somewhere. Usually to merge pull requests (or, on some platforms, "merge requests"), since libgit2 first introduced the ability to do a merge without requiring a working directory. But most people were using a little bit of libgit2, a little bit of git, and a little bit of custom code, all rolled into one service.

But we then started to see more adoption in client tooling: Visual Studio, of course, was an early libgit2 user for a client. Tortoise Git uses libgit2. GitKraken started using NodeGit (the Node.js bindings for libgit2). And then other developer tooling, like dependency managers, adopted libgit2.

At some point, we looked up from our screens and realized that basically every developer used libgit2, even if they didn't know it. This fact hit like a ton of bricks; and we suddenly understood that we needed to enter a new phase of maturity.

API Stability

In "ye olde days" of libgit2, we were pretty sloppy about API stability. We reasoned that we were "pre-v1.0", and that we didn't have that many users, so we just changed things without much thought to backward compatibility. But we didn't do a great job of tracking to what "that many users" meant, or how to adapt when we got there.

When I was at Open Source Summit Japan in 2018, Linus was on stage, and complained about "user space libraries keep breaking the API". Now... I don't know what he was talking about with certainty. But one thing to realize is that Linus is a libgit2 user, for his dive app.

So this one hit close to home: when the guy who invented git calls your git library out, on stage, in a key note β€” even if not by name β€” it's time to get your shit together. So we did (mostly) and API stability became a priority.

Application stability and security

People entrust their source code to libgit2, and often indirectly, so they don't necessarily even know that they're doing it. libgit2 just shows up in their IDE, or their package manager. This means that we need to remain good stewards of their source code and their development environment.

Practically speaking, that means we need to run well everywhere, and not crash anywhere. Over the last several years, we've put a huge investment into some key areas to help ensure that we're stable and secure.

  1. A deep CI build matrix On any pull request to libgit2, we kick off 19 builds across Linux, macOS, and Windows, with a variety of different dependencies like HTTPS libraries, SSH providers, crypto libraries, etc. We expect warning-free builds on all of those platforms, in all of those configurations. And we expect a 100% test pass rate on all of them. (With the possible exception of some integration tests that hit external servers β€” networks aren't 100% reliable!)

    Why 19 builds? Well, we wanted more, but this is the best compromise between safety and speed when validating pull requests. Our nightly build matrix is actually a lot bigger.

  2. SAST and DAST One of the amazing things about libgit2 is that it's a reasonably popular library and used in a lot of places. One of the terrifying things about libgit2 is that it's a reasonably popular library and used in a lot of places and it's written in C.

    Realistically, throwing out 150,000 lines of code and starting over isn't a particularly practical way forward. Trying to make those 150,000 lines of code as safe as possible is, though.

    So we use a combination of tools to make libgit2 as safe as possible. We use memory debuggers like valgrind on every pull request build, and we have separate builds for LLVM sanitizers that look for memory safety, thread safety, and undefined behavior. Every CI build also runs a fuzzing pass, and we have support from the OSS Fuzz project.

    Finally, we use a number of third-party code analysis tools, like Coverity and CodeQL. Both products have been very generous to open source projects like ours and have a no-cost plan for OSS. Obviously there's no magic bullet for memory safety in C, but with these tools, we try to mitigate the risk as much as possible.

v1.0: The present

I've been so grateful to be a part of the libgit2 project; it's been the source of tremendous enjoyment, both professionally and personally. It's been the source of some of the hardest-to-debug problems that I've ever encountered, and it's also been the opportunity for me to stand up on stage in foreign countries.

I can't help but look back in fondness about my decade of working in libgit2. But more importantly, I can't help but be excited about where libgit2 is β€” today β€” as a project.

That will be the next post; coming soon.