Edward Thomson

Advent Day 24: The Reflog

December 24, 2018  •  1:04 PM

This is day 24 of my Git Tips and Tricks Advent Calendar. If you want to see the whole list of tips as they're published, see the index.

There are precious few things in Git that save my bacon more often than the reflog. And knowing it's there encourages me to craft my commits, often rebasing them, into a manner that's easy to read for my collaborators.

I've mentioned before that on the libgit2 project, some of my collaborators like to review each individual commit in a pull request. This lets them focus on a single idea or expression, before moving on to the next idea, in the next commit. It's hard to code review a huge pull request, and breaking it down into smaller pieces helps with that significantly.

But it's not always easy to craft an easy-to-read pull request. It often means taking their feedback and going back to fix up prior commits, rewriting and rebasing.

But what happens if you get this wrong? If you've rewritten a bunch of changes, and you made some errors, how do you get back to the ones that you rewrote?

The reflog - running git reflog will tell you all the commits that you've had checked out, in reverse chronological order, and how you got to that commit.

For example, I've made some fixup commits for a pull request and I've decided to rebase them. Once my rebase finishes, my reflog looks like:

cab2a8240 (HEAD -> newfeature) HEAD@{0}: rebase -i (finish): returning to refs/heads/newfeature
cab2a8240 (HEAD -> newfeature) HEAD@{1}: rebase -i (continue): new feature for widget
3649929fa HEAD@{2}: rebase -i (continue): some new code
ff7d3d7c3 HEAD@{3}: rebase -i (pick): some new code
8a3a999bd HEAD@{4}: rebase -i (pick): refactor before new code
90345d266 HEAD@{5}: rebase -i (pick): update widget
77f1460fc (origin/master, origin/HEAD, master) HEAD@{6}: rebase -i (start): checkout origin/master
e43efed03 HEAD@{7}: commit: fixup! some new code
...

Looking at this, you can see where the rebase started, on the 7th line. It checked out origin/master to start rebasing the commits on top of that. Then it replayed (cherry-picked) my commits on top of that.

But the line before that, the 8th line, is where I was before I started the rebase. So if I got myself into trouble, and want to get back, I can just:

git reset --hard e43efed03

And now I can start my rebase over again, or make more changes, etc.

One of the great things about git is that it's very hard to get into a state where you've lost data. Even when you rebase changes away, those changes still exist, and the reflog will always show the changes that you've made throughout history.

Advent Day 23: Removing Large Binaries with LFS

December 23, 2018  •  1:04 PM

This is day 23 of my Git Tips and Tricks Advent Calendar. If you want to see the whole list of tips as they're published, see the index.

Earlier this month I talked about Git Large File Storage (LFS) and how you can use it to avoid checking in large files. It will keep those files in a separate area, parallel to (but not actually part of) your git repository, where they have a different storage contract that's efficient for large files.

But Git LFS needs to be set up before you add the binaries to your repository - once they've been checked in, they're in history.

This is especially important since some Git hosting providers put limits on the size of the repository that you can push. Once you have 100 MB of binaries in your repository, even if you start using Git LFS after the fact, your repository is still too large to push, because those existing large files are still there.

GitHub rejecting push

So what do you do if you've already added binaries to your repository? How do you get them out?

You're going to have to rewrite history to remove them, and add them with Git LFS retroactively.

Rewriting history, eh? If warning bells are going off in your head: good. They should. This is a bit of a tricky proposition. (But a regrettably unavoidable one in this case.) It's not exactly dangerous, but you will want to be careful, and to coordinate this with your team.

There are a few tools that you could use to remove large blobs from your repository: you could use filter-branch, which will walk through history and run a command to rewrite each commit. This is powerful but difficult and complex, especially if you have to deal with multiple branches. You could use BFG repo cleaner, which is a huge usability improvement over filter-branch but requires a separate install. Recently, Git LFS added migration support built-in to the tool.

Here's how to migrate a repository:

  1. Tell your team to stop working for a while. Yes, really; you're not going to want to do this near the end of a sprint. Ideally, you'll do this on a weekend, when nobody's doing any work, and you can shut down the repository temporarily. (How long? You might want to do a dry run of step 5 to find out - you can do that while people are working to get an idea of how long it will take.)

    Your users should push any work in progress branches to the server, even if it's not ready to be merged into master, having it on the server ensures that they'll get included in your rewrite.

    I'd encourage you to set up protected branches or branch policies to keep the branch closed temporarily, just to make sure that you don't have any conflicts.

  2. Clone a new copy of the repository somewhere locally. You can do a mirror clone, which will clone the remote into a bare repository locally so that you don't check out all your large files unnecessarily. It will also get all the remote branches and tags, and set up remote tracking branches so that you can safely perform a force push.

    git clone --mirror git@github.com:ethomson/repo.git
    
  3. Optional, but encouraged: back up your repository. Maybe physical media to an off-site location, or pushing a zip up to OneDrive or Dropbox. If you're confident about your backup policies, maybe you could skip this, but when you're about to undertake a repository rewrite, you might also want to think about a backup.

    You could also push a backup branch to quickly recover from if things go horribly wrong. They shouldn't, of course, but this is your source code, right?

    git push origin master:master_before_lfs
    

    Should do the trick.

    Repeat for any important integration branches.

  4. Delete any local pull request test merge branches. Hosting providers like GitHub and Azure Repos store their pull request information as hidden branches. You'll normally never see these, since they're in a special namespace, but a mirror clone will pull them down.

    You wouldn't want to rewrite these and try to push them (in fact, you wouldn't be able to), so you should delete them to avoid unnecessary work during rewriting and to simplify your push.

    A quick bash command will take care of that, iterating each of the pull request references in the refs/pull namespace, and then deleting them:

    git for-each-ref --format='%(refname)' refs/pull/ | xargs -n 1 git update-ref -d
    

    (In case you were wondering from yesterday's tip, this is exactly the sort of thing that I'd rather do in a proper language instead of piping things into xargs. Ugh.)

  5. Rewrite your git history to add the files into the LFS area for all your existing commits. The git lfs migrate tool can take care of this for you, and you can either specify the names of the files that you want to import, or let it detect the large files automatically.

    To remove the *.dat files out of history, converting them to LFS files in every branch:

    git lfs migrate import --everything --include='*.dat'
    

    Or to let the tool find and migrate all your large files, you can omit the --include option:

    git lfs migrate import --everything
    

    This will configure the .gitattributes correctly at every commit, and place all the specified (or detected) files into LFS. The lfs migrate tool will have rewritten your branches with a series of new commits that should be identical to your previous branches except the large files will now be added to your large file storage.

  6. Now you can force push the branches that you rewrote. I strongly encourage the safe force push using:

    git push --mirror --force-with-lease
    

    This will push all the rewritten upstream branches, including any branches that were opened for pull requests, and it will update all the branches at once. This is important, because if you only pushed the master branch, or your other integration branches, any pull requests opened against them will suddenly make no sense.

    Fundamentally, you've rewritten all of history, so your open pull requests would now no longer share a merge base with the master branch. This simply can't be merged, so GitHub will just close your pull requests:

    Closed from force push

    This doesn't happen, though, if you also push rewritten pull requests branches. In that case, GitHub notices that they were both force-pushed and recalculates the merge properly. So any open pull requests stay open.

    Force pushed PR

  7. Run continuous integration builds, and validate that they're working properly. Your CI system should include Git LFS, though it may require you to explicitly enable it as an option. There's a simple checkbox in the "get sources" configuration of Azure Pipelines:

    Enable LFS in Azure Pipelines

    Once you've enabled that, queue a build on your master branch (and any other important integration branches) and make sure that things are working correctly.

    GitHub and Azure Pipelines

  8. Clone the repository yourself and make sure that things are working correctly; you should see the message Filtering content as the last line of the clone output. This indicates that LFS is operating (it runs as a git "filter").

    Force pushed PR

    You should also work with your repository to make sure that it builds locally. Although you should have a high degree of confidence from the continuous integration builds, you'll want to ensure that things are working locally as well. Check out other integration branches, if you have them, or pull request branches, to ensure that things worked well during the migration across all branches.

  9. Get your users to reclone their repositories. I am generally loathe to encourage people to ever do this - a lot of people follow the delete and re-clone strategy when something goes wrong in their work and it's quite an anti-pattern. There are really no problems in Git that are so hairy that you need to get rid of the repository and start over; solving these problems without doing that will really level-up your Git knowledge.

    This is the exception - we've rewritten every branch and there's no simple way to update. Instead, cloning a new copy of the repository is the right way to go. (Bonus: it's now much quicker to clone the repository since you don't have all those messy binaries.)

Note that if you pushed any backup branches, those binaries still exist on the remote, so you'll need to delete those backup branches before that space is reclaimed on your hosting provider.

This is actually surprisingly straightforward. There's one gotcha though, and that's if you use a fork-based workflow.

🚨🚨 Forks 🚨🚨

If you're working on an open source or inner source project where users fork the main repository, make changes in a branch in their fork, then open a pull request from the fork back to the main repository, then you'll have a difficulty with this workflow. That's because when you did the rewrite, you weren't able to update the forked repositories.

In that case, when you do the force push, those pull requests will be closed, since there's no common history between the target branch and the pull request.

If you do have pull requests from forks, you should create branches for them in the main repository to ensure that they get rewritten. Users can then update their fork after the rewrite.

But ultimately, there are surprisingly few "gotcha"s in this process, and it will help you manage the large files in your repository.

Advent Day 22: libgit2 and Friends

December 22, 2018  •  1:04 PM

This is day 22 of my Git Tips and Tricks Advent Calendar. If you want to see the whole list of tips as they're published, see the index.

I love scripting against my git repository. Often I'll have an idea for a quick and dirty tool that will help me get my work done - for example, git recover, which helps you locate files that were accidentally deleted from your working tree that never got checked in.

But scripting against git generally means writing Bourne shell scripts and parsing text output. That's fine for simple jobs, but it gets frustrating quickly if you're trying to do something more complicated.

That's when I reach for libgit2.

libgit2 is a pure C library that helps you manage your Git repository. We've taken some of the guts of git itself, and rewritten other parts, to provide you with a library that has a sensible API, is re-entrant, and has a proper object model. This lets you code against your Git repository very efficiently, without trying to parse text output.

For example, to list the paths in the index:

git_repository *repo;
git_index *index;
git_index_entry *entry;
size_t i;

git_repository_open(&repo, "/tmp/myrepo");
git_repository_index(&index, repo);

for (i = 0; i < git_index_entrycount(index); i++) {
    entry = git_index_get_byindex(index, i);
    puts(entry->path);
}

git_index_free(index);
git_repository_free(repo);

Of course, it only lets you code efficiently if you happen to like C. And I do - at least most of the time - but you may not. Thankfully, there are a number of "language binding" projects that wrap libgit2 in other languages. There's LibGit2Sharp for .NET, Rugged for Ruby, and NodeGit for Node.JS. Plus a bunch of others.

libgit2 powers parts of all the major Git hosting providers, including GitHub, GitLab, Bitbucket and Azure Repos. Plus several GUI clients, including GitKraken and gmaster.

I'm happy with some of the contributions that were made to libgit2 this year, from bug fixes and security updates to work to stabilize the API to big new features like patch application support. And I'm excited about what we'll do in 2019.

But I'm more excited to know what you build with it!

Advent Day 21: Renormalizing Line Endings

December 21, 2018  •  1:04 PM

This is day 21 of my Git Tips and Tricks Advent Calendar. If you want to see the whole list of tips as they're published, see the index.

Update: Alex Tercete provided a helpful correction about the merge -Xrenormalize command; I've updated this with a correction.

This month I've given a few tips and tricks that I think are important - for example, getting your line ending configuration right. But a lot of these settings require you to get them set up in the initial commit, ideally before you start adding content to your repository.

But development - like most of life - isn't always so simple. What if you've already committed changes? And you've been using core.autocrlf inconsistently, instead of using a .gitattributes? So you've got a mix of line endings in your repository?

Thankfully, that's an easy fix.

  1. Add your .gitattributes First things first, get your .gitattributes file set up correctly, so that line endings are "normalized", that is, converted to Unix-style (LF) line endings when you add files to your repository.
  2. Run git add --renormalize . && git commit It's that simple: this will go through every file in your repository, and "renormalize" them: that is, re-add them to the repository with the line ending configuration that you specified in your .gitattributes (in step 1).

So now any files that were, erroneously, checked in with Windows-style line endings will now be "renormalized" to have Unix-style line endings.

A word of warning, though: this might cause big ol' merge conflicts! Since you've changed every line of some files, to have new line endings, any change that somebody else has made in another branch is almost certain to conflict.

Renormalized

Unfortunately, this will require a bit of manual intervention. You have two choices, as the author of the pull request: merging with or rebasing onto the master branch, with the renormalize option:

Merge

git merge -Xrenormalize --no-commit master
[fix any conflicts as necessary]
git add --renormalize .
git commit

I previously indicated that git merge -Xrenormalize was enough; it's not, as that will only affect the way automerged or conflicting files are dealt with. It won't renormalize any new changes that you've made in your branch. So the best way to do this is to do the merge with a no-commit option, fix up any conflicts that might exist, and then do a git add --renormalize before committing the merge.

(Thanks to Alex Tercete for the correction.)

Rebase

git rebase -Xrenormalize master

Whether you prefer a merge or a rebase strategy, once you've integrated the changes, you can push them back up in the pull request branch. The pull request can now be automerged into the master branch, and will have the correct line endings.

Advent Day 20: Force with Lease

December 20, 2018  •  1:04 PM

This is day 20 of my Git Tips and Tricks Advent Calendar. If you want to see the whole list of tips as they're published, see the index.

On one of my projects, we usually use a workflow where we create short-lived topic branches and try to keep a reasonably clean history. This means that we often rebase our topic branches onto the master branch so that you'll have a nice history when we merge the pull requests.

Merge History Example

This all sounds nice, but there's a bit of a "gotcha" when it comes to this workflow: if you push a branch to the server and you later rebase that branch, the server will think that you're not up-to-date, since you don't have the latest commits on that branch:

To github.com:ethomson/repo.git
 ! [rejected]        master -> master (non-fast-forward)
error: failed to push some refs to 'git@github.com:ethomson/repo.git'
hint: Updates were rejected because the tip of your current branch is behind
hint: its remote counterpart. Integrate the remote changes (e.g.
hint: 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.

You'll now need to actually force push to the remote, basically removing the existing commits and pushing up your new, rewritten branch in their place. And this is a bit scary, since there's really no checking. Unless, of course, you use the --force-with-lease option.

The "force with lease" option basically does an atomic compare-and-swap on the branch that you're pushing to, based on the last information you fetched. In other words, it ensures that you're force pushing to remove what you think that you're force pushing to remove.

Assuming that you're up-to-date: you've just done a git fetch on the remote, when you run git push --force-with-lease, you'll overwrite the remote:

To github.com:ethomson/repo.git
 + d008110...3653e05 master -> master (forced update)

However, if you're not up-to-date… maybe somebody snuck in and pushed to your branch while you weren't paying attention. If you were only to git push --force then you would overwrite their commits, losing them. But if you git push --force-with-lease then you will not overwrite the branch and lose their changes.

To github.com:ethomson/repo.git
 ! [rejected]        master -> master (stale info)
error: failed to push some refs to 'git@github.com:ethomson/repo.git'

This makes git push --force-with-lease a good default, and I suggest you get in the muscle memory of using it (or set an alias if you prefer). You can always do a "true" force if you really need to, but this helps avoid any accidental mistakes.