GitHub Actions Day 6: Fail-Fast Matrix Workflows

December 6, 2019

This is day 6 of my GitHub Actions Advent Calendar. If you want to see the whole list of tips as they're published, see the index.

So it hasn't even been a week of these posts about GitHub Actions, and I've written a lot about matrix workflows already. If you hadn't guessed, I'm a big fan. 😍

But if you're getting started setting up your first matrix workflow, then there's a caveat that you need to be aware of: by default, matrix workflows fail fast. That is to say: if one of the jobs in the matrix expansion fails, the rest of the jobs will be cancelled.

This behavior is often very beneficial: if you're running pull request validation builds, and one of the builds in the matrix fails, you may simply not care whether the rest succeed or not. Any failure is enough to indicate a problem that would keep you from merging the PR.

But when you're creating a workflow from scratch, you may need to iterate a bit to get it working right the first time. And when jobs fail because there's a problem in the workflow setup -- and not in the code itself -- it can be helpful to turn off the fail-fast behavior as a debugging tool.

Let's say that you have a workflow that works great on Linux, and you want to use a matrix to expand that out to run on macOS and Windows as well. For a simple workflow, this might Just Work. But for something more complex, you might need to set up some dependencies or install some tools before it works. So it's very possible that your workflow that runs on Linux won't work on macOS or Windows without some modification.

So what happens when you run this new matrix workflow the first time? Your Linux, macOS and Windows jobs will all be started, and either the macOS job or the Windows job will fail, and the rest of the workflow will be cancelled.

Imagine that it's the Windows job that fails first. You'll see:

Windows Fail

OK, so you decide that you need to fix up the Windows workflow. So you take a look at what went wrong, update your workflow and then push up the changes to queue a new build. But, since queueing and scheduling isn't very deterministic, maybe this time the macOS build finishes -- and fails -- first. Now your Windows run gets cancelled before you could even find out if it worked:

Mac Fail

Ugh. When you're working on debugging your workflow, you can turn off this behavior by setting fail-fast: false:

strategy:
  matrix:
    os: [ubuntu-latest, macos-latest, windows-latest]
  fail-fast: false

Now the workflow won't be cancelled at the first failed job. It will allow both the Windows and macOS jobs to run to completion.

Mac Fail

Turning off fail-fast will help you more easily iterate on your workflow. Just be sure to turn it back on when you're ready to run in production! That will help you save minutes (and money) on your CI runs.