92 Days to Fix a Bug?

January 12, 2015

Microsoft (disclaimer: my employer) is apparently complaining about Google publishing the details of a security vulnerability in keeping with Google's stated policies on the matter, but two days before Microsoft would have been able to issue a general fix on Patch Tuesday.

In some of the discussion on Hacker News, somebody suggested that Microsoft had not been diligent about fixing this bug in a timely manner:

it seems unlikely it realistically took MSFT 92 days to put together a patch. It seems far more likely they sat on the disclosure and did not prioritize it appropriately, after google continued to push for an estimated release date (which they never got).

I wanted to repost my response to that here, (hopefully) without weighing in to whether Google or Microsoft is right or wrong here.

I very much doubt that this was simply "sat on". Not knowing the full breadth of this vulnerability, I can only speculate, but I'm making an educated guess based on several years of working at Microsoft and recently managing a fix for a security vulnerability in several versions of two products.

It can sound crazy to you (certainly, it sounded crazy to me at first), but taking 90 days to turn around a fix to a piece of software like Windows is not implausible.

First you have to understand the vulnerability: under what circumstances does it occur? Can we build some reliable tests so that we are convinced that we will know when we have fixed it?

Looking at the report of the vulnerability, are there other instances of problems that will lead to similar bugs? Because if you release a patch and then a week later somebody realizes a similar vulnerability, you're going to have to do all of this over again, and not with the luxury of 90 days before the announcement.

Then you have to identify what versions of the software are affected. This is where you start sweating, because were you working on the product seven years ago? If you were, do you remember anything about it? Well, you're going to have to learn fast and hope that the architecture hasn't changed too much.

Then you fix the bug, probably only in one of the affected versions at first, which is probably the version of the software that you have on your dev box. As you're fixing it, poke around in the code to think about similar vulnerabilities that the report didn't find. Recall that Windows is not a small piece of software and building it on your dev box and being able to test your fix may take several hours.

Now you port that bug fix over to all the other versions of this software that are affected and that you still support. For a piece of software like Windows, this is a long list. Hope that the architecture and the code you're fixing hasn't changed much or else you're digging in to remember how Windows 2008 worked and how to fix this problem there. Recall again that building each of these versions takes time.

Now you hand off your fixes to some other people who will independently verify your fixes on some of the supported versions. I say some, not all, because there's going to be another round of testing on the actual deliverables, which is the patch to the operating system.

Which another group is going to make. Most products at Microsoft have a build lab that will take a branch and create either the actual installer disk image or a patch to a previous version. In this case, they're going to create a patch to the latest supported version, for each supported version.

Fortunately, this can probably be done in parallel with the first round of testing if you're pretty sure that you nailed it. If you think that the testing (above) is likely to reveal some problem, then you should hold off handing it over to the build lab because these folks are some of the least appreciated parts of the development team. They're the ones who integrate all the various development teams feature branches into master, resolve the easy conflicts and find the people who need to resolve the hard ones. They have a full time job (and not a trivial one) before you're bringing your high priority build to them, and when you ask them to dust off the build machines for a seven year old version of the product, they're going to graciously accept. But to tell them you didn't get it right and request they start over on a new version is when you start bringing six packs with your request.

Once the patch is created, it goes through the real testing. Because somebody's going to install this patch on all the supported versions, and the SKUs within those versions, to make sure that it works. When I say "it works", I don't just mean that the patch fixes the bug in question (though of course it has to do that), I mean that it also has to not regress any functionality. And that the patch is able to be installed and uninstalled cleanly. This is some annoying work and the longer this bug has been around, the more annoying it is. Remember "Windows Essentials Business Server 2008"? Me neither, but somebody's going to be installing it on a VM and ensuring that your patch works there.

Let's assume that everything has gone well up to this point and you're ready to release it. Most product updates want to get included in Windows Update, of course, because you want your security fixes to just show up to the customer without them having to learn about them, download it and install it because so few people do.

If you're going to miss a patch tuesday, then you need to start asking yourself whether you want to a) try to buy yourself some more time, b) put up several KBs with the patch and ask people to install them manually or c) wait it out until the next patch tuesday. What you decide will probably be some combination of those depending on the severity of the bug. There's a possible fourth option, which is to convince somebody that your patch needs to go out before patch tuesday and while I'm sure that happens, I'm not sure how it happens. I suspect when you're dealing with a bug so critical as to warrant that level of pain for the organization, it will happen.

All of these steps take time. And a big organization like Microsoft moves slowly sometimes - it takes time to find the right people for all of these steps. This is especially difficult over the holidays where many of the "right people" here are out of the office.

And of course there's inevitably the clever person who says "hey guys? I just figured out another way to trigger this bug." And then you start over from the beginning.

Now you could argue that it's ridiculous that it took Microsoft 92 days to get a fix out the door, and maybe you'd be right. But that's another thing entirely. What's clear to me that Microsoft did not simply wait until day 89 to start working on this.

(In relating this story, I would be remiss if I didn't thank Junio Hamano of Google, the maintainer of git, who was kind enough to provide Microsoft with additional time to research and prepare patches for the range of our products that were affected by CVE 2014-9390.)