Last week Crowdstrike released their Preliminary Post Incident Review of the July 19th outage. It’s great to see this kind of transparency for any kind of incident, though I do have to subtract some marks given this appears to be first time the company have ever published one (according the to first 14 pages of Google Advanced Search at least). I am hopeful we’ll see more transparency here in the future.

There a lot of jargon to wade through, and much of the really interesting stuff comes from what’s not in there, so let’s begin.

One process for code, another for config

The piece of software at the heart of this outage was the “Falcon Sensor”, which from now on I’m going to be referring to as the “sensor”. This is the software that was installed on all of the affected Windows machines which protects the machine from bad actors. In the document, Crowdstrike explains that there are two different ways the sensor gets updates:

Sensor Content: This is code, it’s basically new versions of the sensor itself. These might improve performance, fix bugs, or allow it to monitor the machine in new ways when looking for malicious activity

Rapid Response Content: This is “not code or a kernel driver” and contains “configuration data”. This configuration data tells the sensor what to look for using the tools it already has. This means Crowdstrike can quickly react to new threats without changing the whole sensor.

Deploying Code

The Incident Review explains the deployment process for Sensor Content (code) in detail, and we’ll see that it’s pretty reasonable:

There is both automated and manual testing
It’s rolled out first to Crowdstrike themselves (dogfood)
It’s then rolled out in stages to early adopters
Even once it is generally available, users have some control over which versions get deployed where, meaning that they can limit the blast radius of a bad “Sensor Content” update.

The deployment process for “Falcon Sensor Content” — The deployment process for "Sensor Content"

It’s good to see an in-depth, responsible deployment process here. But to be clear: this had almost nothing to do with the July 19th outage. The outage was caused by a config update, not a code update. And as you can imagine an outage of this magnitude would be impossible given the amount of testing in the above process.

(Reading between the lines: While this provides helpful context, it also sets the scene and shows that Crowdstrike do actually know how to deploy something safely. Which is probably the main purpose of this information)

Deploying Config

Given it was a configuration change that caused the outage, it's expected that we get a lot of information about “Rapid Response Content”. It’s very heavy in jargon and goes into a large amount of detail around the internal architecture of the sensor itself, so I’ll try my best to explain here:

Template Type: A type of configuration block that configures a certain feature within the sensor. For example relevant to this incident was the “Inter-Process Communication (IPC)” template type. Think of this as, a thing the sensor is capable of looking for

Template Instance: A piece of config that tells a Template Type how to look for a given thing. Relevant to this incident was a configuration that told the IPC template type how to “detect novel attack techniques that abuse Named Pipes”

Channel File: One or many Template Instances bundled together into a proprietary binary file (a channel file), which is shipped to the sensor and written to disk as part of an update.

Content Interpreter: Reads the channel files from disk and parses them, before sending the configuration to the sensor. This is an important component in protecting the sensor as “The Content Interpreter is designed to gracefully handle exceptions from potentially problematic content”

The mechanism by which the sensor is configured. Channel files distributed by the Falcon Sensor Cloud Platform are then loaded by the local Falcon Sensor.

While there is a lot of proprietary terminology here, the configuration mechanism they are describing is not terribly complex or surprising. The sensor appears to be a modular service whose behaviour can be changed by one or many config files (channel files) that contain one or many sets of instructions (template instances). This is similar to the patterns used by web servers like Nginx or Apache where many config files, which in turn contain many “directives” are combined together to configure how the service should behave.

The key difference between this situation and that of Nginx or Apache is that you don’t control the config, Crowdstrike do.

No, I said “deploying” config

You might have noticed that while the above section tells us a lot about how the config works, it doesn’t tell as much about how the config is deployed, as in, which pieces of config should be deployed to which customers, when?‍

We do get a couple of hints in the review, such as: “Template Instances are created and configured through the use of the Content Configuration System, which includes the Content Validator that performs validation checks on the content before it is published.”. But detailed information about testing, internal dogfood-ing, and deployment to early adopters is absent.

(Reading between the lines: This would have been a good opportunity to call out testing processes if they existed. I’ll leave it to the readers to decide if this was an oversight or not… )

What Went Wrong

So if the config is not that special, how did it go so wrong? For that we need to look at the timeline:

Feb 28th: Sensor 7.11 is released. This includes the new Inter-Process-Communication (IPC) template type. This follows the well-tested code deployment process, with many stages of testing and rollout. However it appears that at the time of deployment, the sensor isn’t actually using the new IPC template
Mar 5th: Stress testing on the new template type is completed. You’ll note that this is being stress tested after it’s been deployed to production in version 7.11, but since it isn’t being used I don’t see this as an issue
Mar 5th: The first Template Instance (config) is deployed that uses the newly stress-tested IPC template type, everything is fine!
April 8th → 24th: Three more IPC template instances are deployed during this period, which “performed as expected in production”.

This is the first surprise of the whole document so far. The previous deployment (On the 5th) was completed after substantial stress testing in the staging environment since this was a brand new type of template (IPC), however it would appear that the second, third and fourth deployments using that template type weren't subject to this, as no testing is mentioned at all.

July 19th: The big day. This update contained two new changes, one of which was the one that broke everything. “Based on the testing performed before the initial deployment of the Template Type (on March 05, 2024), trust in the checks performed in the Content Validator, and previous successful IPC Template Instance deployments, these instances were deployed into production.”. We all know what happened next.

The timeline of the Crowdstrike incident

Let’s dive into the three factors that led to the decision to deploy this change:

Testing performed on March 5th: This testing covered the new IPC Template Type, but not the specific configuration that was being deployed on July the 19th. There seems to have been a belief within Crowdstrike that as long as the code had been tested, each individual piece of configuration didn’t need to be tested.
Previous successful deployments: The notion that deploying three different pieces of config over a three month period would mean that the fourth different piece of config (for what was still a pretty new feature) wouldn’t cause issues, doesn’t make sense to me. This seems to be an example of a faulty generalisation fallacy, however potentially there is more information that we are missing. For example maybe they did 1,000 deployments in this period, and the fact that this was only the fifth deployment of this new type was missed. It would still be a faulty generalisation fallacy, but it would be easier to understand.
Trust in the Content Validator: The role of the Content Validator was to “gracefully handle exceptions from potentially problematic content”. This is critically important when operating a kernel-level driver as unhandled exceptions will crash the entire machine (as we learned the hard way). In the end, it was a bug in the Content Validator that caused the outage. But this confidence was misplaced, even if there hadn’t been a bug. Read on to find out why.

Misplaced Trust

While the team tasked with deploying the Jul 19th changes clearly trusted that the Content Validator would prevent a bad configuration from taking down a customer, as stated in the Incident Review, other parts of the company must have had evidence to the contrary:

On June 27th, users on Reddit reported (and here) that a Crowdstrike config update had caused the sensor service to use 90% of the CPU, which in turn caused outages. This had clearly passed the Content Validation process, but caused customer outages.
On June 4th, Red Hat released a KB relating to kernel panics that were caused by the Crowdstrike sensor process. This was a bug in the Linux kernel itself, that the sensor was triggering and wasn’t Crowdstrike’s fault. However it does prove that config that has passed the Content Validator can cause kernel panics.

These incidents show that someone within Crowdstrike was aware that an outage like that of July 19th (a kernel panic as a result of an otherwise “valid” configuration) was possible, and yet I see no evidence in the post incident review of mitigations against this such as internal testing or staged rollout.

Config changes are dangerous too

As almost every company has found out the hard way, just because you’re changing config and not code, doesn’t make the change safe. In fact most large outages are caused by configuration changes, like those of Datadog, Reddit and Loom. Unfortunately Crowdstrike have learned this lesson the hard way, and we were taken on the ride with them.

If you want to understand the blast radius and risks if your config changes, that’s what Overmind does. Try it now.

Inside Crowdstrike's Deployment Process

One process for code, another for config

Deploying Code

Deploying Config

No, I said “deploying” config

What Went Wrong

Misplaced Trust

Config changes are dangerous too

Prevent Outages from Config Changes

Stop Evaluating AI Tools Based on Demos. Use This Framework Instead

Protect your critical AWS infrastructure with intelligent auto tagging

Infrastructure dependencies are more dangerous than your code dependencies

Has AI Code Generation Made Reviews the New Bottleneck?

Inside Crowdstrike's Deployment Process

One process for code, another for config

Deploying Code

Deploying Config

No, I said “deploying” config

What Went Wrong

Misplaced Trust

Config changes are dangerous too

Prevent Outages from Config Changes

Latest blogs

Stop Evaluating AI Tools Based on Demos. Use This Framework Instead

Protect your critical AWS infrastructure with intelligent auto tagging

Infrastructure dependencies are more dangerous than your code dependencies

Has AI Code Generation Made Reviews the New Bottleneck?