Dylan Ratcliffe
Last Updated
Tech
CrowdStrike Root Cause Analysis: Have they done enough?

CrowdStrike Root Cause Analysis: Have they done enough?

CrowdStrike recently released their Root Cause Analysis (RCA) for a massive system outage on July 19th that impacted millions. This is a follow up to their Preliminary Post Incident Review which I analysed in a previous blog post. This RCA is more detailed, and appears to confirm some of my suspicions from the previous post, namely:

  • The Template Instance that caused the outage was not directly tested (there are “new test procedures to ensure that every new Template Instance is tested”)
  • Template Instances were not deployed to any kind of staging environment before being released to production (a finding is that “Each Template Instance should be deployed in a staged rollout”)

What Happened

This section starts with a paragraph and a half of marketing for Crowdstrike which assures us that they use “powerful on-sensor AI” and that “each sensor correlates context from its local graph store” which despite being very impressive, has been received poorly by the community with some users on Reddit referring to it as “doublespeak” and “word salad”. We’ll let readers decide for themselves how relevant this is as the introduction to the root cause analysis for the biggest IT outage in history:

The CrowdStrike Falcon sensor delivers powerful on-sensor AI and machine learning models to protect customer systems by identifying and remediating the latest advanced threats. These models are kept up-to-date and strengthened with learnings from the latest threat telemetry from the sensor and human intelligence from Falcon Adversary OverWatch,Falcon Complete and CrowdStrike threat detection engineers. This rich set of security telemetry begins as data filtered and aggregated on each sensor into a local graph store.
Each sensor correlates context from its local graph store with live system activity into behaviors and indicators of attack (IOAs) in an ongoing process of refinement. This refinement process includes a Sensor Detection Engine combining built-in Sensor Content with Rapid Response Content delivered from the cloud. Rapid Response Content is used to gather telemetry, identify indicators of adversary behavior, and augment novel detections and preventions on the sensor without requiring sensor code changes.

We already knew that the outage was caused by a new Template Instance, deployed as part of Channel File 291 on the 19th of July (see my previous blog for an explanation of what this means). However we now know the specifics of how this update caused systems to crash: The new Template Type that analysed Inter-Process Communication (IPC) defined 21 input parameter fields, but the “integration code that invoked the Content Interpreter with Channel File 291’s Template Instances supplied only 20 input values to match against”. This meant that when the system tried to read the 21st value it was absent, resulting in an out-of-bounds memory read and a system crash.

Was I right about the deployment process?

I think so. In my last post I suggested that the deployments on July 19th were not tested directly, but instead were deployed based on confidence that previous deployments hadn’t failed. The Findings and Mitigations section gives more detail about this, however it is written in a way that could confuse readers into believing that the July 19th changes were in fact fully tested. I’ve added some additional context here in bold that might help readers understand:

Newly released Template Types are stress tested across many aspects, such as resource utilization, system performance impact and detection volume. For many Template Types, including the IPC Template Type, a specific Template Instance is used to stress test the Template Type by matching against any possible value of the associated data fields to identify adverse system interactions. In the case of the IPC Template Type, this was performed once, when the Template Type was new on March 5th 2024.
A stress test of the IPC Template Type with a test Template Instance was executed in our test environment when it was first released in March, which consists of a variety of operating systems and workloads. The IPC Template Type passed the stress test and was validated for use, and a Template Instance was released to production as part of a Rapid Response Content update. This test did not however exercise the 21st input value which precipitated the outage on Jul 19th.
However, the Content Validator-tested Template Instance, including the one released in Channel 291, but also the previous three released since March, did not observe that the mismatched number of inputs would cause a system crash when provided to the Content Interpreter by the IPC Template Type because “Content validator testing” and “stress testing” are different forms of testing, and the latter simply validates (hence the name) rather than actually evaluating the Template Instance in a real environment.

Mitigations

The mitigations that are being applied to prevent this happening again are as follows, and can be separated into direct responses to the issue, and more general process improvements:

  • Direct responses
    • Validating the number of input fields for Template Types
    • Adding runtime array bounds checking for Channel File 291
    • Fixing the number of inputs for the IPC Template Type (21 not 20)
    • Expanding testing to actually cover the 21st field
  • General Process Improvements
    • Adding even more checks to ensure that config has the expected number of fields
    • Testing each new template instance, rather than once again never again
    • Canary deployments, and customer control over Rapid Response updates

Opinion - Have they done enough?

On the deployment side it’s very good to see both internal testing and canary deployments being implemented, especially since customers will have control over how Rapid Response Content updates are deployed. This should bring the deployment process for Rapid Response Content in line with that of Sensor Content (as described in the previous blog) and give customers some degree of control over the blast radius of these changes.

On the engineering side though, I would have expected to see a more thorough review of of the coding and organisational practices that allowed a mismatch in the number of fields on a config file to be released, and for this release to completely crash the sensor and therefore the host. The RCA notes that “Bounds checking was added to the Content Interpreter function that retrieves input strings” but does not mention whether a review was conducted into:

  • What other locations in the codebase are using array accesses without bounds checking and could therefore be susceptible to the same issue
  • How much CrowdStrike is investing in code review and automated checks like linting, and therefore how a reasonably basic mistake like an out-of-bounds array access could be prevented during code review in the future

Given the consequences of an oversight like this when operating a kernel driver, I would have expected to see defensive programming techniques and input validation at more than one level within the sensor and hopefully the two independent third-party software review vendors that have been contracted to conduct an additional review will focus on this at a code and process level to prevent similar problems in the future.

We support the tools you use most

Prevent Outages from Config Changes

Try out the new Overmind CLI today for free.
No agents, 3 minute deployment.