It turns out that when things go wrong, it’s the things that you don’t know you don’t know that burn you. The STELLA report by David D Woods of Ohio State University studied large IT outages and explains why they were so impactful, and how “unknown unknowns” force users to rebuild their mental models from scratch.
When things go wrong
In all the examples that the team studied as part of the report, the outage exposed that operators were fundamentally wrong about how the system worked1. Solving these outages required the user to rebuild their mental model of the system multiple times to explain the behaviour and devise a solution. This is partly because the outages tended to occur within areas of a system of which no one person had a detailed mental model and partly due to the sheer complexity of the outages. Here are some details that were common to all cases:
- Each anomaly arose from unanticipated, unappreciated interactions between system components.
- There was no 'root' cause. Instead, the anomalies arose from multiple latent factors that combined to generate a vulnerability.
- The vulnerabilities themselves were present for weeks or months before they played a part in the evolution of an anomaly.
- The events involved both external software/hardware (e.g. a server or piece of application from a vendor) and locally-developed, maintained, and configured software (e.g. programs developed 'in-house', automation scripts, configuration files).
- The vulnerabilities were activated by specific events, conditions, or situations.
- The activators were minor events, near-nominal operating conditions, or only slightly off-normal situations.
It’s no coincidence that outages happen in areas where we don’t have a good mental model. They happen there because we don’t have a good mental model.
The legacy of outages
The way these outages often manifest at a company is: A DevOps Engineer makes a configuration change, which causes an outage in a component that unexpectedly relied on the old value. In the incident review, you discover that you followed the process, yet it still broke; therefore, we must need more process. This eventually leads to processes that are more focused on diluting blame than actually building confidence. We call this Risk Management Theatre, and it affects the more than you would think:
- A company experiences an outage and adds more process to try to avoid this problem in the future. This increases the lead time for changes
- Compared to high-performing companies, low-performing companies participating heavily in risk management theatre have a 440x longer lead time. This leads to a much lower deployment frequency
- A lower deployment frequency means each change must be 46x bigger if they are to keep up with their high-performing counterparts
- Larger changes and less practice in doing them means changes are 5x more likely to fail
- All of the above means that when things do go wrong, they go wrong in a big way, taking 96x longer to recover from
Doesn’t observability solve this?
Observability has helped engineering and DevOps teams better understand how their systems perform by implementing millions of metrics, ingesting terabytes of logs, and tracing a single request through tens (hopefully not hundreds) of microservices. But this approach isn’t well suited to these outages that occur outside a user’s mental model.
Observability tools let you see the outputs of a system in great detail (metrics, traces, logs). This means if you have a good mental model of how the system works, you can then infer what the inputs and internal state of the system must look like for it to produce those outputs. But that’s a pretty big if.
“Implicit in every observability solution is the idea that the user already knows how the system is supposed to work, how it was set up, what the components are, and what ‘good’, or even just ‘working’ means. If you built the system or have excellent eng onboarding, this might be true. But it turns out that at scale and over time, this is never true,” - Aneel Lakhani, an Overmind angel and early employee at SignalFx and Honeycomb.
This doesn’t mean you don’t need observability; you probably do. Just that it’s unfair to expect it to solve problems caused by unknown unknowns.
So how do we solve it?
I believe that the answer is enabling engineers to produce new mental models on-demand, either as part of planning a change or in response to an outage. Our current tools don’t support this kind of work, though. When building a mental model, we use “primal” low-level interactions with the system, usually via the command line which requires a great deal of expertise and time. If we are to solve this, we must make building mental models much faster, meaning:
- Being able to discover the same kind of detailed configuration you’d get using the command line, but faster and with less required domain knowledge
- Discovering relationships between pieces of your infrastructure without needing someone to document it manually since we know our documentation is perpetually out-of-date
- Building confidence by using these relationships to determine a realistic blast radius and planning your interaction accordingly
If we can do all of the above automatically, then we can:
- Actively monitor the blast radius for unexpected consequences of a change
- Audit the change to ensure that you implemented it correctly
Or do this in reverse: Start by looking at the item that is broken, and use our blast radius knowledge to determine everything that could have caused that to break, then compare the state of these things to before the outage and see what changes could have caused the issue
This is what we’re building at Overmind. We’ll map your entire infrastructure with no user input and allow you to use this map to build confidence faster than ever before.
*Overmind is now available to try for free. Get started by signing up and creating an account here.
Thanks!
Addendum: Some rare people contradict what I’ve said here by having an amazingly detailed mental model. Often they’ve been around the longest and are the hero that solves the most complex problems. The problem with these heroes is that they are hard to scale. Their mental models are challenging to share, and if you have an organisation of non-trivial size, you end up with lots of people who need that model but don't have it.