It turns out that when things go wrong, it’s the things that you don’t know you don’t know that burn you. The STELLA report by David D Woods of Ohio State University studied large IT outages and explains why they were so impactful, and how “unknown unknowns” force users to rebuild their mental models from scratch.
In all the examples that the team studied as part of the report, the outage exposed that operators were fundamentally wrong about how the system worked1. Solving these outages required the user to rebuild their mental model of the system multiple times to explain the behaviour and devise a solution. This is partly because the outages tended to occur within areas of a system of which no one person had a detailed mental model and partly due to the sheer complexity of the outages. Here are some details that were common to all cases:
It’s no coincidence that outages happen in areas where we don’t have a good mental model. They happen there because we don’t have a good mental model.
The way these outages often manifest at a company is: A DevOps Engineer makes a configuration change, which causes an outage in a component that unexpectedly relied on the old value. In the incident review, you discover that you followed the process, yet it still broke; therefore, we must need more process. This eventually leads to processes that are more focused on diluting blame than actually building confidence. We call this Risk Management Theatre, and it affects the more than you would think:
Observability has helped engineering and DevOps teams better understand how their systems perform by implementing millions of metrics, ingesting terabytes of logs, and tracing a single request through tens (hopefully not hundreds) of microservices. But this approach isn’t well suited to these outages that occur outside a user’s mental model.
Observability tools let you see the outputs of a system in great detail (metrics, traces, logs). This means if you have a good mental model of how the system works, you can then infer what the inputs and internal state of the system must look like for it to produce those outputs. But that’s a pretty big if.
“Implicit in every observability solution is the idea that the user already knows how the system is supposed to work, how it was set up, what the components are, and what ‘good’, or even just ‘working’ means. If you built the system or have excellent eng onboarding, this might be true. But it turns out that at scale and over time, this is never true,” - Aneel Lakhani, an Overmind angel and early employee at SignalFx and Honeycomb.
This doesn’t mean you don’t need observability; you probably do. Just that it’s unfair to expect it to solve problems caused by unknown unknowns.
I believe that the answer is enabling engineers to produce new mental models on-demand, either as part of planning a change or in response to an outage. Our current tools don’t support this kind of work, though. When building a mental model, we use “primal” low-level interactions with the system, usually via the command line which requires a great deal of expertise and time. If we are to solve this, we must make building mental models much faster, meaning:
If we can do all of the above automatically, then we can:
Or do this in reverse: Start by looking at the item that is broken, and use our blast radius knowledge to determine everything that could have caused that to break, then compare the state of these things to before the outage and see what changes could have caused the issue
This is what we’re building at Overmind. We’ll map your entire infrastructure with no user input and allow you to use this map to build confidence faster than ever before.
*Overmind is now available to try for free. Get started by signing up and creating an account here.
Addendum: Some rare people contradict what I’ve said here by having an amazingly detailed mental model. Often they’ve been around the longest and are the hero that solves the most complex problems. The problem with these heroes is that they are hard to scale. Their mental models are challenging to share, and if you have an organisation of non-trivial size, you end up with lots of people who need that model but don't have it.