It was a Friday afternoon and we had planned to roll out a big change that we’d been working on and testing on all week. We knew this was a bad idea, but we were confident! Firstly the change was related to the way the backend UNIX fleet authenticated user logins so should have been fairly innocuous, and we had done all the testing we possibly could, but there was still some risk.
So we pressed the button and rolled out the change. The results came back all green, we could log into the servers, and all that we needed to do was wait. As we waited for all the results to come back, the phone rang.
“Hey, nobody in the department can save PDFs anymore.”
The whole department is at a standstill because they can't save PDFs. We haven’t touched any laptops though, how could we possibly have broken the ability to save PDFs? We started frantically looking into it and it turns out that they aren’t clicking Print -> Save as PDF as you’d expect, they have an actual printer called “PDF Printer” that they print to instead, which we’ve managed to break somehow.
We then tried the easiest things first:
In the end, it turned out that about 10 years ago somebody put a physical server in a data center. And the job of that server was to pretend to be a printer. Meaning that when somebody prints it, it saves it to a pdf, and then it runs a script that picks up that PDF and moves it to a mount point. It didn’t make sense to me at the time, and it still doesn’t, but that’s what we had.
In the end, we managed to get the “printer” working again, but not before everyone in the affected department had already gone home for the weekend without being able to finish their work for the week.
No matter how reliable your systems are or how thoroughly you monitor them, outages can and will occur. Monitoring tools are only as effective as the data points they can access. They can provide valuable insights into system performance but they may not capture everything needed when making a change or finding a root cause fix. A lack of data can make it difficult to pinpoint the cause of an outage, especially when the issue is complex. Often involving multiple systems that can be outside our mental model. These unknown unknowns can be particularly challenging to diagnose and resolve leading to lengthy downtimes.
When an outage occurs, a common response is to implement more risk management processes in an attempt to stop the outage from happening again. However, this increased focus on risk management processes results in a substantial increase in lead time. Puppet’s State of DevOps report found that low-performing companies that engaged heavily in risk management theatre had 440x longer lead-times than high-performing organisations.
Companies with these long lead times make 46x fewer changes, meaning that each change needs to be much larger in order to keep up. Less practice, and larger changes means that they are five times more likely to experience failures. When failures do occur, the consequences are much more severe.
The combination of larger changes, decreased frequency, and limited experience in handling such situations leads to a mean-time-to-recovery almost 100x longer than that of high-performing organisations. And remember that it was large outages that caused this in the first place, so the process feeds back on itself, making the company slower and slower.
Observability tools that measure outputs such as metrics, logs & traces require a good mental model and a deep understanding of the application in order to interpret them. But as we’ve already seen, outages are often caused by unexpected issues outside of our own mental model. When this happens, the system’s behaviour contradicts out understanding of how it should work. This leads to confusion and requires individuals rebuild their mental model of the system on the fly, as mentioned in the brilliant STELLA report.
To address this challenge, we should shift our focus toward measuring inputs. This enables engineers to create new mental models as needed, whether during the planning stage of a change or in response to an outage. Current tools do not adequately support this type of work. When constructing a mental model, we typically rely on "primal" low-level interactions with the system, often accomplished through the command line, which demands a great deal of expertise and time. To resolve this issue, we must find a way to expedite the process of building mental models by measuring input or configuration changes instead.
If we are to solve this, we must make building mental models much faster, meaning:
By measuring config changes (inputs) instead we can understand context on demand and have the confidence that our changes won’t have any unintended negative impacts.
Overmind understands all of the dependencies within your AWS infrastructure, so we can calculate the blast radius of a change, even for those resources outside of Terraform. Showing you the consequences of your changes, not just the changes themselves.
Start by creating an account for free here. You can signup either by using your email or google and github account.
Start by opening a Terraform pull request using the Overmind Github action.
The action will automatically populate a new change with the resources & items from your Terraform plan output.
Based on what you're changing, Overmind will calculate blast radius of the affected items. Use the graph to explore relationships and dependencies between these items.
The blast radius contains:
After a successful early access program where we discovered over 600k AWS resources and mapped 1.7 million dependencies. We are now looking for innovators to join our design partner program to help test impact analysis (only for AWS infrastructure at the moment).
Get started today for free by signing up and creating an account here.
Or you're interested in influencing the direction of what we're building register for our design partner program here.