Preventing Outages: Limitations of Even the Best Observability and Monitoring Tools

It was a Friday afternoon and we had planned to roll out a big change that we’d been working on and testing on all week. We knew this was a bad idea, but we were confident! Firstly the change was related to the way the backend UNIX fleet authenticated user logins so should have been fairly innocuous, and we had done all the testing we possibly could, but there was still some risk.

So we pressed the button and rolled out the change. The results came back all green, we could log into the servers, and all that we needed to do was wait. As we waited for all the results to come back, the phone rang.

“Hey, nobody in the department can save PDFs anymore.”

The whole department is at a standstill because they can't save PDFs. We haven’t touched any laptops though, how could we possibly have broken the ability to save PDFs? We started frantically looking into it and it turns out that they aren’t clicking Print -> Save as PDF as you’d expect, they have an actual printer called “PDF Printer” that they print to instead, which we’ve managed to break somehow.

We then tried the easiest things first:

Ask if anyone knows what it is: nobody does
Check if it exists in the CMDB: it doesn’t
Check the wiki: no mention of it

In the end, it turned out that about 10 years ago somebody put a physical server in a data center. And the job of that server was to pretend to be a printer. Meaning that when somebody prints it, it saves it to a pdf, and then it runs a script that picks up that PDF and moves it to a mount point. It didn’t make sense to me at the time, and it still doesn’t, but that’s what we had.

In the end, we managed to get the “printer” working again, but not before everyone in the affected department had already gone home for the weekend without being able to finish their work for the week.

What does this story tell us about the limitations of observability and monitoring tools?

No matter how reliable your systems are or how thoroughly you monitor them, outages can and will occur. Monitoring tools are only as effective as the data points they can access. They can provide valuable insights into system performance but they may not capture everything needed when making a change or finding a root cause fix. A lack of data can make it difficult to pinpoint the cause of an outage, especially when the issue is complex. Often involving multiple systems that can be outside our mental model. These unknown unknowns can be particularly challenging to diagnose and resolve leading to lengthy downtimes.

The typical (wrong) response: Risk Management Theatre

When an outage occurs, a common response is to implement more risk management processes in an attempt to stop the outage from happening again. However, this increased focus on risk management processes results in a substantial increase in lead time. Puppet’s State of DevOps report found that low-performing companies that engaged heavily in risk management theatre had 440x longer lead-times than high-performing organisations.

Companies with these long lead times make 46x fewer changes, meaning that each change needs to be much larger in order to keep up. Less practice, and larger changes means that they are five times more likely to experience failures. When failures do occur, the consequences are much more severe.

The combination of larger changes, decreased frequency, and limited experience in handling such situations leads to a mean-time-to-recovery almost 100x longer than that of high-performing organisations. And remember that it was large outages that caused this in the first place, so the process feeds back on itself, making the company slower and slower.

Answer = Inputs

Observability tools that measure outputs such as metrics, logs & traces require a good mental model and a deep understanding of the application in order to interpret them. But as we’ve already seen, outages are often caused by unexpected issues outside of our own mental model. When this happens, the system’s behaviour contradicts out understanding of how it should work. This leads to confusion and requires individuals rebuild their mental model of the system on the fly, as mentioned in the brilliant STELLA report.

To address this challenge, we should shift our focus toward measuring inputs. This enables engineers to create new mental models as needed, whether during the planning stage of a change or in response to an outage. Current tools do not adequately support this type of work. When constructing a mental model, we typically rely on "primal" low-level interactions with the system, often accomplished through the command line, which demands a great deal of expertise and time. To resolve this issue, we must find a way to expedite the process of building mental models by measuring input or configuration changes instead.

If we are to solve this, we must make building mental models much faster, meaning:

Ensuring that the configuration and current state of a system are readily accessible.
Enabling users to easily discover the potential impact of their intended changes and what areas might be affected.
Providing users with the means to validate that their modifications have not caused any issues downstream.

By measuring config changes (inputs) instead we can understand context on demand and have the confidence that our changes won’t have any unintended negative impacts.

Overmind blast radius

Overmind understands all of the dependencies within your AWS infrastructure, so we can calculate the blast radius of a change, even for those resources outside of Terraform. Showing you the consequences of your changes, not just the changes themselves. Start by creating an account for free here. You can signup either by using your email or google and github account. Then create a change by

Opening a Terraform pull request using the Overmind Github action.
Manually within the Overmind app.
Coming soon Or via our CLI as part of your Terraform plan

In this example, the Github action will trigger a new change. Firstly it will parse the contents of your Terraform Plan, stripping out any sensitve information. Next it will query the AWS API in real-time to discover any dependencies and create a blast radius.

The blast radius contains:

What infrastructure will be affected.
What applications rely on that infrastructure.
What health checks those applications have.

Next, the Action will comment back a list of human-readable risks. From which you can decide to apply the changes or make any amendments. If you want to dive in deeper the app hosts a interactive graph where you can explore the relationships and AWS meta data in real-time.

Get started today!

Get started today for free by signing up and creating an account here. In the free tier you can create a 150 changes a month and get full access to our Github action or CLI.

Or you're interested in influencing the direction of what we're building register for our design partner program here.

Or join our Discord to join in on the discussion of the next wave of Devops tools.