London, UK
team@overmind.tech
It was around 7pm, and the engineer had just deployed a major upgrade on the legacy stack they had been working on. This upgrade had been tested against a dedicated set of clusters before finally been given the green light to be in released into production. Everything seemed fine, for about 2 minutes. Then? Chaos. As the team hurriedly tried to troubleshoot the issue, they were probably left wondering “Why does it always go sideways in production.”
Sound familiar? If you've ever faced a prod deployment disaster, you're not alone. This example is taken from Reddit’s Pi-Day Outage from March 14th . This is just one of many that can be found in the Verica incident database. But why do these issues arise, especially when things seem smooth in the testing phase? Let's dive in.
If deploying to development is a closed-door rehearsal. Then deploying to production is a live performance on opening night. With all the unpredictability that goes with it, anything that can go wrong will go wrong. But what are some examples of this?
Imagine every layer of your deployment process (code review, testing, monitoring etc) as a slice of Swiss cheese. Each of these slices have random holes of different sizes – vulnerabilities or potential points of failure. A stack of these slices represents a company or team’s defence against a risk. When these holes all overlap the result will often mean failure. Or in this case a deployment to production gone wrong. This is the Swiss Cheese model approach to managing risk.
It's tempting to focus extensively on perfecting a single layer however is it the most effective? Based on the 80/20 rule (Pareto Principle) 80% of your output's value derives from just 20% of your time, resources, and investment. Therefore, pursuing that elusive final 20% of coverage in a layer will result in diminishing returns with each percentage improvement being harder than the last to obtain, and 100% effectiveness being impossible.
Let’s imagine we are using Datadog for our Observability layer. To reduce our risk we might want to increase our coverage from 80% to 95%. We can start to put it quantitatively (though these figures are illustrative):
In simple terms you could expect your next Datadog bill to be 4 times more. Or to put that into $, assuming 25% of your cloud spending is on Observability (similar to companies such as Netflix [link]) and given that 54% of small / medium size businesses are spending over $1.2 million on cloud [link], you’d see your annual Datadog renewal increase from $300k to $1.2m. That’s equivalent to your entire cloud spend in the previous year.
If we were to use testing as an example layer and look at a calculator such as Qawolf’s, we can calculate the costs of increasing test coverage to 95%. Assuming you have a single contractor at a base rate of $65/hour ($87,750 per year) achieving you 80% test coverage. You’d need to add 3 more contractors to reach our target at a additional cost of $263,250 per year.
So what difference would it make to improve our testing or observability layers? If we were able to improve one of them from 80% to 95% effectiveness it’d decrease the chance of an outage getting through from 0.8% to 0.2%
Instead of improving a single layer, what if we chose to add an additional one. For instance, with 4 layers that were each 80% effective you’d actually see a decrease in risk, even compared to the example above where we invested in making an existing layer almost perfect.
Which leads us to our next question. What is additional layer you can add that is going to be more efficient and cost effective than increasing coverage on what you already use today?
Configuration as a layer is often overlooked in the fight against production outages. With config changes you are measuring inputs instead of outputs. Inputs can tell us what’s going to be impacted before making a change rather than whats been impacted after.
Overmind is a tool that does exactly that. It takes the blast radius of your proposed change and calculates what the impact would be across your environment using only read-only AWS credentials. You can create changes manually or integrate it as part of your existing CI / CD .
As far as cost, anything under 200 changes a month falls under the free tier. Making 300 or more changes will cost you $250 a month ($3000 a year.) Comparatively to our two examples above, adding Overmind as an additional layer or safety net that works with your existing processes can be a effective low cost way of ensuring that when you deploy to production, it doesn’t turn into everyone’s nightmare.
Make your next change with Overmind by signing up for free here.
Or join our Discord to join in on the discussion of the next wave of Devops tools.