Why is it always deploys to prod that go wrong?

Everyone's Nightmare

It was around 7pm, and the engineer had just deployed a major upgrade on the legacy stack they had been working on. This upgrade had been tested against a dedicated set of clusters before finally been given the green light to be in released into production. Everything seemed fine, for about 2 minutes. Then? Chaos. As the team hurriedly tried to troubleshoot the issue, they were probably left wondering “Why does it always go sideways in production.”

Sound familiar? If you've ever faced a prod deployment disaster, you're not alone. This example is taken from Reddit’s Pi-Day Outage from March 14th. We also looked at the Loom outage but this is just one of many that can be found in the Verica incident database. But why do these issues arise, especially when things seem smooth in the testing phase? Let's dive in.

Differences Between Dev / Test and Production

If deploying to development is a closed-door rehearsal. Then deploying to production is a live performance on opening night. With all the unpredictability that goes with it, anything that can go wrong will go wrong. But what are some examples of this?

User Traffic: Dev environments often simulate user traffic. But the real-world is unpredictable. Users might interact with your application in ways you never envisioned, uncovering bugs that were never seen during testing.
Configuration: Subtle differences between dev and prod configurations can lead to major malfunctions. A missing environment variable, a misconfigured server – any of these can wreak havoc.
External Dependencies: Third-party integrations may work in dev but pose challenges under the strain of real-world conditions. Similarly they may be missing entirely from Dev posing much of the same challenges.
Data Discrepancies: Production data can be messier than test data. Real-world data can throw curveballs that test datasets don't account for.
Testing: Not every edge case can be anticipated, some slip through even the most rigorous testing.

The Swiss Cheese Model Approach

Imagine every layer of your deployment process (code review, testing, monitoring etc) as a slice of Swiss cheese. Each of these slices have random holes of different sizes – vulnerabilities or potential points of failure. A stack of these slices represents a company or team’s defence against a risk. When these holes all overlap the result will often mean failure. Or in this case a deployment to production gone wrong. This is the Swiss Cheese model approach to managing risk.

The true cost of increasing coverage

It's tempting to focus extensively on perfecting a single layer however is it the most effective? Based on the 80/20 rule (Pareto Principle) 80% of your output's value derives from just 20% of your time, resources, and investment. Therefore, pursuing that elusive final 20% of coverage in a layer will result in diminishing returns with each percentage improvement being harder than the last to obtain, and 100% effectiveness being impossible.

Let’s imagine we are using Datadog for our Observability layer. To reduce our risk we might want to increase our coverage from 80% to 95%. We can start to put it quantitatively (though these figures are illustrative):

Achieving the next 15% (to reach 95% coverage) might require 80% of total possible resources.
So if we assume, for simplicity, that achieving 80% coverage requires 20 units of data.
Achieving the next 15% might require an additional 80 units of data.
You'd need to be sending 4 times the data to reach 95% coverage.

In simple terms you could expect your next Datadog bill to be 4 times more. Or to put that into $, assuming 25% of your cloud spending is on Observability (similar to companies such as Netflix) and given that 54% of small / medium size businesses are spending over $1.2 million on cloud, you’d see your annual Datadog renewal increase from $300k to $1.2m. That’s equivalent to your entire cloud spend in the previous year.

If we were to use testing as an example layer and look at a calculator such as Qawolf’s, we can calculate the costs of increasing test coverage to 95%. Assuming you have a single contractor at a base rate of $65/hour ($87,750 per year) achieving you 80% test coverage. You’d need to add 3 more contractors to reach our target at a additional cost of $263,250 per year.

So what difference would it make to improve our testing or observability layers? If we were able to improve one of them from 80% to 95% effectiveness it’d decrease the chance of an outage getting through from 0.8% to 0.2%

Adding additional layers

Instead of improving a single layer, what if we chose to add an additional one. For instance, with 4 layers that were each 80% effective you’d actually see a decrease in risk, even compared to the example above where we invested in making an existing layer almost perfect.

Which leads us to our next question. What is additional layer you can add that is going to be more efficient and cost effective than increasing coverage on what you already use today?

Config as a layer

Configuration as a layer is often overlooked in the fight against production outages. With config changes you are measuring inputs instead of outputs. Inputs can tell us what’s going to be impacted before making a change rather than whats been impacted after.

With Overmind's risks you can surface incident-causing config changes as part of your pull request. Using current application config instead of tribal knowledge, outdated docs or waiting for that alert to go off. It acts as a second pair of eyes, analysing your Terraform Plan (new supported changes coming soon) along with the current state of your infrastructure to calculate any dependencies and determine the potential impact or the blast radius of a change. From the blast radius it can provide a list of human readable risks that can be reviewed prior to running Terraform apply. These risks can either be commented back as part of your CI / CD pipeline or viewed in the app.

In Loom's case, we replicated a similar Cloudfront configuration to see what Overmind could discover. We were able to identify the distributions the header policy would affect:Overmind found one high, medium and low risk from this pull request

As far as cost, anything under 200 changes a month falls under the free tier. Making 300 or more changes will cost you $250 a month ($3000 a year.) Comparatively to our two examples above, adding Overmind as an additional layer or safety net that works with your existing processes can be a effective low cost way of ensuring that when you deploy to production, it doesn’t turn into everyone’s nightmare.

Make your next change with Overmind by signing up for free here.

Or join our Discord to join in on the discussion of the next wave of Devops tools.

Why is it always deploys to prod that go wrong?

Everyone's Nightmare

Differences Between Dev / Test and Production

The Swiss Cheese Model Approach

The true cost of increasing coverage

Adding additional layers

Config as a layer

Prevent Outages from Config Changes

Stop Evaluating AI Tools Based on Demos. Use This Framework Instead

Protect your critical AWS infrastructure with intelligent auto tagging

Infrastructure dependencies are more dangerous than your code dependencies

Has AI Code Generation Made Reviews the New Bottleneck?

Why is it always deploys to prod that go wrong?

Everyone's Nightmare

Differences Between Dev / Test and Production

The Swiss Cheese Model Approach

The true cost of increasing coverage

Adding additional layers

Config as a layer

Prevent Outages from Config Changes

Latest blogs

Stop Evaluating AI Tools Based on Demos. Use This Framework Instead

Protect your critical AWS infrastructure with intelligent auto tagging

Infrastructure dependencies are more dangerous than your code dependencies

Has AI Code Generation Made Reviews the New Bottleneck?