Dylan Ratcliffe
Last Updated
Tech
The changelog that could have saved Reddit 314 min of downtime on Pi day

The changelog that could have saved Reddit 314 min of downtime on Pi day

Sometimes you have an outage because of an obscure, undocumented situation that nobody was every likely to consider. And sometimes the root cause is in this section of the CHANGELOG:

It doesn’t get much clearer than that. So how is it that a change in this section of the CHANGELOG caused Reddit to be down for 314 minutes on Pi Day? (And no it’s not because they just didn’t read it)

As usual there’s a bit more than meets the eye, so let’s dive in!

What happened?

Just after 19:00UTC, an engineer at Reddit kicked off an upgrade on a Kubernetes cluster that (ironically) had just been the subject of an internal postmortem for a previous Kubernetes upgrade that had gone poorly. This cluster is also one of a handful of “pets” that were hand reared using the kubeadm command rather than the heard of “cattle” which have since been built with standard templates.

Given the extremely recent upgrade failure, the fact that it was a hand-built “pet” and the sheer size of the cluster (>1,000 nodes) you can imagine that this was considered a fairly high-risk upgrade.

Almost immediately after the engineer started the upgrade, the entire Reddit site came to a screeching halt, and within three minutes a call was set up to debug the problem.

The debugging process

How exactly the team got to the bottom of the problem and fixed it is an excellent read and is covered in great detail in the original blog post, so I won’t re-cover it here but rest assured it contains a good deal of:

  • Turning things off and on again
  • Out-of-date documentation
  • Red herrings

All the things we expect from a good outage. If you’d like to read it, do so now, because the rest will contain spoilers.

The cause

So what happened and why?

  1. The application stopped working, because…
  2. Nodes that were running the pods didn’t have any network routes, which meant...
  3. There were no Calico route reflectors running (which Reddit use for networking), therefore...
  4. Calico was configured to run route reflectors on nodes with the master label which didn’t exist, because…
  5. Upgrading to Kubernetes 1.24 removes the master label in favour of control-plane

Oof.

Please tell me they read the CHANGELOG

At this point I was thinking “Tell you what, if I was upgrading a multi-thousand-node nightmare cluster that had been hand-built by god-knows-who and was single-handedly responsible for the vast majority of the site’s functionality, I’d have read the CHANGELOG and, presumably, noticed something like that”. But maybe it wasn’t in the CHANGELOG, well let’s check:

Under the heading Urgent Upgrade Notes

Under the subheading (No, really, you MUST read this before you upgrade)

We have the following:

Kubeadm: apply `second stage` of the plan to migrate kubeadm away from the usage of the word `master` in labels and taints. For new clusters, the label `node-role.kubernetes.io/master` will no longer be added to control plane nodes, only the label `node-role.kubernetes.io/control-plane` will be added. For clusters that are being upgraded to 1.24 with `kubeadm upgrade apply`, the command will remove the label `node-role.kubernetes.io/master` from existing control plane nodes. For new clusters, both the old taint `node-role.kubernetes.io/master:NoSchedule` and new taint `node-role.kubernetes.io/control-plane:NoSchedule` will be added to control plane nodes. In release 1.20 (`first stage`), a release note instructed to preemptively tolerate the new taint. For clusters that are being upgraded to 1.24 with `kubeadm upgrade apply`, the command will add the new taint `node-role.kubernetes.io/control-plane:NoSchedule` to existing control plane nodes. Please adapt your infrastructure to these changes. In 1.25 the old taint `node-role.kubernetes.io/master:NoSchedule` will be removed. ([#107533](https://github.com/kubernetes/kubernetes/pull/107533), [@neolit123](https://github.com/neolit123))

The critical part of which is "the label node-role.kubernetes.io/master will no longer be added to control plane nodes, only the label node-role.kubernetes.io/control-plane will be added”

Oof.

Well surely Reddit haven’t been so busy building cool giant k8s clusters that they have forgotten the most basic rule of upgrades: RTFM? Well no, thankfully:

We actually did know that the label was going away. In fact, our upgrade process accounted for that in several other places - grumpimusprime

So how did it happen then? Once again grumpimusprime has the answers:

The gap wasn't that we didn't know about the label change, rather that the lack of documentation and codifying around the route reflector configuration led to us not knowing the label was being used to provide that functionality

The smoking gun, the actual piece of config that relied on the old tags had not been documented, and had been implemented manually by someone who no longer worked for the company. Sounding familiar now? Sir_dancealot sums it up beautifully:

it’s funny how we’ve invented so many things to make all this infrastructure repeatable, automatable, and reliable, and it still all comes down to PITA networking management tools and some network config some guy did by hand 4 years ago that no one knew about… exactly like it did before we had all this

So does this mean we’re doomed to repeat these outages forever, no matter how new our tools and process are? I don’t think so, but it requires a change to the way we build our mental models.

Trusting your own mental model dooms you to deal with these outages forever

The fundamental reason why Reddit weren’t able to prevent this outage is that they were evaluating the warnings in the CHANGELOG against a model of the system, not against the actual system. This model was pretty detailed as it consisted of the tribal knowledge of the team, the Terraform code, and the documentation. However all models are subject to Woods’ Theorem:

As the complexity of a system increases, the accuracy of any single agent's own model of that system decreases rapidly.

As Woods found in the STELLA Report, outages are always going to happen at the edges of your mental model, with the things you don’t know you don’t know.

In this case the solution would have been pretty simple. Rather than making deployment decisions based on a potentially outdated and incomplete model of the system, the Reddit team could have gone to the cluster directly, building an accurate model by traversing the actual config would would have unearthed the smoking gun, the manually configured resource that triggered the outage:

If you learn nothing else from this blog, learn this: Outages will always find the cracks in your mental model, don’t trust it. As much as you can go directly to the source, specially for important changes.

“Oh gee thanks Dylan, I guess I’ll just manually inspect every resource on a thousand-node cluster every time I’m going to touch something, what genius idea, why didn’t i think of that?”

^ I thought you might say that. So here’s something I prepared earlier: Overmind

  • Always gets data directly from the cluster (or AWS) in real-time
  • Automatically traverses relationships to discover things you didn’t know you had, and weren’t in Terraform
  • Lets you view, filter & search in a GPU-accelerated graph, so that you can easily understand how things are related
  • Calculate a changes blast radius and receive a set of human-readable risks before deployment
  • Is totally free (the discovery part at least)
Overmind acts as a second pair of eyes on each PR.

Receive a set of risks when raising a pull request.

Dive deeper exploring the relationships and meta data in the app using the interactive graph.

Special thanks:

Thanks to Reddit for publishing this great analysis. It’s helped a lot of people avoid these same pitfalls, but more importantly I think it helps a lot of us void the imposter syndrome that is all-too-common in our industry. The fact that a company with cutting-edge infrastructure like Reddit can be taken down by the same thing that has bitten us all in the past (tripping over a mess someone left behind) makes us all feel a bit more relevant I think. Which we all need sometimes.

Additional Reading:
We support the tools you use most

Prevent Outages from Config Changes

Try out the new Overmind CLI today for free.
No agents, 3 minute deployment.