Sometimes you have an outage because of an obscure, undocumented situation that nobody was every likely to consider. And sometimes the root cause is in this section of the CHANGELOG:
It doesn’t get much clearer than that. So how is it that a change in this section of the CHANGELOG caused Reddit to be down for 314 minutes on Pi Day? (And no it’s not because they just didn’t read it)
As usual there’s a bit more than meets the eye, so let’s dive in!
Just after 19:00UTC, an engineer at Reddit kicked off an upgrade on a Kubernetes cluster that (ironically) had just been the subject of an internal postmortem for a previous Kubernetes upgrade that had gone poorly. This cluster is also one of a handful of “pets” that were hand reared using the kubeadm command rather than the heard of “cattle” which have since been built with standard templates.
Given the extremely recent upgrade failure, the fact that it was a hand-built “pet” and the sheer size of the cluster (>1,000 nodes) you can imagine that this was considered a fairly high-risk upgrade.
Almost immediately after the engineer started the upgrade, the entire Reddit site came to a screeching halt, and within three minutes a call was set up to debug the problem.
How exactly the team got to the bottom of the problem and fixed it is an excellent read and is covered in great detail in the original blog post, so I won’t re-cover it here but rest assured it contains a good deal of:
All the things we expect from a good outage. If you’d like to read it, do so now, because the rest will contain spoilers.
So what happened and why?
At this point I was thinking “Tell you what, if I was upgrading a multi-thousand-node nightmare cluster that had been hand-built by god-knows-who and was single-handedly responsible for the vast majority of the site’s functionality, I’d have read the CHANGELOG and, presumably, noticed something like that”. But maybe it wasn’t in the CHANGELOG, well let’s check:
Under the heading Urgent Upgrade Notes
Under the subheading (No, really, you MUST read this before you upgrade)
We have the following:
The critical part of which is "the label node-role.kubernetes.io/master will no longer be added to control plane nodes, only the label node-role.kubernetes.io/control-plane will be added”
Well surely Reddit haven’t been so busy building cool giant k8s clusters that they have forgotten the most basic rule of upgrades: RTFM? Well no, thankfully:
We actually did know that the label was going away. In fact, our upgrade process accounted for that in several other places - grumpimusprime
So how did it happen then? Once again grumpimusprime has the answers:
The gap wasn't that we didn't know about the label change, rather that the lack of documentation and codifying around the route reflector configuration led to us not knowing the label was being used to provide that functionality
The smoking gun, the actual piece of config that relied on the old tags had not been documented, and had been implemented manually by someone who no longer worked for the company. Sounding familiar now? Sir_dancealot sums it up beautifully:
it’s funny how we’ve invented so many things to make all this infrastructure repeatable, automatable, and reliable, and it still all comes down to PITA networking management tools and some network config some guy did by hand 4 years ago that no one knew about… exactly like it did before we had all this
So does this mean we’re doomed to repeat these outages forever, no matter how new our tools and process are? I don’t think so, but it requires a change to the way we build our mental models.
The fundamental reason why Reddit weren’t able to prevent this outage is that they were evaluating the warnings in the CHANGELOG against a model of the system, not against the actual system. This model was pretty detailed as it consisted of the tribal knowledge of the team, the Terraform code, and the documentation. However all models are subject to Woods’ Theorem:
As the complexity of a system increases, the accuracy of any single agent's own model of that system decreases rapidly.
As Woods found in the STELLA Report, outages are always going to happen at the edges of your mental model, with the things you don’t know you don’t know.
In this case the solution would have been pretty simple. Rather than making deployment decisions based on a potentially outdated and incomplete model of the system, the Reddit team could have gone to the cluster directly, building an accurate model by traversing the actual config would would have unearthed the smoking gun, the manually configured resource that triggered the outage:
If you learn nothing else from this blog, learn this: Outages will always find the cracks in your mental model, don’t trust it. As much as you can go directly to the source, specially for important changes.
“Oh gee thanks Dylan, I guess I’ll just manually inspect every resource on a thousand-node cluster every time I’m going to touch something, what genius idea, why didn’t i think of that?”
^ I thought you might say that. So here’s something I prepared earlier: Overmind
Thanks to Reddit for publishing this great analysis. It’s helped a lot of people avoid these same pitfalls, but more importantly I think it helps a lot of us void the imposter syndrome that is all-too-common in our industry. The fact that a company with cutting-edge infrastructure like Reddit can be taken down by the same thing that has bitten us all in the past (tripping over a mess someone left behind) makes us all feel a bit more relevant I think. Which we all need sometimes.