As I'm sure many of you have seen, yesterday at 01:31 EST Datadog experienced (and still is experiencing at time of writing) an outage that affected, well, everything. The scale and duration of this outage make it pretty interesting and while we wait for the official report from Datadog, I thought I'd eliminate some of the more obvious causes.
It's not DNS. There's no way it's DNS.
Overmind doesn't yet have the ability to track DNS changes over time (watch this space) so I don't know, but I'm going to assume given the behaviour and people's reports on Reddit that it's not DNS. I look forward to being proven wrong.
We can tell quite easily if there were any known cloud outages by looking at the status of each of the cloud providers:
Turns out there weren't any incidents, but just because they weren't acknowledged doesn't mean they didn't exist, so if there was a cloud outage how would it affect Datadog? To work this out we need to have a look at Datadog's architecture. I've added all of their endpoints into Overmind and overlaid the regions and functions in Datadog terminology:
We can see that Datadog has a bunch of regions, but how do they relate to cloud providers? Much of this can be worked out using Overmind's built-in reverse-dns queries e.g.
So the mappings from Datadog terminology to real terminology are:
As we can see Datadog's infrastructure is spread across all three major cloud providers, and in the case of AWS and GCP also different regions and likely AZs. From this we can see that at least in theory, there is no cloud outage other than some kind of mass extinction event that could explain the outage we've seen today.
There's an old proverb that says "using multi-cloud for reliability is like riding two horses at once in case one of them dies" and this is a perfect example of that. The reality is that you're probably more likely to make a catastrophic mistake than the cloud providers are, and if straddling all three adds complexity and confusion then it'll only increase your MTTR (mean time to recovery).
Almost certainly. Rumours on Reddit are saying:
Datadog has reported to some customers that it was an OS update that rolled out everywhere that broke networking (see some other comments). This took out important infrastructure like their k8s clusters which run all their workloads.
And I'm inclined to agree that this looks like the most likely culprit. As we've seen time and time again, config changes and the unexpected dependencies between them are almost always at the root of these large outages.
At Overmind we automatically discover all of this for you (with no sidecar containers, agents or libraries) so you can assess the potential blast radius before making a change that brings down production, even if you've never seen that part of the infrastructure before.
Overmind is now available to try for free. Get started by signing up and creating an account here.
Final Note: There are a lot of smart, tired people working hard to get Datadog back up and running (likely without any Observability tooling...), so please be kind. If you need to vent, take it out on your account manager 😜