🚀 Discover Blast Radius! Beta LIVE – Join Now
aws

Loom’s nightmare AWS outage and how it might have been prevented

Dylan Ratcliffe
September 22, 2023

It was at 11:03 am PST, just like any other day, when an employee at Loom fired up the tool to watch a video recording. The unexpected twist? They found themselves logged into an entirely different user’s account. As more employees logged in and encountered the same problem, it didn't take long for the chilling realisation to set in: this was not a one-off anomaly, it affected everyone.

Within seven minutes of noticing the issue, Loom applied a mitigation. As the minutes ticked on, management were faced with a difficult choice: Trust that the mitigation has worked, but risk exposing even more customer data if it hasn’t, or pull the plug. At 11:30am they made the decision to pull the plug on the entire platform. This gave them the time they needed to find the cause and validate the fix, and the platform was back by 2:45pm that afternoon.

Timeline of loom outage

What went wrong?

To understand how this happened, we need to rewind a little to February 24th. An apparently innocuous Terraform change was applied to the Dev environment and gradually, over the next ten days, meandered its way into production via the dev, test, and staging environments.

This change caused AWS Cloudfront (Loom’s CDN) to cache set-cookie headers for one second. Meaning that if a user made a request within a second of another user, they would be logged in as that user.

Looms AWS Terraform CDN change explained
Hussein Nasser: https://www.youtube.com/watch?v=iPXLk5Fk1-U

So, why didn't alarms go off earlier? The tricky condition for this bug to manifest was that requests from different users had to be made within one second of each other - a scenario that rarely unfolds in non-production environments. Once this change was merged into production, Loom's CDN started caching the session cookies sent to JS and CSS static endpoints for a second, which resulted in responses containing a set-cookie header from different users, effectively logging them in as someone else.

Cloudfront

The Insidious Complexity of Cloudfront

One of the key takeaways from the Loom incident is that dealings with services like Cloudfront are a complex matter. A Cloudfront “cache policy” can be shared across many “distributions”. These distributions can in turn, be related to numerous other cache policies, and eventually control what gets cached and could therefore be the culprit causing the outage. Coupled with intermediary components such as load balancers dispatching traffic to multiple applications, it's easy to lose track of the labyrinth that is your infrastructure. This is before we even consider socio-technical factors like stress, timing, etc.

Could this have been prevented?

Loom have stated that they:

“Will be looking into enhancing our monitoring and alerting to help us catch abnormal session usage across accounts and services.”

This will certainly help them to respond more quickly to future production issues. However unless non-production environments are seeing production-like traffic (in this case: users frequently logging in within a second of each other), monitoring changes like this won’t actually prevent the issue from happening again.

Change still going to Prod
Monitoring will catch issues sooner, but likely not prevent them

Crucial to the prevention of such incidents is understanding the potential impact of changes before pushing them to production. With large-scale applications like Loom, it’s almost impossible for engineers to have a perfect mental model of the system and its dependencies (Wood’s Theorem) and they therefore need tools to help them understand the impact of each change.

Overmind steps in here. Overmind analyses a Terraform plan to determine the potential impact or the 'blast radius', making sure nothing slips through the cracks. In Loom's case, we replicated a similar Cloudfront configuration to see what Overmind could discover. We were able to identify the distributions the header policy would affect:

The cloudfront change replicated in Overmind

The load balancer that served as origin for the distribution:

Overmind Blast Radius

And the services behind the load balancer:

Services behind loadbalancer in Overmind

By understanding which services would potentially by affected by the planned Cloudfront change, Loom engineers would have had a full picture and could have fixed the change before it affected production.

If you’re from Loom and reading this: Thank you for your excellent incident update that this post if based on! Your transparency lets the whole industry learn and it’s greatly appreciated. I had to make some assumptions in this post, if I’ve got anything wrong, please contact me (dylan@overmind.tech) and I’ll get it fixed. If you want to use Overmind, I’ll give you your first year free as a thank you for helping push the industry forward!

And if you’d like to give this a try yourself you can sign up for a free Overmind account here.

Make sure to check out our example repository. It demonstrates how to use Terraform with GitHub Actions to send each pull request to Overmind automatically.

Or join our Discord to join in on the discussion of the next wave of devops tools.

Related Blogs

Join Other Innovators

Make your next change with Overmind

Get started today. No agents, 3 minute deployment.