It was at 11:03 am PST, just like any other day, when an employee at Loom fired up the tool to watch a video recording. The unexpected twist? They found themselves logged into an entirely different user’s account. As more employees logged in and encountered the same problem, it didn't take long for the chilling realisation to set in: this was not a one-off anomaly, it affected everyone.
Within seven minutes of noticing the issue, Loom applied a mitigation. As the minutes ticked on, management were faced with a difficult choice: Trust that the mitigation has worked, but risk exposing even more customer data if it hasn’t, or pull the plug. At 11:30am they made the decision to pull the plug on the entire platform. This gave them the time they needed to find the cause and validate the fix, and the platform was back by 2:45pm that afternoon.
What went wrong?
To understand how this happened, we need to rewind a little to February 24th. An apparently innocuous Terraform change was applied to the Dev environment and gradually, over the next ten days, meandered its way into production via the dev, test, and staging environments.
This change caused AWS Cloudfront (Loom’s CDN) to cache set-cookie headers for one second. Meaning that if a user made a request within a second of another user, they would be logged in as that user.
So, why didn't alarms go off earlier? The tricky condition for this bug to manifest was that requests from different users had to be made within one second of each other - a scenario that rarely unfolds in non-production environments. Once this change was merged into production, Loom's CDN started caching the session cookies sent to JS and CSS static endpoints for a second, which resulted in responses containing a set-cookie header from different users, effectively logging them in as someone else.
The Insidious Complexity of Cloudfront
One of the key takeaways from the Loom incident is that dealings with services like Cloudfront are a complex matter. A Cloudfront “cache policy” can be shared across many “distributions”. These distributions can in turn, be related to numerous other cache policies, and eventually control what gets cached and could therefore be the culprit causing the outage. Coupled with intermediary components such as load balancers dispatching traffic to multiple applications, it's easy to lose track of the labyrinth that is your infrastructure. This is before we even consider socio-technical factors like stress, timing, etc.
Could this have been prevented?
Loom have stated that they:
“Will be looking into enhancing our monitoring and alerting to help us catch abnormal session usage across accounts and services.”
This will certainly help them to respond more quickly to future production issues. However unless non-production environments are seeing production-like traffic (in this case: users frequently logging in within a second of each other), monitoring changes like this won’t actually prevent the issue from happening again.
Crucial to the prevention of such incidents is understanding the potential impact of changes before pushing them to production. With large-scale applications like Loom, it’s almost impossible for engineers to have a perfect mental model of the system and its dependencies (Wood’s Theorem) and they therefore need tools to help them understand the impact of each change.
When you've ran a Terraform plan it often then involves a manual review by the the person making the change or someone on their team. In some cases the reviewer can even sit outside of the team making the change. In those cases, context is vital and there are various tools that help you extrapolate the complexity. However, they often fall short when dealing with larger or complex environments. Which means critical config changes that could cause an outage go undetected.
But what if you could stop these from hitting production? With Overmind's risks you can surface incident-causing config changes as part of your pull request. Using current application config instead of tribal knowledge or outdated docs. It acts as a second pair of eyes, analysing your Terraform plan along with the current state of your infrastructure to calculate any dependencies and determine the potential impact or the blast radius of a change. From the blast radius it can provide a list of human readable risks that can be reviewed prior to running Terraform apply. These risks can either be commented back as part of your CI / CD pipeline or viewed in the app.
In Loom's case, we replicated a similar Cloudfront configuration to see what Overmind could discover. We were able to identify the distributions the header policy would affect:
Inside of the app we can see the full blast radius in a interactive graph along with any metadata Overmind was able to get from AWS.
And when you're ready to start the change, Overmind will take a snapshot before and after to validate that the change went through as intended.
By understanding which services would potentially by affected by the planned Cloudfront change, Loom engineers would have had a full picture and could have fixed the change before it affected production.
If you’re from Loom and reading this: Thank you for your excellent incident update that this post if based on! Your transparency lets the whole industry learn and it’s greatly appreciated. I had to make some assumptions in this post, if I’ve got anything wrong, please contact me (firstname.lastname@example.org) and I’ll get it fixed. If you want to use Overmind, I’ll give you your first year free as a thank you for helping push the industry forward!
And if you’d like to give this a try yourself you can sign up for a free Overmind account here.
Make sure to check out our example repository. It demonstrates how to use Terraform with GitHub Actions to send each pull request to Overmind automatically.
Or join our Discord to join in on the discussion of the next wave of devops tools.