On Friday 19th July 2024, an update from CrowdStrike led to substantial disruptions across various sectors, resulting in widespread 'Blue Screen of Death' (BSOD) errors. These errors have grounded planes, cancelled public transportation and medical appointments, and even disrupted banking, stock exchanges, and media operations.
The Complexity of Modern IT System
The CrowdStrike incident highlights the intricate dependencies within modern IT infrastructures. These insights are echoed in the STELLA report, which delves into the complexities of managing complex systems and provides a framework for understanding and mitigating potential outages. Here are some critical insights included in the report:
1. Interdependencies and Cascading Failures
Today’s IT systems are a complex web of interdependencies. An issue in one area can quickly lead to cascading failures across others. For instance, the CrowdStrike update impacted not just individual organisations but entire sectors, showcasing the ripple effect of a single point of failure.
2. External Factors Beyond Control
Even with thorough internal controls, external factors can introduce risks. In this case, an externally developed update caused widespread chaos, emphasising that risk management must account for elements beyond an organisation's immediate control.
3. The Limits of Testing
No matter how rigorous the testing, real-world scenarios often reveal unforeseen challenges. Differences between development/test environments and live production settings can expose vulnerabilities that hadn't been apparent during earlier phases.
This is not new, learning from past outages
Reddit's Pi-Day Outage
On March 14th, Reddit experienced an unplanned outage lasting exactly 314 minutes. Despite extensive pre-deployment tests on the legacy stack, the production environment exposed unexpected issues, underscoring the unpredictability of user interactions and system responses in live scenarios.
Loom’s Misfired Caching
Loom’s critical privacy issue, triggered by a misconfigured CloudFront header policy, further illustrates how real-world conditions can betray fail-safes that function well in test environments. The incident prompted a shutdown to safeguard user data from unintentional exposure.
Why Do Production Deployments Often Go Wrong?
User Traffic: Real-world user interactions can be far less predictable than simulated traffic, uncovering bugs invisible in controlled test environments.
Configuration Differences: Minor discrepancies between development and production configurations can lead to major malfunctions, from missing environment variables to misconfigured servers.
External Dependencies: Third-party services might perform well in testing but fail under production loads or unanticipated conditions.
Data Discrepancies: Production data is invariably messier and can inject unexpected variables into the system that controlled test data does not account for.
Unanticipated Edge Cases: Even the most exhaustive testing can’t anticipate every single scenario that might occur in the real world.
The CrowdStrike outage yet again reiterates the complex dependencies and vulnerabilities within modern IT systems. Further emphasising the need to understand these interconnected risks and leveraging advanced tools so we better prepare for and mitigate future challenges, ensuring system resilience and continuity when inevitably something like this happens again.