There is a good chance you have seen this before. A pull request triggers a build, and it fails. A quick look at the logs reveals the culprit. A sub-sub-dependency in your package.json
has a critical vulnerability. Annoying? Yes. A showstopper? Not really. You run `npm audit fix`
, Dependabot has already opened a PR, or your Snyk scan tells you exactly what to do. The problem, while important, is contained, understood, and managed by a fairly mature ecosystem of tools.
Now, consider a different scenario. An SRE or engineer makes, what they consider, a "safe" change to a security group in a seemingly unrelated AWS account. Two hours later, half of your customer facing services are down. Your monitoring dashboards are a sea of red, and nobody knows why. The root cause? A hidden, implicit dependency between that security group and a load balancer in another region that no one even knew existed.
Walk into most engineering organisations and you'll find sophisticated tooling for code deployments. Blue-green deployments, feature flags, automated rollbacks, comprehensive test suites. But ask about infrastructure changes and you'll hear "we use Terraform" or "we have monitoring." That's like saying you handle code deployments by "we use Git" and "we look at logs when things break."

This isn't a criticism of engineering teams. It's a reflection of how the tooling has evolved. Code dependency management has 20+ years of mature tooling behind it. Infrastructure dependency management? We're still figuring it out.
The disparity between how we manage these two types of dependencies is big. On the code side, we have a mature, sophisticated discipline. We use manifest files like package.json
and requirements.txt
to explicitly declare what we need. We have automated scanners, vulnerability databases, and bots that handle updates for us. In fact, research shows that while 84% of codebases contain at least one known vulnerability, we have the tooling to find and fix them.
On the infra side, it's a different world. There is no package.json
for your AWS environment. The "source of truth" is often a collection of Terraform files, CloudFormation templates, and a whole lot of "tribal knowledge" stored in the heads of your senior engineers. Critical relationships between resources are not declared; they are emergent. They exist because of how services happen to be configured to talk to each other at runtime. This gap means we are missing the forest for the trees.
The consequences extend beyond the immediate outage. When infrastructure fails unexpectedly, teams lose confidence in making necessary changes. Technical debt accumulates. Systems become increasingly fragile as everyone becomes afraid to touch anything. The cure becomes worse than the disease.
And they are not just technical; they are financial, and a little terrifying. When a code dependency issue arises, the cost is typically measured in developer hours. It might take a few hours, or even a day, to resolve a difficult conflict. When an infrastructure dependency fails, the cost is measured in thousands of dollars per minute. Research from Gartner and other industry analysts consistently places the average cost of IT downtime at over $5,600 per minute. For critical applications in large enterprises, that number can easily exceed $1 million per hour.
This is not a build failure. This is a boardroom level crisis. It's lost revenue, SLA penalties, and a direct hit to customer trust. The blast radius isn't contained to a single application; it can take down your entire platform.
So why are these dependencies more dangerous? Firstly they are unbounded, a code dependency is scoped to an application, but an infrastructure dependency can span services, teams, AWS accounts, and even entire regions. That security group change can impact a Lambda function you didn't know existed, a Kubernetes pod connecting to an RDS instance in another VPC, or a load balancer managed by a different team.
Existing tools are blind to these relationships. Tools like Terraform are for provisioning, they are not discovery tools. Terraform only knows about the resources it explicitly manages. It has no idea what another team provisioned, what someone changed in the AWS console to fix an urgent issue last month, or what dynamic relationships form at runtime. It's working from a blueprint, not a live map.
Finally relying on your most experienced engineers to remember how everything is connected is not a scalable or resilient strategy. What happens when they are on vacation? Or when they leave the company?
We've accepted that complex systems will lead to unpredictable outages. We've become experts at firefighting, at assembling a war room, digging through logs metrics and traces to eventually find that one change buried in last week's commits that caused the cascade of failures.
To solve the infrastructure dependency problem, we require more than process improvements. It requires the same kind of tooling sophistication that we've built for code dependencies, adapted to the unique challenges of dynamic, distributed infrastructure.
This means continuous discovery of your actual resources, real-time mapping of how they connect, and impact analysis that lets you understand the consequences of changes before you make them. It means treating your infrastructure with the same automated rigor you apply to your code.
The cost of not solving this problem isn't just failed builds. It's million-dollar outages, eroded customer trust, and engineering teams that are afraid to evolve their systems. The investment in proper infrastructure dependency management pays for itself with the first major incident it prevents.