James Lane
Last Updated
Tech
AI ops agents are trying to fix problems that shouldn't have happened

AI ops agents are trying to fix problems that shouldn't have happened

It's safe to say the AI SRE / Ops space has exploded in recent months. In a recent trip to Kubecon in London we came across a number of startups looking to change the way we respond a deal with alerts.

Let's imagine it is 06:12 a.m. Your phone vibrates. CloudWatch has fired a "5xx rate spike" alert for the payments API. Two minutes later Slack fills with messages from your team along with that new "autonomous SRE agent" your team just deployed...

"Investigating... collecting logs... running kubectl top pods..."

But here's the problem, this isn't how experienced responders actually investigate incidents.

When human SREs get that same alert, they don't immediately dive into logs and metrics. As written in the STELLA report it serves as a trigger, but is "usually not, in themselves, diagnostic". Instead, alerts trigger a "complex process of exploration and investigation". Rather than immediately diving deep into logs and metrics in a strictly data-driven way, experts often start with a broader approach that includes questions about the system's recent history and state: What changed recently? What deployments went out today? What maintenance happened? Who else is online?

This approach is because the systems we have built continue to grow in complexity and continuously change. Anomalies can be fundamental surprises due to the difficulty in maintaining adequate mental models and understanding how the system, or in many cases, systems connect and change. They need to review that history, including “What was happening during the time just before the anomaly appeared?".

The AI agent, meanwhile, appears to be optimising for quickly executing pre-programmed diagnostic steps, essentially playing an expensive game of log grep. While AI systems are being developed to process and correlate heterogeneous data like logs, metrics, and traces and assist with root cause analysis, significant challenges remain such as the sheer "Data Volume and Velocity" 1. Correlating cascading alerts triggered by a single root cause is "an NP-hard problem, making automated root cause analysis extremely challenging" 2 and that many AI models can be a black box in nature, making their decisions hard to interpret.

Incident response is fundamentally about sense making in the face of surprise and uncertainty. Experts use their "incomplete, fragmented models of the system as starting points for exploration and to quickly revise and expand their models during the anomaly response" 3. They then devise and consider the implications of and test one or more countermeasures, which are effectively experiments that test their mental models of the anomaly sources and the surrounding system tribal knowledge.

Often human responders commonly drop back to basic tools for assessment and modification during incidents. Command line tools entered from the terminal prompt are heavily used because they provide tight interaction with the operating system, offering a primal interaction with the platform compared to the indirect nature of automation and monitoring applications.

For this new generation of AI SRE tools to be successful they can't miss this entirely. They can't jump straight to symptom analysis without the context-building phase. Without asking the what's changed recently? What deployments went out today? In simpler terms, you wouldn't go to a doctor who started ordering blood tests before asking "when did this pain start?" or "what were you doing when it first began?"

This is why alert-driven AI can provide brilliant post-mortem analysis while missing obvious root causes that any human would spot in the first few minutes: Oh, we deployed the billing service an hour ago and hardcoded the old load balancer IP.

The AI was busy analysing pod CPU metrics while the real problem was sitting right there in the change log.

What's broken is everything that happens before the alerts fire

You've got great tools for when things go wrong. Dashboards, alerts, incident response playbooks. But you're still playing that game of infrastructure roulette every time you deploy. Crossing your fingers and hoping this change won't be the one that takes down billing at 6 AM.

The real problem isn't better incident response. It's preventing incidents in the first place.

That's what we've built at Overmind, a system grounded in real time infrastructure context that catches problems before they become outages. Our customers have made it mandatory in production because it actually works. Here's what that looks like:

Yesterday: Developer opens a PR to replace the Application Load Balancer and move it to a new subnet.

Overmind (mapping your actual running infrastructure): Discovers the blast radius in real-time, uncovering dependencies that shouldn't exist but do, like that billing-api Kubernetes Service with the old ALB's IP hardcoded from when someone "temporarily" fixed something six months ago.

Result: Overmind comments back on the PR: "High Risk: Replacing aws_lb.main will break kube_service/billing-api (namespace prod). Suggested fix → update Service or add DNS record."

Outcome: Developer fixes the issue before merging. Zero alerts, zero downtime.

Example PR: https://github.com/overmindtech/terraform-example/pull/235

This isn't about better alerting, it's about not needing the alerts in the first place**.** Instead of waiting for your payments API to start throwing 5xx errors, you catch the dependency that would cause those errors before the change ever hits production.

Get started with Overmind today – Discover the blast radius of every change before it hits production. See why our customers have made this mandatory for their deployments.

We support the tools you use most

Prevent Outages from Config Changes

Try out the new Overmind CLI today for free.
No agents, 3 minute deployment.