Does it make sense to apply observability practices to Terraform? And I don’t mean using Terraform to configure your alerts in Datadog, I mean actually observing what Terraform is doing. Unless we’re willing to expand the definition of observability I’d say; no it’s not. But if we think outside observability’s pillars we’re actually missing some obvious opportunities to prevent outages.
Thinking outside the pillars
I think we’re all in agreement now that the three pillars of observability are metrics traces and logs, and that the practice of observability involves working out what’s going on inside a system, by looking at what’s coming out (metrics traces and logs). This idea though doesn’t make much sense when applied to Terraform for a few reasons:
Metrics:
- Terraform only runs when you want to make a change, meaning that there is no continuous stream of metrics to look for trends in
- Traditional metrics like response time don’t really make sense when applied to a terraform since:
- The APIs you’re calling are often outside your control (e.g. AWS)
- You rarely care how long things take in terraform, as long as they work
Traces:
- The Terraform cloud agent does actually support OpenTelemetry
- However this still isn’t likely to be very useful since a change where “Terraform changes a security group and breaks your application” is likely to look exactly the same as “Terraform changes a security group and fixes your application” from the perspective of a trace
Logs:
- Terraform logs what it’s doing, and these logs are helpful as an audit log of what has changed and when.
- However since problems caused by configuration changes are often downstream of the change itself, it’s unlikely that the logs will help you understand an outage in the moment
Continuously collecting and aggregating metrics, traces and logs simply doesn’t help us to see what Terraform is doing and what its effects are.
So what does?
The missing pillars: config & state
Terraform is primarily concerned with managing config, this could be the size of a volume, the value of an environment variable, or the attributes of a security group. Terraform also affects “state” information that is often overlooked, this is things like:
- Is a Kubernetes pod in CrashLoopBackoff state?
- Is a load balancer target healthy or not?
These two new pillars are currently not collected in any meaningful way by any observability tools (that I know of), and are what we are focusing on at Overmind.
What does Overmind do?
Overmind tracks your terraform changes at every point in your workflow, allowing you to move faster with more confidence:
When running terraform plan:
- Based on the planned changes and the relationships that we have discovered, Overmind discovers the blast radius of what might be affected by this change. Even if those things are not in Terraform.
- From this blast radius it can then create a set of risks that show you any potential config issues before you cause an outage.
When running terraform apply:
- Overmind will snapshot your infrastructure before and after the Terraform change. Giving you a diff that allows you to immediately identify unexpected impacts of your changes and revert them before they can cause a knock-on effect.
Sign up here to get started for free or join our Discord to discuss the next wave of Devops tooling.