I’d lave loved to call this “The State of Terraform Report”, but we don’t have the resources to give you the lovely flashy PDF you all deserve. But even though it’s not flashy, we have done quite a lot of research into how people use Terraform as part of building Overmind, so here’s the first instalment in what we’ve learned, enjoy!
First, let's talk about our methodology. We conducted a survey that involved a variety of questions—some multiple-choice, some requiring longer responses—and supplemented our survey with interviews from a number of participating companies. Each company was categorised per the State of DevOps Report methodology into Elite, High, Medium, and Low performance layers. We then aggregated the results to identify which Terraform practices were employed by each performance level.
What wasn't surprising
Higher-performing teams care less about lead time
According to the State of DevOps Report, the emphasis on metrics like deployment frequency often inherently leads to reduced lead times. Our findings echoed this, showing a clear linear correlation between DORA performance level and the extent to which teams prioritised reducing lead time for their Terraform deployments.
What was surprising
“Less than a day” isn’t a good enough lead time…?
Despite having lead times of less than a day according the State of DevOps report, elite-performing teams still rated reducing lead time as a priority, averaging 3.25 out of 5. This might imply either ingrained impatience or that many teams that were "Elite" deployment frequency (the metric we used to categorise), don't achieve "Elite" status in terms of lead time. Unfortunately we don’t get enough granularity from the State of DevOps report to investigate this further.
Interviews with Terraform users revealed that more advanced users were often handing over to other tools such as Helm or ArgoCD to do application deployments on top of infrastructure that was managed by Terraform, rather than managing both the infrastructure and applications within Terraform. This was driven in some cases by a need to separate infrastructure and application layers to better suit the existing team dynamic in the org, and in other cases by a desire to avoid long plan times and the change review process that often accompanies them.
You don’t need “Elite”. “High” might be good enough
An interesting trend among a number of the concerns we asked Terraform users about, was that users in the “High” category were the least concerned about a number of important categories. For example:
Observability: High-performing teams were less concerned with reducing observability costs, reflecting an understanding of observability's value but suggesting they have not reached the point of diminishing returns.
Availability: Additionally, the interest in increasing availability didn't show a straightforward correlation with performance.
- Low performers showed little interest in improving availability, possibly due to lower expectations and minimal changes.
- Medium performers, who are navigating their transition into more frequent deployments, were most concerned, likely because they are in the “breaking a few eggs” phase of making an omelette. From a Terraform perspective this is the phase where running a local
terraform apply
is usually replaced with running in CI, and the workflow needs to become much more standardised as a result - High performers were the most satisfied with their availability levels of all the groups
- Elite performers prioritised availability much more than high performers, possibly due to the critical nature of their platforms? Or their increased focus on it? We're not sure
Even Elite can’t solve these two things…
Interestingly, across all performance levels, there were a couple of questions that showed little variance, regardless of the overall performance level of the organisation. These were:
- Improve confidence when deploying
- Reduce change failures caused by misconfigurations
This was something that I focused on heavily in interviews, as I think it’s the most interesting result. From teams that only deployed locally, had no CI and ran terraform apply
against production less than once a week, to teams that had fully automated workflows including TFSec, custom OPA policies, automated deployment etc. All teams seemed to be dissatisfied with their ability to gain confidence when deploying and avoid outages when making changes with Terraform.
This is because none of the widely used tools in the Terraform space actually answer the question “Is this change a good idea?”. Answering that relies on your experience, and the tribal knowledge of your team, which is fine for smaller teams and simpler environments but doesn’t scale. Overmind is solving this problem by giving you a full risk analysis of any Terraform change, that takes into account not only what you’re changing, but all the dependencies and blast radius that might be affected by your change. We don’t rely on a static list of checks like tflint or tfsec, and instead provide contextual risks that are unique to your environment, allowing you to deploy with more confidence and less outages.