Why We Needed It
In December 2024 I received a message from one of our customers:
What’s going on with Overmind? Some of our risks analysis jobs are taking >10min and it’s slowing down our ability to deploy to production.
This was pretty bad. Overmind’s risk analysis process isn’t quick, but we aim to get changes analysed within 3min so that we’re speeding up the review process for our customers, not slowing it down.
Since this is a process that makes heavy use of LLMs however, performance troubleshooting is not as straightforward as it as with normal applications for a number of reasons:
- A change that modifies 100 things will take longer to analyse than one that modifies one thing
- UNLESS that one thing is really important, and has a big blast radius, in which case it’ll take longer
- UNLESS the specific change you’re making is really simple, like changing a description, in which case it’ll be faster
- UNLESS that description is at odds with what the resource actually does, in which case it’ll need to try to understand if the description is misleading, and if this constitutes a risk, depending on the architecture and importance of the infrastructure you’re modifying, in which case it’ll take longer.
In order to conduct these sort of analyses we’re making a lot of OpenAI calls, which are very dynamic. This means that for a given analysis job:
- We don’t know how many calls we will make to OpenAI
- We don’t know how many of these will be in parallel
- We don’t know how much data we will send in, or get back
This means that any absolute metric is going to be basically meaningless. We need to answer the question:
Is it slow because it’s got a lot of work to do? Or is it slow because OpenAI is slow?
The easiest metric to use here is: Tokens Per Second. This is calculated using:
(input_tokens + output tokens) / time
However you might also want to measure Output Tokens Per Second. Due to the fact that while input tokens do have an effect on the overall request duration, it’s much less than the affect of output tokens, so this gives you a more stable number, and it’s what we chose to use:
output tokens / time
This gives us the underlying performance of the model itself, without being skewed by differences in the size of the requests.
200% Performance Gain
By creating a custom calculation in Honeycomb, we were able to show that the underlying performance of gpt-4o
varies dramatically with each call. Using the Assistant API and gpt-4o
we found that the Output Tokens Per Second (TPS) fitted a normal distribution with a 50th percentile of 25 and 90th percentile of 38.
As an experiment we tried moving back to the Chat Completion API, which now has many of the features (like tool calling) that were previously exclusive to the Assistants API and found an incredible jump in performance, even with the same prompts, and the same model.

Performance doubled when switching to the Chat Completion API
Our 50th percentile results were now faster than the 90th percentile with the Assistants API, even though nothing else had changed.
I can’t even begin to speculate as to why this might be. I would definitely have assumed that both APIs would use the same compute pool for inference and would therefore produce basically the same results, but this wasn’t even close to being true.
If you’re still using the Assistants API, now is the time to move away. The API is being deprecated soon and is being replaced by the Responses API, which in early testing shows similar performance to the Chat Completions API, but with more features.