The Agentic Inference Tax: Why Agents Need 10x Faster Models
A chat application is a single LLM call followed by a stream of tokens to a user who is reading them. An agent is something else. It is a loop: think, call a tool, read the result, think again, maybe call another tool, eventually stop. Each pass through that loop is a separate forward pass through the model. The user sees one task. The system sees ten or twenty inference calls.
This is the agentic inference tax. The model that felt fast enough for chat suddenly feels broken when you put it inside an agent loop, because every weakness in latency gets multiplied by the number of steps. A 2 second response time is fine when a person is reading the answer. It is a 30 second wait when the agent has to do 15 steps to finish a task. This post is about where that multiplier comes from, why the standard chat benchmarks miss it, and what changes when the underlying model gets meaningfully faster.
A chat call versus an agent task
In a chat, the cost structure is simple. You send a prompt, the model generates some output, you stream it. The user perceives two numbers: time to first token (TTFT) and tokens per second after that. If both are good, the experience is good. If TTFT is 400ms and the model puts out 80 tokens per second, a 200 token answer arrives in about three seconds, and the user starts reading well before generation finishes.
An agent task does not look like that. The agent receives a goal, plans a step, generates a tool call, sends it to a tool, waits for the tool result, and feeds that result back into the next forward pass. In a typical ReAct loop the model often emits a short reasoning trace and then a structured call. None of those individual generations are long. Most are a hundred tokens or fewer. But each one pays the full cost of TTFT plus a small decode tail. And because the next step depends on the previous one, none of it parallelizes.
If you have a model with 500ms TTFT and you do 10 sequential steps, you have just spent at least 5 seconds on TTFT alone, before counting decode time, tool execution, or any retries. In practice, a real agent task pays a lot more than that, because steps are not uniform. Some steps generate longer plans. Some require the model to read a large tool result and respond. Some get retried because the structured output failed validation.
How the multiplier shows up in real workloads
The cleanest way to see the tax is to instrument an agent and look at where the wall clock time goes. The general shape, across the agents I have worked with, looks like this:
- 60% to 80% of total time is sequential LLM inference.
- 10% to 30% is tool execution (HTTP calls, database queries, code execution).
- The rest is overhead: serialization, retries, scheduler waits.
In other words, the dominant cost is the LLM, not the tools, even when the tools themselves are not trivial. People often expect the opposite, because they think about the agent in terms of what it is doing in the world. But the agent spends most of its time generating the next sentence about what to do, not actually doing it.
A 10 step coding agent that uses a model with 60 tokens per second decode and 600ms TTFT might have this profile:
- 10 calls of TTFT: 6 seconds.
- 10 calls of decode at roughly 80 tokens per call: about 13 seconds.
- Tool execution averaged across calls: 4 seconds.
- Retry overhead and structured output reparsing: 2 seconds.
Total: around 25 seconds. The model itself accounts for 19 of those. If you swap in a model with 200ms TTFT and 200 tokens per second decode, the same 10 steps cost 2 seconds of TTFT and roughly 4 seconds of decode. Now total task time is closer to 12 seconds. Same agent, same prompts, same tools. Half the wall clock.
That is the multiplier in action. A 3x improvement in the model's per-call latency turns into a 2x improvement in end-to-end task time, which is the number that actually matters to whoever is waiting.
Why the existing inference benchmarks miss this
Most published benchmarks measure throughput on long generations. A common setup is to send a 1k token prompt and ask the model to produce 256 or 512 output tokens, and report tokens per second across batch sizes. This is fine for measuring batch serving economics. It is not fine for measuring agent feasibility.
Agent calls are short. A tool call is often 30 to 80 output tokens. A planning step is usually under 200. The model spends a much larger fraction of its time inside the prefill and the first few decoded tokens, where most engines are underutilized and where TTFT dominates. A model that does 300 tokens per second in steady state but takes 800ms to start producing the first token will look great on long-generation benchmarks and feel terrible inside an agent loop.
This is also where prefix caching matters more than people realize. If your agent reuses a long system prompt across every step, and your serving stack rebuilds the KV cache from scratch each time, you are paying the prefill cost on every loop iteration. The right number to measure is "cached TTFT," the time to first token when the system prompt is already in cache. For agentic workloads, the gap between cached and uncached TTFT can be the difference between a 10 second task and a 60 second task.
The retry problem
Agents retry. This is not a bug, it is a property of how they work. The model sometimes generates malformed JSON. It sometimes calls a tool with the wrong arguments. It sometimes proposes a plan that fails its own self-check. The agent framework catches these and asks the model to try again.
In a slow inference setting, retries are catastrophic. If your base case is 25 seconds and you have to retry one step, you are now at 30 seconds. Retry two steps and you are at 35. The agent that worked in evals starts feeling unusable in production, because production has a wider distribution of inputs and the tail of retries shows up.
Faster inference does not eliminate retries. It changes the cost of retrying. With a fast enough model, the agent can afford to be more aggressive: generate two candidate plans and pick the better one, validate every tool call before executing it, run a self-critique step. Each of those is another LLM call, which means each one adds latency. If a single call costs 200ms instead of 2 seconds, those extra calls become affordable.
There is a useful reframing here. Slow inference forces you to design agents that are minimal: as few steps as possible, no double-checking, no parallel exploration. Fast inference lets you design agents that are robust: more steps, more validation, more retries when something looks off. The set of feasible architectures changes with latency.
Voice agents and the 500ms ceiling
Voice agents are the clearest case where the inference tax becomes a hard constraint. Conversational turn taking expects a response within roughly 500ms to feel natural. That budget has to cover everything: ASR finalization, the LLM call, possibly a tool call, TTS synthesis, and audio playback startup.
If your LLM TTFT is 600ms, you have already missed the budget before the model has produced anything. The voice agent will feel laggy no matter how good the rest of the stack is. This is why voice deployments often resort to small models, aggressive prompt caching, and parallel speculative paths: the latency budget cannot be met any other way.
For multi-turn voice agents that do tool calls, the tax compounds again. A user asks for the weather, the agent has to plan, call the weather API, and respond. Even a simple two-step agent has to fit two LLM calls plus a tool call inside the user's perceived response time, or you start hearing dead air. With 200ms TTFT this is achievable. With 1 second TTFT it is not.
Browser and code agents
Browser agents and code agents have a different latency profile but the same structure. A browser agent loads a page, observes the DOM, decides what to click, clicks, waits for the page, observes again. A code agent reads files, decides what to edit, applies the edit, runs tests, reads output, decides the next step.
In both cases, the user is willing to wait longer than they would for chat. A 30 second task is fine. A two minute task starts feeling slow. A five minute task often gets abandoned.
The reason fast inference matters here is not that any single step has to be sub-second. It is that the number of steps the agent can afford grows with how fast each step is. A code agent that runs at 2 seconds per step is capped at maybe 30 steps before users give up, which limits the size of the task it can handle. A code agent that runs at 400ms per step can handle 100 steps in the same wall time, which is the difference between fixing a typo and refactoring a module.
This is the deeper version of the inference tax: it does not just make agents slower. It makes some agent designs impossible. The product surface that you can build is constrained by the latency of the underlying model, not by the model's quality.
What to measure
If you are building agents, the model benchmarks worth tracking are not the same as the chat benchmarks. The ones that matter:
- Cached TTFT. Time to first token when the system prompt is already in the KV cache.
- Short generation latency. Total time to produce 50, 100, and 200 tokens. This is what each agent step actually looks like.
- Structured output latency. Time to produce a valid JSON tool call, including any decoding constraints. Some serving stacks pay a real cost here.
- Concurrent step latency. What happens to TTFT when N agent loops are running against the same endpoint. Throughput per agent matters as much as throughput per second.
The standard "tokens per second on a 512 token completion" number tells you almost nothing about whether a model will work inside an agent.
What changes at 10x
A 10x speedup in agent step latency does not mean agents become 10x faster end to end. Tool execution and other fixed overhead set a floor. But it changes which patterns are practical:
- Self-verification on every step becomes cheap.
- Parallel candidate generation, where the agent produces two or three plans and picks the best, fits inside the same wall clock budget as a single plan today.
- Long-horizon agents that take 50 to 100 steps stop being a research curiosity and start being shippable.
- Voice agents stop having to choose between fast and capable.
Most of the agent architectures in the literature were designed under the assumption that LLM calls are slow and expensive. As that assumption changes, the design space opens up. The agents that ship in two years will not look like ReAct loops with three retries. They will be wider, deeper, and more redundant, because the cost of being wrong is no longer measured in seconds of dead air.
If you are running agentic workloads and the inference latency is what is bottlenecking your design, General Compute's API is built for short, sequential calls with aggressive prefix caching and low TTFT. It is the workload we optimized for. Pointing your agent at our endpoint is usually a few lines of config, and the wall clock difference shows up immediately.