Tool Calling Latency: The Bottleneck No One Talks About
Tool calling is supposed to be the easy part of building with LLMs. The model emits a JSON object, you route it to a function, the function returns a result, you feed that back. Every major serving stack supports it. Every model card claims it. And yet, when you actually wire up a tool-using agent and put it in front of a user, the experience is often noticeably worse than the plain chat experience with the same model. The agent feels heavy. Each turn takes longer than it should. The dead air between "user asked something" and "the agent did something" is uncomfortable.
Most of that comes from a part of the system nobody benchmarks: the latency of producing a tool call. Not the latency of the tool itself, but the inference time it takes the model to decide what tool to call and emit the structured arguments. This is short generation in a regime where most inference engines are slow, and where most published numbers do not apply. This post is about why tool calling is harder on the serving stack than it looks, where the time actually goes, and how that shapes which agent designs are practical.
Why tool calls are short generations
A typical tool call looks like this: the model is given a system prompt with tool definitions, the user message, and any prior tool results. It produces a small structured output, usually well under 100 tokens. Something like:
{ "tool": "search_orders", "arguments": { "customer_id": "C-2398", "status": "pending" } }
That is maybe 30 to 50 tokens depending on how the schema is encoded. With a model that produces 100 tokens per second in steady state, the decode time for that call is around 300ms to 500ms. Add a TTFT of 400ms to 800ms and you are at roughly one second to produce a single tool call. Then the tool runs, the result comes back, and the model has to read it and produce the next thing. Maybe another tool call, maybe a final response.
If you have ever wondered why agents that use tools feel sluggish on the same model that streams chat at a comfortable pace, this is the reason. Chat hides latency by streaming output the user reads while it generates. Tool calls cannot hide latency, because nothing is shown to the user until the call has completed and the result is back. The user is staring at a spinner during exactly the part of inference that serving stacks are worst at.
The prefill is doing more work than you think
The prompt for a tool-using model is not small. Tool definitions are verbose. A schema for a single moderately complex tool, with a description, parameter types, descriptions of each parameter, and example usage, is often 300 to 600 tokens. Real agents have multiple tools. A coding agent might have 8 to 15 tools defined, which puts the tool block alone at 3,000 to 8,000 tokens.
On top of that, you have the system prompt that explains how the agent should behave, how to format calls, when to stop, what error states mean. Add the conversation history, prior tool calls, and prior tool results, and the prompt sent to the model on each step is regularly 5,000 to 20,000 tokens long. The output is 50 tokens. The ratio is wildly skewed toward prefill.
This matters because prefill is where TTFT comes from. If your serving stack does not aggressively cache the prefix, every tool call pays the full prefill cost from scratch. On a 70B-class model with a 10k token prompt, that can be 600ms to 2 seconds just to start producing the first output token. The actual generation of the JSON object is the cheap part.
Prefix caching is the obvious fix and it works, but it has to be set up correctly. The cache key is the exact token sequence of the prompt prefix. If your agent framework reorders messages, rewrites tool definitions on every call, or injects timestamps into the system prompt, your cache hit rate drops to zero without any obvious symptom. The model still works. It is just slow in a way that looks like the model is slow, when really the cache is being defeated by formatting churn upstream.
Constrained decoding has a real cost
Most production tool-calling setups use some form of structured output enforcement. JSON schema validation, grammar-constrained decoding, regex masks. These exist because models do not reliably produce well-formed JSON without them, especially smaller models, and especially under load when the sampling temperature drifts.
Constrained decoding is not free. The standard approach is to compute a mask over the vocabulary at each step that allows only tokens consistent with the grammar, and then sample within that mask. Computing the mask requires walking the grammar state machine for every candidate token. Naive implementations do this on the CPU after each forward pass, which adds tens of milliseconds per token. On a 50 token output, that is one to three seconds of overhead on top of the model's actual generation time.
Better implementations precompute mask tables, batch the grammar computation, and run it on the GPU alongside sampling. The state of the art adds maybe 1 to 5ms per token, which is acceptable. But many open source serving stacks are still in the naive regime, especially when used with custom schemas. If you are using JSON mode and feeling like the output rate is lower than the model's documented decode speed, this is probably why.
There is a related problem with how schemas are translated into grammars. A loose schema that says "object with at least these fields" is faster to constrain than a tight schema that pins every field type and requires specific enums. People often write tighter schemas to push correctness onto the decoder, which is the right instinct, but it also makes decoding slower. The performance tax is usually not worth optimizing away unless you measure it.
Why the model latency floor matters more for tools
Chat applications have a graceful failure mode for slow inference. The user reads as the model writes. If TTFT is 800ms, the user might not even notice if the rest of the response streams smoothly. Tool calling does not have this property. The user sees nothing until the structured call is fully decoded, the tool runs, and either another call or a user-visible response is produced.
This means the floor on perceived latency is much higher for tool-using interactions. If a chat response feels good at 800ms TTFT, a tool call that needs the same model to reach completion before any user feedback is at 800ms TTFT plus 500ms of decode plus tool execution plus another inference call to produce the final response, all before the user sees anything new. You are at three to five seconds of perceived wait time on a single tool turn, on a model that feels fast in chat.
The product implications are concrete. UI patterns that work for chat do not work for tools without modification. You cannot stream a JSON tool call to the user, because partial JSON is not meaningful. The common workarounds are:
- Show a "thinking" or "calling tool X" indicator the moment the model decides on a tool, even before arguments are complete. This requires the serving stack to surface the partial decode, which most do not by default.
- Pre-decide which tool the model will call by using a smaller, faster model as a router, and only invoke the main model to fill in arguments. This adds complexity but cuts the perceived TTFT roughly in half for the common case.
- Cache previous tool results aggressively so that re-asking is fast. This works for read-heavy workloads, less so for agents that do new things on every turn.
None of these are exotic ideas, but they only become necessary when you have actually felt the latency. Teams that have only built chat tend to be surprised by how much of the agent UX problem is actually a serving problem.
Concurrency and the long tail
Tool-using workloads have a different concurrency profile than chat. A single user interacting with an agent generates a burst of inference calls within a few seconds, then a quiet period while the user reads the result and types a new message. Multiple users hitting the same endpoint produce overlapping bursts.
If your serving stack is optimized for steady state throughput on long generations, it tends to handle these bursts poorly. Continuous batching helps, but only if the requests fit cleanly into the batch shape. Short generations have a lot of variance in how many decode steps they need, which causes head-of-line blocking when one request in a batch needs 200 tokens and the others only need 30. The fast requests sit idle waiting for the slow one to finish a step before the batch advances.
This shows up as a long tail in the latency distribution. The p50 of tool call latency might be 800ms while the p99 is 4 seconds. For a single tool call this is annoying. For an agent that does 10 sequential tool calls, the chance of hitting at least one tail event approaches one. The agent's overall latency is dominated by the worst step in its sequence, which means the tail latency of a single call effectively becomes the typical latency of the whole task.
The right metric to track for tool calling is not p50 of single calls, it is p99 of single calls, or even better, p50 of full task completion across a representative agent workload. Most serving teams do not measure this because it requires running an actual agent, not a synthetic load generator.
The benchmark gap
Public LLM serving benchmarks rarely measure any of this. The standard format is: 1k input tokens, 256 output tokens, single request, report tokens per second. This is a reasonable measurement for batch inference economics. It tells you almost nothing about how a model will perform inside a tool-using agent, where input is 8k tokens, output is 60 tokens, requests come in correlated bursts, and the prompt is mostly cacheable but only if you are careful.
A more honest benchmark for tool calling would specify:
- Input length distribution that matches real agent prompts (large system prompt with tool definitions, growing conversation history).
- Output length distribution skewed toward short structured outputs.
- A cache hit rate target, since that drastically changes the numbers.
- Concurrent request bursts rather than steady throughput.
- Constrained decoding overhead measured separately so it can be attributed.
This is more work than the standard benchmark, but it produces numbers that actually predict whether a serving stack will work for agents. Without it, you end up choosing inference providers based on long generation throughput and being surprised when your tool-using agent feels slow on a model that benchmarks well.
What this means for building agents
The practical takeaway is that tool calling is a different inference workload from chat, and serving stacks that are good at one are not automatically good at the other. If your agent feels slow:
- Measure cached TTFT, not steady state throughput. That is the number that controls per-step latency.
- Verify that prefix caching is actually hitting. Stable serialization of the prompt across calls is the single highest leverage thing you can do.
- Profile constrained decoding overhead separately. If you see a gap between documented decode speed and observed speed during structured output, this is probably it.
- Look at p99 of single calls, not p50. Tail latency is what dominates multi-step task time.
None of this is glamorous work. It is plumbing. But it is what separates an agent that feels responsive from one that feels stuck, on the same model with the same prompts.
If you are building tool-using agents and the inference latency is what is making the experience worse than your chat product, General Compute's API is set up for the workload: short structured generations, high prefix cache hit rates, low TTFT, and predictable tail latency under bursty load. It is OpenAI compatible, so pointing an existing agent framework at our endpoint is usually a config change. The numbers that move when you do it are the ones the user actually feels.