Agent Readout

Your AI Agent Is Only as Good as Its Inference Speed

Agent latency compounds across every sequential step. This post covers the multiplier effect, how model routing can cut costs without sacrificing quality, and why parallelizing calls is one of the highest-leverage improvements you can make to an agentic system.

Author: General Compute
Published: 2026-06-28
Tags: agents, inference, latency, coding-agents, model-routing

Markdown body

When you build an LLM chat interface, you are writing a wrapper around one API call. When you build an agent, you are writing a wrapper around twenty. The difference matters because every shortcoming in your inference layer shows up once in a chat app and twenty times in an agent.

The way people discover this is usually not in benchmarks. It is in production, when the thing that felt fast in evals suddenly feels unusable when a real user is waiting for it to do a real task. The model did not get slower. The workload did.

## The multiplier in concrete terms

A ReAct agent doing a moderately complex task -- finding a bug, writing a unit test, refactoring a function -- makes somewhere between 10 and 40 LLM calls, depending on how many tool results it has to read and how many retries it encounters. Each call has a time to first token (TTFT) and a decode tail.

If TTFT is 600ms and decode runs at 60 tokens per second, a 100-token tool call takes about 2.3 seconds. Do that 20 times and you have 46 seconds of pure inference, before tool execution, before tests run, before the user sees a single useful result.

If you bring TTFT down to 150ms and decode up to 200 tokens per second, the same call takes 0.65 seconds. 20 calls is 13 seconds. Same agent, same prompts, same model quality. The user experience goes from "this feels broken" to "this is fast enough to use."

The ratio is not 1:1 because tool execution and other fixed overhead set a floor. But the inference contribution is large enough that a 3-4x improvement in per-call speed produces a 2-3x improvement in end-to-end task time. In most agentic workloads, inference accounts for 60-75% of wall clock.

## Why model size is not the only variable

The instinct when an agent feels slow is to reach for a smaller model. If you are using a 70B parameter model and it is slow, try a 13B. This sometimes works. It often does not, because the latency problem is not always about model size.

The variables that actually matter:

- **Inference backend and hardware.** A 70B model on hardware tuned for low-latency short generations will often outperform a 13B model on a batch-throughput-optimized stack, in the regimes that agents actually hit. The model's parameter count sets a floor, but the serving infrastructure determines how close to that floor you actually get.
- **TTFT vs decode balance.** Agents make many short generations. A call that produces 80 tokens of JSON is dominated by TTFT, not decode throughput. If you optimize only decode throughput (as most batch benchmarks do), you miss the dominant cost for agent calls.
- **Structured output overhead.** Many inference stacks are slower at constrained generation (tool calls with specific JSON schemas) than at freeform text. The degradation ranges from subtle to a factor of two. If your agent makes 15 tool calls per task, this overhead compounds significantly.

## Model routing: right-sizing inference per step

Not every step in an agent loop has the same quality requirements. A planning step that sets the overall structure of the agent's approach needs the best reasoning you can get. A file lookup step that produces a one-line tool call does not.

Model routing assigns different steps to different model sizes based on what each step actually demands.

A routing layer for a code agent might look like:

```python
def select_model(step_type: str) -> str:
routes = {
"plan": "llama4-maverick", # complex reasoning, needs the big model
"tool_call": "llama4-scout", # structured output, smaller is fine
"code_generation": "qwen3-coder", # specialized, fast, strong at code
"verify": "llama4-scout", # quick correctness check
}
return routes.get(step_type, "llama4-scout")
```

The savings add up quickly. If 60% of your agent's calls are short tool selections and verifications, and you can handle those with a model that runs at 3x the speed and a fraction of the cost, you have materially changed the economics and the wall clock without touching the quality of the steps that actually need the bigger model.

The risk with routing is getting the categorization wrong. A verification step that should be quick but ends up needing to understand subtle code semantics can fail silently if you have routed it to a small model without the depth to catch the issue. In practice, you learn the boundaries of each step type in your specific workload through testing, not by assuming the routing in someone else's architecture will work for yours.

## Parallel calls: the obvious optimization most agents skip

Sequential agent loops assume that step N+1 always depends on the result of step N. Often this is true. Often it is not.

A code agent might need to:
- Read the file that contains the bug
- Read the test file for that module
- Check if there is an existing utility function that handles similar logic

These three reads are independent. They do not depend on each other's results. A naive sequential loop runs them one at a time, paying 3x TTFT and 3x decode for what could be 1x of each.

Dispatching parallel calls changes the profile:

```python
async def gather_context(file_paths: list[str]) -> list[str]:
tasks = [read_file_with_agent(path) for path in file_paths]
results = await asyncio.gather(*tasks)
return results
```

The wall clock time goes from sum to max. If each read takes 0.8 seconds and you have three of them, sequential is 2.4 seconds and parallel is 0.8 seconds.

The tricky part is that most agent frameworks are built around linear loops. They dispatch one call, wait, process the result, dispatch the next. Adding parallelism means either using a framework that supports it natively (LangGraph has some support for parallel branches, and CrewAI can run workers concurrently) or building a custom dispatch layer. Neither is trivial, but the payoff is real.

Some practical rules for parallel calls:

1. File reads and search queries are almost always parallelizable.
2. Any step that gathers information from independent sources can run in parallel.
3. Candidate generation (produce two or three plans and pick the best) is parallelizable, but costs proportionally more inference.
4. Steps that depend on each other's outputs cannot run in parallel. Forcing parallelism there breaks correctness.

## Coding agent case study

To make this concrete, consider a specific task: an agent that takes a failing test as input, locates the bug, and fixes the code so the test passes.

The task involves:
1. Parse the failing test output and extract the error.
2. Locate the relevant source file.
3. Read the source file and the test file.
4. Understand what the test expects and what the code does instead.
5. Generate a fix.
6. Apply the fix.
7. Run the test.
8. Verify the output.

On a fast inference backend (150ms TTFT, 200 tokens/sec decode):

| Step | Output tokens | TTFT | Decode | Total |
|------|--------------|------|--------|-------|
| Parse error | 60 | 150ms | 300ms | 450ms |
| Locate file | 40 | 150ms | 200ms | 350ms |
| Read source | 30 | 150ms | 150ms | 300ms |
| Read test | 30 | 150ms | 150ms | 300ms |
| Understand + plan | 200 | 200ms | 1000ms | 1200ms |
| Generate fix | 400 | 200ms | 2000ms | 2200ms |
| Apply patch | 60 | 150ms | 300ms | 450ms |
| Run tests | (subprocess) | -- | -- | 3000ms |
| Verify | 100 | 150ms | 500ms | 650ms |

Total for a clean one-shot success: roughly 9 seconds.

Three optimizations apply directly here:

- Steps 3 and 4 (reading source and test) can run in parallel. Wall clock for that pair drops from 600ms to 300ms.
- The file location step could use a smaller routed model, since it is producing a file path from search results, not reasoning about code.
- Prefix caching from step 4 onward reduces TTFT on subsequent steps, because the source and test content is already in the KV cache.

With those changes, the fast backend gets to about 7 seconds for a one-shot success.

Now consider a two-iteration case, which is more common than one-shot success in real workloads. The fix has a subtle error and the test fails the first time. Without the parallelism and routing, two iterations cost roughly 20 seconds. With the optimizations, about 16 seconds: 7 for the first pass, 6 for the second (context is warmer), plus two test runs.

That is acceptable for a developer at a keyboard. The equivalent on a slow inference backend (2.5x the latency per call) pushes past 40 seconds, which is past the attention threshold for most interactive use.

## What to evaluate in an inference provider

If you are putting agents into production and latency is the binding constraint, the things to evaluate are not the same as for a batch processing workload.

**TTFT at short context.** Most providers publish throughput numbers. Ask for TTFT at 2k, 4k, and 8k input tokens. This is what your tool calls will actually look like.

**Latency at low batch sizes.** A provider that is fast at batch size 32 may have poor single-request latency. Agents often run at low concurrency per user session.

**Prefix cache effectiveness.** Ask whether the KV cache is preserved across requests in the same conversation. If not, every step pays full prefill cost, including for the system prompt and accumulated context.

**Structured output latency.** Generate 50 tool calls with realistic JSON schemas and measure the p50 and p95. The p95 tells you what your retry cases will feel like, and retries are not rare in production.

## The compounding effect

The argument for fast inference in agents is ultimately about compounding. When each step is slow, every inefficiency multiplies across the loop. When each step is fast, that multiplication works in your favor: more steps fit in the same wall clock, more validation is affordable, and more retry tolerance is available before the user gives up.

A coding agent that can run 10 steps in 20 seconds is qualitatively different from one that can run 30 steps in the same time. The second one handles meaningfully harder tasks. The difference between them is not model quality. It is the infrastructure serving the model.

If you are building agents and hitting latency walls, [General Compute's API](https://generalcompute.com) is optimized for exactly the access pattern agents need: short generations, aggressive prefix caching, and low TTFT at the context lengths agents actually hit. It is OpenAI-compatible, so the migration is a one-line config change, and the latency difference shows up immediately in the metrics that matter for your loop.