Back to blog
coding-agentsinferencelatencydeveloper-toolsagents

Building a Code Agent: Why Each Step Needs Sub-Second Inference

General Compute·

A code agent is a loop. You give it a goal, it reads files, runs commands, edits code, runs tests, and reads the output of those tests. Each cycle has at least one model call in it, often several. The user perceives the agent as fast or slow based on the total wall clock between the moment they hit enter and the moment the agent stops emitting tokens. That number is the sum of every step inside the loop, and the loop usually runs many times.

This is the part of agent design that people gloss over when they sketch the architecture on a whiteboard. The boxes look small. The arrows look short. You can fit a whole agent on one slide. What the slide does not show is that an average task touches the model fifteen to forty times, that each model touch has a prefill and a decode and a structured output pass, and that the user experiences the cumulative result rather than any single call.

This post walks through the actual steps of a code agent, attaches a latency budget to each, and explains why anything slower than roughly a second per step pushes the total task time into a range that breaks the interactive feel.

The latency budget you actually have

A developer using a code agent inside their editor expects the agent to keep pace with their attention. Twenty seconds of wall time is the upper edge of acceptable for a non-trivial task. Forty seconds and the developer alt-tabs. Two minutes and they go look at Slack and probably never come back to that tab in the same flow state.

If you accept a twenty second budget for a task that involves twenty model calls in some combination of planning, tool selection, code generation, and review, you have one second per call on average. That is the headline number. It does not say every call must finish in a second, because some calls are cheap and some are expensive, but it says the average has to land there, and the slowest call cannot eat the whole budget.

A second per call sounds generous. It is not. Most of that second is spoken for before you write any code, because of how prefill and decode interact with structured output. Let us go through the steps.

Step 1: Reading the user request and planning

The first step usually involves a model call where the agent reads the user's instruction along with whatever context it has about the project, and either produces a plan or decides which tool to call first. This call has a moderately large input (system prompt, file tree summary, conversation history) and a small to moderate output (a plan or a tool call).

The latency profile is dominated by time to first token. With an 8,000 token input and a fast prefill backend, you can get to first token in 200 to 400 milliseconds. Decoding 150 tokens of plan at 80 tokens per second adds another 1.9 seconds. So a "fast" planning step is already 2.1 to 2.3 seconds. If your inference backend has a slower prefill or a slower decode (50 tokens per second is common on contended endpoints), the same step takes 3.5 to 4 seconds.

You are now one step in and either at 10 to 12 percent of your budget or at 20 percent of it. The rest of the steps still need to fit.

Step 2: Tool selection and tool calls

Code agents call tools constantly: read a file, search the codebase, run a shell command, list a directory. Each of these tool calls is preceded by a model inference that decides which tool to use and produces the structured arguments for it. Then the tool runs (typically fast, under 100 milliseconds for file IO). Then the model reads the tool result and decides what to do next.

A tool selection call has a small output: usually under 100 tokens of structured JSON for the tool name and arguments. The trap is that structured output generation is slower than freeform decoding on most inference stacks because of constrained decoding, schema validation, and the lower batchability of small structured outputs.

A realistic profile for a tool call inference is 300 milliseconds of time to first token plus 700 milliseconds to produce the JSON. That is one second per tool call decision, on a fast backend. The agent's loop typically performs three to eight tool calls before it has enough context to produce a code edit. That is three to eight seconds spent just on tool decisions, before any code has been written.

Step 3: Reading tool results and reasoning

Once tools return data, the agent has to read it. If the tool was a file read, the model now has a 2,000 to 10,000 token addition to its context. The prefill for the next call has to process those new tokens. Without prefix caching, you pay full prefill cost for the entire context every time. With prefix caching, you pay only for the new portion, but the savings depend on whether the cache is warm and whether the serving stack actually streams the cached prefix at the speed it claims.

The reasoning step itself may or may not be visible to the user. Some agent frameworks separate "think" turns from "act" turns. Others fold the reasoning into the same call as the next tool selection. Either way, the model is generating tokens. A thinking step that produces 300 tokens of internal reasoning takes 3.75 seconds at 80 tokens per second, or six seconds at 50 tokens per second.

This is where the latency starts to feel oppressive. The user sees the agent "thinking" with nothing visible to them, and the longer the think turn, the more it looks like the system is stuck. Streaming the thinking helps, but only if the front end is built to display it as it arrives.

Step 4: Generating code

The code generation step is the one users tolerate the longest, because they can see the code appearing and they understand intuitively that more code takes more time. A 500 token diff at 80 tokens per second is 6.25 seconds. A 1,500 token rewrite of a file is 18 seconds. These are real numbers from real workloads.

The interesting thing about code generation is that it is the only step where the decoder throughput dominates the latency, because the input is already in context from earlier steps and the output is the bulk of the work. This is the step where speculative decoding pays off the most, because correct guesses can multiply effective throughput by two or three times. If your inference stack supports speculative decoding for the size of model you are running, this step gets cheaper without changing the model.

It is also the step where chunked prefill and continuous batching matter, because long generations interact with other requests in the same batch in ways that can starve them. We have written about both elsewhere on this site.

Step 5: Running tests and reading the results

The agent runs tests as a subprocess. The test run is not inference time, but it still counts against the wall clock. A fast test suite returns in three to ten seconds. A slow one takes minutes. The agent then has to read the test output, which is usually a few hundred to a few thousand tokens of failure output, and decide whether to iterate.

The decision step here is another tool selection or planning call, with the same latency profile as step 2: roughly one second on a fast backend. If the tests passed, the agent emits a completion message. If they failed, the loop restarts from step 3 or step 4 with the new error context.

What this adds up to

A simple task that runs the loop once:

| Step | Time on fast backend | Time on slow backend | |------|---------------------|---------------------| | Plan | 2.2s | 4.0s | | Tool calls (5) | 5.0s | 9.0s | | Read results and reason | 2.0s | 4.0s | | Generate code | 6.0s | 12.0s | | Run tests | 5.0s | 5.0s | | Verify | 1.5s | 3.0s | | Total | 21.7s | 37.0s |

A task that needs two iterations of the loop (which is more common than one-shot success in real workloads) doubles most of those rows. The fast backend lands at around 35 seconds, the slow one at 65 seconds or more. The fast backend feels usable. The slow one does not.

The "sub-second per step" framing is a useful target because most of the steps above were budgeted at one to two seconds each on the fast column. The total breaks when individual steps slip toward two or three seconds, because the slips compound across the loop.

What gets you to sub-second steps

A few things move the needle, in roughly the order of how much they matter for an agent workload.

First, raw decode throughput. A model that decodes at 200 tokens per second instead of 80 makes the code generation step two and a half times faster, and most of the other steps proportionally faster too. The model size is not the whole story here. A 70B model on hardware tuned for low-latency decoding can outperform a 13B model on a stack tuned for batch throughput, in the regimes that matter for agents.

Second, time to first token. Many agent steps have short outputs but moderate to large inputs. TTFT dominates the latency for these. Backends that parallelize prefill aggressively, or that use chunked prefill to keep decode running during prefill, win here.

Third, structured output performance. Tool calls dominate the inference count in most agent workloads. If your stack is twice as slow at structured output as at freeform decoding, you have effectively doubled the latency of half your calls. Some inference providers handle this well, some do not. Test it explicitly.

Fourth, prefix caching that actually works. Code agents accumulate context as the loop progresses, and the prefix grows monotonically until something compacts it. If the serving stack reuses the KV cache from prior calls in the same conversation, every step after the first one gets a faster prefill. This is one of the few places where infrastructure can give you a five to ten times speedup on prefill without changing anything in the model.

Fifth, parallelism where it is safe. Some agent steps can run in parallel: multiple file reads, multiple search queries, multiple lints. If the agent framework can dispatch these concurrently and the inference backend can serve them without queueing, the loop tightens without losing correctness. This requires both an agent design that supports it and an inference endpoint with the headroom to actually run requests in parallel.

Where General Compute fits

We built our inference stack on custom silicon because the standard GPU serving path is not optimized for the access pattern that agents have. Short structured outputs, frequent prefill on growing contexts, and long tail latency on a small percentage of calls all hurt agents disproportionately. Our hardware path is shorter on the small calls and steadier on the large ones, which is the shape that matters when you are summing twenty calls into a single user-perceived wall clock.

If you are building a code agent and the per-step latency budget is what is breaking your product, our API is OpenAI-compatible and tuned for exactly this workload. Try it on your hardest test cases, the ones where current providers feel too slow, and see what the loop time looks like. The math in this post is the math we optimize for every day.

ModeHumanAgent
Building a Code Agent: Why Each Step Needs Sub-Second Inference | General Compute