Agent Readout
How Coding Agents Depend on Inference Speed
Coding agents make dozens of sequential LLM calls per task. Every millisecond of inference latency compounds across each step, making speed the single biggest infrastructure bottleneck for AI-powered developer tools.
- Author
- General Compute
- Published
- 2026-03-19
- Tags
- coding-agents, inference, developer-tools
Markdown body
OpenAI just signed a $10 billion, 750-megawatt deal with Cerebras to run Codex, their coding agent, on Cerebras' inference chips. The largest AI company in the world looked at their coding agent product and decided that general-purpose GPU infrastructure wasn't fast enough. They needed specialized hardware built for inference speed. That decision tells you everything about what matters for AI coding tools. Coding agents aren't chatbots. They don't make a single API call and return a response. They run multi-step loops: read code, reason about it, write a fix, run tests, check the results, and iterate. A typical task might involve 8 to 15 sequential LLM calls, and each one blocks the next. When every call in that chain is slow, the delays compound. And when they compound enough, the tool goes from feeling like a collaborator to feeling like something you're babysitting. ## What Happens Inside a Coding Agent To understand why speed matters so much, you need to understand what a coding agent actually does when you give it a task like "fix this failing test" or "add pagination to this endpoint." The loop looks something like this: 1. The agent reads the relevant files and reasons about the codebase. 2. It plans an approach (sometimes explicitly, sometimes implicitly through chain-of-thought). 3. It calls tools: reads files, searches for references, inspects error output. Each tool call requires an LLM inference to decide what to do next. 4. It generates code changes. 5. It validates by running tests, linters, or type checkers. 6. It reads the results and decides whether to iterate or finish. Steps 2 through 6 repeat multiple times per task. Some of these LLM calls are short (tool selection, classification) and some are long (code generation, planning), but they're all on the critical path. Nothing can happen in parallel because each step depends on the output of the previous one. A typical coding agent task involves 8 to 15 of these sequential calls. More complex tools like Devin or SWE-Agent can run 50 to 100+ steps for a single task. ## The Compounding Latency Problem Here's where the math gets uncomfortable. If each LLM call takes 2 seconds and a task requires 12 calls, that's 24 seconds of pure inference time, not counting tool execution. At 500ms per call, the same task takes 6 seconds. At 200ms per call, it's 2.4 seconds. | Latency per call | 10 steps | 15 steps | 25 steps | |---|---|---|---| | 2,000ms | 20s | 30s | 50s | | 500ms | 5s | 7.5s | 12.5s | | 200ms | 2s | 3s | 5s | The difference between the top and bottom row of that table is the difference between a tool that developers actually use and one they disable after a week. This is fundamentally different from a chatbot, where you make one call and wait. With agents, latency doesn't just add up linearly. It determines whether the entire workflow is practical. A 25-step agent running at 2 seconds per call takes nearly a minute of inference time alone. Most developers won't wait that long. They'll just do it manually. ## Both TTFT and TPS Matter (For Different Reasons) Coding agents make two kinds of LLM calls, and each one cares about a different speed metric. **Short calls (tool selection, classification, small edits):** These are latency-sensitive. The model needs to quickly decide which file to read or which tool to call. Time-to-first-token (TTFT) dominates here because the total output is small. A high TTFT means the agent sits idle for hundreds of milliseconds before it even starts generating a one-line response. **Long calls (code generation, planning, large refactors):** These care more about tokens-per-second (TPS). The model is generating 50 to 500 tokens of code, and TPS determines how long that takes. Slow TPS means watching code appear character by character in your editor. Coding agents need both metrics to be fast. An inference provider that has great TPS but slow TTFT (or vice versa) will still feel sluggish for agentic workloads. ## Why OpenAI Moved Codex to Specialized Inference Hardware The OpenAI-Cerebras deal is worth paying attention to because of what it signals about infrastructure requirements for coding agents. OpenAI has access to more GPU compute than almost any other company on the planet. They have massive clusters of NVIDIA hardware. And yet, when it came to running Codex at the speed and scale their coding agent needed, they went outside their existing infrastructure to a company that builds specialized inference chips. The deal is $10 billion and 750 megawatts of power capacity. This is not a small experiment or a pilot program. This is OpenAI making a serious long-term bet that coding agents specifically need inference infrastructure that's faster than what standard GPU setups can deliver. The reasoning is straightforward when you understand the agentic loop. Codex doesn't just generate code. It reads files, plans approaches, calls tools, writes code, validates results, and iterates. Each step is a sequential inference call. The total user-facing latency is the sum of all those calls plus tool execution time. When your product's core experience depends on a loop of 10 to 20 LLM calls completing fast enough to feel interactive, the speed of each individual call becomes your most important infrastructure constraint. This is the same dynamic playing out across the coding agent space. Cursor chose Fireworks specifically for low latency. Every serious coding tool company treats inference speed as a first-class infrastructure requirement, not an afterthought. ## Why Standard Infrastructure Falls Short Most cloud GPU providers and inference APIs are optimized for throughput (serving many requests efficiently) rather than latency (serving individual requests fast). These are different optimization targets, and they often conflict. The specific problems coding agents hit: **Queuing delays.** Shared inference services process requests in batches. When the system is under load, your request sits in a queue before it starts executing. This adds unpredictable latency that compounds across agent steps. **Cold starts.** Serverless GPU providers sometimes need to load models into memory when a request arrives. This can add seconds of latency to the first call, which is exactly when the user is watching. **Batching tradeoffs.** High-throughput providers batch multiple requests together for GPU efficiency. This is great for aggregate throughput but increases latency for individual requests, which is what matters for interactive agents. **Inconsistent tail latency.** P50 (median) latency might look fine, but agents make many sequential calls per task. If your p99 latency is 3x your p50, the agent will regularly hit a slow call somewhere in its loop, and the user will notice. What coding agents actually need from their inference provider: - Consistently low TTFT (under 200ms) - High tokens-per-second for code generation - Low p99 latency, not just low median - Always-warm models with no cold starts - Support for long context windows (codebases are large) ## Speed Determines Developer Experience There's a well-documented relationship between tool latency and developer productivity. Research on developer flow states shows that interruptions longer than about 10 seconds break concentration. A coding agent that takes 30 seconds per task doesn't just feel slow. It actively disrupts the developer's workflow. Speed also determines trust. Developers adopt tools that feel responsive and abandon tools that feel laggy. GitHub Copilot's initial success was partly about model quality, but it was also about the fact that inline completions appeared almost instantly. The speed was part of what made it feel like the tool understood what you were writing. There's also a cost argument that's easy to miss. Faster agents often produce better results because the developer can course-correct sooner. If the agent takes 5 seconds per loop, the developer can spot a wrong direction after one or two iterations and redirect. If it takes 30 seconds per loop, they've already wasted a minute before they realize the agent is going down the wrong path. ## This Problem Gets Worse, Not Better The trend in AI-powered development tools is toward more autonomy, not less. Agents are taking on larger, more complex tasks that require more steps. Multi-agent architectures where a planner, coder, reviewer, and tester collaborate on a task multiply the number of inference calls further. Background agents that run tasks asynchronously (like Cursor's background agent or Devin) still need to finish in minutes, not hours, to be useful. A 100-step agent at 2 seconds per step takes over 3 minutes of inference time. At 200ms per step, it takes 20 seconds. Longer context windows are also becoming standard. As models handle 128K to 1M+ token contexts to ingest entire codebases, maintaining speed at those context lengths becomes a harder engineering problem. Providers that can't serve long-context requests fast enough will be unusable for the next generation of coding tools. ## What Fast Inference Makes Possible When inference is fast enough, coding agents can do things that aren't practical on slower infrastructure. They can run more iterations per task, trying multiple approaches and picking the best one. They can include validation steps (run tests, check types, lint) inside the loop without making the total time unacceptable. They can use reasoning models that think through complex problems step by step, where the thinking overhead would normally make the agent too slow. At General Compute, our infrastructure is built specifically for these workloads. Low TTFT, high TPS, consistent tail latency, and always-warm models. The difference shows up directly in how coding agents perform: more steps per second, faster task completion, and a developer experience that feels responsive rather than something you wait on. --- If you're building or deploying a coding agent, the inference provider you choose determines the ceiling of your tool's performance. [Try General Compute](https://generalcompute.com) and benchmark it against what you're currently using. The compound effect across a multi-step agent loop is where the difference really shows up.