Agent Readout

Multi-Agent Architectures and the Inference Cost Explosion

Orchestrator and worker patterns make multi-agent systems easy to design and expensive to run. Here is where the inference cost actually goes, and what it means for the infrastructure underneath.

Author
General Compute
Published
2026-05-09
Tags
agents, multi-agent, inference, latency, cost

Markdown body


Multi-agent architectures look elegant on a whiteboard. You have an orchestrator that breaks a task into subtasks, a pool of workers that handle the subtasks, and a critic or aggregator that reviews and merges the results. Each agent has a clean role. Each prompt is small and focused. The mental model maps onto how a team of humans would split up the same job.

Then you put it into production and the inference bill is four to twenty times what a single-agent system would cost on the same workload. Latency on the user-visible path is worse, not better, because the orchestrator is now serially gated on its workers. The system is harder to debug because failures show up two or three hops away from where they originated. The promise of decomposition is real, but the cost structure is not what most teams expect when they start.

This post is about why multi-agent systems are so much more inference-hungry than they look, where the cost actually accumulates, and which architectural choices change the math.

## What "multi-agent" actually means at the inference layer

When practitioners say multi-agent, they usually mean one of three patterns.

The first is the orchestrator-worker pattern. A planner LLM reads the task, produces a plan, and delegates pieces of it to worker LLMs. The workers run in parallel or in sequence, return results, and the orchestrator decides what to do next. Most agent frameworks ship some version of this as their default abstraction.

The second is the role-playing pattern. A handful of agents each carry a persona: a researcher, a coder, a reviewer, a summarizer. They take turns producing output, often with a shared scratchpad or message bus between them. CrewAI and AutoGen popularized this style.

The third is the debate or critique pattern. Multiple agents independently produce candidate answers, then one or more critic agents compare them, and a final agent picks or merges. This shows up in research papers more than production systems, but the inference profile is similar to the others.

All three patterns share one property: they replace a single inference call with many inference calls, each of which carries its own prompt overhead, its own time-to-first-token, and its own decode budget. The interesting question is not whether this costs more (it does), but where the multiplier actually comes from.

## The prompt overhead multiplies, not the useful work

Consider a single-agent system that handles a customer support task. The model gets a system prompt explaining the company, the tools available, and the policy guardrails. Maybe 4,000 tokens. It gets the conversation history, maybe 2,000 more tokens. It produces a 200 token response. Total: 6,000 input tokens, 200 output tokens, one model call.

Now decompose the same task into a three-agent system: a triage agent classifies the request, a specialist agent handles the resolution, and a quality reviewer agent checks the response before sending. Each of these agents needs its own system prompt, because the triage agent should not have the specialist's tools and the reviewer should not have either of theirs. Each agent also needs context from the prior steps, because none of them have access to the original conversation by default.

The triage agent reads 4,000 tokens of its own system prompt plus 2,000 tokens of conversation. The specialist reads 4,000 tokens of system prompt, 2,000 tokens of conversation, and the triage output. The reviewer reads its own 3,000 token system prompt, the conversation, the triage output, and the specialist's draft. You have gone from 6,000 input tokens to roughly 22,000 input tokens, and the actual user-visible output is the same 200 tokens.

This is the part that surprises teams when they see the bill. Output tokens are usually a small fraction of cost. Input tokens are where multi-agent systems balloon, because every agent in the chain needs context, and that context overlaps significantly between agents. You are paying to re-feed the same conversation history through three different prompts, each with its own framing.

Prefix caching helps if your serving stack supports it well, and if your agent framework happens to construct prompts in a way that produces stable prefixes. But the typical orchestrator-worker setup actively defeats caching, because the worker prompts include the orchestrator's task description, which changes on every call.

## Latency stacks badly in the orchestrator pattern

The naive expectation is that multi-agent systems can run faster than single-agent systems, because workers can execute in parallel. In practice, parallelism only helps when the workers are genuinely independent and the orchestrator can dispatch them all at once. Most orchestrator-worker setups are not like this.

A common pattern is sequential delegation: the orchestrator decides what to do, dispatches a worker, reads the result, decides what to do next, dispatches another worker. Each step has its own time-to-first-token and decode time. If your model has a 600ms TTFT and produces 100 tokens per second, a single orchestrator step that emits a 50 token plan takes about 1.1 seconds. A worker step that produces a 200 token result takes about 2.6 seconds. A four-step plan with one worker per step is 4 * (1.1 + 2.6) = 14.8 seconds before the orchestrator even produces its final answer.

A single-agent system handling the same task would have one prefill, one decode, and would stream tokens to the user as they were produced. The user would see output starting at around 600ms and finishing around 8 to 10 seconds, with the appearance of progress the entire time. The multi-agent version finishes nominally faster on total compute, perhaps, but the user sees nothing for 14 seconds and then a sudden burst at the end.

When workers can run in parallel, the math improves, but only if the serving stack can actually execute them in parallel without queueing. On a shared inference endpoint, parallel worker calls compete for the same batch slots as everyone else's traffic. If your provider does not have headroom, your "parallel" workers serialize behind queue contention, and you are back to sequential latency with extra steps.

## Tool calls compound the multiplier

Most multi-agent systems are also tool-using. The orchestrator calls workers, the workers call tools, the tools return data, the workers process it, the orchestrator aggregates. Each tool call is itself an inference step that produces a structured output, and structured output generation is one of the slowest regimes for most serving stacks (we covered this in the post on tool calling latency).

If a worker needs three tool calls to complete its subtask, and the orchestrator delegates to four workers, you have twelve tool calls plus four worker completions plus one orchestrator decision step. Seventeen inference calls to handle a single user request, each with its own prefill, its own decode, and its own structured output overhead. That is the cost explosion.

The same task done by a single agent with the same tools would also produce three tool calls per logical subtask, but the agent would amortize the prefill across them through KV cache reuse, and the orchestrator overhead would not exist at all. The decomposition into multiple agents does not reduce the tool calls. It adds inference steps on top of them.

## The infrastructure assumptions break down

Most inference infrastructure is built around a request-response model with one prompt in, one stream out. Multi-agent systems violate this assumption in ways that interact poorly with how serving stacks are tuned.

Continuous batching, which is how modern inference servers extract throughput from GPUs, works best when individual requests are long enough to fill a batch slot for a meaningful number of decode steps. Multi-agent systems produce many short generations: a 30 token plan, a 50 token tool call, a 100 token critique. Each of these is a request that joins the batch, decodes briefly, and leaves. The throughput hit from request churn is real and shows up as lower tokens-per-second per GPU than the same hardware achieves on chat workloads.

Prefix caching, as mentioned, is degraded by the way orchestrator-worker prompts are constructed. KV cache reuse across agent boundaries is essentially impossible unless the framework is explicitly designed for it, because each agent has its own system prompt and the cache keys do not match.

Speculative decoding still helps within a single agent step, but the gains do not compound across agents the way they would within one long generation. A spec decoding speedup of 2x on one agent step still leaves you paying full latency for the gap between agents: the network round trip from worker back to orchestrator, the orchestrator's own prefill, the dispatch to the next worker. Speeding up generation does not help when generation is not the bottleneck.

## What this means for system design

The right response to all of this is not to abandon multi-agent architectures. There are problems they handle well, especially ones with genuine subtask independence (large document processing, multi-source research, parallel code review across files). The right response is to design with the inference profile in mind from the start.

A few practical implications.

Decompose only when the subtasks are genuinely parallelizable or when the role separation provides real safety guarantees. If you are decomposing for organizational clarity but the agents end up running sequentially anyway, you are paying multi-agent cost for single-agent behavior.

Share context aggressively between agents instead of re-feeding it through fresh prompts. Some frameworks have started supporting shared message buses or pinned context windows that survive across agent calls. These reduce the prompt overhead multiplier substantially when used well.

Run on infrastructure that holds up under the request profile multi-agent systems produce: many short generations with structured output, tight latency tails, and bursty parallelism. The throughput numbers that providers publish on long chat workloads are not the numbers you will see on this traffic, and the gap between providers widens as request size shrinks.

Measure end-to-end latency from the user's perspective, not per-agent latency. A multi-agent system that looks fast on a per-step dashboard can still feel slow because the user sees the time from request to first visible output, not the time of the fastest individual step.

Multi-agent designs trade compute for clarity. That trade is sometimes worth it. But the compute side of the trade is larger than most teams budget for when they pick up an agent framework, and the latency side is what kills the user experience when the underlying serving stack is not built for the request shape.

If you are running multi-agent workloads and finding that latency or cost is the limit, the inference layer is usually where the leverage is. Faster generation, lower TTFT, and serving infrastructure that handles short structured outputs well are what make these patterns viable in production. That is the shape of the problem we are working on at General Compute. If you want to see how it changes the math on your own workload, the API is OpenAI-compatible and the docs are at [generalcompute.com](https://generalcompute.com).
ModeHumanAgent
Multi-Agent Architectures and the Inference Cost Explosion | General Compute