Chunked Prefill: Overlapping Compute and Communication

Run any modern serving engine at steady state with mixed traffic and you can watch the GPU utilization graph flicker. A long prompt arrives, the engine spends two or three hundred milliseconds doing prefill, and during that window every in-flight decode step is either paused or executed at higher latency than normal. Users see the symptom as a brief freeze in their token stream. Operators see it as TPOT (time per output token) spikes in their dashboards. The interference is not a bug. It is what happens when two workloads with opposite resource profiles share one device.

Chunked prefill is the main technique production stacks use to reduce that interference without paying for full disaggregation. Sarathi-Serve (Agrawal et al., 2024) formalized the idea, but the intuition is simple. You break each prefill into smaller chunks and run those chunks in the same batch as active decode steps. The prefill chunk does the heavy matmul work and saturates the tensor cores. The decode steps piggyback on the same batch and use the memory bandwidth that would otherwise sit idle. Neither phase waits for the other. The GPU ends up running closer to its true ceiling, because the two phases fill different parts of the roofline.

This post walks through the mechanics of chunked prefill, the tradeoffs Sarathi-Serve explores, and how the technique interacts with the rest of the serving stack. It is a close cousin of disaggregated prefill and decode, and the two approaches are often compared. They solve the same underlying problem with different tools.

Why Colocated Prefill and Decode Interfere

Prefill and decode have well known asymmetries. Prefill takes an input of length N and runs the model once against the whole sequence. The matmuls have a large inner dimension. They saturate the tensor cores and the bottleneck is FLOPs. Decode takes a single new token, attends to the cached prefix, and produces one logit distribution per step. The matmul shapes are small. The bottleneck is the bandwidth required to load the KV cache and the model weights.

On an H100 running Llama 3 8B, a typical 2K-token prefill takes somewhere around 150 milliseconds. A single decode step on the same model runs in roughly 15 to 25 milliseconds at reasonable batch sizes. If you run them back to back on the same GPU, decode steps that were queued behind a prefill sit waiting for the prefill to finish. A decode that should have emitted a token every 20 milliseconds instead emits nothing for 150 milliseconds and then resumes. That jitter is exactly what voice agents, coding assistants, and interactive chat feel like they cannot tolerate.

Continuous batching (Orca) partially addresses this by allowing the serving engine to add new requests at iteration boundaries rather than waiting for the current batch to drain. It helps with throughput. It does not directly help with the prefill-versus-decode conflict, because a long prefill still occupies a full iteration once it begins. The engine cannot preempt a prefill mid-kernel.

Adding priority rules on top of continuous batching is one workaround. You can defer new prefills if decode SLOs are under threat. But you cannot indefinitely defer prefills or TTFT balloons for anyone waiting to start. The two SLOs pull in opposite directions and any scheduling policy is making a tradeoff.

The Chunked Prefill Idea

Sarathi-Serve's observation is that prefill is only indivisible if you let it be. You can compute prefill in slices along the sequence dimension. A 2K-token prefill can be done as four chunks of 512 tokens, or eight chunks of 256. Each chunk is a valid forward pass over a contiguous window of the prompt, and the output is the same KV cache you would have produced in one shot, just assembled piece by piece.

Once prefill is divisible, you can place each chunk in a batch alongside active decode requests. The batch has two kinds of workloads inside it: one prefill chunk operating on K tokens, and N decodes operating on one token each. The total number of tokens in the batch is K + N. The attention kernel runs over this mixed input, using masking tricks to handle the different sequence contexts.

Here is where the name of this post comes in. In a mixed batch, the prefill chunk does the compute-bound work. It provides enough dense arithmetic to keep the tensor cores busy. The decode operations provide memory-bandwidth demand, because their attention passes have to load the KV cache for each active request. On current GPUs, the tensor cores and the memory subsystem can run in parallel. A single mixed batch can keep both busy at once. You are overlapping compute (prefill chunk) and communication (decode KV cache loads) inside a single forward pass.

Sarathi-Serve calls this "stall-free batching." The decode path never has to stop and wait for a prefill to complete. Every iteration includes some amount of decode progress, either as a pure decode batch or as a decode plus prefill-chunk mixed batch. TPOT stays stable regardless of what prefills are flowing through the system.

Chunk Size Tradeoffs

The chunk size is the dial you tune, and it directly controls the tradeoff.

Small chunks make the mixed batch decode-dominated. TPOT stays very low because prefill barely perturbs the decode path. But you have split a 2K prefill into many small kernel calls, each of which carries fixed overhead (kernel launch, pipeline setup, attention mask construction). The total wall time of prefill increases. TTFT gets worse.

Large chunks reverse the balance. The prefill completes in fewer steps, so TTFT is close to the unchunked baseline. But each step with a large chunk pushes more work into the tensor cores, and decode latency in that step goes up. TPOT regresses.

Sarathi-Serve's paper reports that chunk sizes around 512 or 1024 tokens work well on typical hardware for mid-sized models. The optimal number depends on the model shape, the GPU, and the traffic mix. Longer-context workloads tend to want larger chunks because prefill is expensive relative to decode. Short-prompt workloads can run with small chunks without paying much of a TTFT tax.

One non-obvious source of chunk-size cost is attention itself. When you prefill chunk K after chunk K-1, the attention pass for chunk K has to attend to all prior chunks (the already-populated KV cache plus any decode history). That extra attention work is not free. Chunked prefill pays a small overhead compared to one-shot prefill of the same sequence, because attention is no longer a single N-by-N matmul but a sequence of partial matmuls with increasing context lengths. On flash-attention-style kernels this overhead is usually a few percent, but it grows with sequence length. For very long prompts, chunking costs real FLOPs on top of the base prefill.

The Scheduling Policy

Chunking is only half of the story. The other half is deciding which requests go into each batch and how to fill the token budget.

Sarathi-Serve operates with a per-iteration token budget. Each iteration processes at most B tokens across all requests in the batch (B might be 2048 or 4096, depending on memory and latency targets). Inside each iteration the scheduler looks at the in-flight requests and allocates token slots. Decodes each cost one token. Any remaining budget is filled with a chunk from a pending prefill. If the prefill would exceed the remaining budget, it is sliced so that the chunk fits exactly.

This makes the iteration cost predictable. Every iteration runs on roughly B tokens of work, so decode latency is roughly constant. TTFT for any given request is the number of iterations it takes to fully chunk through its prompt, plus queueing delays before its first chunk gets scheduled.

The scheduler can also prioritize. If decode SLOs are at risk, it can temporarily reduce the chunk size on new prefills to keep TPOT down. If TTFT is lagging, it can bump chunk size to push prefills through faster. These are simple policies sitting on top of the chunk mechanism, but they give operators a way to bias the system toward whichever SLO is currently bleeding.

Against the Alternatives

Chunked prefill and disaggregated prefill-decode are often framed as competing approaches. In practice they are complementary.

Disaggregation (Splitwise, DistServe) moves prefill and decode onto separate GPU pools. It avoids interference entirely, at the cost of a KV cache transfer between pools and the operational complexity of running two fleets. It shines when you have tens of GPUs or more and strict SLOs on both phases. It is overkill for a single-node deployment.

Chunked prefill keeps both phases on the same GPU but schedules them carefully so they do not collide. There is no cross-node cache transfer. Implementation is mostly a change to the scheduler and the attention kernel. The engineering complexity is much lower than disaggregation. The downside is that you are still fitting two different workloads on one resource, which caps how far you can push each SLO.

A reasonable mental model: chunked prefill gets you 60 to 80 percent of the benefit of disaggregation for 10 percent of the engineering cost. For smaller deployments or for teams that do not yet need maximum throughput, it is the right first move. For large-scale latency-sensitive production, you often end up doing both: chunked prefill within each pool, and disaggregation across pools.

Implementation Notes

Adopting chunked prefill in a serving engine touches a few places:

The scheduler has to track per-request prefill progress. Each request carries how many prompt tokens have been processed so far, so the scheduler can slice the next chunk off correctly.

The attention kernel needs to handle mixed batches with different context lengths per sequence. This is the "variable length attention" or "flash-attention with variable seq_len" path. Most production kernels already support it because continuous batching needs the same feature.

The KV cache layout has to allow incremental writes. If you are using PagedAttention (vLLM) or a similar block-based cache, each prefill chunk writes its slice of the cache into the right blocks. There is no special allocation logic beyond what pure prefill already needs.

Observability becomes more important, not less. Chunked prefill makes behavior smoother but also harder to reason about from traces. A request's TTFT now depends on how many chunks it took, how busy the scheduler was when each chunk was eligible to run, and how many decodes were co-batched with those chunks. Good tracing that records per-chunk scheduling decisions pays for itself the first time you debug a TTFT regression.

How Production Stacks Use It

By early 2026, chunked prefill is the default or widely enabled in most serious serving stacks. vLLM has had chunked prefill in master since 2024 and it is on by default in most recent releases. TensorRT-LLM supports it as an opt-in policy. SGLang runs a variant of it as part of its scheduler. Dynamo combines chunked prefill within each pool with disaggregation across pools, which is where the top-of-market latency numbers come from.

The technique has also shaped the thinking about how to benchmark serving systems. Older benchmarks reported peak prefill throughput and peak decode throughput as separate numbers. That is no longer informative, because in production the two phases share GPUs and the only number that matters is the sustained mixed-workload behavior under realistic traffic. Sarathi-Serve's paper was part of a shift toward mixed-workload benchmarks that report both TTFT and TPOT distributions across realistic traffic mixes.

When Chunked Prefill Does Not Help Much

If your workload is all prefill (large batch offline inference over short outputs) or all decode (very long generations on short prompts), chunked prefill adds no benefit and may slightly regress throughput because of the chunking overhead. Pure prefill workloads should run one-shot prefill at the largest batch size memory allows. Pure decode workloads are fine with standard continuous batching.

If your fleet has only one GPU and very tight SLOs on both TTFT and TPOT, chunked prefill is the right tool, but there are physical limits. A single H100 has a finite amount of memory bandwidth and tensor-core throughput. At high enough load, both phases slow down no matter how cleverly you schedule. At that point the next move is adding GPUs, either by replicating the same setup or by splitting the phases into disaggregated pools.

Closing

Chunked prefill is one of those techniques where the idea is obvious in hindsight and the engineering is mostly unglamorous. You divide a big kernel into smaller ones, schedule them together with other small kernels, and let the hardware do what it was already capable of doing. The payoff is a steadier latency profile for interactive workloads without the operational weight of running separate pools.

At General Compute we lean on both chunked prefill inside each node and disaggregation across nodes for the latency-sensitive workloads our customers run. Voice agents and real-time coding assistants are the kinds of workloads where a single 300ms prefill stall can break the user experience, and chunked prefill is one of the simpler levers that keeps those stalls from showing up. If you are shipping something where TPOT consistency matters as much as the average, our API and the serving stack behind it were built around keeping both numbers tight. The docs have the latency and throughput numbers if you want to see how they map to your workload.