Back to blog
inferencebatchingservingschedulingthroughput

Dynamic Batching Strategies: From Naive to Continuous to Iteration-Level

General Compute·

The arithmetic of GPU inference is unforgiving. A single decode step on a 70B model uses a tiny fraction of the device's tensor cores, because the bottleneck is loading weights from HBM, not multiplying matrices. If you only serve one request at a time, the GPU spends most of its life waiting on memory. Batching is how you amortize that memory cost over many requests, and the way you batch determines whether your serving stack pushes 100 tokens per second or 5,000.

LLM serving has been through several batching regimes over the last five years. Each one fixed a specific failure of the previous one, and each one left a different residue of inefficiency for the next paper to clean up. This post walks through that evolution, from the simplest case (no batching at all) through static batching, dynamic request-level batching, and finally iteration-level continuous batching. Along the way I will note where memory management, prefill scheduling, and prefix sharing intersect with the batching question, because in practice you cannot reason about one without the others.

Why Batching Matters for LLMs Specifically

Most ML inference systems before LLMs were compute-bound. A vision model running on a single image saturates the tensor cores within a few milliseconds. Batching gives you better throughput, but the marginal gain per added request is bounded because compute eventually saturates.

Decode in an autoregressive LLM is different. Each forward pass loads every parameter (or every active expert in an MoE) from HBM to do a matrix-vector multiply. The arithmetic intensity is something like one floating point operation per byte loaded, far below the roofline crossover for any modern accelerator. This means that for a long stretch of batch sizes, adding requests is nearly free in latency terms while doubling throughput. On an H100 serving Llama 3 70B, going from batch size 1 to batch size 32 typically increases per-token latency by less than 30 percent while delivering roughly 25x more tokens per second.

Prefill is closer to compute-bound, since the attention and projection passes operate on the entire prompt at once. But even there, batching still helps until you saturate the device, and the question of how to coexist prefill (compute-bound) with decode (memory-bound) on the same accelerator is one of the hardest scheduling problems in the system.

So batching is not optional for LLM serving. The only questions are how you form the batches, when you mutate them, and how you handle the variable-length nature of generation.

Stage 0: No Batching

The simplest possible inference server processes requests one at a time. A request comes in, the server runs prefill on the prompt, runs decode until the model emits an end-of-sequence token or hits the max length, and returns the completion. The next request starts only when the previous one finishes.

This is the right design for some uses (single-tenant local inference, latency-sensitive demos with no concurrency), and it is the wrong design for almost any production serving workload. On a 70B model, a single-request server will hit somewhere around 50 to 80 tokens per second on an H100, and the device utilization sits in the low single digits. You are paying for the whole GPU and using a sliver of its capacity.

The natural response is to start grouping requests together.

Stage 1: Static Batching

The first improvement is to wait until you have a batch of requests, run them through the model together, and return them all at once. This is how most pre-LLM serving systems handled batching. TensorFlow Serving, TorchServe, and Triton all support a server-side batcher that collects N requests within a time window and processes them as one tensor.

For fixed-length classification or ranking, this works well. The batch shape is regular: every request has the same input size (or you pad to the longest), every request takes the same number of forward passes (one), and every request returns at the same time. Throughput scales close to linearly until you hit the compute roofline.

For LLM generation, it falls apart fast. The two big problems are:

Variable output length. Static batching forces every request in the batch to run for the same number of decode steps. If one request finishes after 10 tokens and another needs 500, the 10-token request is "done" but cannot leave the batch. The server keeps generating padding tokens for it (or just keeps it pinned in GPU memory and ignores its outputs) until the longest request completes. That is hundreds of wasted forward passes for the short request, and the user sees their reply latency stretched to match the slowest sibling in the batch.

Variable prompt length. Static batching also wants uniform input shapes. The standard fix is right-padding to the longest prompt and masking out the pad tokens during attention. The compute is not actually saved, since the model still runs on the padded sequence. For mixed workloads where some prompts are 50 tokens and some are 5,000, padding wastes a large fraction of the prefill budget.

Static batching is dead for LLM serving. No production system uses it as the primary batching strategy today. The reason is not that batching is a bad idea, but that the granularity of "the whole batch finishes together" is wrong for autoregressive generation.

Stage 2: Dynamic Batching at the Request Level

Dynamic batching, in the classic Triton or TF Serving sense, refers to a server-side batcher that forms batches opportunistically. Requests arrive at irregular intervals, and the batcher waits up to some max delay (often 1 to 10 ms) to collect a batch of up to some max size. When either the size or the time threshold is hit, it dispatches the batch.

This solves the request-arrival pattern problem. You no longer need a queue depth of N before the server does any work; you just wait briefly for additional requests to show up. For workloads with steady traffic, dynamic batching keeps the GPU near its target batch size most of the time. For bursty workloads, it bounds the latency cost of forming a batch.

The catch, for LLMs, is that dynamic batching as classically defined still suffers from the static-batching tail problem once the batch is dispatched. The decision of which requests are in the batch is made once, at dispatch time, and then those requests run together until the slowest one finishes. So you can think of dynamic batching as "static batching with smarter batch formation." It improves the average batch size, and it does nothing about head-of-line blocking inside the batch.

Some systems extended dynamic batching to handle variable output length by terminating the batch after K decode steps and re-batching the survivors with new arrivals. This helps, but the resync points are expensive: every K steps, the server pauses, runs scheduling logic, and rebuilds the batch tensors. If K is small you pay scheduling overhead constantly, and if K is large you reintroduce most of the original blocking problem.

Stage 3: Continuous Batching (Iteration-Level Scheduling)

The Orca paper, published at OSDI 2022, proposed iteration-level scheduling. The scheduler operates on the granularity of a single decode step, not a whole request. At every iteration, the server:

  1. Runs one forward pass on whatever set of requests is currently active.
  2. Removes any request that emitted EOS or hit max length.
  3. Adds any waiting request from the queue, provided the KV cache has room.

The batch composition can change every single step. A request that arrives mid-generation does not have to wait for the current batch to complete; it joins on the next iteration. A short request leaves the moment it finishes and frees its slot for the next waiter. There is no head-of-line blocking, because there is no fixed batch to block in.

This was a large practical improvement. The Orca paper reported 36.9x throughput over NVIDIA FasterTransformer on GPT-3 175B at the same latency target. The improvement is not because the model is doing anything different per step; it is because the GPU is doing useful work on a full batch every step instead of a shrinking batch.

A few details matter when you actually build this:

Selective batching. Most operations in a transformer batch trivially across requests, but attention does not, because each request has its own KV cache with its own length. Orca handles this by batching the linear projections (Q, K, V, the output projection, the FFN) across all requests, then unrolling the attention computation per request. PagedAttention and FlashAttention later replaced the unrolled attention with kernels that handle ragged sequences directly, which is faster but follows the same logical split.

Prefill versus decode. Prefill on a 4,000-token prompt is very different from decode on a single token. Naively mixing them in the same forward pass either underuses the GPU during decode-only iterations or stalls all the decode requests during a prefill iteration. The standard fix today is chunked prefill: break the prefill into chunks of K tokens and interleave them with decode steps. The Sarathi-Serve paper showed that this approach keeps both phases productive without dedicated prefill workers.

KV cache memory. The number of requests you can hold in the active batch is bounded by KV cache memory, not by compute. PagedAttention, introduced in vLLM, allocates KV cache in fixed-size blocks rather than per-request contiguous regions, which lets the scheduler hold more concurrent requests at the cost of an indirection on every attention read. Without paged memory, fragmentation alone caps your effective batch size well below what compute could support.

What Iteration-Level Batching Still Misses

Continuous batching is not the end of the story. Several follow-on techniques target the inefficiencies it leaves behind.

Prefill and decode have different bottlenecks. Even with chunked prefill, running both phases on the same accelerator is a compromise. Disaggregated serving (Splitwise, DistServe) uses separate machines for prefill and decode, sized differently and connected by a fast KV transfer link. The decode machines run a continuous-batched scheduler, and the prefill machines run their own batcher tuned for compute throughput. Throughput is higher because each pool runs at its preferred operating point, at the cost of more complex deployment and KV transfer overhead.

Shared prefixes are wasted. When many requests share a system prompt or a long retrieved context, continuous batching still runs prefill independently for each one. Prefix caching (RadixAttention in SGLang, vLLM's prefix cache) deduplicates the KV cache for shared prefixes. This is orthogonal to batching but compounds with it: prefix caching reduces the effective prefill cost, which lets the scheduler accept more concurrent requests for the same memory budget.

Long-tail decode requests dominate occupancy. A request generating 4,000 tokens occupies a batch slot for 4,000 iterations. If most requests are short and a few are very long, the long ones take up more and more of the active batch over time, and your effective concurrency drops. Speculative decoding (drafting multiple tokens per step and verifying in parallel) attacks this by reducing the number of iterations a long request needs. Multi-token prediction heads (Medusa, EAGLE) extend the same idea inside the model.

Heterogeneous request priorities. Pure FIFO continuous batching does not handle priority well. If a latency-critical request arrives while the batch is full of background batch jobs, it has to wait for slots to open up. Modern serving systems layer admission control, priority queues, and preemption on top of iteration-level scheduling. Preemption is feasible because the smallest unit of work is a single decode step; you can pause a low-priority request, evict its KV cache, and restore it later when load drops.

Practical Choices

For new serving deployments, the question is not whether to use continuous batching (you should) but which implementation and how to tune it. The relevant axes:

  • Max batch size. Bounded by KV cache memory. Larger is better for throughput up to the compute roofline, then it hurts per-token latency.
  • Max waiting requests in queue. Affects burst tolerance. Too high and you queue endlessly; too low and you reject load you could have served.
  • Chunked prefill chunk size. Trades prefill latency for decode steadiness. Smaller chunks keep decode TPS stable but stretch time-to-first-token.
  • Prefix cache size. Memory you give up from the active KV cache budget in exchange for cheaper prefill on repeat traffic.

vLLM, SGLang, TensorRT-LLM, and TGI all implement continuous batching with their own variations on these knobs. The defaults are usually sensible, but production workloads almost always benefit from tuning to the actual prompt-length and output-length distribution you serve.

Where General Compute Sits

We run iteration-level continuous batching with chunked prefill, prefix caching, and disaggregated prefill/decode, on top of a custom inference stack tuned for our hardware. The reason these batching choices matter is that they directly determine the price-performance curve we can offer for voice agents, coding assistants, and other latency-sensitive workloads. If you want to try inference at a faster operating point than mainstream APIs, the General Compute API is OpenAI-compatible and you can swap your endpoint with a one-line change. The docs at generalcompute.com cover the model list and per-token pricing.

If you are building your own serving stack, the practical advice is short. Start from a continuous-batching engine (vLLM is the most common open-source choice). Profile your prompt and output length distribution before tuning anything. Add chunked prefill and prefix caching once you have a baseline. Disaggregate prefill and decode only when you have measured a real bottleneck that motivates the operational complexity. Each layer of the batching stack solves a specific problem; piling them on without diagnosing yours first is how serving deployments end up complicated and slow.

ModeHumanAgent