Agent Readout

Continuous Batching: The Orca Paper That Changed LLM Serving

Before continuous batching, LLM servers wasted GPU cycles waiting for the slowest request in each batch. Orca's iteration-level scheduling fixed this with a 36x throughput improvement.

Author
General Compute
Published
2026-03-24
Tags
inference, papers, deep-dive

Markdown body


Before the Orca paper, LLM serving used static batching. You'd collect a group of requests, process them together, and wait until every request in the batch was done before starting the next batch. If one request generated 500 tokens and another generated 10, the short request would sit idle in GPU memory while the long one finished.

This is massively wasteful. The Orca paper introduced continuous batching (also called iteration-level scheduling), where the server makes scheduling decisions at every single token generation step instead of at the batch level. Finished requests leave immediately and new requests join in their place, keeping the GPU busy at all times.

The result was a 36.9x throughput improvement over NVIDIA's FasterTransformer on GPT-3 175B at the same latency target.

## The Problem With Static Batching

In static batching, a batch of requests is treated as a single unit. All requests start together and the batch completes when the last request finishes. This creates two problems:

**Head-of-line blocking.** Short requests are held hostage by long ones. A request that needs 10 tokens waits for a request that needs 500 tokens, occupying GPU memory the entire time. The short request's latency is determined by the longest request in its batch, not by its own workload.

**Low GPU utilization.** As requests in a batch finish at different times, the batch gets progressively emptier. The GPU is doing work for fewer and fewer requests but still can't accept new ones until the batch completes. Utilization drops steadily over the life of each batch.

For interactive applications where response length varies widely (which is basically all LLM use cases), static batching wastes the majority of available compute.

## How Continuous Batching Works

Orca's key innovation is iteration-level scheduling. Instead of scheduling at the batch level, the scheduler operates at the granularity of individual token generation steps (iterations).

At each iteration:

1. Generate one token for every active request in the current batch.
2. Check if any requests have finished (hit their stop token or max length).
3. Remove finished requests from the batch.
4. If there's room (memory available for KV cache), add waiting requests from the queue.
5. Repeat.

This means the batch composition changes at every single step. A request might join the batch at step 47 and leave at step 82, while other requests continue around it.

The paper also introduced "selective batching," which recognizes that not all operations in a transformer benefit equally from batching. Attention, for instance, has per-request KV caches that can't easily be batched across requests, while the feed-forward layers (the dense matrix multiplications) batch well. Orca applies batching selectively to the operations where it helps.

## Why the Improvement Is So Large

The 36.9x throughput number sounds extreme, but it makes sense when you consider what static batching leaves on the table.

With static batching, the effective batch size (number of requests actually doing useful work) starts high and declines as requests finish. On average, the GPU is underutilized for most of the batch's lifetime.

With continuous batching, the effective batch size stays near the maximum at all times. As soon as one request finishes, another takes its place. The GPU is always working at full capacity.

This is especially impactful for LLM workloads where output lengths vary dramatically. A chatbot might generate anywhere from 5 to 500 tokens per response. Static batching plans for the worst case. Continuous batching adapts continuously.

## The Broader Impact

Continuous batching is now a standard feature in every modern LLM serving system. vLLM, TensorRT-LLM, SGLang, and every major inference provider implements some version of it. It's considered table stakes for production serving.

The Orca paper also established the paradigm of thinking about LLM serving as a scheduling problem rather than just a compute problem. This opened the door for subsequent work on:

- **Preemptive scheduling** (pausing low-priority requests to serve high-priority ones)
- **Prefill-decode disaggregation** (running the prompt-processing phase and token-generation phase on separate hardware, since they have different scheduling characteristics)
- **Priority queues and SLO-aware scheduling** (guaranteeing latency targets for different request classes)

## How This Applies to ASIC-Based Inference

Continuous batching was designed to maximize GPU utilization by eliminating idle cycles. On inference-optimized ASICs, the scheduling problem looks different because the hardware is already designed to minimize idle time for inference workloads.

General Compute runs entirely on inference-optimized ASICs instead of NVIDIA GPUs. We implement our own scheduling optimizations, including disaggregated inference (separating prefill and decode onto dedicated hardware), on top of ASICs that are architecturally suited for high-utilization serving. The combination of hardware that wastes fewer cycles by design and software that keeps that hardware maximally busy is a big part of why we deliver lower latency and higher throughput than GPU-based providers.

[Sign up at generalcompute.com](https://generalcompute.com) and get $5 in free credit to try it out.

## Papers and References

- [Orca: A Distributed Serving System for Transformer-Based Generative Models](https://www.usenix.org/conference/osdi22/presentation/yu) (Yu et al., 2022 -- OSDI 2022)
ModeHumanAgent