Disaggregated Prefill and Decode (Splitwise / DistServe)

Most production LLM stacks still run prefill and decode on the same GPU. A request arrives, the serving engine processes the prompt, and then it streams tokens back, all from the same set of devices. Continuous batching stitches many requests together so the GPU stays busy. This works, and it is what vLLM, SGLang, and TensorRT-LLM do by default.

The problem is that prefill and decode are not the same kind of workload. Prefill is heavy, bursty, and compute-bound. Decode is lightweight per step, memory-bandwidth-bound, and lasts for hundreds or thousands of iterations. When you batch them together on one GPU, each phase interferes with the other. Prefill steals compute that decode needs for low per-token latency. Decode holds KV cache memory that prefill wants for concurrency. You can tune the balance with chunked prefill and priority scheduling, but you are still fitting two different workloads onto one resource.

Splitwise (Microsoft, 2023) and DistServe (UCSD and Duke, 2024) take a different approach. They split the two phases onto separate GPU pools and transfer the KV cache between them. Each pool runs the workload it is tuned for. The complication is the cache transfer, which has to be fast enough that the handoff does not add visible latency. This post walks through why disaggregation helps, how the two systems implement it, what the tradeoffs look like in practice, and when it is worth the extra plumbing.

Two Very Different Phases

Prefill runs once per request. It takes the input prompt, runs it through the model in one big forward pass, and populates the KV cache. The compute is dense matrix multiplication against the whole prompt length N. With modern GPUs and reasonable prompt sizes, prefill saturates the tensor cores. It is compute-bound, and the bottleneck is FLOPs.

Decode runs once per output token. Each step takes a single new token, computes its Q, K, and V against the cached prefix, and produces one logit distribution. The matmul shapes are tiny: batch size by hidden dim. There is no large inner dimension to keep the tensor cores fed. The bottleneck is memory bandwidth, specifically the bandwidth to load the KV cache from HBM into the attention kernel.

Two numbers make the asymmetry concrete. On an H100, a prefill pass on a 2K prompt runs at something like 800 tokens per millisecond of compute, because the work is dense. A decode step on the same model runs at maybe 50 tokens per millisecond at batch size 1, because it is bandwidth-limited. Increase the batch size and decode throughput grows nearly linearly until you run out of KV cache memory, while prefill throughput barely moves because each prefill already saturates the device.

When you colocate them, you get a scheduling problem. A burst of prefills will push compute contention into any decode requests that happen to be in flight, spiking their time-per-output-token. A large decode batch holds KV cache capacity that a fresh prefill needs. Continuous batching tries to interleave them at the iteration level, and chunked prefill (Sarathi-Serve) tries to split prefills into smaller pieces that can slot in between decode steps. Both help, but both are fundamentally working around the fact that one GPU is trying to do two different jobs.

The Disaggregation Idea

Splitwise and DistServe both propose the same structural fix: run prefill on one set of GPUs and decode on another. A request hits a prefill node, the node computes the KV cache for the prompt, the cache is shipped over the interconnect to a decode node, and the decode node streams tokens until the request completes.

The immediate benefit is that each pool can be sized and tuned for its own workload. Prefill nodes want high compute throughput and can live with moderate memory. Decode nodes want high memory bandwidth and lots of HBM for KV cache capacity. If you have a mix of GPUs available, say H100s and older A100s, you can assign them by phase instead of by request. Even if all your GPUs are identical, you can still tune batching policies, KV cache block sizes, and scheduling knobs independently for each pool.

The second benefit is SLO separation. Latency targets for prefill (time to first token, TTFT) and decode (time per output token, TPOT) are distinct, and they tug in opposite directions. With disaggregation you can meet each one separately. Prefill nodes can run small batches to keep TTFT low. Decode nodes can run large batches to maximize throughput, because within a single decode step the per-token latency is not very sensitive to batch size until you hit the memory-bandwidth ceiling.

The cost is the KV cache transfer. For a 32K-token prompt on Llama 3 70B with GQA, the cache is around 10 GB in FP16. You do not want to move that over a slow network. Both Splitwise and DistServe assume fast GPU interconnects (NVLink within a node, InfiniBand between nodes) and pipeline the transfer so that later layers of the cache are moving while earlier layers are already being consumed by decode.

Splitwise

Splitwise was the first system to propose this split publicly. Patel et al. from Microsoft and the University of Washington observed in 2023 that production Azure workloads had extremely bimodal resource usage. Prefill dominated GPU compute time but a tiny fraction of wall time. Decode dominated wall time but used a small fraction of peak compute. Running both on the same hardware meant either overprovisioning for prefill (wasting decode-phase compute) or underprovisioning (hurting TTFT).

Their design assigns request phases to two distinct machine pools. A prefill machine handles input processing for any request, writes the resulting KV cache into a buffer, and hands the request off. A decode machine picks up the request, ingests the cache, and generates output tokens. The handoff uses RDMA over InfiniBand to transfer the cache with minimal CPU involvement.

A key Splitwise finding is that the optimal ratio of prefill to decode GPUs depends on workload characteristics, specifically the mean prompt length and output length. Workloads with long prompts and short outputs (summarization, extraction) want more prefill capacity. Workloads with short prompts and long outputs (code generation, reasoning chains) want more decode capacity. With colocated serving, you cannot adjust the ratio. With disaggregation, you just change the GPU counts in each pool.

Splitwise also shows a cost-efficiency angle. You can use different GPU SKUs for the two phases. Decode nodes benefit from high HBM bandwidth and capacity but do not need the absolute highest FLOPs. If older GPUs have enough bandwidth for decode, you can keep them in service as decode-only nodes while newer GPUs handle prefill. This extends the useful life of a heterogeneous fleet.

DistServe

DistServe, from Zhong et al. in 2024, pushes the idea further and makes the analysis crisper. They formulate serving as a joint optimization over four variables: parallelism strategy (tensor/pipeline/replica counts) for prefill, same for decode, and batching policies for each phase. With colocated serving, you have to pick one configuration that works reasonably for both phases. With disaggregation, each phase is a separate optimization.

Their experiments show that disaggregation can hit tighter TTFT and TPOT SLOs at the same GPU count, or meet the same SLOs with fewer GPUs. The gains are largest when workload latency targets are strict. For workloads where SLOs are loose (offline batch inference, low-priority traffic), the overhead of transfer and the loss of cross-phase batching flexibility often outweigh the benefits.

DistServe also runs a careful analysis of the KV cache transfer overhead. On NVLink, the transfer for a single request can happen in parallel with the first few decode steps, effectively hiding the cost. Across nodes on InfiniBand, there is a few hundred microseconds of unavoidable latency, but for prompts where prefill itself took tens of milliseconds, this is a small addition to TTFT. The place where transfer cost starts to hurt is very short prompts with strict TTFT SLOs, where the overhead is comparable to the prefill itself. For that regime, colocated serving is probably still the right answer.

What Actually Changes in the Stack

Implementing disaggregation requires a few pieces that do not exist in a typical serving engine.

A shared request queue sits in front of both pools. It tags each request with its current phase and routes accordingly. When prefill finishes, the queue re-enqueues the request with the decode pool, along with metadata about where its KV cache lives.

A cache transport layer moves the KV tensors. In practice this is built on something like NCCL, UCX, or a custom RDMA path. The transfer is typically pipelined per transformer layer, so decode can start on early layers while later layers are still moving.

The decode engine has to accept a "resume from cache" request rather than always starting from scratch. This is a small API change but it cascades through scheduling, since the decode node has to validate that the cache fits in its memory before accepting the handoff.

There are failure modes. If a decode node dies mid-generation, the request is stranded unless the cache can be re-transferred from somewhere or the prompt re-prefilled on another node. If the prefill pool is saturated but the decode pool has capacity (or vice versa), requests queue on one side while resources sit idle on the other. Good routing and autoscaling help, but heterogeneous pools are harder to operate than homogeneous ones.

When Disaggregation Pays Off

Based on the published numbers and what we see in practice:

Disaggregation helps most when prefill and decode workloads are large enough to justify separate pools, latency SLOs are strict on one or both phases, and your fleet has fast interconnect between nodes. The classic wins are latency-sensitive chat, voice agents, and coding assistants where TTFT and TPOT both matter and users notice interference when they collide.

It helps less when prompts are short and outputs are short (because prefill and decode are both cheap and the handoff overhead dominates), when you only have a handful of GPUs (because you cannot meaningfully split them), or when your workload is highly bursty and benefits from cross-phase batching flexibility.

Chunked prefill with priority scheduling, done well, closes some of the gap for colocated setups. Sarathi-Serve's approach of splitting prefill into small chunks and interleaving them with decode steps is cheaper to implement than full disaggregation and captures a meaningful fraction of the benefit. If you are not already running at scale, chunked prefill is the first thing to try.

How Serving Stacks Are Adopting This

By early 2026, disaggregation has moved from research papers to production systems. NVIDIA's Dynamo and TensorRT-LLM both ship disaggregated serving as a supported mode. vLLM has prototype support. SGLang has published disaggregation benchmarks. Most cloud inference providers operating at scale run some form of split deployment internally, even if they do not expose the split to users.

The remaining engineering complexity is real. You need good autoscalers for each pool, you need observability that tracks where requests are spending time, and you need to handle cache transfer failures gracefully. For teams serving at single-node scale, these costs still outweigh the benefits. For teams serving across dozens or hundreds of GPUs with strict latency targets, the arithmetic usually flips.

At General Compute, we care about disaggregation because it is one of the levers that makes strict latency SLOs achievable at scale. Voice agents and real-time coding assistants are the workloads where a 50ms blip in TPOT is the difference between feeling instant and feeling sluggish. The more we can isolate phases and run each on hardware tuned for its bottleneck, the tighter those SLOs get. If you are building something where inference latency is the user-visible constraint, our API is designed around this kind of serving architecture. Take a look at the docs to see how the throughput and latency numbers translate to your workload.