S3: Scheduling for Straggler Mitigation in LLM Serving

If you watch the per-request latency distribution of an LLM serving system over a long enough window, you start to see a pattern that does not look like a normal distribution. The median is fine. The p90 is roughly what you would expect from your model and hardware. The p99 is several times the median, and the p99.9 is sometimes ten times worse. Most of those tail samples are not the unlucky requests that hit a cold cache or a noisy neighbor. They are the requests that sat behind a much longer one, sharing a batch slot, waiting their turn to make progress.

Stragglers are the requests that run substantially longer than the typical request in the batch. In LLM serving, the most common reason a request becomes a straggler is that its output length turns out to be much larger than other requests sharing the same batch. The model decides to keep generating, the engine cannot evict it without losing work, and any request that was sized to fit the same memory budget starts paying for that decision. Output length is the variable that drives most of the tail in production traces.

S3 (Jin et al., 2023) was one of the first papers to formalize this and to build a serving system around output length prediction. The idea is straightforward: train a small classifier that predicts the eventual output length of a request based on the prompt, use that prediction to pack the batch more tightly, and reschedule when predictions turn out to be wrong. The full title is "S^3: Increasing GPU Utilization during Generative Inference for Higher Throughput," and while the headline metric is throughput, the underlying mechanism is straggler mitigation. By predicting which requests will run long, the scheduler can avoid the worst kinds of co-location decisions.

This post walks through what straggling looks like in practice, how S3 reduces it, and how the broader serving ecosystem has built on the same observation since 2023.

Where Stragglers Come From

A modern serving engine processes many requests concurrently. With continuous batching (Orca), requests join and leave the batch at iteration boundaries, so a finished request can release its slot to a waiting one without draining the whole batch. That works well when requests have similar runtimes. It works less well when one request keeps generating for ten thousand tokens while its batchmates finish in two hundred.

The problem is not the long request itself. Long generations are valid, often important, and unavoidable. The problem is what happens to the other requests that were admitted into the batch with the implicit assumption that everyone would finish in roughly the same time.

The KV cache is the immediate constraint. When a request is admitted to the batch, the engine reserves space for its KV cache. If the engine sized that reservation to the typical request, a request that runs much longer overflows the reservation and forces an eviction or a swap. If the engine sized the reservation conservatively to the worst case, it can admit far fewer requests and throughput collapses. Either way, the long request distorts the choices made for everyone else.

Memory pressure is one symptom. The other is iteration time. A batch of N decodes runs at a rate determined by the largest active request, because the attention pass loads KV cache proportional to context length. A batch with one request at 8K context and four requests at 1K context runs almost as slowly as a batch with five requests at 8K. The short requests pay the memory bandwidth tax of the long one, even though they do not need it.

Stragglers also distort fairness. If your scheduling policy is FIFO, a long request that arrived earlier will hold its slot for a long time and delay all later arrivals. If your policy is shortest-job-first, you need to know the job length, which is exactly what is unknown for autoregressive generation. Most production systems sit somewhere between these two and accept that some long requests will block some short ones.

What S3 Predicts and Why

S3's central move is to predict output length up front. The model used for prediction is a small classifier (a few-layer transformer, in the original paper) that takes the prompt as input and outputs a length bucket. The buckets are coarse: short, medium, long, and so on, rather than exact token counts. Coarse predictions are easier to learn, easier to calibrate, and good enough for the scheduling decisions that follow.

The prediction is then used in two places.

First, batch admission. When a new request arrives, the scheduler uses the predicted length to estimate the KV cache footprint of the request over its lifetime. If admitting it would push the projected memory usage past safe limits during the predicted generation window, the request is delayed or sent to a different batch where it is more compatible with the existing requests.

Second, batch composition. Rather than mixing long and short requests indiscriminately, S3 groups requests with similar predicted lengths into the same batch when possible. This reduces the variance within a batch. Memory reservations are tighter because the engine can size them to the predicted length instead of a worst-case envelope. Iteration cost is more predictable because the active context lengths are similar.

The result is higher utilization without the tail-latency cost of mismatched batches. The paper reports several-fold throughput improvements on workloads where output length variance is high, which matches what you see in real traffic from chat and code completion.

Handling Mispredictions

A length predictor is wrong sometimes. The interesting question is what the system does when that happens.

S3 treats a misprediction as an event the scheduler reacts to. If a request was bucketed as short but is still generating well past the short threshold, the engine has options. It can keep the request in its current batch and accept the disruption. It can evict the request and put it back in the queue, paying the cost of recomputing or transferring its KV cache. Or it can move it to a "long" batch with other long-runners, where its continued generation does not harm shorter neighbors.

The paper's evaluation shows that the cost of mispredictions is real but bounded. The classifier is correct often enough that the throughput gains from accurate predictions far outweigh the eviction costs of the misses. The exact crossover depends on workload mix, but roughly, you want the classifier accuracy on coarse buckets to be above 70 percent for the system to be worth running.

Calibration matters more than raw accuracy. A classifier that is confidently wrong does more damage than one that hedges. S3 uses class probability outputs rather than hard predictions where possible, so the scheduler can make decisions like "place in medium batch unless confidence in the long bucket is above some threshold."

Other Forms of Straggler Mitigation

S3 is one approach. The space of straggler mitigation in LLM serving has grown since 2023, and S3 belongs to a broader family of scheduling techniques worth knowing.

Preemption-based scheduling lets the engine pause a request mid-generation, reclaim its KV cache, and resume it later. vLLM's swap mechanism is an instance: the KV cache for a paused request is swapped to CPU memory and brought back when the request is rescheduled. This makes long requests less disruptive because the engine can preempt them when they start to crowd out shorter ones. The cost is the swap bandwidth and the latency hit on the preempted request.

Priority queues with admission control is a simpler approach. Requests are tagged with a priority class on arrival (often based on user tier or use case), and the scheduler admits high-priority requests preferentially. This does not predict anything; it just lets operators express which workloads should win when there is contention. It is widely deployed and complements length prediction rather than competing with it.

Output length budgets, set by the client, are another lever. If the client commits to a max_tokens of 200 instead of leaving it at the model maximum, the engine has a hard upper bound it can use for memory planning. Many production deployments enforce or strongly encourage clients to set these budgets, and the effect on tail latency is usually large. This is sometimes called explicit length prediction, in contrast to S3's implicit prediction.

Speculative scheduling and rollback is a related technique seen in more recent systems. The engine batches optimistically (assuming requests will not exceed certain thresholds) and rolls back the work for any request that violates the assumption. This is useful when the cost of being wrong is low but the gain from being right is high.

Sarathi-Serve's chunked prefill addresses a different straggler: the long prefill that blocks decode iterations. The mechanism is unrelated to output length prediction, but the goal (keep one long thing from punishing many short things) is the same. In production, chunked prefill and output-length-aware scheduling are usually deployed together. They handle different sources of variance.

Tail Latency Under Output Length Prediction

The metric that improves the most under S3-style scheduling is p99 latency for short requests. Without prediction, a short request that lands in a batch with several hidden long-runners pays for their generation time by getting fewer effective batch iterations per second. With prediction, the short request is steered to a batch of similar requests, and the long-runners are grouped elsewhere where their cost is paid by other long-runners.

The p99 for long requests does not improve much, because by definition they have to do more work. What does change is that long requests stop interfering with short ones, so the system's overall fairness improves even though the distribution still has heavy tails.

A practical effect: SLO planning becomes much easier. If you can promise different SLOs to different request classes (sub-second TTFT and sub-50ms TPOT for chat, looser numbers for batch summarization), you need a scheduler that can keep those classes from contaminating each other. Length prediction is one of the ways to enforce that separation.

Where Length Prediction Falls Short

Output length is not the only source of straggling. A request that starts a tool-call loop, a request that gets stuck in a degenerate repetition, or a request whose stop conditions are never satisfied can all run unexpectedly long for reasons no classifier can predict from the prompt alone. For these cases, the engine needs runtime safety nets: hard token limits, repetition detection, and timeout policies. S3 addresses the predictable variance, not the pathological cases.

Length prediction also has worse coverage on instruction-tuned models that were trained to produce long structured outputs (reports, code blocks, multi-step explanations) where the prompt carries little signal about the eventual length. On those workloads, the classifier often defaults to "medium" with low confidence, and the scheduler falls back to its default policy. This is fine; it just means the prediction does not buy as much.

Finally, prediction adds latency. The classifier itself runs on every incoming request. The S3 paper's classifier was small enough that this cost was negligible compared to the prefill it precedes. Larger or more accurate predictors might shift the tradeoff. Most production deployments use very small predictors so the prediction cost stays in the low milliseconds.

How Production Systems Have Adopted the Idea

By 2026, output-length-aware scheduling is common in serving stacks that handle high-variance traffic. Most implementations are not direct ports of S3 but follow the same recipe: a lightweight predictor, coarse buckets, batch grouping, and a misprediction policy. vLLM has experimental length-prediction support in some forks and contrib modules. SGLang's scheduler accepts external length hints when the application can supply them. TensorRT-LLM exposes per-request priority and budget hints that can be driven by an upstream predictor.

Many production deployments do not run a learned predictor at all and instead rely on user-supplied max_tokens to do the same job. This works well when the client population is well behaved. It works poorly when many clients leave max_tokens at the default, which is one reason serving providers often add a server-side classifier as a safety net even when the API exposes the budget knob.

The broader takeaway from S3 is that scheduling decisions in LLM serving are not one-shot. They depend on quantities (output length, KV cache footprint, time-to-completion) that you do not know until generation finishes. Any system that improves the estimate of those quantities up front gives the scheduler more room to make good decisions, and the gains compound across all the other techniques in the stack.

Closing

Stragglers in LLM serving are mostly a symptom of one thing: not knowing how long a request will run. Continuous batching, paged attention, and chunked prefill all assume the engine has reasonable estimates of the work each request will do. When those estimates are wrong, the techniques still work but their throughput and tail-latency benefits narrow. S3's contribution was to take the prediction problem seriously and show that even a small, coarse-grained classifier moves the numbers a lot.

At General Compute, fair scheduling under high-variance workloads is one of the things we tune carefully, because voice agents, coding assistants, and multi-tenant inference all exhibit exactly the kind of length distribution where stragglers become expensive. If you are serving a workload where some requests run an order of magnitude longer than others, the techniques in this post (length prediction, priority classes, preemption, chunked prefill) all stack and all help. The docs have the latency and throughput numbers if you want to see how the scheduling choices map to your traffic profile.