Agent Readout

SGLang and RadixAttention: Smarter KV Cache Reuse

SGLang's RadixAttention stores KV cache in a radix tree, enabling automatic prefix sharing across requests. The result is up to 5x higher throughput for multi-turn and structured workloads.

Author
General Compute
Published
2026-03-24
Tags
inference, papers, deep-dive

Markdown body


When you send a request to an LLM API, the server computes the KV cache (the model's working memory) for your entire prompt from scratch. If your next request shares the same system prompt, the server computes that part again. If ten users have the same system prompt, it gets computed ten times.

This is a huge amount of redundant work. In multi-turn conversations, RAG pipelines (where you retrieve documents and include them in the prompt), and few-shot prompting (where you include examples in every request), the majority of the prompt is identical across requests. Recomputing the KV cache for shared prefixes wastes both time and GPU compute.

SGLang's RadixAttention solves this by storing KV cache in a radix tree data structure that automatically detects and reuses shared prefixes across requests.

## The Prefix Sharing Opportunity

Consider a few common patterns:

**Multi-turn chat.** Each message in a conversation shares the entire history of previous messages. Turn 5 of a conversation has the same prefix (turns 1-4) as any other request continuing that conversation.

**System prompts.** Most API deployments use the same system prompt for every request. If your system prompt is 500 tokens, that's 500 tokens of redundant KV cache computation for every single request.

**Few-shot prompting.** If you include 5 examples in every request, those examples are identical across all requests and could share the same KV cache.

**RAG with common documents.** When multiple users ask questions about the same retrieved document, the document's KV cache could be computed once and shared.

In all these cases, you're paying the full prefill cost (which is the compute-intensive phase of inference) for work that's already been done.

## How RadixAttention Works

A radix tree (also called a Patricia trie) is a data structure that stores strings by their shared prefixes. If you insert "hello world" and "hello there", the tree stores "hello " once and branches at the point where the strings diverge.

SGLang applies this to KV cache management. Each node in the radix tree stores a segment of KV cache corresponding to a sequence of tokens. When a new request arrives:

1. The server tokenizes the prompt and walks the radix tree, following matching token sequences.
2. At the point where the tree and the new request diverge, all the KV cache up to that point is reused. No recomputation needed.
3. Only the new, unmatched portion of the prompt goes through prefill.
4. After the request completes, the new KV cache segments are inserted into the tree for future reuse.

The tree uses LRU eviction (least recently used entries get dropped first) when GPU memory is full, so popular prefixes stay cached while rare ones are cleaned up automatically.

## Cache-Aware Scheduling

SGLang also introduces cache-aware scheduling, which reorders requests in the queue to maximize cache hit rates. If the server has a batch of waiting requests and some of them share prefixes with currently cached KV data, those requests get prioritized.

This sounds like a small optimization, but it matters a lot in practice. Without cache-aware scheduling, the server might process requests in FIFO order (first in, first out), evicting cached prefixes before other requests that could have used them arrive. With it, the server batches related requests together and keeps useful cache entries warm.

## Structured Language Model Programs

Beyond caching, SGLang also provides a programming model for structured LLM interactions. Instead of making individual API calls, you write programs that describe multi-step LLM workflows:

```python
@function
def multi_step_qa(s, question):
    s += system("You are a helpful assistant.")
    s += user(question)
    s += assistant(gen("answer", max_tokens=256))
    s += user("Can you elaborate on that?")
    s += assistant(gen("elaboration", max_tokens=512))
```

The serving system sees the entire program structure and can optimize accordingly: pre-allocating cache for the expected conversation flow, scheduling both generation steps together, and reusing the cache from the first turn for the second.

## Results

SGLang achieves up to 5x higher throughput over baseline serving systems on workloads with prefix sharing opportunities. The improvement is highest for:

- Multi-turn conversations: 3-5x improvement (long shared prefixes)
- Few-shot prompting: 2-4x improvement (identical example prefixes)
- Tree-structured generation (like beam search): 2-3x improvement (shared prefix branches)

Even for single-turn workloads without obvious prefix sharing, SGLang performs comparably to vLLM because the radix tree adds minimal overhead when there's nothing to cache.

## How This Fits in Our Stack

Prefix caching is one of those optimizations that becomes more valuable as inference gets faster. When prefill is slow, saving a few hundred milliseconds of redundant computation is nice but not transformative. When prefill is already fast (as it is on inference-optimized ASICs), the savings from prefix caching represent a larger fraction of the total request time, and you can serve proportionally more requests with the freed-up compute.

General Compute is the only neocloud built entirely on inference-optimized ASICs. We implement our own KV cache management and prefix sharing on top of hardware that's already fast at prefill. The combination means that requests with shared prefixes, which includes most production workloads, see compounding speed benefits.

[Sign up at generalcompute.com](https://generalcompute.com) and get $5 in free credit to try it out.

## Papers and References

- [SGLang: Efficient Execution of Structured Language Model Programs](https://arxiv.org/abs/2312.07104) (Zheng et al., 2024 -- NeurIPS 2024)
- [SGLang Blog Post](https://lmsys.org/blog/2024-01-17-sglang/) (LMSYS, 2024)
ModeHumanAgent