General Compute

Look at the prompts your application actually sends to an LLM and you will notice something embarrassing. Most of them share long stretches of identical text. The system prompt is the same on every call. The few-shot examples are the same. The retrieved documents repeat across users. The agent's tool definitions are the same. The conversation history is the same as it was a turn ago, plus a little bit at the end.

Without prefix caching, the model recomputes the KV cache for all of that shared text on every single request. You pay for the prefill, the GPU sits there crunching attention over tokens it has already processed a thousand times today, and your users wait. Prefix caching is the fix, and on workloads that have any meaningful prompt overlap, it is one of the largest practical speedups available to a serving stack.

This post walks through how prefix caching actually works, what the different implementations do differently, and the operational details that decide whether you get the full speedup or a fraction of it.

Why prefill dominates so often

To see why prefix caching matters, it helps to remember what prefill is doing. When a request comes in with N input tokens, the model runs a forward pass over all N tokens to produce the KV cache that decoding will use. The compute cost of that forward pass is roughly proportional to N for the feedforward layers and to N squared for attention.

Most production prompts are heavy on input and light on output. A coding agent might send 8,000 tokens of context and expect 200 tokens back. A RAG application might send 4,000 tokens of retrieved chunks and ask for a one-paragraph answer. A voice agent sends a system prompt plus a few turns of history and gets a sentence in response. In all of these, prefill is doing more work than decode, often by an order of magnitude.

If 90% of those input tokens are identical to a previous request, recomputing them is pure waste. The KV cache that the prefill pass produces is a deterministic function of the input tokens (and the model weights, and the position embeddings, but more on that later). If you computed it once already and you still have it sitting in GPU memory, you can reuse it.

The basic idea

Prefix caching is straightforward in principle. When a request arrives, hash its prompt prefix in token-aligned blocks. For each block, check whether you already have its KV cache stored. If you do, skip the prefill for that block and link the existing cache into the request. Run prefill only on the suffix that you have not seen before.

The unit of caching is usually a block of fixed size, the same blocks that PagedAttention uses to manage KV memory. A block is typically 16 or 32 tokens. You hash the contents of the prefix up to and including each block boundary, and use that hash as a key into a cache of physical KV blocks.

Two requests share a cached block if and only if they have the same tokens up to that block boundary. Once they diverge by even a single token, the rest of their KV is different, and the cache lookup stops.

The savings depend entirely on how much of your traffic shares prefixes. A naive chatbot with no system prompt and no shared history will see almost no benefit. A coding agent where every request starts with 5,000 tokens of identical instructions, tool definitions, and project context will see prefill latency drop by something close to that 5,000-token fraction.

Block-level caching in vLLM

vLLM's prefix caching is the most widely deployed version of this. It builds on top of PagedAttention, which already chops the KV cache into fixed-size blocks for memory management. Each block has a hash computed from the tokens it contains plus the hash of the previous block. This makes the hash a function of the entire prefix up to that block, not just the local token contents.

When a new request comes in, vLLM walks its prompt token by token, computing block hashes as it goes. For each block hash, it checks the cache. Hits get linked into the new request's block table; misses trigger prefill for that block onwards.

Eviction is LRU on physical blocks that are not currently referenced by any active request. A block can be referenced by multiple requests at once if their prefixes match, so reference counts matter. When memory gets tight, vLLM frees the least recently used unreferenced blocks first.

The implementation cost is small. The hash computation is cheap, the lookup is a hash map probe, and the existing PagedAttention machinery already handles non-contiguous KV blocks during attention. The end result is that you get prefix sharing without changing the attention kernel.

RadixAttention and tree-structured caching

SGLang's RadixAttention takes the idea further. Instead of a flat hash table of blocks, it stores cached prefixes in a radix tree. The tree's edges are token sequences and its nodes hold KV cache references. Inserting a prompt walks down the tree, extending paths and splitting nodes when prompts diverge.

The tree structure makes some things cleaner. Looking up the longest matching prefix is a tree walk, which naturally finds the longest shared subpath. Eviction can be done at the granularity of tree nodes, with LRU on subtrees that are no longer referenced. And because the tree explicitly represents the relationships between prompts, you can reason about cache structure more easily when debugging.

In practice the difference between block-hashed caches and radix-tree caches is mostly about implementation taste rather than raw performance. Both achieve the same asymptotic behavior on the same workload. The tree version tends to be a slightly better fit for workloads where prompts naturally form a tree, like agent traces that fan out from a common root. The block-hashed version is simpler and integrates more directly with paged KV memory.

Position embeddings and why they matter

Here is the detail that catches people. The KV cache is not just a function of the tokens. It is a function of the tokens at specific positions. If your model uses RoPE or any other position-dependent encoding, the K and V vectors at position 100 are not the same as the K and V vectors for the same token at position 200.

This is fine when prefixes are reused at the start of the prompt, because the positions are identical. The first 1,000 tokens of the system prompt are always at positions 0 through 999. Cache hit, position match, you are done.

It breaks when you try to reuse a cached chunk in the middle of a prompt. If you cache the KV for a paragraph at positions 500 through 999 in one request, you cannot just splice it into another request where the same paragraph appears at positions 1500 through 1999. The Q vectors at the new positions will not produce correct attention scores against K vectors computed at the old positions.

The clean solutions are limited. Either you only cache prefixes (which is what prefix caching does, and why it is called that), or you use a model architecture that is robust to position shifts, or you do extra math to "shift" the cached KV to its new positions. CacheBlend and a few other research papers have explored the third option. In production, the first option is dominant because it is simple and correct.

The practical consequence is that if you want the most caching, structure your prompts so that the variable content goes at the end. System prompt first, then few-shot examples, then retrieved documents (if they are stable across requests), then the user query last. This pushes the cache boundary as far to the right as possible.

Where the wins actually come from

In real deployments, the workloads where prefix caching pays off the most tend to follow a pattern.

Agents with long system prompts. A coding agent might have 4,000 tokens of system prompt: tool definitions, formatting rules, examples of good and bad responses, a description of the codebase conventions. Every request the agent makes starts with that prompt. Prefix caching means the agent's per-request prefill cost is dominated by the user's actual query and the recent context, not the boilerplate.

Multi-turn chat. Each turn's prompt is the previous turn's prompt plus a new user message and a new assistant response. The KV cache for everything except the new tokens already exists. Without prefix caching, every turn does a full prefill of the entire conversation. With it, each turn prefills only the deltas.

RAG with stable document sets. If your system prompt and your retrieved documents are stable across many requests (think: a customer support bot grounded in a fixed knowledge base, or an agent operating on a single project), prefix caching keeps the KV for those documents warm.

Batch-style evaluation. Running the same prompt template over many inputs is the canonical case. The shared template is cached once, every example pays only for its variable suffix.

The wins are smaller or zero on workloads where every prompt is genuinely different. Open-ended chat with no system prompt, semantic search over user queries, or one-shot tasks with unique inputs do not benefit much.

Eviction and capacity planning

The size of the prefix cache is whatever GPU memory is left after the working set of active requests. On a busy server, this is often less than you would hope. Each active request has its own KV cache that is pinned and cannot be evicted, and the prefix cache competes with that for the same memory pool.

Tuning here is mostly about throttling. If you let your batch size grow without bound, the prefix cache shrinks until it cannot hold even the system prompt, and your hit rate collapses. A reasonable default is to leave a meaningful fraction of KV memory, say 20 to 40 percent, available for prefix caching after subtracting expected concurrent request memory.

Eviction policy matters less than people expect. LRU on unreferenced blocks does fine in almost all cases. The pathological workload would be one where you cycle through more distinct prefixes than fit in cache, but that is rare in practice. The much more common failure is undersizing the cache, not picking the wrong eviction policy.

What to instrument

If you are running an LLM serving stack with prefix caching enabled, the metrics worth watching are the hit rate (fraction of input tokens served from cache) and the prefill token reduction (raw count of tokens skipped). Hit rate by itself can be misleading because cache hits on short prefixes do not save much, while a single cache hit on a 4,000-token prefix can cut your prefill time in half.

If your hit rate is lower than you expected, the usual suspects are: prompts not actually being identical (whitespace differences, ordering of fields in serialized JSON, stable but not literally equal templates), variable content being placed too early in the prompt, or the cache being too small. All three are fixable once you can see them.

Closing

Prefix caching is one of the few inference optimizations that is purely free. It does not change the model output, it does not require retraining, and it does not introduce new failure modes. On workloads with shared prefixes, which is most production workloads, it is the single highest-leverage optimization a serving stack can add.

If you are running open-source inference servers, prefix caching is on by default in vLLM and SGLang at recent versions. If you are running on a managed inference provider, it is worth confirming whether they have it enabled and whether their cache survives across your requests. Speed work tends to involve heavy lifting; this one mostly just involves checking the box.

General Compute serves models on infrastructure designed around exactly these inefficiencies. If you are tired of paying for the same tokens twice, try our API and see what your prompts look like when the prefix is already cached.

Prefix Caching: Why Repeated Prompts Shouldn't Cost You Twice