Agent Readout

PagedAttention and vLLM: Virtual Memory for LLM Serving

The PagedAttention paper solved the biggest memory waste problem in LLM serving by borrowing an idea from operating systems. Here's how it works and why vLLM became the default serving framework.

Author
General Compute
Published
2026-03-22
Tags
inference, papers, deep-dive

Markdown body


Before PagedAttention, LLM serving systems wasted 60-80% of their GPU memory on empty space. The KV cache (the per-request memory that stores the model's "working memory" of the conversation so far) had to be allocated as a single contiguous block when a request came in. Since you don't know how long a response will be in advance, systems would allocate for the maximum possible length, leaving most of that memory unused.

The vLLM team at UC Berkeley looked at this problem and recognized it was the same problem that operating systems solved decades ago with virtual memory and paging.

## The KV Cache Problem

During autoregressive generation (where the model produces one token at a time), each new token needs to attend to (look back at) all previous tokens. The key and value tensors for those previous tokens are cached in GPU memory so they don't need to be recomputed every time. This stored state is called the KV cache, and it grows linearly with the length of the conversation.

For a model like Llama 2 13B with a maximum context of 4096 tokens, the KV cache for a single request can require around 1.6GB of GPU memory. On a 40GB A100 GPU, you can only fit about 25 concurrent requests if each one reserves its full maximum allocation.

The waste comes from pre-allocation. If a request only generates 100 tokens, the other 3996 tokens worth of allocated KV cache memory sits empty. Across many concurrent requests, this internal fragmentation (allocated but unused memory) eats up the majority of available GPU memory.

Before PagedAttention, the two options were: allocate conservatively and waste memory (limiting how many requests you can serve at once), or allocate tightly and risk running out of space mid-generation (causing requests to fail).

## How PagedAttention Works

PagedAttention borrows directly from how operating systems manage virtual memory. If you've taken an OS class, the concept will feel familiar. Instead of allocating one big contiguous block per request, the KV cache is divided into fixed-size blocks called pages, typically holding the KV data for 16 tokens each.

The key ideas:

**Non-contiguous storage.** A request's KV cache doesn't need to be in a single contiguous chunk of memory. It's stored across pages that can be scattered anywhere in GPU memory, linked together by a page table (a lookup structure that maps logical positions to physical locations, just like how your operating system manages RAM).

**Allocate on demand.** Pages are only allocated as new tokens are generated. A request that produces 100 tokens uses pages for those 100 tokens, not the maximum context length. No more over-allocation.

**Memory sharing.** When multiple requests share the same prompt prefix (this is common when many users have the same system prompt), they can share the same physical KV cache pages. Only pages that diverge between requests need separate storage. This is similar to copy-on-write in operating systems.

The results from the paper: memory waste dropped from 60-80% to under 4%. This directly translated to 2-4x higher serving throughput because you can fit many more concurrent requests in the same GPU memory.

## vLLM: The Serving System Built on PagedAttention

The authors didn't just publish a paper. They built vLLM, an open-source serving engine with PagedAttention at its core. It quickly became the most widely used LLM serving framework in the industry.

Beyond PagedAttention, vLLM includes:

- **Continuous batching.** New requests can join an in-progress batch at any iteration, so the GPU never sits idle waiting for a slow request to finish (this technique originated in the Orca paper, which we cover in a separate post).
- **Prefix caching.** Automatic detection and reuse of shared prompt prefixes across requests, so the model doesn't redo work it's already done.
- **Speculative decoding.** Built-in support for using a smaller "draft" model to speed up generation from a larger model.
- **Tensor parallelism.** Splitting a model across multiple GPUs so you can serve models that don't fit on a single card.
- **Quantization support.** GPTQ, AWQ, FP8, and other formats that shrink model weights to use less memory and run faster.

vLLM's adoption was rapid because it solved a practical problem that every LLM deployment was hitting. Before vLLM, teams were writing custom serving code or using NVIDIA's FasterTransformer (which predated many of these optimizations). vLLM made it possible to serve models at 2-4x higher throughput with the same hardware, just by being smarter about memory.

## The Broader Impact

PagedAttention changed how people think about LLM serving infrastructure. The realization that memory management, not just compute, was the primary bottleneck opened up a wave of follow-on work.

SGLang's RadixAttention took the prefix-sharing idea further with a radix tree data structure for more granular cache reuse. Disaggregated inference (running the prompt-processing phase and the token-generation phase on separate hardware) became practical partly because PagedAttention made memory management flexible enough to support it. And the core question of "how many requests can I serve at once" shifted from being a GPU compute question to a GPU memory management question.

The paper also showed something important about inference optimization: sometimes the biggest wins come not from making the math faster, but from eliminating waste in how memory and resources are managed around the math.

## Why Custom Hardware Goes Further

PagedAttention is a clever software solution to a real hardware limitation. GPUs allocate memory in a general-purpose way, and LLM serving workloads have unusual memory access patterns that don't map well to how GPUs were designed to work. The paging system adds overhead (page table lookups, non-contiguous memory access patterns) that wouldn't be necessary if the hardware understood the workload natively.

At General Compute, we run entirely on inference-optimized ASICs instead of NVIDIA GPUs. These chips handle memory allocation and KV cache management at the hardware level. The memory fragmentation problem that PagedAttention solves in software is addressed architecturally. There's no page table overhead, no fragmentation, and no gap between allocated and used memory. This is one of the reasons we can serve more concurrent requests at lower latency than GPU-based systems running vLLM.

If you want to see what LLM serving looks like without GPU memory constraints, [sign up at generalcompute.com](https://generalcompute.com) and get $5 in free credit to try it out.

## Papers and References

- [Efficient Memory Management for Large Language Model Serving with PagedAttention](https://arxiv.org/abs/2309.06180) (Kwon et al., 2023 -- SOSP 2023)
- [vLLM Blog Post](https://blog.vllm.ai/2023/06/20/vllm.html) (UC Berkeley, 2023)
ModeHumanAgent