KV Cache Compression: MLA and Beyond
The KV cache is the single largest variable cost in transformer inference. Model weights are fixed. The KV cache grows with every token of context, for every request currently in flight. On a serving node, it is usually the KV cache, not the weights, that decides how many concurrent users you can handle and how long a context you can support.
DeepSeek's Multi-Head Latent Attention (MLA), introduced in DeepSeek-V2 and refined in V3, attacks this problem at the architecture level. Instead of caching the key and value tensors directly, MLA caches a low-rank projection of them and reconstructs the per-head keys and values on the fly during attention. The cache footprint drops by roughly an order of magnitude compared to standard multi-head attention, and benchmark quality stays essentially unchanged. MLA is the reason DeepSeek can serve very long contexts at competitive cost.
MLA is not the only way to compress the KV cache, and it is worth understanding how it fits with the other techniques the field has developed. This post walks through MLA in detail, compares it to MQA and GQA, and then covers the main alternatives: quantized KV caches, eviction policies like H2O and StreamingLLM, and runtime factorization.
Why the KV Cache Is the Bottleneck
During autoregressive decoding, every new token needs to attend to every previous token. The naive approach would recompute K and V for the whole prefix at each step, which is O(N^2) in compute for a context of length N. The standard optimization is to cache the keys and values after the first forward pass and reuse them. New tokens only compute their own K and V and append them to the cache.
The per-request size of this cache, for a model with L layers, h heads, head dimension d_h, and sequence length N, is:
kv_cache_bytes = 2 * L * h * d_h * N * sizeof(dtype)
For Llama 3 70B with 80 layers, 8 KV heads (it uses GQA), head dim 128, FP16, and a 32K context, that is 2 * 80 * 8 * 128 * 32000 * 2 = 10.5 GB per request. For a vanilla multi-head model with 64 heads instead of 8, the same calculation gives 84 GB per request. That is why nobody ships vanilla MHA at large scale anymore.
Everything that improves inference throughput, continuous batching, PagedAttention, prefix caching, has to reckon with the fact that concurrent users share a limited HBM budget, and the KV cache is how that budget gets spent. Shrinking it directly increases how many requests fit on a GPU.
MQA and GQA: the First Round of Compression
Multi-Query Attention (MQA), from Noam Shazeer in 2019, noticed that the Q side of attention needs to be per-head to preserve expressivity, but K and V do not. MQA ties all query heads to a single shared K and V, dropping the KV cache by a factor of h. The tradeoff is a quality regression that shows up clearly on harder benchmarks.
Grouped-Query Attention (GQA), from Ainslie et al. in 2023, is the compromise most modern models use. It groups query heads into g groups and uses g KV heads, one per group. Llama 3 uses 64 query heads and 8 KV heads, so g=8, which means an 8x cache reduction versus MHA. GQA preserves most of MHA's quality and has become the default.
The limitation of both MQA and GQA is that they directly reduce the number of KV heads. You can only push this so far before quality collapses. You also still cache full-precision K and V tensors for every retained head, so the cache still scales linearly with context length and with the number of KV heads you keep.
Multi-Head Latent Attention
MLA takes a different path. Instead of reducing the number of heads, it keeps all heads but caches a compressed representation of K and V, then reconstructs the full per-head tensors inside the attention computation using a stored projection. The high-level structure looks like this:
- Compute a shared low-rank latent vector
c_kvfrom the hidden state. This latent has dimension d_c, much smaller thanh * d_h. - Cache only
c_kvfor each token. - At attention time, project
c_kvup to the full per-head K and V using two matricesW^UKandW^UVthat are learned and shared across positions.
Concretely, for each token:
c_kv = x @ W^DKV # shape (d_c,)
K = c_kv @ W^UK # shape (h, d_h)
V = c_kv @ W^UV # shape (h, d_h)
The per-token cache is just c_kv, which is d_c * sizeof(dtype) bytes per layer. For DeepSeek-V2, d_c is 512 and d_h * h would be around 16384 for comparable MHA, so the cache shrinks by roughly 32x at equal precision.
There is a clever reformulation that makes this efficient. You do not actually need to materialize the full K matrix at each step. The attention score for query head i at position t is:
score_i = (Q_i^t) @ K_i^{1..t}^T
= (Q_i^t @ W^UK_i) @ c_kv^{1..t}^T
Because W^UK is fixed, you can absorb it into the query projection. The attention computation turns into a product between a modified query and the cached c_kv, with no need to expand to per-head K. The same trick works for V: you absorb W^UV into the output projection. The result is that MLA's attention kernel operates directly on the compressed latent dimension, which also reduces memory bandwidth during decode, where the attention step is memory-bound.
The RoPE Problem and the Decoupled Fix
There is a subtlety that the simple version of MLA does not handle. Rotary Position Embedding (RoPE) applies a position-dependent rotation to Q and K before the attention score. If K is reconstructed from a compressed latent using a fixed matrix, the latent has to be rotated too, and that rotation depends on the absorbing position. The absorption trick above stops working, because W^UK can no longer be pulled into the query projection cleanly once it is interleaved with a position-dependent rotation.
DeepSeek's fix is what they call decoupled RoPE. They split each head into two parts: a non-positional part of dimension d_h^nope, reconstructed from the latent as above, and a positional part of dimension d_h^rope that is cached separately after applying RoPE. The positional part is shared across heads, much like MQA, so the extra cache cost is small. The attention score is the sum of a score on the latent part (where the matrix absorption trick works) and a score on the RoPE part (which uses a small shared cache).
This is not conceptually elegant, but it is the engineering compromise that makes MLA work with RoPE, and RoPE is non-negotiable for long-context quality. DeepSeek-V2 uses d_h^nope = 128 and d_h^rope = 64. The total cache per token is d_c + d_h^rope per layer, which for their setup is 512 + 64 = 576 scalars, compared to h * d_h = 128 * 128 = 16384 for vanilla MHA. That is a 28x reduction.
How MLA Compares in Practice
DeepSeek-V2 reports that MLA matches MHA on standard benchmarks while cutting the KV cache substantially. The key numbers from the paper:
- KV cache per token: about 6.7% of Llama 3 70B's GQA cache at equivalent context length.
- Generation throughput: roughly 5.76x that of DeepSeek 67B (a dense MHA model with comparable parameter count).
- Training cost: competitive with dense baselines, since the extra projections are modest.
For DeepSeek-V3, which is a 671B parameter MoE with 37B active parameters, MLA is what makes long-context serving economically viable. Without it, the cache would dominate memory even on 8xH100 nodes.
Retrofitting MLA into an existing model is not trivial. The latent dimension and the RoPE decoupling have to be baked into the architecture and trained from scratch, or at least with heavy finetuning. You cannot drop MLA into a Llama 3 checkpoint. This is a real barrier to adoption for teams that already have trained models they want to keep serving, and it is why most of the open ecosystem is still on GQA.
Quantized KV Caches
A complementary line of work compresses the KV cache at runtime, without changing the architecture. The simplest version stores K and V in INT8 or INT4 instead of FP16. KIVI (Liu et al., 2024) showed that you can quantize keys per-channel and values per-token down to 2 bits with minimal quality loss, using asymmetric quantization with per-group scales. That is an 8x reduction on top of whatever architectural compression you already have.
Quantized caches are attractive because they are orthogonal to MLA, MQA, or GQA. You can quantize an MLA cache just as easily as a GQA cache. The cost is extra compute at attention time, since you have to dequantize on the fly, and the implementation has to be careful about kernel performance. vLLM and SGLang both ship INT8 KV cache options, and the quality regression is small enough to be acceptable for most workloads.
FP8 KV caches, which Hopper GPUs support natively, are becoming common for production serving. They give 2x compression versus FP16 with essentially no quality impact and no dequantization overhead, since the attention kernels can operate directly on FP8.
Eviction: H2O and StreamingLLM
A different approach asks whether you need to cache every token at all. H2O (Heavy Hitter Oracle) from Zhang et al. (2023) observes that attention is highly skewed in practice. A small subset of tokens, the "heavy hitters," attract most of the attention mass across layers, and the rest can be evicted with little effect on output quality. H2O keeps a fixed-size cache that evicts based on accumulated attention scores.
StreamingLLM (Xiao et al., 2023) goes further and identifies an "attention sink" effect: the first few tokens of a sequence receive disproportionate attention, regardless of content, because of how softmax normalizes. Keeping those initial tokens plus a sliding window of recent tokens allows models to generate indefinitely without quality collapse, even when the true context far exceeds their training-time window.
These methods are useful when you cannot afford to keep the full cache and are willing to accept some quality degradation on long-range dependencies. They compose with everything else: you can run MLA plus INT4 quantization plus H2O eviction, and multiply the compression ratios.
Runtime Low-Rank Factorization
There is a middle ground between architectural changes (MLA) and runtime compression (quantization, eviction): compress the cached K and V with a learned or SVD-based low-rank factorization applied after training. Methods like LESS and EVA fit small projection matrices that map cached K and V to a lower-rank subspace, and store the factors instead of the full tensors. Quality is not as good as MLA, which is trained end-to-end with the compression in place, but these methods can be applied to existing checkpoints.
What Actually Matters for Serving
If you are choosing a model to serve at scale, the practical picture looks like this:
- Training a new model: MLA is the strongest option if you care about long-context serving economics. It is harder to implement than GQA and requires careful attention to the RoPE decoupling, but the cache savings are decisive.
- Serving an existing GQA model: FP8 or INT8 KV cache quantization is the first lever. It is well-supported in modern serving stacks and the quality hit is negligible.
- Serving under extreme memory pressure: Stack quantization with eviction policies like StreamingLLM if your workload tolerates some accuracy loss on very long dependencies.
- Prefill-heavy workloads: Prefix caching and sharing across requests (what SGLang's RadixAttention does) matters more than compression, because the cache is populated once and reused many times.
The KV cache is a shared resource across requests. Every byte you cut out of a single request's cache is a byte you can spend on another concurrent user. MLA is remarkable because it attacks the problem at the source, in the architecture, rather than layering on compression after the fact. For inference providers, the arithmetic compounds quickly: an 8x smaller cache means 8x more concurrent users per GPU, or 8x longer context at the same concurrency, or some mix of both.
At General Compute, we care about this because inference speed and concurrency are what let our customers build real-time applications on top of our API. Architectures like MLA, combined with the right runtime stack, are what make million-token contexts a product feature rather than a benchmark stunt. If you are building a voice agent or a coding agent and bumping into KV cache limits, come talk to us about the tradeoffs. The landscape has moved fast over the last year, and what was infeasible in early 2024 is now routine.