Agent Readout
The Attention Sink Phenomenon: Why the First Token Matters
How attention concentrates on the first few tokens of every sequence, why naive sliding-window caching breaks long-context generation, and how StreamingLLM uses sink tokens to serve effectively unbounded streams.
- Author
- General Compute
- Published
- 2026-05-04
- Tags
- attention sinks, streamingllm, long context, kv cache, inference, transformers
Markdown body
If you visualize the attention weights of a decoder-only transformer halfway through a long generation, you see a strange pattern. Most of the probability mass goes where you would expect, onto the recent tokens and a handful of semantically relevant earlier tokens. But a surprisingly large share, often 30 to 50 percent, lands on the very first tokens of the sequence. The model is paying intense attention to a "BOS" token and the opening words of the prompt, even when those tokens have nothing to do with what is being generated right now. This is the attention sink phenomenon. It was named and characterized by Xiao et al. in the StreamingLLM paper (2023), but anyone who has poked at attention maps on a long-running model has probably seen it. The pattern is consistent across models, across layers, and across input distributions, which makes it more than a curiosity. It is a structural property of softmax attention, and it has direct consequences for how you serve LLMs in long-context and streaming settings. This post walks through what attention sinks are, why they exist, what breaks when you ignore them in a serving system, and how StreamingLLM uses them to enable effectively unbounded generation without retraining the model. ## The pattern in the attention maps Take a Llama-style model and feed it a long passage. Then, during decoding, look at the attention weights from any given layer to all previous tokens. You will see three rough bands: 1. A spike on the first one to four tokens of the sequence, regardless of what those tokens contain. 2. A more diffuse band of moderate weights on the most recent tokens, the ones in the local context window. 3. Lower, scattered weights on tokens in between, with a few peaks corresponding to semantically related words. The first band is the surprising one. The model is spending a real fraction of its attention budget on tokens that are not semantically related to the current generation step. If the prompt starts with "The following is a transcript of a customer support call," and the model is now 30,000 tokens deep into the call, those opening words still get heavy attention weight. The model is not retrieving information from them. It is using them as a sink. The behavior is most pronounced in middle layers. Early layers attend more locally. The deepest layers also attend somewhat locally. But somewhere in the middle of the stack, you see this strong pull toward the first tokens, layer after layer. ## Why this happens The mechanical reason is softmax. Self-attention computes attention weights as `softmax(Q K^T / sqrt(d))`, and softmax forces the weights to sum to one over the keys. The model cannot choose to attend to nothing. If there is no semantically relevant content elsewhere in the sequence, the attention head still has to put its weight somewhere. Tokens at the very start of the sequence end up serving as the default destination for "I do not need to attend anywhere specific." The first token is visible to every position in the sequence because of causal masking, so every query can see it. During training, the model learns that putting excess attention there is harmless, since those tokens already encode generic information about the start of the input. Over time, this becomes a stable equilibrium. Heads that do not need to retrieve information at a given step learn to dump their attention onto the initial tokens. You can think of it as a pressure-relief valve. The softmax must integrate to one, but heads do not always have meaningful work to do. The first tokens absorb the leftover probability mass. This is why removing the first tokens is so destructive. The model's attention budget is calibrated around the assumption that those tokens are present and absorbing extra mass. If you remove them, the softmax has to redistribute that mass onto other tokens. Now the heads that were silently sinking attention into the BOS token are loudly attending to whatever else is in the window, and that injects noise into the residual stream. Quality collapses fast. ## Why sliding-window caching does not just work The motivation for caring about attention sinks is practical. KV caches grow linearly with sequence length, and at long contexts, the cache eats most of your GPU memory. A single Llama 3 70B request at 128K tokens uses tens of gigabytes of KV cache. If you want to serve indefinitely long streaming sessions (voice agents, persistent assistants, very long documents), you eventually have to evict tokens from the cache. The simplest eviction policy is a sliding window. Keep the last N tokens, drop everything older. This is what classical RNNs and many older transformer variants approximate. For a transformer, it would seem natural: keep a window of size 4096, and as new tokens arrive, drop the oldest one to maintain the window. If you actually do this on a pretrained transformer at inference time, the model breaks. Perplexity climbs from a healthy single-digit number into the dozens or hundreds as soon as the window starts evicting the initial tokens. Generation degrades into incoherent text within a few hundred steps after the first eviction. This is the attention sink at work. The moment you drop those first few tokens, every middle-layer head that was sinking attention into them has nowhere to put its excess mass. The redistribution corrupts the hidden states, and the model loses coherence. ## What StreamingLLM actually does The StreamingLLM fix is small and almost embarrassingly simple. Keep the first few tokens, always. Then maintain a sliding window of recent tokens after that. The KV cache contains: ``` [sink_tokens (e.g., 4 tokens)] + [recent_window (e.g., 4092 tokens)] ``` The sink tokens are never evicted. The recent window slides as generation continues, dropping the oldest non-sink tokens to make room for new ones. Total cache size stays bounded. That is the entire algorithm. The reported results are striking: with as few as four sink tokens preserved, models like Llama 2 and Pythia maintain stable perplexity over generations of more than four million tokens. Without the sink tokens, the same models collapse within thousands of steps. A few details matter for getting this right in a real system: **Position encoding.** The model was trained with absolute or relative positions that grow linearly with sequence length. If you naively keep the original positions, the recent window's positions can exceed what the model saw during training, and rotary embeddings (RoPE) start producing out-of-distribution values. StreamingLLM re-encodes positions within the cache: the sink tokens stay at positions 0..k, and the recent window is mapped to positions k+1..k+W, regardless of how far into the stream you are. The model only ever sees positions inside the trained range. **Number of sink tokens.** Four is a common choice and works well in practice. One sink token works on some models but not all. The marginal benefit drops off quickly past four. The exact right number depends on how concentrated the attention sink behavior is in the model you are using. **What to use as sink tokens.** The original tokens of the prompt work. A small set of dummy tokens prepended at training time works better but requires retraining. For most deployments, just keeping the literal first few tokens of whatever the model saw is fine. ## Implications for serving systems Attention sinks change a few things about how you architect a long-context inference stack. For batched serving with paged KV caches (the vLLM / SGLang style of system), you can implement StreamingLLM as an eviction policy on top of the page table. Instead of evicting the least-recently-used pages, you mark the first few pages as pinned and evict only from the rest. This composes naturally with continuous batching. For streaming voice and chat, the practical effect is huge. You no longer need to truncate or summarize the conversation history to keep the cache bounded. You keep the first few tokens of the system prompt, slide a window over recent turns, and let the conversation run for hours without re-ingesting context or paying for an unboundedly large KV cache. Latency stays flat instead of growing with conversation length. For document processing, the calculus shifts a bit. If the document is a single coherent piece and you need to attend to its middle, sliding-window approaches throw away information that may matter. Sinks help with stability, not with global recall. For tasks where the model legitimately needs to retrieve information from the middle of a 200K-token document, you still want full attention over the whole context, with techniques like Ring Attention or chunked prefill carrying the load. The clean use case for streaming-with-sinks is sequential dialogue. The model only needs the recent context plus the framing tokens at the start. That is exactly what a long voice conversation or persistent agent session looks like. ## How this interacts with other long-context techniques StreamingLLM is not a replacement for long-context training. Models trained with longer contexts (RoPE scaling, YaRN, position interpolation) handle genuinely long single-shot inputs better than a sliding-window model can. What StreamingLLM offers is a way to keep generation stable beyond the trained context length, by ensuring the active attention pattern stays inside the distribution the model was trained on. It also pairs naturally with prefix caching. The sink tokens are usually inside the system prompt, which is shared across requests. If you are already caching the system prompt's KV across users, you are already keeping the sink tokens warm. The streaming policy just says "and never evict that prefix from the per-request cache during long sessions." Speculative decoding interacts cleanly too. The draft model and target model can both use sliding windows with sinks; the speculation logic does not care about cache management. The piece this does not solve is multi-turn retrieval over very long histories. If you need to recall a fact from 100K tokens ago in a streaming session, sliding-window attention has lost that information. The usual fix is external memory: store older turns in a vector database, retrieve relevant chunks as needed, and inject them into the recent window. The KV cache stays bounded, the relevant history stays retrievable, and the attention sink keeps the model coherent. ## Why this is worth understanding The attention sink is a good example of behavior that emerges from architectural details (softmax integrating to one, causal masking exposing the first tokens to everyone) rather than from anything explicit in the training objective. Understanding why it exists is what lets you design serving systems that work at long context lengths instead of collapsing. If you have ever wondered why your long-running chat session went off the rails after a certain point, or why a "just keep the last N tokens" cache eviction strategy ruined generation quality, the attention sink is a likely culprit. The fix is mechanical, costs almost nothing in compute, and makes streaming inference behave the way you would naively expect it to. At General Compute, fast inference is not just about FLOPs per token. It is about keeping the system stable across the kinds of long-running, high-throughput workloads that real applications produce: voice agents that stay alive through hour-long calls, coding assistants that hold a session open across many interactions, and customer-facing chat that does not get worse the longer it runs. Sink-aware cache management is one of the small architectural choices that lets that happen. If you are building long-running agents or streaming applications and want low-latency, sink-aware inference out of the box, take a look at the [General Compute API](https://generalcompute.com). The same OpenAI-compatible interface, with the cache policies that keep your sessions stable.