What attention sinks are
Large language models pour a surprising share of every token's attention onto the first few tokens of the sequence — typically the BOS token and a handful after it — no matter whether those tokens carry any relevant meaning. These positions act as attention sinks: somewhere to park attention that has nowhere better to go. Spotting and preserving them is what lets a model stream indefinitely without its quality collapsing.
This page assumes you know self-attention, the KV cache, and sliding-window attention.
Interactive visualization
Scrub through the stream and watch how preserving a few sink tokens keeps perplexity flat:
Why attention sinks form
Attention weights pass through a softmax, so for every query they must sum to 1:
That constraint is the whole story. When a token has nothing it especially needs to look at, it still has to spend a full unit of attention somewhere. The model learns to dump that excess onto tokens that are always available and positionally fixed — and the earliest tokens fit perfectly:
- They are visible to every later token (causal masking never hides them).
- Their position never shifts, so they make a stable anchor.
- Every training sequence had them, so the model comes to lean on them.
The result is a token whose content is irrelevant but whose role — absorbing leftover attention mass — is essential.
Why naive streaming breaks
To run on an unbounded stream with bounded memory, the obvious move is a sliding context window: keep the most recent N tokens and evict the oldest. But the oldest tokens are the sinks. Evict them and the attention mass that used to land there has nowhere to go; it is forced onto tokens that were never meant to receive it, the learned distribution breaks, and generation degenerates.
The damage is dramatic and, for that run, permanent. These figures are representative of the StreamingLLM results, not measured here:
| Method | PPL@2K | PPL@8K | PPL@16K | Memory |
|---|---|---|---|---|
| Full cache | 10.2 | 10.3 | 10.4 | O(n) |
| Window, no sinks | 10.5 | 63.5 | 450+ | O(w) |
| Window + 4 sinks | 10.3 | 10.9 | 11.2 | O(w) |
Same memory budget; the only difference is whether the first four tokens survive eviction.
The fix: keep a few sink tokens
StreamingLLM's fix is almost embarrassingly small: keep the first ~4 tokens permanently, and slide the window over everything after them.
[ sink tokens 0-3 ] + [ sliding window: most recent N-4 tokens ]
The sinks go on absorbing the excess attention, so the learned distribution stays intact; the window supplies recent content. Memory stays fixed at the cache size no matter how long the stream runs — a 100K-token stream with a 1K cache costs the same as the first 1K tokens, roughly a 99% saving versus caching everything.
Position handling
Evicting middle tokens leaves gaps in the original sequence — but StreamingLLM does not carry those gapped positions into the cache. It assigns positions within the cache, contiguously, so the model never sees a position index beyond the range it was trained on.
For example, if the cache holds tokens whose original positions are [0, 1, 2, 3, 6, 7, 8], the attention layer is fed positions [0, 1, 2, 3, 4, 5, 6, 7]. The four sink tokens keep positions 0–3; the window tokens are renumbered to follow immediately after, with no gap.
This is why rotary embeddings (RoPE) and ALiBi keep working: they encode the relative distance between cached tokens. Because the cache positions are contiguous, those distances stay small and in-distribution — exactly the regime the model trained in. Reusing the original gapped positions would do the opposite, forcing huge cross-gap distances the model never saw.
Further reading
- Efficient Streaming Language Models with Attention Sinks — Xiao et al., 2023 (StreamingLLM)
Related concepts
Learn ALiBi, the position encoding method that adds linear biases to attention scores for exceptional length extrapolation in transformers.
How Flash Attention, Multi-Head Attention (MHA), Grouped-Query Attention (GQA), and Multi-Query Attention (MQA) compare — algorithm vs architecture, KV-cache memory, quality trade-offs, and how to choose for production transformer inference.
Understand cross-attention, the mechanism that enables transformers to align and fuse information from different sources, sequences, or modalities.
Learn how Grouped-Query Attention (GQA) balances Multi-Head quality with Multi-Query efficiency for faster LLM inference.
Explore linear complexity attention mechanisms including Performer, Linformer, and other efficient transformers that scale to very long sequences.
Learn how masked attention enables autoregressive generation and prevents information leakage in transformers and language models.
