Skip to main content

Attention Sinks: Stable Streaming LLMs

Summary
Learn about attention sinks, where LLMs concentrate attention on initial tokens, and how preserving them enables streaming inference.

What attention sinks are

Large language models pour a surprising share of every token's attention onto the first few tokens of the sequence — typically the BOS token and a handful after it — no matter whether those tokens carry any relevant meaning. These positions act as attention sinks: somewhere to park attention that has nowhere better to go. Spotting and preserving them is what lets a model stream indefinitely without its quality collapsing.

This page assumes you know self-attention, the KV cache, and sliding-window attention.

Interactive visualization

Scrub through the stream and watch how preserving a few sink tokens keeps perplexity flat:

Why attention sinks form

Attention weights pass through a softmax, so for every query they must sum to 1:

Σj=1n \text{softmax}(si)j = 1

That constraint is the whole story. When a token has nothing it especially needs to look at, it still has to spend a full unit of attention somewhere. The model learns to dump that excess onto tokens that are always available and positionally fixed — and the earliest tokens fit perfectly:

  • They are visible to every later token (causal masking never hides them).
  • Their position never shifts, so they make a stable anchor.
  • Every training sequence had them, so the model comes to lean on them.

The result is a token whose content is irrelevant but whose role — absorbing leftover attention mass — is essential.

Why naive streaming breaks

To run on an unbounded stream with bounded memory, the obvious move is a sliding context window: keep the most recent N tokens and evict the oldest. But the oldest tokens are the sinks. Evict them and the attention mass that used to land there has nowhere to go; it is forced onto tokens that were never meant to receive it, the learned distribution breaks, and generation degenerates.

The damage is dramatic and, for that run, permanent. These figures are representative of the StreamingLLM results, not measured here:

MethodPPL@2KPPL@8KPPL@16KMemory
Full cache10.210.310.4O(n)
Window, no sinks10.563.5450+O(w)
Window + 4 sinks10.310.911.2O(w)

Same memory budget; the only difference is whether the first four tokens survive eviction.

The fix: keep a few sink tokens

StreamingLLM's fix is almost embarrassingly small: keep the first ~4 tokens permanently, and slide the window over everything after them.

[ sink tokens 0-3 ] + [ sliding window: most recent N-4 tokens ]

The sinks go on absorbing the excess attention, so the learned distribution stays intact; the window supplies recent content. Memory stays fixed at the cache size no matter how long the stream runs — a 100K-token stream with a 1K cache costs the same as the first 1K tokens, roughly a 99% saving versus caching everything.

Position handling

Evicting middle tokens leaves gaps in the original sequence — but StreamingLLM does not carry those gapped positions into the cache. It assigns positions within the cache, contiguously, so the model never sees a position index beyond the range it was trained on.

For example, if the cache holds tokens whose original positions are [0, 1, 2, 3, 6, 7, 8], the attention layer is fed positions [0, 1, 2, 3, 4, 5, 6, 7]. The four sink tokens keep positions 0–3; the window tokens are renumbered to follow immediately after, with no gap.

This is why rotary embeddings (RoPE) and ALiBi keep working: they encode the relative distance between cached tokens. Because the cache positions are contiguous, those distances stay small and in-distribution — exactly the regime the model trained in. Reusing the original gapped positions would do the opposite, forcing huge cross-gap distances the model never saw.

Further reading

If you found this explanation helpful, consider sharing it with others.

Mastodon