Skip to main content

Linear Attention Approximations

Summary
Explore linear complexity attention mechanisms including Performer, Linformer, and other efficient transformers that scale to very long sequences.

The O(n²) bottleneck

Standard self-attention compares every token with every other token, so it builds an explicit n × n score matrix:

\text{Attention}(Q,K,V) = \text{softmax}\!(QK^\top√(d))V

The softmax sits between Q and K, coupling every query–key pair, so that matrix has to exist before you can multiply by V. Its size grows with the square of the sequence length — that quadratic cost in time and memory is the wall linear attention tries to get around.

This page assumes you know self-attention and the softmax it computes.

Interactive explorer

Compare the methods and their trade-offs as you change the sequence length:

The core trick: kernelize and reassociate

Every linear-attention method comes from one observation. The only reason you must form the n × n matrix is that softmax does not factorize. Replace the softmax similarity with a kernel feature map φ so that the similarity between a query and a key splits into a dot product of independent features:

\text{sim}(qi, kj) = φ(qi)·φ(kj)

Now attention is just φ(Q)\,φ(K)^\top V, and matrix multiplication is associative, so you can change the order of evaluation:

\bigl(φ(Q)\,φ(K)^\top\bigr)V \;=\; φ(Q)\,\bigl(φ(K)^\top V\bigr)

Compute the right pair first. φ(K)^\top V is a d × d matrix whose size does not depend on n. Multiplying it by φ(Q) then costs O(n·d²) time and O(d²) memory — linear in the sequence length, and the n × n matrix is never formed. (You normalize each row by φ(qi)·Σj φ(kj), the linear-attention stand-in for softmax's denominator.)

Everything below is a different answer to one question: what is φ, or how else do you shrink n?

The families

  • Linear Transformer picks the simplest possible feature map, φ(x) = \text{elu}(x) + 1, and leans entirely on the associativity trick above. It also decodes causally in constant time per token using a running sum.
  • Performer (FAVOR+) chooses φ so that φ(q)·φ(k) is an unbiased estimate of the softmax kernel, using positive orthogonal random features (the "positive" matters — trigonometric/Fourier features are numerically unstable here).
  • Linformer keeps softmax but attacks n directly: it projects K and V along the sequence axis from n down to a small k, exploiting the fact that the attention matrix is approximately low-rank.
  • cosFormer applies a cosine-based reweighting that stays linear while reintroducing a sense of locality and relative position.
MethodHow it linearizesComplexityTrade-off
Linear Transformerfeature map φ = elu(x)+1O(n·d²)simplest; weakest approximation
Performer (FAVOR+)positive orthogonal random features approximating softmaxO(n·k·d)unbiased softmax estimate; k≈256 features tune accuracy
Linformerlow-rank projection of K,V along the sequence (n→k)O(n·k)needs a fixed max length; very memory-light
cosFormercosine reweighting kernelO(n·d²) (linear)adds locality / position sensitivity
FlashAttentionexact softmax (IO-aware tiling)O(n²·d) compute, O(n) memorynot linear compute — linear memory

What you give up

Linearity is an approximation. How close it lands to exact softmax is task-dependent, and linear methods often lag noticeably on tasks that need precise retrieval or long-range recall — the very thing long contexts are usually for. A common compromise is hybrid attention: give some heads exact local (sliding-window) attention and others linear global attention, as in Longformer and BigBird.

It is also worth knowing when not to reach for these. When you can afford O(n²) compute but are bottlenecked on memory, exact FlashAttention is usually the better default — it computes the true softmax with linear memory by tiling, no approximation involved. Sparse attention and sliding-window attention are other exact-but-cheaper routes worth weighing first.

Further reading

If you found this explanation helpful, consider sharing it with others.

Mastodon