The O(n²) bottleneck
Standard self-attention compares every token with every other token, so it builds an explicit n × n score matrix:
The softmax sits between Q and K, coupling every query–key pair, so that matrix has to exist before you can multiply by V. Its size grows with the square of the sequence length — that quadratic cost in time and memory is the wall linear attention tries to get around.
This page assumes you know self-attention and the softmax it computes.
Interactive explorer
Compare the methods and their trade-offs as you change the sequence length:
The core trick: kernelize and reassociate
Every linear-attention method comes from one observation. The only reason you must form the n × n matrix is that softmax does not factorize. Replace the softmax similarity with a kernel feature map φ so that the similarity between a query and a key splits into a dot product of independent features:
Now attention is just φ(Q)\,φ(K)^\top V, and matrix multiplication is associative, so you can change the order of evaluation:
Compute the right pair first. φ(K)^\top V is a d × d matrix whose size does not depend on n. Multiplying it by φ(Q) then costs O(n·d²) time and O(d²) memory — linear in the sequence length, and the n × n matrix is never formed. (You normalize each row by φ(qi)·Σj φ(kj), the linear-attention stand-in for softmax's denominator.)
Everything below is a different answer to one question: what is φ, or how else do you shrink n?
The families
- Linear Transformer picks the simplest possible feature map, φ(x) = \text{elu}(x) + 1, and leans entirely on the associativity trick above. It also decodes causally in constant time per token using a running sum.
- Performer (FAVOR+) chooses φ so that φ(q)·φ(k) is an unbiased estimate of the softmax kernel, using positive orthogonal random features (the "positive" matters — trigonometric/Fourier features are numerically unstable here).
- Linformer keeps softmax but attacks n directly: it projects K and V along the sequence axis from n down to a small k, exploiting the fact that the attention matrix is approximately low-rank.
- cosFormer applies a cosine-based reweighting that stays linear while reintroducing a sense of locality and relative position.
| Method | How it linearizes | Complexity | Trade-off |
|---|---|---|---|
| Linear Transformer | feature map φ = elu(x)+1 | O(n·d²) | simplest; weakest approximation |
| Performer (FAVOR+) | positive orthogonal random features approximating softmax | O(n·k·d) | unbiased softmax estimate; k≈256 features tune accuracy |
| Linformer | low-rank projection of K,V along the sequence (n→k) | O(n·k) | needs a fixed max length; very memory-light |
| cosFormer | cosine reweighting kernel | O(n·d²) (linear) | adds locality / position sensitivity |
| FlashAttention | exact softmax (IO-aware tiling) | O(n²·d) compute, O(n) memory | not linear compute — linear memory |
What you give up
Linearity is an approximation. How close it lands to exact softmax is task-dependent, and linear methods often lag noticeably on tasks that need precise retrieval or long-range recall — the very thing long contexts are usually for. A common compromise is hybrid attention: give some heads exact local (sliding-window) attention and others linear global attention, as in Longformer and BigBird.
It is also worth knowing when not to reach for these. When you can afford O(n²) compute but are bottlenecked on memory, exact FlashAttention is usually the better default — it computes the true softmax with linear memory by tiling, no approximation involved. Sparse attention and sliding-window attention are other exact-but-cheaper routes worth weighing first.
Further reading
- Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention — Katharopoulos et al., 2020 (Linear Transformer)
- Rethinking Attention with Performers — Choromanski et al., 2020 (FAVOR+)
- Linformer: Self-Attention with Linear Complexity — Wang et al., 2020
- cosFormer: Rethinking Softmax in Attention — Qin et al., 2022
Related concepts
How Flash Attention, Multi-Head Attention (MHA), Grouped-Query Attention (GQA), and Multi-Query Attention (MQA) compare — algorithm vs architecture, KV-cache memory, quality trade-offs, and how to choose for production transformer inference.
Learn how Grouped-Query Attention (GQA) balances Multi-Head quality with Multi-Query efficiency for faster LLM inference.
Learn Multi-Query Attention (MQA), the optimization that shares keys and values across attention heads for massive memory savings.
Sliding Window Attention for long sequences: local context windows enable O(n) complexity, used in Mistral and Longformer models.
Explore sparse attention mechanisms that reduce quadratic complexity to linear or sub-quadratic, enabling efficient processing of long sequences.
Learn ALiBi, the position encoding method that adds linear biases to attention scores for exceptional length extrapolation in transformers.
