Scaled Dot-Product Attention
Scaled dot-product attention is the operation at the core of every transformer: it scores how well each query matches each key, turns those scores into weights with softmax, and uses the weights to blend the values. The query/key/value mechanism itself is covered in self-attention — this page focuses on the scoring step and, above all, why the scores are scaled by √dₖ.
Interactive visualization
Walk through the computation one query at a time — scores, scaling, softmax, and the weighted output:
The core formula
- Q — query (what we're looking for)
- K — key (what we compare against)
- V — value (what we actually retrieve)
- dₖ — dimension of the key vectors
- √dₖ — the scaling factor that makes this "scaled" dot-product attention
The dot product as similarity
A query attends to a key in proportion to their dot product:
A large positive dot product means the vectors point the same way (similar); near zero means unrelated; negative means opposed. Computing QKᵀ does this for every query–key pair at once.
Why scale by √dₖ
This is the step the name is built around. For queries and keys whose components have unit variance, the dot product q·k is a sum of dₖ such products, so its variance grows with dₖ — its magnitude scales like √dₖ. At dₖ = 512 that means raw scores routinely reach ±22.6.
Feed scores that large into softmax and it saturates: one weight rounds to 1, the rest to 0. Where softmax is flat, its gradient is ~0, so almost no learning signal flows back — training stalls. Dividing every score by √dₖ pulls the variance back to ~1 regardless of dₖ, keeping softmax in its responsive range where gradients stay healthy. (Softmax is used in the first place because it turns arbitrary scores into a valid, differentiable probability distribution.)
The computation, step by step
Score every query against every key — output shape [seq_len, seq_len].
Scale to control variance, as above. An optional mask (e.g. causal — see masked attention) sets forbidden positions to −∞ here, before softmax.
Softmax over the key axis turns each query's scores into weights that sum to 1.
Each output is the attention-weighted sum of the value vectors.
This is O(n²) in sequence length — the cost addressed by flash-attention (linear memory), sparse attention, and linear attention (linear compute). Variants that change the scoring — like multi-query attention — reuse this same scaled-softmax core.
Further reading
- Attention Is All You Need — Vaswani et al., 2017 (§3.2.1 introduces scaled dot-product attention and the √dₖ factor)
Related concepts
Learn ALiBi, the position encoding method that adds linear biases to attention scores for exceptional length extrapolation in transformers.
How Flash Attention, Multi-Head Attention (MHA), Grouped-Query Attention (GQA), and Multi-Query Attention (MQA) compare — algorithm vs architecture, KV-cache memory, quality trade-offs, and how to choose for production transformer inference.
Learn about attention sinks, where LLMs concentrate attention on initial tokens, and how preserving them enables streaming inference.
Understand cross-attention, the mechanism that enables transformers to align and fuse information from different sources, sequences, or modalities.
Learn how Grouped-Query Attention (GQA) balances Multi-Head quality with Multi-Query efficiency for faster LLM inference.
Explore linear complexity attention mechanisms including Performer, Linformer, and other efficient transformers that scale to very long sequences.
