Skip to main content

Scaled Dot-Product Attention

Summary
Master scaled dot-product attention, the fundamental transformer building block. Learn why scaling is crucial for stable training.

Scaled Dot-Product Attention

Scaled dot-product attention is the operation at the core of every transformer: it scores how well each query matches each key, turns those scores into weights with softmax, and uses the weights to blend the values. The query/key/value mechanism itself is covered in self-attention — this page focuses on the scoring step and, above all, why the scores are scaled by √dₖ.

Interactive visualization

Walk through the computation one query at a time — scores, scaling, softmax, and the weighted output:

The core formula

Attention(Q, K, V) = softmax(QKT√(dk))V
  • Q — query (what we're looking for)
  • K — key (what we compare against)
  • V — value (what we actually retrieve)
  • dₖ — dimension of the key vectors
  • √dₖ — the scaling factor that makes this "scaled" dot-product attention

The dot product as similarity

A query attends to a key in proportion to their dot product:

q · k = Σi=1dk qi ki

A large positive dot product means the vectors point the same way (similar); near zero means unrelated; negative means opposed. Computing QKᵀ does this for every query–key pair at once.

Why scale by √dₖ

This is the step the name is built around. For queries and keys whose components have unit variance, the dot product q·k is a sum of dₖ such products, so its variance grows with dₖ — its magnitude scales like √dₖ. At dₖ = 512 that means raw scores routinely reach ±22.6.

Feed scores that large into softmax and it saturates: one weight rounds to 1, the rest to 0. Where softmax is flat, its gradient is ~0, so almost no learning signal flows back — training stalls. Dividing every score by √dₖ pulls the variance back to ~1 regardless of dₖ, keeping softmax in its responsive range where gradients stay healthy. (Softmax is used in the first place because it turns arbitrary scores into a valid, differentiable probability distribution.)

The computation, step by step

S = QKT

Score every query against every key — output shape [seq_len, seq_len].

Sscaled = S√(dk)

Scale to control variance, as above. An optional mask (e.g. causal — see masked attention) sets forbidden positions to −∞ here, before softmax.

A = softmax(Sscaled)

Softmax over the key axis turns each query's scores into weights that sum to 1.

Output = AV

Each output is the attention-weighted sum of the value vectors.

This is O(n²) in sequence length — the cost addressed by flash-attention (linear memory), sparse attention, and linear attention (linear compute). Variants that change the scoring — like multi-query attention — reuse this same scaled-softmax core.

Further reading

  • Attention Is All You Need — Vaswani et al., 2017 (§3.2.1 introduces scaled dot-product attention and the √dₖ factor)

If you found this explanation helpful, consider sharing it with others.

Mastodon