Scaled Dot-Product Attention

Summary: Master scaled dot-product attention, the fundamental transformer building block. Learn why scaling is crucial for stable training.

Scaled Dot-Product Attention

Scaled dot-product attention is the operation at the core of every transformer: it scores how well each query matches each key, turns those scores into weights with softmax, and uses the weights to blend the values. The query/key/value mechanism itself is covered in self-attention — this page focuses on the scoring step and, above all, why the scores are scaled by √dₖ.

Interactive visualization

Walk through the computation one query at a time — scores, scaling, softmax, and the weighted output:

The core formula

Attention(Q, K, V) = softmax(QK^T√(d_k))V

Q — query (what we're looking for)
K — key (what we compare against)
V — value (what we actually retrieve)
dₖ — dimension of the key vectors
√dₖ — the scaling factor that makes this "scaled" dot-product attention

The dot product as similarity

A query attends to a key in proportion to their dot product:

q · k = Σ_i=1^d_k q_i k_i

A large positive dot product means the vectors point the same way (similar); near zero means unrelated; negative means opposed. Computing QKᵀ does this for every query–key pair at once.

Why scale by √dₖ

This is the step the name is built around. For queries and keys whose components have unit variance, the dot product q·k is a sum of dₖ such products, so its variance grows with dₖ — its magnitude scales like √dₖ. At dₖ = 512 that means raw scores routinely reach ±22.6.

Feed scores that large into softmax and it saturates: one weight rounds to 1, the rest to 0. Where softmax is flat, its gradient is ~0, so almost no learning signal flows back — training stalls. Dividing every score by √dₖ pulls the variance back to ~1 regardless of dₖ, keeping softmax in its responsive range where gradients stay healthy. (Softmax is used in the first place because it turns arbitrary scores into a valid, differentiable probability distribution.)

The computation, step by step

S = QK^T

Score every query against every key — output shape [seq_len, seq_len].

S_scaled = S√(d_k)

Scale to control variance, as above. An optional mask (e.g. causal — see masked attention) sets forbidden positions to −∞ here, before softmax.

A = softmax(S_scaled)

Softmax over the key axis turns each query's scores into weights that sum to 1.

Output = AV

Each output is the attention-weighted sum of the value vectors.

This is O(n²) in sequence length — the cost addressed by flash-attention (linear memory), sparse attention, and linear attention (linear compute). Variants that change the scoring — like multi-query attention — reuse this same scaled-softmax core.