Multi-Query Attention (MQA)

Summary: Learn Multi-Query Attention (MQA), the optimization that shares keys and values across attention heads for massive memory savings.

Multi-Query Attention (MQA) is a radical simplification of multi-head attention that shares a single set of keys and values across all query heads, achieving dramatic memory savings with acceptable quality trade-offs.

Interactive MQA Visualization

See how all query heads share the same keys and values:

The Core Insight

Traditional multi-head attention (MHA) maintains separate K, V projections for each head:

Memory: O(n · h · d)
Redundancy: Similar patterns learned across heads

MQA's breakthrough: One K, V pair serves all heads

Memory: O(n · d)
Efficiency: Up to 32× KV cache reduction

How MQA Works

The Architecture

MQA(X) = Concat(head₁, ..., head_h)W^O

Where each head computes:

head_i = Attention(Q_i, K_shared, V_shared)

Key differences from MHA:

Queries: Still head-specific (Q₁, Q₂, ..., Q_h)
Keys/Values: Shared across all heads (K_shared, V_shared)

Memory Savings Analysis

KV Cache Comparison

The savings land in the KV cache — the per-token Key/Value state kept during generation. For a model with 32 heads, 40 layers, sequence length 2048, head dimension 128:

Method	Cache Size per Token	Total for 2K Context	Reduction
MHA	2 × 40 × 32 × 128 = 327,680 floats	640 MB	0%
MQA	2 × 40 × 1 × 128 = 10,240 floats	20 MB	96.9%

Because the cache is ~32× smaller, the same GPU holds far more concurrent sequences — directly multiplying serving throughput (e.g. a 50 GB cache budget fits ~78 MHA sequences but ~2,500 MQA ones).

Quality Considerations

The Trade-off

MQA trades expressiveness for efficiency:

Parameter Count Comparison:

MHA: n_heads × d_model × d_head × 3 (separate Q, K, V per head)
MQA: n_heads × d_model × d_head + 2 × d_model × d_head (Q per head, shared K,V)
Reduction: Approximately 66% fewer parameters in attention layers

Empirical Results

From the original paper (Shazeer, 2019):

Model	Attention Type	Perplexity	Speed
Base	MHA	10.2	1.0×
Base	MQA	10.4	1.8×
Large	MHA	8.1	1.0×
Large	MQA	8.3	2.4×

Key findings:

Small quality loss (~2% perplexity increase)
Significant speed gains (1.8-2.4×)
Benefits scale with model size

Comparison with Alternatives

Grouped-query attention (GQA) sits between MHA and MQA — a few K/V heads instead of one or all:

Feature	MHA	GQA-8	MQA
KV Parameters	100%	25%	3.1%
Cache Size	100%	25%	3.1%
Quality	Best	Near-best	Good
Inference Speed	1×	1.5×	2×
Implementation	Complex	Moderate	Simple

Transformers & LLMs

Flash Attention vs MHA vs GQA vs MQA: Comparing Attention Mechanisms

How Flash Attention, Multi-Head Attention (MHA), Grouped-Query Attention (GQA), and Multi-Query Attention (MQA) compare — algorithm vs architecture, KV-cache memory, quality trade-offs, and how to choose for production transformer inference.

Transformers & LLMs

Grouped-Query Attention (GQA)

Learn how Grouped-Query Attention (GQA) balances Multi-Head quality with Multi-Query efficiency for faster LLM inference.

Transformers & LLMs

Linear Attention Approximations

Explore linear complexity attention mechanisms including Performer, Linformer, and other efficient transformers that scale to very long sequences.

Transformers & LLMs

Sliding Window Attention

Sliding Window Attention for long sequences: local context windows enable O(n) complexity, used in Mistral and Longformer models.

Transformers & LLMs

Sparse Attention Patterns

Explore sparse attention mechanisms that reduce quadratic complexity to linear or sub-quadratic, enabling efficient processing of long sequences.

Transformers & LLMs

ALiBi: Attention with Linear Biases

Learn ALiBi, the position encoding method that adds linear biases to attention scores for exceptional length extrapolation in transformers.