Skip to main content

Multi-Query Attention (MQA)

Summary
Learn Multi-Query Attention (MQA), the optimization that shares keys and values across attention heads for massive memory savings.

Multi-Query Attention: Maximum Efficiency Through Sharing

Multi-Query Attention (MQA) is a radical simplification of multi-head attention that shares a single set of keys and values across all query heads, achieving dramatic memory savings with acceptable quality trade-offs.

Interactive MQA Visualization

See how all query heads share the same keys and values:

The Core Insight

Traditional multi-head attention (MHA) maintains separate K, V projections for each head:

  • Memory: O(n · h · d)
  • Redundancy: Similar patterns learned across heads

MQA's breakthrough: One K, V pair serves all heads

  • Memory: O(n · d)
  • Efficiency: Up to 32× KV cache reduction

How MQA Works

The Architecture

MQA(X) = Concat(head1, ..., headh)WO

Where each head computes:

headi = Attention(Qi, Kshared, Vshared)

Key differences from MHA:

  • Queries: Still head-specific (Q1, Q2, ..., Qh)
  • Keys/Values: Shared across all heads (Kshared, Vshared)

Memory Savings Analysis

KV Cache Comparison

The savings land in the KV cache — the per-token Key/Value state kept during generation. For a model with 32 heads, 40 layers, sequence length 2048, head dimension 128:

MethodCache Size per TokenTotal for 2K ContextReduction
MHA2 × 40 × 32 × 128 = 327,680 floats640 MB0%
MQA2 × 40 × 1 × 128 = 10,240 floats20 MB96.9%

Because the cache is ~32× smaller, the same GPU holds far more concurrent sequences — directly multiplying serving throughput (e.g. a 50 GB cache budget fits ~78 MHA sequences but ~2,500 MQA ones).

Quality Considerations

The Trade-off

MQA trades expressiveness for efficiency:

Parameter Count Comparison:

  • MHA: n_heads × d_model × d_head × 3 (separate Q, K, V per head)
  • MQA: n_heads × d_model × d_head + 2 × d_model × d_head (Q per head, shared K,V)
  • Reduction: Approximately 66% fewer parameters in attention layers

Empirical Results

From the original paper (Shazeer, 2019):

ModelAttention TypePerplexitySpeed
BaseMHA10.21.0×
BaseMQA10.41.8×
LargeMHA8.11.0×
LargeMQA8.32.4×

Key findings:

  • Small quality loss (~2% perplexity increase)
  • Significant speed gains (1.8-2.4×)
  • Benefits scale with model size

Comparison with Alternatives

Grouped-query attention (GQA) sits between MHA and MQA — a few K/V heads instead of one or all:

FeatureMHAGQA-8MQA
KV Parameters100%25%3.1%
Cache Size100%25%3.1%
QualityBestNear-bestGood
Inference Speed1.5×
ImplementationComplexModerateSimple

If you found this explanation helpful, consider sharing it with others.

Mastodon