Multi-Query Attention: Maximum Efficiency Through Sharing
Multi-Query Attention (MQA) is a radical simplification of multi-head attention that shares a single set of keys and values across all query heads, achieving dramatic memory savings with acceptable quality trade-offs.
Interactive MQA Visualization
See how all query heads share the same keys and values:
The Core Insight
Traditional multi-head attention (MHA) maintains separate K, V projections for each head:
- Memory: O(n · h · d)
- Redundancy: Similar patterns learned across heads
MQA's breakthrough: One K, V pair serves all heads
- Memory: O(n · d)
- Efficiency: Up to 32× KV cache reduction
How MQA Works
The Architecture
Where each head computes:
Key differences from MHA:
- Queries: Still head-specific (Q1, Q2, ..., Qh)
- Keys/Values: Shared across all heads (Kshared, Vshared)
Memory Savings Analysis
KV Cache Comparison
The savings land in the KV cache — the per-token Key/Value state kept during generation. For a model with 32 heads, 40 layers, sequence length 2048, head dimension 128:
| Method | Cache Size per Token | Total for 2K Context | Reduction |
|---|---|---|---|
| MHA | 2 × 40 × 32 × 128 = 327,680 floats | 640 MB | 0% |
| MQA | 2 × 40 × 1 × 128 = 10,240 floats | 20 MB | 96.9% |
Because the cache is ~32× smaller, the same GPU holds far more concurrent sequences — directly multiplying serving throughput (e.g. a 50 GB cache budget fits ~78 MHA sequences but ~2,500 MQA ones).
Quality Considerations
The Trade-off
MQA trades expressiveness for efficiency:
Parameter Count Comparison:
- MHA: n_heads × d_model × d_head × 3 (separate Q, K, V per head)
- MQA: n_heads × d_model × d_head + 2 × d_model × d_head (Q per head, shared K,V)
- Reduction: Approximately 66% fewer parameters in attention layers
Empirical Results
From the original paper (Shazeer, 2019):
| Model | Attention Type | Perplexity | Speed |
|---|---|---|---|
| Base | MHA | 10.2 | 1.0× |
| Base | MQA | 10.4 | 1.8× |
| Large | MHA | 8.1 | 1.0× |
| Large | MQA | 8.3 | 2.4× |
Key findings:
- Small quality loss (~2% perplexity increase)
- Significant speed gains (1.8-2.4×)
- Benefits scale with model size
Comparison with Alternatives
Grouped-query attention (GQA) sits between MHA and MQA — a few K/V heads instead of one or all:
| Feature | MHA | GQA-8 | MQA |
|---|---|---|---|
| KV Parameters | 100% | 25% | 3.1% |
| Cache Size | 100% | 25% | 3.1% |
| Quality | Best | Near-best | Good |
| Inference Speed | 1× | 1.5× | 2× |
| Implementation | Complex | Moderate | Simple |
Related concepts
How Flash Attention, Multi-Head Attention (MHA), Grouped-Query Attention (GQA), and Multi-Query Attention (MQA) compare — algorithm vs architecture, KV-cache memory, quality trade-offs, and how to choose for production transformer inference.
Learn how Grouped-Query Attention (GQA) balances Multi-Head quality with Multi-Query efficiency for faster LLM inference.
Explore linear complexity attention mechanisms including Performer, Linformer, and other efficient transformers that scale to very long sequences.
Sliding Window Attention for long sequences: local context windows enable O(n) complexity, used in Mistral and Longformer models.
Explore sparse attention mechanisms that reduce quadratic complexity to linear or sub-quadratic, enabling efficient processing of long sequences.
Learn ALiBi, the position encoding method that adds linear biases to attention scores for exceptional length extrapolation in transformers.
