Flash Attention, MHA, GQA, and MQA are often compared in the same breath but solve different problems. Flash Attention is an IO-aware algorithm for computing attention that fuses the softmax and tiles operations to avoid materializing the N×N attention matrix in HBM. MHA, GQA, and MQA are architectural variants of how queries, keys, and values are projected and shared across heads. The two are orthogonal: production transformers like Llama 3 ship Flash Attention 2 + GQA together. This article compares the four mechanisms, when each matters, and how they combine.
Compare the Variants
Drag the slider to change the number of K/V heads. MHA and MQA are just the endpoints of one axis — the K/V-head count — with GQA in between. The diagram and the memory readout move together.
4 query heads share each K/V head
drag: number of K/V heads — MHA and MQA are the endpoints
Illustrative: real models use more heads (Llama-2-70B: 64 queries → 8 K/V). Same pattern.
Quick Decision Matrix
| Use Case | Recommended | Why |
|---|---|---|
| Research/Training | MHA | Maximum quality, parameter count |
| Cloud Serving (>30B) | GQA-8 | Balance of quality and efficiency |
| Edge Deployment | MQA | Minimum memory footprint |
| Long Context (>8K) | GQA-4 or MQA | Memory becomes critical |
| Batch Inference | GQA-8 | Good balance for multiple requests |
| Real-time Systems | MQA | Lowest latency |
Detailed Comparison
Architecture Differences
| Feature | MHA | GQA | MQA |
|---|---|---|---|
| Q Projections | H separate | H separate | H separate |
| K Projections | H separate | G groups | 1 shared |
| V Projections | H separate | G groups | 1 shared |
| Parameters | 3 × H × D2 | (H + 2G) × D2 | (H + 2) × D2 |
| KV Heads | H | G | 1 |
Where H = number of heads, G = number of groups, D = model dimension
Memory Footprint
For a typical configuration (H=32, L=2048, D=128):
| Method | KV Cache Size | Relative | Example (Llama 70B) |
|---|---|---|---|
| MHA | 2 × L × H × D | 100% | 8.4 GB/sequence |
| GQA-8 | 2 × L × 8 × D | 25% | 2.1 GB/sequence |
| GQA-4 | 2 × L × 4 × D | 12.5% | 1.0 GB/sequence |
| MQA | 2 × L × 1 × D | 3.1% | 0.26 GB/sequence |
Mathematical Formulations
MHA: Full Expressiveness
Each head has independent parameters:
- Total attention parameters: 3HD2
- KV cache per token: 2HD
GQA: Balanced Approach
Where g(i) = \lfloor i · G / H \rfloor maps heads to groups:
- Total attention parameters: (H + 2G)D2
- KV cache per token: 2GD
MQA: Maximum Sharing
All heads share the same K,V:
- Total attention parameters: (H + 2)D2
- KV cache per token: 2D
Production Model Configurations
| Model | Size | Attention Type | Config | Rationale |
|---|---|---|---|---|
| GPT-3 | 175B | MHA | 96 heads | Quality priority |
| Llama 2 | 70B | GQA | 64Q, 8KV | Balanced approach |
| Llama 2 | 7B | GQA | 32Q, 32KV | Small model, less reduction needed |
| Mistral | 7B | GQA + SWA | 32Q, 8KV | Combined optimizations |
| Falcon | 40B | MQA | 64Q, 1KV | Maximum efficiency |
| PaLM | 540B | MQA | 48Q, 1KV | Extreme scale requires MQA |
Optimization Combinations
| Base Attention | + Optimization | Result | Example |
|---|---|---|---|
| GQA | + Flash Attention | Fast + memory efficient | Llama 2 |
| GQA | + Sliding Window | Local + global efficiency | Mistral |
| MQA | + Flash Attention | Maximum efficiency | Optimized Falcon |
| GQA | + RoPE | Efficient + better positions | Most modern LLMs |
Best Practices
Selection Guidelines
-
Start with GQA-8 as default
- Good balance for most use cases
- Minimal quality loss
- 4× memory savings
-
Consider MQA when:
- Serving at scale (more than 1000 QPS)
- Memory constrained (less than 40GB)
- Long context (more than 16K)
- Batch size critical
-
Stick with MHA when:
- Research/experimentation
- Quality is paramount
- Small models (less than 1B params)
- Abundant resources
Related concepts
Learn how Grouped-Query Attention (GQA) balances Multi-Head quality with Multi-Query efficiency for faster LLM inference.
Explore linear complexity attention mechanisms including Performer, Linformer, and other efficient transformers that scale to very long sequences.
Learn Multi-Query Attention (MQA), the optimization that shares keys and values across attention heads for massive memory savings.
Sliding Window Attention for long sequences: local context windows enable O(n) complexity, used in Mistral and Longformer models.
Explore sparse attention mechanisms that reduce quadratic complexity to linear or sub-quadratic, enabling efficient processing of long sequences.
Learn ALiBi, the position encoding method that adds linear biases to attention scores for exceptional length extrapolation in transformers.
