Flash Attention vs MHA vs GQA vs MQA: Comparing Attention Mechanisms

Summary: How Flash Attention, Multi-Head Attention (MHA), Grouped-Query Attention (GQA), and Multi-Query Attention (MQA) compare — algorithm vs architecture, KV-cache memory, quality trade-offs, and how to choose for production transformer inference.

Flash Attention, MHA, GQA, and MQA are often compared in the same breath but solve different problems. Flash Attention is an IO-aware algorithm for computing attention that fuses the softmax and tiles operations to avoid materializing the N×N attention matrix in HBM. MHA, GQA, and MQA are architectural variants of how queries, keys, and values are projected and shared across heads. The two are orthogonal: production transformers like Llama 3 ship Flash Attention 2 + GQA together. This article compares the four mechanisms, when each matters, and how they combine.

Compare the Variants

Drag the slider to change the number of K/V heads. MHA and MQA are just the endpoints of one axis — the K/V-head count — with GQA in between. The diagram and the memory readout move together.

GQA — 2 K/V heads

4 query heads share each K/V head

8MHA4GQA2GQA1MQA

drag: number of K/V heads — MHA and MQA are the endpoints

KV cache

25%

of MHA · 4× smaller

Quality

minor drop

illustrative, not measured

Best for

serving · long context

Illustrative: real models use more heads (Llama-2-70B: 64 queries → 8 K/V). Same pattern.

Quick Decision Matrix

Use Case	Recommended	Why
Research/Training	MHA	Maximum quality, parameter count
Cloud Serving (>30B)	GQA-8	Balance of quality and efficiency
Edge Deployment	MQA	Minimum memory footprint
Long Context (>8K)	GQA-4 or MQA	Memory becomes critical
Batch Inference	GQA-8	Good balance for multiple requests
Real-time Systems	MQA	Lowest latency

Detailed Comparison

Architecture Differences

Feature	MHA	GQA	MQA
Q Projections	H separate	H separate	H separate
K Projections	H separate	G groups	1 shared
V Projections	H separate	G groups	1 shared
Parameters	3 × H × D²	(H + 2G) × D²	(H + 2) × D²
KV Heads	H	G	1

Where H = number of heads, G = number of groups, D = model dimension

Memory Footprint

For a typical configuration (H=32, L=2048, D=128):

Method	KV Cache Size	Relative	Example (Llama 70B)
MHA	2 × L × H × D	100%	8.4 GB/sequence
GQA-8	2 × L × 8 × D	25%	2.1 GB/sequence
GQA-4	2 × L × 4 × D	12.5%	1.0 GB/sequence
MQA	2 × L × 1 × D	3.1%	0.26 GB/sequence

Mathematical Formulations

MHA: Full Expressiveness

head_i = Attention(Q_i, K_i, V_i) ∀ i ∈ [1, H]

Each head has independent parameters:

Total attention parameters: 3HD²
KV cache per token: 2HD

GQA: Balanced Approach

head_i = Attention(Q_i, K_g(i), V_g(i))

Where g(i) = \lfloor i · G / H \rfloor maps heads to groups:

Total attention parameters: (H + 2G)D²
KV cache per token: 2GD

head_i = Attention(Q_i, K_shared, V_shared) ∀ i

All heads share the same K,V:

Total attention parameters: (H + 2)D²
KV cache per token: 2D

Production Model Configurations

Model	Size	Attention Type	Config	Rationale
GPT-3	175B	MHA	96 heads	Quality priority
Llama 2	70B	GQA	64Q, 8KV	Balanced approach
Llama 2	7B	GQA	32Q, 32KV	Small model, less reduction needed
Mistral	7B	GQA + SWA	32Q, 8KV	Combined optimizations
Falcon	40B	MQA	64Q, 1KV	Maximum efficiency
PaLM	540B	MQA	48Q, 1KV	Extreme scale requires MQA

Optimization Combinations

Base Attention	+ Optimization	Result	Example
GQA	+ Flash Attention	Fast + memory efficient	Llama 2
GQA	+ Sliding Window	Local + global efficiency	Mistral
MQA	+ Flash Attention	Maximum efficiency	Optimized Falcon
GQA	+ RoPE	Efficient + better positions	Most modern LLMs

Best Practices

Selection Guidelines

Start with GQA-8 as default
- Good balance for most use cases
- Minimal quality loss
- 4× memory savings
Consider MQA when:
- Serving at scale (more than 1000 QPS)
- Memory constrained (less than 40GB)
- Long context (more than 16K)
- Batch size critical
Stick with MHA when:
- Research/experimentation
- Quality is paramount
- Small models (less than 1B params)
- Abundant resources

Transformers & LLMs

Grouped-Query Attention (GQA)

Learn how Grouped-Query Attention (GQA) balances Multi-Head quality with Multi-Query efficiency for faster LLM inference.

Transformers & LLMs

Linear Attention Approximations

Explore linear complexity attention mechanisms including Performer, Linformer, and other efficient transformers that scale to very long sequences.

Transformers & LLMs

Multi-Query Attention (MQA)

Learn Multi-Query Attention (MQA), the optimization that shares keys and values across attention heads for massive memory savings.

Transformers & LLMs

Sliding Window Attention

Sliding Window Attention for long sequences: local context windows enable O(n) complexity, used in Mistral and Longformer models.

Transformers & LLMs

Sparse Attention Patterns

Explore sparse attention mechanisms that reduce quadratic complexity to linear or sub-quadratic, enabling efficient processing of long sequences.

Transformers & LLMs

ALiBi: Attention with Linear Biases

Learn ALiBi, the position encoding method that adds linear biases to attention scores for exceptional length extrapolation in transformers.