Skip to main content

Flash Attention vs MHA vs GQA vs MQA: Comparing Attention Mechanisms

Summary
How Flash Attention, Multi-Head Attention (MHA), Grouped-Query Attention (GQA), and Multi-Query Attention (MQA) compare — algorithm vs architecture, KV-cache memory, quality trade-offs, and how to choose for production transformer inference.

Flash Attention, MHA, GQA, and MQA are often compared in the same breath but solve different problems. Flash Attention is an IO-aware algorithm for computing attention that fuses the softmax and tiles operations to avoid materializing the N×N attention matrix in HBM. MHA, GQA, and MQA are architectural variants of how queries, keys, and values are projected and shared across heads. The two are orthogonal: production transformers like Llama 3 ship Flash Attention 2 + GQA together. This article compares the four mechanisms, when each matters, and how they combine.

Compare the Variants

Drag the slider to change the number of K/V heads. MHA and MQA are just the endpoints of one axis — the K/V-head count — with GQA in between. The diagram and the memory readout move together.

GQA2 K/V heads

4 query heads share each K/V head

8 query heads2 K/V heads
8MHA4GQA2GQA1MQA

drag: number of K/V heads — MHA and MQA are the endpoints

KV cache
25%
of MHA · 4× smaller
Quality
minor drop
illustrative, not measured
Best for
serving · long context

Illustrative: real models use more heads (Llama-2-70B: 64 queries → 8 K/V). Same pattern.

Quick Decision Matrix

Use CaseRecommendedWhy
Research/TrainingMHAMaximum quality, parameter count
Cloud Serving (>30B)GQA-8Balance of quality and efficiency
Edge DeploymentMQAMinimum memory footprint
Long Context (>8K)GQA-4 or MQAMemory becomes critical
Batch InferenceGQA-8Good balance for multiple requests
Real-time SystemsMQALowest latency

Detailed Comparison

Architecture Differences

FeatureMHAGQAMQA
Q ProjectionsH separateH separateH separate
K ProjectionsH separateG groups1 shared
V ProjectionsH separateG groups1 shared
Parameters3 × H × D2(H + 2G) × D2(H + 2) × D2
KV HeadsHG1

Where H = number of heads, G = number of groups, D = model dimension

Memory Footprint

For a typical configuration (H=32, L=2048, D=128):

MethodKV Cache SizeRelativeExample (Llama 70B)
MHA2 × L × H × D100%8.4 GB/sequence
GQA-82 × L × 8 × D25%2.1 GB/sequence
GQA-42 × L × 4 × D12.5%1.0 GB/sequence
MQA2 × L × 1 × D3.1%0.26 GB/sequence

Mathematical Formulations

MHA: Full Expressiveness

headi = Attention(Qi, Ki, Vi) ∀ i ∈ [1, H]

Each head has independent parameters:

  • Total attention parameters: 3HD2
  • KV cache per token: 2HD

GQA: Balanced Approach

headi = Attention(Qi, Kg(i), Vg(i))

Where g(i) = \lfloor i · G / H \rfloor maps heads to groups:

  • Total attention parameters: (H + 2G)D2
  • KV cache per token: 2GD

MQA: Maximum Sharing

headi = Attention(Qi, Kshared, Vshared) ∀ i

All heads share the same K,V:

  • Total attention parameters: (H + 2)D2
  • KV cache per token: 2D

Production Model Configurations

ModelSizeAttention TypeConfigRationale
GPT-3175BMHA96 headsQuality priority
Llama 270BGQA64Q, 8KVBalanced approach
Llama 27BGQA32Q, 32KVSmall model, less reduction needed
Mistral7BGQA + SWA32Q, 8KVCombined optimizations
Falcon40BMQA64Q, 1KVMaximum efficiency
PaLM540BMQA48Q, 1KVExtreme scale requires MQA

Optimization Combinations

Base Attention+ OptimizationResultExample
GQA+ Flash AttentionFast + memory efficientLlama 2
GQA+ Sliding WindowLocal + global efficiencyMistral
MQA+ Flash AttentionMaximum efficiencyOptimized Falcon
GQA+ RoPEEfficient + better positionsMost modern LLMs

Best Practices

Selection Guidelines

  1. Start with GQA-8 as default

    • Good balance for most use cases
    • Minimal quality loss
    • 4× memory savings
  2. Consider MQA when:

    • Serving at scale (more than 1000 QPS)
    • Memory constrained (less than 40GB)
    • Long context (more than 16K)
    • Batch size critical
  3. Stick with MHA when:

    • Research/experimentation
    • Quality is paramount
    • Small models (less than 1B params)
    • Abundant resources

If you found this explanation helpful, consider sharing it with others.

Mastodon