Sparse Attention Patterns

Summary: Explore sparse attention mechanisms that reduce quadratic complexity to linear or sub-quadratic, enabling efficient processing of long sequences.

Sparse Attention Patterns: Efficient Long-Range Modeling

Sparse attention patterns reduce the quadratic complexity of self-attention by limiting which positions can attend to each other, enabling efficient processing of sequences with thousands or millions of tokens.

Interactive Sparse Pattern Explorer

Toggle primitives to compose a pattern and watch its mask, sparsity, and the model it matches:

Compose a sparse pattern

Toggle primitives — the mask, sparsity, and recognized model update together. Each active cell is colored by the primitive that lit it.

LocalGlobalStridedRandomMasked

Sparsity

62%

Keys / token

7.7

Complexity

O(n·k), k≈8

= BigBird

Window: 2Sequence length n: 20

Illustrative n; real models use thousands of tokens. Global = 1 token, stride = 4, ~2 random keys per row.

The Sparsity Principle

Instead of computing attention between all pairs:

Attention_full(Q, K, V) = softmax(QK^T√(d))V O(n²)

Sparse attention uses patterns:

Attention_sparse(Q, K, V) = softmax(QK^T ⊙ M√(d))V O(n × k)

Where M is a sparse mask and k is much less than n.

Complexity Analysis

Pattern	Time Complexity	Space Complexity	Effective Range
Full Attention	O(n²)	O(n²)	Global
Fixed Sparse	O(n × k)	O(n × k)	k positions
Strided	O(n × n/s)	O(n × n/s)	Every s-th position
Block-Local	O(n × b)	O(n × b)	Block size b
Global+Local	O(n × (g+w))	O(n × (g+w))	Global + window w
Axial	O(n × √n)	O(n × √n)	Row + column
Random	O(n × r)	O(n × r)	r random positions

Production Models

Longformer (Global + Local)

Combines a sliding-window local pattern with a few global tokens.

Architecture:

768 hidden size, 12 attention heads
Window size: 512 tokens per layer
CLS token has global attention
Max sequence: 4,096 tokens

Key features: Efficient document processing, QA tasks

BigBird (Random + Window + Global)

Architecture:

768 hidden size, 12 attention heads
Block size: 64 tokens
3 random blocks + 2 global blocks + 3 sliding window blocks
Max sequence: 4,096 tokens

Key features: Theoretically proven to approximate full attention

Sparse Transformer (Fixed Pattern)

Architecture:

1,024 hidden size, 16 attention heads
Stride: 128 (attend every 128th position)
Local context: 128 tokens
Max sequence: 8,192 tokens

Key features: Early sparse attention work, generative modeling

Best Practices

1. Pattern Selection Guidelines

Choose based on task type:

Task Type	Recommended Pattern	Reason
Classification	Global + Local	Global tokens (CLS) need full context
Generation	Sliding Window	Local context most important
Image/Video	Axial	Natural 2D structure
Long Documents	BigBird / Longformer	Balanced coverage
Structured Data	Fixed / Strided	Exploit known patterns
Very Long (>4K)	BigBird	Proven theoretical guarantees

2. Dynamic Sparsity

Concept: Adapt sparsity pattern based on content importance rather than fixed structure.

How it works:

Score each position's importance
Select top-k positions to attend to
Sparsity ratio adjusts based on content
More compute for important sequences

Trade-offs:

More flexible than fixed patterns
Adds scoring overhead
May be less predictable for optimization

Transformers & LLMs

Flash Attention vs MHA vs GQA vs MQA: Comparing Attention Mechanisms

How Flash Attention, Multi-Head Attention (MHA), Grouped-Query Attention (GQA), and Multi-Query Attention (MQA) compare — algorithm vs architecture, KV-cache memory, quality trade-offs, and how to choose for production transformer inference.

Transformers & LLMs

Grouped-Query Attention (GQA)

Learn how Grouped-Query Attention (GQA) balances Multi-Head quality with Multi-Query efficiency for faster LLM inference.

Transformers & LLMs

Linear Attention Approximations

Explore linear complexity attention mechanisms including Performer, Linformer, and other efficient transformers that scale to very long sequences.

Transformers & LLMs

Multi-Query Attention (MQA)

Learn Multi-Query Attention (MQA), the optimization that shares keys and values across attention heads for massive memory savings.

Transformers & LLMs

Sliding Window Attention

Sliding Window Attention for long sequences: local context windows enable O(n) complexity, used in Mistral and Longformer models.

Transformers & LLMs

ALiBi: Attention with Linear Biases

Learn ALiBi, the position encoding method that adds linear biases to attention scores for exceptional length extrapolation in transformers.