Sparse Attention Patterns: Efficient Long-Range Modeling
Sparse attention patterns reduce the quadratic complexity of self-attention by limiting which positions can attend to each other, enabling efficient processing of sequences with thousands or millions of tokens.
Interactive Sparse Pattern Explorer
Toggle primitives to compose a pattern and watch its mask, sparsity, and the model it matches:
Toggle primitives — the mask, sparsity, and recognized model update together. Each active cell is colored by the primitive that lit it.
Illustrative n; real models use thousands of tokens. Global = 1 token, stride = 4, ~2 random keys per row.
The Sparsity Principle
Instead of computing attention between all pairs:
Sparse attention uses patterns:
Where M is a sparse mask and k is much less than n.
Complexity Analysis
| Pattern | Time Complexity | Space Complexity | Effective Range |
|---|---|---|---|
| Full Attention | O(n²) | O(n²) | Global |
| Fixed Sparse | O(n × k) | O(n × k) | k positions |
| Strided | O(n × n/s) | O(n × n/s) | Every s-th position |
| Block-Local | O(n × b) | O(n × b) | Block size b |
| Global+Local | O(n × (g+w)) | O(n × (g+w)) | Global + window w |
| Axial | O(n × √n) | O(n × √n) | Row + column |
| Random | O(n × r) | O(n × r) | r random positions |
Production Models
Longformer (Global + Local)
Combines a sliding-window local pattern with a few global tokens.
Architecture:
- 768 hidden size, 12 attention heads
- Window size: 512 tokens per layer
- CLS token has global attention
- Max sequence: 4,096 tokens
Key features: Efficient document processing, QA tasks
BigBird (Random + Window + Global)
Architecture:
- 768 hidden size, 12 attention heads
- Block size: 64 tokens
- 3 random blocks + 2 global blocks + 3 sliding window blocks
- Max sequence: 4,096 tokens
Key features: Theoretically proven to approximate full attention
Sparse Transformer (Fixed Pattern)
Architecture:
- 1,024 hidden size, 16 attention heads
- Stride: 128 (attend every 128th position)
- Local context: 128 tokens
- Max sequence: 8,192 tokens
Key features: Early sparse attention work, generative modeling
Best Practices
1. Pattern Selection Guidelines
Choose based on task type:
| Task Type | Recommended Pattern | Reason |
|---|---|---|
| Classification | Global + Local | Global tokens (CLS) need full context |
| Generation | Sliding Window | Local context most important |
| Image/Video | Axial | Natural 2D structure |
| Long Documents | BigBird / Longformer | Balanced coverage |
| Structured Data | Fixed / Strided | Exploit known patterns |
| Very Long (>4K) | BigBird | Proven theoretical guarantees |
2. Dynamic Sparsity
Concept: Adapt sparsity pattern based on content importance rather than fixed structure.
How it works:
- Score each position's importance
- Select top-k positions to attend to
- Sparsity ratio adjusts based on content
- More compute for important sequences
Trade-offs:
- More flexible than fixed patterns
- Adds scoring overhead
- May be less predictable for optimization
Related concepts
How Flash Attention, Multi-Head Attention (MHA), Grouped-Query Attention (GQA), and Multi-Query Attention (MQA) compare — algorithm vs architecture, KV-cache memory, quality trade-offs, and how to choose for production transformer inference.
Learn how Grouped-Query Attention (GQA) balances Multi-Head quality with Multi-Query efficiency for faster LLM inference.
Explore linear complexity attention mechanisms including Performer, Linformer, and other efficient transformers that scale to very long sequences.
Learn Multi-Query Attention (MQA), the optimization that shares keys and values across attention heads for massive memory savings.
Sliding Window Attention for long sequences: local context windows enable O(n) complexity, used in Mistral and Longformer models.
Learn ALiBi, the position encoding method that adds linear biases to attention scores for exceptional length extrapolation in transformers.
