Skip to main content

Sparse Attention Patterns

Summary
Explore sparse attention mechanisms that reduce quadratic complexity to linear or sub-quadratic, enabling efficient processing of long sequences.

Sparse Attention Patterns: Efficient Long-Range Modeling

Sparse attention patterns reduce the quadratic complexity of self-attention by limiting which positions can attend to each other, enabling efficient processing of sequences with thousands or millions of tokens.

Interactive Sparse Pattern Explorer

Toggle primitives to compose a pattern and watch its mask, sparsity, and the model it matches:

Compose a sparse pattern

Toggle primitives — the mask, sparsity, and recognized model update together. Each active cell is colored by the primitive that lit it.

LocalGlobalStridedRandomMasked
Sparsity
62%
Keys / token
7.7
Complexity
O(n·k), k≈8
= BigBird

Illustrative n; real models use thousands of tokens. Global = 1 token, stride = 4, ~2 random keys per row.

The Sparsity Principle

Instead of computing attention between all pairs:

Attentionfull(Q, K, V) = softmax(QKT√(d))V O(n2)

Sparse attention uses patterns:

Attentionsparse(Q, K, V) = softmax(QKT ⊙ M√(d))V O(n × k)

Where M is a sparse mask and k is much less than n.

Complexity Analysis

PatternTime ComplexitySpace ComplexityEffective Range
Full AttentionO(n²)O(n²)Global
Fixed SparseO(n × k)O(n × k)k positions
StridedO(n × n/s)O(n × n/s)Every s-th position
Block-LocalO(n × b)O(n × b)Block size b
Global+LocalO(n × (g+w))O(n × (g+w))Global + window w
AxialO(n × √n)O(n × √n)Row + column
RandomO(n × r)O(n × r)r random positions

Production Models

Longformer (Global + Local)

Combines a sliding-window local pattern with a few global tokens.

Architecture:

  • 768 hidden size, 12 attention heads
  • Window size: 512 tokens per layer
  • CLS token has global attention
  • Max sequence: 4,096 tokens

Key features: Efficient document processing, QA tasks

BigBird (Random + Window + Global)

Architecture:

  • 768 hidden size, 12 attention heads
  • Block size: 64 tokens
  • 3 random blocks + 2 global blocks + 3 sliding window blocks
  • Max sequence: 4,096 tokens

Key features: Theoretically proven to approximate full attention

Sparse Transformer (Fixed Pattern)

Architecture:

  • 1,024 hidden size, 16 attention heads
  • Stride: 128 (attend every 128th position)
  • Local context: 128 tokens
  • Max sequence: 8,192 tokens

Key features: Early sparse attention work, generative modeling

Best Practices

1. Pattern Selection Guidelines

Choose based on task type:

Task TypeRecommended PatternReason
ClassificationGlobal + LocalGlobal tokens (CLS) need full context
GenerationSliding WindowLocal context most important
Image/VideoAxialNatural 2D structure
Long DocumentsBigBird / LongformerBalanced coverage
Structured DataFixed / StridedExploit known patterns
Very Long (>4K)BigBirdProven theoretical guarantees

2. Dynamic Sparsity

Concept: Adapt sparsity pattern based on content importance rather than fixed structure.

How it works:

  • Score each position's importance
  • Select top-k positions to attend to
  • Sparsity ratio adjusts based on content
  • More compute for important sequences

Trade-offs:

  • More flexible than fixed patterns
  • Adds scoring overhead
  • May be less predictable for optimization

If you found this explanation helpful, consider sharing it with others.

Mastodon