Hierarchical Attention: Efficient Multi-Scale Processing
Hierarchical attention mechanisms enable transformers to efficiently process data at multiple scales, crucial for vision tasks where both local details and global context matter. This approach, pioneered by models like Swin Transformer, revolutionizes how transformers handle high-resolution images.
Interactive Hierarchical Attention Visualization
Explore how attention operates at different scales and merges information hierarchically:
Why Hierarchical Attention?
The Challenge with Standard Attention
- Quadratic complexity: O(N²) for N tokens
- Memory explosion: Unfeasible for high-resolution images
- Single scale: Misses multi-scale nature of visual data
The Hierarchical Solution
- Local windows: Compute attention within small regions
- Progressive merging: Combine windows at higher levels
- Multi-scale features: Capture both fine details and global context
- Linear complexity: O(N) with respect to image size
How Hierarchical Attention Works
1. Window Partitioning
Divide the input into a grid of non-overlapping windows — the same local-window idea as sliding-window attention, but tiled across a 2D image.
2. Local Window Attention
Apply ordinary self-attention within each window independently — cost grows with the (small, fixed) window size, not the whole image.
3. Shifted Windows (Swin Transformer)
Create connections between windows by shifting the window grid every other layer, so tokens that were split across a boundary now share a window. This is Swin's key trick — it restores cross-window information flow without paying for global attention.
4. Hierarchical Merging
Between stages, merge each 2×2 block of patches into one and project the channels down. Resolution halves while the receptive field grows — building a coarse-to-fine hierarchy like a CNN's feature pyramid.
Hierarchical Attention Architectures
Swin Transformer
- Window-based attention: 7×7 or 14×14 windows
- Shifted windows: Alternate between regular and shifted
- 4 stages: Progressively downsample like CNNs
- Patch merging: 2×2 patches → 1 patch between stages
Pyramid Vision Transformer (PVT)
- Progressive shrinking: Reduce spatial resolution gradually
- Spatial reduction attention: Downsample K, V for efficiency
- Multi-scale features: Different resolutions at each stage
Focal Transformer
- Focal attention: Both fine-grained and coarse-grained
- Multi-level aggregation: Combine multiple window sizes
- Adaptive granularity: Adjust based on content
Mathematical Formulation
Complexity Analysis
Standard Attention:
- Complexity: O(N² × d) where N = H × W
- Memory: O(N²)
Hierarchical Attention (with windows of size M):
- Complexity: O(N × M² × d)
- Memory: O(N × M²)
- Reduction factor: N/M² (typically 49× for M=7)
Multi-Scale Feature Maps
At stage s with downsampling factor 2^s:
- Resolution: H/2^s × W/2^s
- Channels: C × 2^s
- Window size: M (constant)
- Number of windows: (H × W) / (M² × 4^s)
Comparison with other approaches
| Approach | Complexity | Global context | Multi-scale | Memory |
|---|---|---|---|---|
| Standard attention | O(N²) | Yes, from the start | No (single scale) | High |
| Hierarchical | O(N) | At higher levels | Built-in | Low |
| Sparse attention | O(N√N) | Limited | No (single scale) | Medium |
| Axial attention | O(N^1.5) | Along axes | No (single scale) | Medium |
Related concepts
Learn adaptive tiling in vision transformers: dynamically partition images based on visual complexity to reduce token counts while preserving detail.
How Flash Attention, Multi-Head Attention (MHA), Grouped-Query Attention (GQA), and Multi-Query Attention (MQA) compare — algorithm vs architecture, KV-cache memory, quality trade-offs, and how to choose for production transformer inference.
Learn how the CLS token acts as a global information aggregator in Vision Transformers, enabling whole-image classification through attention mechanisms.
Learn how Grouped-Query Attention (GQA) balances Multi-Head quality with Multi-Query efficiency for faster LLM inference.
Explore linear complexity attention mechanisms including Performer, Linformer, and other efficient transformers that scale to very long sequences.
Learn Multi-Query Attention (MQA), the optimization that shares keys and values across attention heads for massive memory savings.
