Hierarchical Attention in Vision Transformers

Summary: How hierarchical (windowed, multi-scale) attention — pioneered by Swin Transformer — cuts the quadratic cost of self-attention to near-linear for high-resolution vision.

Hierarchical Attention: Efficient Multi-Scale Processing

Hierarchical attention mechanisms enable transformers to efficiently process data at multiple scales, crucial for vision tasks where both local details and global context matter. This approach, pioneered by models like Swin Transformer, revolutionizes how transformers handle high-resolution images.

Interactive Hierarchical Attention Visualization

Explore how attention operates at different scales and merges information hierarchically:

Why Hierarchical Attention?

The Challenge with Standard Attention

Quadratic complexity: O(N²) for N tokens
Memory explosion: Unfeasible for high-resolution images
Single scale: Misses multi-scale nature of visual data

The Hierarchical Solution

Local windows: Compute attention within small regions
Progressive merging: Combine windows at higher levels
Multi-scale features: Capture both fine details and global context
Linear complexity: O(N) with respect to image size

How Hierarchical Attention Works

1. Window Partitioning

Divide the input into a grid of non-overlapping windows — the same local-window idea as sliding-window attention, but tiled across a 2D image.

2. Local Window Attention

Apply ordinary self-attention within each window independently — cost grows with the (small, fixed) window size, not the whole image.

3. Shifted Windows (Swin Transformer)

Create connections between windows by shifting the window grid every other layer, so tokens that were split across a boundary now share a window. This is Swin's key trick — it restores cross-window information flow without paying for global attention.

4. Hierarchical Merging

Between stages, merge each 2×2 block of patches into one and project the channels down. Resolution halves while the receptive field grows — building a coarse-to-fine hierarchy like a CNN's feature pyramid.

Hierarchical Attention Architectures

Swin Transformer

Window-based attention: 7×7 or 14×14 windows
Shifted windows: Alternate between regular and shifted
4 stages: Progressively downsample like CNNs
Patch merging: 2×2 patches → 1 patch between stages

Pyramid Vision Transformer (PVT)

Progressive shrinking: Reduce spatial resolution gradually
Spatial reduction attention: Downsample K, V for efficiency
Multi-scale features: Different resolutions at each stage

Focal Transformer

Focal attention: Both fine-grained and coarse-grained
Multi-level aggregation: Combine multiple window sizes
Adaptive granularity: Adjust based on content

Mathematical Formulation

Complexity Analysis

Standard Attention:

Complexity: O(N² × d) where N = H × W
Memory: O(N²)

Hierarchical Attention (with windows of size M):

Complexity: O(N × M² × d)
Memory: O(N × M²)
Reduction factor: N/M² (typically 49× for M=7)

Multi-Scale Feature Maps

At stage s with downsampling factor 2^s:

Resolution: H/2^s × W/2^s
Channels: C × 2^s
Window size: M (constant)
Number of windows: (H × W) / (M² × 4^s)

Comparison with other approaches

Approach	Complexity	Global context	Multi-scale	Memory
Standard attention	O(N²)	Yes, from the start	No (single scale)	High
Hierarchical	O(N)	At higher levels	Built-in	Low
Sparse attention	O(N√N)	Limited	No (single scale)	Medium
Axial attention	O(N^1.5)	Along axes	No (single scale)	Medium

Deep Learning

Adaptive Tiling: Efficient Visual Token Generation

Learn adaptive tiling in vision transformers: dynamically partition images based on visual complexity to reduce token counts while preserving detail.

Transformers & LLMs

Flash Attention vs MHA vs GQA vs MQA: Comparing Attention Mechanisms

How Flash Attention, Multi-Head Attention (MHA), Grouped-Query Attention (GQA), and Multi-Query Attention (MQA) compare — algorithm vs architecture, KV-cache memory, quality trade-offs, and how to choose for production transformer inference.

Transformers & LLMs

CLS Token in Vision Transformers

Learn how the CLS token acts as a global information aggregator in Vision Transformers, enabling whole-image classification through attention mechanisms.

Transformers & LLMs

Grouped-Query Attention (GQA)

Learn how Grouped-Query Attention (GQA) balances Multi-Head quality with Multi-Query efficiency for faster LLM inference.

Transformers & LLMs

Linear Attention Approximations

Explore linear complexity attention mechanisms including Performer, Linformer, and other efficient transformers that scale to very long sequences.

Transformers & LLMs

Multi-Query Attention (MQA)

Learn Multi-Query Attention (MQA), the optimization that shares keys and values across attention heads for massive memory savings.