Skip to main content

Hierarchical Attention in Vision Transformers

Summary
How hierarchical (windowed, multi-scale) attention — pioneered by Swin Transformer — cuts the quadratic cost of self-attention to near-linear for high-resolution vision.

Hierarchical Attention: Efficient Multi-Scale Processing

Hierarchical attention mechanisms enable transformers to efficiently process data at multiple scales, crucial for vision tasks where both local details and global context matter. This approach, pioneered by models like Swin Transformer, revolutionizes how transformers handle high-resolution images.

Interactive Hierarchical Attention Visualization

Explore how attention operates at different scales and merges information hierarchically:

Why Hierarchical Attention?

The Challenge with Standard Attention

  • Quadratic complexity: O(N²) for N tokens
  • Memory explosion: Unfeasible for high-resolution images
  • Single scale: Misses multi-scale nature of visual data

The Hierarchical Solution

  • Local windows: Compute attention within small regions
  • Progressive merging: Combine windows at higher levels
  • Multi-scale features: Capture both fine details and global context
  • Linear complexity: O(N) with respect to image size

How Hierarchical Attention Works

1. Window Partitioning

Divide the input into a grid of non-overlapping windows — the same local-window idea as sliding-window attention, but tiled across a 2D image.

2. Local Window Attention

Apply ordinary self-attention within each window independently — cost grows with the (small, fixed) window size, not the whole image.

3. Shifted Windows (Swin Transformer)

Create connections between windows by shifting the window grid every other layer, so tokens that were split across a boundary now share a window. This is Swin's key trick — it restores cross-window information flow without paying for global attention.

4. Hierarchical Merging

Between stages, merge each 2×2 block of patches into one and project the channels down. Resolution halves while the receptive field grows — building a coarse-to-fine hierarchy like a CNN's feature pyramid.

Hierarchical Attention Architectures

Swin Transformer

  • Window-based attention: 7×7 or 14×14 windows
  • Shifted windows: Alternate between regular and shifted
  • 4 stages: Progressively downsample like CNNs
  • Patch merging: 2×2 patches → 1 patch between stages

Pyramid Vision Transformer (PVT)

  • Progressive shrinking: Reduce spatial resolution gradually
  • Spatial reduction attention: Downsample K, V for efficiency
  • Multi-scale features: Different resolutions at each stage

Focal Transformer

  • Focal attention: Both fine-grained and coarse-grained
  • Multi-level aggregation: Combine multiple window sizes
  • Adaptive granularity: Adjust based on content

Mathematical Formulation

Complexity Analysis

Standard Attention:

  • Complexity: O(N² × d) where N = H × W
  • Memory: O(N²)

Hierarchical Attention (with windows of size M):

  • Complexity: O(N × M² × d)
  • Memory: O(N × M²)
  • Reduction factor: N/M² (typically 49× for M=7)

Multi-Scale Feature Maps

At stage s with downsampling factor 2^s:

  • Resolution: H/2^s × W/2^s
  • Channels: C × 2^s
  • Window size: M (constant)
  • Number of windows: (H × W) / (M² × 4^s)

Comparison with other approaches

ApproachComplexityGlobal contextMulti-scaleMemory
Standard attentionO(N²)Yes, from the startNo (single scale)High
HierarchicalO(N)At higher levelsBuilt-inLow
Sparse attentionO(N√N)LimitedNo (single scale)Medium
Axial attentionO(N^1.5)Along axesNo (single scale)Medium

If you found this explanation helpful, consider sharing it with others.

Mastodon