Swin Transformer: Hierarchical ViT with Shifted Windows

Ze Liu; Yutong Lin; Yue Cao; Han Hu; Yixuan Wei; Zheng Zhang; Stephen Lin; Baining Guo

TL;DR

Swin Transformer solves the fundamental scalability problem of Vision Transformers by replacing ViT's global self-attention with window-based local attention, then introducing a shifted window scheme that restores cross-window information flow. The result is a hierarchical vision backbone with linear computational complexity in image size, producing multi-scale feature maps that plug directly into existing dense prediction frameworks. It achieved state-of-the-art results on ImageNet classification (87.3% top-1), COCO object detection (58.7 box AP), and ADE20K segmentation (53.5 mIoU), establishing itself as the go-to general-purpose vision backbone.

The Problem: ViT Does Not Scale to Dense Vision Tasks

The original Vision Transformer (ViT) computes self-attention globally across all image patches. For an image tokenized into n patches, the attention computation is O(n²) in both time and memory. This is manageable for classification at 224×224 (196 patches), but dense prediction tasks like object detection and semantic segmentation require high-resolution inputs (e.g., 1024×1024, yielding 4096 patches with 16×16 patch size). At that scale, global self-attention becomes prohibitively expensive.

Beyond computational cost, ViT has a structural limitation: it produces single-scale feature maps. CNNs naturally produce hierarchical, multi-scale features through pooling and strided convolutions — a property that feature pyramid networks (FPN), anchor-based detectors, and segmentation decoders all depend on. ViT's flat sequence of same-resolution tokens cannot directly serve these downstream architectures.

Swin Transformer addresses both problems simultaneously.

Window-Based Self-Attention

The core mechanism is simple: instead of computing attention across all n tokens, partition the feature map into non-overlapping local windows of fixed size M × M (default M = 7) and compute self-attention independently within each window.

For a feature map of h × w tokens, the computational complexity changes from:

\Omega(\text{Global MSA}) = 4hwC² + 2(hw)²C

to:

\Omega(\text{W-MSA}) = 4hwC² + 2M² hwC

where C is the embedding dimension. The critical difference is in the second term: (hw)² becomes M² hw. Since M is fixed, the attention cost scales linearly with image size rather than quadratically. For a 1024×1024 image with M = 7, this represents a roughly 580x reduction in the attention term.

The trade-off is that each window attends only to its local M² tokens, losing global receptive field. This is where the shifted window mechanism becomes essential.

Shifted Window Attention: Cross-Window Communication

Window-based attention in isolation creates hard boundaries between windows — tokens at the edge of one window cannot attend to adjacent tokens in the neighboring window. Swin Transformer solves this by alternating between two windowing configurations across consecutive transformer blocks.

In layer \ell, the feature map is partitioned with standard non-overlapping windows. In layer \ell + 1, the window grid is shifted by (\lfloor M/2 \rfloor, \lfloor M/2 \rfloor) pixels, so that each new window straddles the boundaries of four windows from the previous layer. This creates cross-window connections without any additional attention computation.

Formally, consecutive Swin Transformer blocks compute:

\mathbf{ẑ}^\ell = \text{W-MSA}(\text{LN}(\mathbf{z}^\ell-1)) + \mathbf{z}^\ell-1

\mathbf{z}^\ell = \text{MLP}(\text{LN}(\mathbf{ẑ}^\ell)) + \mathbf{ẑ}^\ell

\mathbf{ẑ}^\ell+1 = \text{SW-MSA}(\text{LN}(\mathbf{z}^\ell)) + \mathbf{z}^\ell

\mathbf{z}^\ell+1 = \text{MLP}(\text{LN}(\mathbf{ẑ}^\ell+1)) + \mathbf{ẑ}^\ell+1

where W-MSA is standard window multi-head self-attention and SW-MSA is shifted window multi-head self-attention. As in a standard transformer block, each sub-layer carries a residual connection and layer normalization — the additive terms and LN in the equations above.

A naive implementation of shifted windows would increase the number of windows (some partial at the borders), creating an irregular computation pattern. The paper introduces an efficient cyclic shift approach: shift the feature map, apply standard windowing, then mask out attention between tokens that are not actually adjacent in the original layout. This keeps the number of windows constant and enables batched computation.

Relative Position Bias

Unlike ViT, which uses absolute positional embeddings, Swin Transformer injects positional information through a relative position bias added to each attention head:

\text{Attention}(Q, K, V) = \text{SoftMax}\!(QK^T√(d) + B)V

where B ∈ ℝ^{M² × M²} is the relative position bias matrix. Since relative positions along each axis range from -M+1 to M-1, the bias is parameterized as a smaller matrix B̂ ∈ ℝ^{(2M-1) × (2M-1)} that is indexed for each query-key pair. This adds negligible parameters but provides consistent improvements — the ablation study shows +1.2% top-1 accuracy on ImageNet compared to no positional encoding, and +0.5% compared to absolute position embeddings.

Hierarchical Feature Maps via Patch Merging

Swin Transformer produces multi-scale feature maps by progressively reducing spatial resolution through patch merging layers, analogous to pooling in CNNs.

The architecture has four stages. Starting from an input image of H × W × 3:

Stage 1: A patch partition layer splits the image into non-overlapping 4 × 4 patches (each flattened to a 48-dim vector), followed by a linear embedding to C dimensions. Resolution: H4 × W4.
Stage 2: A patch merging layer concatenates features of each 2 × 2 group of neighboring patches (yielding 4C-dim vectors), then projects to 2C via a linear layer. Resolution: H8 × W8.
Stage 3: Same merging operation, resolution H16 × W16, channel dimension 4C.
Stage 4: Resolution H32 × W32, channel dimension 8C.

This produces a feature pyramid at 4x, 8x, 16x, and 32x downsampling — the same set of scales that FPN and other multi-scale architectures expect. This structural compatibility is what makes Swin Transformer a drop-in replacement for CNN backbones in detection and segmentation frameworks.

Architecture Variants

The paper defines four model sizes, all sharing the same architecture but varying in channel dimension and block depth:

Swin-T (Tiny): C = 96, blocks = [2, 2, 6, 2], 29M params, 4.5 GFLOPs
Swin-S (Small): C = 96, blocks = [2, 2, 18, 2], 50M params, 8.7 GFLOPs
Swin-B (Base): C = 128, blocks = [2, 2, 18, 2], 88M params, 15.4 GFLOPs
Swin-L (Large): C = 192, blocks = [2, 2, 18, 2], 197M params, 34.5 GFLOPs

The depth is concentrated in Stage 3 (18 blocks for S/B/L), which operates at 16x downsampling — a design choice that mirrors ResNet's heavy middle stages and balances computational cost against representational capacity.

Key Results

ImageNet-1K Classification: Swin-B achieves 83.5% top-1 accuracy, outperforming DeiT-B (83.1%) at comparable FLOPs. With ImageNet-22K pretraining, Swin-L reaches 87.3% top-1 accuracy on 384×384 inputs, surpassing the previous best ViT-Large result (87.1%) while using fewer FLOPs.

COCO Object Detection: Using Cascade Mask R-CNN as the detection framework, Swin-L achieves 58.7 box AP and 51.1 mask AP on COCO test-dev. This represents a +3.6 box AP improvement over the previous best result using ResNeXt-101-64x4d as backbone, demonstrating that the hierarchical transformer features transfer effectively to detection.

ADE20K Semantic Segmentation: With UperNet, Swin-L achieves 53.5 mIoU on the ADE20K validation set, a +3.2 mIoU gain over the previous best using the same framework. The multi-scale features from the four stages are directly consumed by the decoder without any adaptation.

Critical Analysis

Strengths:

Linear complexity enables processing high-resolution inputs that are infeasible for global attention models. This is not just a theoretical advantage — it enabled practical deployment on tasks like detection and segmentation where 800×1333 inputs are standard.
Multi-scale feature maps make Swin Transformer structurally compatible with the entire ecosystem of dense prediction heads (FPN, UperNet, Cascade R-CNN), eliminating the need for task-specific architectural redesign.
General-purpose backbone: the same pretrained model serves classification, detection, segmentation, and later video understanding, reducing engineering overhead compared to task-specific architectures.

Limitations:

Window boundary artifacts: despite shifted windows, the effective receptive field grows slowly across layers. Tasks requiring long-range spatial reasoning (e.g., counting distant objects, understanding scene layout) may suffer compared to global attention.
Implementation complexity: the cyclic shift with attention masking is considerably more involved than standard transformer attention. This complicates custom kernel development and makes the model harder to port to new hardware.
Fixed window size: the M = 7 window size is a hyperparameter that does not adapt to content. Regions with large homogeneous textures waste capacity on uninformative local attention, while regions with fine detail may need more than 7×7 context.
Training recipe sensitivity: achieving the reported numbers requires careful use of augmentation (RandAugment, Mixup, CutMix), regularization (stochastic depth, label smoothing), and the AdamW optimizer with cosine learning rate decay. The architecture alone, without this recipe, underperforms.

Impact and Legacy

Swin Transformer became the dominant vision backbone for the 2021–2023 period, displacing both CNN-based (ResNet, EfficientNet) and earlier ViT-based approaches in detection and segmentation benchmarks. Its influence extended well beyond classification:

Swin V2 (2022) extended the architecture to 3 billion parameters and 1536×1536 resolution by replacing the learned bias matrix with a log-spaced continuous relative position bias, addressing the resolution transfer problem. Video Swin Transformer applied the shifted window mechanism to 3D spatiotemporal volumes for video understanding. The architecture also became the default backbone for medical imaging (Swin UNETR for 3D segmentation) and remote sensing.

The broader legacy is architectural: Swin Transformer demonstrated that the key to making transformers work for vision was not better pretraining or larger datasets, but structural inductive biases borrowed from CNNs — locality, hierarchy, and translation equivariance through sliding windows. This insight influenced subsequent architectures like ConvNeXt, which showed CNNs could match Swin when given the same training recipe, and MaxViT, which combined local and global attention.

Attention Is All You Need — the original transformer architecture that Swin adapts for vision
Vision Transformer (ViT) — the baseline global-attention ViT that Swin improves upon
DETR — another approach to transformers in detection, using global attention with learned object queries
EfficientNet — the CNN scaling methodology that Swin displaced as the top ImageNet backbone
BEiT — masked image modeling pretraining that further improved Swin Transformer performance
Deep Residual Learning (ResNet) — the hierarchical CNN backbone design that Swin Transformer's feature pyramid mirrors