Skip to main content

Deep Residual Learning for Image Recognition

ResNet analysis: how skip connections and residual learning solved the degradation problem, enabling training of 100+ layer neural networks.

Kaiming He, Xiangyu Zhang +215 min read|Original Paper|Computer VisionCNNResNet+1
Best viewed on desktop for optimal interactive experience

TL;DR

Deeper neural networks should be at least as accurate as their shallower counterparts — a deeper model can always copy the shallow layers and set the extra layers to identity. In practice, this does not happen: deeper plain networks exhibit higher training error, a phenomenon the authors call the degradation problem. He et al. fix this by reformulating layers to learn residual functions ℱ(\mathbf{x}) = ℋ(\mathbf{x}) - \mathbf{x} via skip connections, making it trivially easy for extra layers to default to identity. The resulting ResNets train stably at 152 layers, win ILSVRC 2015 with 3.57% top-5 error on ImageNet, and become the default backbone architecture across computer vision.

The Degradation Problem

Before ResNet, the common belief was that stacking more layers should improve accuracy — more parameters means more capacity. Techniques like batch normalization and ReLU activations had already addressed the vanishing/exploding gradient problem, allowing networks of 20-30 layers to converge. But beyond that depth, something unexpected happened.

The authors trained 20-layer and 56-layer plain networks on CIFAR-10 and observed that the 56-layer network had higher training error than the 20-layer network. This is not overfitting (which would show lower training error but higher test error). It is an optimization failure: SGD cannot find a good solution in the deeper network's loss landscape. The same pattern appeared on ImageNet — a 34-layer plain network had higher training error than an 18-layer one.

The theoretical argument for why this should not happen is straightforward. Given a shallow network that achieves some accuracy, a deeper network can always match it: copy the learned layers and set all additional layers to identity mappings. The deeper network's solution space is a strict superset. Yet optimizers fail to find this construction, meaning the loss surface of deep plain networks contains pathological regions that trap gradient-based methods.

This observation is the paper's key motivation. The degradation problem is not a capacity issue — it is a trainability issue. Batch normalization ensures gradients neither vanish nor explode, yet deeper networks still degrade. The question becomes: can we restructure the network so that identity mappings are easy to learn?

Skip Connections: The Core Idea

The answer is residual learning. Instead of asking a stack of layers to learn a desired mapping ℋ(\mathbf{x}) directly, restructure them to learn the residual via skip connections:

ℱ(\mathbf{x}) := ℋ(\mathbf{x}) - \mathbf{x}

The output of the block then becomes:

\mathbf{y} = ℱ(\mathbf{x}) + \mathbf{x}

This is implemented by adding a shortcut connection (skip connection) that bypasses one or more layers and performs identity mapping. The element-wise addition of \mathbf{x} to the layer output requires no extra parameters and adds negligible computation.

The insight is about optimization, not representational power. If the optimal function is close to identity, pushing ℱ(\mathbf{x}) toward zero is easier than pushing ℋ(\mathbf{x}) toward \mathbf{x} — the weights are initialized near zero, so the residual formulation starts closer to a good solution. In the worst case, the network can always set ℱ(\mathbf{x}) = \mathbf{0} and pass the input through unchanged, recovering at least the performance of the shallower network.

Gradient flow provides another perspective. During backpropagation, the gradient through a residual block is:

∂ ℒ∂ \mathbf{x} = ∂ ℒ∂ \mathbf{y} (1 + ∂ ℱ∂ \mathbf{x})

The additive 1 term means the gradient always has a direct path back through the skip connection, mitigating the vanishing gradient problem even in very deep networks. This is distinct from approaches like batch normalization or careful initialization, which help but do not fully solve degradation at extreme depths.

Shortcut Connection Variants

The paper evaluates three options for handling dimension mismatches at stage boundaries where spatial resolution halves and channel count doubles:

  • Option A (zero-padding): Use identity shortcuts everywhere. When dimensions increase, pad the extra channels with zeros and use stride-2 shortcuts. No extra parameters.
  • Option B (projection shortcuts at boundaries): Use 1×1 convolutions only when dimensions change, identity shortcuts elsewhere. Adds a small number of parameters at stage transitions.
  • Option C (all projection shortcuts): Replace all skip connections with 1×1 convolutions, even when dimensions match.

Results on ImageNet: Option C is marginally better than B, which is marginally better than A. But the differences are small (fractions of a percent), and the authors attribute C's advantage to the extra parameters rather than to a better optimization path. The paper adopts Option B as the default — it balances parameter efficiency with accurate dimension matching.

Architecture Details

The paper presents five ResNet variants of increasing depth: ResNet-18, ResNet-34, ResNet-50, ResNet-101, and ResNet-152. All follow the same high-level structure: an initial 7×7 convolution with stride 2, max pooling, four stages of residual blocks with increasing channel counts (64, 128, 256, 512), global average pooling, and a fully connected classification layer.

Basic Block (ResNet-18/34): Each residual block consists of two 3×3 convolutional layers with batch normalization and ReLU. The skip connection adds the input directly to the output. When spatial dimensions change between stages (stride 2 downsampling), a 1×1 convolution with stride 2 is used on the shortcut to match dimensions.

Bottleneck Block (ResNet-50/101/152): For deeper networks, the authors introduce a bottleneck design to manage computational cost. Each block uses three layers: a 1×1 convolution that reduces the channel dimension (e.g., 256 to 64), a 3×3 convolution that operates in this reduced space, and a 1×1 convolution that restores the original dimension. This reduces the parameter count per block while maintaining the same output dimensionality. A 3-layer bottleneck block with 64 channels in the middle has similar computation to a 2-layer basic block with 64 channels, but the bottleneck variant has a 256-dimensional output, enabling much deeper stacking.

ModelLayersParametersFLOPsTop-5 Error (ImageNet)
ResNet-181811.7M1.8G10.92%
ResNet-343421.8M3.6G7.76%
ResNet-505025.6M3.8G6.71%
ResNet-10110144.5M7.6G6.07%
ResNet-15215260.2M11.5G5.71%

A critical detail: ResNet-50 has only marginally more parameters than ResNet-34 despite being 16 layers deeper, thanks to the bottleneck design. The efficiency of the 1×1 projection layers means depth scales better than width in this architecture.

Key Results

ImageNet (ILSVRC 2015): A 152-layer ResNet achieved 3.57% top-5 error on the ImageNet test set, winning 1st place at ILSVRC 2015. For context, here is the progression of ILSVRC winners:

YearModelDepthTop-5 Error
2012AlexNet816.4%
2013ZFNet811.7%
2014GoogLeNet226.7%
2014VGGNet197.3%
2015ResNet1523.57%

The 34-layer ResNet outperformed the 34-layer plain network by 2.8% top-1 accuracy, directly demonstrating the benefit of residual connections. The gap between plain and residual networks widened with depth, confirming that skip connections specifically address the degradation problem rather than providing a general accuracy boost.

CIFAR-10: The authors pushed depth further, training networks with 110, 1202 layers. The 110-layer ResNet achieved 6.43% error, competitive with the state of the art at the time. The 1202-layer network trained successfully (no optimization failure), though it achieved slightly worse results (7.93%) than the 110-layer variant — likely due to overfitting on the small CIFAR-10 dataset (50k training images for a model of that size), not degradation.

Transfer learning: ResNet features transferred strongly to detection and segmentation. On COCO object detection, replacing a VGG-16 backbone with ResNet-101 in Faster R-CNN improved mAP by 6.0 points (from 41.5% to 47.5%). This result demonstrated that deeper representations learn more transferable features, not just better classifiers. ResNet won 1st place in ILSVRC and COCO 2015 competitions across all four tracks: ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

Training details: The models were trained with SGD (momentum 0.9, weight decay 1e-4) using a mini-batch size of 256. Learning rate started at 0.1 and was divided by 10 when the error plateaued. Data augmentation included scale augmentation (short side randomly sampled from [256, 480]) and random horizontal flipping. Batch normalization was applied after each convolution and before ReLU activation. No dropout was used — batch normalization provided sufficient regularization.

Key Takeaways

  1. Degradation is not overfitting — deeper plain networks have higher training error, not just test error. This is a fundamental optimization issue, not a generalization issue.

  2. Residual learning reframes the problem — learning ℱ(\mathbf{x}) = ℋ(\mathbf{x}) - \mathbf{x} instead of ℋ(\mathbf{x}) biases layers toward identity, making depth safe to add.

  3. Skip connections provide gradient highways — the additive identity path guarantees gradient flow regardless of how deep the network is, complementing batch normalization.

  4. Bottleneck design decouples depth from parameters — the 1×1 projection pattern allows ResNet-50 to have similar parameters to ResNet-34 while being substantially deeper and more accurate.

  5. Depth transfers — the improvements from deeper ResNets are not limited to classification; they transfer directly to detection and segmentation, indicating that deeper features are genuinely more informative.

Critical Analysis

Strengths:

  • Elegant simplicity. The core idea — add the input to the output — requires zero additional parameters for identity shortcuts and negligible extra computation. This makes it immediately adoptable without specialized hardware or training procedures.
  • Strong empirical evidence for the degradation problem. The paper carefully isolates degradation from overfitting by showing higher training error in deeper plain nets, making a compelling case that the problem is optimization-theoretic.
  • Scalable depth. The bottleneck design allows practical training of 152-layer networks with reasonable parameter counts, and the 1202-layer CIFAR experiment demonstrates that gradient flow remains healthy at extreme depths.
  • Broad applicability. The same residual connection pattern works across classification, detection, and segmentation with no modification, suggesting it addresses a fundamental optimization issue rather than a task-specific one.

Limitations:

  • No theoretical guarantee. The paper provides strong intuition (identity mappings are easier to learn as residuals) but no formal proof of why residual networks converge better. Later work by Li et al. (2018) showed that skip connections smooth the loss landscape, providing geometric insight but still no convergence guarantee.
  • Diminishing returns at extreme depth. The 1202-layer CIFAR result is worse than 110 layers, suggesting that residual connections are necessary but not sufficient for arbitrarily deep networks. Regularization and capacity-data ratio still matter.
  • Identity mapping assumption. The skip connection works best when input and output dimensions match. When they do not (at stage boundaries), the paper uses projection shortcuts (1×1 convolutions), which introduce parameters and break the pure identity property. The follow-up paper "Identity Mappings in Deep Residual Networks" (He et al. 2016) showed that pre-activation residual blocks further improve gradient flow.
  • Width vs. depth trade-off unexplored. The paper focuses exclusively on depth. Wide Residual Networks (Zagoruyko & Komodakis, 2016) later showed that wider, shallower ResNets can outperform deeper, narrower ones with better computational efficiency, suggesting the paper's emphasis on depth as the key axis was incomplete.

Impact and Legacy

ResNet's influence on computer vision and deep learning is difficult to overstate in concrete terms. The paper has accumulated over 145,000 citations, making it one of the most cited papers in all of computer science. Skip connections became a default architectural primitive — present in virtually every modern architecture from DenseNet to transformers.

The core mechanism maps directly to the transformer architecture: the residual stream in transformers, where attention and MLP outputs are added to the input, is exactly a skip connection. Without this pattern, training 96-layer GPT-3 or 32-layer ViT would face the same degradation problem. In this sense, ResNet did not just solve a CNN problem — it established a design principle that made the entire large-model era possible.

Specific architectural descendants include: DenseNet (Huang et al. 2017), which generalizes skip connections to connect every layer to every other layer; ResNeXt (Xie et al. 2017), which adds grouped convolutions to the residual block; EfficientNet (Tan & Le 2019), which uses compound scaling of depth, width, and resolution with residual blocks; and ConvNeXt (Liu et al. 2022), which modernizes the ResNet design with transformer-inspired components and achieves competitive performance with ViT.

In the pretraining era, ResNet-50 remains one of the most commonly used backbones for self-supervised learning methods (MoCo, BYOL, SimCLR, DINO) and serves as the standard benchmark architecture for comparing representation learning approaches.

The paper also had a methodological impact beyond skip connections. The combination of batch normalization, skip connections, and the absence of dropout established a training recipe that dominated CNN training for years. The idea that architectural changes can substitute for explicit regularization influenced how practitioners thought about network design.

  • Attention Is All You Need — transformers use the same residual connection pattern that ResNet introduced
  • Vision Transformer (ViT) — applying transformers (with residual streams) directly to image patches, challenging CNN dominance
  • EfficientNet — compound scaling of ResNet-style architectures for better accuracy-efficiency trade-offs
  • DINO — self-supervised learning that commonly uses ResNet-50 as its CNN backbone
  • Swin Transformer — hierarchical vision transformer that incorporates residual connections within each block

If you found this paper review helpful, consider sharing it with others.

Mastodon