Feature Pyramid Networks

Learn how Feature Pyramid Networks build multi-scale feature representations through top-down pathways and lateral connections for robust object detection.

Best viewed on desktop for optimal interactive experience

Feature Pyramid Networks: Multi-Scale Feature Fusion

Feature Pyramid Networks (FPN) solved one of computer vision's oldest headaches: detecting objects at wildly different scales without paying the computational cost of image pyramids. Before FPN, you either ran a detector on multiple resized copies of the image (slow) or used only the final CNN features (poor for small objects). FPN showed that a lightweight top-down pathway could recycle features the backbone already computed, producing a rich multi-scale representation at marginal extra cost.

The key innovation is combining semantically strong but spatially coarse features from deep layers with spatially precise but semantically weak features from shallow layers — giving every pyramid level the best of both worlds.

The Map Zoom Analogy

Think of a mapping application. When you zoom all the way out, you see country borders and major highways — high-level structure, no detail. When you zoom in, you see individual buildings and street names — rich detail, no context. Now imagine a system that overlays the zoomed-out labels onto the zoomed-in view, giving you both detail and context simultaneously. That is exactly what FPN does with CNN features: it takes the "zoomed-out" semantic understanding from deep layers and fuses it back into the "zoomed-in" spatial detail of shallow layers.

The Map Zoom Analogy

FPN works like Google Maps: different zoom levels reveal different details. High resolution catches small objects, low resolution captures the big picture.

High resolution, small area, fine details

person
sign
bike
Detects at this level:
Edges, textures, small objects (pedestrians)
P2
Satellite
P3-P4
City
P5
Country
256 x 256
Resolution
16 x 16 px
Receptive Field
144
Grid Cells

Key insight: Just like switching zoom levels in a map reveals different features, FPN creates a pyramid of feature maps at multiple resolutions. Each level is responsible for detecting objects at a specific scale, ensuring nothing is missed regardless of size.

FPN Architecture

Bottom-Up Pathway

The bottom-up pathway is simply the backbone network (ResNet, EfficientNet, etc.) running its normal forward pass. As features flow through successive stages, spatial resolution halves while channel depth and semantic richness increase. FPN taps into the output of each stage, producing feature maps C2, C3, C4, C5 at strides of 4, 8, 16, and 32 pixels respectively.

Top-Down Pathway

Starting from the coarsest level C5, FPN upsamples by 2x using nearest-neighbor interpolation and merges with the next finer level. This propagates strong semantic information downward to higher-resolution maps. For each pyramid level i:

Pi = \text{Conv}3 × 3\!\bigl(Ci\text{lateral} + \text{Upsample}(Pi+1)\bigr)

Where Ci\text{lateral} is the bottom-up feature passed through a 1x1 convolution to match channel dimensions, and Pi+1 is the coarser pyramid level. The final 3x3 convolution reduces aliasing artifacts from upsampling.

ROI Level Assignment

Objects of different sizes are assigned to the pyramid level best suited for their scale:

k = \lfloor k0 + log2\!\bigl(√(wh)\,/\,224\bigr) \rfloor

Where k0 = 4 is the canonical level, and w, h are the ROI dimensions. Small objects land on fine-grained levels; large objects land on coarse levels.

Exploring the Architecture

The interactive diagram below lets you trace how features flow through FPN's three pathways — bottom-up, lateral, and top-down. Select any pyramid level to see its spatial resolution, channel depth, and which backbone stage feeds into it.

FPN Architecture Explorer

Visualize the bottom-up backbone, top-down pathway, and lateral connections. Click any level to inspect it, or animate the data flow.

C2-C5
Current Level
256-2048
Channels
All
Semantic Strength
Backbone (Bottom-Up) FPN (Top-Down) Lateral (1x1 Conv)

How Lateral Connections Work

Lateral connections are the bridge between the bottom-up and top-down pathways. A 1x1 convolution projects each backbone stage's feature map to a uniform channel dimension (typically 256), then element-wise addition merges it with the upsampled top-down feature. This fusion is what gives each pyramid level both spatial precision and semantic richness — the shallow feature contributes where objects are, while the deep feature contributes what objects are.

Lateral Connection Demo

See how lateral connections merge high-level semantic features with spatially precise backbone features at each FPN level.

1024
Input Channels
256
Output Channels
H/16
Spatial Resolution
Top-Down (Semantic) Lateral (Spatial) Merged Output

Multi-Scale Object Detection

FPN's true power shows when detecting objects across a wide range of sizes in a single image. A pedestrian 30 pixels tall activates strongly at level P2, while a bus spanning 400 pixels activates at P5. Without FPN, a single-scale detector struggles with one or the other. With FPN, each object is matched to the pyramid level where its size aligns with the detector's receptive field, dramatically improving recall across scales.

Multi-Scale Detection Demo

Compare single-scale detection vs FPN multi-scale. Each object is assigned to the optimal FPN level based on its area using k = floor(k0 + log2(sqrt(wh)/224)).

P2 - SmallP3 - Med-SmallP4 - Med-LargeP5 - Large
100x100px
16px (tiny)Assigned to: P2512px (huge)
k = floor(4 + log2(sqrt(10000) / 224)) = 2
88%
Small Objects
93%
Medium Objects
94%
Large Objects

FPN assigns each object to its optimal pyramid level. Small objects use high-resolution P2, large objects use semantically-rich P5. This yields uniformly high detection rates across all sizes.

FPN Variants

Since the original FPN, researchers have proposed many variations that improve feature fusion quality, speed, or both. PANet adds a second bottom-up pass for better localization. BiFPN introduces learnable fusion weights and skip connections. NAS-FPN uses neural architecture search to discover optimal connection patterns. The table below compares these variants across accuracy, speed, and architectural complexity.

FPN Variant Comparison

How the original FPN evolved into more powerful multi-scale feature fusion architectures.

Original FPN
2017
Feature Fusion
Element-wise addition
Direction
Top-down only
Learnable Weights
poor
No learnable weights
Compute Cost
excellent
Minimal overhead
mAP: +2.0 AP
Used in: Faster R-CNN, Mask R-CNN, RetinaNet
PANet
2018
Feature Fusion
Addition + bottom-up path
Direction
Top-down + Bottom-up
Learnable Weights
poor
No learnable weights
Compute Cost
moderate
~15% more FLOPs
mAP: +1.2 AP over FPN
Used in: YOLOv4, instance segmentation
BiFPN
2020
Feature Fusion
Weighted bi-directional
Direction
Bi-directional (repeated)
Learnable Weights
excellent
Fast normalized fusion
Compute Cost
moderate
Efficient with scaling
mAP: +4.0 AP over FPN
Used in: EfficientDet family
NAS-FPN
2019
Feature Fusion
NAS-discovered topology
Direction
Learned connections
Learnable Weights
excellent
Architecture search
Compute Cost
poor
High search cost
mAP: +2.5 AP over FPN
Used in: Research, AutoML pipelines
Recursive FPN
2021
Feature Fusion
Recursive refinement
Direction
Top-down (iterated)
Learnable Weights
moderate
Shared weights per iteration
Compute Cost
moderate
Scales with iterations
mAP: +1.5 AP over FPN
Used in: DetectoRS, HTC++
Use Original FPN when...
  • - You need a simple, proven baseline
  • - Compute budget is tight
  • - Training a two-stage detector (Faster R-CNN)
  • - Prototyping a new detection architecture
Use BiFPN when...
  • - Maximum accuracy matters (competitions)
  • - You want learnable fusion weights
  • - Using EfficientDet-style compound scaling
  • - Small objects are critical to detect

Common Pitfalls

1. Feature Misalignment After Upsampling

Nearest-neighbor upsampling can introduce spatial misalignment between the top-down and lateral features, especially at boundaries. This manifests as blurry detections or shifted bounding boxes. Deformable convolutions in the lateral path or aligned bilinear upsampling can mitigate this.

2. Channel Dimension Mismatch

All pyramid levels must share the same channel dimension for downstream heads to work. If the backbone stages have very different channel counts, the 1x1 lateral convolutions must be initialized carefully — poor initialization can cause some levels to dominate while others contribute noise.

3. Ignoring Extra Pyramid Levels

The original FPN stops at P5, but many detection frameworks add P6 and P7 via strided convolutions for detecting very large objects. Omitting these extra levels when your dataset contains large objects (vehicles in aerial imagery, buildings in satellite photos) leaves performance on the table.

Key Takeaways

  1. FPN builds a multi-scale feature pyramid at minimal extra cost — reusing the backbone's existing computation rather than processing multiple image scales.

  2. The top-down pathway propagates semantics to fine-grained levels — so even the highest-resolution features understand what they are looking at, not just where.

  3. Lateral connections merge spatial precision with semantic depth — 1x1 projections followed by element-wise addition fuse the complementary strengths of shallow and deep features.

  4. ROI-to-level assignment matches object scale to feature resolution — ensuring each object is detected at the pyramid level where the receptive field best fits its size.

  5. Modern variants like PANet and BiFPN improve fusion quality — through bidirectional pathways, learnable weights, and additional skip connections, each addressing limitations of the original design.

  • Receptive Field — FPN provides multi-scale receptive fields, directly addressing the RF-to-object-size matching problem
  • Dilated Convolutions — An alternative approach to expanding receptive fields without downsampling
  • Convolution Operation — The fundamental building block underlying every FPN pathway

If you found this explanation helpful, consider sharing it with others.

Mastodon