NAS-FPN: Learning to Design Feature Pyramid Networks

Overview

NAS-FPN asks a provocative question: what if we let an algorithm design the feature pyramid network instead of relying on human intuition? Using reinforcement learning to search over a vast space of possible architectures, NAS-FPN discovers irregular, asymmetric connection patterns that consistently outperform hand-designed alternatives like FPN and PANet—proving that the "obvious" top-down pathway wasn't optimal after all.

The key insight is that human designers favor symmetric, adjacent-scale connections because they're intuitive, but the optimal architecture often includes surprising long-range skip connections and asymmetric patterns that humans would never consider.

Key Concepts

Neural Architecture Search (NAS)

Automated discovery of optimal neural network architectures using search algorithms, typically reinforcement learning or evolutionary methods.

Merging Cell

The building block of NAS-FPN: takes two feature maps, resizes them to a target resolution, and combines them with sum or global-pooling attention.

Search Space

The set of all possible architectures that NAS can explore. NAS-FPN's search space includes ~10¹⁴ possible configurations.

Proxy Task

A simplified training task (subset of data, fewer epochs) used to quickly evaluate architectures during search. Rankings correlate with full training.

Policy Gradient (REINFORCE)

The RL algorithm used to update the controller. High-AP architectures reinforce the decisions that generated them.

Stacking / Scalability

The discovered architecture can be repeated multiple times to trade compute for accuracy, enabling flexible deployment.

The Problem: Human Design Bias

Since Lin et al. introduced Feature Pyramid Networks in 2017, researchers have proposed numerous variants: PANet added bottom-up paths, BiFPN introduced weighted fusion. But all these designs share a limitation: they're constrained by human intuition about what connections "should" exist.

Hand-Designed vs. NAS-Discovered Architectures

Hand-Designed vs. NAS-Discovered

Human intuition favors symmetric patterns — NAS discovers irregular, better architectures

Hand-Designed

FPN

Top-down pathway only

Limitation: Adjacent-scale connections only

Hand-Designed

• Symmetric patterns
• Adjacent-scale only
• Limited exploration

The Problem

10¹⁴+ possible architectures — humans can only explore a tiny fraction

NAS Solution

• Algorithmic search
• Skip connections matter
• +3.4 AP improvement

Human designers favor:

Symmetric patterns — if there's a top-down path, add a bottom-up path
Adjacent-scale connections — only connect P3↔P4, P4↔P5, etc.
Regular structures — same pattern repeated at each level

But with 5 feature levels and multiple operations, there are over 10¹⁴ possible architectures. Humans can only explore a tiny fraction of this space.

The NAS-FPN Approach

NAS-FPN formulates feature pyramid design as a reinforcement learning problem. An RNN controller generates architecture specifications, child networks are trained and evaluated, and the controller is updated to favor high-performing designs.

Neural Architecture Search Loop

NAS Search Framework

Reinforcement learning discovers optimal FPN architectures

Iteration: 1 / 8,000

RNN Controller

Generates architecture decisions sequentially using LSTM cells

Why RL? The reward (detection AP) is non-differentiable with respect to architecture choices. We can't backpropagate through discrete decisions like "connect P3 to P5" — so we use policy gradients instead.

How It Works

RNN Controller Generates Architecture

An LSTM-based controller outputs a sequence of decisions that specify the complete FPN architecture: which inputs to use, what operation to apply, where to output.

Build Child Network

The sampled architecture specification is instantiated as an actual neural network with the specified connections and operations.

Train on Proxy Task

Each child network is trained for only ~10 epochs on a subset of COCO. This is enough to get a reliable ranking of architectures.

Compute Reward (AP)

The trained network's Average Precision on validation becomes the reward signal. Higher AP means better architecture.

Update Controller via REINFORCE

Policy gradient updates reinforce the decisions that led to high-AP architectures. The controller learns which patterns work.

Repeat 8000 Times

After ~500 TPU-hours of search, the best discovered architecture is selected for final training and evaluation.

The Merging Cell: Building Block

The key abstraction in NAS-FPN is the merging cell: a unit that takes two input feature maps and combines them to produce one output. The controller decides the inputs, output resolution, and operation for each cell.

Interactive Merging Cell

The Merging Cell

Building block of NAS-FPN — combines two feature maps into one

Input 1

Input 2

Output Level

Operation

Sum Operation

output = h₁ + h₂

Simple element-wise addition after resizing both inputs to the same resolution.

Global Pooling (GP)

output = h₁ + σ(GAP(h₂)) · h₂

Attention-weighted addition: global average pooling creates channel attention weights.

For each merging cell, the controller makes four decisions:

Input 1: Which feature level (P3, P4, P5, P6, or P7)
Input 2: Which feature level or previous cell output
Output resolution: What scale the merged output should be
Binary operation: Sum (element-wise addition) or GP (global-pooling attention)

This simple abstraction enables a vast search space while keeping individual cells easy to implement and efficient to run.

The Discovered Architecture

After searching, NAS-FPN discovers architectures that look nothing like human designs. The connections are irregular, asymmetric, and include surprising long-range skip connections.

NAS-FPN Discovered Architecture

7 merging cells with irregular, asymmetric connections

Click play to animate data flow

Key Observations from Discovered Architecture

Non-adjacent connections: P3 connects directly to P5, skipping P4

Asymmetric pattern: Different pathways for different scales

Strategic GP: Global pooling used selectively (Cells 1, 5, 6)

Key observations from the discovered architecture:

Non-adjacent connections: Direct P3→P5 connections that skip P4 entirely
Asymmetric patterns: Different scales get different treatment
Strategic GP operations: Global-pooling attention used selectively, not uniformly
Long-range dependencies: P7 information flows directly to lower levels

These patterns emerge because they improve detection accuracy, not because they're intuitive to humans.

Scalability: Stacking for Performance

One elegant property of NAS-FPN: the discovered architecture can be stacked multiple times to trade compute for accuracy. Each repetition refines the feature pyramid further.

Scalable NAS-FPN Stacking

Scalable Architecture

Stack discovered cells for accuracy-compute trade-offs

×7

Repeats

39.9

AP (COCO)

28.0M

FPN Params

81B

FLOPs

AP vs. Stacking Depth

Scalability Insight: The discovered architecture can be stacked multiple times for progressive refinement. Diminishing returns after ×5 suggest the architecture finds most improvements in early iterations.

Stacking	AP	FPN Params	Use Case
×1	37.9	4.0M	Baseline / Real-time
×3	39.1	12.0M	Balanced
×5	39.6	20.0M	High accuracy
×7	39.9	28.0M	Maximum accuracy

Diminishing returns after ×5 suggest the architecture finds most improvements in early iterations.

Real-World Applications

Maximum Detection Accuracy

When accuracy matters more than inference speed

Use NAS-FPN ×7 with strong backbone for SOTA results

Scalable Deployment

Need different accuracy-speed trade-offs for different devices

Deploy ×1 on edge, ×5 on server

Drop-in FPN Replacement

Existing detector with standard FPN neck

Replace FPN with NAS-FPN, keep everything else

Research Baseline

Studying what makes FPN architectures effective

Analyze discovered patterns to inform future designs

AutoML Pipelines

Building automated detection systems

Use NAS-FPN as the neck in NAS-based detectors

Transfer to New Domains

Applying to medical imaging, satellite, etc.

Architecture generalizes; may need domain-specific search

Advantages & Limitations

Advantages

✓Discovers architectures humans wouldn't consider
✓+3.4 AP improvement over standard FPN (same backbone)
✓Scalable via stacking for accuracy-compute trade-offs
✓Drop-in replacement for existing FPN implementations
✓Reveals insights about optimal feature pyramid design
✓Architecture generalizes across different backbones

Limitations

×High search cost (~500 TPU-hours one-time)
×Discovered architecture is less interpretable than hand-designed
×May not be optimal for domains very different from COCO
×More parameters and compute than simple FPN
×Search space design requires expertise
×Results depend on proxy task correlation

Best Practices

Start with Published Architecture: Use the paper's discovered architecture directly—no need to re-run search. The 7-cell design works well across domains.
Choose Stacking Based on Budget: ×1 for real-time, ×3-5 for balanced, ×7 for maximum accuracy. Profile on your target hardware.
Keep Strong Backbone: NAS-FPN amplifies backbone quality. ResNet-50 gives 39.9 AP; AmoebaNet pushes to 48.3 AP.
Use with RetinaNet: The paper validates on RetinaNet. Results should transfer to other one-stage detectors.
Consider BiFPN for Efficiency: If you need maximum efficiency over accuracy, EfficientDet's BiFPN may be better suited.
Analyze for Insights: Study the discovered connections to understand what patterns matter—may inspire better hand-designed alternatives.

NAS-FPN vs Other FPN Variants

Aspect	FPN	PANet	BiFPN	NAS-FPN
Design Method	Hand-crafted	Hand-crafted	Hand-crafted	NAS (RL)
Connection Pattern	Top-down	Top-down + Bottom-up	Bidirectional + Skip	Discovered (irregular)
Fusion Method	Add	Add	Weighted Add	Sum or GP (learned)
Scalability	Fixed	Fixed	Stackable	Stackable (×1-7)
Search Cost	N/A	N/A	N/A	~500 TPU-hours
AP Gain	baseline	+0.8	+1.5	+3.4

Lessons from NAS-FPN

Beyond the specific architecture, NAS-FPN teaches broader lessons:

Human intuitions are biased: We favor symmetric, "clean" architectures even when irregular ones work better. The search reveals that our design preferences don't align with optimal performance.

Skip connections matter more than adjacency: Direct P3→P5 connections can be more valuable than going through P4. Information doesn't need to flow through every intermediate level.

Search space design is crucial: The merging cell abstraction enables tractable search while being expressive enough to discover novel architectures.

Proxy tasks work: Quick training on small data correlates with full training, enabling fast architecture evaluation. This insight enables efficient search.

Scalability should be designed in: The best architectures can be stacked for accuracy-compute trade-offs, providing deployment flexibility.

Performance Summary

Method	Backbone	AP	AP₅₀	AP₇₅	APₛ	APₘ	APₗ
RetinaNet + FPN	ResNet-50	36.5	55.4	39.1	20.4	40.3	48.1
RetinaNet + NAS-FPN	ResNet-50	39.9	59.6	43.4	24.2	44.3	52.4
RetinaNet + NAS-FPN	AmoebaNet	48.3	—	—	—	—	—

+3.4 AP improvement with the same backbone, demonstrating that FPN architecture matters significantly for detection performance.