EfficientNet: Compound Scaling for CNNs

Mingxing Tan; Quoc V. Le

TL;DR

EfficientNet demonstrates that scaling CNNs along depth, width, and resolution simultaneously with a fixed ratio produces better accuracy-efficiency trade-offs than scaling any single dimension. The authors use neural architecture search to find a strong baseline (B0), then apply a compound scaling coefficient to uniformly scale all three dimensions, producing a family of models (B0–B7). EfficientNet-B7 reaches 84.3% ImageNet top-1 accuracy while using 8.4x fewer parameters and 6.1x fewer FLOPs than the previous best model (GPipe).

The Scaling Problem

Before EfficientNet, practitioners scaled CNNs by independently increasing one of three dimensions:

Depth (number of layers): deeper networks capture more complex features, but suffer from vanishing gradients and diminishing returns. ResNet-1000 performs worse than ResNet-101 on CIFAR-10.
Width (channels per layer): wider networks capture finer-grained features, but wide shallow networks struggle to learn high-level abstractions.
Resolution (input image size): higher resolution provides more fine-grained detail, but accuracy gains saturate quickly. Going from 224 to 560 pixels improves accuracy by less than 1% for a fixed architecture.

The key empirical observation in the paper is that these three dimensions are not independent. Scaling resolution without also scaling depth and width means the network lacks the capacity to process the additional spatial detail. Conversely, scaling depth without increasing resolution gives the network more capacity than it can use on a low-resolution input.

The authors validate this with a controlled experiment: for any given depth or width, accuracy gain from increasing resolution diminishes faster than when depth and width are scaled in tandem. This motivates a principled approach to joint scaling.

Compound Scaling Method

The paper formalizes the scaling problem as a constrained optimization. Given a baseline network with depth d, width w, and resolution r, scaling is parameterized by a single compound coefficient φ:

d = α^φ, w = β^φ, r = γ^φ

subject to the constraint:

α · β² · γ² ≈ 2

The rationale for this constraint is computational: FLOPs scale linearly with depth (α) but quadratically with width (β²) and resolution (γ²). The constraint ensures that for each unit increase in φ, total FLOPs roughly double, giving a predictable compute budget.

The authors perform a grid search with φ = 1 to find the optimal ratio, arriving at:

α = 1.2, β = 1.1, γ = 1.15

which satisfies 1.2 · 1.1² · 1.15² ≈ 2.0. Once these base coefficients are fixed, scaling to any target compute budget is a matter of increasing φ. EfficientNet-B0 uses φ = 0, B1 uses φ = 1, and so on up to B7 with φ = 7.

The Baseline: EfficientNet-B0

The compound scaling method is architecture-agnostic in principle, but the choice of baseline matters significantly. The authors use multi-objective neural architecture search (NAS) to find a baseline that jointly optimizes accuracy and FLOPs, similar to the approach in MnasNet. The search space is built on mobile inverted bottleneck convolution (MBConv) blocks.

EfficientNet-B0 has 5.3M parameters and requires 0.39B FLOPs — comparable to MobileNetV2 but with higher accuracy (77.1% vs 72.0% ImageNet top-1). The architecture consists of 7 stages of MBConv blocks with varying kernel sizes (3x3 and 5x5), expansion ratios (1 and 6), and channel counts.

MBConv Blocks and Squeeze-and-Excitation

Each stage of EfficientNet uses mobile inverted bottleneck convolution (MBConv) blocks, originally introduced in MobileNetV2. The structure of each block is:

Expansion: a 1x1 convolution expands channels by a factor of 6 (MBConv6) or keeps them unchanged (MBConv1).
Depthwise convolution: a spatial convolution with a 3x3 or 5x5 kernel operates independently on each channel, reducing compute from O(k² · C_in · C_out) to O(k² · C).
Squeeze-and-Excitation (SE): global average pooling compresses spatial dimensions to a channel descriptor, which passes through two FC layers with a reduction ratio to produce per-channel attention weights. This recalibrates channel responses based on global context.
Projection: a 1x1 convolution projects back to the target output channels with a linear activation.

Residual connections are added when the input and output dimensions match, following the inverted residual pattern. The "inverted" aspect refers to the expansion-then-projection structure, which is the reverse of traditional residual blocks that narrow first.

Scaling from B0 to B7

With the base coefficients fixed, the family scales predictably:

Model	φ	Resolution	Parameters	FLOPs	Top-1 Acc.
B0	0	224	5.3M	0.39B	77.1%
B1	1	240	7.8M	0.70B	79.1%
B2	2	260	9.2M	1.0B	80.1%
B3	3	300	12M	1.8B	81.6%
B4	4	380	19M	4.2B	82.9%
B5	5	456	30M	9.9B	83.6%
B6	6	528	43M	19B	84.0%
B7	7	600	66M	37B	84.3%

For context, GPipe (the previous ImageNet SOTA) achieved 84.3% with 557M parameters and 238B FLOPs. EfficientNet-B7 matches this accuracy with 8.4x fewer parameters and 6.1x fewer FLOPs. Even B4 at 82.9% accuracy uses 9.2x fewer FLOPs than ResNet-152 while surpassing it by over 2%.

The models also transfer well. On 8 widely-used datasets (CIFAR-10, CIFAR-100, Flowers, Cars, etc.), EfficientNets achieve state-of-the-art accuracy on 5 of 8 while using an average of 9.6x fewer parameters than previous best results.

Key Results

The paper's central empirical claim is that compound scaling consistently outperforms single-dimension scaling across architectures. Using the same FLOP budget, compound-scaled models achieve 2.5% higher accuracy than width-only or depth-only scaling on MobileNet and ResNet baselines.

On ImageNet, EfficientNet-B0 surpasses ResNet-50 accuracy (77.1% vs 76.0%) with 5.7x fewer parameters. At the high end, B7 matches GPipe accuracy at a fraction of the compute. The accuracy-vs-FLOPs curve of the EfficientNet family forms a clear Pareto frontier above prior architectures.

The paper also reports latency measurements: EfficientNet-B1 runs 5.7x faster than ResNet-152 on a single GPU while delivering higher accuracy. However, the authors note that FLOPs do not perfectly correlate with wall-clock time due to memory access patterns and hardware-specific optimizations.

Critical Analysis

Strengths:

Principled scaling framework. The compound coefficient provides a simple, reproducible recipe for scaling any architecture. Before this work, scaling decisions were ad hoc and architecture-specific.
Strong empirical efficiency. The parameter and FLOP reductions are substantial and consistent across model sizes, not cherry-picked at a single operating point.
Transferability. The scaling method improves accuracy on architectures beyond the NAS-derived baseline (ResNet, MobileNet), suggesting the insight generalizes.

Limitations:

NAS cost is hidden. The paper presents compound scaling as the contribution, but the strong B0 baseline is critical to the results. The NAS search itself requires substantial compute (comparable to training thousands of models), and the paper does not fully account for this cost when comparing efficiency.
SE block overhead. Squeeze-and-Excitation adds parameters and latency that are not captured well by FLOP counts. On hardware without efficient global pooling (e.g., some edge TPUs), SE blocks cause disproportionate slowdowns.
MobileNet-derived constraints. The MBConv search space inherits MobileNetV2 design choices (depthwise separable convolutions, inverted residuals, ReLU6). These are optimized for mobile inference and may not be the best building blocks for datacenter GPUs, where different operations have different relative costs.
Fixed scaling ratios. The α, β, γ coefficients are found at φ = 1 and applied unchanged to φ = 7. There is no guarantee that the optimal scaling ratio at 37B FLOPs is the same as at 0.7B FLOPs. Evidence from EfficientNetV2 later confirmed that high-resolution scaling becomes less beneficial at larger scales.
Training cost at large scale. B7 requires large input resolutions (600x600), which increases memory consumption quadratically. Training B7 with standard batch sizes requires significant memory optimization or model parallelism.

Impact and Legacy

EfficientNet shifted the community's approach to model design from architecture engineering toward principled scaling. Its influence appears in several directions:

EfficientNetV2 (Tan & Le, 2021) addressed the training speed limitation by combining NAS with training-aware optimization. It replaced some MBConv blocks with Fused-MBConv (standard convolutions instead of depthwise separable) in early stages, where the latter is slower on modern accelerators. It also introduced progressive training that gradually increases image resolution during training.

Scaling laws. The compound scaling idea parallels the broader scaling laws research in NLP (Kaplan et al., 2020), where Chinchilla and related work showed that compute-optimal training requires jointly scaling model size and data. EfficientNet demonstrated this principle for architecture dimensions in vision.

Architecture search baselines. EfficientNet-B0 became a standard NAS baseline and benchmark. Many subsequent NAS papers compare against it to demonstrate search efficiency improvements.

Replaced by ViTs at scale. At the largest compute budgets, Vision Transformers (ViT, DeiT, Swin) have largely superseded EfficientNet. ViTs scale more gracefully to very large datasets and benefit more from self-supervised pretraining. However, EfficientNet variants remain competitive at small and medium scales, particularly on edge devices.

Continued practical use. As of 2025, EfficientNet and its V2 variant remain among the most deployed classification backbones in production, particularly in mobile and edge applications where the parameter-accuracy trade-off matters directly.

Deep Residual Learning — ResNets introduced skip connections and demonstrated the value of depth scaling
Vision Transformer — ViT challenged the CNN scaling paradigm with patch-based transformers
Swin Transformer — hierarchical vision transformer that competes with EfficientNet at multiple scales
YOLO — real-time detection where backbone efficiency directly impacts latency
DINO — self-supervised pretraining that benefits from scalable backbone architectures