NAdam: Nesterov-Accelerated Adam Optimizer

Understanding the NAdam optimizer that combines Adam's adaptive learning rates with Nesterov's look-ahead momentum for faster convergence

Best viewed on desktop for optimal interactive experience

Overview

NAdam (Nesterov-Accelerated Adaptive Moment Estimation) is an optimization algorithm that combines the best of two worlds: Adam's adaptive per-parameter learning rates and Nesterov momentum's "look-ahead" gradient computation. While Adam revolutionized deep learning by making training more robust to hyperparameter choices, NAdam takes it further by incorporating the theoretically superior Nesterov momentum, resulting in faster convergence without sacrificing Adam's ease of use.

The key insight is simple but powerful: instead of computing the gradient at the current position, NAdam computes it at where momentum will take you. This "look-ahead" allows the optimizer to correct its course before overshooting, leading to smoother and faster convergence especially in loss landscapes with narrow valleys or saddle points.

Key Concepts

Gradient Descent

The foundational optimization technique: iteratively move parameters in the direction that reduces loss. All modern optimizers build on this.

Momentum

Accumulates velocity from past gradients, smoothing updates and accelerating through consistent gradient directions like a ball rolling downhill.

Nesterov Momentum

Computes gradient at the 'look-ahead' position where momentum will take you, not where you currently are. Allows course correction before overshooting.

Adaptive Learning Rates

Adam's innovation: each parameter gets its own learning rate based on historical gradient magnitudes. Frequent gradients get smaller steps, rare gradients get larger steps.

First & Second Moments

First moment (m) is the running average of gradients (momentum). Second moment (v) is the running average of squared gradients (controls per-parameter learning rate).

Bias Correction

Early training steps need adjustment because running averages are initialized to zero. Without correction, early updates would be too small.

The Optimization Challenge

Training neural networks means navigating a loss landscape—a high-dimensional surface where we seek the lowest point (minimum loss). This landscape is rarely a smooth bowl; it's filled with narrow valleys, saddle points, and noisy gradients from mini-batch sampling.

Loss Landscape Navigation

The Loss Landscape

Navigate from high loss to the minimum — the core optimization challenge

Global MinimumStart θ₀High LossLow LossMinimum
Step 0 of 8
Challenge 1

Learning rate too high → diverge, too low → crawl

Challenge 2

Mini-batch gradients are noisy estimates

Challenge 3

Different parameters need different update scales

The three fundamental challenges every optimizer must handle:

  1. Learning Rate Sensitivity: Too high and you diverge; too low and training takes forever
  2. Noisy Gradients: Mini-batch gradients are estimates, not the true gradient
  3. Parameter Scales: Different parameters often need different update magnitudes

From SGD to Momentum

Standard SGD has a notorious problem: it oscillates wildly across narrow valleys while making slow progress along the valley floor. This happens because the gradient perpendicular to the valley is large, but the gradient along the valley is small.

SGD vs Momentum in a Narrow Valley

SGD vs Momentum

The narrow valley problem — momentum smooths oscillations

Momentum β:0.90
Narrow Valley (common in deep learning)Start
SGD: oscillates across valley
Momentum: smooth convergence
SGD Update
θₜ₊₁ = θₜ − α · gₜ

Direct gradient step — sensitive to noise

Momentum Update
vₜ = β · vₜ₋₁ + gₜ
θₜ₊₁ = θₜ − α · vₜ

Accumulates velocity — smooths updates

Key Insight: Momentum is like a ball rolling downhill — it builds speed in consistent directions and dampens oscillations in noisy directions. Typical β = 0.9 means 90% of velocity is retained.

Momentum solves this by accumulating velocity:

  • Oscillating gradients cancel out — reducing zigzag across the valley
  • Consistent gradients accumulate — accelerating progress along the valley

Think of it like a ball rolling downhill: it builds speed in consistent directions and its inertia dampens oscillations. The typical momentum coefficient β = 0.9 means 90% of the previous velocity is retained.

The Nesterov Look-Ahead Trick

Nesterov momentum makes one crucial change: instead of computing the gradient at your current position, compute it at where momentum is about to take you. This "look-ahead" provides invaluable foresight.

Standard vs Nesterov Momentum

The Nesterov Look-Ahead Trick

Compute gradient at where momentum will take you, not where you are

Current Position
Starting at θₜ with accumulated velocity vₜ₋₁

Standard Momentum

θₜ
Gradient computed at current position

Nesterov Momentum

θₜ
Gradient computed at lookahead position
Standard Momentum
vₜ = β·vₜ₋₁ + ∇L(θₜ)
Nesterov Momentum
vₜ = β·vₜ₋₁ + ∇L(θₜ + β·vₜ₋₁)

Key Insight: By computing the gradient at the "lookahead" position, Nesterov momentum can see ahead and correct its course before overshooting. This leads to faster convergence and is provably optimal for convex functions!

Why does this matter?

  • Standard momentum: Blindly applies velocity, then corrects based on where you end up
  • Nesterov momentum: Peeks ahead, sees the terrain, then adjusts before moving

This look-ahead is provably optimal for convex functions and empirically faster for deep learning. Nesterov saw the future of optimization—literally.

The NAdam Algorithm

NAdam elegantly incorporates Nesterov's look-ahead into Adam's framework. The key is in Step 6: instead of using the bias-corrected first moment directly, NAdam computes a "Nesterov-enhanced" momentum estimate.

NAdam Algorithm Step-by-Step

NAdam Algorithm Step-by-Step

Adam + Nesterov = NAdam — The best of both worlds

Step 1 of 7Initialize

Initialize

m₀ = 0, v₀ = 0, t = 0

Start with zero momentum and zero variance

Full NAdam Algorithm

Initialize: m₀ = 0, v₀ = 0
For each step t:
gₜ = ∇L(θₜ₋₁)
mₜ = β₁·mₜ₋₁ + (1-β₁)·gₜ
vₜ = β₂·vₜ₋₁ + (1-β₂)·gₜ²
m̂ₜ = mₜ/(1-β₁ᵗ), v̂ₜ = vₜ/(1-β₂ᵗ)
m̂_nesterov = β₁·m̂ₜ + (1-β₁)·gₜ/(1-β₁ᵗ)
θₜ = θₜ₋₁ − α·m̂_nesterov/(√v̂ₜ + ε)

Default Hyperparameters

α = 0.001
Learning rate
β₁ = 0.9
Momentum decay
β₂ = 0.999
Variance decay
ε = 10⁻⁸
Stability term

How It Works

1

Initialize State Variables

Set first moment (m) and second moment (v) to zero. The timestep t starts at 0.

2

Compute Gradient

Backpropagate through the network to get the gradient of loss with respect to parameters.

3

Update First Moment (Momentum)

Exponential moving average of gradients: m = β₁·m + (1-β₁)·g. This accumulates momentum.

4

Update Second Moment (Variance)

Exponential moving average of squared gradients: v = β₂·v + (1-β₂)·g². This tracks per-parameter gradient magnitudes.

5

Apply Bias Correction

Correct for zero initialization: m̂ = m/(1-β₁ᵗ), v̂ = v/(1-β₂ᵗ). Critical for early training steps.

6

Compute Nesterov Momentum

THE KEY STEP: m̂_nesterov = β₁·m̂ + (1-β₁)·g/(1-β₁ᵗ). This is the look-ahead momentum estimate.

7

Update Parameters

Apply the adaptive, look-ahead update: θ = θ - α·m̂_nesterov/(√v̂ + ε). Parameters with large historical gradients get smaller updates.

Optimizer Convergence Comparison

How do these optimizers compare in practice? The following visualization shows typical convergence behavior on a training task:

Optimizer Convergence Comparison

Optimizer Convergence Comparison

Training loss over epochs — NAdam often converges fastest

2.52.01.51.00.5Loss0255075100Epochs
SGD+Momentum2.50
Adam2.50
NAdam2.50
Epoch:0

Practical Advice: Start with Adam for baseline. Switch to NAdam for faster convergence. Use AdamW for transformers with weight decay. Use SGD for CNNs when you can tune the schedule.

Key observations:

  • SGD+Momentum: Slowest convergence but often better generalization. Needs careful learning rate scheduling.
  • Adam: Fast initial progress, works out-of-the-box with default parameters. May have issues with weight decay.
  • AdamW: Adam with proper weight decay decoupling. Preferred for transformers and when regularization matters.
  • NAdam: Fastest convergence by combining Adam's adaptivity with Nesterov's look-ahead. Same ease of use as Adam.

Real-World Applications

Language Models & Transformers

Large-scale pre-training where convergence speed directly impacts compute costs

Use NAdam or AdamW with learning rate warmup and decay

Computer Vision CNNs

Image classification, object detection, segmentation tasks

NAdam works well; SGD+Momentum with tuned schedule can match or exceed

Rapid Prototyping

Quick experiments where you don't want to tune learning rates

NAdam or Adam with defaults (α=0.001, β₁=0.9, β₂=0.999)

Fine-tuning Pretrained Models

Transfer learning with smaller datasets

Use NAdam with lower learning rate (1e-5 to 1e-4)

Generative Models

GANs, VAEs, diffusion models with complex loss landscapes

Adam or NAdam helps navigate adversarial dynamics

Reinforcement Learning

Policy optimization with noisy reward signals

Adam family optimizers handle non-stationary objectives well

Advantages & Limitations

Advantages

  • Fastest convergence among common optimizers
  • Combines benefits of both Adam and Nesterov momentum
  • Works well with default hyperparameters
  • Minimal additional computation over Adam
  • Same memory footprint as Adam
  • Robust to learning rate choice

Limitations

  • ×Slightly more computation per step than Adam
  • ×May converge to sharper minima (potential generalization issues)
  • ×Less studied than Adam in literature
  • ×May overfit faster—monitor validation carefully
  • ×Same weight decay issues as Adam (use NAdamW if available)
  • ×Not always better than tuned SGD+Momentum for CNNs

Best Practices

  • Start with NAdam Defaults: Use α=0.001, β₁=0.9, β₂=0.999, ε=1e-8. These work well for most tasks without tuning.
  • Use Learning Rate Warmup: Gradually increase learning rate over first 1-5% of training for large models. Helps stabilize early training.
  • Apply Learning Rate Decay: Cosine decay or step decay in later training. NAdam converges fast but may benefit from fine-grained later optimization.
  • Consider Weight Decay Carefully: NAdam has the same L2/weight decay coupling issue as Adam. Use decoupled weight decay (NAdamW) for regularization.
  • Monitor Training Closely: NAdam converges faster, which means it can also overfit faster. Watch validation metrics carefully.
  • Compare Against Baselines: Always benchmark against Adam and tuned SGD. NAdam usually wins on speed but not always on final performance.

Default Hyperparameters

ParameterValuePurpose
α (learning rate)0.001Step size for parameter updates
β₁0.9First moment decay (momentum)
β₂0.999Second moment decay (adaptive LR)
ε10⁻⁸Numerical stability in denominator

These defaults work remarkably well across tasks. Only tune the learning rate α if needed—the β values rarely need adjustment.

Key Formulas

Momentum Update:

vₜ = β·vₜ₋₁ + gₜ θₜ = θₜ₋₁ − α·vₜ

Nesterov Momentum:

vₜ = β·vₜ₋₁ + ∇L(θₜ₋₁ + β·vₜ₋₁) θₜ = θₜ₋₁ − α·vₜ

Adam Update:

mₜ = β₁·mₜ₋₁ + (1-β₁)·gₜ vₜ = β₂·vₜ₋₁ + (1-β₂)·gₜ² θₜ = θₜ₋₁ − α·m̂ₜ/(√v̂ₜ + ε)

NAdam Update:

m̂_nesterov = β₁·m̂ₜ + (1-β₁)·gₜ/(1-β₁ᵗ) θₜ = θₜ₋₁ − α·m̂_nesterov/(√v̂ₜ + ε)

When to Choose NAdam

ScenarioRecommendation
Need fast convergenceNAdam is your best bet
Training large transformersUse NAdam or AdamW
Limited compute budgetNAdam reduces training time
Rapid prototypingNAdam with defaults
Need best generalizationConsider tuned SGD+Momentum
Transfer learningNAdam with lower learning rate

Further Reading

If you found this explanation helpful, consider sharing it with others.

Mastodon