He/Kaiming Initialization

Learn He (Kaiming) initialization for ReLU neural networks: understand why ReLU needs special weight initialization, visualize variance flow, and see dead neurons in action.

Best viewed on desktop for optimal interactive experience

He/Kaiming Initialization: Optimizing for ReLU Networks

Before a neural network learns anything, its weights must be set to some initial values. This choice matters enormously — the wrong initialization can kill a network before training begins. He initialization (also called Kaiming initialization) solves a specific problem: how to set weights when your network uses ReLU activations, which behave fundamentally differently from older activations like tanh.

The core insight is simple. ReLU zeroes out all negative inputs, cutting the signal's variance in half at every layer. If you don't compensate for this halving, signals vanish exponentially as they travel deeper. He initialization doubles the weight variance to cancel out ReLU's halving — keeping signals alive through hundreds of layers.

The Signal Amplifier Analogy

Think of a deep network as a chain of amplifiers connected in series. Each amplifier boosts the signal, then passes it through a filter (ReLU) that removes the bottom half of the waveform. If you set each amplifier's gain for a normal full-waveform signal (Xavier), the chain bleeds power at every stage. He initialization sets the gain to compensate for the filter — doubling the power to offset the 50% that ReLU removes.

The Signal Amplifier Chain

Think of a neural network as a chain of amplifiers. Each amplifier boosts a signal, then a filter (ReLU) cuts the bottom half. How you set each amplifier's gain determines whether the signal reaches the end.

Each amplifier set for symmetric signals — but the filter cuts half the power at every stage.

1.00
Input
×1 → ½
0.50
Stage 1
×1 → ½
0.25
Stage 2
×1 → ½
0.13
Stage 3
×1 → ½
0.06
Stage 4
×1 → ½
0.03
Stage 5
×1 → ½
0.02
Stage 6
×1
Gain per stage
×0.5
ReLU filter
Final signal

The ReLU Problem

Why ReLU Breaks Standard Initialization

Xavier initialization was designed for symmetric activations like tanh, where roughly equal amounts of positive and negative signal pass through. It sets weight variance as:

\text{Var}(W) = 2nin + nout

But ReLU is not symmetric. It passes all positive values unchanged and zeroes all negative values:

f(x) = max(0, x)

This has a precise mathematical consequence — ReLU cuts the output variance in half:

\text{Var}(\text{ReLU}(x)) = 12 · \text{Var}(x)

With Xavier initialization, each layer loses half its signal variance. After 10 layers: 0.510 = 0.001 of the original variance remains. After 20 layers: 0.520 ≈ 10-6. The signal has effectively vanished.

He's Solution: Double the Variance

He initialization compensates by using only nin (not the average of fan-in and fan-out) and includes a factor of 2:

\text{Var}(W) = 2nin

The factor of 2 exactly cancels ReLU's halving. Each layer's effective gain becomes 2 × 0.5 = 1.0, preserving variance through arbitrary depth.

Variance Flow Explorer

Track how activation variance evolves layer by layer. With He initialization and ReLU, variance stays near 1.0 across all layers. Switch to Xavier or Random to watch it collapse.

Variance Flow Explorer

Track activation variance as signals propagate through layers. Healthy networks maintain variance near 1.0 — variance that vanishes or explodes kills training.

31020
1.000
Layer 1 variance
1.000
Layer 10 variance
Stable
Signal health

He initialization with ReLU maintains variance near 1.0 across all layers. The factor of 2 in Var(W) = 2/n_in perfectly compensates for ReLU halving the variance at each layer.

He Initialization Variants

He Normal

Draw weights from a normal distribution centered at zero:

W ∼ 𝒩(0, √(\frac{2){nin}})

This is the most common variant, used as the default in PyTorch's nn.Linear and nn.Conv2d when paired with ReLU.

He Uniform

Draw weights from a uniform distribution with matched variance:

W ∼ U[-√(\frac{6){nin}}, √(\frac{6){nin}}]

The bounds are derived from matching the variance of a uniform distribution to the He normal variance. Some practitioners prefer uniform for its bounded range.

Generalized Gain

Different ReLU variants preserve different fractions of variance. He initialization uses a gain factor:

\text{Var}(W) = \text{gain}2nin
ActivationGainReasoning
ReLU√(2)Zeroes exactly half the distribution
Leaky ReLU (α)√(\frac{2){1 + α2}}Small negative slope preserves slightly more variance
ELU1.0Smooth negative region preserves more signal
SELU0.75Self-normalizing property reduces needed gain

The Dead Neuron Problem

When weights are initialized poorly, some neurons receive only negative pre-activations. ReLU maps these to exactly zero — and since the gradient of ReLU at zero is also zero, these neurons can never update. They are permanently dead, wasting model capacity.

He initialization minimizes dead neurons by ensuring pre-activations are spread symmetrically around zero with enough variance that most neurons see at least some positive inputs.

Dead Neuron Detector

Each cell represents a neuron after signals propagate through the network. Dead neurons (red) output zero and can never recover — they waste parameters and capacity.

1 (shallow)8 (deep)15 (very deep)
Dead (output = 0)
Weak (near zero)
Active (healthy)
2.0%
Active neurons
1
Weak neurons
97.5%
Dead neurons

He initialization keeps 2.0% of neurons active. By setting the right variance, initial pre-activations are centered around zero with enough spread that most pass through ReLU's positive region.

He vs Xavier: Head-to-Head

The difference between He and Xavier becomes dramatic in deep ReLU networks. Watch both methods propagate a signal through 20 layers — He maintains stable variance while Xavier decays exponentially.

He vs Xavier: Head-to-Head

Watch both initialization methods propagate through 20 layers. He maintains stable variance while Xavier decays exponentially with ReLU.

Activation:

Choosing Your Initialization

Initialization Methods at a Glance

Each method is designed for specific activation functions and network architectures.

He Normal
N(0, √(2/n_in))
Best for:
ReLU, Leaky ReLU
Depth:
excellent
Default for modern CNNs and MLPs with ReLU activations
He Uniform
U[-√(6/n_in), √(6/n_in)]
Best for:
ReLU, ELU
Depth:
excellent
Alternative to He Normal with bounded weights
Xavier Normal
N(0, √(2/(n_in+n_out)))
Best for:
Tanh, Sigmoid
Depth:
good
Networks with symmetric activations like tanh
Xavier Uniform
U[-√(6/(n_in+n_out)), √(6/(n_in+n_out))]
Best for:
Tanh, Sigmoid
Depth:
good
Bounded variant of Xavier for symmetric activations
LeCun Normal
N(0, √(1/n_in))
Best for:
SELU
Depth:
moderate
Self-normalizing networks with SELU activation
Orthogonal
Q from QR decomposition
Best for:
RNNs, LSTMs
Depth:
excellent
Recurrent networks where gradient preservation is critical
Use He initialization when:
  • Your network uses ReLU, Leaky ReLU, ELU, or PReLU
  • You are building deep CNNs or MLPs
  • You need stable gradients through many layers
Use Xavier instead when:
  • Your network uses tanh or sigmoid activations
  • You are using transformer architectures with GELU
  • The activation is symmetric around zero

Fan-in vs Fan-out

He initialization has two modes that control different aspects of signal flow:

Fan-in mode preserves the variance of activations during the forward pass. This is the default and the right choice for most networks — it ensures signals maintain their strength as they flow from input to output.

Fan-out mode preserves the variance of gradients during the backward pass. This matters for convolutional layers with batch normalization, where gradient flow during backpropagation is the bottleneck. ResNets and similar architectures use mode='fan_out' for this reason.

In practice, the difference between fan-in and fan-out is small for layers where nin ≈ nout. It only matters significantly when the layer changes dimensionality dramatically.

Common Pitfalls

1. Wrong Activation Pairing

He initialization is designed for ReLU-family activations. Using it with tanh or sigmoid overestimates the needed variance, causing initial activations to saturate. Xavier is the correct choice for symmetric activations.

2. Ignoring Batch Normalization

Batch normalization re-normalizes activations at every layer, reducing sensitivity to initialization. However, initialization still affects the first few training steps and gradient magnitudes. For networks with BN, use He with mode='fan_out' to prioritize gradient flow.

3. Residual Network Scaling

In residual networks, the skip connection adds the identity to the transformed signal. Without adjustment, variance doubles at every residual block. The fix is to zero-initialize the last batch normalization layer in each residual branch, so the block initially acts as an identity function.

4. Very Deep Networks Without Normalization

For networks deeper than 50 layers without batch normalization, even He initialization may not suffice. Techniques like FIXUP initialization scale weights by d-1/4 (where d is depth) to account for the accumulation effect across many layers.

Key Takeaways

  1. ReLU halves variance at every layer. He initialization compensates with \text{Var}(W) = 2/nin, keeping signals alive through arbitrary depth.

  2. Xavier is wrong for ReLU. It assumes symmetric activations and causes exponential variance decay — signals vanish by layer 10 in deep networks.

  3. Dead neurons are permanent. Poor initialization pushes pre-activations negative, where ReLU kills them and zeroes their gradients. He initialization minimizes this by spreading pre-activations around zero.

  4. The gain factor adapts to different activations. ReLU uses √(2), Leaky ReLU adjusts for the slope, and SELU uses 0.75 for its self-normalizing property.

  5. Fan-in for forward pass, fan-out for backward pass. Most networks use fan-in (default). ConvNets with batch normalization benefit from fan-out.

  • Xavier Initialization — The symmetric-activation counterpart designed for tanh and sigmoid
  • Batch Normalization — Reduces initialization sensitivity by re-normalizing activations
  • Dropout — Regularization that can interact with initialization choices
  • Skip Connections — Architectural solution that complements proper initialization

If you found this explanation helpful, consider sharing it with others.

Mastodon