He/Kaiming Initialization: Optimizing for ReLU Networks

Before a neural network learns anything, its weights must be set to some initial values. This choice matters enormously — the wrong initialization can kill a network before training begins. He initialization (also called Kaiming initialization) solves a specific problem: how to set weights when your network uses ReLU activations, which behave fundamentally differently from older activations like tanh.

The core insight is simple. ReLU zeroes out all negative inputs, cutting the signal's variance in half at every layer. If you don't compensate for this halving, signals vanish exponentially as they travel deeper. He initialization doubles the weight variance to cancel out ReLU's halving — keeping signals alive through hundreds of layers.

The Signal Amplifier Analogy

Think of a deep network as a chain of amplifiers connected in series. Each amplifier boosts the signal, then passes it through a filter (ReLU) that removes the bottom half of the waveform. If you set each amplifier's gain for a normal full-waveform signal (Xavier), the chain bleeds power at every stage. He initialization sets the gain to compensate for the filter — doubling the power to offset the 50% that ReLU removes.

The Signal Amplifier Chain

Think of a neural network as a chain of amplifiers. Each amplifier boosts a signal, then a filter (ReLU) cuts the bottom half. How you set each amplifier's gain determines whether the signal reaches the end.

Each amplifier set for symmetric signals — but the filter cuts half the power at every stage.

1.00

Input

×1 → ½

0.50

Stage 1

×1 → ½

0.25

Stage 2

×1 → ½

0.13

Stage 3

×1 → ½

0.06

Stage 4

×1 → ½

0.03

Stage 5

×1 → ½

0.02

Stage 6

×1

Gain per stage

×0.5

ReLU filter

—

Final signal

The ReLU Problem

Why ReLU Breaks Standard Initialization

Xavier initialization was designed for symmetric activations like tanh, where roughly equal amounts of positive and negative signal pass through. It sets weight variance as:

\text{Var}(W) = 2n_in + n_out

But ReLU is not symmetric. It passes all positive values unchanged and zeroes all negative values:

f(x) = max(0, x)

This has a precise mathematical consequence — ReLU cuts the output variance in half:

\text{Var}(\text{ReLU}(x)) = 12 · \text{Var}(x)

With Xavier initialization, each layer loses half its signal variance. After 10 layers: 0.5¹⁰ = 0.001 of the original variance remains. After 20 layers: 0.5²⁰ ≈ 10^-6. The signal has effectively vanished.

He's Solution: Double the Variance

He initialization compensates by using only n_in (not the average of fan-in and fan-out) and includes a factor of 2:

\text{Var}(W) = 2n_in

The factor of 2 exactly cancels ReLU's halving. Each layer's effective gain becomes 2 × 0.5 = 1.0, preserving variance through arbitrary depth.

Variance Flow Explorer

Track how activation variance evolves layer by layer. With He initialization and ReLU, variance stays near 1.0 across all layers. Switch to Xavier or Random to watch it collapse.

Variance Flow Explorer

Track activation variance as signals propagate through layers. Healthy networks maintain variance near 1.0 — variance that vanishes or explodes kills training.

Initialization

Activation

Network Depth: 10 layers

31020

1.000

Layer 1 variance

1.000

Layer 10 variance

Stable

Signal health

He initialization with ReLU maintains variance near 1.0 across all layers. The factor of 2 in Var(W) = 2/n_in perfectly compensates for ReLU halving the variance at each layer.

He Initialization Variants

He Normal

Draw weights from a normal distribution centered at zero:

W ∼ 𝒩(0, √(\frac{2){n_in}})

This is the most common variant, used as the default in PyTorch's nn.Linear and nn.Conv2d when paired with ReLU.

He Uniform

Draw weights from a uniform distribution with matched variance:

W ∼ U[-√(\frac{6){n_in}}, √(\frac{6){n_in}}]

The bounds are derived from matching the variance of a uniform distribution to the He normal variance. Some practitioners prefer uniform for its bounded range.

Generalized Gain

Different ReLU variants preserve different fractions of variance. He initialization uses a gain factor:

\text{Var}(W) = \text{gain}²n_in

Activation	Gain	Reasoning
ReLU	√(2)	Zeroes exactly half the distribution
Leaky ReLU (α)	√(\frac{2){1 + α²}}	Small negative slope preserves slightly more variance
ELU	1.0	Smooth negative region preserves more signal
SELU	0.75	Self-normalizing property reduces needed gain

The Dead Neuron Problem

When weights are initialized poorly, some neurons receive only negative pre-activations. ReLU maps these to exactly zero — and since the gradient of ReLU at zero is also zero, these neurons can never update. They are permanently dead, wasting model capacity.

He initialization minimizes dead neurons by ensuring pre-activations are spread symmetrically around zero with enough variance that most neurons see at least some positive inputs.

Dead Neuron Detector

Each cell represents a neuron after signals propagate through the network. Dead neurons (red) output zero and can never recover — they waste parameters and capacity.

Initialization Method

Network Depth: 5 layers

1 (shallow)8 (deep)15 (very deep)

Dead (output = 0)

Weak (near zero)

Active (healthy)

2.0%

Active neurons

Weak neurons

97.5%

Dead neurons

He initialization keeps 2.0% of neurons active. By setting the right variance, initial pre-activations are centered around zero with enough spread that most pass through ReLU's positive region.

He vs Xavier: Head-to-Head

The difference between He and Xavier becomes dramatic in deep ReLU networks. Watch both methods propagate a signal through 20 layers — He maintains stable variance while Xavier decays exponentially.

He vs Xavier: Head-to-Head

Watch both initialization methods propagate through 20 layers. He maintains stable variance while Xavier decays exponentially with ReLU.

Activation:

Choosing Your Initialization

Initialization Methods at a Glance

Each method is designed for specific activation functions and network architectures.

Method	Formula	Best For	Depth Handling	Use Case
He Normal	N(0, √(2/n_in))	ReLU, Leaky ReLU	excellent	Default for modern CNNs and MLPs with ReLU activations
He Uniform	U[-√(6/n_in), √(6/n_in)]	ReLU, ELU	excellent	Alternative to He Normal with bounded weights
Xavier Normal	N(0, √(2/(n_in+n_out)))	Tanh, Sigmoid	good	Networks with symmetric activations like tanh
Xavier Uniform	U[-√(6/(n_in+n_out)), √(6/(n_in+n_out))]	Tanh, Sigmoid	good	Bounded variant of Xavier for symmetric activations
LeCun Normal	N(0, √(1/n_in))	SELU	moderate	Self-normalizing networks with SELU activation
Orthogonal	Q from QR decomposition	RNNs, LSTMs	excellent	Recurrent networks where gradient preservation is critical

He Normal

N(0, √(2/n_in))

Best for:

ReLU, Leaky ReLU

Depth:

excellent

Default for modern CNNs and MLPs with ReLU activations

He Uniform

U[-√(6/n_in), √(6/n_in)]

Best for:

ReLU, ELU

Depth:

excellent

Alternative to He Normal with bounded weights

Xavier Normal

N(0, √(2/(n_in+n_out)))

Best for:

Tanh, Sigmoid

Depth:

good

Networks with symmetric activations like tanh

Xavier Uniform

U[-√(6/(n_in+n_out)), √(6/(n_in+n_out))]

Best for:

Tanh, Sigmoid

Depth:

good

Bounded variant of Xavier for symmetric activations

LeCun Normal

N(0, √(1/n_in))

Best for:

SELU

Depth:

moderate

Self-normalizing networks with SELU activation

Orthogonal

Q from QR decomposition

Best for:

RNNs, LSTMs

Depth:

excellent

Recurrent networks where gradient preservation is critical

Use He initialization when:

Your network uses ReLU, Leaky ReLU, ELU, or PReLU
You are building deep CNNs or MLPs
You need stable gradients through many layers

Use Xavier instead when:

Your network uses tanh or sigmoid activations
You are using transformer architectures with GELU
The activation is symmetric around zero

Fan-in vs Fan-out

He initialization has two modes that control different aspects of signal flow:

Fan-in mode preserves the variance of activations during the forward pass. This is the default and the right choice for most networks — it ensures signals maintain their strength as they flow from input to output.

Fan-out mode preserves the variance of gradients during the backward pass. This matters for convolutional layers with batch normalization, where gradient flow during backpropagation is the bottleneck. ResNets and similar architectures use mode='fan_out' for this reason.

In practice, the difference between fan-in and fan-out is small for layers where n_in ≈ n_out. It only matters significantly when the layer changes dimensionality dramatically.

Common Pitfalls

1. Wrong Activation Pairing

He initialization is designed for ReLU-family activations. Using it with tanh or sigmoid overestimates the needed variance, causing initial activations to saturate. Xavier is the correct choice for symmetric activations.

2. Ignoring Batch Normalization

Batch normalization re-normalizes activations at every layer, reducing sensitivity to initialization. However, initialization still affects the first few training steps and gradient magnitudes. For networks with BN, use He with mode='fan_out' to prioritize gradient flow.

3. Residual Network Scaling

In residual networks, the skip connection adds the identity to the transformed signal. Without adjustment, variance doubles at every residual block. The fix is to zero-initialize the last batch normalization layer in each residual branch, so the block initially acts as an identity function.

4. Very Deep Networks Without Normalization

For networks deeper than 50 layers without batch normalization, even He initialization may not suffice. Techniques like FIXUP initialization scale weights by d^-1/4 (where d is depth) to account for the accumulation effect across many layers.

Key Takeaways

ReLU halves variance at every layer. He initialization compensates with \text{Var}(W) = 2/n_in, keeping signals alive through arbitrary depth.
Xavier is wrong for ReLU. It assumes symmetric activations and causes exponential variance decay — signals vanish by layer 10 in deep networks.
Dead neurons are permanent. Poor initialization pushes pre-activations negative, where ReLU kills them and zeroes their gradients. He initialization minimizes this by spreading pre-activations around zero.
The gain factor adapts to different activations. ReLU uses √(2), Leaky ReLU adjusts for the slope, and SELU uses 0.75 for its self-normalizing property.
Fan-in for forward pass, fan-out for backward pass. Most networks use fan-in (default). ConvNets with batch normalization benefit from fan-out.

Xavier Initialization — The symmetric-activation counterpart designed for tanh and sigmoid
Batch Normalization — Reduces initialization sensitivity by re-normalizing activations
Dropout — Regularization that can interact with initialization choices
Skip Connections — Architectural solution that complements proper initialization