TL;DR. BatchNorm normalizes each channel using statistics pooled across the batch and spatial dimensions, so it depends on batch composition. LayerNorm normalizes each sample using statistics pooled across the channel and spatial dimensions, so it depends only on that one sample. CNNs typically use BatchNorm, transformers and any small-batch or sequence-of-ones inference path use LayerNorm. The rest of the choice falls out of those two facts.
What each one actually does
For an activation tensor of shape (N, C, H, W) — N samples, C channels, H×W spatial:
- BatchNorm computes one mean and variance per channel. To normalize the value at position (n, c, h, w), BN uses the mean of every value in the same channel c across all n, h, w. There are C means and C variances per layer.
- LayerNorm computes one mean and variance per sample. To normalize the value at (n, c, h, w), LN uses the mean of every value in the same sample n across all c, h, w. There are N means and N variances per batch.
Both then apply the same shape: subtract mean, divide by sqrt(variance + ε), then scale and shift by learned per-channel parameters γ and β.
The visualization below shows which cells get pooled together for each variant. Click a cell to change the focus position; the highlighted region is "every cell that gets averaged together with this one."
Side-by-side
| Aspect | Batch Norm | Layer Norm |
|---|---|---|
| Normalization axis | per-channel; pools over (N, H, W) | per-sample; pools over (C, H, W) |
| Number of mean/var pairs | C (one per channel) | N (one per sample) |
| Depends on batch composition? | Yes — mean/var change as batch changes | No — each sample is normalized independently |
| Works at batch size = 1? | No — variance is degenerate | Yes — same math regardless of batch |
| Train vs eval behavior | Different. Train uses batch stats; eval uses running stats | Identical. Same math both modes |
| Sensitive to batch composition order? | Yes — bad batches produce bad stats | No |
| Implicit regularization | Yes — noise from changing batches helps generalization | None from normalization itself |
| Memory cost (running stats) | C floats × 2 (γ, β + running mean, var) | C floats × 2 (γ, β only — no running stats) |
| Compute cost | Slightly higher (sync running stats during training) | Lower; pure per-sample reduction |
| Distributed training | Needs SyncBatchNorm to pool stats across GPUs | Works with no extra sync — stats are local |
| Sequence models | Awkward — different sequence lengths per sample, batch stats become unstable | Native — designed for variable-length sequences |
| CNNs | Default choice for ResNet-style classifiers | Works but loses regularization benefit |
| Transformers | Doesn't work well — small batches, variable lengths | Default choice everywhere |
| Inference latency | One extra running-stats lookup per layer | One reduction per sample per layer |
When each one wins
A simple decision flow:
- Sequences with variable length (transformers, RNNs, time series) → LayerNorm. Padding tokens would corrupt batch statistics; per-sample stats sidestep that.
- Inference at batch size 1 (real-time serving, mobile, edge) → LayerNorm. BN's running stats can drift from the inference distribution; LN doesn't have the problem.
- Distributed training across many GPUs → LayerNorm by default; BN only with SyncBatchNorm and the synchronization tax it brings. Most modern recipes prefer LN to avoid the sync.
- Convolutional classifier with reasonable batch size (≥32) → BatchNorm. The stochastic noise from batch sampling is a useful regularizer; this is why ResNets and most classical CNNs use it.
- Style transfer or generative models that need to remove per-image statistics → InstanceNorm (which is just LN restricted to spatial axes within each channel).
- Training with batch size constraint (e.g., big segmentation models that fit only 2–4 samples per GPU) → GroupNorm. It's the in-between: per-sample like LN, but channel-aware.
The internal covariate shift framing isn't quite right
The original BatchNorm paper motivated the technique as fixing "internal covariate shift" — the claim that distribution drift in intermediate activations slows training. Later work (Santurkar et al., 2018) showed the actual mechanism is different: BN smooths the loss landscape, making gradients more predictive across step sizes. That's why aggressive learning rates work after adding BN.
LayerNorm gets the same loss-smoothing benefit without the batch-statistics machinery. For transformers — where the attention mechanism is already sensitive to input scale — the predictability of LN's behavior across batch sizes turned out to matter more than the implicit regularization from BN.
For the deeper dives see BatchNorm, LayerNorm, and internal covariate shift.
Common pitfalls
A short list of things that ship to production broken:
- Inference batch size mismatch with training: a model trained with BN at batch=64 and served at batch=1 uses different stats (running vs batch). The output distribution shifts, AUC drops 1–3 points. Switch to LN at training time, or call
model.eval()(which uses running stats) — never the train-mode forward at inference. - Padding tokens in BN: applying BN to a transformer with batch=variable-length sequences pulls padded positions into the per-channel mean. The padded positions are effectively random noise corrupting your statistics. Use LN.
- Frozen BN in transfer learning: many pretrained CNNs ship with
track_running_stats=True. If you fine-tune at a different batch size (or on a small dataset), the running stats slowly drift toward the new distribution and the head doesn't see what the trunk produces. Either freeze BN entirely (bn.eval()even during training) or use LN/GN for the fine-tuning. - GroupNorm with G=1 thinking it's LN: not quite. GN with G=1 normalizes across (C, H, W) within each sample — same as LN over the channel-and-spatial axes. But standard transformer LN normalizes across
hidden_dimonly (last axis), not spatial. They're equivalent only for 2D feature maps without an extra spatial structure to preserve. - SyncBatchNorm with mismatched grad accumulation: if you accumulate gradients over 4 micro-batches before stepping the optimizer, but SyncBN syncs after every micro-batch, the synced stats are from a single micro-batch — not the effective batch. PyTorch handles this if you use
torch.distributedcorrectly; bare DDP doesn't.
Performance and memory
Rough numbers for a 1024-dim transformer block at batch size 32, sequence 512:
- LayerNorm: 32×512×1024 reductions for mean (per-sample over the hidden axis), one γ/β per layer → ~16M ops, 2K parameters per LN layer.
- BatchNorm (if you tried using it on a transformer): 32×512 reductions per channel × 1024 channels → effectively similar op count, but adds running-mean/running-var state (2K parameters) and SyncBatchNorm communication cost across GPUs.
The compute is comparable. The operational cost is what differs — LN is "no machinery", BN comes with running stats, SyncBN comms, and a train/eval mode switch.
When the comparison really matters
Three places this decision shows up most often:
- Adapting a CNN backbone for a sequence task: you'll want to swap BN for GN or LN. The naive port keeps BN, hits the variable-batch problem, and trains poorly.
- Productionizing a transformer for batch-1 inference: it already uses LN, so this just works. If the codebase uses RMSNorm (a LayerNorm variant that drops the mean centering), the same conclusion holds.
- Multi-GPU vs single-GPU training: BN's sync tax compounds. For 8-GPU setups you'll want SyncBN; for 32+ GPU setups it's often faster to just switch to LN.
When in doubt: if your batch size is small or variable, use LayerNorm. If your batch size is large and stable and you're training a CNN, BatchNorm gives you free regularization. Anything else is GroupNorm territory.
