Cross-Entropy Loss

Understand cross-entropy loss for classification: interactive demos of binary and multi-class CE, the -log(p) curve, softmax gradients, and focal loss.

Best viewed on desktop for optimal interactive experience

Cross-Entropy Loss: The Language of Classification

Cross-entropy loss is the default objective for training classification models in deep learning. It measures the gap between a model's predicted probability distribution and the true labels, rooted in a simple idea from information theory: how surprised are you when the true answer is revealed? A confident and correct model is barely surprised. A confident and wrong model is maximally surprised. Cross-entropy quantifies this surprise and turns it into a smooth, differentiable signal that drives learning.

What makes cross-entropy special among loss functions is the elegance of its gradients. When combined with softmax, the gradient simplifies to pi - yi โ€” just the difference between prediction and truth. No complicated derivative chains, no vanishing signals. This clean gradient is why cross-entropy trains faster and more reliably than alternatives like mean squared error for classification tasks.

The Surprise Analogy

Think of classification as a guessing game. Before seeing the answer, a model assigns probabilities to each possible class. When the true class is revealed, the model experiences "surprise" proportional to how unlikely it considered that outcome. If the model assigned 95% probability to the correct class, the surprise is tiny. If it assigned 5%, the surprise is enormous. Cross-entropy loss is exactly this surprise, measured in nats (using the natural logarithm) or bits (using log base 2). Training a classifier means teaching it to be less surprised by the training data.

The Surprise Guessing Game

Imagine a photo appears and you must guess the animal. If you guessed "Cat" with only 5% confidence, seeing a cat would be a huge surprise. Cross-entropy measures this surprise: -log(predicted probability).

The model assigns high probability to the correct class โ€” low surprise, low loss

True Answer: Cat
๐ŸฑCat
1.0
๐ŸถDog
0.0
๐ŸฆBird
0.0
๐ŸŸFish
0.0
Model's Predicted Probabilities
๐ŸฑCat
85%
๐ŸถDog
10%
๐ŸฆBird
3%
๐ŸŸFish
2%
0.23
Surprise (bits)
0.1625
CE Loss (nats)
0.00
Wasted bits

Mathematical Definition

Binary Cross-Entropy

For a single sample with true label y โˆˆ \{0, 1\} and predicted probability yฬ‚:

โ„’ = -y log(yฬ‚) - (1-y) log(1-yฬ‚)

When y = 1, only the first term is active: -log(yฬ‚). When y = 0, only the second term matters: -log(1 - yฬ‚). In both cases, the loss decreases as the predicted probability moves closer to the true label.

Categorical Cross-Entropy

For multi-class classification with one-hot encoded labels over C classes:

โ„’ = -ฮฃi=1C yi log(yฬ‚i)

Since y is one-hot, only one term survives โ€” the negative log-probability of the true class. This is equivalent to the negative log-likelihood under a categorical distribution, which connects cross-entropy directly to maximum likelihood estimation.

Connection to Information Theory

Cross-entropy between distributions P (truth) and Q (model) measures the average number of bits needed to encode events from P using Q's coding scheme:

H(P, Q) = -ฮฃx P(x) log Q(x)

This always satisfies H(P, Q) โ‰ฅ H(P), with equality only when P = Q. The excess H(P, Q) - H(P) is the KL divergence โ€” the wasted bits from using the wrong code. Since H(P) is constant during training, minimizing cross-entropy is equivalent to minimizing KL divergence between the model's predictions and the true distribution.

Interactive Cross-Entropy Explorer

Interactive Cross-Entropy Explorer

Adjust the predicted probabilities and watch how cross-entropy loss responds on the -log(p) curve. The red dot shows your current operating point.

Cat (true)70.0%
Dog 20.0%
Bird 10.0%
70.0%
P(true class)
0.3567
CE Loss (nats)
0.51
Surprise (bits)

Moderate confidence. The loss gradient is gentle here, so learning will be steady but not urgent.

Binary Cross-Entropy

Binary cross-entropy (also called log loss) handles two-class problems. The model outputs a single probability p representing the likelihood of the positive class. The loss has a characteristic asymmetric shape: it rises gently when the model is slightly wrong but explodes toward infinity when the model is confidently wrong.

This asymmetry is a feature, not a bug. When the model assigns near-zero probability to the true class, the gradient -1/p becomes extremely large, creating an urgent signal to correct the mistake. When the model is already correct and confident, the gradient is small, avoiding unnecessary parameter updates. This "self-regulating" property is one reason cross-entropy outperforms MSE for classification.

In practice, binary cross-entropy is always computed from raw logits (pre-sigmoid values) rather than probabilities. Combining the sigmoid and log operations into a single numerically stable function avoids the catastrophic cancellation that occurs when computing log(ฯƒ(z)) for large negative z.

Binary Cross-Entropy Demo

Toggle the true label and drag the predicted probability. Watch both loss curves simultaneously โ€” the blue curve penalizes wrong predictions when y=1, the red curve penalizes when y=0. The orange arrow shows gradient direction and magnitude.

0.01Decision boundary (0.5)0.99
0.70
Predicted p
0.3567
BCE Loss
-1.43
Gradient

Correct side of the boundary but not fully confident. The gradient is moderate, pushing the model toward higher confidence.

Multi-Class Cross-Entropy

For problems with more than two classes, the model produces a vector of logits that passes through softmax to create a probability distribution. The softmax function converts raw scores into probabilities that sum to one:

\text{softmax}(zi) = eziฮฃj=1C ezj

The combined softmax + cross-entropy loss yields the most elegant gradient in deep learning:

โˆ‚ โ„’โˆ‚ zi = pi - yi

For the true class, the gradient is p\text{true} - 1, always negative, pushing the logit up. For every other class, the gradient equals the softmax probability itself, always positive, pushing those logits down. The magnitude is proportional to how wrong the prediction is โ€” larger errors produce stronger corrections.

This gradient has three important properties. First, it never vanishes โ€” even when the model is very confident, there is always a nonzero gradient. Second, it is bounded between -1 and 1, preventing gradient explosion. Third, it is computationally trivial โ€” just subtract the one-hot label from the softmax output.

Multi-Class Cross-Entropy

Adjust raw logits for each class and watch the softmax probabilities, per-class loss contributions, and gradients update in real time. Only the true class contributes to the loss.

๐ŸฑCat (true)
Logit: 3.5Softmax: 89.1%
Loss: 0.115
Grad: -0.109
๐ŸถDog
Logit: 0.8Softmax: 6.0%
Loss: 0.000
Grad: +0.060
๐ŸฆBird
Logit: 0.2Softmax: 3.3%
Loss: 0.000
Grad: +0.033
๐ŸŸFish
Logit: -0.5Softmax: 1.6%
Loss: 0.000
Grad: +0.016
89.1%
P(true class)
0.1155
Total CE Loss
3.5
True class logit
Gradient Insight

The gradient for softmax + cross-entropy has an elegant form: gradient = softmax(z) - y. For the true class (y=1), the gradient is -0.109 (negative, pushing the logit up). For all other classes (y=0), the gradient equals the softmax probability itself โ€” always positive, pushing those logits down.

Correct prediction but not fully confident. Cross-entropy will keep pushing the true class logit higher relative to the others.

Comparing Classification Losses

Cross-entropy is the default, but it is not the only option. Different classification losses make different tradeoffs between gradient quality, class imbalance handling, and calibration. Understanding these tradeoffs helps you choose the right tool for your specific problem.

Classification Loss Functions Compared

How cross-entropy stacks up against other classification losses across key dimensions.

Cross-Entropy
-y log(p)
Imbalance
moderate
Weight classes manually
Calibration
excellent
Penalizes underconfidence for true class
Stability
excellent
Stable with log-sum-exp trick
Gradients
excellent
Gradient never vanishes
Best for: General classification โ€” the default choice for most tasks
MSE (L2)
(y - p)^2
Imbalance
poor
No built-in mechanism
Calibration
poor
Slow gradient when p is far from y
Stability
excellent
No log operations
Gradients
poor
Vanishes near 0 and 1 with sigmoid
Best for: Regression tasks โ€” not recommended for classification
Focal Loss
-(1-p)^gamma log(p)
Imbalance
excellent
Down-weights easy examples
Calibration
excellent
Focuses on hard examples
Stability
moderate
Extra terms add complexity
Gradients
excellent
Strong signal for hard cases
Best for: Object detection, severe class imbalance (RetinaNet)
Hinge Loss
max(0, 1 - y*f(x))
Imbalance
moderate
Margin-based, not probability
Calibration
moderate
No gradient once margin met
Stability
excellent
Simple max operation
Gradients
moderate
Zero gradient for correct predictions
Best for: SVMs, max-margin classifiers
Label Smoothing CE
-(1-e)log(p) - e/C
Imbalance
moderate
Same as CE
Calibration
excellent
Prevents overconfidence
Stability
excellent
Same as CE
Gradients
excellent
Never fully satisfied
Best for: Large-scale classification, knowledge distillation, calibration
Use cross-entropy when...
  • - Training any standard classification model
  • - You want well-calibrated probability outputs
  • - Class imbalance is mild to moderate
  • - You need reliable gradients throughout training
Consider alternatives when...
  • - Severe class imbalance exists (use focal loss)
  • - You need max-margin separation (use hinge loss)
  • - Overconfidence is a problem (use label smoothing)
  • - Doing regression, not classification (use MSE/MAE)

Connection to KL Divergence

Minimizing cross-entropy loss is mathematically equivalent to minimizing the KL divergence between the true distribution and the model's predicted distribution. Since the entropy of the true labels H(P) is fixed (zero for hard labels, positive for soft labels), the only way to reduce H(P, Q) is to reduce the KL divergence term DKL(P \| Q) = H(P, Q) - H(P).

This equivalence has a profound implication: cross-entropy training is maximum likelihood estimation in disguise. Maximizing the likelihood of the data under the model is the same as minimizing the cross-entropy between the empirical data distribution and the model's predictions. This is why cross-entropy is not just a heuristic โ€” it is the principled, information-theoretically optimal loss for classification.

When using label smoothing, the true distribution is no longer one-hot but a mixture: y'i = (1-ฮต) ยท yi + ฮต / C. This makes H(P) positive, and the model can never drive the loss to zero. The smoothed targets act as a regularizer, preventing the model from becoming overconfident and improving generalization โ€” especially in large-scale classification with many classes.

Common Pitfalls

Numerical Instability

Computing log(p) directly from softmax probabilities is dangerous. When a logit is very large and negative, the softmax output underflows to zero, and log(0) produces negative infinity. The solution is the log-sum-exp trick: compute cross-entropy directly from logits as -z\text{true} + log ฮฃj ezj, subtracting the maximum logit for numerical stability. Every major deep learning framework implements this combined operation internally.

Wrong Loss for the Problem

Using mean squared error for classification seems reasonable but fails in practice. The gradient of MSE through sigmoid saturates when the model is confidently wrong โ€” exactly when you need the strongest learning signal. Cross-entropy's gradient never saturates, which is why it converges faster and more reliably.

Ignoring Class Imbalance

Standard cross-entropy treats all classes equally. When one class dominates the dataset, the model learns to predict that class for everything and achieves low loss. Solutions include class-weighted cross-entropy (scaling each class's loss by its inverse frequency), focal loss (down-weighting easy examples), or oversampling the minority class.

Probability Clipping

When implementing cross-entropy manually, always clip predicted probabilities away from zero and one. A prediction of exactly zero for the true class produces infinite loss and NaN gradients. A small epsilon (typically 1e-7) prevents this while having negligible effect on the loss value.

Confusing Loss Variants

Binary cross-entropy expects a single sigmoid output per sample. Categorical cross-entropy expects a softmax distribution. Using the wrong variant โ€” for example, applying binary cross-entropy to softmax outputs โ€” produces incorrect gradients and poor training. Multi-label classification (where samples can belong to multiple classes) requires binary cross-entropy applied independently to each class, not categorical cross-entropy.

Key Takeaways

  1. Cross-entropy measures surprise. It quantifies how many extra bits the model wastes by not perfectly predicting the true labels. Minimizing cross-entropy means minimizing surprise.

  2. The gradient is elegant. Combined with softmax, the gradient simplifies to p - y โ€” the difference between prediction and truth. This never vanishes, never explodes, and is trivial to compute.

  3. It equals maximum likelihood. Minimizing cross-entropy is equivalent to maximizing the likelihood of the data, providing a principled information-theoretic foundation.

  4. Numerical stability matters. Always compute cross-entropy from logits, not probabilities. Use the log-sum-exp trick to prevent overflow and underflow.

  5. Know when to extend it. Focal loss for imbalanced data, label smoothing for calibration, and class weighting for uneven datasets are all modifications of the core cross-entropy objective.

  • KL Divergence โ€” Minimizing cross-entropy is equivalent to minimizing forward KL divergence
  • Focal Loss โ€” Modified cross-entropy that down-weights easy examples for imbalanced classification
  • Contrastive Loss โ€” Distribution matching through contrastive learning objectives
  • MSE and MAE โ€” Regression losses and why they fail for classification
  • Gradient Flow โ€” How cross-entropy's clean gradients propagate through deep networks

If you found this explanation helpful, consider sharing it with others.

Mastodon