MSE and MAE: Fundamental Regression Losses
Mean Squared Error (MSE) and Mean Absolute Error (MAE) are the two most important loss functions for regression. They look deceptively simple, but their differences in how they penalize errors lead to fundamentally different model behaviors. Choosing the wrong one can make a model chase outliers instead of learning the true pattern, or oscillate endlessly instead of converging.
The core question is: when your prediction is wrong, how harshly should the penalty scale with the size of the mistake?
The Target Practice Analogy
Think of each prediction as an arrow shot at a target. The distance from the bullseye is your prediction error. MSE squares each distance: a shot 1 ring off costs 1 penalty point, but a shot 4 rings off costs 16 penalty points rather than 4. MAE simply adds the distances: 1 ring off costs 1, 4 rings off costs 4. This single difference shapes everything downstream, from outlier sensitivity to gradient behavior to what statistical quantity the model converges toward.
Mathematical Definitions
Mean Squared Error (L2 Loss)
MSE computes the average of the squared differences between predictions and targets:
Where yi is the actual value, ŷi is the predicted value, and n is the number of samples. The squaring means that an error of 10 contributes 100 to the loss, while an error of 1 contributes only 1. This quadratic scaling is what makes MSE sensitive to large errors.
Mean Absolute Error (L1 Loss)
MAE computes the average of the absolute differences:
Every unit of error contributes equally to the total loss. An error of 10 contributes exactly 10 times as much as an error of 1, not 100 times as much. This linear scaling is what gives MAE its robustness to outliers.
Root Mean Squared Error
RMSE is simply the square root of MSE, which brings the loss back into the same units as the target variable:
RMSE is useful for reporting because its units match the prediction (dollars, meters, degrees), making it more interpretable than raw MSE. However, minimizing RMSE is mathematically identical to minimizing MSE since the square root is monotonic.
Loss Surface Explorer
The loss surface determines how optimization behaves. Below, you can manually adjust a prediction line (slope and intercept) and watch the MSE or MAE loss change in real-time. When you toggle to MSE residuals, the blue squares show the squared error for each point. For MAE residuals, the orange bars show the absolute error. Notice how the squares grow dramatically faster than the bars as points move further from the line.
Outlier Sensitivity
The most practically important difference between MSE and MAE is how they respond to outliers. A single extreme data point can completely dominate MSE, dragging the optimal fit line away from the majority of points. MAE is far more resistant because it treats an outlier's error linearly rather than quadratically.
Consider a dataset where most points follow a clean linear trend, but one point is far from the pattern. Toggle the outlier below and watch the best-fit line shift. The MSE-optimal line pivots dramatically toward the outlier, while a MAE-optimal line would barely move.
This is why MSE is said to optimize the conditional mean while MAE optimizes the conditional median. The mean is pulled by outliers; the median is not. If your data contains measurement errors, sensor glitches, or any source of occasional extreme values, MAE or Huber loss is usually the better choice.
Gradient Behavior
MSE Gradients Scale with Error
The gradient of MSE with respect to a single prediction is:
This gradient is proportional to the error. When the prediction is far off, the gradient is large, pushing the model hard to correct. As the prediction approaches the target, the gradient shrinks toward zero, enabling precise fine-tuning near the optimum. This adaptive behavior is MSE's greatest strength for clean data.
MAE Gradients Are Constant
The gradient (technically, subgradient) of MAE is:
This gradient has constant magnitude: it is always plus one or minus one (divided by n), regardless of how large or small the error is. For large errors, this is fine. But near the optimum, the model still takes full-size steps, causing it to overshoot and oscillate around the minimum instead of settling smoothly. At exactly zero error, the gradient is undefined, which can cause numerical issues.
Regression Loss Functions Compared
MSE and MAE are not the only options. Several loss functions attempt to combine the best properties of both, or add capabilities for specialized use cases.
Regression Loss Functions Compared
Each loss function makes different tradeoffs between robustness, smoothness, and convergence behavior. Click a row on mobile to see details.
| Loss | Outlier Robust? | Smoothness | Convergence | Near Zero | Optimizes |
|---|---|---|---|---|---|
MSE (L2) (y - ŷ)² | poor Squares amplify outliers | excellent Smooth and differentiable everywhere | excellent Fast near optimum, adaptive step | excellent Gradient vanishes smoothly | Mean of the conditional distribution |
MAE (L1) |y - ŷ| | excellent Linear penalty, ignores outlier magnitude | poor Non-differentiable at zero | moderate Constant step, oscillates near optimum | poor Gradient jumps at zero | Median of the conditional distribution |
Huber (δ) ½(y-ŷ)² if |e|≤δ, else δ|e|-½δ² | excellent MSE for small, MAE for large errors | excellent Differentiable everywhere | excellent Best of both worlds | excellent Smooth quadratic near zero | Trimmed mean (between mean and median) |
Log-Cosh log(cosh(ŷ - y)) | excellent Similar to Huber but smoother | excellent Infinitely differentiable | excellent Smooth gradients everywhere | excellent Behaves like MSE near zero | Approximate mean with outlier resistance |
Quantile (τ) τ·e if e≥0, else (τ-1)·e | moderate Depends on quantile chosen | poor Non-differentiable at zero | moderate Constant gradient like MAE | poor Same jump as MAE | τ-th quantile of the distribution |
Start with Huber Loss when...
- - You are unsure about outlier presence
- - You want smooth gradients everywhere
- - You need a safe, general-purpose default
Use Quantile Loss when...
- - You need prediction intervals
- - Over-prediction costs differ from under-prediction
- - You care about specific percentiles
Huber Loss: The Best of Both Worlds
Huber loss is arguably the most important variant to know. It behaves like MSE when the error is small (below a threshold δ) and like MAE when the error is large:
For small errors, Huber's quadratic region provides smooth, vanishing gradients that allow precise convergence. For large errors, the linear region prevents outliers from dominating the loss. The threshold δ controls the transition point: a large delta makes Huber behave more like MSE, while a small delta makes it behave more like MAE.
In deep learning frameworks, Huber loss is often called Smooth L1 loss (with δ = 1). It is the standard regression loss in object detection architectures like Faster R-CNN and SSD, where bounding box coordinates frequently contain outlier-scale errors during early training.
Common Pitfalls
Scale Sensitivity with MSE
When features or targets have very different scales, MSE can be dominated by the largest-scale component. A target measured in thousands (like house prices in dollars) produces MSE values in the millions, which can cause gradient explosion. Always normalize your targets before using MSE, or use gradient clipping as a safety net.
MAE Oscillation Near Convergence
Because MAE's gradient never decreases in magnitude, training with MAE often shows loss plateaus where the model oscillates around the optimum without settling. Reducing the learning rate over time helps, but Huber loss is usually a better solution since it naturally switches to smooth gradients near zero error.
Using MSE for Heavy-Tailed Data
Many real-world datasets have heavy-tailed error distributions: most predictions are close, but a few are very far off. Using MSE on such data means the model spends most of its capacity trying to reduce the few large errors, potentially sacrificing accuracy on the majority of samples. If your residuals have a long tail, switch to Huber loss or MAE.
Forgetting That MSE and MAE Optimize Different Statistics
MSE converges to the conditional mean, while MAE converges to the conditional median. For symmetric distributions, these are the same. For skewed distributions, they are not. If you need the expected value, use MSE. If you need a robust central estimate, use MAE. If you need a specific percentile, use quantile loss.
Key Takeaways
-
MSE squares errors, amplifying large mistakes quadratically. This makes it sensitive to outliers but provides smooth, adaptive gradients that enable precise convergence on clean data.
-
MAE treats all errors linearly, making it robust to outliers. But its constant gradient magnitude causes oscillation near the optimum and non-differentiability at zero.
-
MSE optimizes the mean, MAE optimizes the median. Choose based on which statistical quantity matters for your application and whether your data is symmetric or skewed.
-
Huber loss combines the best of both. It uses MSE for small errors (smooth gradients) and MAE for large errors (outlier robustness). It should be your default choice when you are unsure.
-
Always check your residual distribution. If residuals are heavy-tailed, MSE will chase outliers. If they are clean and Gaussian, MSE will converge faster than MAE.
Related Concepts
- Cross-Entropy Loss — The classification equivalent of MSE, measuring error between predicted probabilities and true labels
- KL Divergence — Measures distribution differences, closely related to cross-entropy
- Focal Loss — Modified cross-entropy that down-weights easy examples for imbalanced classification
- Contrastive Loss — Loss function for learning representations through similarity and dissimilarity
- Dropout — Regularization technique that can be combined with any regression loss
