MSE and MAE: Fundamental Regression Losses
Mean Squared Error (MSE) and Mean Absolute Error (MAE) are the two most important loss functions for regression. They look deceptively simple, but their differences in how they penalize errors lead to fundamentally different model behaviors. Choosing the wrong one can make a model chase outliers instead of learning the true pattern, or oscillate endlessly instead of converging.
The core question is: when your prediction is wrong, how harshly should the penalty scale with the size of the mistake?
The Target Practice Analogy
Think of each prediction as an arrow shot at a target. The distance from the bullseye is your prediction error. MSE squares each distance: a shot 1 ring off costs 1 penalty point, but a shot 4 rings off costs 16 penalty points rather than 4. MAE simply adds the distances: 1 ring off costs 1, 4 rings off costs 4. This single difference shapes everything downstream, from outlier sensitivity to gradient behavior to what statistical quantity the model converges toward.
The Target Practice Analogy
Imagine each prediction as an arrow shot at a target. The distance from the bullseye is your error. MSE squares each miss, so a shot 4 rings off costs 16 penalty points instead of 4. MAE counts each ring equally.
Click "Next Shot" to begin
Mathematical Definitions
Mean Squared Error (L2 Loss)
MSE computes the average of the squared differences between predictions and targets:
Where yi is the actual value, ŷi is the predicted value, and n is the number of samples. The squaring means that an error of 10 contributes 100 to the loss, while an error of 1 contributes only 1. This quadratic scaling is what makes MSE sensitive to large errors.
Mean Absolute Error (L1 Loss)
MAE computes the average of the absolute differences:
Every unit of error contributes equally to the total loss. An error of 10 contributes exactly 10 times as much as an error of 1, not 100 times as much. This linear scaling is what gives MAE its robustness to outliers.
Root Mean Squared Error
RMSE is simply the square root of MSE, which brings the loss back into the same units as the target variable:
RMSE is useful for reporting because its units match the prediction (dollars, meters, degrees), making it more interpretable than raw MSE. However, minimizing RMSE is mathematically identical to minimizing MSE since the square root is monotonic.
Loss Surface Explorer
The loss surface determines how optimization behaves. Below, you can manually adjust a prediction line (slope and intercept) and watch the MSE or MAE loss change in real-time. When you toggle to MSE residuals, the blue squares show the squared error for each point. For MAE residuals, the orange bars show the absolute error. Notice how the squares grow dramatically faster than the bars as points move further from the line.
Loss Surface Explorer
Adjust the slope and intercept of the red prediction line to minimize the loss. The blue squares (MSE) or orange bars (MAE) visualize each residual. The dashed green line shows the optimal least-squares fit.
Your line is far from optimal. Notice how MSE is much larger than MAE relative to their minimums - the squaring amplifies the penalty for large residuals.
Outlier Sensitivity
The most practically important difference between MSE and MAE is how they respond to outliers. A single extreme data point can completely dominate MSE, dragging the optimal fit line away from the majority of points. MAE is far more resistant because it treats an outlier's error linearly rather than quadratically.
Consider a dataset where most points follow a clean linear trend, but one point is far from the pattern. Toggle the outlier below and watch the best-fit line shift. The MSE-optimal line pivots dramatically toward the outlier, while a MAE-optimal line would barely move.
Outlier Sensitivity Demo
Toggle an outlier on and off to see how dramatically it shifts the MSE-optimal fit line. Watch both loss values: MSE explodes while MAE remains relatively stable.
Clean data: the least-squares line fits well with low MSE and MAE. Toggle the outlier to see what happens when a single point goes rogue.
This is why MSE is said to optimize the conditional mean while MAE optimizes the conditional median. The mean is pulled by outliers; the median is not. If your data contains measurement errors, sensor glitches, or any source of occasional extreme values, MAE or Huber loss is usually the better choice.
Gradient Behavior
MSE Gradients Scale with Error
The gradient of MSE with respect to a single prediction is:
This gradient is proportional to the error. When the prediction is far off, the gradient is large, pushing the model hard to correct. As the prediction approaches the target, the gradient shrinks toward zero, enabling precise fine-tuning near the optimum. This adaptive behavior is MSE's greatest strength for clean data.
MAE Gradients Are Constant
The gradient (technically, subgradient) of MAE is:
This gradient has constant magnitude: it is always plus one or minus one (divided by n), regardless of how large or small the error is. For large errors, this is fine. But near the optimum, the model still takes full-size steps, causing it to overshoot and oscillate around the minimum instead of settling smoothly. At exactly zero error, the gradient is undefined, which can cause numerical issues.
Gradient Flow Comparison
MSE gradients grow proportionally to the error, pushing harder on large mistakes. MAE gradients are constant (always plus or minus one), regardless of how far off the prediction is. Slide the error value to compare.
MSE gradient (4.0) is 4.0x the MAE gradient (1). MSE applies proportionally stronger corrections to larger errors. Huber loss transitions from MSE-like to MAE-like at this point.
Regression Loss Functions Compared
MSE and MAE are not the only options. Several loss functions attempt to combine the best properties of both, or add capabilities for specialized use cases.
Regression Loss Functions Compared
Each loss function makes different tradeoffs between robustness, smoothness, and convergence behavior. Click a row on mobile to see details.
| Loss | Outlier Robust? | Smoothness | Convergence | Near Zero | Optimizes |
|---|---|---|---|---|---|
MSE (L2) (y - ŷ)² | poor Squares amplify outliers | excellent Smooth and differentiable everywhere | excellent Fast near optimum, adaptive step | excellent Gradient vanishes smoothly | Mean of the conditional distribution |
MAE (L1) |y - ŷ| | excellent Linear penalty, ignores outlier magnitude | poor Non-differentiable at zero | moderate Constant step, oscillates near optimum | poor Gradient jumps at zero | Median of the conditional distribution |
Huber (δ) ½(y-ŷ)² if |e|≤δ, else δ|e|-½δ² | excellent MSE for small, MAE for large errors | excellent Differentiable everywhere | excellent Best of both worlds | excellent Smooth quadratic near zero | Trimmed mean (between mean and median) |
Log-Cosh log(cosh(ŷ - y)) | excellent Similar to Huber but smoother | excellent Infinitely differentiable | excellent Smooth gradients everywhere | excellent Behaves like MSE near zero | Approximate mean with outlier resistance |
Quantile (τ) τ·e if e≥0, else (τ-1)·e | moderate Depends on quantile chosen | poor Non-differentiable at zero | moderate Constant gradient like MAE | poor Same jump as MAE | τ-th quantile of the distribution |
Start with Huber Loss when...
- - You are unsure about outlier presence
- - You want smooth gradients everywhere
- - You need a safe, general-purpose default
Use Quantile Loss when...
- - You need prediction intervals
- - Over-prediction costs differ from under-prediction
- - You care about specific percentiles
Huber Loss: The Best of Both Worlds
Huber loss is arguably the most important variant to know. It behaves like MSE when the error is small (below a threshold δ) and like MAE when the error is large:
For small errors, Huber's quadratic region provides smooth, vanishing gradients that allow precise convergence. For large errors, the linear region prevents outliers from dominating the loss. The threshold δ controls the transition point: a large delta makes Huber behave more like MSE, while a small delta makes it behave more like MAE.
In deep learning frameworks, Huber loss is often called Smooth L1 loss (with δ = 1). It is the standard regression loss in object detection architectures like Faster R-CNN and SSD, where bounding box coordinates frequently contain outlier-scale errors during early training.
Common Pitfalls
Scale Sensitivity with MSE
When features or targets have very different scales, MSE can be dominated by the largest-scale component. A target measured in thousands (like house prices in dollars) produces MSE values in the millions, which can cause gradient explosion. Always normalize your targets before using MSE, or use gradient clipping as a safety net.
MAE Oscillation Near Convergence
Because MAE's gradient never decreases in magnitude, training with MAE often shows loss plateaus where the model oscillates around the optimum without settling. Reducing the learning rate over time helps, but Huber loss is usually a better solution since it naturally switches to smooth gradients near zero error.
Using MSE for Heavy-Tailed Data
Many real-world datasets have heavy-tailed error distributions: most predictions are close, but a few are very far off. Using MSE on such data means the model spends most of its capacity trying to reduce the few large errors, potentially sacrificing accuracy on the majority of samples. If your residuals have a long tail, switch to Huber loss or MAE.
Forgetting That MSE and MAE Optimize Different Statistics
MSE converges to the conditional mean, while MAE converges to the conditional median. For symmetric distributions, these are the same. For skewed distributions, they are not. If you need the expected value, use MSE. If you need a robust central estimate, use MAE. If you need a specific percentile, use quantile loss.
Key Takeaways
-
MSE squares errors, amplifying large mistakes quadratically. This makes it sensitive to outliers but provides smooth, adaptive gradients that enable precise convergence on clean data.
-
MAE treats all errors linearly, making it robust to outliers. But its constant gradient magnitude causes oscillation near the optimum and non-differentiability at zero.
-
MSE optimizes the mean, MAE optimizes the median. Choose based on which statistical quantity matters for your application and whether your data is symmetric or skewed.
-
Huber loss combines the best of both. It uses MSE for small errors (smooth gradients) and MAE for large errors (outlier robustness). It should be your default choice when you are unsure.
-
Always check your residual distribution. If residuals are heavy-tailed, MSE will chase outliers. If they are clean and Gaussian, MSE will converge faster than MAE.
Related Concepts
- Cross-Entropy Loss — The classification equivalent of MSE, measuring error between predicted probabilities and true labels
- KL Divergence — Measures distribution differences, closely related to cross-entropy
- Focal Loss — Modified cross-entropy that down-weights easy examples for imbalanced classification
- Contrastive Loss — Loss function for learning representations through similarity and dissimilarity
- Dropout — Regularization technique that can be combined with any regression loss
