MSE and MAE: Fundamental Regression Losses

Mean Squared Error (MSE) and Mean Absolute Error (MAE) are the two most important loss functions for regression. They look deceptively simple, but their differences in how they penalize errors lead to fundamentally different model behaviors. Choosing the wrong one can make a model chase outliers instead of learning the true pattern, or oscillate endlessly instead of converging.

The core question is: when your prediction is wrong, how harshly should the penalty scale with the size of the mistake?

The Target Practice Analogy

Think of each prediction as an arrow shot at a target. The distance from the bullseye is your prediction error. MSE squares each distance: a shot 1 ring off costs 1 penalty point, but a shot 4 rings off costs 16 penalty points rather than 4. MAE simply adds the distances: 1 ring off costs 1, 4 rings off costs 4. This single difference shapes everything downstream, from outlier sensitivity to gradient behavior to what statistical quantity the model converges toward.

The Target Practice Analogy

Imagine each prediction as an arrow shot at a target. The distance from the bullseye is your error. MSE squares each miss, so a shot 4 rings off costs 16 penalty points instead of 4. MAE counts each ring equally.

Click "Next Shot" to begin

4 shots remaining

0.00

MSE (squared penalty)

0.00

MAE (linear penalty)

---x

MSE / MAE ratio

Mathematical Definitions

Mean Squared Error (L2 Loss)

MSE computes the average of the squared differences between predictions and targets:

MSE = 1n Σ_i=1ⁿ (y_i - ŷ_i)²

Where y_i is the actual value, ŷ_i is the predicted value, and n is the number of samples. The squaring means that an error of 10 contributes 100 to the loss, while an error of 1 contributes only 1. This quadratic scaling is what makes MSE sensitive to large errors.

Mean Absolute Error (L1 Loss)

MAE computes the average of the absolute differences:

MAE = 1n Σ_i=1ⁿ |y_i - ŷ_i|

Every unit of error contributes equally to the total loss. An error of 10 contributes exactly 10 times as much as an error of 1, not 100 times as much. This linear scaling is what gives MAE its robustness to outliers.

Root Mean Squared Error

RMSE is simply the square root of MSE, which brings the loss back into the same units as the target variable:

RMSE = √(\frac{1){n} Σ_i=1ⁿ (y_i - ŷ_i)²}

RMSE is useful for reporting because its units match the prediction (dollars, meters, degrees), making it more interpretable than raw MSE. However, minimizing RMSE is mathematically identical to minimizing MSE since the square root is monotonic.

Loss Surface Explorer

The loss surface determines how optimization behaves. Below, you can manually adjust a prediction line (slope and intercept) and watch the MSE or MAE loss change in real-time. When you toggle to MSE residuals, the blue squares show the squared error for each point. For MAE residuals, the orange bars show the absolute error. Notice how the squares grow dramatically faster than the bars as points move further from the line.

Loss Surface Explorer

Adjust the slope and intercept of the red prediction line to minimize the loss. The blue squares (MSE) or orange bars (MAE) visualize each residual. The dashed green line shows the optimal least-squares fit.

Slope: 2.0

024

Intercept: 0.5

-404

0.476

Your MSE

0.640

Your MAE

0.057

Optimal MSE

Your line is far from optimal. Notice how MSE is much larger than MAE relative to their minimums - the squaring amplifies the penalty for large residuals.

Outlier Sensitivity

The most practically important difference between MSE and MAE is how they respond to outliers. A single extreme data point can completely dominate MSE, dragging the optimal fit line away from the majority of points. MAE is far more resistant because it treats an outlier's error linearly rather than quadratically.

Consider a dataset where most points follow a clean linear trend, but one point is far from the pattern. Toggle the outlier below and watch the best-fit line shift. The MSE-optimal line pivots dramatically toward the outlier, while a MAE-optimal line would barely move.

Outlier Sensitivity Demo

Toggle an outlier on and off to see how dramatically it shifts the MSE-optimal fit line. Watch both loss values: MSE explodes while MAE remains relatively stable.

0.02

MSE

0.10

MAE

1.99

Fitted slope

Clean data: the least-squares line fits well with low MSE and MAE. Toggle the outlier to see what happens when a single point goes rogue.

This is why MSE is said to optimize the conditional mean while MAE optimizes the conditional median. The mean is pulled by outliers; the median is not. If your data contains measurement errors, sensor glitches, or any source of occasional extreme values, MAE or Huber loss is usually the better choice.

Gradient Behavior

MSE Gradients Scale with Error

The gradient of MSE with respect to a single prediction is:

∂ MSE∂ ŷ_i = 2n(ŷ_i - y_i)

This gradient is proportional to the error. When the prediction is far off, the gradient is large, pushing the model hard to correct. As the prediction approaches the target, the gradient shrinks toward zero, enabling precise fine-tuning near the optimum. This adaptive behavior is MSE's greatest strength for clean data.

MAE Gradients Are Constant

The gradient (technically, subgradient) of MAE is:

∂ MAE∂ ŷ_i = 1n · \text{sign}(ŷ_i - y_i)

This gradient has constant magnitude: it is always plus one or minus one (divided by n), regardless of how large or small the error is. For large errors, this is fine. But near the optimum, the model still takes full-size steps, causing it to overshoot and oscillate around the minimum instead of settling smoothly. At exactly zero error, the gradient is undefined, which can cause numerical issues.

Gradient Flow Comparison

MSE gradients grow proportionally to the error, pushing harder on large mistakes. MAE gradients are constant (always plus or minus one), regardless of how far off the prediction is. Slide the error value to compare.

Selected Error: 2.0

-4 (over-predict)0+4 (under-predict)

4.00

|MSE gradient|

1.00

|MAE gradient|

4.0x

MSE / MAE ratio

MSE gradient (4.0) is 4.0x the MAE gradient (1). MSE applies proportionally stronger corrections to larger errors. Huber loss transitions from MSE-like to MAE-like at this point.

Regression Loss Functions Compared

MSE and MAE are not the only options. Several loss functions attempt to combine the best properties of both, or add capabilities for specialized use cases.

Regression Loss Functions Compared

Each loss function makes different tradeoffs between robustness, smoothness, and convergence behavior. Click a row on mobile to see details.

Loss	Outlier Robust?	Smoothness	Convergence	Near Zero	Optimizes
MSE (L2) (y - ŷ)²	poor Squares amplify outliers	excellent Smooth and differentiable everywhere	excellent Fast near optimum, adaptive step	excellent Gradient vanishes smoothly	Mean of the conditional distribution
MAE (L1) \|y - ŷ\|	excellent Linear penalty, ignores outlier magnitude	poor Non-differentiable at zero	moderate Constant step, oscillates near optimum	poor Gradient jumps at zero	Median of the conditional distribution
Huber (δ) ½(y-ŷ)² if \|e\|≤δ, else δ\|e\|-½δ²	excellent MSE for small, MAE for large errors	excellent Differentiable everywhere	excellent Best of both worlds	excellent Smooth quadratic near zero	Trimmed mean (between mean and median)
Log-Cosh log(cosh(ŷ - y))	excellent Similar to Huber but smoother	excellent Infinitely differentiable	excellent Smooth gradients everywhere	excellent Behaves like MSE near zero	Approximate mean with outlier resistance
Quantile (τ) τ·e if e≥0, else (τ-1)·e	moderate Depends on quantile chosen	poor Non-differentiable at zero	moderate Constant gradient like MAE	poor Same jump as MAE	τ-th quantile of the distribution

MSE (L2)

tap to expand

(y - ŷ)²

Outlier Robust?

poor

Smoothness

excellent

Convergence

excellent

Near Zero

excellent

MAE (L1)

tap to expand

|y - ŷ|

Outlier Robust?

excellent

Smoothness

poor

Convergence

moderate

Near Zero

poor

Huber (δ)

tap to expand

½(y-ŷ)² if |e|≤δ, else δ|e|-½δ²

Outlier Robust?

excellent

Smoothness

excellent

Convergence

excellent

Near Zero

excellent

Log-Cosh

tap to expand

log(cosh(ŷ - y))

Outlier Robust?

excellent

Smoothness

excellent

Convergence

excellent

Near Zero

excellent

Quantile (τ)

tap to expand

τ·e if e≥0, else (τ-1)·e

Outlier Robust?

moderate

Smoothness

poor

Convergence

moderate

Near Zero

poor

Start with Huber Loss when...

- You are unsure about outlier presence
- You want smooth gradients everywhere
- You need a safe, general-purpose default

Use Quantile Loss when...

- You need prediction intervals
- Over-prediction costs differ from under-prediction
- You care about specific percentiles

Huber Loss: The Best of Both Worlds

Huber loss is arguably the most important variant to know. It behaves like MSE when the error is small (below a threshold δ) and like MAE when the error is large:

L_δ(e) = \begin{cases} 12e² & \text{if } |e| ≤ δ \ δ|e| - 12δ² & \text{otherwise} \end{cases}

For small errors, Huber's quadratic region provides smooth, vanishing gradients that allow precise convergence. For large errors, the linear region prevents outliers from dominating the loss. The threshold δ controls the transition point: a large delta makes Huber behave more like MSE, while a small delta makes it behave more like MAE.

In deep learning frameworks, Huber loss is often called Smooth L1 loss (with δ = 1). It is the standard regression loss in object detection architectures like Faster R-CNN and SSD, where bounding box coordinates frequently contain outlier-scale errors during early training.

Common Pitfalls

Scale Sensitivity with MSE

When features or targets have very different scales, MSE can be dominated by the largest-scale component. A target measured in thousands (like house prices in dollars) produces MSE values in the millions, which can cause gradient explosion. Always normalize your targets before using MSE, or use gradient clipping as a safety net.

MAE Oscillation Near Convergence

Because MAE's gradient never decreases in magnitude, training with MAE often shows loss plateaus where the model oscillates around the optimum without settling. Reducing the learning rate over time helps, but Huber loss is usually a better solution since it naturally switches to smooth gradients near zero error.

Using MSE for Heavy-Tailed Data

Many real-world datasets have heavy-tailed error distributions: most predictions are close, but a few are very far off. Using MSE on such data means the model spends most of its capacity trying to reduce the few large errors, potentially sacrificing accuracy on the majority of samples. If your residuals have a long tail, switch to Huber loss or MAE.

Forgetting That MSE and MAE Optimize Different Statistics

MSE converges to the conditional mean, while MAE converges to the conditional median. For symmetric distributions, these are the same. For skewed distributions, they are not. If you need the expected value, use MSE. If you need a robust central estimate, use MAE. If you need a specific percentile, use quantile loss.

Key Takeaways

MSE squares errors, amplifying large mistakes quadratically. This makes it sensitive to outliers but provides smooth, adaptive gradients that enable precise convergence on clean data.
MAE treats all errors linearly, making it robust to outliers. But its constant gradient magnitude causes oscillation near the optimum and non-differentiability at zero.
MSE optimizes the mean, MAE optimizes the median. Choose based on which statistical quantity matters for your application and whether your data is symmetric or skewed.
Huber loss combines the best of both. It uses MSE for small errors (smooth gradients) and MAE for large errors (outlier robustness). It should be your default choice when you are unsure.
Always check your residual distribution. If residuals are heavy-tailed, MSE will chase outliers. If they are clean and Gaussian, MSE will converge faster than MAE.

Cross-Entropy Loss — The classification equivalent of MSE, measuring error between predicted probabilities and true labels
KL Divergence — Measures distribution differences, closely related to cross-entropy
Focal Loss — Modified cross-entropy that down-weights easy examples for imbalanced classification
Contrastive Loss — Loss function for learning representations through similarity and dissimilarity
Dropout — Regularization technique that can be combined with any regression loss

MSE and MAE Loss Functions

The Target Practice Analogy

Loss Surface Explorer

Outlier Sensitivity Demo

Gradient Flow Comparison

Regression Loss Functions Compared

Start with Huber Loss when...

Use Quantile Loss when...