Skip to main content

Neural Scaling Laws Explained

Explore neural scaling laws in deep learning: power law relationships between model size, data, and compute that predict AI performance.

Best viewed on desktop for optimal interactive experience

Neural Scaling Laws: The Mathematics of Model Performance

Neural scaling laws are empirical power law relationships that describe how model performance improves with increased scale — whether in parameters, data, or compute. These laws have become fundamental to understanding and predicting AI progress, guiding multi-million dollar training decisions and revealing the path toward more capable systems.

The discovery that simple mathematical relationships govern complex emergent behaviors has transformed model development from trial-and-error into principled engineering. If you know the exponents, you can predict the loss before training a single step.

The Recipe Scaling Analogy

Imagine scaling up a recipe from a home kitchen to a restaurant. Doubling the flour does not double the quality of the bread — you also need to scale water, yeast, and oven time in the right proportions. Neural scaling works the same way: parameters, data, and compute must grow together in specific ratios, and getting these ratios wrong wastes resources without improving results.

Power Law Relationships

At the heart of scaling laws lies a remarkably simple mathematical structure. Despite the complexity of neural networks — billions of parameters, trillions of floating-point operations, terabytes of training data — the relationship between scale and performance follows a clean power law in each scaling dimension. For a single variable, the relationship takes this form:

L(x) = a · x + L\text{irreducible}

Where x is the scaling variable (parameters, data tokens, or FLOPs), α is the scaling exponent that governs the rate of improvement, a is a constant, and L\text{irreducible} is the theoretical minimum loss — the inherent randomness in the data that no model can eliminate.

Three Scaling Dimensions

Parameter scaling describes how loss decreases with model size. Kaplan et al. found αN ≈ 0.076, meaning 10x more parameters yields roughly 17% lower loss. Larger models are also more sample-efficient, learning more per token seen.

Data scaling captures how loss improves with dataset size. The exponent αD ≈ 0.095 is larger than the parameter exponent, which means that adding more high-quality data has a stronger impact than adding more parameters — a finding that reshaped the field.

Compute scaling ties the other two together through the relationship C ≈ 6ND, where C is FLOPs, N is parameters, and D is tokens. Given a fixed compute budget, the question becomes how to split it between model size and training data.

The joint scaling law combines all three dimensions into a single expression:

L(N, D) = (NcN)αN + (DcD)αD + L\text{irreducible}

Where Nc and Dc are critical constants estimated from experimental data. This equation reveals that parameters and data contribute independently to loss reduction — a model with infinite data but few parameters still hits a wall, and vice versa.

Interactive Scaling Curves

Adjust the scaling exponents and see how loss curves change across parameters, data, and compute. Notice how small changes in the exponent lead to large differences at scale — a 0.01 shift in alpha can mean billions of dollars at frontier model budgets.

Chinchilla vs Kaplan: The Compute-Optimal Debate

The biggest controversy in scaling laws centers on how to allocate a fixed compute budget between model size and training data. The answer to this question directly shaped the trajectory of frontier model development.

Kaplan et al. (2020) argued that models should be as large as possible given the budget, even if that means training on relatively little data. This philosophy produced GPT-3: 175B parameters trained on only 300B tokens, a ratio of roughly 1.7:1. The Kaplan prescription suggested Nopt \propto C0.73, allocating most of the budget to parameters.

Hoffmann et al. (2022) — the Chinchilla paper — challenged this directly. By training over 400 models ranging from 70M to 16B parameters on different amounts of data, they found that parameters and data should scale equally: Nopt \propto C0.5 and Dopt \propto C0.5. This yields the "20:1 rule" — the optimal token count is roughly 20 times the parameter count. Their 70B parameter Chinchilla model, trained on 1.4T tokens, matched the 175B parameter GPT-3 with 4x less compute.

The impact was immediate: LLaMA, Mistral, and subsequent model families all adopted data-heavy training strategies. The era of under-trained, parameter-heavy models was over.

Compute Budget Planning

In practice, compute budgets are fixed by hardware availability and economic constraints. A research lab with 1,000 GPUs for 3 months has a specific FLOP budget, and the question is how to spend it wisely.

Given a fixed compute budget in FLOPs, how should you design your training run? This planner lets you set a total budget and see how different allocation strategies — Kaplan-style (parameter-heavy), Chinchilla-optimal (balanced), or inference-aware (over-train a smaller model) — affect expected loss and downstream cost. The tradeoff between training efficiency and serving cost is one of the most consequential decisions in modern AI development.

Comparing Scaling Strategies

Different scaling strategies suit different objectives, and there is no single "best" approach. Parameter-heavy training reaches the lowest loss per training dollar but produces expensive-to-serve models that may be impractical for production. Balanced Chinchilla scaling optimizes the training-inference tradeoff for moderate deployment volumes. Inference-aware over-training — training a smaller model on far more data than Chinchilla recommends — sacrifices training efficiency to produce models that are cheaper to deploy at scale, which is why LLaMA trained a 7B model on 1T tokens (a 143:1 ratio, far beyond the Chinchilla-optimal 20:1).

Scaling Approaches Compared

Different eras of LLM development have adopted different scaling strategies. The "optimal" approach depends on whether you are optimizing for training cost, inference cost, or final performance.

Kaplan Scaling
2020

Scale model size faster than data; bigger models are more sample-efficient

Data/Param:
~5 tokens/param
Compute Eff.:
moderate
Training:
High (large model, less data)
Inference:
Very High (oversized model)
Models: GPT-3, Gopher, Megatron-Turing
Chinchilla Scaling
2022

Scale model and data equally; train on ~20 tokens per parameter

Data/Param:
~20 tokens/param
Compute Eff.:
excellent
Training:
Optimal for given compute
Inference:
Moderate (smaller model)
Models: Chinchilla, LLaMA 1
LLaMA Approach
2023

Overtrain smaller models on more data for cheaper inference at deployment

Data/Param:
~140+ tokens/param
Compute Eff.:
moderate
Training:
Higher than optimal
Inference:
Low (small, capable model)
Models: LLaMA 2, LLaMA 3, Mistral
MoE Scaling
2022

Scale total parameters but activate only a subset per token via expert routing

Data/Param:
Varies (~20 tokens/active param)
Compute Eff.:
excellent
Training:
Moderate (sparse compute)
Inference:
Moderate (memory-heavy)
Models: Switch Transformer, Mixtral, GPT-4
Inference-Optimized
2024

Heavily overtrain for minimal deployment cost; optimize tokens-per-dollar at inference

Data/Param:
~200+ tokens/param
Compute Eff.:
moderate
Training:
Very High
Inference:
Very Low
Models: Gemma, Phi-3, SmolLM
Minimize Training Cost
  • Follow Chinchilla scaling (~20 tokens/param)
  • Optimal compute allocation between model and data
  • Best loss per FLOP spent
Minimize Inference Cost
  • Overtrain a smaller model (LLaMA approach)
  • Trade training compute for smaller deployment size
  • 100-200+ tokens per parameter
Maximize Capability
  • Use MoE for more parameters at fixed compute
  • Scale total knowledge while keeping inference fast
  • Combine with Chinchilla data ratios

Common Pitfalls

1. Ignoring Data Quality

Scaling laws assume clean, deduplicated training data. Repeating data or training on low-quality text changes the effective exponents and can make more compute actively harmful. The power law only holds when each additional token provides genuine new information. In practice, data curation — filtering, deduplication, and quality scoring — often matters more than raw dataset size.

2. Extrapolating Beyond Observed Ranges

Power laws are empirical fits to observed data. Extrapolating three orders of magnitude beyond your largest experiment is risky — the exponents may shift, new phenomena like emergent abilities may appear, or hardware bottlenecks may change the effective scaling. Always validate scaling predictions with intermediate checkpoints before committing to a full frontier training run.

3. Forgetting Inference Costs

Chinchilla-optimal training produces the best model for the training budget, but not necessarily the cheapest model to deploy. If the model will serve billions of queries, over-training a smaller model (like LLaMA) can be more economical overall, even if training itself is suboptimal. The total cost of ownership includes both training and the cumulative cost of every inference request the model will ever serve.

Key Takeaways

  1. Test loss follows power laws in parameters, data, and compute — with different exponents for each dimension.

  2. Data scaling has a larger exponent than parameter scaling — adding high-quality data is often more impactful than adding more parameters.

  3. The Chinchilla rule says to balance parameters and data equally — roughly 20 tokens per parameter for compute-optimal training.

  4. Optimal training strategy depends on deployment — training-optimal and inference-optimal allocations are different when serving costs matter.

  5. Scaling laws are predictive but not prescriptive — they guide resource allocation but cannot predict emergent abilities, data quality effects, or architectural breakthroughs.

If you found this explanation helpful, consider sharing it with others.

Mastodon