Neural Scaling Laws

Explore neural scaling laws in deep learning: power law relationships between model size, data, and compute that predict AI performance, with interactive visualizations.

Best viewed on desktop for optimal interactive experience

Neural Scaling Laws: The Mathematics of Model Performance

Neural scaling laws are empirical power law relationships that describe how model performance improves with increased scale — whether in parameters, data, or compute. These laws have become fundamental to understanding and predicting AI progress, guiding multi-million dollar training decisions and revealing the path toward more capable systems.

The discovery that simple mathematical relationships govern complex emergent behaviors has transformed model development from trial-and-error into principled engineering. If you know the exponents, you can predict the loss before training a single step.

The Recipe Scaling Analogy

Imagine scaling up a recipe from a home kitchen to a restaurant. Doubling the flour does not double the quality of the bread — you also need to scale water, yeast, and oven time in the right proportions. Neural scaling works the same way: parameters, data, and compute must grow together in specific ratios, and getting these ratios wrong wastes resources without improving results.

The Scaling Recipe

Scaling a neural network is like perfecting a recipe. You need a skilled chef (model size), quality ingredients (training data), and enough cooking time (compute). Scaling just one dimension hits diminishing returns fast -- the secret is balanced scaling.

Ingredient Levels
Chef
1x
Ingredi.
1x
Time
1x
Output Quality0.0
All three ingredients are balanced -- this is the compute-optimal frontier.
Chef skill Model Size1x
1x10x100x
Ingredients Data1x
1x10x100x
Cook time Compute1x
1x10x100x
0.0
Quality Score
100%
Efficiency
Balanced
Bottleneck

Key insight: Neural scaling laws follow power laws with diminishing returns. Doubling model size alone gives a small improvement, but scaling model size, data, and compute together gives compounding gains. The Chinchilla paper showed that for a fixed compute budget, the optimal strategy scales parameters and training tokens at roughly equal rates.

Power Law Relationships

At the heart of scaling laws lies a remarkably simple mathematical structure. Despite the complexity of neural networks — billions of parameters, trillions of floating-point operations, terabytes of training data — the relationship between scale and performance follows a clean power law in each scaling dimension. For a single variable, the relationship takes this form:

L(x) = a · x + L\text{irreducible}

Where x is the scaling variable (parameters, data tokens, or FLOPs), α is the scaling exponent that governs the rate of improvement, a is a constant, and L\text{irreducible} is the theoretical minimum loss — the inherent randomness in the data that no model can eliminate.

Three Scaling Dimensions

Parameter scaling describes how loss decreases with model size. Kaplan et al. found αN ≈ 0.076, meaning 10x more parameters yields roughly 17% lower loss. Larger models are also more sample-efficient, learning more per token seen.

Data scaling captures how loss improves with dataset size. The exponent αD ≈ 0.095 is larger than the parameter exponent, which means that adding more high-quality data has a stronger impact than adding more parameters — a finding that reshaped the field.

Compute scaling ties the other two together through the relationship C ≈ 6ND, where C is FLOPs, N is parameters, and D is tokens. Given a fixed compute budget, the question becomes how to split it between model size and training data.

The joint scaling law combines all three dimensions into a single expression:

L(N, D) = (NcN)αN + (DcD)αD + L\text{irreducible}

Where Nc and Dc are critical constants estimated from experimental data. This equation reveals that parameters and data contribute independently to loss reduction — a model with infinite data but few parameters still hits a wall, and vice versa.

Interactive Scaling Curves

Adjust the scaling exponents and see how loss curves change across parameters, data, and compute. Notice how small changes in the exponent lead to large differences at scale — a 0.01 shift in alpha can mean billions of dollars at frontier model budgets.

Scaling Curve Explorer

Loss follows a power law: L(x) = a * x^(-alpha) + L_irreducible. On a log-log plot these curves are straight lines that flatten as they approach the irreducible loss floor.

0.02 (slow scaling)0.076 (Kaplan)0.5 (fast scaling)
0.076
Current alpha
6.8%
10x Improvement
1.69
Irreducible Loss

Why power laws matter: The exponent alpha determines how quickly performance improves with scale. Kaplan et al. found alpha ~ 0.076 for parameters and ~0.095 for data. A larger exponent means faster improvement per order of magnitude, but all curves eventually flatten toward the irreducible loss -- the fundamental limit set by the entropy of natural language.

Chinchilla vs Kaplan: The Compute-Optimal Debate

The biggest controversy in scaling laws centers on how to allocate a fixed compute budget between model size and training data. The answer to this question directly shaped the trajectory of frontier model development.

Kaplan et al. (2020) argued that models should be as large as possible given the budget, even if that means training on relatively little data. This philosophy produced GPT-3: 175B parameters trained on only 300B tokens, a ratio of roughly 1.7:1. The Kaplan prescription suggested Nopt \propto C0.73, allocating most of the budget to parameters.

Hoffmann et al. (2022) — the Chinchilla paper — challenged this directly. By training over 400 models ranging from 70M to 16B parameters on different amounts of data, they found that parameters and data should scale equally: Nopt \propto C0.5 and Dopt \propto C0.5. This yields the "20:1 rule" — the optimal token count is roughly 20 times the parameter count. Their 70B parameter Chinchilla model, trained on 1.4T tokens, matched the 175B parameter GPT-3 with 4x less compute.

The impact was immediate: LLaMA, Mistral, and subsequent model families all adopted data-heavy training strategies. The era of under-trained, parameter-heavy models was over.

Chinchilla vs Kaplan: Optimal Allocation

Given a fixed compute budget, how should you split it between model size and training data? Kaplan (2020) said "go bigger"; Chinchilla (2022) said "scale both equally." The difference in recommendations is massive.

10^18 (small)10^21 (medium)10^25 (frontier)
Kaplan Recommendation
Model:
6166.0T
Tokens:
2.7M
Ratio: 0.0 tokens per param
Chinchilla Recommendation
Model:
94.9B
Tokens:
175.7B
Ratio: 1.9 tokens per param
6166.0T
Kaplan Model Size
94.9B
Chinchilla Model Size
2:1
Token Ratio

The Chinchilla lesson: DeepMind showed that most large models were significantly undertrained. For the same compute budget used to train the 280B-parameter Gopher, a 70B model trained on 4x more data (Chinchilla) achieved better performance. The rule of thumb: train on roughly 20 tokens per parameter for compute-optimality.

Compute Budget Planning

In practice, compute budgets are fixed by hardware availability and economic constraints. A research lab with 1,000 GPUs for 3 months has a specific FLOP budget, and the question is how to spend it wisely.

Given a fixed compute budget in FLOPs, how should you design your training run? This planner lets you set a total budget and see how different allocation strategies — Kaplan-style (parameter-heavy), Chinchilla-optimal (balanced), or inference-aware (over-train a smaller model) — affect expected loss and downstream cost. The tradeoff between training efficiency and serving cost is one of the most consequential decisions in modern AI development.

Compute Budget Calculator

Estimate training cost from first principles. Compute = 6 x Parameters x Tokens. Then divide by your hardware throughput to get wall-clock time and cost.

1M13B1T
1B1T10T
151210K
Total Compute
1800.0 EFLOPs
Training Time
15.7 hours
GPU Hours
4.0K
Estimated Cost
$4.0K
Estimated Cost Breakdown
Compute 70%
Compute (GPU) Memory (HBM) Network (interconnect)
Overtrained (could use bigger model)
Token/param ratio: 300.0 (Chinchilla optimal: ~20)
Chinchilla tokens:
20.0B
$4.0K
Total Cost
15.7 hours
Training Days
No
Chinchilla Optimal?

Assumptions: C = 6ND (forward + backward pass), A100 at 312 TFLOPS FP16 with 40% MFU (model FLOPS utilization), $1/GPU-hour cloud pricing. Real costs vary significantly with hardware generation, cluster efficiency, and cloud provider. The 6x factor accounts for both forward and backward passes plus gradient computation.

Comparing Scaling Strategies

Different scaling strategies suit different objectives, and there is no single "best" approach. Parameter-heavy training reaches the lowest loss per training dollar but produces expensive-to-serve models that may be impractical for production. Balanced Chinchilla scaling optimizes the training-inference tradeoff for moderate deployment volumes. Inference-aware over-training — training a smaller model on far more data than Chinchilla recommends — sacrifices training efficiency to produce models that are cheaper to deploy at scale, which is why LLaMA trained a 7B model on 1T tokens (a 143:1 ratio, far beyond the Chinchilla-optimal 20:1).

Scaling Approaches Compared

Different eras of LLM development have adopted different scaling strategies. The "optimal" approach depends on whether you are optimizing for training cost, inference cost, or final performance.

Kaplan Scaling
2020

Scale model size faster than data; bigger models are more sample-efficient

Data/Param:
~5 tokens/param
Compute Eff.:
moderate
Training:
High (large model, less data)
Inference:
Very High (oversized model)
Models: GPT-3, Gopher, Megatron-Turing
Chinchilla Scaling
2022

Scale model and data equally; train on ~20 tokens per parameter

Data/Param:
~20 tokens/param
Compute Eff.:
excellent
Training:
Optimal for given compute
Inference:
Moderate (smaller model)
Models: Chinchilla, LLaMA 1
LLaMA Approach
2023

Overtrain smaller models on more data for cheaper inference at deployment

Data/Param:
~140+ tokens/param
Compute Eff.:
moderate
Training:
Higher than optimal
Inference:
Low (small, capable model)
Models: LLaMA 2, LLaMA 3, Mistral
MoE Scaling
2022

Scale total parameters but activate only a subset per token via expert routing

Data/Param:
Varies (~20 tokens/active param)
Compute Eff.:
excellent
Training:
Moderate (sparse compute)
Inference:
Moderate (memory-heavy)
Models: Switch Transformer, Mixtral, GPT-4
Inference-Optimized
2024

Heavily overtrain for minimal deployment cost; optimize tokens-per-dollar at inference

Data/Param:
~200+ tokens/param
Compute Eff.:
moderate
Training:
Very High
Inference:
Very Low
Models: Gemma, Phi-3, SmolLM
Minimize Training Cost
  • Follow Chinchilla scaling (~20 tokens/param)
  • Optimal compute allocation between model and data
  • Best loss per FLOP spent
Minimize Inference Cost
  • Overtrain a smaller model (LLaMA approach)
  • Trade training compute for smaller deployment size
  • 100-200+ tokens per parameter
Maximize Capability
  • Use MoE for more parameters at fixed compute
  • Scale total knowledge while keeping inference fast
  • Combine with Chinchilla data ratios

Common Pitfalls

1. Ignoring Data Quality

Scaling laws assume clean, deduplicated training data. Repeating data or training on low-quality text changes the effective exponents and can make more compute actively harmful. The power law only holds when each additional token provides genuine new information. In practice, data curation — filtering, deduplication, and quality scoring — often matters more than raw dataset size.

2. Extrapolating Beyond Observed Ranges

Power laws are empirical fits to observed data. Extrapolating three orders of magnitude beyond your largest experiment is risky — the exponents may shift, new phenomena like emergent abilities may appear, or hardware bottlenecks may change the effective scaling. Always validate scaling predictions with intermediate checkpoints before committing to a full frontier training run.

3. Forgetting Inference Costs

Chinchilla-optimal training produces the best model for the training budget, but not necessarily the cheapest model to deploy. If the model will serve billions of queries, over-training a smaller model (like LLaMA) can be more economical overall, even if training itself is suboptimal. The total cost of ownership includes both training and the cumulative cost of every inference request the model will ever serve.

Key Takeaways

  1. Test loss follows power laws in parameters, data, and compute — with different exponents for each dimension.

  2. Data scaling has a larger exponent than parameter scaling — adding high-quality data is often more impactful than adding more parameters.

  3. The Chinchilla rule says to balance parameters and data equally — roughly 20 tokens per parameter for compute-optimal training.

  4. Optimal training strategy depends on deployment — training-optimal and inference-optimal allocations are different when serving costs matter.

  5. Scaling laws are predictive but not prescriptive — they guide resource allocation but cannot predict emergent abilities, data quality effects, or architectural breakthroughs.

If you found this explanation helpful, consider sharing it with others.

Mastodon