Neural Scaling Laws: The Mathematics of Model Performance

Neural scaling laws are empirical power law relationships that describe how model performance improves with increased scale — whether in parameters, data, or compute. These laws have become fundamental to understanding and predicting AI progress, guiding multi-million dollar training decisions and revealing the path toward more capable systems.

The discovery that simple mathematical relationships govern complex emergent behaviors has transformed model development from trial-and-error into principled engineering. If you know the exponents, you can predict the loss before training a single step.

The Recipe Scaling Analogy

Imagine scaling up a recipe from a home kitchen to a restaurant. Doubling the flour does not double the quality of the bread — you also need to scale water, yeast, and oven time in the right proportions. Neural scaling works the same way: parameters, data, and compute must grow together in specific ratios, and getting these ratios wrong wastes resources without improving results.

The Scaling Recipe

Scaling a neural network is like perfecting a recipe. You need a skilled chef (model size), quality ingredients (training data), and enough cooking time (compute). Scaling just one dimension hits diminishing returns fast -- the secret is balanced scaling.

Ingredient Levels

Chef

Ingredi.

Time

Output Quality0.0

All three ingredients are balanced -- this is the compute-optimal frontier.

Chef skill Model Size1x

1x10x100x

Ingredients Data1x

1x10x100x

Cook time Compute1x

1x10x100x

0.0

Quality Score

100%

Efficiency

Balanced

Bottleneck

Key insight: Neural scaling laws follow power laws with diminishing returns. Doubling model size alone gives a small improvement, but scaling model size, data, and compute together gives compounding gains. The Chinchilla paper showed that for a fixed compute budget, the optimal strategy scales parameters and training tokens at roughly equal rates.

Power Law Relationships

At the heart of scaling laws lies a remarkably simple mathematical structure. Despite the complexity of neural networks — billions of parameters, trillions of floating-point operations, terabytes of training data — the relationship between scale and performance follows a clean power law in each scaling dimension. For a single variable, the relationship takes this form:

L(x) = a · x^-α + L_{\text{irreducible}}

Where x is the scaling variable (parameters, data tokens, or FLOPs), α is the scaling exponent that governs the rate of improvement, a is a constant, and L_{\text{irreducible}} is the theoretical minimum loss — the inherent randomness in the data that no model can eliminate.

Three Scaling Dimensions

Parameter scaling describes how loss decreases with model size. Kaplan et al. found α_N ≈ 0.076, meaning 10x more parameters yields roughly 17% lower loss. Larger models are also more sample-efficient, learning more per token seen.

Data scaling captures how loss improves with dataset size. The exponent α_D ≈ 0.095 is larger than the parameter exponent, which means that adding more high-quality data has a stronger impact than adding more parameters — a finding that reshaped the field.

Compute scaling ties the other two together through the relationship C ≈ 6ND, where C is FLOPs, N is parameters, and D is tokens. Given a fixed compute budget, the question becomes how to split it between model size and training data.

The joint scaling law combines all three dimensions into a single expression:

L(N, D) = (N_cN)^α_N + (D_cD)^α_D + L_{\text{irreducible}}

Where N_c and D_c are critical constants estimated from experimental data. This equation reveals that parameters and data contribute independently to loss reduction — a model with infinite data but few parameters still hits a wall, and vice versa.

Interactive Scaling Curves

Adjust the scaling exponents and see how loss curves change across parameters, data, and compute. Notice how small changes in the exponent lead to large differences at scale — a 0.01 shift in alpha can mean billions of dollars at frontier model budgets.

Scaling Curve Explorer

Loss follows a power law: L(x) = a * x^(-alpha) + L_irreducible. On a log-log plot these curves are straight lines that flatten as they approach the irreducible loss floor.

Scaling Exponent alpha: 0.076

0.02 (slow scaling)0.076 (Kaplan)0.5 (fast scaling)

0.076

Current alpha

6.8%

10x Improvement

1.69

Irreducible Loss

Why power laws matter: The exponent alpha determines how quickly performance improves with scale. Kaplan et al. found alpha ~ 0.076 for parameters and ~0.095 for data. A larger exponent means faster improvement per order of magnitude, but all curves eventually flatten toward the irreducible loss -- the fundamental limit set by the entropy of natural language.

Chinchilla vs Kaplan: The Compute-Optimal Debate

The biggest controversy in scaling laws centers on how to allocate a fixed compute budget between model size and training data. The answer to this question directly shaped the trajectory of frontier model development.

Kaplan et al. (2020) argued that models should be as large as possible given the budget, even if that means training on relatively little data. This philosophy produced GPT-3: 175B parameters trained on only 300B tokens, a ratio of roughly 1.7:1. The Kaplan prescription suggested N_opt \propto C^0.73, allocating most of the budget to parameters.

Hoffmann et al. (2022) — the Chinchilla paper — challenged this directly. By training over 400 models ranging from 70M to 16B parameters on different amounts of data, they found that parameters and data should scale equally: N_opt \propto C^0.5 and D_opt \propto C^0.5. This yields the "20:1 rule" — the optimal token count is roughly 20 times the parameter count. Their 70B parameter Chinchilla model, trained on 1.4T tokens, matched the 175B parameter GPT-3 with 4x less compute.

The impact was immediate: LLaMA, Mistral, and subsequent model families all adopted data-heavy training strategies. The era of under-trained, parameter-heavy models was over.

Chinchilla vs Kaplan: Optimal Allocation

Given a fixed compute budget, how should you split it between model size and training data? Kaplan (2020) said "go bigger"; Chinchilla (2022) said "scale both equally." The difference in recommendations is massive.

Compute Budget: 1.0e23 FLOPs

10^18 (small)10^21 (medium)10^25 (frontier)

Kaplan Recommendation

Model:

6166.0T

Tokens:

2.7M

Ratio: 0.0 tokens per param

Chinchilla Recommendation

Model:

94.9B

Tokens:

175.7B

Ratio: 1.9 tokens per param

6166.0T

Kaplan Model Size

94.9B

Chinchilla Model Size

2:1

Token Ratio

The Chinchilla lesson: DeepMind showed that most large models were significantly undertrained. For the same compute budget used to train the 280B-parameter Gopher, a 70B model trained on 4x more data (Chinchilla) achieved better performance. The rule of thumb: train on roughly 20 tokens per parameter for compute-optimality.

Compute Budget Planning

In practice, compute budgets are fixed by hardware availability and economic constraints. A research lab with 1,000 GPUs for 3 months has a specific FLOP budget, and the question is how to spend it wisely.

Given a fixed compute budget in FLOPs, how should you design your training run? This planner lets you set a total budget and see how different allocation strategies — Kaplan-style (parameter-heavy), Chinchilla-optimal (balanced), or inference-aware (over-train a smaller model) — affect expected loss and downstream cost. The tradeoff between training efficiency and serving cost is one of the most consequential decisions in modern AI development.

Compute Budget Calculator

Estimate training cost from first principles. Compute = 6 x Parameters x Tokens. Then divide by your hardware throughput to get wall-clock time and cost.

Model Parameters: 1B

1M13B1T

Training Tokens: 300B

1B1T10T

A100 GPUs: 256

151210K

Total Compute

1800.0 EFLOPs

Training Time

15.7 hours

GPU Hours

4.0K

Estimated Cost

$4.0K

Estimated Cost Breakdown

Compute 70%

Compute (GPU) Memory (HBM) Network (interconnect)

Overtrained (could use bigger model)

Token/param ratio: 300.0 (Chinchilla optimal: ~20)

Chinchilla tokens:

20.0B

$4.0K

Total Cost

15.7 hours

Training Days

Chinchilla Optimal?

Assumptions: C = 6ND (forward + backward pass), A100 at 312 TFLOPS FP16 with 40% MFU (model FLOPS utilization), $1/GPU-hour cloud pricing. Real costs vary significantly with hardware generation, cluster efficiency, and cloud provider. The 6x factor accounts for both forward and backward passes plus gradient computation.

Comparing Scaling Strategies

Different scaling strategies suit different objectives, and there is no single "best" approach. Parameter-heavy training reaches the lowest loss per training dollar but produces expensive-to-serve models that may be impractical for production. Balanced Chinchilla scaling optimizes the training-inference tradeoff for moderate deployment volumes. Inference-aware over-training — training a smaller model on far more data than Chinchilla recommends — sacrifices training efficiency to produce models that are cheaper to deploy at scale, which is why LLaMA trained a 7B model on 1T tokens (a 143:1 ratio, far beyond the Chinchilla-optimal 20:1).

Scaling Approaches Compared

Different eras of LLM development have adopted different scaling strategies. The "optimal" approach depends on whether you are optimizing for training cost, inference cost, or final performance.

Approach	Principle	Data/Param Ratio	Compute Eff.	Training Cost	Inference Cost	Notable Models
Kaplan Scaling 2020	Scale model size faster than data; bigger models are more sample-efficient	~5 tokens/param	moderate	High (large model, less data)	Very High (oversized model)	GPT-3, Gopher, Megatron-Turing
Chinchilla Scaling 2022	Scale model and data equally; train on ~20 tokens per parameter	~20 tokens/param	excellent	Optimal for given compute	Moderate (smaller model)	Chinchilla, LLaMA 1
LLaMA Approach 2023	Overtrain smaller models on more data for cheaper inference at deployment	~140+ tokens/param	moderate	Higher than optimal	Low (small, capable model)	LLaMA 2, LLaMA 3, Mistral
MoE Scaling 2022	Scale total parameters but activate only a subset per token via expert routing	Varies (~20 tokens/active param)	excellent	Moderate (sparse compute)	Moderate (memory-heavy)	Switch Transformer, Mixtral, GPT-4
Inference-Optimized 2024	Heavily overtrain for minimal deployment cost; optimize tokens-per-dollar at inference	~200+ tokens/param	moderate	Very High	Very Low	Gemma, Phi-3, SmolLM

Kaplan Scaling

2020

Scale model size faster than data; bigger models are more sample-efficient

Data/Param:

~5 tokens/param

Compute Eff.:

moderate

Training:

High (large model, less data)

Inference:

Very High (oversized model)

Models: GPT-3, Gopher, Megatron-Turing

Chinchilla Scaling

2022

Scale model and data equally; train on ~20 tokens per parameter

Data/Param:

~20 tokens/param

Compute Eff.:

excellent

Training:

Optimal for given compute

Inference:

Moderate (smaller model)

Models: Chinchilla, LLaMA 1

LLaMA Approach

2023

Overtrain smaller models on more data for cheaper inference at deployment

Data/Param:

~140+ tokens/param

Compute Eff.:

moderate

Training:

Higher than optimal

Inference:

Low (small, capable model)

Models: LLaMA 2, LLaMA 3, Mistral

MoE Scaling

2022

Scale total parameters but activate only a subset per token via expert routing

Data/Param:

Varies (~20 tokens/active param)

Compute Eff.:

excellent

Training:

Moderate (sparse compute)

Inference:

Moderate (memory-heavy)

Models: Switch Transformer, Mixtral, GPT-4

Inference-Optimized

2024

Heavily overtrain for minimal deployment cost; optimize tokens-per-dollar at inference

Data/Param:

~200+ tokens/param

Compute Eff.:

moderate

Training:

Very High

Inference:

Very Low

Models: Gemma, Phi-3, SmolLM

Minimize Training Cost

Follow Chinchilla scaling (~20 tokens/param)
Optimal compute allocation between model and data
Best loss per FLOP spent

Minimize Inference Cost

Overtrain a smaller model (LLaMA approach)
Trade training compute for smaller deployment size
100-200+ tokens per parameter

Maximize Capability

Use MoE for more parameters at fixed compute
Scale total knowledge while keeping inference fast
Combine with Chinchilla data ratios

Common Pitfalls

1. Ignoring Data Quality

Scaling laws assume clean, deduplicated training data. Repeating data or training on low-quality text changes the effective exponents and can make more compute actively harmful. The power law only holds when each additional token provides genuine new information. In practice, data curation — filtering, deduplication, and quality scoring — often matters more than raw dataset size.

2. Extrapolating Beyond Observed Ranges

Power laws are empirical fits to observed data. Extrapolating three orders of magnitude beyond your largest experiment is risky — the exponents may shift, new phenomena like emergent abilities may appear, or hardware bottlenecks may change the effective scaling. Always validate scaling predictions with intermediate checkpoints before committing to a full frontier training run.

3. Forgetting Inference Costs

Chinchilla-optimal training produces the best model for the training budget, but not necessarily the cheapest model to deploy. If the model will serve billions of queries, over-training a smaller model (like LLaMA) can be more economical overall, even if training itself is suboptimal. The total cost of ownership includes both training and the cumulative cost of every inference request the model will ever serve.

Key Takeaways

Test loss follows power laws in parameters, data, and compute — with different exponents for each dimension.
Data scaling has a larger exponent than parameter scaling — adding high-quality data is often more impactful than adding more parameters.
The Chinchilla rule says to balance parameters and data equally — roughly 20 tokens per parameter for compute-optimal training.
Optimal training strategy depends on deployment — training-optimal and inference-optimal allocations are different when serving costs matter.
Scaling laws are predictive but not prescriptive — they guide resource allocation but cannot predict emergent abilities, data quality effects, or architectural breakthroughs.

Emergent Abilities — Sudden capabilities that appear at scale, where power laws break down
Prompt Engineering — How prompting techniques interact with model scale
Cross-Entropy Loss — The loss function that scaling laws typically measure
Xavier Initialization — Weight initialization that enables stable training at scale
He Initialization — Initialization for ReLU networks at scale