Skip to main content

Emergent Abilities in Large Language Models

Explore emergent abilities in large language models: sudden capabilities at scale thresholds, phase transitions, and the mirage debate.

Best viewed on desktop for optimal interactive experience

Emergent Abilities: When AI Suddenly "Gets It"

Emergent abilities are capabilities that appear suddenly and unpredictably in large language models as they cross certain scale thresholds. Below these thresholds, performance is essentially random — the model shows no sign of understanding the task. Above them, the model exhibits qualitatively new behavior. This is not a gradual improvement but an abrupt jump, often appearing within a narrow range of model sizes.

This phenomenon fundamentally challenges how we predict AI progress. You cannot forecast when a model will learn to do multi-step arithmetic or chain-of-thought reasoning just by watching smaller models fail at it. The inability to predict emergence from smaller-scale experiments makes it one of the most consequential — and controversial — phenomena in modern AI research.

The Phase Transition Analogy

Think about heating a block of ice. From -20C to -1C, the ice gets warmer but remains solid — nothing visibly changes. Then at 0C, the ice suddenly becomes water. The underlying physics was always continuous (molecular kinetic energy increases smoothly), but the macroscopic property — solid versus liquid — changes abruptly at a critical threshold.

Emergent abilities in language models work the same way. Internal representations and token-level predictions improve smoothly with scale, but task-level performance — whether the model can actually solve a multi-step problem correctly — can jump discontinuously when some internal capacity threshold is crossed.

Mathematical Framework

To reason precisely about emergence, we need a mathematical model that captures the sharp transition from "cannot do this at all" to "does this reliably." The probability of a model exhibiting an emergent ability can be modeled as a sigmoid function of the log of model size:

P\text{emerge}(N) = 11 + e-k(log N - log Nc)

Where N is the number of parameters, Nc is the critical threshold where emergence occurs, and k controls the sharpness of the transition. When k is large, the transition is nearly discontinuous — the model goes from zero capability to full capability within a narrow parameter range.

Information-Theoretic View

Emergence can also be understood through an information lens. A model exhibits an ability when its internal capacity exceeds the minimum information required by the task:

C\text{model}(N) > H(\text{task}) + ε

Where C\text{model} grows with parameters, H(\text{task}) is the inherent complexity of the task, and ε is a margin needed for robust generalization. Simple tasks have low H and emerge early; complex multi-step reasoning has high H and requires much larger models. This framework explains why abilities emerge in a consistent order — they are ranked by their information-theoretic complexity.

Emergence Thresholds

The critical parameter count Nc varies enormously across tasks. Simple pattern completion emerges around 1B parameters. Few-shot learning requires roughly 10B. Chain-of-thought reasoning appears around 50-100B. Theory of mind and self-correction may require 500B or more. The transition sharpness k also varies — some abilities emerge over an order of magnitude in model size, while others switch on within a factor of 2-3x.

Adjust the model scale below to see which abilities are active and which remain dormant. Notice how some abilities have sharp thresholds while others transition more gradually.

Unlocking Abilities at Scale

This visualization lets you watch abilities unlock one by one as you scale a model from millions to trillions of parameters. Simple pattern matching appears first, followed by few-shot learning, then chain-of-thought reasoning, and finally abstract capabilities like theory of mind and self-correction. Each ability appears at a specific scale, and the order is remarkably consistent across different model families — GPT, PaLM, LLaMA, and Chinchilla all unlock the same abilities in roughly the same sequence, even though their architectures and training data differ.

The Mirage Debate: Are Emergent Abilities Real?

This is perhaps the most heated debate in modern AI research. Not everyone agrees that emergent abilities represent genuine phase transitions. Schaeffer et al. (2023) proposed the "mirage hypothesis" — arguing that apparent emergence is an artifact of the evaluation metric, not a property of the model.

The argument hinges on nonlinear metrics. Consider exact-match accuracy for arithmetic: a model that gets 49 out of 50 digits correct scores 0 (wrong answer), while one that gets all 50 correct scores 1 (right answer). The underlying capability — digit-level accuracy — may improve smoothly with scale, but the binary metric creates an illusion of sudden emergence. When researchers replaced exact-match with token-level accuracy, many "emergent" abilities showed smooth, predictable improvement curves instead.

The counterargument is equally compelling. Some abilities — particularly chain-of-thought reasoning and in-context learning — appear genuinely discontinuous even under continuous metrics. A model either spontaneously generates intermediate reasoning steps or it does not; there is no "partial" chain-of-thought. The debate remains open, and the answer likely varies by task: some emergent abilities are real phase transitions while others are metric artifacts.

Documented Emergent Abilities

Researchers have cataloged over 130 tasks that show emergent behavior across different model families. The most well-documented include three-digit arithmetic in GPT-3, multi-step reasoning in PaLM, joke explanation in Chinchilla, and theory of mind in GPT-4.

Different model families exhibit emergence at different absolute scales, but the relative ordering of abilities is surprisingly consistent. This table summarizes documented emergent abilities across GPT, PaLM, and other model families, including the approximate parameter threshold and whether the emergence survives metric correction.

Catalog of Emergent Abilities

A comparison of documented emergent abilities, their approximate emergence scales, and the level of scientific consensus around each claim.

Arithmetic
~10B
GPT-3 (Brown et al., 2020)
Task
Formal reasoning
Controversyhigh
Evidence
Moderate

Multi-digit arithmetic appears suddenly with scale. Debated: may be metric artifact.

Chain-of-Thought
~100B
PaLM (Wei et al., 2022)
Task
Reasoning strategy
Controversymedium
Evidence
Strong

Step-by-step prompting unlocks at scale. Widely reproduced.

Code Generation
~50B
Codex (Chen et al., 2021)
Task
Program synthesis
Controversylow
Evidence
Strong

Functional code from natural language. Clear scaling trend.

Translation
~1B
GPT-2 (Radford et al., 2019)
Task
Cross-lingual transfer
Controversylow
Evidence
Strong

Zero-shot translation emerges early. Not controversial.

Theory of Mind
~500B
GPT-4 (Kosinski, 2023)
Task
Social reasoning
Controversyvery high
Evidence
Weak

Highly debated. May reflect pattern matching rather than true understanding.

Tool Use
~100B
Toolformer (Schick et al., 2023)
Task
API interaction
Controversymedium
Evidence
Moderate

Learning to use external tools (calculators, search). Requires sufficient context length.

Reliably documented:
  • Chain-of-Thought -- ~100B, strong evidence across multiple studies
  • Code Generation -- ~50B, strong evidence across multiple studies
  • Translation -- ~1B, strong evidence across multiple studies
Debated / contested:
  • Arithmetic -- high controversy, moderate evidence
  • Theory of Mind -- very high controversy, weak evidence
  • Tool Use -- medium controversy, moderate evidence

Common Pitfalls

1. Confusing Metric Artifacts with True Emergence

Using discontinuous evaluation metrics (exact match, pass/fail) can make smooth improvements look like sudden jumps. Always check whether the apparent emergence persists under continuous metrics like token-level accuracy or partial credit scoring before concluding that a true phase transition occurred.

2. Assuming Emergence Is Predictable

The fact that past abilities emerged at specific scales does not mean we can predict future ones. The threshold for a new ability depends on task complexity, training data distribution, and architecture — factors that interact in ways we do not yet fully understand. Planning for emergence requires preparing for surprises, not just extrapolating from history.

3. Overweighting Parameter Count

Parameters alone do not determine emergence. Training data quality, compute budget, and architectural choices all shift the thresholds. A 70B model trained on 1.4T high-quality tokens (Chinchilla) can outperform a 175B model trained on 300B tokens (GPT-3) on many emergent benchmarks. Effective scale is a function of all three factors, not just parameter count.

Key Takeaways

  1. Emergent abilities are capabilities that appear abruptly at specific scale thresholds, not through gradual improvement.

  2. The transition resembles a phase change — internal representations improve smoothly, but task performance jumps discontinuously at a critical point.

  3. Some emergence may be a metric artifact — the mirage hypothesis shows that discontinuous evaluation metrics can create the illusion of sudden capability jumps.

  4. The order of emergence is consistent across model families — simple abilities appear first, complex reasoning appears last, regardless of architecture.

  5. Emergence makes AI progress hard to predict — we cannot reliably forecast what capabilities will appear at the next order of magnitude in scale.

  • Neural Scaling Laws — The power law relationships that govern performance between emergent jumps
  • Prompt Engineering — Techniques that can elicit latent abilities below the emergence threshold
  • Cross-Entropy Loss — The training objective whose smooth decrease belies discontinuous downstream performance
  • Dropout — Regularization that affects effective model capacity and may shift emergence thresholds
  • Gradient Flow — Training dynamics that determine whether large models learn efficiently enough to exhibit emergence

If you found this explanation helpful, consider sharing it with others.

Mastodon