BERT: Pre-training of Deep Bidirectional Transformers

Jacob Devlin; Ming-Wei Chang; Kenton Lee; Kristina Toutanova

TL;DR

BERT pre-trains a 12- or 24-layer Transformer encoder on two self-supervised objectives — masked language modeling (predict randomly masked tokens) and next-sentence prediction — using only unlabeled text from Books and Wikipedia.
The key insight is bidirectional context: unlike GPT, every token attends to every other token in both directions, producing richer contextualized representations.
After pre-training, a small task-specific head is added and the whole model is fine-tuned on labeled data for minutes to hours — yielding state-of-the-art results on GLUE, SQuAD, and 11 other NLP benchmarks at publication.
BERT established the pretrain-then-finetune paradigm as the dominant approach in NLP, and the encoder-only architecture it pioneered remains foundational to this day.

Traditional language models predict the next token given all previous tokens — a left-to-right constraint. BERT breaks this constraint with masked language modeling (MLM): approximately 15% of WordPiece tokens are randomly replaced with a special [MASK] token, and the model is trained to reconstruct them from the surrounding context. Because the encoder sees the entire sequence simultaneously, both left and right context inform every prediction. This is the mechanism that makes BERT genuinely bidirectional.

In practice, the 15% of selected tokens are handled as follows: 80% are replaced with [MASK], 10% are replaced with a random token, and 10% are left unchanged. The mixed strategy prevents the model from only learning to handle [MASK] tokens at fine-tune time, when no masking occurs.

Bidirectional self-attention

The Transformer architecture introduced self-attention, but its decoder applies a causal mask: each position can only attend to itself and earlier positions. This is necessary for autoregressive generation but limits representation quality — the model never sees the right-side context when building a token’s representation.

BERT uses a Transformer encoder with no causal mask. Every token attends to every other token in every layer. The full N×N attention matrix is active. This is what “deep bidirectional” means: not just a shallow bidirectional RNN pass, but full cross-position attention at every layer of a deep stack.

The multi-head attention mechanism runs this in parallel across H heads, each learning a different relational pattern. BERT-Base uses 12 heads across 12 layers; BERT-Large uses 16 heads across 24 layers.

Pre-train once, fine-tune everywhere

BERT separates the expensive representation learning from the cheap task adaptation. Pre-training uses a massive unlabeled corpus (BooksCorpus + English Wikipedia, around 3.3 billion words) and runs for days on TPUs. The result is a set of encoder weights encoding deep knowledge of English syntax, semantics, and world facts.

Fine-tuning then adds a small head on top of the pre-trained encoder and trains for a few epochs on a labeled dataset that may be orders of magnitude smaller. Different tasks require different heads:

Sentence classification (e.g., sentiment, entailment): the final-layer embedding of the special [CLS] token feeds a linear classifier.
Span extraction (e.g., SQuAD question answering): two linear layers output per-token start and end logits; the span with the highest combined score is extracted.
Sequence labeling (e.g., named entity recognition): each token’s final-layer embedding feeds a per-token linear classifier independently.

The [CLS] token is a learned positional embedding prepended to every sequence. Because it has no intrinsic meaning of its own, the model learns to pack sentence-level summary information into it during pre-training — making it a natural hook for classification heads.

Why it mattered

BERT’s release in October 2018 was a watershed moment for NLP. It improved the state of the art on the GLUE benchmark by 7.7 points absolute, on SQuAD 1.1 by 1.5 F1, and on SQuAD 2.0 by 5.1 F1 — beating human performance on SQuAD 2.0 for the first time. Eleven NLP tasks saw new records in a single paper.

The deeper impact was architectural and methodological. Before BERT, NLP systems were largely task-specific pipelines with task-specific training. After BERT, the default approach became: take a pre-trained encoder, add a small head, fine-tune. This shift mirrors what ImageNet pre-training did for computer vision years earlier, and it directly enabled the scaling era that followed — RoBERTa, ALBERT, DeBERTa, and eventually the large language models that descended from the encoder–decoder and decoder-only branches of the same Transformer family.

BERT also popularized WordPiece tokenization, the [CLS]/[SEP] sentence boundary convention, and the use of the [CLS] vector as a sentence representation — conventions that persist across many modern models.

Attention Is All You Need — the Transformer architecture that BERT’s encoder is built on; reading it first clarifies exactly what BERT inherits and what it changes
CLIP — extends the pretrain-then-finetune idea to vision–language with contrastive learning, showing the same recipe generalizes across modalities
Vision Transformer — applies BERT’s encoder-only Transformer directly to image patches, demonstrating that the architecture transfers to vision with minimal modification

BERT: Pre-training of Deep Bidirectional Transformers

TL;DR

Masked language modeling

Bidirectional self-attention

Pre-train once, fine-tune everywhere

Why it mattered

Related Reading