BEiT: BERT Pre-Training of Image Transformers

Hangbo Bao; Li Dong; Songhao Piao; Furu Wei

Paper Overview

BEiT — BERT Pre-Training of Image Transformers — is the first method to successfully adapt BERT’s masked language modeling paradigm to Vision Transformers. Published at ICLR 2022 by Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei at Microsoft Research, BEiT introduces a two-stage framework: first, a discrete variational autoencoder (dVAE) learns to tokenize image patches into a finite visual vocabulary of 8192 entries; then, a Vision Transformer is pre-trained to predict the visual tokens of masked patches from the remaining visible context. This approach draws a direct parallel to BERT, where the model predicts masked word tokens from surrounding text — except here, the “words” are discrete visual codes that capture the semantic content of each 16×16 image patch.

The central insight behind BEiT is that predicting discrete tokens rather than raw pixels forces the model to learn higher-level visual abstractions. When reconstructing raw pixels, a model can succeed by learning low-level statistics — local textures, color gradients, and edge patterns. Discrete visual tokens, by contrast, compress each patch into a categorical label that captures its semantic essence, stripping away pixel-level noise. This makes the prediction task inherently more semantic: the model must understand what an image region represents, not merely what color values it contains. The dVAE tokenizer acts as a bottleneck that discards low-level details, leaving only the information that matters for high-level understanding.

BEiT achieves 83.2% top-1 accuracy on ImageNet-1K with ViT-B/16 after fine-tuning, surpassing supervised ViT-B (82.3%) and demonstrating that self-supervised pre-training with masked image modeling produces stronger representations than training with labels alone. On downstream tasks, BEiT pre-trained features achieve 49.8 mIoU on ADE20K semantic segmentation with ViT-L, confirming that the discrete token prediction objective learns spatially rich representations. BEiT also shows strong linear probing performance at 56.7% with ViT-B — modest compared to contrastive methods, but the fine-tuning results reveal that BEiT’s representations are particularly amenable to adaptation, suggesting the model learns a flexible feature space that can be efficiently tuned for diverse visual tasks.

The Visual Tokenizer

BEiT’s visual tokenizer is a discrete variational autoencoder (dVAE) borrowed from the image generation literature — specifically, the same architecture used in DALL-E. The dVAE is trained separately on ImageNet-1K before BEiT pre-training begins. It learns to map each 16×16 image patch to one of 8192 discrete visual tokens through a codebook lookup. The encoder projects each patch into a continuous embedding, and the nearest codebook vector determines the token assignment. Formally, for a patch x_i, the assigned visual token is:

z_i = \argmin_{v ∈ 𝒞v} \|e(x_i) - v\|

where e(·) is the dVAE encoder and 𝒞v is the codebook of 8192 learned visual embeddings. The dVAE decoder can reconstruct the original patch from its token, but BEiT only uses the encoder during pre-training — the decoder is discarded. Each visual token represents a cluster of visually similar patches: tokens might correspond to “blue sky texture,” “fur-like pattern,” or “sharp horizontal edge.” This discretization compresses the information in each patch from 768 continuous pixel values (16×16×3) into a single categorical label from a vocabulary of 8192, dramatically reducing the complexity of the prediction target.

The quality of the visual tokenizer directly impacts BEiT’s pre-training effectiveness. A tokenizer that preserves too much low-level detail (large codebook, high-fidelity reconstruction) would push the prediction task back toward pixel reconstruction. A tokenizer that is too lossy (small codebook, poor reconstruction) would discard semantic information the model needs to learn. The 8192-entry codebook strikes a balance: it preserves enough semantic content to make token prediction meaningful while abstracting away pixel-level noise. Later work — BEiT-2 in particular — demonstrated that better tokenizers (such as VQ-KD, which distills knowledge from a pre-trained CLIP model into the codebook) can substantially improve downstream performance, confirming that the tokenizer is not merely a preprocessing step but a critical component of the entire framework.

Blockwise Masking Strategy

BEiT employs a blockwise masking strategy rather than the uniform random masking used by BERT or the random patch masking later adopted by MAE. The masking procedure works by repeatedly sampling rectangular blocks: each block has a random aspect ratio between 0.3 and 3.3, and its area is drawn from a truncated normal distribution. These blocks are placed at random positions until approximately 40% of the 14×14 = 196 patches are masked. Because blocks are contiguous, the masked regions tend to cover coherent spatial areas — an entire object part, a texture region, or a meaningful scene element — rather than scattered individual patches.

The motivation for blockwise masking is that random per-patch masking is too easy when the prediction target is a discrete token. With random masking, each masked patch is surrounded by visible neighbors on most sides, making the prediction a local interpolation problem. Blockwise masking removes entire spatial neighborhoods, forcing the model to reason about longer-range dependencies and higher-level scene structure. The 40% masking ratio is notably lower than MAE’s 75%, reflecting a fundamental difference in prediction targets: predicting discrete tokens from a categorical vocabulary of 8192 is a harder per-patch problem than reconstructing continuous pixel values, so the model needs more visible context (60% vs. 25%) to learn effectively. This design choice also means BEiT’s encoder processes all 196 patches — both masked and visible — during pre-training, unlike MAE which removes masked patches from the encoder entirely.

The BEiT Pipeline

The full BEiT pre-training pipeline operates in two parallel streams on each input image. In the first stream, the image is divided into a 14×14 grid of 16×16 patches, and the pre-trained dVAE tokenizer maps each patch to its discrete visual token — producing 196 token labels that serve as prediction targets. In the second stream, the same 196 patches are fed to the Vision Transformer backbone, but approximately 40% of them are replaced with a learnable [MASK] embedding before entering the encoder. Crucially, unlike MAE, the masked positions are not removed from the input sequence. The encoder processes all 196 tokens — visible patches retain their patch embeddings while masked positions carry the shared [MASK] token — meaning the encoder’s self-attention operates over the full sequence length at every layer.

This architectural choice has significant implications for both computation and representation learning. Because all positions are present in the encoder, masked tokens can attend to visible tokens and to each other, allowing the model to propagate contextual information through the masked regions during encoding. The trade-off is computational: the encoder’s self-attention cost remains O(196²) per layer regardless of masking ratio, unlike MAE which achieves a 16× reduction by encoding only visible patches. After the encoder, a linear classification head maps each masked position’s output embedding to a probability distribution over the 8192-entry visual vocabulary, and the model is trained with cross-entropy loss to predict the correct visual token at each masked position. Only the masked positions contribute to the loss — the model is not penalized for its representations of visible patches.

Token Prediction vs Pixel Reconstruction

The choice between predicting discrete tokens and reconstructing raw pixels represents a fundamental design axis in masked image modeling. BEiT predicts discrete visual tokens using a cross-entropy classification loss over 8192 categories, while MAE reconstructs per-patch normalized pixel values using mean squared error. BEiT’s loss function for a set of masked positions 𝒞m is:

ℒ_\text{BEiT} = -Σ_{i ∈ 𝒞m} log p(z_i | x_{\backslash 𝒞m})

where z_i is the ground-truth visual token for masked position i and x_{\backslash 𝒞m} denotes the visible (unmasked) patches. This classification formulation means the model must commit to a discrete semantic category for each masked patch, rather than hedging with a blurry average of possible pixel values. When a model reconstructs pixels and faces ambiguity — say, a masked region could plausibly be grass or water — MSE loss encourages a blurry average of both. Cross-entropy over discrete tokens forces a categorical choice, which better captures the multi-modal nature of visual prediction.

The empirical trade-offs are nuanced. BEiT ViT-B achieves 83.2% fine-tuning accuracy but only 56.7% linear probe accuracy, while MAE ViT-B achieves 83.6% fine-tuning with 68.0% linear probe. The lower linear probe score suggests BEiT’s representations require non-linear adaptation to become fully useful — the features are powerful but not immediately linearly separable. MAE’s pixel target, surprisingly, produces more linearly separable features, possibly because continuous pixel reconstruction preserves more low-level spatial information that a linear classifier can exploit. However, BEiT’s approach has a deeper advantage: the discrete token target decouples representation learning from pixel-level details, making the framework naturally extensible to multimodal settings where text and image tokens share the same vocabulary space — an insight that BEiT-3 later exploits to great effect.

From BEiT to BEiT-3

The BEiT lineage demonstrates how the core idea of predicting discrete visual tokens evolved through three generations of increasingly powerful models. BEiT-2 (2022) replaced the dVAE tokenizer with VQ-KD — a vector-quantized knowledge distillation approach that trains the tokenizer to produce codes aligned with a pre-trained CLIP teacher model’s representations. This seemingly simple change proved transformative: by grounding the visual vocabulary in CLIP’s semantically rich feature space, VQ-KD tokens carry substantially more semantic information than dVAE tokens, which are trained purely on pixel reconstruction. BEiT-2 with ViT-L/16 reaches 85.5% top-1 on ImageNet-1K, a 0.3 point improvement over the original BEiT-L, and the gains are even more pronounced on downstream tasks like semantic segmentation, where better tokenization directly translates to richer spatial representations.

BEiT-3 (2023) takes the multimodal leap that BEiT’s discrete token framework naturally enables. By treating both image patches and text words as discrete tokens from unified vocabularies, BEiT-3 applies masked data modeling across vision, language, and vision-language tasks using a single Multiway Transformer architecture. Each modality has its own expert feed-forward layers, but the self-attention parameters are shared, allowing the model to develop cross-modal understanding through unified pre-training. BEiT-3 ViT-g achieves 87.6% on ImageNet-1K classification, 64.2 mIoU on ADE20K semantic segmentation, and state-of-the-art results on vision-language benchmarks including VQAv2, NLVR2, and image-text retrieval. The progression from BEiT to BEiT-3 validates the original paper’s core thesis: discrete visual tokens provide the right abstraction for bridging vision and language, enabling a unified framework that scales from single-modality pre-training to multimodal foundation models.

How BEiT Compares

Self-Supervised Method Comparison

How BEiT compares to other self-supervised and supervised approaches on ImageNet fine-tuning accuracy.

Method	Approach	Masking	Target	ViT-B (%)	ViT-L (%)	Key Advantage
BEiT	Token prediction	40% blockwise	Discrete tokens	83.2	—	Semantic token targets from dVAE tokenizer
MAE	Pixel reconstruction	75% random	Raw pixels	83.6	84.9	Simple pixel target, 3.5× faster training
SimMIM	Pixel reconstruction	60% random	Raw pixels	83.8	—	Simple design, works with Swin Transformer
DINO	Self-distillation	None (multi-crop)	Teacher CLS token	82.8	—	Emergent segmentation in attention maps
iBOT	Distillation + MIM	Patch masking	Teacher tokens	83.8	84.8	Combines DINO image-level + BEiT patch-level
Supervised	Cross-entropy	N/A	Class labels	82.3	82.6	Requires labeled data, saturates at scale

BEiT

Approach:

Token prediction

Masking:

40% blockwise

Target:

Discrete tokens

ViT-B: 83.2%ViT-L: —

Semantic token targets from dVAE tokenizer

MAE

Approach:

Pixel reconstruction

Masking:

75% random

Target:

Raw pixels

ViT-B: 83.6%ViT-L: 84.9%

Simple pixel target, 3.5× faster training

SimMIM

Approach:

Pixel reconstruction

Masking:

60% random

Target:

Raw pixels

ViT-B: 83.8%ViT-L: —

Simple design, works with Swin Transformer

DINO

Approach:

Self-distillation

Masking:

None (multi-crop)

Target:

Teacher CLS token

ViT-B: 82.8%ViT-L: —

Emergent segmentation in attention maps

iBOT

Approach:

Distillation + MIM

Masking:

Patch masking

Target:

Teacher tokens

ViT-B: 83.8%ViT-L: 84.8%

Combines DINO image-level + BEiT patch-level

Supervised

Approach:

Cross-entropy

Masking:

N/A

Target:

Class labels

ViT-B: 82.3%ViT-L: 82.6%

Requires labeled data, saturates at scale

BEiT's key insight

Discrete token prediction forces semantic understanding over pixel details
dVAE tokenizer maps patches to a learned visual vocabulary
Blockwise masking encourages reasoning about spatial context

Trade-offs

Requires separate dVAE tokenizer — extra training stage
No MAE-style encoder speedup — all patches processed
Tokenizer quality limits representation quality ceiling

Key Results

Model	Fine-tuning	Linear Probe	Notes
BEiT ViT-B	83.2%	56.7%	300ep, dVAE tokens
BEiT ViT-L	85.2%	—	800ep pre-train
BEiT-2 ViT-L	85.5%	—	VQ-KD tokenizer
BEiT-3 ViT-g	87.6%	—	Multimodal
MAE ViT-B	83.6%	68.0%	Pixel target
Supervised ViT-B	82.3%	—	Labels required

Why BEiT Matters

BEiT was the first work to demonstrate that BERT-style masked prediction could work for Vision Transformers, establishing masked image modeling as a viable and powerful self-supervised paradigm for computer vision. Before BEiT, self-supervised learning in vision was dominated by contrastive methods — SimCLR, MoCo, BYOL, DINO — which learn representations by comparing augmented views of the same image. These methods require careful augmentation design, momentum encoders, or large batch sizes. BEiT showed that a conceptually simpler approach — masking and predicting — could match or exceed contrastive methods on fine-tuning benchmarks, opening an entirely new research direction. The paper’s 83.2% result with ViT-B, surpassing supervised training by 0.9 points, was the first concrete evidence that masked image modeling could be practically superior to learning from labels.

Beyond the immediate results, BEiT’s lasting contribution is the framework itself: the idea that images can be treated as sequences of discrete tokens, analogous to words in language, and that masked prediction over these tokens produces strong visual representations. This framework directly influenced MAE (which simplified the target to raw pixels), SimMIM (which explored broader masking strategies), PeCo (which refined the codebook with perceptual quality), and ultimately BEiT-3 (which unified vision and language under a single masked modeling objective). The discrete tokenization approach also proved essential for vision-language models, where having images and text in compatible token spaces enables seamless cross-modal learning. BEiT did not just introduce a new pre-training method — it established the conceptual bridge between NLP’s masked modeling success and vision, catalyzing a paradigm shift that reshaped how the field thinks about self-supervised visual representation learning.

Key Takeaways

Discrete visual tokens make masked prediction semantic — by compressing each 16×16 patch into one of 8192 categorical labels via a dVAE tokenizer, BEiT forces the model to predict semantic content rather than low-level pixel statistics, producing representations that excel after fine-tuning.
Blockwise masking creates a harder pretext task — masking contiguous rectangular blocks at 40% ratio removes entire spatial neighborhoods, preventing trivial local interpolation and forcing the model to reason about object parts, textures, and scene layout from distant context.
The tokenizer quality is a critical bottleneck — BEiT-2’s switch from dVAE to VQ-KD (CLIP-aligned) tokens improved ViT-L accuracy from 85.2% to 85.5%, demonstrating that the expressiveness of the visual vocabulary directly limits the quality of learned representations.
Token prediction enables natural multimodal extension — because discrete visual tokens are structurally analogous to text tokens, BEiT’s framework extends seamlessly to vision-language pre-training, as demonstrated by BEiT-3’s 87.6% ImageNet accuracy and state-of-the-art vision-language performance with a unified Multiway Transformer.
Fine-tuning reveals what linear probing misses — BEiT’s 56.7% linear probe (vs. MAE’s 68.0%) suggests less linearly separable features, but its 83.2% fine-tuning accuracy (close to MAE’s 83.6%) reveals that BEiT learns a flexible, high-dimensional feature space that becomes highly discriminative with end-to-end adaptation.

MAE — Masked autoencoders that reconstruct raw pixels instead of discrete tokens
DINO — Self-distillation with Vision Transformers via momentum teacher
SimCLR — Contrastive learning framework that established SSL baselines for vision
MoCo — Momentum contrast for building large, consistent negative dictionaries
I-JEPA — Joint-embedding predictive architecture that avoids pixel and token reconstruction
DINOv2 — Scaling self-supervised ViTs with curated data and distillation
BYOL — Bootstrap Your Own Latent — self-supervised learning without negative pairs
VICReg — Variance-invariance-covariance regularization for non-contrastive learning