ViT: An Image is Worth 16x16 Words

Alexey Dosovitskiy; Lucas Beyer; Alexander Kolesnikov; Dirk Weissenborn; Xiaohua Zhai; Thomas Unterthiner; Mostafa Dehghani; Matthias Minderer; Georg Heigold; Sylvain Gelly; Jakob Uszkoreit; Neil Houlsby

TL;DR

ViT demonstrates that a standard Transformer encoder, applied directly to sequences of image patches, can match or exceed the best convolutional networks on image classification — provided it is pre-trained on sufficient data. The architecture is deliberately minimal: split an image into 16x16 patches, linearly embed each patch, prepend a learnable classification token, add positional embeddings, and feed the sequence through a vanilla Transformer. When pre-trained on JFT-300M (300 million images), ViT-Huge/14 reaches 88.55% top-1 accuracy on ImageNet, surpassing the best CNNs while using fewer compute resources to train.

The Core Idea: Image Patches as Tokens

The central insight is a reframing: treat an image not as a pixel grid but as a sequence of patch tokens, then apply the same Transformer architecture that works for language. This is a deliberate bet on scale over inductive bias. CNNs bake in locality (convolution kernels) and translation equivariance (weight sharing) — priors that help with limited data. ViT discards both, relying instead on the Transformer's capacity to learn these relationships from data when given enough of it.

An input image \mathbf{x} ∈ ℝ^{H × W × C} is reshaped into a sequence of N flattened patches \mathbf{x}_p ∈ ℝ^{N × (P² · C)}, where P is the patch size and N = HW / P². For a 224x224 image with 16x16 patches, this yields N = 196 tokens — a sequence length easily handled by a Transformer.

Patch Embedding

Each flattened patch is projected into the model's hidden dimension D through a learnable linear projection:

\mathbf{z}₀ⁱ = \mathbf{x}_pⁱ \mathbf{E}, \mathbf{E} ∈ ℝ^{(P² · C) × D}

In practice this is implemented as a single convolution with kernel size and stride both equal to P, which is mathematically equivalent to flattening plus linear projection but more efficient on GPU hardware.

A learnable [CLS] token \mathbf{z}₀⁰ is prepended to the patch sequence, following the BERT convention. Its output representation at the final layer serves as the aggregate image representation for classification. Learnable 1D positional embeddings \mathbf{E}_pos ∈ ℝ^{(N+1) × D} are added to the full sequence to encode spatial information:

\mathbf{z}₀ = [\mathbf{z}₀^\text{cls};\, \mathbf{z}₀¹;\, \mathbf{z}₀²;\, \ldots;\, \mathbf{z}₀^N] + \mathbf{E}_pos

The paper found that learned 1D positional embeddings perform comparably to more sophisticated 2D-aware alternatives, suggesting the model learns to infer spatial structure from the data.

Transformer Encoder

The patch embeddings are processed by a standard Transformer encoder — the same architecture from Vaswani et al. (2017), with no vision-specific modifications. Each of the L layers applies multihead self-attention (MSA) followed by an MLP block, both with layer normalization applied before each block (pre-norm) and residual connections:

\mathbf{z}_\ell' = \text{MSA}(\text{LN}(\mathbf{z}_\ell-1)) + \mathbf{z}_\ell-1

\mathbf{z}_\ell = \text{MLP}(\text{LN}(\mathbf{z}_\ell')) + \mathbf{z}_\ell'

The MLP contains two linear layers with a GELU activation. The output of the [CLS] token at the final layer is passed through a classification head (a single linear layer during fine-tuning) to produce the class prediction:

\mathbf{y} = \text{LN}(\mathbf{z}_L⁰)

Self-attention allows every patch to attend to every other patch at every layer, giving the model a global receptive field from the first layer. This contrasts with CNNs, where the effective receptive field grows linearly with depth.

Pre-training at Scale

ViT's central experimental finding is that dataset scale determines whether the architecture succeeds or fails. When trained from scratch on ImageNet-1k (1.3M images), ViT-Base performs several points below a comparably sized ResNet. The Transformer lacks the inductive biases that let CNNs generalize from limited data. But this deficit inverts with scale:

ImageNet-21k (14M images): ViT becomes competitive with CNNs.
JFT-300M (300M images): ViT surpasses the best CNNs at lower total training compute.

The paper pre-trains on these large datasets using standard supervised classification, then fine-tunes on downstream tasks. Pre-training uses Adam with a linear learning rate warmup and cosine decay. The large-scale pre-training essentially substitutes for the missing inductive biases — what locality and translation equivariance give CNNs for free, ViT learns from data.

Fine-tuning Strategy

For fine-tuning on downstream tasks, the pre-trained classification head is replaced with a zero-initialized linear layer sized for the target number of classes. A key practical detail: fine-tuning at higher resolution than pre-training improves performance. When the input resolution increases, the number of patches grows (e.g., from 196 at 224px to 576 at 384px), so the pre-trained positional embeddings are 2D-interpolated to the new grid size. This works well in practice despite the positional embeddings being learned as 1D vectors.

Fine-tuning uses SGD with momentum, lower learning rates than pre-training, and typically runs for only a few thousand steps — far less compute than pre-training.

ViT Variants

The paper defines three model configurations following the naming convention Model/Patch-size:

Variant	Layers	Hidden dim D	MLP dim	Heads	Params
ViT-Base/16	12	768	3072	12	86M
ViT-Large/16	24	1024	4096	16	307M
ViT-Huge/14	32	1280	5120	16	632M

The "/16" or "/14" suffix denotes patch size. Smaller patches mean more tokens (longer sequences) and proportionally more compute due to the quadratic cost of self-attention, but they also capture finer spatial detail. ViT-H/14 uses 14x14 patches, yielding 256 tokens for a 224px image compared to 196 tokens for 16x16 patches.

Key Results

When pre-trained on JFT-300M, ViT achieves state-of-the-art results across multiple benchmarks:

ImageNet: 88.55% top-1 (ViT-H/14), surpassing the previous best of 87.4% (EfficientNet-L2 with Noisy Student) while requiring 2.5x less compute to pre-train.
CIFAR-100: 94.55% top-1.
VTAB (19-task suite): Highest average across all three task groups (Natural, Specialized, Structured).

The compute efficiency result is notable: ViT reaches a given accuracy level with fewer training FLOPs than CNN baselines like BiT (Big Transfer). The scaling curves show that ViT performance has not saturated at JFT-300M scale, suggesting further gains with larger datasets.

The Data Efficiency Problem

ViT's reliance on large-scale pre-training is its most significant practical limitation. Without JFT-300M (a proprietary Google dataset), reproducing the top results is infeasible. On ImageNet-1k alone, ViT-Base underperforms a ResNet-50 trained with modern regularization. The reason is structural: a Transformer must learn from data what convolutions encode by design — that nearby pixels are more relevant than distant ones, and that patterns are translation-invariant.

This gap motivated substantial follow-up work. DeiT (Touvron et al., 2021) showed that with aggressive data augmentation (RandAugment, Mixup, CutMix), regularization (stochastic depth, repeated augmentation), and a distillation token trained against a CNN teacher, ViT-Base can reach 81.8% on ImageNet-1k using only that dataset. This demonstrated that ViT's data hunger is not fundamental but can be addressed through training recipes that partially compensate for the missing inductive biases.

Critical Analysis

Strengths:

Architectural simplicity. ViT applies a standard NLP Transformer to vision with essentially zero domain-specific modifications. This simplicity enables direct transfer of advances in Transformer scaling, training techniques, and hardware optimization.
Favorable scaling behavior. Performance improves log-linearly with compute and data, and the scaling curves show no signs of saturation. This is a strong argument for the architecture's long-term viability.
Transfer learning. A single pre-trained ViT transfers well across diverse downstream tasks with minimal fine-tuning, making it a practical foundation model.

Limitations:

No locality inductive bias. The global self-attention mechanism treats all patches equally regardless of spatial proximity. This hurts data efficiency and makes the model rely on large datasets to learn spatial relationships that CNNs encode structurally.
Quadratic attention cost. Self-attention scales as O(N²) with sequence length, making high-resolution inputs expensive. A 224px image with 16x16 patches produces 196 tokens, but moving to 14x14 patches at 384px yields 729 tokens — a 14x increase in attention computation.
Positional embedding limitations. The learned 1D positional embeddings require interpolation for resolution changes and do not generalize to drastically different aspect ratios or image sizes without fine-tuning.
Pre-training data requirements. The flagship results depend on JFT-300M, a dataset that is not publicly available, limiting reproducibility.

Impact and Legacy

ViT's impact on computer vision has been substantial, not because it was the first to apply attention to images, but because it demonstrated that a pure Transformer — with no convolutional layers — could work at scale. This shifted the field's default architecture away from CNNs.

Direct architectural descendants include the Swin Transformer (Liu et al., 2021), which reintroduces locality through shifted windows and hierarchical feature maps, achieving strong performance on dense prediction tasks where ViT struggles. DeiT showed that careful training recipes could close the data gap on ImageNet-1k.

Self-supervised learning adopted ViT as the standard backbone. DINO (Caron et al., 2021) demonstrated that ViT trained with self-distillation learns features that explicitly encode semantic segmentation without supervision. MAE (He et al., 2022) showed that masking 75% of patches and reconstructing them produces strong representations, achieving 87.8% on ImageNet with ViT-Huge. BEiT (Bao et al., 2022) adapted BERT-style pre-training to vision patches.

Multimodal models built directly on ViT. CLIP (Radford et al., 2021) trains a ViT image encoder alongside a text encoder via contrastive learning, producing zero-shot classifiers that generalize across distributions. The ViT encoder is now a standard component in models like LLaVA, Flamingo, and GPT-4V.

Attention Is All You Need — the original Transformer architecture that ViT applies directly to vision
Deep Residual Learning — ResNet, the CNN baseline that ViT aims to surpass at scale
DINO — self-supervised training of ViT that reveals emergent segmentation in attention maps
MAE — masked autoencoding pre-training for ViT, reducing dependence on labeled data
BEiT — BERT-style pre-training adapted for vision transformers
CLIP — contrastive language-image pre-training using ViT as the visual encoder
Swin Transformer — hierarchical vision transformer with shifted windows, addressing ViT's limitations on dense tasks