TL;DR
- ALIGN trains a dual-encoder model on 1.8 billion raw image–alt-text pairs, using only minimal frequency-based filtering — no human curation, no caption quality checks.
- The key finding: at sufficient scale, the noise in uncurated web data becomes irrelevant; sheer data volume produces stronger vision–language representations than smaller, carefully cleaned datasets.
- The architecture is deliberately simple: EfficientNet for images, BERT for text, L2-normalized embeddings projected into a shared space, trained with an in-batch InfoNCE contrastive loss.
- The resulting embedding space supports zero-shot image–text retrieval and classification, with performance competitive with CLIP on standard benchmarks, despite the dramatically different data strategy.
Scale over cleaning
Training good vision–language models has traditionally required painstaking data curation: human-written captions (COCO), filtered crawls with heuristics (WIT for CLIP), or expert annotations. Curated datasets top out around 400 million pairs before the effort becomes prohibitive.
ALIGN takes the opposite bet. Google's image search index contains billions of images, each accompanied by an alt-text attribute written by web authors. Most of these alt-texts are useless for visual grounding: filenames like IMG_2043.jpg, marketing copy, or single generic words. But the dataset is enormous. ALIGN applies only the lightest possible filter — removing images or texts that appear too rarely (frequency thresholds on tokens and image hashes) — and retains 1.8 billion pairs. No human review, no quality scoring.
The thesis is that, past a threshold, the signal-to-noise ratio stops mattering: a model seeing a billion noisy examples eventually learns the visual-semantic associations that a smaller clean dataset makes explicit, simply by encountering enough natural co-occurrences.
The curve above is illustrative, but it captures the paper's finding: the noisy ALIGN dataset starts behind a well-curated 3M-pair dataset in representation quality, but overtakes it somewhere around the billion-pair mark. At 1.8B, the noisy data produces better representations than any curated competitor available at the time.
A simple dual encoder
ALIGN's architecture is a textbook dual encoder. There is nothing structurally novel — the contribution is entirely in the data strategy and scale.
Image encoder. EfficientNet-L2, pre-trained on JFT and fine-tuned end-to-end. The image encoder outputs a single global embedding, projected and L2-normalized into the shared space.
Text encoder. BERT-Large, initialized from a pre-trained checkpoint and fine-tuned alongside the image encoder. The [CLS] token representation is linearly projected and L2-normalized into the same shared space.
Training objective. For a batch of N image–text pairs, the model computes cosine similarities between all N² combinations. The in-batch InfoNCE loss maximizes the N diagonal (matched) similarities and minimizes the N² − N off-diagonal (mismatched) ones, symmetrically in both image→text and text→image directions. A learned temperature scales the logits.
The absence of cross-modal fusion is deliberate. Cross-attention layers (as in later VLMs) produce richer joint representations but require every image–text pair to be encoded together at inference — prohibitively expensive for retrieval at scale. With independent encoders, image and text embeddings can be precomputed and cached separately; retrieval reduces to a dot-product lookup.
A shared embedding space
The learned embedding space is what makes ALIGN practically useful beyond training. Because both modalities land in the same L2-normalized space, any image embedding can be directly compared with any text embedding via cosine similarity.
Image-to-text retrieval: given a query image, rank all text candidates by cosine similarity and return the top matches.
Text-to-image retrieval: given a text query, rank all image candidates by cosine similarity.
Zero-shot classification: treat each class label (or a templated prompt like a photo of a {class}) as a text query, encode all class texts once, then assign the image to the nearest class text embedding.
On the MS-COCO retrieval benchmark, ALIGN achieves state-of-the-art image-to-text and text-to-image recall at the time of publication, outperforming models trained on orders of magnitude less data. On ImageNet zero-shot classification, ALIGN matches CLIP's numbers despite having a different training set and encoder choice.
Why it mattered
ALIGN arrived concurrently with CLIP (both submitted to ICML 2021, both published the same week). Together they established the now-standard paradigm for vision–language pre-training: dual encoders, web-scale supervision, contrastive loss. ALIGN's specific contribution to this consensus was demonstrating the data side: you do not need curated data if you have enough of it.
The practical consequence was significant. Curation pipelines are bottlenecks. If the frequency filter alone is sufficient, any organization with access to a large web crawl can train a competitive vision–language model without investing in annotation infrastructure. This lowered the barrier to building large-scale multimodal systems and influenced how subsequent models (Florence, BASIC, SigLIP at scale) approached data collection.
ALIGN also demonstrated that EfficientNet, a purely convolutional architecture, could compete with ViT-based models for vision–language representation at the time, though subsequent work has shifted the field toward transformer image encoders.
Related Reading
- CLIP — the concurrent dual-encoder model from OpenAI, trained on a curated 400M-pair dataset; together ALIGN and CLIP defined the paradigm
- SigLIP — replaces InfoNCE with a pairwise sigmoid loss that removes the all-gather requirement, enabling contrastive training at smaller batch sizes
- CoCa — extends the dual-encoder framework with a generative captioning objective, combining contrastive and generative supervision in one model
