TL;DR
- BLIP introduces a Multimodal mixture of Encoder-Decoder (MED) that shares weights across three modes — unimodal encoder (ITC), image-grounded text encoder (ITM), and image-grounded text decoder (LM) — enabling a single pre-trained model to both score and generate.
- Unlike CLIP (understanding only) or captioning models (generation only), BLIP does both from one set of weights.
- CapFilt bootstraps noisy web data: a captioner synthesizes fresh captions for web images; a filter removes noisy pairs from both the originals and the synthetic set; the cleaned, augmented corpus retrains the model.
- BLIP was the direct predecessor of BLIP-2, which replaced BLIP's end-to-end training with a frozen-backbone Q-Former.
Bootstrapping noisy data: CapFilt
Web-crawled image-text pairs are cheap and plentiful but noisy — captions are often unrelated to the image, too generic, or entirely wrong. BLIP addresses this with CapFilt, a self-bootstrapping data pipeline built from two components of the pretrained MED model itself.
The Captioner is the MED decoder fine-tuned on COCO Captions to generate synthetic captions for images. It runs over web images that already have (noisy) associated text, producing a second, model-generated caption for each image.
The Filter is the MED image-text matching head fine-tuned to classify a given image-text pair as matched or not. It removes pairs judged noisy from both the original web corpus and the synthetic captions produced by the captioner.
The result is a cleaned-and-augmented dataset where each surviving image has at least one caption (original or synthetic) that the filter judged as faithful. This dataset is used to retrain BLIP from scratch, yielding meaningful performance gains on VQA and image captioning — all without any additional labeled data or a separate annotation pipeline.
One model, three modes: MED
BLIP's core architecture is the Multimodal mixture of Encoder-Decoder (MED). A single transformer text stack is conditioned on a frozen image encoder (ViT) and activated in one of three modes by switching which attention layers are enabled:
Unimodal encoder (ITC loss). Self-attention only; no cross-attention to the image. A [CLS] token aggregates the text. The output is compared to the image's [CLS] token via a contrastive loss, pulling matched image-text pairs together and pushing unmatched pairs apart in the embedding space. This mode learns a contrastive representation.
Image-grounded text encoder (ITM loss). Bidirectional self-attention plus cross-attention to the image encoder's patch features. An [Encode] token produces a multimodal embedding that is fed to a binary classifier predicting whether the image and text are matched. Hard negative mining selects the most confusing unmatched pairs as negatives.
Image-grounded text decoder (LM loss). Causal (masked) self-attention plus cross-attention to the image. The model autoregressively predicts caption tokens conditioned on the image, supervised by a language modeling loss. This is the mode used for caption generation at inference.
All three modes share the self-attention weights (and the cross-attention weights between modes 2 and 3). The key variation is the attention mask applied: unimodal (no cross-attn), bidirectional (full self-attn + cross-attn), or causal (masked self-attn + cross-attn).
Understanding and generation, unified
The same pretrained BLIP model can be deployed for two qualitatively different tasks without any modification to the weights:
Understanding — Image-Text Matching (ITM). Given an image and a candidate caption, the model scores how well they match. Internally it activates the image-grounded encoder (mode 2) and reads out the ITM probability from the [Encode] token. This is used for retrieval: re-rank a shortlist of candidates and return the best-scoring pair.
Generation — Image Captioning (LM). Given an image with no text, the model runs the image-grounded decoder (mode 3) autoregressively, emitting caption tokens one at a time until an end-of-sequence token is produced. The caption is grounded in the image features through cross-attention at every decoding step.
No separate model is needed for each task. The pre-training objective is designed so that modes 2 and 3 share cross-attention weights: the same cross-attention that learns to attend to image patches for matching also learns to attend for generation. This weight sharing is the reason a single pre-training run produces a model capable of both.
Why it mattered
BLIP was significant for two reasons. First, it was the first model to cleanly unify understanding and generation in a single pre-training framework with shared weights — CLIP excels at understanding but cannot generate; pure captioning models generate but cannot score. BLIP demonstrated that a single MED architecture can do both, and do both well.
Second, CapFilt showed that the model's own outputs can be used to improve the quality of its training data. The captioner and filter are derived from the same model being trained, creating a bootstrapping loop that does not require external annotation. This principle — using model outputs to curate training data — has become a standard technique in modern multimodal training pipelines.
BLIP set the stage for BLIP-2, which replaced the end-to-end MED pre-training with a frozen ViT backbone bridged to a frozen LLM via a lightweight Q-Former, dramatically reducing training cost while inheriting BLIP's multi-objective training strategy.
Related Reading
- BLIP-2: Efficient Vision-Language Pre-training — the direct successor that freezes BLIP's backbone and bridges it to a large language model with a Q-Former, reducing trainable parameters by an order of magnitude
- CLIP: Learning Transferable Visual Models From Natural Language Supervision — the contrastive pre-training framework BLIP extends with a generation objective and a data-cleaning pipeline
- CoCa — a contemporaneous model that also adds a captioning loss on top of contrastive learning, arriving at a similar unification from a different architecture
