BLIP-2: Efficient Vision-Language Pre-training

Junnan Li; Dongxu Li; Silvio Savarese; Steven Hoi

TL;DR

BLIP-2 demonstrates that you do not need to train a vision-language model end-to-end from scratch. Instead, freeze a pre-trained image encoder (ViT) and a pre-trained large language model (OPT or FlanT5), then train only a lightweight Querying Transformer (Q-Former) to bridge them. The Q-Former uses 32 learnable query tokens and cross-attention to compress visual information into a fixed-length representation the LLM can consume. The result: BLIP-2 matches or exceeds models trained with 54x more compute on VQA, image captioning, and image-text retrieval — training the bridging module requires fewer than 190M parameters.

The Compute Problem

Training large vision-language models end-to-end is expensive. Flamingo (80B parameters), CoCa, and PaLI all require training both a vision encoder and a language model jointly on hundreds of millions of image-text pairs. This means billions of parameters are updated at every gradient step, demanding thousands of GPU hours and proprietary datasets. Flamingo's training on 2.3B image-text pairs with a model of that scale is simply out of reach for most research labs.

The cost compounds because both the vision and language components are large. A ViT-g image encoder has ~1B parameters. An LLM like OPT-6.7B has 6.7B. Training both together means back-propagating through ~8B parameters at every step, with the memory and compute costs that implies. And if a better LLM comes along next month, the entire training process must be repeated.

BLIP-2 asks a different question: can we reuse the capabilities already learned by frozen unimodal models and just learn to connect them? The image encoder already understands visual features. The LLM already understands language. The missing piece is a translation layer that solves the vision-language alignment problem between the two modalities — and that layer can be small.

This modular approach reduces trainable parameters by an order of magnitude. Where Flamingo trains ~80B parameters, BLIP-2 trains ~188M (the Q-Former) while keeping the image encoder (~1B ViT-g) and LLM (~3–11B) frozen. Total training uses 16 A100 GPUs for roughly 6 days on the first stage and 9 days on the second — a fraction of what end-to-end training demands.

The Q-Former Architecture

The Q-Former is the core contribution. It is a lightweight transformer that sits between the frozen image encoder and the frozen LLM, responsible for extracting task-relevant visual information and projecting it into a form the language model can process.

The architecture consists of two transformer submodules that share self-attention layers:

Image transformer — interacts with the frozen image encoder through cross-attention layers. The cross-attention queries are a set of 32 learnable embedding vectors (the "query tokens"), each of dimension 768. These queries attend to the image encoder's output features to extract visual information.
Text transformer — functions as both a text encoder and text decoder depending on the pre-training task. It shares self-attention parameters with the image transformer but does not share cross-attention layers (the text side has no cross-attention to image features directly).

The 32 query tokens are the key mechanism. Each query learns to attend to different aspects of the image through cross-attention:

\text{CrossAttn}(Q, K, V) = \text{softmax}\!(Q_\text{query} · K_\text{image}^\top√(d)) V_\text{image}

where Q_\text{query} ∈ ℝ^{32 × 768} are the learnable queries and K_\text{image}, V_\text{image} come from the frozen image encoder's output. The queries interact with each other through shared self-attention layers, allowing them to coordinate what visual information each one extracts.

The output is a fixed set of 32 vectors of dimension 768 — regardless of the image encoder's output size. This creates a fixed-length bottleneck that compresses the visual representation before it reaches the LLM.

Why 32 queries? The paper's ablation shows diminishing returns beyond 32: going from 16 to 32 queries improves VQAv2 accuracy, but 64 queries provides marginal additional gain. The fixed query count also means inference cost is constant regardless of image resolution — a practical advantage over methods that pass all visual tokens to the LLM.

The Q-Former is initialized from BERT_base weights. The cross-attention layers (which do not exist in BERT) are randomly initialized. This initialization gives the Q-Former a strong language understanding prior from the start, which helps with the text-side objectives during stage 1 pre-training.

Two-Stage Pre-training

BLIP-2 trains the Q-Former in two stages, each with the image encoder frozen but with different objectives and different modules frozen.

Stage 1: Vision-Language Representation Learning. The Q-Former learns to extract visual representations aligned with text. The image encoder is frozen; only the Q-Former is trained. Three losses are applied jointly:

Image-Text Contrastive Loss (ITC) — aligns the query output with the text embedding by maximizing the cosine similarity of matched image-text pairs and minimizing it for unmatched pairs. The highest-similarity query output is used as the image representation. Crucially, a unimodal self-attention mask prevents queries from seeing text tokens and vice versa, forcing each modality to form independent representations before comparison.
Image-Text Matching Loss (ITM) — a binary classification task predicting whether an image-text pair is matched or unmatched. This uses a bidirectional self-attention mask so queries and text tokens can attend to each other, enabling fine-grained alignment. Hard negative mining selects the most confusing unmatched pairs.
Image-Grounded Text Generation Loss (ITG) — trains the Q-Former to generate text conditioned on the image. The text transformer operates as a causal decoder: queries can attend to each other and to the image features, but text tokens can only attend to previous text tokens and to the query outputs. This forces the queries to capture all visual information needed for text generation.

The three objectives use different self-attention masking strategies (unimodal, bidirectional, causal) to extract complementary visual representations from the shared architecture. This design is inspired by BLIP's Multimodal Mixture of Encoder-Decoder (MED) but adapted to work with learnable queries rather than direct image features.

Stage 1 trains on 129M image-text pairs (COCO, Visual Genome, CC3M, CC12M, SBU, and LAION-400M filtered subset) for 250K steps with batch size 2320.

Stage 2: Vision-to-Language Generative Learning. The Q-Former's output is connected to the frozen LLM through a linear projection layer that maps the 32 query outputs from dimension 768 to the LLM's input dimension:

\mathbf{h}_\text{LLM} = W_\text{proj} · \mathbf{z}_\text{query} + \mathbf{b}, W_\text{proj} ∈ ℝ^d_\text{LLM × 768}

The projected queries are prepended to the text input embeddings as soft visual prompts. For decoder-only LLMs (OPT), the visual prompts are prepended and the model is trained with a language modeling loss to generate text conditioned on them. For encoder-decoder LLMs (FlanT5), the visual prompts are given to the encoder and the decoder generates the output via a prefix language modeling loss. Only the Q-Former and the projection layer are trained; the LLM remains frozen.

BLIP Architecture Comparison

Visual comparison of BLIP and BLIP-2 neural network architectures

Key Improvements in BLIP-2

Larger vision encoder - ViT-L/14 with 304M parameters (vs. ViT-B/16 with 86M)

Q-Former bridge - Query-based transformer to connect vision and language models

Frozen LLM integration - Using pre-trained language models without fine-tuning

Parameter efficiency - Freezing pre-trained models to reduce training costs

BLIP

Bootstrapping Language-Image Pre-training

~220M parameters

Image Input

3×224×224

196×768

Patch → Embedding

Text Input

Text Tokens

77×768

Text Tokens → Embedding

Separate processing paths

86M

Vision Encoder

196×768

ViT-B/16

Image features

Self-attention

110M

Text Encoder-Decoder

77×768 + 196×768

77×768

BERT (12 layers)

24M

MED Task-specific

77×768

MED & CapFilt

Output Head

768

Text/Class

ITC / ITM / LM

Evolution

BLIP-2

Bootstrapping Language-Image Pre-training with Frozen Models

~1B+ parameters (with LLM)

Image Input

3×224×224

256×1024

Patch → Embedding

304M

Improved

Vision Encoder

256×1024

ViT-L/14

Frozen weights

188M

Improved

Q-Former

256×1024

32×768

Query-based Bridge

Dimensionality reduction

3M

Improved

Projection Layer

32×768

32×4096

Dimension Adapter

Frozen weights

~700M-7B

Improved

Language Model

32×4096

N×4096

OPT/T5/FLAN-T5

Generation

N×4096

Text/Class

Generative Output

Note: Component sizes are approximated. Yellow-highlighted components indicate key improvements in BLIP-2 architecture.

Key Results

Visual Question Answering. On VQAv2, BLIP-2 with OPT_6.7B achieves 65.0% accuracy in a zero-shot setting — outperforming Flamingo (80B parameters) by 8.7 points while using 54x fewer trainable parameters. With FlanT5_XXL, the zero-shot score reaches 65.4%. Fine-tuned on VQAv2, BLIP-2 achieves a test-dev accuracy of 82.2%. On GQA and OK-VQA, the zero-shot results similarly outperform Flamingo across all model sizes.

Image Captioning. On COCO Captions (Karpathy test split), BLIP-2 with OPT_6.7B achieves a zero-shot CIDEr score of 144.5. Fine-tuned, the model reaches a CIDEr of 145.8. On NoCaps validation, the zero-shot CIDEr is 121.6, demonstrating strong generalization to novel object categories outside the training distribution.

Image-Text Retrieval. On COCO retrieval, fine-tuned BLIP-2 achieves 97.6% Recall@1 for image-to-text and 89.7% for text-to-image. On Flickr30K zero-shot, the model achieves 97.6% TR@1 and 89.2% IR@1, outperforming prior methods that were fine-tuned on the dataset. The stage-1 Q-Former alone handles retrieval tasks, demonstrating that the learned queries capture sufficient visual-semantic alignment.

Scaling Behavior. The paper reports consistent improvements when scaling both the image encoder (ViT-L to ViT-g) and the LLM (OPT-2.7B to OPT-6.7B, FlanT5-XL to FlanT5-XXL). This suggests that BLIP-2 can continue to benefit from improvements in either unimodal component without architectural changes.

Critical Analysis

Strengths:

Parameter efficiency. Training only ~188M parameters (the Q-Former) while keeping ~1–11B parameters frozen reduces compute requirements by an order of magnitude compared to end-to-end methods. This makes large-scale multimodal models accessible with moderate hardware.
Modularity. The frozen-component design means the image encoder and LLM can be swapped independently. The paper demonstrates this by pairing the same Q-Former training procedure with OPT (decoder-only) and FlanT5 (encoder-decoder) LLMs, and with ViT-L versus ViT-g image encoders.
Works with any LLM. Because the Q-Former produces soft visual prompts, it can interface with any LLM that accepts token embeddings as input. As stronger LLMs become available, the same architecture can leverage them without retraining the vision pipeline.

Limitations:

Information bottleneck. Compressing all visual information into 32 fixed-length query vectors is a deliberate trade-off. For tasks requiring fine-grained spatial reasoning — counting objects, understanding spatial relationships, reading small text — this bottleneck can lose critical detail. Later work (InstructBLIP, LLaVA) has experimented with increasing the number of visual tokens or passing image features directly.
Q-Former complexity. The three-objective training procedure with different self-attention masking strategies (unimodal, bidirectional, causal) adds architectural complexity. Each loss requires a different attention mask configuration in the same transformer, making the training pipeline harder to reproduce and debug than simpler alternatives like linear projection (as in LLaVA).
Two-stage training. The bootstrapping approach requires sequential training: stage 1 must complete before stage 2 begins. Single-stage alternatives are simpler to implement and tune.
Hallucination. Like other vision-language models that rely on frozen LLMs, BLIP-2 can generate text that is fluent but not grounded in the image content. The frozen LLM's language priors can override visual evidence, producing plausible-sounding but incorrect descriptions. Because the LLM is frozen, there is no mechanism to train it to attend more carefully to the visual tokens versus its own text generation patterns.
Frozen LLM ceiling. The quality of generated text is bounded by the frozen LLM's capabilities. If the LLM cannot reason about a particular type of question or follow a particular instruction format, no amount of Q-Former training will compensate. This became clearer as instruction-tuned LLMs improved rapidly after BLIP-2's release.

Impact and Legacy

BLIP-2 established the paradigm that dominates multimodal LLM design: freeze pre-trained unimodal models and train a lightweight connector. This pattern was adopted by virtually every major multimodal system that followed.

InstructBLIP (Dai et al. 2023) extended BLIP-2 by adding instruction tuning to the Q-Former, enabling instruction-following capabilities across diverse vision-language tasks without task-specific fine-tuning. LLaVA (Liu et al. 2023) simplified the connector further, replacing the Q-Former with a single linear projection layer and showing that visual instruction tuning could achieve strong results with an even simpler architecture. The tension between Q-Former-style compression and direct projection (LLaVA-style) remains an active area of research.

The broader impact is architectural: BLIP-2 demonstrated that vision-language capability is not about training the largest model end-to-end, but about efficiently bridging modalities. This insight has informed GPT-4V, Gemini, and other production multimodal systems, all of which use some form of lightweight visual adapter rather than training vision and language jointly from scratch.

The Q-Former's learnable query mechanism also influenced work beyond vision-language models. The idea of using a small set of learned tokens to extract task-relevant information from a larger representation has appeared in video understanding (Video-LLaMA), audio-language models, and 3D vision systems. BLIP-2's demonstration that 32 query vectors can compress the visual world into a form an LLM can reason about remains one of the more surprising empirical findings in multimodal learning.

BLIP — the original BLIP that BLIP-2 succeeds: introduces the unified encoder-decoder and the CapFilt data bootstrapping that BLIP-2 inherits
Flamingo — the gated cross-attention VLM that pioneered bridging a frozen vision encoder and a frozen LLM, the lineage BLIP-2 refines with its Q-Former
CoCa — unifies contrastive alignment and captioning in a single image-text model, a different route to the multimodal pre-training BLIP-2 targets
CLIP: Learning Transferable Visual Models From Natural Language Supervision — the contrastive pre-training framework whose frozen ViT encoders serve as BLIP-2's visual backbone
Attention Is All You Need — the transformer architecture underlying both the Q-Former and the frozen LLMs
Vision Transformer — the ViT architecture used as the frozen image encoder in BLIP-2
DETR: End-to-End Object Detection with Transformers — shares the concept of learnable queries that attend to visual features via cross-attention
Segment Anything (SAM) — another modular vision foundation model that separates the encoder from the task-specific decoder