CoCa: Contrastive Captioners are Image-Text Foundation Models

Jiahui Yu; Zirui Wang; Vijay Vasudevan; Legg Yeung; Mojtaba Seyedhosseini; Yonghui Wu

TL;DR

CoCa trains a single image-text model jointly with two losses — a contrastive loss (like CLIP) and an autoregressive captioning loss — in one forward pass, requiring no separate training stages.
The text decoder is split into a unimodal lower half (self-attention only, produces the text embedding for contrastive alignment) and a multimodal upper half (adds cross-attention to image features, produces captioning logits).
Task-specific attentional poolers convert image token sequences into either a single global vector (for contrastive use) or multiple fine-grained vectors (for generative use), without discarding any visual detail.
The resulting foundation model achieves state-of-the-art results on image classification, image-text retrieval, and image captioning — all from the same pretrained weights.

Most vision-language models commit to one training paradigm: contrastive models (like CLIP) learn to align images and text in a shared embedding space, while generative models learn to produce descriptive captions token by token. Each approach excels at different tasks — contrastive models shine at retrieval and zero-shot classification; generative models enable open-ended captioning and VQA. CoCa asks: why choose?

The key insight is that both objectives can share the same forward pass. The image encoder produces a sequence of patch tokens once. The text decoder runs once, but its lower layers (unimodal, self-attention only) produce a text embedding that feeds the contrastive loss, while its upper layers (multimodal, with cross-attention to image features) produce token predictions that feed the captioning loss. The total training loss is a weighted sum: L_total = λ_con · L_con + λ_cap · L_cap, with λ_cap = 2 in the paper.

Computing both losses in a single forward pass means CoCa trains roughly as fast as a single-objective model while capturing the benefits of both. There is no alternation between contrastive and generative batches, no two-stage pretraining schedule, and no separate model to maintain for each downstream task type.

Decoupled decoder

The architecture that makes the dual objective possible is the decoupled text decoder. A standard captioning decoder would apply cross-attention to image features at every layer, which would prevent the text embedding from being independent of the image — exactly what the contrastive loss requires. CoCa resolves this cleanly: the lower half of the decoder uses only causal self-attention (masked attention), making it a pure text representation. The upper half stacks cross-attention on top, fusing image information to generate tokens conditioned on what the image contains.

At training time, the unimodal half’s final [CLS] token representation is projected and compared against the image embedding under the contrastive loss. The multimodal half’s outputs are compared against ground-truth tokens under the standard cross-entropy captioning loss. Both paths share the same decoder weights up to the split point, so language representations are learned jointly across both objectives.

The split depth is a hyperparameter. In the paper, roughly half the decoder layers are unimodal and half are multimodal. Shallower splits give the contrastive branch fewer layers to refine a clean text embedding; deeper splits leave fewer layers for cross-modal fusion. The chosen split balances alignment quality against generation quality on downstream benchmarks.

Attentional pooling

The image encoder (a ViT) outputs a sequence of patch tokens — rich spatial information, but not a single vector the contrastive loss can compare against a text CLS token. CoCa introduces task-specific attentional poolers to bridge this gap without collapsing information prematurely.

An attentional pooler is a small cross-attention module with a fixed set of learned query vectors. The queries attend over all patch tokens, and the pooler outputs one vector per query. For the contrastive objective, a single query produces one global summary vector. For the generative (captioning) objective, many queries produce a richer sequence of visual features that the multimodal decoder layers can cross-attend to with higher fidelity.

Using separate poolers per objective means neither task is forced to compromise. The contrastive pooler can specialise in producing compact, discriminative global representations. The generative pooler can retain spatial granularity. Both attend over the exact same patch-token sequence from the image encoder, so there is no duplication of image computation.

Why it mattered

Before CoCa, achieving top results across retrieval, classification, and captioning typically required separate models or multi-stage pipelines: first pretrain a contrastive model, then attach and fine-tune a captioning head. CoCa showed that a single pretraining recipe — one encoder, one decoupled decoder, two poolers, two losses, one forward pass — produces a foundation model competitive with or better than specialized systems on all three task families simultaneously.

This unification matters in practice. A single set of pretrained weights can be fine-tuned for image captioning (adapt the multimodal decoder), for image-text retrieval (use the contrastive embeddings directly), or for image classification (linear probe on the contrastive image embedding). No architectural changes are needed between tasks; only the fine-tuning head or objective changes.

CoCa also demonstrated that the generative captioning loss acts as a useful regulariser for the contrastive objective and vice versa: learning to generate captions forces the model to retain compositional image content, while the alignment loss anchors text representations to visual semantics. The combination yields stronger representations than either objective alone at equivalent model scale.

CLIP — the contrastive pretraining paradigm that CoCa extends with a joint captioning objective
SigLIP — a successor contrastive approach that replaces the softmax loss with a sigmoid loss for improved scaling
BLIP-2 — bridges frozen image encoders and large language models with a lightweight querying transformer, approaching the dual-objective idea from the fine-tuning angle

CoCa: Contrastive Captioners are Image-Text Foundation Models

TL;DR

One model, two objectives

Decoupled decoder

Attentional pooling

Why it mattered

Related Reading