Flamingo: a Visual Language Model for Few-Shot Learning

Jean-Baptiste Alayrac; Jeff Donahue; Pauline Luc; Antoine Miech; Iain Barr; Yana Hasson; Karel Lenc; Arthur Mensch; Katherine Millican; Malcolm Reynolds; Roman Ring; Eliza Rutherford; Serkan Cabi; Tengda Han; Zhitao Gong; Sina Samangooei; Marianne Monteiro; Jacob Menick; Sebastian Borgeaud; Andrew Brock; Aida Nematzadeh; Sahand Sharifzadeh; Mikolaj Binkowski; Ricardo Barreira; Oriol Vinyals; Andrew Zisserman; Karen Simonyan

TL;DR

Flamingo bridges a frozen NFNet vision encoder and a frozen Chinchilla language model by training only two bridging components: a Perceiver Resampler and gated cross-attention layers inserted between frozen LM blocks.
The Perceiver Resampler compresses a variable number of image feature tokens into a fixed 64-latent representation, decoupling the vision encoder's output resolution from the LM's context budget.
Gated cross-attention layers (GATED XATTN-DENSE) use a tanh gate initialized to zero so the augmented model starts identical to the pretrained LM and gradually learns how much visual information to admit.
Interleaved image-text sequences with per-image attention masking enable few-shot in-context learning: prepend a handful of image-question-answer examples and Flamingo answers a new query image without any weight updates.

Frozen backbones, learned bridges

Training a vision-language model from scratch requires enormous data and compute. Flamingo takes a different path: keep two powerful pretrained models — a large vision encoder and a large language model — completely frozen, and train only the lightweight components that connect them. The vision encoder is a Normalizer-Free ResNet (NFNet) pretrained with a contrastive objective on a large image-text dataset. The language model is Chinchilla (or its predecessors, depending on Flamingo variant size), pretrained purely on text.

The core challenge is that image features are spatially dense — a high-resolution image may produce hundreds or thousands of patch-level feature vectors — while a language model can only attend to a bounded number of tokens efficiently. Flamingo solves this with the Perceiver Resampler: a small transformer module that holds a fixed set of 64 learned latent queries. These queries cross-attend to however many image feature vectors the vision encoder produces, and the module outputs exactly 64 visual tokens regardless of input resolution or image count. This fixed-size output is what the LM receives, making the bridge resolution-agnostic.

Because neither backbone is ever updated, Flamingo's training is efficient: only the Perceiver Resampler and the new cross-attention layers need gradients. The frozen LM retains all its language capabilities, and the frozen vision encoder retains all its visual representations. The learned bridges act as a translator between the two modalities.

Gated cross-attention

Getting visual information into the language model requires injecting new cross-attention layers, since the frozen LM's self-attention layers were never designed to attend to visual tokens. Flamingo inserts new GATED XATTN-DENSE blocks between every frozen LM layer (or a subset of layers). Each new block performs cross-attention between the current text hidden states (queries) and the 64 visual latents from the Perceiver Resampler (keys and values), followed by a dense feed-forward network.

The critical design detail is the tanh gate. Each GATED XATTN-DENSE block multiplies its cross-attention output by tanh(α), where α is a scalar learned per block, initialized to zero. At initialization, tanh(0) = 0, which means the cross-attention output is zeroed out and the model is mathematically equivalent to the original pretrained LM. This is essential: it ensures stable fine-tuning, because the model starts at a known-good state (the pretrained LM) and only opens the gate as gradients push it to. Without this initialization, the random cross-attention outputs would corrupt the pretrained LM's representations from the first step.

As training proceeds, each block's gate α increases from zero, allowing more visual signal into the residual stream. Different blocks can open their gates at different rates, learning at which LM depth visual information is most useful. The frozen self-attention and feed-forward layers in the original LM are never touched.

Few-shot in-context learning

The most distinctive capability Flamingo inherits from its frozen LM backbone is in-context learning — the ability to adapt to a new task from a few examples in the prompt, without any gradient updates. Flamingo extends this to the multimodal setting by training on interleaved sequences of images and text drawn from web pages and image-caption datasets.

During training, each sequence consists of alternating image and text segments: an image token (the 64 visual latents), followed by the caption or dialogue turn that refers to it, followed by the next image, and so on. The key architectural detail is per-image attention masking: each text token can only cross-attend to the single image that immediately precedes it in the sequence. This prevents a text token from being confused by later images and preserves the left-to-right causal structure required for autoregressive generation.

At inference, few-shot prompting is simple: prepend some number of (image, question, answer) demonstrations before the query image and question. Flamingo's per-image masking ensures each demonstration is grounded to its own image. Because the model was trained on interleaved sequences of arbitrary length, it handles 0-shot, 1-shot, 4-shot, and 32-shot prompts with the same forward pass — no task-specific training required.

Why it mattered

Flamingo demonstrated that a frozen-backbone bridging strategy could produce a competitive general-purpose vision-language model. Before Flamingo, the dominant approach was to jointly train vision and language components end-to-end, which was expensive and destroyed the knowledge already encoded in pretrained models. Flamingo showed that the two modalities could be connected cheaply via learned bridges while preserving both backbones intact.

The gated cross-attention design became an influential template: the gate-to-zero initialization trick appears in later models as a standard technique for safely augmenting pretrained transformers. The Perceiver Resampler's idea of compressing variable-length visual features to a fixed token budget is directly echoed in BLIP-2's Q-Former. And Flamingo's interleaved sequence format, with per-image masking, became the standard way to think about multimodal context windows.

LLaVA later showed that instruction tuning on a curated set of visual dialogues — rather than web-scale interleaved sequences — could produce strong few-shot behavior with far less data, but it adopted Flamingo's core insight: freeze the backbone, train the bridge. Flamingo was the proof of concept that made that research direction credible.

BLIP-2 — bridges a frozen image encoder and a frozen LLM with a Q-Former, a direct descendant of Flamingo's Perceiver Resampler idea, adding a two-stage training procedure for more efficient alignment
LLaVA: Visual Instruction Tuning — shows that instruction-tuning a vision-language connector on GPT-4-generated visual dialogues achieves strong multimodal reasoning with far less data than Flamingo's web-scale interleaved training
CLIP — the contrastive vision-language pretraining approach that inspired Flamingo's vision encoder and demonstrated that large-scale image-text pairing produces powerful transferable visual representations
Attention Is All You Need — the transformer architecture that underlies every component in Flamingo: the vision encoder, the Perceiver Resampler cross-attention, the gated cross-attention layers, and the frozen LM backbone