Qwen2-VL: Vision-Language Perception at Any Resolution

Peng Wang; Shuai Bai; Sinan Tan; Shijie Wang; Zhihao Fan; Jinze Bai; Keqin Chen; Xuejing Liu; Jialin Wang; Wenbin Ge; Yang Fan; Kai Dang; Mengfei Du; Xuancheng Ren; Rui Men; Dayiheng Liu; Chang Zhou; Jingren Zhou; Junyang Lin

TL;DR

Qwen2-VL introduces naive dynamic resolution: images are encoded at their native aspect ratio and resolution, producing a variable number of visual tokens rather than forcing every image into a fixed 224×224 or 448×448 grid — more image content means more tokens, less means fewer.
M-RoPE (Multimodal Rotary Position Embedding) decomposes each token's position into three independent components — temporal, height, and width — so one rotary scheme handles text, images, and video frames without separate encoders or position tables.
A single Qwen2 language model backbone processes all modalities: still images, sequences of images in one context, and long videos, all interleaved with text via the same token stream.
Released in three sizes (2B, 7B, 72B), Qwen2-VL-72B matched or exceeded GPT-4o and Claude 3.5 Sonnet on several multimodal benchmarks at the time of publication, setting a strong open-VLM baseline.

Most vision-language models resize every input image to a fixed resolution before encoding. A thumbnail and a 4K photograph both become the same 224×224 tensor, discarding detail in the larger image and wasting nothing useful in the smaller one. Qwen2-VL calls its alternative approach naive dynamic resolution: the ViT processes images at their native resolution by splitting them into 14×14 pixel patches, then merging every 2×2 group of adjacent patches into a single visual token. A 448×448 image produces 32×32=1,024 patch features, merged into 16×16=256 tokens. A 1,568×1,568 image produces 3,136 tokens. The token count is proportional to the image area, scaled down by a factor of 4 from the raw patch count.

This matters because detail survives intact. A dense document, a high-resolution photograph, or a wide-angle video frame can contribute enough tokens to convey its content. At the same time, a small thumbnail contributes only a handful of tokens, keeping inference efficient. The LLM receives however many visual tokens the image warrants, not a one-size-fits-all block.

M-RoPE: multimodal position

Standard rotary position embeddings assign each token a single 1D position index. That works for text, but images have two spatial dimensions and video has a third temporal dimension. Qwen2-VL introduces M-RoPE, which decomposes every token's position into three independent components: temporal t, height h, and width w.

For text tokens, all three components share the same incrementing integer — effectively recovering 1D RoPE. For image tokens, t is fixed at a constant (e.g., 0) and (h, w) vary across the patch grid, encoding 2D spatial position. For video, t increments with each frame while (h, w) cycle through the patch grid within each frame. The three components are encoded in separate subsets of the attention head's rotary dimensions and added together — the attention mechanism therefore sees how far apart two tokens are along each axis independently. Text, images, and video share one unified position scheme with no modality-specific position tables, and the same pretrained rotary weights apply across all three.

One model for image, multi-image, and video

Dynamic resolution and M-RoPE are what let a single Qwen2 LLM handle every visual input type without routing logic. A single image becomes a block of visual tokens preceded by a <|vision_start|> sentinel and followed by <|vision_end|>, interleaved inline with the surrounding text prompt. A multi-image conversation simply appends additional vision blocks in sequence within the same context window. A video is flattened frame-by-frame into consecutive vision blocks, with the temporal M-RoPE axis distinguishing frames from one another.

The LLM sees one flat token sequence: text tokens, vision tokens, text tokens, vision tokens — all carrying consistent M-RoPE position IDs. There is no separate video encoder, no modality-specific MLP head, and no special routing at inference time. The model can therefore handle arbitrary interleavings of text and visual content, including referring back to an earlier image while answering a question about a later one.

Why it mattered

Qwen2-VL set a high open-VLM bar at release. The 72B model matched GPT-4o on DocVQA, outperformed it on MathVista, and rivaled Claude 3.5 Sonnet on several other benchmarks — a rare achievement for an openly released model. But the architectural contributions are arguably more durable than the benchmark numbers.

Dynamic resolution means the model is no longer limited by a fixed token budget per image. As context windows grow, so does the practical resolution ceiling. M-RoPE makes position information genuinely multimodal: rather than stitching together separate 1D text positions and 2D image positions with hacks, every token in the model carries a coherent (t, h, w) address. Future models extending to 3D volumes, point clouds, or other spatiotemporal data have a clean position framework to build on.

Together, these two ideas — token counts that scale with content, and positions that decompose across modalities — form a principled foundation for any-resolution, any-modality perception in a single language model.

LLaVA: Visual Instruction Tuning — the seminal open VLM recipe of frozen vision encoder + linear projection + instruction tuning that Qwen2-VL builds on
Flamingo — the pioneering few-shot VLM that introduced interleaved image–text sequences and cross-attention bridging
CLIP: Visual Models via Language Supervision — the contrastive vision encoder whose ViT architecture underlies most modern VLM image encoders

Qwen2-VL: Vision-Language Perception at Any Resolution

TL;DR

Naive dynamic resolution

M-RoPE: multimodal position

One model for image, multi-image, and video

Why it mattered

Related Reading