LLaVA paper: align LLMs with visual information through instruction tuning on image-text pairs, enabling multimodal understanding and reasoning.

Visual Instruction Tuning

Introducing BLIP-2, a new vision-language model that leverages frozen image encoders and large language models to achieve improved efficiency and performance in various multimodal tasks.

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

CLIP explained: contrastive learning on 400M image-text pairs enables zero-shot image classification and powerful vision-language understanding.

multimodal learning

Papers Related to multimodal learning

Visual Instruction Tuning

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Learning Transferable Visual Models From Natural Language Supervision