The Vision-Language Alignment Problem
How vision-language models align visual and text representations using contrastive learning, cross-modal attention, and CLIP-style training.
No direct links0 refs
Vision-language models, alignment techniques, and the fundamental challenges of multimodal learning.
How vision-language models align visual and text representations using contrastive learning, cross-modal attention, and CLIP-style training.
The modality gap in CLIP and vision-language models: why image and text embeddings occupy separate regions despite contrastive training.
Discover how multimodal vision-language models like CLIP, ALIGN, and LLaVA scale with data, parameters, and compute following Chinchilla-style power laws.
Master LoRA, bottleneck adapters, and prefix tuning for parameter-efficient fine-tuning of vision-language models like LLaVA with minimal compute and memory.