The Vision-Language Alignment Problem
How vision-language models align visual and text representations using contrastive learning, cross-modal attention, and CLIP-style training.
5 min readConcept
Explore machine learning concepts related to vision-language. Clear explanations and practical insights.
How vision-language models align visual and text representations using contrastive learning, cross-modal attention, and CLIP-style training.
The modality gap in CLIP and vision-language models: why image and text embeddings occupy separate regions despite contrastive training.
Discover how multimodal vision-language models like CLIP, ALIGN, and LLaVA scale with data, parameters, and compute following Chinchilla-style power laws.
Master LoRA, bottleneck adapters, and prefix tuning for parameter-efficient fine-tuning of vision-language models like LLaVA with minimal compute and memory.