LLM

Transformers & LLMs

Attention mechanisms, large language models, and multimodal architectures: the building blocks of modern AI.

26 concepts

All Transformers & LLMs Concepts

April 8, 2025

CLS Token in Vision Transformers

Learn how the CLS token acts as a global information aggregator in Vision Transformers, enabling whole-image classification through attention mechanisms.

deep-learning attention architectures vision-transformers

No direct links0 refs

April 8, 2025

Hierarchical Attention in Vision Transformers

How hierarchical (windowed, multi-scale) attention — pioneered by Swin Transformer — cuts the quadratic cost of self-attention to near-linear for high-resolution vision.

deep-learning attention architectures optimization

No direct links0 refs

April 8, 2025

Multi-Head Attention

How multi-head attention runs scaled dot-product attention in parallel across several representation subspaces to build context-aware token embeddings.

deep-learning attention architectures neural-nets

No direct links0 refs

April 8, 2025

Positional Embeddings in Vision Transformers

Explore how positional embeddings enable Vision Transformers (ViT) to process sequential data by encoding relative positions.

deep-learning attention architectures neural-nets

No direct links0 refs

April 8, 2025

Self-Attention in Vision Transformers

Explore how self-attention enables Vision Transformers (ViT) to understand images by capturing global context, with a CNN comparison.

deep-learning attention architectures neural-nets

No direct links0 refs

January 31, 2025

ALiBi: Attention with Linear Biases

Learn ALiBi, the position encoding method that adds linear biases to attention scores for exceptional length extrapolation in transformers.

deep-learning attention transformers position-encoding

No direct links0 refs

January 31, 2025

Flash Attention vs MHA vs GQA vs MQA: Comparing Attention Mechanisms

How Flash Attention, Multi-Head Attention (MHA), Grouped-Query Attention (GQA), and Multi-Query Attention (MQA) compare — algorithm vs architecture, KV-cache memory, quality trade-offs, and how to choose for production transformer inference.

deep-learning attention transformers optimization

No direct links0 refs

January 31, 2025

Attention Sinks: Stable Streaming LLMs

Learn about attention sinks, where LLMs concentrate attention on initial tokens, and how preserving them enables streaming inference.

deep-learning attention transformers streaming inference

No direct links0 refs

January 31, 2025

Cross-Attention: Bridging Different Modalities

Understand cross-attention, the mechanism that enables transformers to align and fuse information from different sources, sequences, or modalities.

deep-learning attention transformers multimodal

No direct links0 refs

January 31, 2025

Grouped-Query Attention (GQA)

Learn how Grouped-Query Attention (GQA) balances Multi-Head quality with Multi-Query efficiency for faster LLM inference.

deep-learning attention transformers optimization

No direct links0 refs

January 31, 2025

Linear Attention Approximations

Explore linear complexity attention mechanisms including Performer, Linformer, and other efficient transformers that scale to very long sequences.

deep-learning attention transformers linear-attention optimization

No direct links0 refs

January 31, 2025

Masked and Causal Attention

Learn how masked attention enables autoregressive generation and prevents information leakage in transformers and language models.

deep-learning attention transformers language-models

No direct links0 refs

January 31, 2025

Multi-Query Attention (MQA)

Learn Multi-Query Attention (MQA), the optimization that shares keys and values across attention heads for massive memory savings.

deep-learning attention transformers optimization

No direct links0 refs

January 31, 2025

Rotary Position Embeddings (RoPE)

Learn Rotary Position Embeddings (RoPE), the elegant position encoding using rotation matrices, powering LLaMA, Mistral, and modern LLMs.

deep-learning attention transformers position-encoding

No direct links0 refs

January 31, 2025

Scaled Dot-Product Attention

Master scaled dot-product attention, the fundamental transformer building block. Learn why scaling is crucial for stable training.

deep-learning attention transformers fundamentals

No direct links0 refs

January 31, 2025

Sliding Window Attention

Sliding Window Attention for long sequences: local context windows enable O(n) complexity, used in Mistral and Longformer models.

deep-learning attention transformers optimization

No direct links0 refs

January 31, 2025

Sparse Attention Patterns

Explore sparse attention mechanisms that reduce quadratic complexity to linear or sub-quadratic, enabling efficient processing of long sequences.

deep-learning attention transformers optimization sparse-models

No direct links0 refs

January 21, 2025

The Vision-Language Alignment Problem

How vision-language models align visual and text representations using contrastive learning, cross-modal attention, and CLIP-style training.

multimodal alignment CLIP vision-language contrastive-learning

No direct links0 refs

January 21, 2025

Context Windows: The Memory Limits of LLMs

Interactive visualization of LLM context windows - sliding windows, expanding contexts, and attention patterns that define model memory limits.

llms attention memory transformers

No direct links0 refs

January 21, 2025

Flash Attention: IO-Aware Exact Attention

Interactive Flash Attention visualization - the IO-aware algorithm achieving memory-efficient exact attention through tiling and kernel fusion.

llms optimization attention gpu

No direct links0 refs

January 21, 2025

KV Cache: The Secret to Fast LLM Inference

Interactive KV cache visualization - how key-value caching in LLM transformers enables fast text generation without quadratic recomputation.

llms optimization inference transformers

No direct links0 refs

January 21, 2025

The Modality Gap in Multimodal AI

The modality gap in CLIP and vision-language models: why image and text embeddings occupy separate regions despite contrastive training.

multimodal modality-gap embeddings vision-language representation-learning

No direct links0 refs

January 21, 2025

Multimodal Scaling Laws

Discover how multimodal vision-language models like CLIP, ALIGN, and LLaVA scale with data, parameters, and compute following Chinchilla-style power laws.

multimodal scaling-laws vision-language chinchilla optimization

No direct links0 refs

January 21, 2025

Tokenization: Converting Text to Numbers

Interactive exploration of tokenization methods in LLMs - BPE, SentencePiece, and WordPiece. Understand how text becomes tokens that models can process.

llms tokenization nlp transformers

No direct links0 refs

January 21, 2025

Vision-Language Adapters: Efficient Fine-tuning

Master LoRA, bottleneck adapters, and prefix tuning for parameter-efficient fine-tuning of vision-language models like LLaVA with minimal compute and memory.

multimodal adapters lora peft fine-tuning vision-language

No direct links0 refs

January 16, 2024

Mixture of Experts (MoE)

Understanding sparse mixture of experts models - architecture, routing mechanisms, load balancing, and efficient scaling strategies for large language models

MoE sparse-models expert-networks routing transformers scaling switch-transformer mixtral

No direct links0 refs