Dense Embeddings
How dense embeddings turn meaning into geometry: word2vec, GloVe, and contextual models, vector arithmetic, cosine similarity, and where the field is heading.
Explore machine learning concepts related to embeddings. Clear explanations and practical insights.
How dense embeddings turn meaning into geometry: word2vec, GloVe, and contextual models, vector arithmetic, cosine similarity, and where the field is heading.
How a transformer’s per-token outputs become one embedding: CLS, mean, max, last-token, and attention pooling — what each does and when to use it.
Matryoshka embeddings: nested representations enabling dimension reduction by simple truncation without model retraining for flexible retrieval.
Embedding quantization simulator: explore memory-accuracy trade-offs from float32 to int8 and binary representations for retrieval.
Master vector compression techniques from scalar to product quantization. Learn how to reduce memory usage by 10-100× while preserving search quality.
Explore the fundamental data structures powering vector databases: trees, graphs, hash tables, and hybrid approaches for efficient similarity search.
How HNSW, IVF-PQ, and LSH compare for approximate nearest neighbor (ANN) search — recall, latency, memory, build cost, and update characteristics — with Annoy, ScaNN, and DiskANN included for completeness.
Explore how LSH uses probabilistic hash functions to find similar vectors in sub-linear time, perfect for streaming and high-dimensional data.
Learn how IVF-PQ combines clustering and compression to enable billion-scale vector search with minimal memory footprint.
How HNSW navigates a layered proximity graph to find nearest neighbors in logarithmic time — the default in-memory index of modern vector databases.
Understanding complete, dimensional, and cluster collapse — the failure modes that every self-supervised method must prevent. Learn why collapse happens and how contrastive, asymmetric, regularization, and masking approaches solve it.
The modality gap in CLIP and vision-language models: why image and text embeddings occupy separate regions despite contrastive training.