Dense Embeddings Space Explorer
Dense embeddings revolutionized NLP by representing words and sentences as continuous vectors in high-dimensional space, where semantic similarity corresponds to geometric proximity.
How Text Becomes Vectors
Watch how text transforms into high-dimensional vectors through the embedding process:
Text to Vector Transformation
How Text Becomes Vectors
1. Tokenization: Text is split into tokens (words or subwords)
2. Embedding: Each token is mapped to a high-dimensional vector (typically 768 dimensions)
3. Vector Space: Similar words have similar vectors, enabling semantic search
4. Clustering: Semantically related words form clusters in vector space
Interactive 3D Embedding Space
Dense Embeddings Space Explorer
Explore semantic relationships in high-dimensional vector spaces
Embedding Configuration
3D Embedding Space
Nearest Neighbors
Understanding Dense Embeddings
Key Properties
- • Continuous vector representations
- • Capture semantic similarity
- • Enable arithmetic operations
- • Typically 50-1000 dimensions
Common Models
- • Word2Vec (CBOW, Skip-gram)
- • GloVe (Global Vectors)
- • FastText (Subword)
- • BERT (Contextual)
Applications
- • Semantic search
- • Document clustering
- • Recommendation systems
- • Machine translation
What Are Dense Embeddings?
Dense embeddings are continuous vector representations where:
- Every dimension has a value (unlike sparse representations)
- Semantic similarity = geometric proximity
- Vector arithmetic captures relationships
- Typically 50-1000 dimensions
Key Concepts
1. Word Embeddings Evolution
The progression of embedding techniques:
| Model | Year | Key Innovation | Dimensions |
|---|---|---|---|
| Word2Vec | 2013 | Skip-gram/CBOW | 50-300 |
| GloVe | 2014 | Global matrix factorization | 50-300 |
| FastText | 2016 | Subword information | 100-300 |
| BERT | 2018 | Contextual embeddings | 768 |
| GPT-3 | 2020 | Scale + few-shot | 12,288 |
2. Training Objectives
Different models use different objectives:
Word2Vec Skip-gram:
GloVe:
3. Cosine Similarity
The standard metric for comparing embeddings:
Vector Arithmetic
The Famous Analogy
The most celebrated property of word embeddings:
king - man + woman ≈ queen
This works because embeddings encode relationships:
king - man= royalty vector- Adding
womanapplies royalty to female - Result closest to
queen
More Examples
# Relationships captured by arithmetic paris - france + italy ≈ rome bigger - big + small ≈ smaller walking - walk + swim ≈ swimming
Implementation Details
Creating Word Embeddings
import numpy as np from gensim.models import Word2Vec # Train Word2Vec sentences = [["cat", "sat", "mat"], ["dog", "stood", "rug"]] model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=1) # Skip-gram # Get embeddings cat_vector = model.wv['cat'] dog_vector = model.wv['dog'] # Compute similarity similarity = model.wv.similarity('cat', 'dog')
Finding Nearest Neighbors
def find_nearest(embedding, embeddings, k=5): """Find k nearest neighbors using cosine similarity""" similarities = [] for word, vec in embeddings.items(): sim = cosine_similarity(embedding, vec) similarities.append((word, sim)) # Sort by similarity similarities.sort(key=lambda x: x[1], reverse=True) return similarities[:k]
Sentence Embeddings
Moving from words to sentences:
Average Pooling
Simple but effective:
sentence_emb = np.mean([word_emb for word in sentence], axis=0)
Weighted Average
Using TF-IDF or importance weights:
weights = compute_tfidf(sentence) sentence_emb = np.average(word_embs, weights=weights, axis=0)
Sentence-BERT
Specialized models for sentence embeddings:
from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2') embeddings = model.encode(sentences)
Applications
1. Semantic Search
# Index documents doc_embeddings = model.encode(documents) # Search query_embedding = model.encode(query) similarities = cosine_similarity(query_embedding, doc_embeddings) top_k = np.argsort(similarities)[-k:]
2. Clustering
from sklearn.cluster import KMeans # Cluster embeddings kmeans = KMeans(n_clusters=10) clusters = kmeans.fit_predict(embeddings)
3. Classification
# Use embeddings as features X = np.array([get_embedding(text) for text in texts]) classifier = LogisticRegression() classifier.fit(X, labels)
Visualization Techniques
t-SNE Projection
Reduce dimensions for visualization:
from sklearn.manifold import TSNE tsne = TSNE(n_components=2, perplexity=30) embeddings_2d = tsne.fit_transform(embeddings)
UMAP
Faster alternative to t-SNE:
import umap reducer = umap.UMAP(n_components=2) embeddings_2d = reducer.fit_transform(embeddings)
Common Pitfalls
1. Bias in Embeddings
Word embeddings can encode societal biases:
# Problematic associations doctor - man + woman ≈ nurse # Gender bias programmer - man + woman ≈ homemaker # Occupation bias
2. Out-of-Vocabulary Words
Handling unknown words:
- Use subword tokenization (FastText)
- Fall back to character embeddings
- Use contextual models (BERT)
3. Polysemy
Single vector per word loses context:
- "bank" (financial) vs "bank" (river)
- Solution: Contextual embeddings (BERT, GPT)
Performance Considerations
Memory Usage
- Word2Vec: ~1GB for 1M words × 300 dims
- BERT: ~400MB model + dynamic computation
- Storage: Use float16 or quantization
Speed Optimization
# Batch operations similarities = np.dot(query_embs, doc_embs.T) # Approximate nearest neighbor from annoy import AnnoyIndex index = AnnoyIndex(embedding_dim, 'angular') for i, vec in enumerate(embeddings): index.add_item(i, vec) index.build(10) # 10 trees
Modern Developments
1. Contextual Embeddings
BERT and GPT models provide context-dependent embeddings:
from transformers import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') model = AutoModel.from_pretrained('bert-base-uncased') # Different embeddings for same word in different contexts inputs1 = tokenizer("The bank is closed", return_tensors="pt") inputs2 = tokenizer("The river bank is muddy", return_tensors="pt")
2. Multilingual Embeddings
Cross-lingual understanding:
- mBERT: 104 languages
- XLM-R: 100 languages
- LaBSE: Language-agnostic sentence embeddings
3. Multimodal Embeddings
Combining text and vision:
- CLIP: Text-image alignment
- ALIGN: Noisy data training
- Flamingo: Few-shot multimodal
Best Practices
-
Choose the right model:
- Static embeddings for speed
- Contextual for accuracy
- Domain-specific when available
-
Normalize embeddings:
normalized = embedding / np.linalg.norm(embedding) -
Use appropriate similarity metrics:
- Cosine for normalized vectors
- Euclidean for positional relationships
- Dot product for efficiency
-
Consider fine-tuning:
- Domain adaptation improves performance
- Contrastive learning for specific tasks
Related Concepts
- Quantization Effects - Reducing embedding precision
- Matryoshka Embeddings - Multi-scale representations
- Sparse vs Dense - Comparing embedding types
References
- Mikolov et al. "Efficient Estimation of Word Representations in Vector Space"
- Pennington et al. "GloVe: Global Vectors for Word Representation"
- Devlin et al. "BERT: Pre-training of Deep Bidirectional Transformers"
- Reimers & Gurevych "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks"
