Dense Embeddings Space Explorer

Dense embeddings revolutionized NLP by representing words and sentences as continuous vectors in high-dimensional space, where semantic similarity corresponds to geometric proximity.

How Text Becomes Vectors

Watch how text transforms into high-dimensional vectors through the embedding process:

Text to Vector Transformation

Input Text

Tokenization

Embedding

Vector Space

Semantic Clusters

How Text Becomes Vectors

1. Tokenization: Text is split into tokens (words or subwords)

2. Embedding: Each token is mapped to a high-dimensional vector (typically 768 dimensions)

3. Vector Space: Similar words have similar vectors, enabling semantic search

4. Clustering: Semantically related words form clusters in vector space

Interactive 3D Embedding Space

Dense Embeddings Space Explorer

Explore semantic relationships in high-dimensional vector spaces

Embedding Configuration

Selected Word

Embedding Type

Similarity Threshold: 0.70

3D Embedding Space

Nearest Neighbors

1. princeroyalty

0.995

2. mangender

0.968

3. boygender

0.965

4. queenroyalty

0.943

5. princessroyalty

0.926

Understanding Dense Embeddings

Key Properties

• Continuous vector representations
• Capture semantic similarity
• Enable arithmetic operations
• Typically 50-1000 dimensions

Common Models

• Word2Vec (CBOW, Skip-gram)
• GloVe (Global Vectors)
• FastText (Subword)
• BERT (Contextual)

Applications

• Semantic search
• Document clustering
• Recommendation systems
• Machine translation

What Are Dense Embeddings?

Dense embeddings are continuous vector representations where:

Every dimension has a value (unlike sparse representations)
Semantic similarity = geometric proximity
Vector arithmetic captures relationships
Typically 50-1000 dimensions

Key Concepts

1. Word Embeddings Evolution

The progression of embedding techniques:

Model	Year	Key Innovation	Dimensions
Word2Vec	2013	Skip-gram/CBOW	50-300
GloVe	2014	Global matrix factorization	50-300
FastText	2016	Subword information	100-300
BERT	2018	Contextual embeddings	768
GPT-3	2020	Scale + few-shot	12,288

2. Training Objectives

Different models use different objectives:

Word2Vec Skip-gram:

J(θ) = -1TΣ_t=1^TΣ_{-c ≤ j ≤ c, j ≠ 0} log p(w_t+j | w_t)

GloVe:

J = Σ_i,j=1^V f(X_ij)(w_i^T \tilde{w}_j + b_i + \tilde{b}_j - log X_ij)²

3. Cosine Similarity

The standard metric for comparing embeddings:

\text{similarity}(u, v) = u · v‖u‖ · ‖v‖ = Σ_i=1ⁿ u_i v_i√(Σ_i=1ⁿ u_i²) · √(Σ_i=1ⁿ v_i²)

Vector Arithmetic

The Famous Analogy

The most celebrated property of word embeddings:

king - man + woman ≈ queen

This works because embeddings encode relationships:

king - man = royalty vector
Adding woman applies royalty to female
Result closest to queen

More Examples

# Relationships captured by arithmetic
paris - france + italy ≈ rome
bigger - big + small ≈ smaller
walking - walk + swim ≈ swimming

Implementation Details

Creating Word Embeddings

import numpy as np
from gensim.models import Word2Vec

# Train Word2Vec
sentences = [["cat", "sat", "mat"], 
             ["dog", "stood", "rug"]]
model = Word2Vec(sentences, 
                  vector_size=100,
                  window=5,
                  min_count=1,
                  sg=1)  # Skip-gram

# Get embeddings
cat_vector = model.wv['cat']
dog_vector = model.wv['dog']

# Compute similarity
similarity = model.wv.similarity('cat', 'dog')

Finding Nearest Neighbors

def find_nearest(embedding, embeddings, k=5):
    """Find k nearest neighbors using cosine similarity"""
    similarities = []
    for word, vec in embeddings.items():
        sim = cosine_similarity(embedding, vec)
        similarities.append((word, sim))
    
    # Sort by similarity
    similarities.sort(key=lambda x: x[1], reverse=True)
    return similarities[:k]

Sentence Embeddings

Moving from words to sentences:

Average Pooling

Simple but effective:

sentence_emb = np.mean([word_emb for word in sentence], axis=0)

Weighted Average

Using TF-IDF or importance weights:

weights = compute_tfidf(sentence)
sentence_emb = np.average(word_embs, weights=weights, axis=0)

Sentence-BERT

Specialized models for sentence embeddings:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(sentences)

Applications

1. Semantic Search

# Index documents
doc_embeddings = model.encode(documents)

# Search
query_embedding = model.encode(query)
similarities = cosine_similarity(query_embedding, doc_embeddings)
top_k = np.argsort(similarities)[-k:]

2. Clustering

from sklearn.cluster import KMeans

# Cluster embeddings
kmeans = KMeans(n_clusters=10)
clusters = kmeans.fit_predict(embeddings)

3. Classification

# Use embeddings as features
X = np.array([get_embedding(text) for text in texts])
classifier = LogisticRegression()
classifier.fit(X, labels)

Visualization Techniques

t-SNE Projection

Reduce dimensions for visualization:

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=30)
embeddings_2d = tsne.fit_transform(embeddings)

UMAP

Faster alternative to t-SNE:

import umap

reducer = umap.UMAP(n_components=2)
embeddings_2d = reducer.fit_transform(embeddings)

Common Pitfalls

1. Bias in Embeddings

Word embeddings can encode societal biases:

# Problematic associations
doctor - man + woman ≈ nurse  # Gender bias
programmer - man + woman ≈ homemaker  # Occupation bias

2. Out-of-Vocabulary Words

Handling unknown words:

Use subword tokenization (FastText)
Fall back to character embeddings
Use contextual models (BERT)

3. Polysemy

Single vector per word loses context:

"bank" (financial) vs "bank" (river)
Solution: Contextual embeddings (BERT, GPT)

Performance Considerations

Memory Usage

Word2Vec: ~1GB for 1M words × 300 dims
BERT: ~400MB model + dynamic computation
Storage: Use float16 or quantization

Speed Optimization

# Batch operations
similarities = np.dot(query_embs, doc_embs.T)

# Approximate nearest neighbor
from annoy import AnnoyIndex
index = AnnoyIndex(embedding_dim, 'angular')
for i, vec in enumerate(embeddings):
    index.add_item(i, vec)
index.build(10)  # 10 trees

Modern Developments

1. Contextual Embeddings

BERT and GPT models provide context-dependent embeddings:

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased')

# Different embeddings for same word in different contexts
inputs1 = tokenizer("The bank is closed", return_tensors="pt")
inputs2 = tokenizer("The river bank is muddy", return_tensors="pt")

2. Multilingual Embeddings

Cross-lingual understanding:

mBERT: 104 languages
XLM-R: 100 languages
LaBSE: Language-agnostic sentence embeddings

3. Multimodal Embeddings

Combining text and vision:

CLIP: Text-image alignment
ALIGN: Noisy data training
Flamingo: Few-shot multimodal

Best Practices

Choose the right model:
- Static embeddings for speed
- Contextual for accuracy
- Domain-specific when available

Normalize embeddings:

normalized = embedding / np.linalg.norm(embedding)

Use appropriate similarity metrics:
- Cosine for normalized vectors
- Euclidean for positional relationships
- Dot product for efficiency
Consider fine-tuning:
- Domain adaptation improves performance
- Contrastive learning for specific tasks

Quantization Effects - Reducing embedding precision
Matryoshka Embeddings - Multi-scale representations
Sparse vs Dense - Comparing embedding types

References

Mikolov et al. "Efficient Estimation of Word Representations in Vector Space"
Pennington et al. "GloVe: Global Vectors for Word Representation"
Devlin et al. "BERT: Pre-training of Deep Bidirectional Transformers"
Reimers & Gurevych "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks"