Skip to main content

Dense Embeddings

Summary
How dense embeddings turn meaning into geometry: word2vec, GloVe, and contextual models, vector arithmetic, cosine similarity, and where the field is heading.

Dense embeddings revolutionized NLP by representing words and sentences as continuous vectors in high-dimensional space, where semantic similarity corresponds to geometric proximity.

Try it: the embedding playground

Everything below — the positions on the map, the nearest neighbours, the similarity scores, the analogy — is computed live on real GloVe vectors for a small curated vocabulary. Nothing is hand-picked. Walk the four steps, or switch to explore and click the map.

What Are Dense Embeddings?

Dense embeddings are continuous vector representations where:

  • Every dimension has a value (unlike sparse representations)
  • Semantic similarity = geometric proximity
  • Vector arithmetic captures relationships
  • Typically 50-1000 dimensions

Key Concepts

1. Word Embeddings Evolution

The progression of embedding techniques:

ModelYearKey InnovationDimensions
Word2Vec2013Skip-gram/CBOW50-300
GloVe2014Global matrix factorization50-300
FastText2016Subword information100-300
BERT2018Contextual embeddings768
GPT-32020Scale + few-shot12,288

2. Training Objectives

Different models use different objectives:

Word2Vec Skip-gram:

J(θ) = -1TΣt=1TΣ-c ≤ j ≤ c, j ≠ 0 log p(wt+j | wt)

GloVe:

J = Σi,j=1V f(Xij)(wiT \tilde{w}j + bi + \tilde{b}j - log Xij)2

3. Cosine Similarity

The standard metric for comparing embeddings:

\text{similarity}(u, v) = u · v‖u‖ · ‖v‖ = Σi=1n ui vi√(Σi=1n ui2) · √(Σi=1n vi2)

Vector Arithmetic

The Famous Analogy

The most celebrated property of word embeddings:

king - man + woman ≈ queen

This works because embeddings encode relationships:

  • king - man = royalty vector
  • Adding woman applies royalty to female
  • Result closest to queen

More Examples

# Relationships captured by arithmetic paris - france + italy ≈ rome bigger - big + small ≈ smaller walking - walk + swim ≈ swimming

These clean analogies hold for static word vectors like Word2Vec and GloVe. Modern contextual encoders bury the same relational structure in a richer, less interpretable geometry — the arithmetic still roughly works, but it is no longer the headline feature.

Implementation Details

Creating Word Embeddings

from gensim.models import Word2Vec # Train Word2Vec on tokenized sentences sentences = [["cat", "sat", "mat"], ["dog", "stood", "rug"]] model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=1) # skip-gram cat_vector = model.wv["cat"] similarity = model.wv.similarity("cat", "dog")

Finding Nearest Neighbors

def find_nearest(query, embeddings, k=5): """Rank stored vectors by cosine similarity to a query.""" scored = [ (word, cosine_similarity(query, vec)) for word, vec in embeddings.items() ] scored.sort(key=lambda x: x[1], reverse=True) return scored[:k]

At scale you would not scan every vector — see the note on approximate nearest neighbor search under Performance Considerations.

Sentence Embeddings

A single word vector is rarely what you want; most applications embed whole sentences or passages.

Average Pooling

Simple, but a surprisingly strong baseline:

sentence_emb = np.mean([word_emb for word in sentence], axis=0)

Sentence-BERT

Purpose-built models produce far better sentence vectors than pooling word embeddings:

from sentence_transformers import SentenceTransformer model = SentenceTransformer("all-MiniLM-L6-v2") embeddings = model.encode(sentences)

Applications

Dense embeddings follow one pattern everywhere: encode once, then compare by proximity. That single idea covers most of their uses:

  • Semantic search & RAG — embed documents and queries, then retrieve the nearest vectors instead of matching keywords.
  • Clustering & deduplication — group or collapse items whose vectors sit close together.
  • Classification — use the embedding as input features for a lightweight downstream model.
  • Recommendation — surface items near a user's or an item's vector.

In every case the embedding is computed up front and reused; only the cheap similarity comparison happens at query time.

Common Pitfalls

1. Bias in Embeddings

Embeddings absorb the biases in their training text, and vector arithmetic exposes them:

doctor - man + woman ≈ nurse # gender bias programmer - man + woman ≈ homemaker # occupation bias

Debiasing methods, careful data curation, and supervised fine-tuning all help, but none fully remove it.

2. Out-of-Vocabulary Words

Handling unknown words:

  • Use subword tokenization (FastText, BPE)
  • Fall back to character embeddings
  • Use contextual models (BERT), which tokenize into subwords by default

3. Polysemy

A single static vector per word cannot separate senses:

  • "bank" (financial) vs "bank" (river)
  • Solution: contextual embeddings (BERT, modern LLM encoders) that produce a different vector per occurrence

Performance Considerations

Memory

  • Static vectors: ~1 GB for 1M words × 300 dims in float32
  • Transformer encoders: a few hundred MB of weights plus per-input compute
  • Store vectors as float16, int8, or binary to cut the footprint dramatically (see Where Dense Embeddings Are Heading)

Speed

Brute-force cosine against every stored vector is fine for thousands of items but not millions. For large corpora, precompute embeddings once and query an approximate nearest neighbor index (HNSW, IVF-PQ) that trades a little recall for orders-of-magnitude faster search.

Best Practices

  1. Pick the right encoder — static vectors for speed, a contextual or instruction-tuned model for accuracy, a domain-specific model when one exists.

  2. Normalize before comparing — unit-normalize so cosine similarity reduces to a dot product:

    normalized = embedding / np.linalg.norm(embedding)
  3. Match the metric to the vectors — cosine for normalized embeddings, dot product for raw magnitude-carrying vectors.

  4. Fine-tune when it matters — contrastive fine-tuning on in-domain pairs reliably beats an off-the-shelf model.

Where Dense Embeddings Are Heading

The center of gravity has moved from static word vectors to contextual ones and now to LLM-derived embedding models. Models like E5, GTE, BGE, nomic-embed, and OpenAI's text-embedding-3 are trained with large-scale contrastive objectives on query–document and instruction-tagged pairs, so a single encoder produces strong vectors for retrieval, clustering, and classification at once. Many are instruction-tuned — you prefix the input with a short task description — and accept far longer context (thousands of tokens per embedding) than the 512-token encoders that preceded them.

Two efficiency ideas are reshaping how these vectors are stored. Matryoshka representation learning packs coarse-to-fine information so that a 1536-dim vector can be truncated to 256 dims and still work — one model, many sizes, chosen per query. And quantization to int8 or binary cuts storage 4–32× with modest recall loss, which is what makes billion-scale vector search affordable.

The honest caveat is that the famous king − man + woman ≈ queen story belongs to the static era. Today's passage encoders capture far richer structure, but that geometry is harder to interpret — "similar vectors mean similar meaning" is a useful intuition, not a precise law, and false neighbors (fluent text that is semantically off) remain a real failure mode in retrieval.

Further Reading

When to use dense embeddings (and when sparse is the better default)

Dense embeddings win whenever semantic similarity matters more than exact term overlap, but they cost more to compute, more to serve, and more to debug. The honest default for production search is: use sparse first, add dense second, fuse them, and ship the hybrid. Pure-dense is a strong choice for a narrow set of problems where exact-term match is actively bad for you.

Use dense embeddings when:

  • Query and document vocabulary diverge — users search with paraphrases, synonyms, or natural language, and your corpus uses different terminology. BM25 cannot bridge that gap; dense models can.
  • You operate cross-lingual or multimodal — dense models trained on aligned pairs (multilingual SBERT, CLIP, etc.) handle translation and image-text matching that sparse methods cannot.
  • You have a curated domain and can fine-tune — even 1k labelled pairs on a contrastive objective lifts dense retrieval well above zero-shot BM25 in that domain.
  • Downstream uses the embedding for more than retrieval — classification, clustering, recommendations, semantic deduplication. Sparse vectors cannot serve those second-class consumers.

Use sparse retrieval (BM25/TF-IDF) when:

  • Exact terms must match — names, error codes, product SKUs, legal citations. Dense models routinely retrieve "close" matches that are wrong for these queries.
  • You have no labelled training data and the domain is far from generic web text — out-of-the-box dense models can underperform BM25 on technical corpora.
  • Latency and memory budgets are tight at billions of documents — sparse inverted indexes serve from disk; dense indexes need vectors in RAM (or a GPU).
  • The downstream consumer is human inspection and you need to explain why a result ranked where it did — BM25 score components are inspectable; cosine similarity is not.

Default to hybrid retrieval (sparse + dense + reranker) for production search. Run BM25 and dense in parallel, fuse with Reciprocal Rank Fusion, then rerank the top-K with a cross-encoder. Pure-dense is the right answer only when you have proven that the lexical signal hurts more than it helps.

If you found this explanation helpful, consider sharing it with others.

Mastodon