Skip to main content

Sparse vs Dense vs Hybrid Retrieval: BM25, BERT, and Reranking Compared

Summary
How sparse retrieval (BM25/TF-IDF), dense retrieval (BERT-style embeddings), and hybrid systems that combine both compare on recall, semantic understanding, computational cost, and operational complexity for modern search.

Sparse retrieval (BM25, TF-IDF), dense retrieval (BERT-style neural embeddings), and hybrid systems that combine both are the three major approaches to modern information retrieval. Sparse retrieval wins on exact-term matching and is cheap to run; dense retrieval wins on semantic matching and zero-shot generalization; hybrid systems fuse lexical and semantic signals — usually with Reciprocal Rank Fusion — and consistently beat either alone on real-world workloads.

Try it: the retrieval playground

The same query, ranked three ways over one small library. Pick a query and step through sparse, dense, and hybrid — or switch to Explore to see all three at once. Every ranking is computed live: real BM25, real cosine similarity over averaged GloVe word vectors, and real Reciprocal Rank Fusion. Nothing is staged.

Each query breaks a different ranker. "fierce animal" exposes sparse's blind spot — it scores the cat and lion documents 0.00 because they share no words with the query, while dense ranks them at the top. "diesel engine" exposes dense's drift — it floats an electric-car document just under the diesel one (engine ≈ motor), where sparse stays exact. "machine learning models" shows hybrid breaking a tie neither ranker resolves alone.

The playground's dense encoder averages a document's GloVe vectors into a single 50-dimensional vector — a lightweight stand-in for a trained sentence encoder. Production systems use models like Sentence-BERT, but the behavior the demo teaches — meaning over words — is the same.

Sparse vs dense: the fundamental difference

Sparse embeddings (lexical)

  • High-dimensional — one dimension per vocabulary term (30K–1M)
  • Mostly zero — only ~10–100 non-zero values per document
  • Exact term matching — a dimension fires only when the word appears
  • Interpretable — every dimension is a word
  • No training — pure corpus statistics

Dense embeddings (neural)

  • Low-dimensional — 128–1024 learned dimensions
  • Fully dense — every dimension carries signal
  • Semantic matching — synonyms and paraphrases land nearby
  • Opaque — individual dimensions have no clean meaning
  • Trained — learned from large corpora

That density difference is visible in the playground: the sparse column is mostly empty bars (zeros), while the dense column scores every document.

Sparse retrieval: BM25 and TF-IDF

TF-IDF weights a term by how often it appears in a document (TF) against how rare it is across the corpus (IDF):

\text{TF-IDF}(t, d, D) = \text{TF}(t, d) × \text{IDF}(t, D)

where \text{TF}(t, d) = ft,dΣt' ∈ d ft',d and \text{IDF}(t, D) = log |D||\{d ∈ D : t ∈ d\}|.

from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer(max_features=10000) sparse = vectorizer.fit_transform(documents) # [n_docs, vocab] — ~99% zeros

BM25 refines TF-IDF with term-frequency saturation (k1) and document-length normalization (b) — the exact scoring function the playground runs:

\text{BM25}(D, Q) = Σi=1n \text{IDF}(qi) · f(qi, D) · (k1 + 1)f(qi, D) + k1 · (1 - b + b · |D|\text{avgdl})
from rank_bm25 import BM25Okapi bm25 = BM25Okapi([doc.split() for doc in documents]) scores = bm25.get_scores("machine learning models".split())

In production this runs over an inverted index — a term → postings-list map that touches only the documents containing a query word, which is what keeps sparse retrieval fast at billions of documents.

Dense retrieval: neural embeddings

A dense encoder maps text to a fixed-length vector where geometric closeness means semantic closeness. Sentence-BERT-style models mean-pool token embeddings and train with a contrastive objective, so paraphrases converge and unrelated text diverges:

from sentence_transformers import SentenceTransformer model = SentenceTransformer("all-MiniLM-L6-v2") doc_vecs = model.encode(documents) # [n_docs, 384], every value non-zero query_vec = model.encode("machine learning models")

Retrieval is nearest-neighbor search by cosine similarity. At scale it's approximate (HNSW, IVF-PQ) — trading a sliver of recall for sub-linear latency:

import faiss faiss.normalize_L2(doc_vecs) # cosine via inner product index = faiss.IndexFlatIP(384) index.add(doc_vecs) scores, ids = index.search(query_vec, k=10)

Comparison

Retrieval quality

AspectSparse (BM25)Dense (BERT)
Exact matching✅ Excellent❌ Poor
Synonyms❌ Cannot handle✅ Excellent
Typos❌ Fails✅ Handles well
Long queries✅ Good⚠️ May truncate
Rare terms✅ Excellent⚠️ May miss
Common words⚠️ Down-weighted✅ Contextual

Resource requirements

MetricSparseDense
Index size~10% of corpus~100% of corpus
Query latency<10ms50-200ms
RAM usageLowHigh
GPU requiredNoRecommended
Training dataNoneLarge corpus

Hybrid retrieval

Sparse and dense fail in different directions, so fusing them is robust. The dominant method is Reciprocal Rank Fusion — it combines rankings, not scores, so it needs no score normalization:

\text{RRF}(d) = Σr ∈ R 1k + \text{rank}r(d)

with k = 60 by convention. The playground's hybrid column is exactly this:

def reciprocal_rank_fusion(rankings, k=60): scores = {} for ranking in rankings: # each: doc_ids ordered best-first for rank, doc_id in enumerate(ranking, start=1): scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank) return sorted(scores, key=scores.get, reverse=True)

Other fusion strategies trade simplicity for control:

  • Linear combinationα · \text{sparse} + (1 - α) · \text{dense} after normalizing each score range; tunable, but sensitive to scale.
  • Cascade / rerank — a cheap first stage retrieves candidates, then an expensive cross-encoder reranks the top-k.
  • Learned sparse (SPLADE) — a transformer predicts term weights across the vocabulary, producing sparse vectors with dense-like semantics that still drop into an inverted index.

Choosing an approach

Sparse is better for exact-term intent — legal document search, code search (specific function names), e-commerce (product IDs, SKUs), and hard latency budgets (<20ms).

Dense is better for meaning over wording — question answering, semantic similarity, conversational search, and cross-lingual retrieval.

Hybrid is better for mixed query types where quality is paramount — enterprise, academic, medical, and general web search; the default for production RAG.

References

  • Robertson & Zaragoza "The Probabilistic Relevance Framework: BM25 and Beyond"
  • Reimers & Gurevych "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks"
  • Formal et al. "SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking"
  • Ma et al. "A Replication Study of Dense Passage Retriever"
  • Khattab & Zaharia "ColBERT: Efficient and Effective Passage Search"

If you found this explanation helpful, consider sharing it with others.

Mastodon