Sparse retrieval (BM25, TF-IDF), dense retrieval (BERT-style neural embeddings), and hybrid systems that combine both are the three major approaches to modern information retrieval. Sparse retrieval wins on exact-term matching and is cheap to run; dense retrieval wins on semantic matching and zero-shot generalization; hybrid systems fuse lexical and semantic signals — usually with Reciprocal Rank Fusion — and consistently beat either alone on real-world workloads.
Try it: the retrieval playground
The same query, ranked three ways over one small library. Pick a query and step through sparse, dense, and hybrid — or switch to Explore to see all three at once. Every ranking is computed live: real BM25, real cosine similarity over averaged GloVe word vectors, and real Reciprocal Rank Fusion. Nothing is staged.
Each query breaks a different ranker. "fierce animal" exposes sparse's blind spot — it scores the cat and lion documents 0.00 because they share no words with the query, while dense ranks them at the top. "diesel engine" exposes dense's drift — it floats an electric-car document just under the diesel one (engine ≈ motor), where sparse stays exact. "machine learning models" shows hybrid breaking a tie neither ranker resolves alone.
The playground's dense encoder averages a document's GloVe vectors into a single 50-dimensional vector — a lightweight stand-in for a trained sentence encoder. Production systems use models like Sentence-BERT, but the behavior the demo teaches — meaning over words — is the same.
Sparse vs dense: the fundamental difference
Sparse embeddings (lexical)
- High-dimensional — one dimension per vocabulary term (30K–1M)
- Mostly zero — only ~10–100 non-zero values per document
- Exact term matching — a dimension fires only when the word appears
- Interpretable — every dimension is a word
- No training — pure corpus statistics
Dense embeddings (neural)
- Low-dimensional — 128–1024 learned dimensions
- Fully dense — every dimension carries signal
- Semantic matching — synonyms and paraphrases land nearby
- Opaque — individual dimensions have no clean meaning
- Trained — learned from large corpora
That density difference is visible in the playground: the sparse column is mostly empty bars (zeros), while the dense column scores every document.
Sparse retrieval: BM25 and TF-IDF
TF-IDF weights a term by how often it appears in a document (TF) against how rare it is across the corpus (IDF):
where \text{TF}(t, d) = ft,dΣt' ∈ d ft',d and \text{IDF}(t, D) = log |D||\{d ∈ D : t ∈ d\}|.
from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer(max_features=10000) sparse = vectorizer.fit_transform(documents) # [n_docs, vocab] — ~99% zeros
BM25 refines TF-IDF with term-frequency saturation (k1) and document-length normalization (b) — the exact scoring function the playground runs:
from rank_bm25 import BM25Okapi bm25 = BM25Okapi([doc.split() for doc in documents]) scores = bm25.get_scores("machine learning models".split())
In production this runs over an inverted index — a term → postings-list map that touches only the documents containing a query word, which is what keeps sparse retrieval fast at billions of documents.
Dense retrieval: neural embeddings
A dense encoder maps text to a fixed-length vector where geometric closeness means semantic closeness. Sentence-BERT-style models mean-pool token embeddings and train with a contrastive objective, so paraphrases converge and unrelated text diverges:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("all-MiniLM-L6-v2") doc_vecs = model.encode(documents) # [n_docs, 384], every value non-zero query_vec = model.encode("machine learning models")
Retrieval is nearest-neighbor search by cosine similarity. At scale it's approximate (HNSW, IVF-PQ) — trading a sliver of recall for sub-linear latency:
import faiss faiss.normalize_L2(doc_vecs) # cosine via inner product index = faiss.IndexFlatIP(384) index.add(doc_vecs) scores, ids = index.search(query_vec, k=10)
Comparison
Retrieval quality
| Aspect | Sparse (BM25) | Dense (BERT) |
|---|---|---|
| Exact matching | ✅ Excellent | ❌ Poor |
| Synonyms | ❌ Cannot handle | ✅ Excellent |
| Typos | ❌ Fails | ✅ Handles well |
| Long queries | ✅ Good | ⚠️ May truncate |
| Rare terms | ✅ Excellent | ⚠️ May miss |
| Common words | ⚠️ Down-weighted | ✅ Contextual |
Resource requirements
| Metric | Sparse | Dense |
|---|---|---|
| Index size | ~10% of corpus | ~100% of corpus |
| Query latency | <10ms | 50-200ms |
| RAM usage | Low | High |
| GPU required | No | Recommended |
| Training data | None | Large corpus |
Hybrid retrieval
Sparse and dense fail in different directions, so fusing them is robust. The dominant method is Reciprocal Rank Fusion — it combines rankings, not scores, so it needs no score normalization:
with k = 60 by convention. The playground's hybrid column is exactly this:
def reciprocal_rank_fusion(rankings, k=60): scores = {} for ranking in rankings: # each: doc_ids ordered best-first for rank, doc_id in enumerate(ranking, start=1): scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank) return sorted(scores, key=scores.get, reverse=True)
Other fusion strategies trade simplicity for control:
- Linear combination — α · \text{sparse} + (1 - α) · \text{dense} after normalizing each score range; tunable, but sensitive to scale.
- Cascade / rerank — a cheap first stage retrieves candidates, then an expensive cross-encoder reranks the top-k.
- Learned sparse (SPLADE) — a transformer predicts term weights across the vocabulary, producing sparse vectors with dense-like semantics that still drop into an inverted index.
Choosing an approach
Sparse is better for exact-term intent — legal document search, code search (specific function names), e-commerce (product IDs, SKUs), and hard latency budgets (<20ms).
Dense is better for meaning over wording — question answering, semantic similarity, conversational search, and cross-lingual retrieval.
Hybrid is better for mixed query types where quality is paramount — enterprise, academic, medical, and general web search; the default for production RAG.
References
- Robertson & Zaragoza "The Probabilistic Relevance Framework: BM25 and Beyond"
- Reimers & Gurevych "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks"
- Formal et al. "SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking"
- Ma et al. "A Replication Study of Dense Passage Retriever"
- Khattab & Zaharia "ColBERT: Efficient and Effective Passage Search"
Related concepts
Master the BM25 algorithm, the probabilistic ranking function powering Elasticsearch and Lucene for keyword-based document retrieval and search systems.
Learn how binary embeddings use 1-bit quantization for ultra-compact vector representations, enabling billion-scale similarity search with 32x memory reduction.
Understand the fundamental differences between independent and joint encoding architectures for neural retrieval systems.
How dense embeddings turn meaning into geometry: word2vec, GloVe, and contextual models, vector arithmetic, cosine similarity, and where the field is heading.
Explore ColBERT and other multi-vector retrieval models that use fine-grained token-level matching for superior search quality.
How HNSW, IVF-PQ, and LSH compare for approximate nearest neighbor (ANN) search — recall, latency, memory, build cost, and update characteristics — with Annoy, ScaNN, and DiskANN included for completeness.
