There is one decision that shapes every neural search system: do you encode the query and the document separately, or together? Separately (bi-encoder) is fast and scales to billions of documents but blurs fine detail. Together (cross-encoder) is exquisitely accurate but can't be precomputed. The art is using each where it belongs.
Interactive Architecture Comparison
The explorer below is real: documents are genuine GloVe vectors. The bi-encoder ranks the whole corpus by pooled-vector cosine — and gets fooled by a topically-dense doc that misses a key query term. The cross-encoder re-scores the shortlist with a joint, term-level score and recovers the right answer. Watch where the gold doc sits before and after.
Two ways to score a pair
Bi-encoder (dual encoder)
Encode query and document independently into one vector each, then compare with a dot product. Because the document side never sees the query, every document embedding can be computed once, indexed, and reused for every future query.
class BiEncoder(nn.Module): def __init__(self, model='bert-base-uncased'): super().__init__() self.q_encoder = AutoModel.from_pretrained(model) self.d_encoder = AutoModel.from_pretrained(model) def encode(self, tokens, which='q'): enc = self.q_encoder if which == 'q' else self.d_encoder return F.normalize(enc(**tokens).pooler_output, dim=-1) def score(self, q_emb, d_emb): return (q_emb * d_emb).sum(-1) # dot product of fixed vectors
Bi-encoders are trained with a contrastive objective — pull the right document closer than every other document in the batch:
where s(q, d) is the similarity, d^+ the positive document, D the in-batch candidates, and τ the temperature.
def in_batch_negatives_loss(q_embs, d_embs, temperature=0.07): sims = (q_embs @ d_embs.T) / temperature # every query vs every doc labels = torch.arange(len(q_embs), device=q_embs.device) return F.cross_entropy(sims, labels) # positives on the diagonal
Cross-encoder
Feed the concatenated pair [CLS] query [SEP] document [SEP] through one transformer, so every query token attends to every document token. The output is a single relevance score — but it depends on the pair, so nothing can be precomputed.
class CrossEncoder(nn.Module): def __init__(self, model='bert-base-uncased'): super().__init__() self.encoder = AutoModel.from_pretrained(model) self.classifier = nn.Linear(768, 1) def forward(self, pair_tokens): # tokenizer(query, document) cls = self.encoder(**pair_tokens).last_hidden_state[:, 0] return torch.sigmoid(self.classifier(cls))
That joint attention is exactly why the cross-encoder notices a missing query term that pooling averages away — the effect the explorer makes visible.
The cost asymmetry
The architectures differ by orders of magnitude in where they spend compute.
| Stage | Bi-Encoder | Cross-Encoder |
|---|---|---|
| Indexing | O(n) one-time | Not applicable |
| Query encoding | O(1) | O(n) per document |
| Scoring | O(1) dot product | O(L²) full attention |
| Total for 1M docs | ~50ms | ~3 hours |
The bi-encoder does its expensive work once, offline. The cross-encoder must re-encode every query-document pair at query time — feasible for a shortlist, hopeless for a corpus.
Two-stage retrieve-then-rerank
So you use both, in sequence: the bi-encoder retrieves a cheap shortlist from the full corpus; the cross-encoder spends its accuracy budget re-ranking just those candidates.
def retrieve_then_rerank(query, index, corpus, bi, cross, k=100, top=10): # Stage 1 — bi-encoder over precomputed index (fast, wide) q_emb = bi.encode(query, which='q') cand_ids = index.search(q_emb, k) # ANN over the whole corpus # Stage 2 — cross-encoder over the shortlist (slow, sharp) pairs = [(query, corpus[i]) for i in cand_ids] scores = cross(tokenizer(pairs)) order = scores.argsort(descending=True)[:top] return [cand_ids[i] for i in order]
The catch the explorer dramatizes: re-ranking can only reorder what Stage 1 surfaced. If the bi-encoder buries the right document below the cutoff k, no amount of re-ranking recovers it — so k trades recall against latency.
Quality vs latency
On MS MARCO passage ranking, the hybrid recovers almost all of the cross-encoder's quality at a fraction of its cost:
| Model | MRR@10 | Recall@100 | Latency |
|---|---|---|---|
| BM25 | 18.7 | 85.7 | 20ms |
| Bi-Encoder (DPR) | 31.2 | 95.2 | 50ms |
| Cross-Encoder | 39.2 | N/A | 10s/doc |
| Bi-Encoder + Cross-Encoder | 38.5 | 95.2 | 150ms |
Choosing an architecture
Reach for a bi-encoder alone when scale and latency dominate (semantic search, recommendation, real-time QA). Reach for a cross-encoder alone only on tiny corpora where accuracy is everything (fact verification, duplicate detection). For almost everything in between — web, enterprise, e-commerce, and academic search — use both.
Beyond the two
The bi/cross split is the start of a spectrum that trades interaction for cost. Poly-encoders add a handful of attention codes to a bi-encoder for a little query-time interaction; ColBERT keeps a vector per token and does cheap late interaction at search time — recovering much of the cross-encoder's precision while staying precomputable. Those multi-vector methods get their own treatment in multi-vector late interaction.
Best practices
- Start with a bi-encoder — it defines your recall ceiling; the re-ranker can't exceed it.
- Add a cross-encoder when accuracy plateaus, re-ranking the top 50–200.
- Tune
kdeliberately — measure Stage-1 recall@k before paying for re-ranking. - Distill the cross-encoder into the bi-encoder to lift Stage-1 quality for free.
- Cache cross-encoder scores for hot query-document pairs.
References
- Karpukhin et al. "Dense Passage Retrieval for Open-Domain Question Answering"
- Humeau et al. "Poly-encoders: Architectures and Pre-training Strategies"
- Reimers & Gurevych "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks"
- Nogueira & Cho "Passage Re-ranking with BERT"
Related concepts
Learn how binary embeddings use 1-bit quantization for ultra-compact vector representations, enabling billion-scale similarity search with 32x memory reduction.
Master the BM25 algorithm, the probabilistic ranking function powering Elasticsearch and Lucene for keyword-based document retrieval and search systems.
Explore ColBERT and other multi-vector retrieval models that use fine-grained token-level matching for superior search quality.
How sparse retrieval (BM25/TF-IDF), dense retrieval (BERT-style embeddings), and hybrid systems that combine both compare on recall, semantic understanding, computational cost, and operational complexity for modern search.
Understanding end-to-end object detection with transformers, from DETR's object queries to bipartite matching and attention-based localization
BatchNorm normalizes over the batch and spatial axes; LayerNorm normalizes over the channel and spatial axes for each sample. The choice changes whether your model trains stably with batch=1, depends on batch composition at inference, and behaves consistently across train and eval.
