Skip to main content

Cross-Encoder vs Bi-Encoder

Summary
Understand the fundamental differences between independent and joint encoding architectures for neural retrieval systems.

There is one decision that shapes every neural search system: do you encode the query and the document separately, or together? Separately (bi-encoder) is fast and scales to billions of documents but blurs fine detail. Together (cross-encoder) is exquisitely accurate but can't be precomputed. The art is using each where it belongs.

Interactive Architecture Comparison

The explorer below is real: documents are genuine GloVe vectors. The bi-encoder ranks the whole corpus by pooled-vector cosine — and gets fooled by a topically-dense doc that misses a key query term. The cross-encoder re-scores the shortlist with a joint, term-level score and recovers the right answer. Watch where the gold doc sits before and after.

Two ways to score a pair

Bi-encoder (dual encoder)

Encode query and document independently into one vector each, then compare with a dot product. Because the document side never sees the query, every document embedding can be computed once, indexed, and reused for every future query.

s\text{bi}(q, d) = cos\!\big(\,\text{pool}(Eq(q)),\; \text{pool}(Ed(d))\,\big)
class BiEncoder(nn.Module): def __init__(self, model='bert-base-uncased'): super().__init__() self.q_encoder = AutoModel.from_pretrained(model) self.d_encoder = AutoModel.from_pretrained(model) def encode(self, tokens, which='q'): enc = self.q_encoder if which == 'q' else self.d_encoder return F.normalize(enc(**tokens).pooler_output, dim=-1) def score(self, q_emb, d_emb): return (q_emb * d_emb).sum(-1) # dot product of fixed vectors

Bi-encoders are trained with a contrastive objective — pull the right document closer than every other document in the batch:

ℒ = -log es(q, d^+) / τΣd' ∈ D es(q, d') / τ

where s(q, d) is the similarity, d^+ the positive document, D the in-batch candidates, and τ the temperature.

def in_batch_negatives_loss(q_embs, d_embs, temperature=0.07): sims = (q_embs @ d_embs.T) / temperature # every query vs every doc labels = torch.arange(len(q_embs), device=q_embs.device) return F.cross_entropy(sims, labels) # positives on the diagonal

Cross-encoder

Feed the concatenated pair [CLS] query [SEP] document [SEP] through one transformer, so every query token attends to every document token. The output is a single relevance score — but it depends on the pair, so nothing can be precomputed.

s\text{cross}(q, d) = σ\!\big(W · \text{BERT}([q; d])\texttt{[CLS]}\big)
class CrossEncoder(nn.Module): def __init__(self, model='bert-base-uncased'): super().__init__() self.encoder = AutoModel.from_pretrained(model) self.classifier = nn.Linear(768, 1) def forward(self, pair_tokens): # tokenizer(query, document) cls = self.encoder(**pair_tokens).last_hidden_state[:, 0] return torch.sigmoid(self.classifier(cls))

That joint attention is exactly why the cross-encoder notices a missing query term that pooling averages away — the effect the explorer makes visible.

The cost asymmetry

The architectures differ by orders of magnitude in where they spend compute.

StageBi-EncoderCross-Encoder
IndexingO(n) one-timeNot applicable
Query encodingO(1)O(n) per document
ScoringO(1) dot productO(L²) full attention
Total for 1M docs~50ms~3 hours

The bi-encoder does its expensive work once, offline. The cross-encoder must re-encode every query-document pair at query time — feasible for a shortlist, hopeless for a corpus.

Two-stage retrieve-then-rerank

So you use both, in sequence: the bi-encoder retrieves a cheap shortlist from the full corpus; the cross-encoder spends its accuracy budget re-ranking just those candidates.

def retrieve_then_rerank(query, index, corpus, bi, cross, k=100, top=10): # Stage 1 — bi-encoder over precomputed index (fast, wide) q_emb = bi.encode(query, which='q') cand_ids = index.search(q_emb, k) # ANN over the whole corpus # Stage 2 — cross-encoder over the shortlist (slow, sharp) pairs = [(query, corpus[i]) for i in cand_ids] scores = cross(tokenizer(pairs)) order = scores.argsort(descending=True)[:top] return [cand_ids[i] for i in order]

The catch the explorer dramatizes: re-ranking can only reorder what Stage 1 surfaced. If the bi-encoder buries the right document below the cutoff k, no amount of re-ranking recovers it — so k trades recall against latency.

Quality vs latency

On MS MARCO passage ranking, the hybrid recovers almost all of the cross-encoder's quality at a fraction of its cost:

ModelMRR@10Recall@100Latency
BM2518.785.720ms
Bi-Encoder (DPR)31.295.250ms
Cross-Encoder39.2N/A10s/doc
Bi-Encoder + Cross-Encoder38.595.2150ms

Choosing an architecture

Bi-encoder
Cross-encoder
Encoding
Query and doc independently
Query + doc jointly, full attention
Precompute docs?
Yes — build the index once
No — needs the pair at query time
Scoring cost
O(1) dot product
O(L²) attention per pair
Scales to
Millions of documents
A shortlist (tens–hundreds)
Pipeline role
Stage 1 — retrieve
Stage 2 — re-rank

Reach for a bi-encoder alone when scale and latency dominate (semantic search, recommendation, real-time QA). Reach for a cross-encoder alone only on tiny corpora where accuracy is everything (fact verification, duplicate detection). For almost everything in between — web, enterprise, e-commerce, and academic search — use both.

Beyond the two

The bi/cross split is the start of a spectrum that trades interaction for cost. Poly-encoders add a handful of attention codes to a bi-encoder for a little query-time interaction; ColBERT keeps a vector per token and does cheap late interaction at search time — recovering much of the cross-encoder's precision while staying precomputable. Those multi-vector methods get their own treatment in multi-vector late interaction.

Best practices

  1. Start with a bi-encoder — it defines your recall ceiling; the re-ranker can't exceed it.
  2. Add a cross-encoder when accuracy plateaus, re-ranking the top 50–200.
  3. Tune k deliberately — measure Stage-1 recall@k before paying for re-ranking.
  4. Distill the cross-encoder into the bi-encoder to lift Stage-1 quality for free.
  5. Cache cross-encoder scores for hot query-document pairs.

References

  • Karpukhin et al. "Dense Passage Retrieval for Open-Domain Question Answering"
  • Humeau et al. "Poly-encoders: Architectures and Pre-training Strategies"
  • Reimers & Gurevych "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks"
  • Nogueira & Cho "Passage Re-ranking with BERT"

If you found this explanation helpful, consider sharing it with others.

Mastodon