Bi-Encoder Recall, Cross-Encoder Rerank

Why production retrieval usually uses two stages: a fast bi-encoder to gather candidates, then a slower cross-encoder to rerank the few that matter.

Read this as Where do you spend latency: broad recall or precise reranking?
Failure Trap
Sending the first embedding top-k straight to the model and mistaking similarity for relevance.
Decision Rule
Use a fast bi-encoder to gather candidates, then a cross-encoder to rerank the final set.
Bi-encoder recall and cross-encoder reranking A six-step retrieval pipeline. A user query first goes through a fast bi-encoder recall stage that finds 100 candidates in about 20 milliseconds. Because independently embedded vectors can confuse similarity with relevance, a cross-encoder then scores query-document pairs directly and reranks the candidates into a precise top 10 in about 250 milliseconds, improving accuracy from 85 percent to 93 percent. Start with one query User query "reset API key" Retrieval system Goal: find answer-bearing docs Similar text is not enough Bi-encoder recall: top 100 Query vector Vector index 100 docs ~20ms Fast because document vectors are precomputed Precision gap: similar is not relevant Top 100 recall set A N N A N A = answers N = near miss Bi-encoder only: 85% accuracy Cross-encoder scores pairs Query Doc Cross encoder attention score Rerank: top 10 in about 250ms 100 pairs score all q + d1 q + d2 ... q + d100 Top 10 ~250ms Better accuracy, bounded latency Bi-encoder only 85% ~20ms Recall + rerank 93% ~250ms +8 pp accuracy
1 / ?

Start With the User Query

Retrieval begins with one natural-language question. The system is not trying to answer yet; it is trying to find documents likely to contain the answer.

  • The query is the only online input.
  • The corpus was already chunked and indexed offline.
  • The first goal is recall: do not miss answer-bearing documents.

Bi-Encoder Recall Is Fast

A bi-encoder embeds the query independently from the documents, then uses vector search to retrieve a broad candidate set. Document vectors can be precomputed, so the online path is fast.

  • Query and document vectors are compared with similarity search.
  • The recall stage returns roughly the top 100 candidates.
  • The source chapter uses about 20ms as the fast-path latency.

The Precision Gap

Similarity is not the same as relevance. A document can resemble the query while still failing to answer it, so the top 100 set contains both useful answers and near misses.

  • Bi-encoders are excellent filters.
  • They cannot let the query and document attend to each other.
  • Bi-encoder-only retrieval is shown here as about 85% accuracy.

Cross-Encoder Pair Scoring

A cross-encoder reads the query and one candidate document together. It scores the pair directly, so the model can inspect whether the document actually answers the question.

  • Input shape: query plus document as one sequence.
  • Attention connects terms across both sides.
  • The cost is linear in the number of candidates scored.

Rerank to the Top 10

The reranker scores the 100 recalled candidates, sorts them by relevance, and sends only the best 10 forward to the RAG prompt. The total path is slower, but still practical for production.

  • The bi-encoder keeps the candidate set small enough to rerank.
  • The cross-encoder spends compute only where it matters.
  • The source chapter frames the total latency around 250ms.

The Production Tradeoff

Multi-stage retrieval buys precision without scanning the whole corpus with the expensive model. The chapter's comparison is the useful mental model: about 85% accuracy for bi-encoder only, about 93% after reranking.

  • Use bi-encoder recall to avoid missing candidates.
  • Use cross-encoder reranking to order the best evidence.
  • The tradeoff is roughly +8 percentage points for higher latency.