Read this as Where do you spend latency: broad recall or precise reranking?
- Failure Trap
- Sending the first embedding top-k straight to the model and mistaking similarity for relevance.
- Decision Rule
- Use a fast bi-encoder to gather candidates, then a cross-encoder to rerank the final set.
Start With the User Query
Retrieval begins with one natural-language question. The system is not trying to answer yet; it is trying to find documents likely to contain the answer.
- The query is the only online input.
- The corpus was already chunked and indexed offline.
- The first goal is recall: do not miss answer-bearing documents.
Bi-Encoder Recall Is Fast
A bi-encoder embeds the query independently from the documents, then uses vector search to retrieve a broad candidate set. Document vectors can be precomputed, so the online path is fast.
- Query and document vectors are compared with similarity search.
- The recall stage returns roughly the top 100 candidates.
- The source chapter uses about 20ms as the fast-path latency.
The Precision Gap
Similarity is not the same as relevance. A document can resemble the query while still failing to answer it, so the top 100 set contains both useful answers and near misses.
- Bi-encoders are excellent filters.
- They cannot let the query and document attend to each other.
- Bi-encoder-only retrieval is shown here as about 85% accuracy.
Cross-Encoder Pair Scoring
A cross-encoder reads the query and one candidate document together. It scores the pair directly, so the model can inspect whether the document actually answers the question.
- Input shape: query plus document as one sequence.
- Attention connects terms across both sides.
- The cost is linear in the number of candidates scored.
Rerank to the Top 10
The reranker scores the 100 recalled candidates, sorts them by relevance, and sends only the best 10 forward to the RAG prompt. The total path is slower, but still practical for production.
- The bi-encoder keeps the candidate set small enough to rerank.
- The cross-encoder spends compute only where it matters.
- The source chapter frames the total latency around 250ms.
The Production Tradeoff
Multi-stage retrieval buys precision without scanning the whole corpus with the expensive model. The chapter's comparison is the useful mental model: about 85% accuracy for bi-encoder only, about 93% after reranking.
- Use bi-encoder recall to avoid missing candidates.
- Use cross-encoder reranking to order the best evidence.
- The tradeoff is roughly +8 percentage points for higher latency.