Generation to Retrieval - Grounding LLMs in Facts
Deep dive into retrieval: why pure generation hallucinates, vector similarity search, dense vs sparse retrieval, chunking strategies, and multi-stage retrieval with reranking
Concepts Covered in This Article
Building On Previous Knowledge
In the previous progression, you learned how LLMs generate text token-by-token using learned probability distributions. This creates a fundamental problem: the model can only generate from patterns it memorized during training.
If the answer isn’t in the training data, the model will either refuse to answer or—more dangerously—generate a plausible-sounding but fabricated response.
This progression solves that problem by introducing retrieval: giving the model access to external knowledge at inference time.
What Goes Wrong Without This:
Symptom: Your AI assistant confidently answers questions about your
company's products with completely fabricated information.
Cause: The LLM generates plausible text from patterns, but has no
access to your actual documentation. High confidence ≠ correctness.
Symptom: Retrieval returns documents with high similarity scores,
but the RAG system still produces incorrect answers.
Cause: You treated retrieval as similarity search. Similarity is
SYMMETRIC (A similar to B = B similar to A). Relevance is NOT.
Symptom: RAG works for demo queries but fails for real user queries.
Cause: Demo queries match document phrasing. Real queries use
different vocabulary. Query-document mismatch.
The Limits of Pure Generation
LLMs are trained on static data with a cutoff date. They generate from learned patterns, not live facts.
Problems with pure generation:
1. KNOWLEDGE CUTOFF
Q: "Who won the 2024 election?"
A: "I don't have information past my training cutoff..."
2. HALLUCINATION
Q: "What's the API for uploading files in our product?"
A: "Use POST /api/upload with multipart/form-data..."
(confidently wrong—made up based on patterns)
3. NO PRIVATE DATA
Q: "What did the client say in yesterday's email?"
A: Cannot access—not in training data
4. OUTDATED FACTS
Q: "What's the current price of Bitcoin?"
A: Training data price, not live price
The model generates plausible text, but plausible ≠ true.
Retrieval: Grounding Generation in Facts
Instead of asking the model to recall facts, give it facts to use.
Without retrieval:
User query → LLM → Generated answer (may hallucinate)
With retrieval:
User query → Search knowledge base → Relevant docs
↓
[Query + Docs] → LLM → Grounded answer
The LLM now has context to work with.
The Retrieval Pipeline
+------------------------------------------------------------------+
| INDEXING (offline) |
+------------------------------------------------------------------+
| |
| Documents |
| │ |
| ↓ |
| ┌─────────┐ |
| │ Chunk │ Split into manageable pieces |
| └─────────┘ |
| │ |
| ↓ |
| ┌─────────┐ |
| │ Embed │ Convert chunks to vectors |
| └─────────┘ |
| │ |
| ↓ |
| ┌─────────┐ |
| │ Index │ Store in vector database |
| └─────────┘ |
| |
+------------------------------------------------------------------+
| QUERY (online) |
+------------------------------------------------------------------+
| |
| User query |
| │ |
| ↓ |
| ┌─────────┐ |
| │ Embed │ Same embedding model as indexing |
| └─────────┘ |
| │ |
| ↓ |
| ┌─────────┐ |
| │ Search │ Find similar vectors in index |
| └─────────┘ |
| │ |
| ↓ |
| Top-K most similar chunks |
| |
+------------------------------------------------------------------+
Vector Similarity Search
Core mechanic: find vectors closest to query vector.
Query embedding: [0.2, 0.8, -0.1, ...]
Document embeddings in index:
doc1: [0.25, 0.75, -0.05, ...] → sim = 0.98 ← most similar
doc2: [0.1, 0.6, 0.3, ...] → sim = 0.85
doc3: [-0.5, 0.1, 0.8, ...] → sim = 0.23
doc4: [0.22, 0.78, -0.08, ...] → sim = 0.97
Return top-K (e.g., K=3): [doc1, doc4, doc2]
Why it works: Embedding models map similar meanings to similar vectors. “How do I reset my password?” is close to “Password reset instructions” even though they share few words.
Dense vs Sparse Retrieval
Two fundamentally different approaches:
+------------------------------------------------------------------+
| SPARSE RETRIEVAL (BM25, TF-IDF) |
+------------------------------------------------------------------+
| Representation: High-dimensional sparse vectors |
| (vocab_size dimensions, mostly zeros) |
| |
| "The cat sat" → [0, 0, ..., 1, 0, ..., 1, 0, ..., 1, ...] |
| ↑ ↑ ↑ |
| cat sat the |
| |
| Matching: Exact keyword overlap |
| Strengths: Precise keyword matching, interpretable |
| Weakness: Misses synonyms, requires exact terms |
+------------------------------------------------------------------+
| DENSE RETRIEVAL (Embeddings) |
+------------------------------------------------------------------+
| Representation: Low-dimensional dense vectors |
| (384-1536 dimensions, all non-zero) |
| |
| "The cat sat" → [0.23, -0.41, 0.89, 0.12, ...] |
| |
| Matching: Semantic similarity |
| Strengths: Captures meaning, handles synonyms |
| Weakness: May miss exact matches, less interpretable |
+------------------------------------------------------------------+
In practice: Combine both (hybrid search).
Query: "error code E1234"
Sparse (BM25): Finds docs with exact string "E1234" ✓
Dense: May not find if "E1234" wasn't in training data ✗
Query: "my application keeps crashing"
Sparse (BM25): Needs exact word "crashing" ✗
Dense: Matches "app failure", "program stops working" ✓
Hybrid: Best of both
Reciprocal Rank Fusion (RRF)
When combining dense and sparse results, use RRF to merge ranked lists:
RRF_score(doc) = Σ 1 / (k + rank_i(doc))
Where:
- rank_i(doc) = position of doc in ranking i (1-indexed)
- k = constant (typically 60)
Example:
Dense ranking: [doc_A (rank 1), doc_B (rank 2), doc_C (rank 3)]
Sparse ranking: [doc_B (rank 1), doc_C (rank 2), doc_A (rank 3)]
RRF scores (k=60):
doc_A: 1/(60+1) + 1/(60+3) = 0.0164 + 0.0159 = 0.0323
doc_B: 1/(60+2) + 1/(60+1) = 0.0161 + 0.0164 = 0.0325 ← Highest!
doc_C: 1/(60+3) + 1/(60+2) = 0.0159 + 0.0161 = 0.0320
Final ranking: [doc_B, doc_A, doc_C]
RRF is simple and often performs just as well as learned score combination.
Chunking: Why and How
Documents are too long to embed as single units:
- Embedding models have token limits (512-8192)
- Long texts dilute specific information
- Retrieval granularity matters
Document: 50-page manual
Bad: One embedding for entire document
→ Query matches but relevant info buried in noise
Good: Chunk into ~500 token pieces
→ Query matches specific relevant section
Chunking Strategies
+------------------------------------------------------------------+
| Chunking strategies |
+------------------------------------------------------------------+
| Fixed-size: Every N tokens |
| Simple, may break mid-sentence |
| |
| Sentence-based: Split at sentence boundaries |
| Preserves complete thoughts |
| |
| Paragraph-based: Split at paragraph breaks |
| Preserves larger context |
| |
| Semantic: Split where topic changes |
| Best quality, more complex |
| |
| Recursive: Try larger splitters first, fall back |
| Hierarchical, respects structure |
+------------------------------------------------------------------+
Overlap
Include some text from previous chunk to preserve context at boundaries:
Chunk 1: "...the password reset link. Click it to..."
Chunk 2: "...reset link. Click it to create a new password..."
↑
overlap region
Why: Context at boundaries isn't lost.
Tradeoff: More storage, potential duplicate retrieval.
Multi-Stage Retrieval
Single-stage retrieval trades off speed vs accuracy. Multi-stage gets both.
SINGLE-STAGE RETRIEVAL
--------------------------------------------------------------------
Query → Embedding Model → Vector Search → Top 10 Results
Fast (~20ms), but accuracy limited by bi-encoder's ability
to independently embed query and docs.
MULTI-STAGE RETRIEVAL (Retrieve → Rerank)
--------------------------------------------------------------------
Stage 1: Fast retrieval (bi-encoder)
Query → Top 100 candidates (~20ms)
Uses: Dense/sparse/hybrid retrieval
Stage 2: Accurate reranking (cross-encoder)
Rerank 100 → Top 10 (~200ms for 100 pairs)
Uses: Cross-encoder model
Total: ~250ms, but significantly better accuracy
Bi-Encoder vs Cross-Encoder
BI-ENCODER (used in retrieval)
--------------------------------------------------------------------
Query ─────→ [Encoder] ─────→ query_vector
↓
cosine_similarity = score
↑
Doc ─────→ [Encoder] ─────→ doc_vector
✓ Can pre-compute doc vectors (once)
✓ Fast similarity search at query time
✓ Scales to millions of documents
✗ Query and doc don't "see" each other
✗ Lower accuracy
CROSS-ENCODER (used in reranking)
--------------------------------------------------------------------
[CLS] query [SEP] document [SEP] ─────→ [BERT] ─────→ score
✓ Query and doc interact via attention
✓ Higher accuracy (5-10% improvement)
✗ Must encode every (query, doc) pair
✗ Can't pre-compute anything
✗ Slow: O(n) for n documents
Why Two Stages?
+--------------------------+-----------+----------+-------------------+
| Method | Latency | Accuracy | Use Case |
+--------------------------+-----------+----------+-------------------+
| Bi-encoder only | ~20ms | 85% | Speed-critical |
| Cross-encoder only | ~20s/1M | 95% | Tiny corpus only |
| Bi-encoder → Cross-enc | ~250ms | 93% | Production |
+--------------------------+-----------+----------+-------------------+
The bi-encoder filters to candidates.
The cross-encoder reranks for precision.
Best of both worlds.
Retrieval Quality Metrics
How do you know retrieval is working?
+------------------------------------------------------------------+
| Recall@K |
| "Of all relevant docs, how many are in my top-K?" |
| |
| 5 relevant docs exist, top-10 retrieval finds 4 |
| Recall@10 = 4/5 = 0.80 |
| |
| Critical for RAG: if relevant doc isn't retrieved, |
| the LLM can't use it. |
+------------------------------------------------------------------+
| Precision@K |
| "Of my top-K results, how many are relevant?" |
| |
| Top-10 has 4 relevant, 6 irrelevant |
| Precision@10 = 4/10 = 0.40 |
| |
| Matters for: context window efficiency, noise reduction |
+------------------------------------------------------------------+
| MRR (Mean Reciprocal Rank) |
| "How high is the first relevant result?" |
| |
| First relevant at position 3 → RR = 1/3 |
| Average across queries = MRR |
+------------------------------------------------------------------+
For RAG, Recall@K usually matters most. If the answer isn’t retrieved, generation fails.
Common Pitfalls
Misconception: “High similarity score = relevant result”
Query: "How do I bake cookies?"
Document: "I baked cookies yesterday and they were delicious."
Similarity: HIGH (same topic, same words)
Relevance: ZERO (describes past event, doesn't answer the question)
Similarity is SYMMETRIC. Relevance is NOT.
A relevant document must be similar, but a similar document
isn't necessarily relevant.
The Query-Document Mismatch Problem
Query: "How do I fix the login bug?"
(question format, user language)
Doc: "Authentication failures can be resolved by..."
(statement format, technical language)
Problem: Different phrasing may have lower similarity
even when doc answers the query.
Solutions:
- Query Expansion: Add synonyms and related terms
- HyDE: Generate hypothetical answer, embed that instead
- Query Rewriting: Transform user query to match document style
- Fine-tuned Retrievers: Train on your domain’s (query, doc) pairs
Chunking Mistakes
The correct answer might be split across two chunks.
Neither chunk alone answers the query.
Both chunks score medium similarity.
Retrieval "succeeds" (returns chunks).
RAG fails (no chunk contains the answer).
Chunking is a system design decision, not preprocessing trivia.
Code Example
Basic semantic search with sentence-transformers:
import numpy as np
from sentence_transformers import SentenceTransformer
# Initialize embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Sample knowledge base
documents = [
"To reset your password, go to Settings > Security > Reset Password.",
"Our API rate limit is 100 requests per minute for free tier.",
"Contact support@example.com for billing questions.",
"The application requires Python 3.9 or higher.",
"Two-factor authentication can be enabled in Settings > Security.",
]
# Index: Embed all documents
doc_embeddings = model.encode(documents)
def search(query: str, top_k: int = 3) -> list[tuple[str, float]]:
"""Semantic search: find most similar documents to query."""
# Embed query with same model
query_embedding = model.encode(query)
# Compute cosine similarities
# (embeddings are normalized, so dot product = cosine similarity)
similarities = np.dot(doc_embeddings, query_embedding)
# Get top-k indices
top_indices = np.argsort(similarities)[::-1][:top_k]
# Return documents with scores
results = []
for idx in top_indices:
results.append((documents[idx], similarities[idx]))
return results
# Test queries
queries = [
"How do I change my password?",
"What are the API limits?",
"I need help with my bill",
]
for query in queries:
print(f"\nQuery: {query}")
results = search(query, top_k=2)
for doc, score in results:
print(f" [{score:.3f}] {doc[:50]}...")
Key Takeaways
1. Pure LLM generation has limits
- Knowledge cutoff, hallucinations, no private data access
2. Retrieval grounds generation in facts
- Find relevant docs, then generate with context
3. Dense retrieval uses semantic similarity
- Embed query and docs, find closest vectors
4. Sparse retrieval (BM25) uses keyword matching
- Better for exact terms, combine with dense = hybrid
5. Chunking matters
- Documents → smaller pieces for granular retrieval
- Overlap preserves context at boundaries
6. Multi-stage retrieval improves accuracy
- Fast bi-encoder for recall, slow cross-encoder for precision
7. Similarity ≠ Relevance
- High similarity score doesn't mean the doc answers the query
Verify Your Understanding
Before proceeding, you should be able to:
Explain why similarity = 0.95 can still be useless — Articulate with a concrete example.
Given this scenario, identify the problem:
- Query: “How do I reset my password?”
- Doc1: “Password reset failed for user X at 3:42pm” (similarity = 0.91)
- Doc2: “Go to Settings > Security > Reset Password” (similarity = 0.87)
Which document is more relevant? Why is it ranked lower?
Identify the error in this statement: “BM25 is outdated, dense retrieval is always better.”
Your retrieval returns 10 documents. 8 are similar. 2 answer the question. Which metric captures this problem—Recall@10 or Precision@10?
What’s Next
After this, you can:
- Continue → Retrieval → RAG — putting it all together
- Build → Semantic search for your documents