Retrieval to RAG - The Complete Pipeline
Deep dive into RAG: prompt construction, reranking, failure modes, the debugging decision tree, and how to diagnose when things go wrong
Concepts Covered in This Article
Building On Previous Knowledge
In the previous progression, you learned how retrieval finds relevant documents for a query. This created a new problem: having the right documents doesn’t automatically produce the right answer.
The LLM might ignore the retrieved context. It might hallucinate despite having good context. It might use the context but synthesize incorrectly. Retrieval gave us ingredients; now we need to cook them properly.
This progression solves that problem by introducing the complete RAG pipeline: how to construct prompts, when to use reranking, and critically—how to diagnose failures when things go wrong.
What Goes Wrong Without This:
Symptom: "RAG doesn't work" (your team gives up on the approach).
Cause: No debugging methodology. When output is wrong, random changes
are made. Nobody identified whether the problem is retrieval
or generation.
Symptom: RAG works perfectly for demo queries, fails for real user queries.
Cause: Demo queries were crafted to match document phrasing.
Real user queries are messy and use different vocabulary.
Symptom: The LLM confidently produces an answer that contradicts
the retrieved documents.
Cause: Weak grounding instruction in prompt. The LLM's prior knowledge
is more "confident" than the provided context.
The Complete RAG Pipeline
RAG = Retrieval-Augmented Generation. Combine retrieval with LLM generation:
+------------------------------------------------------------------+
| RAG Pipeline |
+------------------------------------------------------------------+
| |
| User: "What's the refund policy for premium plans?" |
| |
| ┌─────────────────────────────────────────────────────┐ |
| │ 1. RETRIEVE │ |
| │ │ |
| │ Query → Embed → Search vector DB → Top-K docs │ |
| │ │ |
| │ Retrieved: │ |
| │ • "Premium plans have a 30-day refund window..." │ |
| │ • "To request a refund, contact support..." │ |
| │ • "Refunds are processed within 5 business days" │ |
| └─────────────────────────────────────────────────────┘ |
| │ |
| ↓ |
| ┌─────────────────────────────────────────────────────┐ |
| │ 2. AUGMENT │ |
| │ │ |
| │ Construct prompt with retrieved context: │ |
| │ │ |
| │ "Based on the following information: │ |
| │ [Retrieved docs] │ |
| │ │ |
| │ Answer the user's question: │ |
| │ [User query]" │ |
| └─────────────────────────────────────────────────────┘ |
| │ |
| ↓ |
| ┌─────────────────────────────────────────────────────┐ |
| │ 3. GENERATE │ |
| │ │ |
| │ LLM produces grounded answer using context │ |
| │ │ |
| │ "Premium plans have a 30-day refund policy. │ |
| │ To request a refund, contact support@... and │ |
| │ expect processing within 5 business days." │ |
| └─────────────────────────────────────────────────────┘ |
| |
+------------------------------------------------------------------+
Prompt Construction
How you present retrieved context to the LLM matters:
+------------------------------------------------------------------+
| Basic RAG prompt template |
+------------------------------------------------------------------+
| |
| You are a helpful assistant. Answer questions based only |
| on the provided context. If the context doesn't contain |
| enough information, say "I don't have enough information." |
| |
| Context: |
| --- |
| {retrieved_document_1} |
| --- |
| {retrieved_document_2} |
| --- |
| {retrieved_document_3} |
| |
| Question: {user_query} |
| |
| Answer: |
| |
+------------------------------------------------------------------+
Key elements:
1. GROUNDING INSTRUCTION
"Answer based only on the provided context"
→ Reduces hallucination, keeps model on topic
2. FALLBACK INSTRUCTION
"If context doesn't contain enough information, say so"
→ Prevents confident wrong answers
3. CLEAR SEPARATION
Use delimiters (---, ```, XML tags) between chunks
→ Model can distinguish sources
4. SOURCE ATTRIBUTION (optional)
Include metadata: "From: billing_policy.md, Section 3"
→ Enables citations in response
Reranking: Quality Over Quantity
Initial retrieval is fast but imprecise. Reranking improves quality:
+------------------------------------------------------------------+
| Without reranking |
+------------------------------------------------------------------+
| |
| Query → Retrieve top-20 → Use top-5 in prompt |
| |
| Problem: Top-5 by embedding similarity may not be |
| the most relevant for answering the question. |
+------------------------------------------------------------------+
| With reranking |
+------------------------------------------------------------------+
| |
| Query → Retrieve top-20 → Rerank → Use top-5 in prompt |
| |
| Reranker: Cross-encoder that scores (query, doc) pairs |
| More accurate than bi-encoder similarity, but slower |
| |
+------------------------------------------------------------------+
How Many Documents?
More context isn’t always better:
+------------------------------------------------------------------+
| Trade-offs in K (number of retrieved docs) |
+------------------------------------------------------------------+
| |
| Small K (1-3): |
| ✓ Less noise, focused context |
| ✓ Lower cost (fewer tokens) |
| ✗ May miss relevant information |
| ✗ Low recall |
| |
| Large K (10-20): |
| ✓ Higher recall, more coverage |
| ✓ Redundancy can help |
| ✗ More noise, irrelevant content |
| ✗ Higher cost, possible "lost in the middle" |
| |
+------------------------------------------------------------------+
“Lost in the middle” problem: LLMs attend more to beginning and end of context. Information in the middle may be ignored.
Practical guidance:
Factoid questions: K = 3-5 (need specific answer)
Complex questions: K = 5-10 (need multiple aspects)
Research/synthesis: K = 10-20 (need comprehensive coverage)
After reranking: Use top 3-5 from reranked results
RAG Failure Modes
When RAG goes wrong:
+------------------------------------------------------------------+
| 1. RETRIEVAL FAILURE |
+------------------------------------------------------------------+
| Relevant document exists but wasn't retrieved |
| |
| Causes: |
| • Query-document vocabulary mismatch |
| • Poor chunking (answer split across chunks) |
| • Embedding model doesn't capture domain semantics |
| • K too small |
| |
| Diagnosis: Check if relevant doc is in top-100 |
+------------------------------------------------------------------+
| 2. CONTEXT IGNORED |
+------------------------------------------------------------------+
| Relevant doc retrieved but LLM didn't use it |
| |
| Causes: |
| • Lost in the middle (long context) |
| • LLM's prior knowledge conflicts with context |
| • Poor prompt construction |
| • Answer requires synthesis across multiple chunks |
| |
| Diagnosis: Is the answer literally in the context? |
+------------------------------------------------------------------+
| 3. HALLUCINATION DESPITE CONTEXT |
+------------------------------------------------------------------+
| LLM generates plausible but incorrect information |
| |
| Causes: |
| • Weak grounding instruction |
| • Context partially relevant, LLM fills gaps |
| • Model confident in prior knowledge |
| |
| Diagnosis: Does response contain info not in context? |
+------------------------------------------------------------------+
| 4. MISSING INFORMATION |
+------------------------------------------------------------------+
| Information doesn't exist in knowledge base |
| |
| Correct behavior: LLM should say "I don't know" |
| Failure: LLM makes up answer anyway |
| |
| Solution: Strong fallback instruction in prompt |
+------------------------------------------------------------------+
The RAG Debugging Decision Tree
When RAG output is wrong, use this systematic approach:
Output is wrong
│
▼
Is the correct answer in the retrieved documents?
│
├─── YES ──▶ GENERATION PROBLEM
│ │
│ ├─ Check prompt construction
│ ├─ Check grounding instruction strength
│ ├─ Check for "lost in the middle" (reorder context)
│ └─ Check if model's prior conflicts with context
│
└─── NO ───▶ Does the correct document exist in corpus?
│
├─── YES ──▶ RETRIEVAL PROBLEM
│ │
│ ├─ Check query-document vocabulary mismatch
│ ├─ Check chunking (answer split across chunks?)
│ ├─ Check embedding model domain fit
│ └─ Check K (too small?)
│
└─── NO ───▶ KNOWLEDGE GAP
│
├─ Add missing information to corpus
└─ Or implement "I don't know" fallback
Commit this decision tree to memory. It will save you hours of random debugging.
Evaluation
RAG has two components to evaluate:
+------------------------------------------------------------------+
| RETRIEVAL EVALUATION |
+------------------------------------------------------------------+
| |
| Recall@K: Are relevant docs in top-K? |
| Precision@K: Are top-K docs relevant? |
| MRR: How high is first relevant doc? |
| |
| Requires: Ground truth (query → relevant doc mappings) |
| Can be automated with labeled dataset |
| |
+------------------------------------------------------------------+
| GENERATION EVALUATION |
+------------------------------------------------------------------+
| |
| Faithfulness: Is answer supported by retrieved context? |
| Relevance: Does answer address the question? |
| Completeness: Does answer cover all aspects? |
| Correctness: Is the answer factually correct? |
| |
| Requires: Human evaluation or LLM-as-judge |
| Harder to automate than retrieval metrics |
| |
+------------------------------------------------------------------+
| END-TO-END EVALUATION |
+------------------------------------------------------------------+
| |
| Answer correctness: Given query, is final answer right? |
| |
| Note: End-to-end can mask where failures occur. |
| If answer is wrong, is it retrieval or generation fault? |
| Evaluate components separately for debugging. |
| |
+------------------------------------------------------------------+
Advanced Patterns
Beyond basic RAG:
+------------------------------------------------------------------+
| QUERY TRANSFORMATION |
+------------------------------------------------------------------+
| Query expansion: Add synonyms, rephrase |
| Query decomposition: Break complex query into sub-queries |
| HyDE: Generate hypothetical answer, embed that |
| |
+------------------------------------------------------------------+
| ITERATIVE RETRIEVAL |
+------------------------------------------------------------------+
| Multi-hop: First retrieval informs second retrieval |
| "Who is the CEO of the company that acquired Twitter?" |
| Step 1: Retrieve → "X Corp acquired Twitter" |
| Step 2: Retrieve → "Elon Musk is CEO of X Corp" |
| |
+------------------------------------------------------------------+
| SELF-REFLECTION |
+------------------------------------------------------------------+
| Generate → Check if answer uses context → If not, retry |
| Generate → Verify answer against sources → Correct if needed |
| |
+------------------------------------------------------------------+
| AGENTIC RAG |
+------------------------------------------------------------------+
| LLM decides when to retrieve, what to search |
| Can search multiple sources, combine results |
| More flexible but harder to control |
+------------------------------------------------------------------+
Common Misconceptions
”RAG is just Retrieve + Generate”
This is the happy path. The unhappy paths are:
- Retrieved docs don’t contain the answer
- Retrieved docs contain the answer but LLM ignores them
- Retrieved docs are used but LLM synthesizes incorrectly
- Retrieved docs conflict with each other
RAG is a system with multiple failure modes. Understanding the failure modes IS understanding RAG.
”If retrieval is good, generation will be good”
“Lost in the middle” is real—LLMs attend more to the beginning and end of context. Without strong grounding instructions, the model may prefer its training data.
Good retrieval is necessary but not sufficient.
”More retrieved documents = better answers”
More documents = more noise, higher cost, and “lost in the middle” problems. If you retrieve 20 documents and only 3 are relevant, you’ve added 17 distractors.
There’s an optimal K for your use case. It’s usually smaller than you think.
Code Example
Complete RAG pipeline with retrieval and generation:
import numpy as np
from sentence_transformers import SentenceTransformer
from openai import OpenAI
# Initialize models
embedder = SentenceTransformer('all-MiniLM-L6-v2')
llm_client = OpenAI()
# Knowledge base
documents = [
{
"id": "policy_1",
"content": "Premium plans have a 30-day refund policy. Users can request a full refund within 30 days of purchase.",
},
{
"id": "policy_2",
"content": "To request a refund, email support@example.com with your order ID and reason for refund.",
},
{
"id": "policy_3",
"content": "Refunds are processed within 5 business days. The amount will be credited to the original payment method.",
},
]
# Index documents
doc_texts = [d["content"] for d in documents]
doc_embeddings = embedder.encode(doc_texts)
def retrieve(query: str, top_k: int = 3) -> list[dict]:
"""Retrieve most relevant documents for query."""
query_embedding = embedder.encode(query)
similarities = np.dot(doc_embeddings, query_embedding)
top_indices = np.argsort(similarities)[::-1][:top_k]
return [documents[i] for i in top_indices]
def generate_with_context(query: str, context_docs: list[dict]) -> str:
"""Generate answer using retrieved context."""
# Format context
context = "\n---\n".join([
f"[Source: {doc['id']}]\n{doc['content']}"
for doc in context_docs
])
# Construct prompt with grounding instruction
prompt = f"""Answer the question based only on the provided context.
If the context doesn't contain enough information, say "I don't have enough information to answer that."
Context:
{context}
Question: {query}
Answer:"""
response = llm_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0,
)
return response.choices[0].message.content
def rag(query: str) -> dict:
"""Complete RAG pipeline."""
# 1. Retrieve
retrieved = retrieve(query, top_k=3)
# 2. Generate
answer = generate_with_context(query, retrieved)
return {
"query": query,
"retrieved_docs": [d["id"] for d in retrieved],
"answer": answer,
}
# Test
result = rag("What's the refund policy and how do I get one?")
print(f"Query: {result['query']}")
print(f"Retrieved: {result['retrieved_docs']}")
print(f"Answer: {result['answer']}")
Key Takeaways
1. RAG = Retrieve + Augment + Generate
- Find relevant docs, add to prompt, generate grounded answer
2. Prompt construction matters
- Clear grounding instructions reduce hallucination
- Delimiters help model distinguish sources
3. Reranking improves quality
- Fast bi-encoder for recall, slow cross-encoder for precision
4. Failure modes have different causes
- Retrieval failure: doc not found
- Generation failure: doc found but ignored/misused
- Diagnose separately
5. The debugging decision tree
- First check: Is correct answer in retrieved docs?
- If yes → generation problem
- If no → retrieval problem or knowledge gap
6. K (number of docs) involves trade-offs
- More docs = higher recall, more noise
- "Lost in the middle" is real
Verify Your Understanding
Before considering yourself RAG-capable:
Use the debugging decision tree from memory. Given a wrong RAG output, what’s your first diagnostic question?
Given this scenario, diagnose the problem:
RAG returns: “The refund policy is 60 days.” Ground truth: “The refund policy is 30 days.” Retrieved doc contains: “Premium plans have a 30-day refund policy.”
Is this a retrieval problem, generation problem, or knowledge gap? How do you know?
When would you use reranking vs. just increasing K? If your answer is “always rerank” or “never rerank,” you don’t understand the trade-offs.
Your RAG system works in development but fails in production. List 3 specific hypotheses for why this might happen.
What’s Next
After this, you can:
- Continue → RAG → Agents — from single-shot RAG to multi-step agents
- Build → Production RAG system with proper evaluation