Generation to Retrieval - Grounding LLMs in Facts | Intentional / Deliberate / Engineering

Building On Previous Knowledge

In the previous progression, you learned how LLMs generate text token-by-token using learned probability distributions. This creates a fundamental problem: the model can only generate from patterns it memorized during training.

If the answer isn’t in the training data, the model will either refuse to answer or—more dangerously—generate a plausible-sounding but fabricated response.

This progression solves that problem by introducing retrieval: giving the model access to external knowledge at inference time.

What Goes Wrong Without This:

Symptom: Your AI assistant confidently answers questions about your
         company's products with completely fabricated information.
Cause: The LLM generates plausible text from patterns, but has no
       access to your actual documentation. High confidence ≠ correctness.

Symptom: Retrieval returns documents with high similarity scores,
         but the RAG system still produces incorrect answers.
Cause: You treated retrieval as similarity search. Similarity is
       SYMMETRIC (A similar to B = B similar to A). Relevance is NOT.

Symptom: RAG works for demo queries but fails for real user queries.
Cause: Demo queries match document phrasing. Real queries use
       different vocabulary. Query-document mismatch.

The Limits of Pure Generation

LLMs are trained on static data with a cutoff date. They generate from learned patterns, not live facts.

Problems with pure generation:

1. KNOWLEDGE CUTOFF
   Q: "Who won the 2024 election?"
   A: "I don't have information past my training cutoff..."

2. HALLUCINATION
   Q: "What's the API for uploading files in our product?"
   A: "Use POST /api/upload with multipart/form-data..."
      (confidently wrong—made up based on patterns)

3. NO PRIVATE DATA
   Q: "What did the client say in yesterday's email?"
   A: Cannot access—not in training data

4. OUTDATED FACTS
   Q: "What's the current price of Bitcoin?"
   A: Training data price, not live price

The model generates plausible text, but plausible ≠ true.

Retrieval: Grounding Generation in Facts

Instead of asking the model to recall facts, give it facts to use.

Without retrieval:
  User query → LLM → Generated answer (may hallucinate)

With retrieval:
  User query → Search knowledge base → Relevant docs
       ↓
  [Query + Docs] → LLM → Grounded answer

The LLM now has context to work with.

The Retrieval Pipeline

+------------------------------------------------------------------+
|                     INDEXING (offline)                            |
+------------------------------------------------------------------+
|                                                                   |
|   Documents                                                       |
|      │                                                            |
|      ↓                                                            |
|   ┌─────────┐                                                     |
|   │ Chunk   │  Split into manageable pieces                       |
|   └─────────┘                                                     |
|      │                                                            |
|      ↓                                                            |
|   ┌─────────┐                                                     |
|   │ Embed   │  Convert chunks to vectors                          |
|   └─────────┘                                                     |
|      │                                                            |
|      ↓                                                            |
|   ┌─────────┐                                                     |
|   │ Index   │  Store in vector database                           |
|   └─────────┘                                                     |
|                                                                   |
+------------------------------------------------------------------+
|                     QUERY (online)                                |
+------------------------------------------------------------------+
|                                                                   |
|   User query                                                      |
|      │                                                            |
|      ↓                                                            |
|   ┌─────────┐                                                     |
|   │ Embed   │  Same embedding model as indexing                   |
|   └─────────┘                                                     |
|      │                                                            |
|      ↓                                                            |
|   ┌─────────┐                                                     |
|   │ Search  │  Find similar vectors in index                      |
|   └─────────┘                                                     |
|      │                                                            |
|      ↓                                                            |
|   Top-K most similar chunks                                       |
|                                                                   |
+------------------------------------------------------------------+

Vector Similarity Search

Core mechanic: find vectors closest to query vector.

Query embedding: [0.2, 0.8, -0.1, ...]

Document embeddings in index:
  doc1: [0.25, 0.75, -0.05, ...] → sim = 0.98  ← most similar
  doc2: [0.1, 0.6, 0.3, ...]     → sim = 0.85
  doc3: [-0.5, 0.1, 0.8, ...]    → sim = 0.23
  doc4: [0.22, 0.78, -0.08, ...] → sim = 0.97

Return top-K (e.g., K=3): [doc1, doc4, doc2]

Why it works: Embedding models map similar meanings to similar vectors. “How do I reset my password?” is close to “Password reset instructions” even though they share few words.

Dense vs Sparse Retrieval

Two fundamentally different approaches:

+------------------------------------------------------------------+
|  SPARSE RETRIEVAL (BM25, TF-IDF)                                  |
+------------------------------------------------------------------+
|  Representation: High-dimensional sparse vectors                  |
|  (vocab_size dimensions, mostly zeros)                            |
|                                                                   |
|  "The cat sat" → [0, 0, ..., 1, 0, ..., 1, 0, ..., 1, ...]        |
|                         ↑          ↑          ↑                   |
|                        cat        sat        the                  |
|                                                                   |
|  Matching: Exact keyword overlap                                  |
|  Strengths: Precise keyword matching, interpretable               |
|  Weakness: Misses synonyms, requires exact terms                  |
+------------------------------------------------------------------+
|  DENSE RETRIEVAL (Embeddings)                                     |
+------------------------------------------------------------------+
|  Representation: Low-dimensional dense vectors                    |
|  (384-1536 dimensions, all non-zero)                              |
|                                                                   |
|  "The cat sat" → [0.23, -0.41, 0.89, 0.12, ...]                   |
|                                                                   |
|  Matching: Semantic similarity                                    |
|  Strengths: Captures meaning, handles synonyms                    |
|  Weakness: May miss exact matches, less interpretable             |
+------------------------------------------------------------------+

In practice: Combine both (hybrid search).

Query: "error code E1234"

Sparse (BM25): Finds docs with exact string "E1234"  ✓
Dense:         May not find if "E1234" wasn't in training data  ✗

Query: "my application keeps crashing"

Sparse (BM25): Needs exact word "crashing"  ✗
Dense:         Matches "app failure", "program stops working"  ✓

Hybrid: Best of both

Reciprocal Rank Fusion (RRF)

When combining dense and sparse results, use RRF to merge ranked lists:

RRF_score(doc) = Σ 1 / (k + rank_i(doc))

Where:
  - rank_i(doc) = position of doc in ranking i (1-indexed)
  - k = constant (typically 60)

Example:
  Dense ranking:  [doc_A (rank 1), doc_B (rank 2), doc_C (rank 3)]
  Sparse ranking: [doc_B (rank 1), doc_C (rank 2), doc_A (rank 3)]

  RRF scores (k=60):
    doc_A: 1/(60+1) + 1/(60+3) = 0.0164 + 0.0159 = 0.0323
    doc_B: 1/(60+2) + 1/(60+1) = 0.0161 + 0.0164 = 0.0325  ← Highest!
    doc_C: 1/(60+3) + 1/(60+2) = 0.0159 + 0.0161 = 0.0320

  Final ranking: [doc_B, doc_A, doc_C]

RRF is simple and often performs just as well as learned score combination.

Chunking: Why and How

Documents are too long to embed as single units:

Embedding models have token limits (512-8192)
Long texts dilute specific information
Retrieval granularity matters

Document: 50-page manual

Bad: One embedding for entire document
  → Query matches but relevant info buried in noise

Good: Chunk into ~500 token pieces
  → Query matches specific relevant section

Chunking Strategies

+------------------------------------------------------------------+
|  Chunking strategies                                              |
+------------------------------------------------------------------+
|  Fixed-size:       Every N tokens                                 |
|                    Simple, may break mid-sentence                 |
|                                                                   |
|  Sentence-based:   Split at sentence boundaries                   |
|                    Preserves complete thoughts                    |
|                                                                   |
|  Paragraph-based:  Split at paragraph breaks                      |
|                    Preserves larger context                       |
|                                                                   |
|  Semantic:         Split where topic changes                      |
|                    Best quality, more complex                     |
|                                                                   |
|  Recursive:        Try larger splitters first, fall back          |
|                    Hierarchical, respects structure               |
+------------------------------------------------------------------+

Overlap

Include some text from previous chunk to preserve context at boundaries:

Chunk 1: "...the password reset link. Click it to..."
Chunk 2: "...reset link. Click it to create a new password..."
                    ↑
              overlap region

Why: Context at boundaries isn't lost.
Tradeoff: More storage, potential duplicate retrieval.

Multi-Stage Retrieval

Single-stage retrieval trades off speed vs accuracy. Multi-stage gets both.

SINGLE-STAGE RETRIEVAL
--------------------------------------------------------------------

Query → Embedding Model → Vector Search → Top 10 Results

Fast (~20ms), but accuracy limited by bi-encoder's ability
to independently embed query and docs.


MULTI-STAGE RETRIEVAL (Retrieve → Rerank)
--------------------------------------------------------------------

Stage 1: Fast retrieval (bi-encoder)
  Query → Top 100 candidates (~20ms)
  Uses: Dense/sparse/hybrid retrieval

Stage 2: Accurate reranking (cross-encoder)
  Rerank 100 → Top 10 (~200ms for 100 pairs)
  Uses: Cross-encoder model

Total: ~250ms, but significantly better accuracy

Bi-Encoder vs Cross-Encoder

BI-ENCODER (used in retrieval)
--------------------------------------------------------------------

  Query  ─────→ [Encoder] ─────→ query_vector
                                         ↓
                              cosine_similarity = score
                                         ↑
  Doc    ─────→ [Encoder] ─────→ doc_vector

✓ Can pre-compute doc vectors (once)
✓ Fast similarity search at query time
✓ Scales to millions of documents
✗ Query and doc don't "see" each other
✗ Lower accuracy


CROSS-ENCODER (used in reranking)
--------------------------------------------------------------------

  [CLS] query [SEP] document [SEP] ─────→ [BERT] ─────→ score

✓ Query and doc interact via attention
✓ Higher accuracy (5-10% improvement)
✗ Must encode every (query, doc) pair
✗ Can't pre-compute anything
✗ Slow: O(n) for n documents

Why Two Stages?

+--------------------------+-----------+----------+-------------------+
| Method                   | Latency   | Accuracy | Use Case          |
+--------------------------+-----------+----------+-------------------+
| Bi-encoder only          | ~20ms     | 85%      | Speed-critical    |
| Cross-encoder only       | ~20s/1M   | 95%      | Tiny corpus only  |
| Bi-encoder → Cross-enc   | ~250ms    | 93%      | Production        |
+--------------------------+-----------+----------+-------------------+

The bi-encoder filters to candidates.
The cross-encoder reranks for precision.
Best of both worlds.

Retrieval Quality Metrics

How do you know retrieval is working?

+------------------------------------------------------------------+
|  Recall@K                                                         |
|  "Of all relevant docs, how many are in my top-K?"                |
|                                                                   |
|  5 relevant docs exist, top-10 retrieval finds 4                  |
|  Recall@10 = 4/5 = 0.80                                           |
|                                                                   |
|  Critical for RAG: if relevant doc isn't retrieved,               |
|  the LLM can't use it.                                            |
+------------------------------------------------------------------+
|  Precision@K                                                      |
|  "Of my top-K results, how many are relevant?"                    |
|                                                                   |
|  Top-10 has 4 relevant, 6 irrelevant                              |
|  Precision@10 = 4/10 = 0.40                                       |
|                                                                   |
|  Matters for: context window efficiency, noise reduction          |
+------------------------------------------------------------------+
|  MRR (Mean Reciprocal Rank)                                       |
|  "How high is the first relevant result?"                         |
|                                                                   |
|  First relevant at position 3 → RR = 1/3                          |
|  Average across queries = MRR                                     |
+------------------------------------------------------------------+

For RAG, Recall@K usually matters most. If the answer isn’t retrieved, generation fails.

Common Pitfalls

Misconception: “High similarity score = relevant result”

Query: "How do I bake cookies?"
Document: "I baked cookies yesterday and they were delicious."

Similarity: HIGH (same topic, same words)
Relevance: ZERO (describes past event, doesn't answer the question)

Similarity is SYMMETRIC. Relevance is NOT.
A relevant document must be similar, but a similar document
isn't necessarily relevant.

The Query-Document Mismatch Problem

Query:  "How do I fix the login bug?"
        (question format, user language)

Doc:    "Authentication failures can be resolved by..."
        (statement format, technical language)

Problem: Different phrasing may have lower similarity
         even when doc answers the query.

Solutions:

Query Expansion: Add synonyms and related terms
HyDE: Generate hypothetical answer, embed that instead
Query Rewriting: Transform user query to match document style
Fine-tuned Retrievers: Train on your domain’s (query, doc) pairs

Chunking Mistakes

The correct answer might be split across two chunks.
Neither chunk alone answers the query.
Both chunks score medium similarity.
Retrieval "succeeds" (returns chunks).
RAG fails (no chunk contains the answer).

Chunking is a system design decision, not preprocessing trivia.

Code Example

Basic semantic search with sentence-transformers:

import numpy as np
from sentence_transformers import SentenceTransformer

# Initialize embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Sample knowledge base
documents = [
    "To reset your password, go to Settings > Security > Reset Password.",
    "Our API rate limit is 100 requests per minute for free tier.",
    "Contact support@example.com for billing questions.",
    "The application requires Python 3.9 or higher.",
    "Two-factor authentication can be enabled in Settings > Security.",
]

# Index: Embed all documents
doc_embeddings = model.encode(documents)

def search(query: str, top_k: int = 3) -> list[tuple[str, float]]:
    """Semantic search: find most similar documents to query."""
    # Embed query with same model
    query_embedding = model.encode(query)

    # Compute cosine similarities
    # (embeddings are normalized, so dot product = cosine similarity)
    similarities = np.dot(doc_embeddings, query_embedding)

    # Get top-k indices
    top_indices = np.argsort(similarities)[::-1][:top_k]

    # Return documents with scores
    results = []
    for idx in top_indices:
        results.append((documents[idx], similarities[idx]))
    return results

# Test queries
queries = [
    "How do I change my password?",
    "What are the API limits?",
    "I need help with my bill",
]

for query in queries:
    print(f"\nQuery: {query}")
    results = search(query, top_k=2)
    for doc, score in results:
        print(f"  [{score:.3f}] {doc[:50]}...")

Key Takeaways

1. Pure LLM generation has limits
   - Knowledge cutoff, hallucinations, no private data access

2. Retrieval grounds generation in facts
   - Find relevant docs, then generate with context

3. Dense retrieval uses semantic similarity
   - Embed query and docs, find closest vectors

4. Sparse retrieval (BM25) uses keyword matching
   - Better for exact terms, combine with dense = hybrid

5. Chunking matters
   - Documents → smaller pieces for granular retrieval
   - Overlap preserves context at boundaries

6. Multi-stage retrieval improves accuracy
   - Fast bi-encoder for recall, slow cross-encoder for precision

7. Similarity ≠ Relevance
   - High similarity score doesn't mean the doc answers the query

Verify Your Understanding

Before proceeding, you should be able to:

Explain why similarity = 0.95 can still be useless — Articulate with a concrete example.

Given this scenario, identify the problem:

Query: “How do I reset my password?”
Doc1: “Password reset failed for user X at 3:42pm” (similarity = 0.91)
Doc2: “Go to Settings > Security > Reset Password” (similarity = 0.87)

Which document is more relevant? Why is it ranked lower?

Identify the error in this statement: “BM25 is outdated, dense retrieval is always better.”

Your retrieval returns 10 documents. 8 are similar. 2 answer the question. Which metric captures this problem—Recall@10 or Precision@10?

What’s Next

After this, you can:

Continue → Retrieval → RAG — putting it all together
Build → Semantic search for your documents

Generation to Retrieval - Grounding LLMs in Facts

Concepts Covered in This Article

ML Metrics

Building On Previous Knowledge

The Limits of Pure Generation

Retrieval: Grounding Generation in Facts

The Retrieval Pipeline

Vector Similarity Search

Dense vs Sparse Retrieval

Reciprocal Rank Fusion (RRF)

Chunking: Why and How

Chunking Strategies

Overlap

Multi-Stage Retrieval

Bi-Encoder vs Cross-Encoder

Retrieval Quality Metrics

Common Pitfalls

Misconception: “High similarity score = relevant result”

The Query-Document Mismatch Problem

Chunking Mistakes

Code Example

Key Takeaways

Verify Your Understanding

What’s Next

Table of Contents

Ai-engineering Series

Concepts Covered in This Article

ML Metrics

Building On Previous Knowledge

The Limits of Pure Generation

Retrieval: Grounding Generation in Facts

The Retrieval Pipeline

Vector Similarity Search

Dense vs Sparse Retrieval

Reciprocal Rank Fusion (RRF)

Chunking: Why and How

Chunking Strategies

Overlap

Multi-Stage Retrieval

Bi-Encoder vs Cross-Encoder

Retrieval Quality Metrics

Common Pitfalls

Misconception: “High similarity score = relevant result”

The Query-Document Mismatch Problem

Chunking Mistakes

Code Example

Key Takeaways

Verify Your Understanding

What’s Next

Table of Contents