M10: Advanced RAG Patterns

Go beyond basic retrieve-and-generate: hybrid search, re-ranking, query transformation, and measuring what matters.

Learning Objectives

Identify the limitations of naive RAG and explain how advanced patterns address each one
Implement hybrid search combining BM25 keyword matching with vector similarity
Apply re-ranking to improve retrieval precision using Claude as a cross-encoder
Use query transformation (HyDE, multi-query) to improve recall on complex questions
Measure RAG quality with precision, recall, and faithfulness metrics

Naive RAG vs. Advanced RAG

In M09, you built a working RAG system: chunk documents, embed them, store vectors, retrieve by similarity, and generate an answer. That’s naive RAG — and it works surprisingly well for simple use cases. But production systems hit a ceiling fast.

💡 Everyday Analogy

Before: Imagine searching Google by typing your exact question and reading only the first result — sometimes it’s exactly right, but often it’s tangential, outdated, or misses the nuance of what you really needed.

The pain: You waste time reading irrelevant pages, miss the one article that actually answers your question because it used different wording, and have no way to know if the answer you got is actually trustworthy.

The mapping: Advanced RAG is like having a research librarian who first rephrases your question three different ways, searches both a keyword index and a topic index, carefully reads the top candidates to rank them by actual relevance, highlights only the sentences that matter, and then gives you a sourced answer. Every stage that the librarian adds makes the final answer more accurate.

What this actually looks like in practice: Your naive pipeline sends one query and gets back chunks ranked by cosine similarity alone. An advanced pipeline might produce output like this at each stage:

// Stage 1 — Query Transformation (HyDE) Original: "What's the risk for filing 0093?" Hypothesis: "UCC-1 filing 2024-FL-0093 creates a security interest that elevates creditor priority for the debtor XYZ Corp..." // Stage 2 — Hybrid Search (BM25 + Vector, fused via RRF) BM25 top-3: [Filing-0093 (12.4), Amendment-0093 (9.1), Debtor-Profile (6.8)] Vector top-3: [Risk-Analysis (0.91), Debtor-Profile (0.87), Credit-Score (0.85)] Fused top-3: [Filing-0093 (0.033), Debtor-Profile (0.031), Risk-Analysis (0.027)] // Stage 3 — Re-Ranking (Claude scores 1-10) Filing-0093: 9.2 (directly about this filing) Risk-Analysis: 8.1 (discusses debtor risk factors) Debtor-Profile: 5.4 (general background, less relevant)

📐 Technical Definition

Advanced RAG wraps the basic retrieve-then-generate pipeline with three optimization layers. Each layer targets a different failure mode. Here’s what they do and when they kick in:

First, pre-retrieval transforms the user’s query before searching — think of it as asking a better question. Instead of sending the user’s raw words straight to your search engine, you rephrase, expand, or generate a hypothetical answer first.

Second, retrieval enhancement combines multiple search strategies. Rather than relying on vector similarity alone, you run both keyword search (BM25) and semantic search in parallel, then merge the results. A re-ranker then sorts those merged results by true relevance — not just rough cosine similarity scores.

Third, post-retrieval processing compresses the retrieved chunks to remove noise. It also verifies citations so the model doesn’t hallucinate from irrelevant context. In other words, you clean up the search results before the LLM ever sees them.

Each layer can be added independently — you don’t need all of them, and the right combination depends on your data and queries.

⚠️ Common Misconceptions

“Hybrid search is always better than pure semantic search.” — Not always. Keyword search (BM25) wins for exact IDs, codes, and proper nouns, but for abstract or conceptual queries (“explain debtor risk factors”), vector-only search can actually outperform hybrid because BM25 adds noise by matching irrelevant keyword hits. Always test with your actual queries.

“Re-ranking is free improvement.” — It adds 100–300ms of latency per query and costs real money (an extra LLM call). For simple factual lookups where the initial retrieval is already good, re-ranking just slows things down without meaningfully changing the result order. Reserve it for complex or high-stakes queries.

“More retrieval stages = better quality.” — Each stage adds latency and complexity. In practice, you see diminishing returns after 2–3 stages. A pipeline with hybrid search + re-ranking covers most use cases. Adding query transformation, compression, AND re-ranking on top might gain you 2–3% accuracy while tripling your latency and cost.

“Query decomposition helps every query.” — Simple factual queries (“What is the filing date for UCC-0093?”) do not benefit from multi-query or HyDE. These techniques shine on complex, multi-part questions where a single search cannot capture all aspects of what the user needs.

“Semantic chunking always beats fixed-size chunking.” — It depends on content structure. Semantic chunking works well for prose (articles, reports). But for structured data like tables, code, or form fields, fixed-size chunking often works better because semantic boundaries are harder to detect and splitting a table row across chunks destroys its meaning.

Naive vs. Advanced RAG — Pipeline Comparison

Naive vs. Advanced RAG Pipeline

Naive RAG

Quality:

0%

Advanced RAG

Quality:

0%

✅ Why It Matters

Benchmarks on real-world QA datasets show naive RAG achieves 55–65% answer accuracy. Adding hybrid search lifts that to 70–75%. Adding re-ranking pushes it to 80–85%. Query transformation on top gets you to 85–90%. Each technique targets a different failure mode: hybrid search catches keyword misses, re-ranking fixes bad ordering, and query transformation handles ambiguous questions. For a customer-support bot handling 10,000 queries/day, going from 60% to 85% accuracy means 2,500 fewer wrong answers every single day.

Now you understand WHERE naive RAG falls short. The next four sections each tackle one specific weakness: keyword blindness (hybrid search), rough ranking (re-ranking), poor queries (query transformation), and noisy context (compression). We’ll build each pattern independently so you can mix and match for your use case.

Hybrid Search: Keyword + Semantic Fusion

Vector similarity search is powerful, but it has a blind spot: exact terms. If a user asks about “invoice #INV-2024-0847”, vector search might return chunks about invoices in general rather than that specific one. BM25 keyword search finds exact matches instantly, but misses synonyms and semantic relationships. Hybrid search combines both.

💡 Everyday Analogy

Before: You’re searching for a restaurant. You know it’s called something like “Trattoria Verde” but you also remember it as “that cozy Italian place near Central Park.”

The pain: A name-only search finds the exact restaurant but misses similar ones you might like. A description-only search finds cozy Italian places but might miss Trattoria Verde if its listing doesn’t say “cozy.”

The mapping: Hybrid search runs both queries and merges the results. BM25 is the name search — fast, literal, great for proper nouns. Vector search is the description search — understands meaning, catches synonyms. Together, they cover cases that either alone would miss.

What this actually looks like: Below is a real example of how the two search strategies return different documents for the same query, and how RRF fuses them:

Query: "UCC filing #2024-FL-0093 debtor risk" BM25 Results (keyword match): #1 Filing-0093.txt score: 14.2 (exact match on "2024-FL-0093") #2 Amendment-0093.txt score: 11.8 (exact match on "0093") #3 Debtor-XYZ-Profile.txt score: 6.3 (matches "debtor") Vector Results (semantic similarity): #1 Risk-Analysis-Q4.txt score: 0.92 (semantically about risk) #2 Debtor-XYZ-Profile.txt score: 0.89 (semantically about debtor) #3 Credit-Score-Report.txt score: 0.86 (related to risk assessment) RRF Fused (rank-based merge, k=60): #1 Filing-0093.txt rrf: 0.033 (rank 1 in BM25 + not in vector top) #2 Debtor-XYZ-Profile.txt rrf: 0.031 (rank 3 in BM25 + rank 2 in vector) #3 Risk-Analysis-Q4.txt rrf: 0.027 (rank 1 in vector + not in BM25 top)

📐 Technical Definition

Hybrid search runs two retrieval strategies in parallel.

First, sparse retrieval (BM25 or TF-IDF) matches documents by exact keyword overlap. It breaks the query into individual words, counts how often each appears in each document, and boosts rare words — that’s the “inverse document frequency” part. A rare word like “INV-2024-0847” gets a high score because it appears in very few documents, making it a strong signal.

Second, dense retrieval (vector similarity) encodes both query and documents as high-dimensional number arrays and finds the closest ones by cosine similarity. It understands meaning, not just words — so “automobile” and “car” would match even though they share no letters.

Finally, Reciprocal Rank Fusion (RRF) merges the two ranked lists into one. For each document, it computes a fused score based on where that document appeared in each list: score = Σ 1/(k + rank), where k is typically 60. The beauty of RRF is that it uses rank positions, not raw scores — so you never have to worry about normalizing BM25 scores (which might be 12.4) against cosine similarities (which might be 0.87).

Hybrid Search: BM25 + Vector → Fusion

Query: "UCC filing #2024-FL-0093 debtor risk"

BM25

Vector

↓ Reciprocal Rank Fusion ↓

Fused

✅ Why It Matters

In a UCC filing system with 500,000 documents, a query like “filing #2024-FL-0093 debtor risk assessment” needs both strategies: BM25 finds the exact filing number (a string match that vector search might miss entirely), while vector search retrieves related risk analysis documents that discuss similar debtors. In real-world benchmarks (BEIR, MTEB), hybrid search improves nDCG@10 by 8–15% over vector-only search, with the biggest gains on queries containing proper nouns, IDs, or technical terms.

✔ What Just Happened?

You learned that neither keyword search nor vector search alone covers all query types. BM25 handles exact terms; vectors handle meaning. Reciprocal Rank Fusion merges their results without needing score normalization. This is the first Advanced RAG pattern — and often the one with the biggest bang-for-buck.

Re-Ranking: Why Retrieval Order Matters

Hybrid search gets better documents into your candidate set. But the ordering within that set is still approximate — BM25 scores and cosine similarity scores are rough proxies for “actually answers the user’s question.” Re-ranking applies a more powerful model to sort the candidates by true relevance.

💡 Everyday Analogy

Before: A job recruiter searches LinkedIn for “senior backend engineer, Rust experience, distributed systems.” The search returns 200 profiles ranked by keyword match.

The pain: Profile #3 mentions “Rust” once in a side project, while profile #47 has 5 years of Rust at a distributed-systems company but listed it as “systems programming.” The initial ranking is misleading.

The mapping: Re-ranking is the recruiter reading the top 50 profiles carefully and reordering them by actual fit. It’s slower (reading takes time), but the final shortlist is dramatically better. In RAG, the “recruiter” is a cross-encoder or LLM that compares each document directly against the query.

What this actually looks like: Here is a before-and-after of re-ranking. Notice how the initial retrieval order (based on cosine similarity) is misleading — the most relevant document was buried at position #4:

Before re-ranking (sorted by cosine similarity): #1 Debtor-Overview.txt cosine: 0.89 ← general background #2 Market-Trends.txt cosine: 0.87 ← only tangentially related #3 Risk-Policy-Guide.txt cosine: 0.85 ← generic policy document #4 Filing-0093-Detail.txt cosine: 0.84 ← EXACT document we need! After re-ranking (Claude scored each 1-10): #1 Filing-0093-Detail.txt relevance: 9.2 ↑ promoted from #4 #2 Risk-Policy-Guide.txt relevance: 7.8 #3 Debtor-Overview.txt relevance: 5.4 ↓ demoted from #1 #4 Market-Trends.txt relevance: 2.1 ↓ demoted from #2

📐 Technical Definition

A cross-encoder takes a query-document pair as a single input and produces a relevance score. Think of it this way: instead of comparing two separate summaries of the query and document, the model reads both texts side by side and decides “how relevant is this document to this specific question?”

This is fundamentally more powerful than bi-encoder embeddings. A bi-encoder creates a vector for the query and a separate vector for each document, then compares them with cosine similarity. It’s fast because you can pre-compute document vectors, but it misses subtle relevance signals that only become visible when you read both texts together.

The trade-off is speed. Cross-encoders are too slow to run on millions of documents, so you use a two-stage approach. First, retrieve broadly with fast methods (top-50 to top-100 candidates). Second, re-rank that short list precisely with a cross-encoder (or Claude itself) to get the true top-5.

If you’ve used database indexes before, the analogy is apt: a database uses a fast B-tree index to narrow 10 million rows down to 100 candidates, then applies the full WHERE clause to those 100. Re-ranking works the same way — the initial retrieval (BM25 + vector) is the fast index scan, and the cross-encoder is the expensive filter applied to the short list. You would never run the expensive filter on every document, but it’s perfectly affordable for 20–50 candidates.

Re-Ranking: From Approximate to Precise

Initial Retrieval

➔

After Re-Ranking

✅ Why It Matters

Research from Cohere and MS MARCO benchmarks shows re-ranking improves retrieval precision@5 by 15–25%. In production, this translates directly to answer quality: if your RAG system feeds the top-5 chunks to Claude, and re-ranking moves the truly relevant chunk from position #8 to position #2, that chunk is now in the context window instead of being dropped. For a healthcare pre-authorization system processing 200 claims/day, the difference between retrieving the correct clinical guideline and a similar-but-wrong one could mean approving or denying coverage incorrectly.

💰 Cost Consideration

Using Claude as a re-ranker means an extra API call per query. Re-ranking 20 chunks with a short scoring prompt costs roughly 3,000–5,000 input tokens. At $3/MTok for Claude Sonnet, that’s about $0.01–$0.015 per query. For 10,000 queries/day: ~$100–$150/day. Use a dedicated re-ranker model (Cohere Rerank, BGE-reranker) for high-volume systems, and Claude for lower-volume, higher-stakes applications.

🎓 Cert Tip — Domain 4.6
When using an LLM as a re-ranker or evaluator, same-session self-review creates confirmation bias — the model retains reasoning context from generation. For production RAG evaluation, use SEPARATE API calls (separate sessions) for generation and quality assessment. The exam tests whether you recognize that self-evaluation in the same context window is unreliable.

Query Transformation Strategies

You now have better retrieval (hybrid search) and better ranking (re-ranking). But what if the user’s question itself is the problem? Vague, ambiguous, or poorly worded queries lead to poor retrieval no matter how good your search infrastructure is. Query transformation fixes the input before it reaches the search engine.

💡 Everyday Analogy

Before: You walk into a library and ask the librarian “I need info about that thing where companies report their debts.” The librarian stares blankly — that could mean annual reports, credit filings, bankruptcy proceedings, or SEC disclosures.

The pain: With a vague question, the librarian might hand you a random book about corporate finance, wasting your time.

The mapping: A skilled librarian rephrases your question into three specific queries: “UCC-1 financing statement filings,” “commercial debt disclosure requirements,” and “secured transaction public records.” Each version captures a different angle, and the combined results cover what you actually needed. That’s what query transformation does for RAG.

What this actually looks like: Here is the output of each query transformation strategy applied to the same user question:

Original query: "How does filing 2024-FL-0093 affect the debtor?" HyDE output (hypothetical document paragraph): "UCC-1 filing 2024-FL-0093 names XYZ Corp as debtor and ABC Bank as secured party. The filing creates a perfected security interest in all accounts receivable, which elevates ABC Bank's creditor priority and may reduce XYZ Corp's ability to obtain additional unsecured financing..." Multi-Query output (3 reformulations): 1. "Impact of UCC filing 2024-FL-0093 on debtor XYZ Corp" 2. "How does a new UCC-1 financing statement affect debtor credit?" 3. "Risks to debtor from secured transaction filing 2024-FL-0093" Step-Back output (abstracted question): "How do UCC-1 filings generally affect a debtor's financial risk profile and creditworthiness?"

📐 Technical Definition

Query transformation reformulates the user’s query before retrieval to improve recall. Three main strategies exist:

HyDE (Hypothetical Document Embedding): Ask Claude to generate a hypothetical answer to the query, then embed that answer instead of the raw query. Since the hypothesis uses document-like language, its embedding lands closer to real relevant documents in vector space.
Multi-Query: Generate 3–5 different reformulations of the original query, run retrieval for each, and union the results. This catches documents that match one phrasing but not another.
Step-Back Prompting: Abstract the specific question to a broader one (“How does UCC filing #2024-FL-0093 affect debtor risk?” → “How do UCC filings affect debtor risk assessment?”), retrieve broader context first, then use it alongside the specific query.

HyDE Flow — Hypothetical Document Embeddings

Query Transformation Strategies

Original: "How does filing 2024-FL-0093 affect the debtor?"

🧪 HyDE

🔄 Multi-Query

⬆ Step-Back

✅ Why It Matters

Google Research found that HyDE improves recall@20 by 20–40% on the TREC-DL benchmark compared to raw query embedding. Multi-query is especially valuable for ambiguous questions: in an internal eval at a legal document search company, multi-query reduced “no relevant results” errors from 18% to 4% of queries. The cost is one extra Claude API call per query for transformation (roughly 200–500 tokens), a trivial expense compared to the retrieval quality gain.

⚠️ Common Misconceptions

“HyDE always improves retrieval.” — If Claude generates a hypothesis that’s factually wrong or off-topic, embedding that hypothesis will retrieve wrong documents. HyDE works best for domain-specific queries where Claude can generate plausible-sounding text. For generic or highly ambiguous queries, the hypothesis may hurt more than it helps. Always fuse HyDE results with original query results via RRF rather than replacing the original query entirely.

“Multi-query is just sending the same question three times.” — No. The key is that each reformulation captures a different facet of the question. “What is the debtor risk?” might become “credit default probability for XYZ Corp,” “financial health indicators for UCC debtors,” and “risk factors in secured lending.” Each phrasing retrieves different documents, and the union covers blind spots that any single phrasing would miss.

“Step-back prompting loses specificity.” — That’s actually the point. You use the broad results as background context alongside the specific query, not as a replacement. The broad context helps Claude understand the general topic, while the specific query targets the exact answer.

✔ What Just Happened?

You learned three strategies for fixing bad queries before they reach your search engine. HyDE generates a fake answer to embed (better vector matches). Multi-query tries multiple phrasings (broader recall). Step-back abstracts to a general question first (better context). All three are cheap LLM calls that dramatically improve retrieval quality.

Agentic RAG — When the Agent Decides What to Retrieve

Everyday Analogy

Naive RAG (M09) is a vending machine: you put in a query, it spits out the top-k chunks, you eat what comes out. Hybrid + re-ranking + transformation (this module so far) is a better vending machine — smarter sorting, more selection — but still one-shot: one query in, one set of chunks out.

The pain shows up the moment a question can’t be answered from a single retrieval. “Compare Acme’s 2024 and 2025 quarterly filings” needs two retrievals — one per year — and then a synthesis step. “Find the clause that contradicts what we agreed in last week’s memo” needs to retrieve the memo first, then search for contradictions. A vending machine can’t do this. A research assistant can.

Agentic RAG hands retrieval over to the agent itself. The agent treats the retriever as a tool, calls it once, looks at what came back, decides whether to call it again with a refined query, and stops when it has enough — the M12 ReAct loop, but for retrieval. Same patterns you already know, applied to the search problem.

Technical Definition

Agentic RAG is RAG where the agent itself orchestrates retrieval — deciding which index/corpus to query, how to phrase the query, whether to re-query after seeing partial results, and when to stop. Mechanically: search is registered as a tool (M05), and Claude decides whether and how to invoke it inside its normal tool-use loop.

Three properties distinguish it from the passive RAG of M09:

Multi-step retrieval. The agent can retrieve once, read the chunks, decide it needs more, and retrieve again with a sharpened query. Passive RAG is one-shot by construction.
Corpus routing. If you have multiple indices — product docs, support tickets, public filings, internal wikis — the agent picks which to hit based on the question. Passive RAG hits a single index it was wired to at build time.
Self-stopping. The agent stops retrieving when it has enough to answer (or escalates when it never gets there). Passive RAG always retrieves exactly k chunks, whether the question needs 1 or 20.

It’s the M12 ReAct pattern applied to retrieval, with the same upsides (smarter, handles multi-hop questions, can use multiple corpora) and the same downsides (more LLM calls, more latency, harder to evaluate). Worth reaching for when single-shot retrieval keeps failing on the questions that matter; not worth reaching for when hybrid + re-rank already lands the right chunk on the first try.

Passive RAG (M09 / Hybrid) vs Agentic RAG

Passive RAG — one-shot, fixed

Embed the question
Retrieve top-k chunks (one index)
Stuff chunks + question into prompt
Claude answers from what was handed to it

LLM calls per question: 1. Latency: low. Failure mode: question needed info not in top-k => wrong answer.

Agentic RAG — multi-step, adaptive

Claude reads the question
Claude picks an index, drafts a query, calls search tool
Claude reads the returned chunks
If insufficient: refine query, hit a different index, or both — loop
If sufficient: stop and answer
If never sufficient after N tries: escalate to human (M17)

LLM calls per question: 2-6. Latency: higher. Failure mode: agent loops; mitigation = step cap + eval.

In practice, “agentic RAG” is what you get when you take the M09 retriever, wrap it as a tool, and drop it into the M12 ReAct loop. Here’s a minimal working example over two corpora — product docs + support tickets — with the agent choosing per-call which to hit:

from anthropic import Anthropic
import chromadb

client = Anthropic()
chroma = chromadb.PersistentClient(path="./vector_db")

# Two corpora — the agent picks which to query per call.
INDICES = {
    "product_docs":    chroma.get_collection("product_docs"),
    "support_tickets": chroma.get_collection("support_tickets"),
}

# === The retrieval tool. Claude decides corpus + query + k. =================
tools = [{
    "name": "search",
    "description": (
        "Search a corpus by semantic similarity. "
        "Use 'product_docs' for behavior/spec questions, "
        "'support_tickets' for known-issues / customer cases. "
        "Call multiple times if the first result is insufficient or off-topic."
    ),
    "input_schema": {
        "type": "object",
        "properties": {
            "corpus": {"type": "string", "enum": list(INDICES)},
            "query":  {"type": "string", "description": "Natural-language query"},
            "k":      {"type": "integer", "minimum": 1, "maximum": 8, "default": 4},
        },
        "required": ["corpus", "query"],
    },
}]

def run_tool(name: str, args: dict) -> str:
    if name != "search":
        raise ValueError(f"Unknown tool: {name}")
    idx = INDICES.get(args["corpus"])
    if idx is None:
        return f"ERROR: unknown corpus '{args['corpus']}'"
    res = idx.query(query_texts=[args["query"]], n_results=args.get("k", 4))
    docs = res["documents"][0]
    return "\n---\n".join(f"[chunk {i+1}] {d}" for i, d in enumerate(docs)) or "(no hits)"

# === The agentic loop — standard M12 ReAct, just with `search` as the tool. ===
def agentic_rag(question: str, *, max_iters: int = 6) -> str:
    messages = [{"role": "user", "content": question}]
    for step in range(max_iters):
        resp = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=2048,
            tools=tools,
            messages=messages,
        )
        if resp.stop_reason == "end_turn":
            return "".join(b.text for b in resp.content if b.type == "text")

        messages.append({"role": "assistant", "content": resp.content})
        tool_results = []
        for block in resp.content:
            if block.type == "tool_use":
                try:
                    result = run_tool(block.name, block.input)
                except Exception as e:
                    result = f"TOOL_ERROR: {e}"
                tool_results.append({
                    "type": "tool_result", "tool_use_id": block.id, "content": result,
                })
        messages.append({"role": "user", "content": tool_results})

    # Hit the step cap — escalate per M17 rather than guess.
    raise RuntimeError(f"Agentic RAG exceeded {max_iters} retrieval steps")

import Anthropic from "@anthropic-ai/sdk";
import { ChromaClient } from "chromadb";

const client = new Anthropic();
const chroma = new ChromaClient({ path: "./vector_db" });

const INDICES: Record<string, any> = {
  product_docs:    await chroma.getCollection({ name: "product_docs" }),
  support_tickets: await chroma.getCollection({ name: "support_tickets" }),
};

// === The retrieval tool. Claude decides corpus + query + k. =================
const tools = [{
  name: "search",
  description:
    "Search a corpus by semantic similarity. " +
    "Use 'product_docs' for behavior/spec questions, " +
    "'support_tickets' for known-issues / customer cases. " +
    "Call multiple times if the first result is insufficient or off-topic.",
  input_schema: {
    type: "object",
    properties: {
      corpus: { type: "string", enum: Object.keys(INDICES) },
      query:  { type: "string", description: "Natural-language query" },
      k:      { type: "integer", minimum: 1, maximum: 8, default: 4 },
    },
    required: ["corpus", "query"],
  },
}];

async function runTool(name: string, args: any): Promise<string> {
  if (name !== "search") throw new Error(`Unknown tool: ${name}`);
  const idx = INDICES[args.corpus];
  if (!idx) return `ERROR: unknown corpus '${args.corpus}'`;
  const res = await idx.query({ queryTexts: [args.query], nResults: args.k ?? 4 });
  const docs: string[] = res.documents[0] ?? [];
  return docs.length
    ? docs.map((d, i) => `[chunk ${i + 1}] ${d}`).join("\n---\n")
    : "(no hits)";
}

// === The agentic loop — standard M12 ReAct, just with `search` as the tool. ===
export async function agenticRag(question: string, maxIters = 6): Promise<string> {
  const messages: any[] = [{ role: "user", content: question }];
  for (let step = 0; step < maxIters; step++) {
    const resp = await client.messages.create({
      model: "claude-sonnet-4-6",
      max_tokens: 2048,
      tools,
      messages,
    });
    if (resp.stop_reason === "end_turn") {
      return resp.content.filter((b: any) => b.type === "text").map((b: any) => b.text).join("");
    }

    messages.push({ role: "assistant", content: resp.content });
    const toolResults: any[] = [];
    for (const block of resp.content as any[]) {
      if (block.type === "tool_use") {
        let result: string;
        try { result = await runTool(block.name, block.input); }
        catch (e: any) { result = `TOOL_ERROR: ${e.message}`; }
        toolResults.push({ type: "tool_result", tool_use_id: block.id, content: result });
      }
    }
    messages.push({ role: "user", content: toolResults });
  }

  // Hit the step cap — escalate per M17 rather than guess.
  throw new Error(`Agentic RAG exceeded ${maxIters} retrieval steps`);
}

✔ What Just Happened?

You wrapped the M09 retriever as a tool and gave Claude the M12 ReAct loop. The only domain-specific lines are the corpus enum and the tool description — the loop is the same one you wrote in M12. The agent now decides per-question: which corpus to hit, what query to send, and whether the chunks it got back are sufficient to answer. The max_iters cap is the budget; if the agent can’t find enough after 6 retrieval steps, it raises — you escalate (M17), you don’t fabricate.

When to Reach for Agentic RAG

Use it when: questions span multiple corpora; questions are multi-hop (need to retrieve, read, then retrieve again with new info); single-shot retrieval keeps missing the right chunk and your re-ranker tuning has plateaued; users phrase the same question 20 different ways and a fixed query template can’t cover them all.

Skip it when: a hybrid search + good re-ranker already lands the right chunk on the first try (don’t pay for extra LLM calls you don’t need); latency budget is sub-second (each retrieval step is another model round-trip); your eval set is tiny and you can’t measure whether the extra calls actually improve answer quality.

Cost calibration on Sonnet 4.6: a 4-step agentic RAG question costs roughly 4× the LLM tokens of a 1-step passive RAG question, plus 4 retrieval round-trips. For a 5K-token question that’s ~$0.06 vs ~$0.015 (4×), with ~3-6 seconds added latency. Worth it for the questions that need it; wasteful otherwise.

Common Misconceptions

“Agentic RAG replaces hybrid search and re-ranking.” — No. The agent uses them. Inside the search tool you should still be doing hybrid retrieval and cross-encoder re-ranking (this module’s earlier sections). Agentic RAG is the outer loop; hybrid + re-rank is the inner implementation.

“If passive RAG works, agentic RAG works better.” — Not automatically. For single-fact lookups, agentic RAG often under-performs because the agent second-guesses correct first-shot results. Measure on your eval set (M18) before switching.

“Self-RAG / Corrective RAG / Adaptive RAG are different things.” — They’re named papers; mechanically they’re all agentic RAG with different prompts and stop conditions. You can read them as reference, but you don’t need a separate framework for each — just adjust the tool description and step cap.

Agentic RAG is the most powerful retrieval pattern in this module but also the most expensive. The next section covers the cheapest fix that’s often even more impactful: compressing retrieved chunks before they hit Claude’s context.

Contextual Compression & RAG Evaluation

You’ve improved what you retrieve (hybrid search), how you rank it (re-ranking), and the quality of your queries (transformation). The final two pieces: reducing noise in retrieved chunks (compression) and measuring whether all of this actually works (evaluation).

Contextual Compression

💡 Everyday Analogy

Before: You’re studying for an exam and a friend hands you five full textbook pages that “contain the answer somewhere.”

The pain: You have to scan 5,000 words to find the three sentences that actually matter, and the surrounding text might confuse you into thinking irrelevant details are important.

The mapping: Contextual compression is your friend highlighting only the relevant sentences before handing you the pages. Claude reads each retrieved chunk, extracts only the parts that answer the query, and throws away the rest. Less noise means better answers and fewer tokens spent on generation.

What this actually looks like in practice: Here is a real before-and-after showing a 500-token chunk compressed down to just the relevant sentences. Notice that only 2 out of 8 sentences survive — the rest were background noise:

Query: "What is the debtor's default risk?" Original chunk (487 tokens): "XYZ Corp was incorporated in Florida in 2018. The company operates in the logistics sector with 45 employees across 3 locations. Recent quarterly filings show revenue of $2.3M with net margin of 4.2%. The company's credit profile indicates elevated default probability of 12.3%, primarily driven by a debt-to-equity ratio of 3.8x and declining cash reserves. The CEO, John Smith, previously founded two successful startups..." Compressed output (62 tokens): "The company's credit profile indicates elevated default probability of 12.3%, primarily driven by a debt-to-equity ratio of 3.8x and declining cash reserves." Token savings: 487 → 62 (87% reduction)

📐 Technical Definition

Contextual compression uses an LLM to extract or summarize only the query-relevant portions of each retrieved chunk. Here’s the idea: you give the LLM a query and a 500-token chunk. It reads both, identifies the sentences that actually help answer the question, and outputs a 50–150 token extract. Everything else gets thrown away.

This serves two purposes. First, it reduces the token count in the generation prompt, directly saving cost — fewer tokens in means fewer tokens billed. Second, it removes irrelevant context that might lead the generator model to hallucinate or get distracted by tangential information. Think of it as signal-to-noise cleanup: the retriever found the right document, but only 10% of that document is relevant to this specific query.

Compression comes in two flavors. Extractive compression selects exact sentences from the original — it copies them verbatim, preserving the source wording. Abstractive compression rewrites the content into a shorter summary. Extractive is generally safer for production systems because it preserves original wording. That matters when you need accurate citations and traceability back to source documents — if you summarize, you can’t point to an exact quote anymore.

RAG Evaluation Metrics

You cannot improve what you cannot measure. RAG evaluation tells you exactly where your pipeline is succeeding and where it is failing. Without it, you are guessing — and in production, guessing costs real money and real user trust.

RAG evaluation works differently from traditional software testing. You are not checking whether code runs without errors. Instead, you are measuring the quality of search results and generated answers against human-judged ground truth. This means you need a test set: a collection of questions, the documents that correctly answer each question, and (optionally) the expected answer text. Building this test set is manual work, but even 10–20 well-labeled examples are enough to catch major pipeline regressions.

If you have used information retrieval systems before (like Elasticsearch or Solr), precision and recall will feel familiar. The new metric here is faithfulness, which is specific to RAG — it measures whether the LLM stuck to the retrieved context or invented facts on its own. Traditional search systems do not generate answers, so they never needed this metric.

RAG evaluation uses three core metrics. Each one answers a different question about your pipeline’s health:

Precision: Of the documents you retrieved, how many were actually relevant? If you retrieve 5 chunks and 3 are relevant: precision = 3/5 = 60%.
Recall: Of all relevant documents in the corpus, how many did you retrieve? If 4 documents are relevant and you found 3: recall = 3/4 = 75%.
Faithfulness: Does the generated answer stick to the retrieved context without hallucinating? If the answer contains 5 claims and all 5 are grounded in retrieved chunks: faithfulness = 100%.

RAG Evaluation: Cumulative Impact of Advanced Patterns

Naive RAG

0%

Precision

0%

Recall

0%

Faithfulness

✅ Why It Matters

Without evaluation metrics, you’re flying blind. A team at a financial services company spent 3 weeks optimizing their embedding model only to discover (via evaluation) that their real bottleneck was chunking strategy — 40% of relevant paragraphs were split across two chunks, and neither half was retrievable alone. Precision tells you “am I retrieving junk?” Recall tells you “am I missing relevant docs?” Faithfulness tells you “is the LLM making things up?” Measuring all three tells you exactly which pipeline stage to fix.

🎓 Cert Tip — Domain 5.5
Aggregate accuracy metrics (e.g., “85% retrieval accuracy”) can mask per-query-type failures. A RAG system might retrieve perfectly for factual lookups but fail on complex multi-hop questions. Track precision and recall per query type (factual, analytical, comparison, multi-hop) to catch hidden failure modes. The exam tests whether you know to stratify metrics rather than relying on a single average.

✔ What Just Happened?

You learned two final pieces: compression (removing noise from retrieved chunks to improve generation quality and reduce cost) and evaluation (measuring precision, recall, and faithfulness to diagnose pipeline failures). Together with hybrid search, re-ranking, and query transformation, you now have a complete Advanced RAG toolkit.

Code Walkthrough: Building an Advanced RAG Pipeline

We’ll upgrade the M09 basic RAG system with four advanced patterns: BM25 keyword search, Reciprocal Rank Fusion, Claude-based re-ranking, and HyDE query transformation. Each builds on the last.

Step 1: BM25 Keyword Search & Hybrid Fusion

Let’s start by adding BM25 keyword search alongside the existing vector search from M09, then merging the results using Reciprocal Rank Fusion. The goal is simple: vector search alone misses exact term matches like invoice numbers, product codes, and proper nouns. BM25 catches those exact hits, and RRF merges both result lists without the headache of normalizing their incompatible score ranges.

Here’s the gotcha to watch for: BM25 requires tokenized text (split into words), and it’s case-sensitive by default. If you forget to lowercase and strip punctuation, “Invoice” and “invoice” become different tokens, and your recall takes an unnecessary hit. The code below handles this with .lower().split() — basic but effective.

import math
from collections import Counter

class BM25:
    """Simple BM25 implementation for keyword search."""
    def __init__(self, documents: list[str], k1: float = 1.5, b: float = 0.75):
        self.k1 = k1
        self.b = b
        self.docs = documents
        self.doc_len = [len(d.split()) for d in documents]
        self.avg_dl = sum(self.doc_len) / len(self.doc_len)
        self.doc_count = len(documents)
        # Build inverted index: word -> set of doc indices
        self.doc_freqs: dict[str, int] = {}
        self.term_freqs: list[Counter] = []
        for doc in documents:
            tf = Counter(doc.lower().split())
            self.term_freqs.append(tf)
            for term in tf:
                self.doc_freqs[term] = self.doc_freqs.get(term, 0) + 1

    def score(self, query: str, top_k: int = 10) -> list[tuple[int, float]]:
        """Score all documents against query, return top-k (index, score) pairs."""
        query_terms = query.lower().split()
        scores = []
        for i in range(self.doc_count):
            s = 0.0
            for term in query_terms:
                if term not in self.doc_freqs:
                    continue
                df = self.doc_freqs[term]
                idf = math.log((self.doc_count - df + 0.5) / (df + 0.5) + 1)
                tf = self.term_freqs[i].get(term, 0)
                numerator = tf * (self.k1 + 1)
                denominator = tf + self.k1 * (1 - self.b + self.b * self.doc_len[i] / self.avg_dl)
                s += idf * numerator / denominator
            scores.append((i, s))
        scores.sort(key=lambda x: x[1], reverse=True)
        return scores[:top_k]


def reciprocal_rank_fusion(
    ranked_lists: list[list[tuple[int, float]]],
    k: int = 60
) -> list[tuple[int, float]]:
    """Merge multiple ranked lists using RRF.

    Each ranked_list is [(doc_index, score), ...] in descending order.
    Returns fused list sorted by RRF score.
    """
    fused_scores: dict[int, float] = {}
    for ranked_list in ranked_lists:
        for rank, (doc_idx, _) in enumerate(ranked_list):
            fused_scores[doc_idx] = fused_scores.get(doc_idx, 0) + 1.0 / (k + rank + 1)
    result = sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
    return result


def hybrid_search(
    query: str,
    bm25: BM25,
    vector_store,        # Your ChromaDB collection from M09
    embedding_fn,        # Function to embed text -> vector
    top_k: int = 10,
    bm25_weight: float = 0.5
) -> list[tuple[int, float]]:
    """Run hybrid search: BM25 + vector, fused with RRF."""
    try:
        # BM25 keyword search
        bm25_results = bm25.score(query, top_k=top_k * 2)

        # Vector similarity search (using ChromaDB from M09)
        query_embedding = embedding_fn(query)
        vector_results_raw = vector_store.query(
            query_embeddings=[query_embedding],
            n_results=top_k * 2
        )
        # Convert to (index, score) format
        vector_results = [
            (int(doc_id), 1.0 - dist)  # Convert distance to similarity
            for doc_id, dist in zip(
                vector_results_raw["ids"][0],
                vector_results_raw["distances"][0]
            )
        ]

        # Fuse with RRF
        fused = reciprocal_rank_fusion([bm25_results, vector_results])
        return fused[:top_k]

    except Exception as e:
        print(f"Hybrid search error: {e}")
        # Fallback to vector-only search
        return vector_results[:top_k] if vector_results else []

// bm25.ts — BM25 + Reciprocal Rank Fusion for hybrid search

interface ScoredDoc { index: number; score: number; }

class BM25 {
  private k1: number;
  private b: number;
  private docLengths: number[];
  private avgDl: number;
  private docCount: number;
  private docFreqs: Map<string, number> = new Map();
  private termFreqs: Map<string, number>[];

  constructor(documents: string[], k1 = 1.5, b = 0.75) {
    this.k1 = k1;
    this.b = b;
    this.docCount = documents.length;
    this.docLengths = documents.map(d => d.split(/\s+/).length);
    this.avgDl = this.docLengths.reduce((a, b) => a + b, 0) / this.docCount;

    // Build inverted index
    this.termFreqs = documents.map(doc => {
      const tf = new Map<string, number>();
      for (const word of doc.toLowerCase().split(/\s+/)) {
        tf.set(word, (tf.get(word) || 0) + 1);
      }
      for (const term of tf.keys()) {
        this.docFreqs.set(term, (this.docFreqs.get(term) || 0) + 1);
      }
      return tf;
    });
  }

  score(query: string, topK = 10): ScoredDoc[] {
    const queryTerms = query.toLowerCase().split(/\s+/);
    const scores: ScoredDoc[] = [];

    for (let i = 0; i < this.docCount; i++) {
      let s = 0;
      for (const term of queryTerms) {
        const df = this.docFreqs.get(term) || 0;
        if (df === 0) continue;
        const idf = Math.log((this.docCount - df + 0.5) / (df + 0.5) + 1);
        const tf = this.termFreqs[i].get(term) || 0;
        const numerator = tf * (this.k1 + 1);
        const denominator = tf + this.k1 *
          (1 - this.b + this.b * this.docLengths[i] / this.avgDl);
        s += idf * numerator / denominator;
      }
      scores.push({ index: i, score: s });
    }
    return scores.sort((a, b) => b.score - a.score).slice(0, topK);
  }
}

function reciprocalRankFusion(
  rankedLists: ScoredDoc[][],
  k = 60
): ScoredDoc[] {
  const fusedScores = new Map<number, number>();
  for (const list of rankedLists) {
    list.forEach((doc, rank) => {
      const current = fusedScores.get(doc.index) || 0;
      fusedScores.set(doc.index, current + 1 / (k + rank + 1));
    });
  }
  return Array.from(fusedScores.entries())
    .map(([index, score]) => ({ index, score }))
    .sort((a, b) => b.score - a.score);
}

async function hybridSearch(
  query: string,
  bm25: BM25,
  vectorStore: any,       // Your ChromaDB collection from M09
  embedFn: (text: string) => Promise<number[]>,
  topK = 10
): Promise<ScoredDoc[]> {
  try {
    // BM25 keyword search
    const bm25Results = bm25.score(query, topK * 2);

    // Vector similarity search
    const queryEmbedding = await embedFn(query);
    const vectorRaw = await vectorStore.query({
      queryEmbeddings: [queryEmbedding],
      nResults: topK * 2,
    });
    const vectorResults: ScoredDoc[] = vectorRaw.ids[0].map(
      (id: string, i: number) => ({
        index: parseInt(id),
        score: 1 - vectorRaw.distances[0][i],
      })
    );

    // Fuse with RRF
    const fused = reciprocalRankFusion([bm25Results, vectorResults]);
    return fused.slice(0, topK);
  } catch (error) {
    console.error("Hybrid search error:", error);
    return []; // Fallback: empty results
  }
}

✔ What Just Happened?

You built a BM25 keyword search engine from scratch (no external library needed for the basics), then merged it with your existing vector search using Reciprocal Rank Fusion. The hybrid_search function runs both searches, fuses the results, and falls back to vector-only if something fails. Your retrieval now catches both exact terms and semantic matches.

Step 2: Claude-Based Re-Ranking

Now for the interesting part: after hybrid retrieval returns the top candidates, we ask Claude to score each document’s relevance to the query on a 1–10 scale, then re-sort by those scores. This is where the real quality jump happens. Claude understands nuance that cosine similarity and BM25 simply cannot capture. For example, a chunk about “filing amendments” might score high on vector similarity for a query about “new filings,” but Claude knows that amendments and new filings are fundamentally different things.

Here’s the dilemma with LLM-based re-ranking: Claude might give all documents similar scores (e.g., 7, 7, 8, 7), which defeats the purpose of re-ranking. The trick is to add “use the full 1-10 scale” in the scoring prompt and require a brief justification for each score. This forces more discriminative scoring — Claude has to think about why one document deserves a 9 while another only deserves a 3.

import anthropic
import json

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY env var

async def rerank_with_claude(
    query: str,
    documents: list[dict],   # [{"id": "...", "text": "..."}, ...]
    top_k: int = 5,
    timeout: float = 30.0
) -> list[dict]:
    """Re-rank documents using Claude as a cross-encoder.

    Returns top_k documents sorted by Claude's relevance score.
    """
    if not documents:
        return []

    # Build the scoring prompt
    doc_list = "\n\n".join(
        f"[Document {i+1}] (ID: {doc['id']})\n{doc['text'][:500]}"
        for i, doc in enumerate(documents)
    )

    try:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            messages=[{
                "role": "user",
                "content": f"""Score each document's relevance to the query on a scale of 1-10.
Use the FULL scale: 1=completely irrelevant, 5=somewhat related, 10=directly answers the query.

Query: {query}

{doc_list}

Respond with ONLY a JSON array of objects, each with "doc_id" and "score" fields.
Example: [{{"doc_id": "abc", "score": 8}}, {{"doc_id": "def", "score": 3}}]"""
            }]
        )

        # Parse Claude's scoring response
        scores = json.loads(response.content[0].text)

        # Build lookup and sort
        score_map = {s["doc_id"]: s["score"] for s in scores}
        ranked = sorted(
            documents,
            key=lambda d: score_map.get(d["id"], 0),
            reverse=True
        )
        return ranked[:top_k]

    except json.JSONDecodeError as e:
        print(f"Re-rank parse error: {e}. Returning original order.")
        return documents[:top_k]
    except anthropic.APITimeoutError:
        print("Re-rank timeout. Returning original order.")
        return documents[:top_k]
    except Exception as e:
        print(f"Re-rank error: {e}. Returning original order.")
        return documents[:top_k]

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic(); // reads ANTHROPIC_API_KEY env var

interface Document { id: string; text: string; }
interface RankedDoc extends Document { score: number; }

async function rerankWithClaude(
  query: string,
  documents: Document[],
  topK = 5
): Promise<RankedDoc[]> {
  if (documents.length === 0) return [];

  const docList = documents
    .map((doc, i) =>
      `[Document ${i + 1}] (ID: ${doc.id})\n${doc.text.slice(0, 500)}`
    )
    .join("\n\n");

  try {
    const response = await client.messages.create({
      model: "claude-sonnet-4-6",
      max_tokens: 1024,
      messages: [{
        role: "user",
        content: `Score each document's relevance to the query on a scale of 1-10.
Use the FULL scale: 1=completely irrelevant, 5=somewhat related, 10=directly answers the query.

Query: ${query}

${docList}

Respond with ONLY a JSON array of objects, each with "doc_id" and "score" fields.
Example: [{"doc_id": "abc", "score": 8}, {"doc_id": "def", "score": 3}]`,
      }],
    });

    const text = response.content[0].type === "text"
      ? response.content[0].text : "";
    const scores: { doc_id: string; score: number }[] = JSON.parse(text);

    const scoreMap = new Map(scores.map(s => [s.doc_id, s.score]));
    const ranked: RankedDoc[] = documents
      .map(d => ({ ...d, score: scoreMap.get(d.id) || 0 }))
      .sort((a, b) => b.score - a.score);

    return ranked.slice(0, topK);
  } catch (error) {
    if (error instanceof SyntaxError) {
      console.error("Re-rank parse error:", error.message);
    } else {
      console.error("Re-rank error:", error);
    }
    // Fallback: return original order with score 0
    return documents.slice(0, topK).map(d => ({ ...d, score: 0 }));
  }
}

Step 3: HyDE Query Transformation

This next technique is clever. Before we even search, we ask Claude to generate a hypothetical answer to the user’s question, then embed that answer instead of the raw query. Why does this help? The user’s short question (“What’s the risk for filing 0093?”) uses question language, but documents use statement language (“The debtor’s credit profile indicates elevated default probability...”). A hypothetical answer bridges that gap — its embedding lands closer to relevant chunks because it sounds like the documents themselves.

The risk with HyDE is that if the hypothetical answer is wrong or off-topic, you’ll retrieve irrelevant documents. That’s why the code below treats HyDE as an additional input to Reciprocal Rank Fusion, not a replacement for the original query. If the hypothesis is good, it boosts retrieval. If it’s bad, the original query’s results still dominate.

async def hyde_transform(
    query: str,
    domain_context: str = "UCC filings and commercial lending"
) -> str:
    """Generate a hypothetical document that answers the query.

    The hypothesis is embedded instead of the raw query,
    producing better vector matches against real documents.
    """
    try:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=300,
            messages=[{
                "role": "user",
                "content": f"""Write a short paragraph (3-5 sentences) that would appear in a
document about {domain_context} and directly answers this question:

Question: {query}

Write as if this paragraph is from an actual document, not as a response
to a user. Use technical language appropriate for the domain.
Do NOT start with "Here is" or similar preamble."""
            }]
        )
        return response.content[0].text
    except Exception as e:
        print(f"HyDE generation failed: {e}. Using original query.")
        return query


async def advanced_rag_query(
    query: str,
    bm25: BM25,
    vector_store,
    embedding_fn,
    documents: list[dict],
    top_k: int = 5
) -> list[dict]:
    """Full advanced RAG pipeline: HyDE + Hybrid Search + Re-rank."""
    # Step 1: Generate hypothetical answer for better embedding
    hypothesis = await hyde_transform(query)

    # Step 2: Run hybrid search with BOTH original query and hypothesis
    original_results = hybrid_search(query, bm25, vector_store, embedding_fn, top_k=top_k * 2)
    hyde_embedding = embedding_fn(hypothesis)
    hyde_results_raw = vector_store.query(
        query_embeddings=[hyde_embedding], n_results=top_k * 2
    )
    hyde_results = [
        (int(doc_id), 1.0 - dist)
        for doc_id, dist in zip(
            hyde_results_raw["ids"][0], hyde_results_raw["distances"][0]
        )
    ]

    # Step 3: Fuse all three result sets (BM25, vector, HyDE-vector)
    bm25_only = bm25.score(query, top_k=top_k * 2)
    all_fused = reciprocal_rank_fusion([
        bm25_only,
        [(idx, sc) for idx, sc in original_results],
        hyde_results
    ])

    # Step 4: Gather top candidates as documents
    candidate_docs = []
    seen = set()
    for doc_idx, _ in all_fused[:top_k * 2]:
        if doc_idx not in seen and doc_idx < len(documents):
            candidate_docs.append(documents[doc_idx])
            seen.add(doc_idx)

    # Step 5: Re-rank with Claude
    reranked = await rerank_with_claude(query, candidate_docs, top_k=top_k)
    return reranked

async function hydeTransform(
  query: string,
  domainContext = "UCC filings and commercial lending"
): Promise<string> {
  try {
    const response = await client.messages.create({
      model: "claude-sonnet-4-6",
      max_tokens: 300,
      messages: [{
        role: "user",
        content: `Write a short paragraph (3-5 sentences) that would appear in a
document about ${domainContext} and directly answers this question:

Question: ${query}

Write as if this paragraph is from an actual document, not as a response
to a user. Use technical language appropriate for the domain.
Do NOT start with "Here is" or similar preamble.`,
      }],
    });
    const text = response.content[0].type === "text"
      ? response.content[0].text : query;
    return text;
  } catch (error) {
    console.error("HyDE generation failed:", error);
    return query; // Fallback to original query
  }
}

async function advancedRagQuery(
  query: string,
  bm25: BM25,
  vectorStore: any,
  embedFn: (text: string) => Promise<number[]>,
  documents: Document[],
  topK = 5
): Promise<RankedDoc[]> {
  // Step 1: HyDE — generate hypothetical answer
  const hypothesis = await hydeTransform(query);

  // Step 2: Hybrid search with original query
  const originalResults = await hybridSearch(
    query, bm25, vectorStore, embedFn, topK * 2
  );

  // Step 3: Vector search with HyDE embedding
  const hydeEmbedding = await embedFn(hypothesis);
  const hydeRaw = await vectorStore.query({
    queryEmbeddings: [hydeEmbedding],
    nResults: topK * 2,
  });
  const hydeResults: ScoredDoc[] = hydeRaw.ids[0].map(
    (id: string, i: number) => ({
      index: parseInt(id),
      score: 1 - hydeRaw.distances[0][i],
    })
  );

  // Step 4: Fuse all three result sets
  const bm25Only = bm25.score(query, topK * 2);
  const allFused = reciprocalRankFusion([
    bm25Only, originalResults, hydeResults
  ]);

  // Step 5: Gather candidates and re-rank
  const seen = new Set<number>();
  const candidates: Document[] = [];
  for (const { index } of allFused.slice(0, topK * 2)) {
    if (!seen.has(index) && index < documents.length) {
      candidates.push(documents[index]);
      seen.add(index);
    }
  }

  return rerankWithClaude(query, candidates, topK);
}

✔ What Just Happened?

You wired together the complete Advanced RAG pipeline: HyDE generates a hypothesis for better embedding, hybrid search combines BM25 + vector + HyDE results via RRF, and Claude re-ranks the top candidates by true relevance. The advanced_rag_query function orchestrates all five steps. Error handling ensures each stage falls back gracefully — a HyDE failure uses the original query, a re-rank timeout keeps the original order.

Step 4: RAG Evaluation Harness

Now let’s build something that will save you hours of guessing: a simple evaluation framework that measures precision, recall, and faithfulness across a test set of questions with known answers. Without these metrics, you genuinely cannot tell if your advanced pipeline actually improved things. Maybe re-ranking helps for your data; maybe it adds latency without improving quality. The numbers will tell you — and they often surprise you.

The hardest part of evaluation is not the code — it’s creating the ground-truth labels. Someone (usually you) has to manually decide which documents are relevant for each test query. This is tedious but essential. Start with 10–20 test cases and label them carefully. More test cases are not better if the labels are sloppy — a small, accurate test set beats a large, noisy one every time.

The code below has three parts. First, a TestCase data class that holds a query plus its known-good answers — this is your ground truth. Second, evaluate_retrieval computes precision and recall by comparing what the system retrieved against what it should have retrieved. Third, evaluate_faithfulness asks Claude to check whether the generated answer actually sticks to the retrieved context or invents facts. Together, these three functions tell you exactly where your pipeline is strong and where it’s leaking quality.

from dataclasses import dataclass

@dataclass
class TestCase:
    query: str
    relevant_doc_ids: list[str]   # Ground truth: which docs answer this
    expected_answer_fragments: list[str]  # Key phrases the answer should contain

@dataclass
class EvalResult:
    precision: float    # Retrieved relevant / total retrieved
    recall: float       # Retrieved relevant / total relevant
    faithfulness: float # Claims grounded in context / total claims

def evaluate_retrieval(
    retrieved_ids: list[str],
    relevant_ids: list[str],
    k: int = 5
) -> tuple[float, float]:
    """Compute precision@k and recall@k."""
    top_k = set(retrieved_ids[:k])
    relevant = set(relevant_ids)

    hits = len(top_k & relevant)
    precision = hits / k if k > 0 else 0.0
    recall = hits / len(relevant) if relevant else 0.0
    return precision, recall

async def evaluate_faithfulness(
    answer: str,
    retrieved_chunks: list[str]
) -> float:
    """Use Claude to check if the answer is grounded in retrieved context."""
    context = "\n---\n".join(retrieved_chunks)
    try:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=512,
            messages=[{
                "role": "user",
                "content": f"""Evaluate whether the answer below is fully supported by the context.
For each factual claim in the answer, check if it appears in the context.

Context:
{context}

Answer:
{answer}

Respond with JSON: {{"total_claims": N, "grounded_claims": M, "ungrounded": ["list of claims not in context"]}}"""
            }]
        )
        result = json.loads(response.content[0].text)
        total = result.get("total_claims", 1)
        grounded = result.get("grounded_claims", 0)
        return grounded / total if total > 0 else 0.0
    except Exception as e:
        print(f"Faithfulness eval error: {e}")
        return 0.0

async def run_evaluation(
    test_cases: list[TestCase],
    search_fn,  # Function: query -> (retrieved_doc_ids, answer, chunks)
    k: int = 5
) -> dict:
    """Run full evaluation across test cases."""
    precisions, recalls, faiths = [], [], []

    for tc in test_cases:
        retrieved_ids, answer, chunks = await search_fn(tc.query)
        p, r = evaluate_retrieval(retrieved_ids, tc.relevant_doc_ids, k)
        f = await evaluate_faithfulness(answer, chunks)
        precisions.append(p)
        recalls.append(r)
        faiths.append(f)
        print(f"  Query: {tc.query[:50]}... P={p:.2f} R={r:.2f} F={f:.2f}")

    return {
        "avg_precision": sum(precisions) / len(precisions),
        "avg_recall": sum(recalls) / len(recalls),
        "avg_faithfulness": sum(faiths) / len(faiths),
        "n_queries": len(test_cases)
    }

interface TestCase {
  query: string;
  relevantDocIds: string[];
  expectedFragments: string[];
}

interface EvalResult {
  precision: number;
  recall: number;
  faithfulness: number;
}

function evaluateRetrieval(
  retrievedIds: string[],
  relevantIds: string[],
  k = 5
): { precision: number; recall: number } {
  const topK = new Set(retrievedIds.slice(0, k));
  const relevant = new Set(relevantIds);

  let hits = 0;
  for (const id of topK) {
    if (relevant.has(id)) hits++;
  }

  return {
    precision: k > 0 ? hits / k : 0,
    recall: relevant.size > 0 ? hits / relevant.size : 0,
  };
}

async function evaluateFaithfulness(
  answer: string,
  retrievedChunks: string[]
): Promise<number> {
  const context = retrievedChunks.join("\n---\n");
  try {
    const response = await client.messages.create({
      model: "claude-sonnet-4-6",
      max_tokens: 512,
      messages: [{
        role: "user",
        content: `Evaluate whether the answer below is fully supported by the context.
For each factual claim in the answer, check if it appears in the context.

Context:
${context}

Answer:
${answer}

Respond with JSON: {"total_claims": N, "grounded_claims": M, "ungrounded": ["list"]}`,
      }],
    });

    const text = response.content[0].type === "text"
      ? response.content[0].text : "{}";
    const result = JSON.parse(text);
    const total = result.total_claims || 1;
    const grounded = result.grounded_claims || 0;
    return total > 0 ? grounded / total : 0;
  } catch (error) {
    console.error("Faithfulness eval error:", error);
    return 0;
  }
}

async function runEvaluation(
  testCases: TestCase[],
  searchFn: (q: string) => Promise<{ids: string[]; answer: string; chunks: string[]}>,
  k = 5
): Promise<{avgPrecision: number; avgRecall: number; avgFaithfulness: number}> {
  const precisions: number[] = [];
  const recalls: number[] = [];
  const faiths: number[] = [];

  for (const tc of testCases) {
    const { ids, answer, chunks } = await searchFn(tc.query);
    const { precision, recall } = evaluateRetrieval(ids, tc.relevantDocIds, k);
    const faith = await evaluateFaithfulness(answer, chunks);
    precisions.push(precision);
    recalls.push(recall);
    faiths.push(faith);
    console.log(`  ${tc.query.slice(0, 50)}... P=${precision.toFixed(2)} R=${recall.toFixed(2)} F=${faith.toFixed(2)}`);
  }

  const avg = (arr: number[]) => arr.reduce((a, b) => a + b, 0) / arr.length;
  return {
    avgPrecision: avg(precisions),
    avgRecall: avg(recalls),
    avgFaithfulness: avg(faiths),
  };
}

✔ What Just Happened?

You built a complete evaluation framework. evaluate_retrieval computes precision and recall by comparing retrieved doc IDs against ground truth. evaluate_faithfulness uses Claude to check whether the generated answer is grounded in the retrieved context. run_evaluation runs both across a test set and reports averages. Now you can objectively compare naive RAG vs. your advanced pipeline.

Hands-On Exercise

What You'll Build

An advanced RAG pipeline that upgrades the M09 basic system with BM25 keyword search, Reciprocal Rank Fusion, and Claude-based re-ranking — then measures the improvement with an evaluation framework.

Time Estimate: 45–60 minutes | Prerequisites: Working M09 RAG pipeline with docs/ folder | Files You'll Create: advanced_rag.py

Environment Setup

This lab builds on the M09 RAG lab. You need the rag-lab/ directory with the docs/ folder containing sample UCC filing documents. If you skipped M09, create the folder and add 2–3 .md files about UCC filings.

# Navigate to your M09 rag-lab directory
cd rag-lab

# Activate virtual environment
source venv/bin/activate          # macOS/Linux
# venv\Scripts\activate           # Windows

# Install dependencies (if not already installed)
pip install anthropic>=0.30.0 chromadb>=0.4.0

# Set your API key
export ANTHROPIC_API_KEY=your-key-here          # macOS/Linux
# set ANTHROPIC_API_KEY=your-key-here           # Windows CMD
# $env:ANTHROPIC_API_KEY="your-key-here"        # Windows PowerShell

Verify setup

python -c "import anthropic; import chromadb; print('Setup OK')"

Expected Output

Setup OK

Troubleshooting

ModuleNotFoundError: No module named 'chromadb' → Run pip install chromadb. If it fails, try pip install --upgrade pip first.
ModuleNotFoundError: No module named 'anthropic' → Run pip install anthropic.
AuthenticationError → Check that ANTHROPIC_API_KEY is set in your current shell session.

Step 1: Build the Advanced RAG Pipeline

What & Why: This step adds three advanced patterns on top of the M09 base: BM25 keyword search for exact term matching, RRF fusion to merge both result sets, and Claude-based re-ranking to sort by true relevance. It uses the same docs/ folder and ChromaDB setup from M09. By the end of this step, you’ll have a single Python file that runs the complete advanced pipeline end-to-end.

Create a new file called advanced_rag.py in your rag-lab/ directory and add the following:

import os
import glob
import json
import math
from collections import Counter
import anthropic
import chromadb

client = anthropic.Anthropic()

# ── Document Loading & Chunking (from M09) ──────────────────
def load_documents(docs_dir: str) -> list[dict]:
    docs = []
    for pattern in ["*.md", "*.txt"]:
        for path in glob.glob(os.path.join(docs_dir, pattern)):
            try:
                with open(path, "r", encoding="utf-8") as f:
                    docs.append({"content": f.read(), "source": os.path.basename(path)})
            except (IOError, UnicodeDecodeError):
                pass
    if not docs:
        raise FileNotFoundError(f"No documents found in {docs_dir}")
    return docs

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    if len(text) <= chunk_size:
        return [text]
    chunks, start = [], 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        if end < len(text):
            for sep in ["\n\n", "\n", ". ", " "]:
                last_sep = chunk.rfind(sep)
                if last_sep > chunk_size * 0.5:
                    end = start + last_sep + len(sep)
                    chunk = text[start:end]
                    break
        chunks.append(chunk.strip())
        start = end - overlap
    return [c for c in chunks if c]

# ── BM25 Keyword Search ─────────────────────────────────────
class BM25:
    def __init__(self, documents: list[str], k1=1.5, b=0.75):
        self.k1, self.b = k1, b
        self.docs = documents
        self.doc_len = [len(d.split()) for d in documents]
        self.avg_dl = sum(self.doc_len) / len(self.doc_len)
        self.doc_count = len(documents)
        self.doc_freqs = {}
        self.term_freqs = []
        for doc in documents:
            tf = Counter(doc.lower().split())
            self.term_freqs.append(tf)
            for term in tf:
                self.doc_freqs[term] = self.doc_freqs.get(term, 0) + 1

    def score(self, query: str, top_k: int = 10) -> list[tuple[int, float]]:
        query_terms = query.lower().split()
        scores = []
        for i in range(self.doc_count):
            s = 0.0
            for term in query_terms:
                if term not in self.doc_freqs:
                    continue
                df = self.doc_freqs[term]
                idf = math.log((self.doc_count - df + 0.5) / (df + 0.5) + 1)
                tf = self.term_freqs[i].get(term, 0)
                s += idf * tf * (self.k1 + 1) / (tf + self.k1 * (1 - self.b + self.b * self.doc_len[i] / self.avg_dl))
            scores.append((i, s))
        scores.sort(key=lambda x: x[1], reverse=True)
        return scores[:top_k]

# ── Reciprocal Rank Fusion ───────────────────────────────────
def rrf_fuse(ranked_lists: list[list[tuple[int, float]]], k: int = 60) -> list[tuple[int, float]]:
    fused = {}
    for ranked_list in ranked_lists:
        for rank, (doc_idx, _) in enumerate(ranked_list):
            fused[doc_idx] = fused.get(doc_idx, 0) + 1.0 / (k + rank + 1)
    return sorted(fused.items(), key=lambda x: x[1], reverse=True)

# ── Claude-Based Re-Ranking ─────────────────────────────────
def rerank_with_claude(query: str, chunks: list[dict], top_n: int = 5) -> list[dict]:
    if not chunks:
        return []
    docs_text = "\n".join(f"[Doc {i+1}] {c['text'][:300]}" for i, c in enumerate(chunks))
    prompt = (
        f"Query: {query}\n\nDocuments:\n{docs_text}\n\n"
        f"Score each document 1-10 for relevance to the query. "
        f"1 = completely irrelevant, 10 = directly answers the question.\n"
        f"Output ONLY the raw JSON array, no markdown fences, no commentary.\n"
        f"Format: [{{\"doc\": 1, \"score\": 8.5}}, ...]"
    )
    try:
        response = client.messages.create(
            model="claude-sonnet-4-6", max_tokens=512,
            messages=[{"role": "user", "content": prompt}],
        )
        raw = response.content[0].text.strip()
        # Strip common markdown code fences if Claude wraps the JSON
        if raw.startswith("```"):
            raw = raw.split("```")[1]
            if raw.startswith("json"):
                raw = raw[4:]
            raw = raw.strip()
        scores = json.loads(raw)
        scored = []
        for s in scores:
            idx = s["doc"] - 1
            if 0 <= idx < len(chunks):
                chunks[idx]["rerank_score"] = s["score"]
                scored.append(chunks[idx])
        scored.sort(key=lambda x: x.get("rerank_score", 0), reverse=True)
        return scored[:top_n]
    except Exception as e:
        print(f"  Re-ranking failed: {e}, using original order")
        return chunks[:top_n]

# ── Full Advanced Pipeline ───────────────────────────────────
def ingest_and_build(docs_dir="./docs"):
    """Load, chunk, store in ChromaDB, build BM25 index."""
    docs = load_documents(docs_dir)
    all_chunks = []
    for doc in docs:
        for i, chunk in enumerate(chunk_text(doc["content"])):
            all_chunks.append({"text": chunk, "source": doc["source"], "index": i})

    chroma = chromadb.Client()
    collection = chroma.get_or_create_collection("advanced_rag", metadata={"hnsw:space": "cosine"})
    collection.add(
        ids=[f"chunk_{i}" for i in range(len(all_chunks))],
        documents=[c["text"] for c in all_chunks],
        metadatas=[{"source": c["source"], "index": c["index"]} for c in all_chunks],
    )
    bm25 = BM25([c["text"] for c in all_chunks])
    print(f"  Loaded {len(docs)} docs, {len(all_chunks)} chunks")
    return collection, bm25, all_chunks

def advanced_query(query, collection, bm25, all_chunks, verbose=True):
    """Hybrid search + RRF + re-ranking + Claude generation."""
    # 1. BM25 search
    bm25_results = bm25.score(query, top_k=10)

    # 2. Vector search
    vector_raw = collection.query(query_texts=[query], n_results=10)
    vector_results = [(int(doc_id.split("_")[1]), 1.0 - dist)
                      for doc_id, dist in zip(vector_raw["ids"][0], vector_raw["distances"][0])]

    # 3. RRF fusion
    fused = rrf_fuse([bm25_results, vector_results])
    fused_chunks = [all_chunks[idx] for idx, _ in fused[:10]]

    if verbose:
        print(f"\n  Query: {query}")
        print(f"  BM25 top-3: {[all_chunks[i]['source'] for i, _ in bm25_results[:3]]}")
        print(f"  Vector top-3: {[all_chunks[i]['source'] for i, _ in vector_results[:3]]}")
        print(f"  Fused top-3: {[all_chunks[i]['source'] for i, _ in fused[:3]]}")

    # 4. Re-rank top candidates
    reranked = rerank_with_claude(query, fused_chunks, top_n=3)

    if verbose and reranked:
        print(f"  Re-ranked top-3: {[(c['source'], c.get('rerank_score', '?')) for c in reranked]}")

    # 5. Generate answer
    context = "\n\n---\n\n".join(
        f"[Source {i+1}: {c['source']}]\n{c['text']}" for i, c in enumerate(reranked)
    )
    response = client.messages.create(
        model="claude-sonnet-4-6", max_tokens=1024,
        system="Answer based ONLY on the provided context. Cite sources using [Source N].",
        messages=[{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}],
    )
    return response.content[0].text

# ── Run Tests ────────────────────────────────────────────────
if __name__ == "__main__":
    print("── Advanced RAG Pipeline ──")
    collection, bm25, all_chunks = ingest_and_build()

    questions = [
        "What are the high risk indicators for a UCC filing?",
        "How long is a UCC-1 effective and what happens when it lapses?",
        "What is the difference between a UCC-1 and a UCC-3?",
    ]

    for q in questions:
        answer = advanced_query(q, collection, bm25, all_chunks)
        print(f"\n  Answer: {answer[:200]}...")
        print(f"  {'═' * 50}")

Run the advanced pipeline:

Command

python advanced_rag.py

Expected Output (abbreviated)

── Advanced RAG Pipeline ── Loaded 3 docs, 8 chunks Query: What are the high risk indicators for a UCC filing? BM25 top-3: ['risk_assessment.md', 'filing_guide.md', 'compliance_faq.txt'] Vector top-3: ['risk_assessment.md', 'filing_guide.md', 'risk_assessment.md'] Fused top-3: ['risk_assessment.md', 'filing_guide.md', 'compliance_faq.txt'] Re-ranked top-3: [('risk_assessment.md', 9.1), ('filing_guide.md', 4.2), ...] Answer: According to the risk assessment criteria, high risk indicators include... ══════════════════════════════════════════════════ Query: How long is a UCC-1 effective and what happens when it lapses? ... ══════════════════════════════════════════════════

✅ Checkpoint

Look for these key behaviors:

BM25 vs Vector: BM25 and vector search should return different orderings — BM25 favors exact keyword matches, vector favors semantic similarity
Fused results: Should combine the best of both lists (documents appearing in both get boosted)
Re-ranking: Should promote the most directly relevant document to position #1 (check the scores)
Cited answers: Claude's responses should include [Source N] citations

Troubleshooting

Re-ranking returns error / invalid JSON → Claude's response format may vary. The code falls back to the original order if parsing fails. Try running again — it's usually a one-off issue.
FileNotFoundError: No documents found → You need the docs/ folder from M09's lab. Create it with the 3 sample files from the M09 exercise.
BM25 and vector return identical results → With a small corpus (3 docs, ~8 chunks), there's limited diversity. The difference becomes dramatic with 50+ documents.
APIError: 529 → This test makes 4+ API calls (3 queries + re-ranking). Wait 30 seconds between runs.

Verify Everything Works

Run the complete pipeline end-to-end. All 3 queries should complete, showing the full retrieval trace (BM25 → vector → fused → re-ranked) at each stage, followed by cited answers:

Command

python advanced_rag.py

What to check in the output

You should see Loaded N docs, M chunks at the top (confirming ingestion worked)
Each query should print BM25 top-3, Vector top-3, Fused top-3, and Re-ranked top-3 with scores
BM25 and Vector results should differ — that’s the whole point of hybrid search
Re-ranked scores should be on a 1–10 scale, not cosine similarity (0–1)
Final answers should contain [Source N] citations

🎉 Congratulations

You've upgraded a naive RAG system with three production patterns: BM25 hybrid search for exact term matching, RRF fusion for combining result lists, and Claude-based re-ranking for precise relevance ordering. This is the same architecture used by enterprise document search systems.

Stretch Goals (Optional)

HyDE: Add a hyde_transform function that generates a hypothetical answer before embedding. Measure recall improvement on abstract queries.
Contextual compression: After re-ranking, use Claude to extract only the relevant sentences from each chunk. Measure token savings.
Evaluation framework: Create a test set of 10 questions with ground-truth relevant chunk IDs. Compute precision@3 and recall@5 for naive vs. advanced pipeline.

⚠️ Common Pitfalls

BM25 tokenization mismatch: Always BM25-index the same chunks you vector-indexed. Different chunking = wrong scores.
Re-ranker hallucination: Claude might invent relevance. Always include "1 = completely irrelevant" in the scoring prompt.
Test set bias: Include adversarial queries (ambiguous, no-match, multi-hop) in your evaluation set.

Knowledge Check

1. When would BM25 keyword search outperform vector similarity search?

When the query is about abstract concepts like “risk assessment strategies”

When the query contains specific identifiers like invoice numbers or product codes

When the document corpus is very small (under 100 documents)

When you want to find documents in a different language than the query

✓ Correct! BM25 excels at exact term matching. Vector search understands meaning but may miss specific identifiers like “INV-2024-0847” because it compares semantic similarity, not exact strings.

✗ Not quite. BM25’s strength is exact keyword matching. It outperforms vector search specifically when queries contain proper nouns, IDs, or technical terms that need literal matching — like invoice numbers or product codes.

2. What problem does Reciprocal Rank Fusion (RRF) solve?

It makes vector search faster by reducing the embedding dimension

It translates queries into multiple languages for multilingual search

It merges ranked lists from different search systems without needing to normalize their incompatible scores

It compresses retrieved documents to save tokens in the generation prompt

✓ Correct! BM25 scores and cosine similarity scores are on completely different scales. RRF sidesteps this by using rank positions instead of raw scores: score = Σ1/(k + rank).

✗ RRF solves the score normalization problem. BM25 returns scores like 12.4 while cosine similarity returns 0.87 — you can’t just add them. RRF uses rank positions instead of raw scores, making fusion system-agnostic.

3. A RAG system has high recall but low precision. What does this mean, and which pattern would help most?

It’s missing relevant documents — add hybrid search to cast a wider net

It retrieves relevant documents but also too many irrelevant ones — add re-ranking to filter the top results

The generated answers are hallucinating — add query transformation to fix the input

The embedding model is too small — upgrade to a larger embedding model

✓ Correct! High recall = finding most relevant docs. Low precision = too much junk in the results. Re-ranking applies a more powerful model to sort the candidates, pushing irrelevant documents down and relevant ones up.

✗ Think about what the metrics mean. High recall = the system IS finding relevant documents (so more search isn’t needed). Low precision = too many irrelevant documents in the results. Re-ranking filters the good from the bad.

4. In HyDE, why does embedding a hypothetical answer often retrieve better results than embedding the raw query?

The hypothetical answer uses document-like language, so its embedding lands closer to real documents in vector space

The hypothetical answer has more tokens, and longer texts always produce better embeddings

The LLM already knows the correct answer from its training data, so the hypothesis is always correct

HyDE bypasses the embedding step entirely and uses string matching instead

✓ Correct! A user query (“What’s the risk?”) uses question language. Documents use statement language (“The risk profile indicates...”). The hypothesis bridges this gap by generating document-like text, whose embedding is naturally closer to relevant documents.

✗ The key insight is about language style. Queries use question words; documents use declarative statements. The hypothetical answer mimics document language, so its embedding vector lands in the same neighborhood as real relevant documents. It doesn’t need to be correct — just stylistically similar.

5. Which evaluation metric tells you whether the LLM is hallucinating in its generated answer?

Precision — it measures how many retrieved documents are relevant

Recall — it measures how many relevant documents were retrieved

nDCG@10 — it measures ranking quality of search results

Faithfulness — it checks if every claim in the answer is grounded in the retrieved context

✓ Correct! Faithfulness measures groundedness — does every factual claim in the generated answer come from the retrieved chunks? A faithfulness score below 100% means the LLM added information not in the context, i.e., hallucinated.

✗ Precision and recall measure retrieval quality (finding the right documents). Faithfulness measures generation quality — specifically, whether the answer only contains information from the retrieved context, without making up facts.

6. You’re building a RAG system for a healthcare company. Which advanced pattern should you add FIRST for the biggest impact? (Recall from M09: your basic vector search is already working.)

HyDE query transformation — doctors phrase questions differently than clinical guidelines

Contextual compression — clinical documents are long and noisy

Hybrid search — clinical queries contain specific codes (CPT, ICD-10) that need exact matching

Multi-agent orchestration — separate agents for different medical specialties

✓ Correct! Healthcare queries are full of specific codes (CPT: 99213, ICD-10: E11.65) that vector search handles poorly. BM25 keyword search finds these exact codes instantly. Hybrid search is typically the highest-impact first addition to any RAG system with structured identifiers.

✗ Think about the domain: healthcare queries contain specific procedure codes (CPT: 99213) and diagnosis codes (ICD-10: E11.65). Vector search might return “diabetes management” documents generically, but BM25 finds the exact code match. Hybrid search is the biggest bang-for-buck first improvement.

Your Score

0/0

Module Summary

Key Concepts Recap

Naive RAG follows a single-pass retrieve-then-generate pattern. Advanced RAG adds pre-retrieval, retrieval, and post-retrieval optimizations.
Hybrid Search combines BM25 keyword matching with vector similarity via Reciprocal Rank Fusion, catching both exact terms and semantic matches.
Re-Ranking uses a cross-encoder or LLM to re-sort retrieved candidates by true relevance, fixing the approximate ordering from initial retrieval.
Query Transformation (HyDE, multi-query, step-back) improves retrieval by reformulating the user’s question before search.
Contextual Compression extracts only query-relevant sentences from chunks, reducing noise and token cost.
RAG Evaluation uses precision, recall, and faithfulness to diagnose exactly where your pipeline is failing.

What We Built

A complete advanced RAG pipeline: BM25 + vector hybrid search with RRF fusion, Claude-based re-ranking, HyDE query transformation, and an evaluation harness measuring precision, recall, and faithfulness. Each component is independent and composable — add them as needed for your use case.

Next: M11 — Multi-Layer Memory

You’ve mastered retrieval. But what about information that persists across conversations? In M11, you’ll build a multi-layer memory system with short-term (conversation), medium-term (session summaries), and long-term (persistent knowledge) tiers — giving your agent the ability to remember users, preferences, and context across interactions.