M10: Advanced RAG Patterns
Go beyond basic retrieve-and-generate: hybrid search, re-ranking, query transformation, and measuring what matters.
Prerequisites: M09: RAG Fundamentals
Learning Objectives
- Identify the limitations of naive RAG and explain how advanced patterns address each one
- Implement hybrid search combining BM25 keyword matching with vector similarity
- Apply re-ranking to improve retrieval precision using Claude as a cross-encoder
- Use query transformation (HyDE, multi-query) to improve recall on complex questions
- Measure RAG quality with precision, recall, and faithfulness metrics
Naive RAG vs. Advanced RAG
In M09, you built a working RAG system: chunk documents, embed them, store vectors, retrieve by similarity, and generate an answer. That’s naive RAGA single-pass retrieve-then-generate pipeline with no query preprocessing or result refinement. It works but often misses relevant context or includes irrelevant chunks. — and it works surprisingly well for simple use cases. But production systems hit a ceiling fast.
Before: Imagine searching Google by typing your exact question and reading only the first result — sometimes it’s exactly right, but often it’s tangential, outdated, or misses the nuance of what you really needed.
The pain: You waste time reading irrelevant pages, miss the one article that actually answers your question because it used different wording, and have no way to know if the answer you got is actually trustworthy.
The mapping: Advanced RAG is like having a research librarian who first rephrases your question three different ways, searches both a keyword index and a topic index, carefully reads the top candidates to rank them by actual relevance, highlights only the sentences that matter, and then gives you a sourced answer. Every stage that the librarian adds makes the final answer more accurate.
What this actually looks like in practice: Your naive pipeline sends one query and gets back chunks ranked by cosine similarity alone. An advanced pipeline might produce output like this at each stage:
Advanced RAG wraps the basic retrieve-then-generate pipeline with three optimization layers. Each layer targets a different failure mode. Here’s what they do and when they kick in:
First, pre-retrieval transforms the user’s query before searching — think of it as asking a better question. Instead of sending the user’s raw words straight to your search engine, you rephrase, expand, or generate a hypothetical answer first.
Second, retrieval enhancement combines multiple search strategies. Rather than relying on vector similarity alone, you run both keyword search (BM25) and semantic search in parallel, then merge the results. A re-ranker then sorts those merged results by true relevance — not just rough cosine similarity scores.
Third, post-retrieval processing compresses the retrieved chunks to remove noise. It also verifies citations so the model doesn’t hallucinate from irrelevant context. In other words, you clean up the search results before the LLM ever sees them.
Each layer can be added independently — you don’t need all of them, and the right combination depends on your data and queries.
“Hybrid search is always better than pure semantic search.” — Not always. Keyword search (BM25) wins for exact IDs, codes, and proper nouns, but for abstract or conceptual queries (“explain debtor risk factors”), vector-only search can actually outperform hybrid because BM25 adds noise by matching irrelevant keyword hits. Always test with your actual queries.
“Re-ranking is free improvement.” — It adds 100–300ms of latency per query and costs real money (an extra LLM call). For simple factual lookups where the initial retrieval is already good, re-ranking just slows things down without meaningfully changing the result order. Reserve it for complex or high-stakes queries.
“More retrieval stages = better quality.” — Each stage adds latency and complexity. In practice, you see diminishing returns after 2–3 stages. A pipeline with hybrid search + re-ranking covers most use cases. Adding query transformation, compression, AND re-ranking on top might gain you 2–3% accuracy while tripling your latency and cost.
“Query decomposition helps every query.” — Simple factual queries (“What is the filing date for UCC-0093?”) do not benefit from multi-query or HyDE. These techniques shine on complex, multi-part questions where a single search cannot capture all aspects of what the user needs.
“Semantic chunking always beats fixed-size chunking.” — It depends on content structure. Semantic chunking works well for prose (articles, reports). But for structured data like tables, code, or form fields, fixed-size chunking often works better because semantic boundaries are harder to detect and splitting a table row across chunks destroys its meaning.
Benchmarks on real-world QA datasets show naive RAG achieves 55–65% answer accuracy. Adding hybrid search lifts that to 70–75%. Adding re-ranking pushes it to 80–85%. Query transformation on top gets you to 85–90%. Each technique targets a different failure mode: hybrid search catches keyword misses, re-ranking fixes bad ordering, and query transformation handles ambiguous questions. For a customer-support bot handling 10,000 queries/day, going from 60% to 85% accuracy means 2,500 fewer wrong answers every single day.
Hybrid Search: Keyword + Semantic Fusion
Vector similarity search is powerful, but it has a blind spot: exact terms. If a user asks about “invoice #INV-2024-0847”, vector search might return chunks about invoices in general rather than that specific one. BM25Best Match 25 — a ranking function that scores documents based on term frequency and inverse document frequency. It excels at finding exact keyword matches like product IDs, names, and technical terms. keyword search finds exact matches instantly, but misses synonyms and semantic relationships. Hybrid search combines both.
Before: You’re searching for a restaurant. You know it’s called something like “Trattoria Verde” but you also remember it as “that cozy Italian place near Central Park.”
The pain: A name-only search finds the exact restaurant but misses similar ones you might like. A description-only search finds cozy Italian places but might miss Trattoria Verde if its listing doesn’t say “cozy.”
The mapping: Hybrid search runs both queries and merges the results. BM25 is the name search — fast, literal, great for proper nouns. Vector search is the description search — understands meaning, catches synonyms. Together, they cover cases that either alone would miss.
What this actually looks like: Below is a real example of how the two search strategies return different documents for the same query, and how RRF fuses them:
Hybrid search runs two retrieval strategies in parallel.
First, sparse retrieval (BM25 or TF-IDF) matches documents by exact keyword overlap. It breaks the query into individual words, counts how often each appears in each document, and boosts rare words — that’s the “inverse document frequency” part. A rare word like “INV-2024-0847” gets a high score because it appears in very few documents, making it a strong signal.
Second, dense retrieval (vector similarity) encodes both query and documents as high-dimensional number arrays and finds the closest ones by cosine similarity. It understands meaning, not just words — so “automobile” and “car” would match even though they share no letters.
Finally, Reciprocal Rank Fusion (RRF)A method for combining ranked lists from different search systems. For each document, RRF computes 1/(k+rank) for each list it appears in and sums the scores. This gives a unified ranking without needing to normalize scores across different systems. merges the two ranked lists into one. For each document, it computes a fused score based on where that document appeared in each list: score = Σ 1/(k + rank), where k is typically 60. The beauty of RRF is that it uses rank positions, not raw scores — so you never have to worry about normalizing BM25 scores (which might be 12.4) against cosine similarities (which might be 0.87).
In a UCC filing system with 500,000 documents, a query like “filing #2024-FL-0093 debtor risk assessment” needs both strategies: BM25 finds the exact filing number (a string match that vector search might miss entirely), while vector search retrieves related risk analysis documents that discuss similar debtors. In real-world benchmarks (BEIR, MTEB), hybrid search improves nDCG@10 by 8–15% over vector-only search, with the biggest gains on queries containing proper nouns, IDs, or technical terms.
You learned that neither keyword search nor vector search alone covers all query types. BM25 handles exact terms; vectors handle meaning. Reciprocal Rank Fusion merges their results without needing score normalization. This is the first Advanced RAG pattern — and often the one with the biggest bang-for-buck.
Re-Ranking: Why Retrieval Order Matters
Before: A job recruiter searches LinkedIn for “senior backend engineer, Rust experience, distributed systems.” The search returns 200 profiles ranked by keyword match.
The pain: Profile #3 mentions “Rust” once in a side project, while profile #47 has 5 years of Rust at a distributed-systems company but listed it as “systems programming.” The initial ranking is misleading.
The mapping: Re-ranking is the recruiter reading the top 50 profiles carefully and reordering them by actual fit. It’s slower (reading takes time), but the final shortlist is dramatically better. In RAG, the “recruiter” is a cross-encoder or LLM that compares each document directly against the query.
What this actually looks like: Here is a before-and-after of re-ranking. Notice how the initial retrieval order (based on cosine similarity) is misleading — the most relevant document was buried at position #4:
A cross-encoderA model that takes a query-document PAIR as input and scores their relevance jointly. Unlike bi-encoders (which embed query and document separately), cross-encoders see both texts together, enabling deeper understanding of relevance — but they're too slow to run on every document. takes a query-document pair as a single input and produces a relevance score. Think of it this way: instead of comparing two separate summaries of the query and document, the model reads both texts side by side and decides “how relevant is this document to this specific question?”
This is fundamentally more powerful than bi-encoderA model that encodes query and document separately into vectors, then compares them via cosine similarity. Fast (can pre-compute document embeddings) but less accurate than cross-encoders since it can't see both texts together. embeddings. A bi-encoder creates a vector for the query and a separate vector for each document, then compares them with cosine similarity. It’s fast because you can pre-compute document vectors, but it misses subtle relevance signals that only become visible when you read both texts together.
The trade-off is speed. Cross-encoders are too slow to run on millions of documents, so you use a two-stage approach. First, retrieve broadly with fast methods (top-50 to top-100 candidates). Second, re-rank that short list precisely with a cross-encoder (or Claude itself) to get the true top-5.
If you’ve used database indexes before, the analogy is apt: a database uses a fast B-tree index to narrow 10 million rows down to 100 candidates, then applies the full WHERE clause to those 100. Re-ranking works the same way — the initial retrieval (BM25 + vector) is the fast index scan, and the cross-encoder is the expensive filter applied to the short list. You would never run the expensive filter on every document, but it’s perfectly affordable for 20–50 candidates.
Research from Cohere and MS MARCO benchmarks shows re-ranking improves retrieval precision@5 by 15–25%. In production, this translates directly to answer quality: if your RAG system feeds the top-5 chunks to Claude, and re-ranking moves the truly relevant chunk from position #8 to position #2, that chunk is now in the context window instead of being dropped. For a healthcare pre-authorization system processing 200 claims/day, the difference between retrieving the correct clinical guideline and a similar-but-wrong one could mean approving or denying coverage incorrectly.
Using Claude as a re-ranker means an extra API call per query. Re-ranking 20 chunks with a short scoring prompt costs roughly 3,000–5,000 input tokens. At $3/MTok for Claude Sonnet, that’s about $0.01–$0.015 per query. For 10,000 queries/day: ~$100–$150/day. Use a dedicated re-ranker model (Cohere Rerank, BGE-reranker) for high-volume systems, and Claude for lower-volume, higher-stakes applications.
When using an LLM as a re-ranker or evaluator, same-session self-review creates confirmation bias — the model retains reasoning context from generation. For production RAG evaluation, use SEPARATE API calls (separate sessions) for generation and quality assessment. The exam tests whether you recognize that self-evaluation in the same context window is unreliable.
Query Transformation Strategies
Before: You walk into a library and ask the librarian “I need info about that thing where companies report their debts.” The librarian stares blankly — that could mean annual reports, credit filings, bankruptcy proceedings, or SEC disclosures.
The pain: With a vague question, the librarian might hand you a random book about corporate finance, wasting your time.
The mapping: A skilled librarian rephrases your question into three specific queries: “UCC-1 financing statement filings,” “commercial debt disclosure requirements,” and “secured transaction public records.” Each version captures a different angle, and the combined results cover what you actually needed. That’s what query transformation does for RAG.
What this actually looks like: Here is the output of each query transformation strategy applied to the same user question:
Query transformation reformulates the user’s query before retrieval to improve recall. Three main strategies exist:
- HyDEHypothetical Document Embedding — instead of embedding the raw query, HyDE asks the LLM to generate a hypothetical answer, then embeds THAT. The hypothesis is closer to the language of actual documents, making vector search more effective. (Hypothetical Document Embedding): Ask Claude to generate a hypothetical answer to the query, then embed that answer instead of the raw query. Since the hypothesis uses document-like language, its embedding lands closer to real relevant documents in vector space.
- Multi-Query: Generate 3–5 different reformulations of the original query, run retrieval for each, and union the results. This catches documents that match one phrasing but not another.
- Step-Back Prompting: Abstract the specific question to a broader one (“How does UCC filing #2024-FL-0093 affect debtor risk?” → “How do UCC filings affect debtor risk assessment?”), retrieve broader context first, then use it alongside the specific query.
Google Research found that HyDE improves recall@20 by 20–40% on the TREC-DL benchmark compared to raw query embedding. Multi-query is especially valuable for ambiguous questions: in an internal eval at a legal document search company, multi-query reduced “no relevant results” errors from 18% to 4% of queries. The cost is one extra Claude API call per query for transformation (roughly 200–500 tokens), a trivial expense compared to the retrieval quality gain.
“HyDE always improves retrieval.” — If Claude generates a hypothesis that’s factually wrong or off-topic, embedding that hypothesis will retrieve wrong documents. HyDE works best for domain-specific queries where Claude can generate plausible-sounding text. For generic or highly ambiguous queries, the hypothesis may hurt more than it helps. Always fuse HyDE results with original query results via RRF rather than replacing the original query entirely.
“Multi-query is just sending the same question three times.” — No. The key is that each reformulation captures a different facet of the question. “What is the debtor risk?” might become “credit default probability for XYZ Corp,” “financial health indicators for UCC debtors,” and “risk factors in secured lending.” Each phrasing retrieves different documents, and the union covers blind spots that any single phrasing would miss.
“Step-back prompting loses specificity.” — That’s actually the point. You use the broad results as background context alongside the specific query, not as a replacement. The broad context helps Claude understand the general topic, while the specific query targets the exact answer.
You learned three strategies for fixing bad queries before they reach your search engine. HyDE generates a fake answer to embed (better vector matches). Multi-query tries multiple phrasings (broader recall). Step-back abstracts to a general question first (better context). All three are cheap LLM calls that dramatically improve retrieval quality.
Agentic RAG — When the Agent Decides What to Retrieve
Naive RAG (M09) is a vending machine: you put in a query, it spits out the top-k chunks, you eat what comes out. Hybrid + re-ranking + transformation (this module so far) is a better vending machine — smarter sorting, more selection — but still one-shot: one query in, one set of chunks out.
The pain shows up the moment a question can’t be answered from a single retrieval. “Compare Acme’s 2024 and 2025 quarterly filings” needs two retrievals — one per year — and then a synthesis step. “Find the clause that contradicts what we agreed in last week’s memo” needs to retrieve the memo first, then search for contradictions. A vending machine can’t do this. A research assistant can.
Agentic RAG hands retrieval over to the agent itself. The agent treats the retriever as a tool, calls it once, looks at what came back, decides whether to call it again with a refined query, and stops when it has enough — the M12 ReAct loop, but for retrieval. Same patterns you already know, applied to the search problem.
Agentic RAGA retrieval pattern where the LLM agent treats retrieval as a tool it can call multiple times, with the agent deciding which corpus to query, how to phrase each query, and when it has enough context to answer. Contrasts with “passive RAG” where retrieval is a fixed preprocessing step before the LLM ever sees the question. is RAG where the agent itself orchestrates retrieval — deciding which index/corpus to query, how to phrase the query, whether to re-query after seeing partial results, and when to stop. Mechanically: search is registered as a tool (M05), and Claude decides whether and how to invoke it inside its normal tool-use loop.
Three properties distinguish it from the passive RAG of M09:
- Multi-step retrieval. The agent can retrieve once, read the chunks, decide it needs more, and retrieve again with a sharpened query. Passive RAG is one-shot by construction.
- Corpus routing. If you have multiple indices — product docs, support tickets, public filings, internal wikis — the agent picks which to hit based on the question. Passive RAG hits a single index it was wired to at build time.
- Self-stopping. The agent stops retrieving when it has enough to answer (or escalates when it never gets there). Passive RAG always retrieves exactly k chunks, whether the question needs 1 or 20.
It’s the M12 ReAct pattern applied to retrieval, with the same upsides (smarter, handles multi-hop questions, can use multiple corpora) and the same downsides (more LLM calls, more latency, harder to evaluate). Worth reaching for when single-shot retrieval keeps failing on the questions that matter; not worth reaching for when hybrid + re-rank already lands the right chunk on the first try.
- Embed the question
- Retrieve top-k chunks (one index)
- Stuff chunks + question into prompt
- Claude answers from what was handed to it
LLM calls per question: 1. Latency: low. Failure mode: question needed info not in top-k => wrong answer.
- Claude reads the question
- Claude picks an index, drafts a query, calls
searchtool - Claude reads the returned chunks
- If insufficient: refine query, hit a different index, or both — loop
- If sufficient: stop and answer
- If never sufficient after N tries: escalate to human (M17)
LLM calls per question: 2-6. Latency: higher. Failure mode: agent loops; mitigation = step cap + eval.
In practice, “agentic RAG” is what you get when you take the M09 retriever, wrap it as a tool, and drop it into the M12 ReAct loop. Here’s a minimal working example over two corpora — product docs + support tickets — with the agent choosing per-call which to hit:
from anthropic import Anthropic
import chromadb
client = Anthropic()
chroma = chromadb.PersistentClient(path="./vector_db")
# Two corpora — the agent picks which to query per call.
INDICES = {
"product_docs": chroma.get_collection("product_docs"),
"support_tickets": chroma.get_collection("support_tickets"),
}
# === The retrieval tool. Claude decides corpus + query + k. =================
tools = [{
"name": "search",
"description": (
"Search a corpus by semantic similarity. "
"Use 'product_docs' for behavior/spec questions, "
"'support_tickets' for known-issues / customer cases. "
"Call multiple times if the first result is insufficient or off-topic."
),
"input_schema": {
"type": "object",
"properties": {
"corpus": {"type": "string", "enum": list(INDICES)},
"query": {"type": "string", "description": "Natural-language query"},
"k": {"type": "integer", "minimum": 1, "maximum": 8, "default": 4},
},
"required": ["corpus", "query"],
},
}]
def run_tool(name: str, args: dict) -> str:
if name != "search":
raise ValueError(f"Unknown tool: {name}")
idx = INDICES.get(args["corpus"])
if idx is None:
return f"ERROR: unknown corpus '{args['corpus']}'"
res = idx.query(query_texts=[args["query"]], n_results=args.get("k", 4))
docs = res["documents"][0]
return "\n---\n".join(f"[chunk {i+1}] {d}" for i, d in enumerate(docs)) or "(no hits)"
# === The agentic loop — standard M12 ReAct, just with `search` as the tool. ===
def agentic_rag(question: str, *, max_iters: int = 6) -> str:
messages = [{"role": "user", "content": question}]
for step in range(max_iters):
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
tools=tools,
messages=messages,
)
if resp.stop_reason == "end_turn":
return "".join(b.text for b in resp.content if b.type == "text")
messages.append({"role": "assistant", "content": resp.content})
tool_results = []
for block in resp.content:
if block.type == "tool_use":
try:
result = run_tool(block.name, block.input)
except Exception as e:
result = f"TOOL_ERROR: {e}"
tool_results.append({
"type": "tool_result", "tool_use_id": block.id, "content": result,
})
messages.append({"role": "user", "content": tool_results})
# Hit the step cap — escalate per M17 rather than guess.
raise RuntimeError(f"Agentic RAG exceeded {max_iters} retrieval steps")
import Anthropic from "@anthropic-ai/sdk";
import { ChromaClient } from "chromadb";
const client = new Anthropic();
const chroma = new ChromaClient({ path: "./vector_db" });
const INDICES: Record<string, any> = {
product_docs: await chroma.getCollection({ name: "product_docs" }),
support_tickets: await chroma.getCollection({ name: "support_tickets" }),
};
// === The retrieval tool. Claude decides corpus + query + k. =================
const tools = [{
name: "search",
description:
"Search a corpus by semantic similarity. " +
"Use 'product_docs' for behavior/spec questions, " +
"'support_tickets' for known-issues / customer cases. " +
"Call multiple times if the first result is insufficient or off-topic.",
input_schema: {
type: "object",
properties: {
corpus: { type: "string", enum: Object.keys(INDICES) },
query: { type: "string", description: "Natural-language query" },
k: { type: "integer", minimum: 1, maximum: 8, default: 4 },
},
required: ["corpus", "query"],
},
}];
async function runTool(name: string, args: any): Promise<string> {
if (name !== "search") throw new Error(`Unknown tool: ${name}`);
const idx = INDICES[args.corpus];
if (!idx) return `ERROR: unknown corpus '${args.corpus}'`;
const res = await idx.query({ queryTexts: [args.query], nResults: args.k ?? 4 });
const docs: string[] = res.documents[0] ?? [];
return docs.length
? docs.map((d, i) => `[chunk ${i + 1}] ${d}`).join("\n---\n")
: "(no hits)";
}
// === The agentic loop — standard M12 ReAct, just with `search` as the tool. ===
export async function agenticRag(question: string, maxIters = 6): Promise<string> {
const messages: any[] = [{ role: "user", content: question }];
for (let step = 0; step < maxIters; step++) {
const resp = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 2048,
tools,
messages,
});
if (resp.stop_reason === "end_turn") {
return resp.content.filter((b: any) => b.type === "text").map((b: any) => b.text).join("");
}
messages.push({ role: "assistant", content: resp.content });
const toolResults: any[] = [];
for (const block of resp.content as any[]) {
if (block.type === "tool_use") {
let result: string;
try { result = await runTool(block.name, block.input); }
catch (e: any) { result = `TOOL_ERROR: ${e.message}`; }
toolResults.push({ type: "tool_result", tool_use_id: block.id, content: result });
}
}
messages.push({ role: "user", content: toolResults });
}
// Hit the step cap — escalate per M17 rather than guess.
throw new Error(`Agentic RAG exceeded ${maxIters} retrieval steps`);
}
You wrapped the M09 retriever as a tool and gave Claude the M12 ReAct loop. The only domain-specific lines are the corpus enum and the tool description — the loop is the same one you wrote in M12. The agent now decides per-question: which corpus to hit, what query to send, and whether the chunks it got back are sufficient to answer. The max_iters cap is the budget; if the agent can’t find enough after 6 retrieval steps, it raises — you escalate (M17), you don’t fabricate.
Use it when: questions span multiple corpora; questions are multi-hop (need to retrieve, read, then retrieve again with new info); single-shot retrieval keeps missing the right chunk and your re-ranker tuning has plateaued; users phrase the same question 20 different ways and a fixed query template can’t cover them all.
Skip it when: a hybrid search + good re-ranker already lands the right chunk on the first try (don’t pay for extra LLM calls you don’t need); latency budget is sub-second (each retrieval step is another model round-trip); your eval set is tiny and you can’t measure whether the extra calls actually improve answer quality.
Cost calibration on Sonnet 4.6: a 4-step agentic RAG question costs roughly 4× the LLM tokens of a 1-step passive RAG question, plus 4 retrieval round-trips. For a 5K-token question that’s ~$0.06 vs ~$0.015 (4×), with ~3-6 seconds added latency. Worth it for the questions that need it; wasteful otherwise.
“Agentic RAG replaces hybrid search and re-ranking.” — No. The agent uses them. Inside the search tool you should still be doing hybrid retrieval and cross-encoder re-ranking (this module’s earlier sections). Agentic RAG is the outer loop; hybrid + re-rank is the inner implementation.
“If passive RAG works, agentic RAG works better.” — Not automatically. For single-fact lookups, agentic RAG often under-performs because the agent second-guesses correct first-shot results. Measure on your eval set (M18) before switching.
“Self-RAG / Corrective RAG / Adaptive RAG are different things.” — They’re named papers; mechanically they’re all agentic RAG with different prompts and stop conditions. You can read them as reference, but you don’t need a separate framework for each — just adjust the tool description and step cap.
Contextual Compression & RAG Evaluation
Contextual Compression
Before: You’re studying for an exam and a friend hands you five full textbook pages that “contain the answer somewhere.”
The pain: You have to scan 5,000 words to find the three sentences that actually matter, and the surrounding text might confuse you into thinking irrelevant details are important.
The mapping: Contextual compression is your friend highlighting only the relevant sentences before handing you the pages. Claude reads each retrieved chunk, extracts only the parts that answer the query, and throws away the rest. Less noise means better answers and fewer tokens spent on generation.
What this actually looks like in practice: Here is a real before-and-after showing a 500-token chunk compressed down to just the relevant sentences. Notice that only 2 out of 8 sentences survive — the rest were background noise:
Contextual compression uses an LLM to extract or summarize only the query-relevant portions of each retrieved chunk. Here’s the idea: you give the LLM a query and a 500-token chunk. It reads both, identifies the sentences that actually help answer the question, and outputs a 50–150 token extract. Everything else gets thrown away.
This serves two purposes. First, it reduces the token count in the generation prompt, directly saving cost — fewer tokens in means fewer tokens billed. Second, it removes irrelevant context that might lead the generator model to hallucinate or get distracted by tangential information. Think of it as signal-to-noise cleanup: the retriever found the right document, but only 10% of that document is relevant to this specific query.
Compression comes in two flavors. Extractive compression selects exact sentences from the original — it copies them verbatim, preserving the source wording. Abstractive compression rewrites the content into a shorter summary. Extractive is generally safer for production systems because it preserves original wording. That matters when you need accurate citations and traceability back to source documents — if you summarize, you can’t point to an exact quote anymore.
RAG Evaluation Metrics
You cannot improve what you cannot measure. RAG evaluation tells you exactly where your pipeline is succeeding and where it is failing. Without it, you are guessing — and in production, guessing costs real money and real user trust.
RAG evaluation works differently from traditional software testing. You are not checking whether code runs without errors. Instead, you are measuring the quality of search results and generated answers against human-judged ground truth. This means you need a test set: a collection of questions, the documents that correctly answer each question, and (optionally) the expected answer text. Building this test set is manual work, but even 10–20 well-labeled examples are enough to catch major pipeline regressions.
If you have used information retrieval systems before (like Elasticsearch or Solr), precision and recall will feel familiar. The new metric here is faithfulness, which is specific to RAG — it measures whether the LLM stuck to the retrieved context or invented facts on its own. Traditional search systems do not generate answers, so they never needed this metric.
RAG evaluation uses three core metrics. Each one answers a different question about your pipeline’s health:
- PrecisionOf all the documents your system retrieved, what fraction were actually relevant? High precision = few irrelevant results in your top-k. Precision@5 measures just the top 5 results.: Of the documents you retrieved, how many were actually relevant? If you retrieve 5 chunks and 3 are relevant: precision = 3/5 = 60%.
- RecallOf all the relevant documents in your entire corpus, what fraction did your system actually retrieve? High recall = you're not missing important documents. Hard to measure without labeled test data.: Of all relevant documents in the corpus, how many did you retrieve? If 4 documents are relevant and you found 3: recall = 3/4 = 75%.
- FaithfulnessDoes the generated answer only use information from the retrieved context? A faithfulness score of 100% means no hallucinated facts. This is the hardest metric to measure — it often requires LLM-based evaluation.: Does the generated answer stick to the retrieved context without hallucinating? If the answer contains 5 claims and all 5 are grounded in retrieved chunks: faithfulness = 100%.
Without evaluation metrics, you’re flying blind. A team at a financial services company spent 3 weeks optimizing their embedding model only to discover (via evaluation) that their real bottleneck was chunking strategy — 40% of relevant paragraphs were split across two chunks, and neither half was retrievable alone. Precision tells you “am I retrieving junk?” Recall tells you “am I missing relevant docs?” Faithfulness tells you “is the LLM making things up?” Measuring all three tells you exactly which pipeline stage to fix.
Aggregate accuracy metrics (e.g., “85% retrieval accuracy”) can mask per-query-type failures. A RAG system might retrieve perfectly for factual lookups but fail on complex multi-hop questions. Track precision and recall per query type (factual, analytical, comparison, multi-hop) to catch hidden failure modes. The exam tests whether you know to stratify metrics rather than relying on a single average.
You learned two final pieces: compression (removing noise from retrieved chunks to improve generation quality and reduce cost) and evaluation (measuring precision, recall, and faithfulness to diagnose pipeline failures). Together with hybrid search, re-ranking, and query transformation, you now have a complete Advanced RAG toolkit.
Code Walkthrough: Building an Advanced RAG Pipeline
We’ll upgrade the M09 basic RAG system with four advanced patterns: BM25 keyword search, Reciprocal Rank Fusion, Claude-based re-ranking, and HyDE query transformation. Each builds on the last.
Step 1: BM25 Keyword Search & Hybrid Fusion
Let’s start by adding BM25 keyword search alongside the existing vector search from M09, then merging the results using Reciprocal Rank Fusion. The goal is simple: vector search alone misses exact term matches like invoice numbers, product codes, and proper nouns. BM25 catches those exact hits, and RRF merges both result lists without the headache of normalizing their incompatible score ranges.
Here’s the gotcha to watch for: BM25 requires tokenized text (split into words), and it’s case-sensitive by default. If you forget to lowercase and strip punctuation, “Invoice” and “invoice” become different tokens, and your recall takes an unnecessary hit. The code below handles this with .lower().split() — basic but effective.
import math
from collections import Counter
class BM25:
"""Simple BM25 implementation for keyword search."""
def __init__(self, documents: list[str], k1: float = 1.5, b: float = 0.75):
self.k1 = k1
self.b = b
self.docs = documents
self.doc_len = [len(d.split()) for d in documents]
self.avg_dl = sum(self.doc_len) / len(self.doc_len)
self.doc_count = len(documents)
# Build inverted index: word -> set of doc indices
self.doc_freqs: dict[str, int] = {}
self.term_freqs: list[Counter] = []
for doc in documents:
tf = Counter(doc.lower().split())
self.term_freqs.append(tf)
for term in tf:
self.doc_freqs[term] = self.doc_freqs.get(term, 0) + 1
def score(self, query: str, top_k: int = 10) -> list[tuple[int, float]]:
"""Score all documents against query, return top-k (index, score) pairs."""
query_terms = query.lower().split()
scores = []
for i in range(self.doc_count):
s = 0.0
for term in query_terms:
if term not in self.doc_freqs:
continue
df = self.doc_freqs[term]
idf = math.log((self.doc_count - df + 0.5) / (df + 0.5) + 1)
tf = self.term_freqs[i].get(term, 0)
numerator = tf * (self.k1 + 1)
denominator = tf + self.k1 * (1 - self.b + self.b * self.doc_len[i] / self.avg_dl)
s += idf * numerator / denominator
scores.append((i, s))
scores.sort(key=lambda x: x[1], reverse=True)
return scores[:top_k]
def reciprocal_rank_fusion(
ranked_lists: list[list[tuple[int, float]]],
k: int = 60
) -> list[tuple[int, float]]:
"""Merge multiple ranked lists using RRF.
Each ranked_list is [(doc_index, score), ...] in descending order.
Returns fused list sorted by RRF score.
"""
fused_scores: dict[int, float] = {}
for ranked_list in ranked_lists:
for rank, (doc_idx, _) in enumerate(ranked_list):
fused_scores[doc_idx] = fused_scores.get(doc_idx, 0) + 1.0 / (k + rank + 1)
result = sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
return result
def hybrid_search(
query: str,
bm25: BM25,
vector_store, # Your ChromaDB collection from M09
embedding_fn, # Function to embed text -> vector
top_k: int = 10,
bm25_weight: float = 0.5
) -> list[tuple[int, float]]:
"""Run hybrid search: BM25 + vector, fused with RRF."""
try:
# BM25 keyword search
bm25_results = bm25.score(query, top_k=top_k * 2)
# Vector similarity search (using ChromaDB from M09)
query_embedding = embedding_fn(query)
vector_results_raw = vector_store.query(
query_embeddings=[query_embedding],
n_results=top_k * 2
)
# Convert to (index, score) format
vector_results = [
(int(doc_id), 1.0 - dist) # Convert distance to similarity
for doc_id, dist in zip(
vector_results_raw["ids"][0],
vector_results_raw["distances"][0]
)
]
# Fuse with RRF
fused = reciprocal_rank_fusion([bm25_results, vector_results])
return fused[:top_k]
except Exception as e:
print(f"Hybrid search error: {e}")
# Fallback to vector-only search
return vector_results[:top_k] if vector_results else []
// bm25.ts — BM25 + Reciprocal Rank Fusion for hybrid search
interface ScoredDoc { index: number; score: number; }
class BM25 {
private k1: number;
private b: number;
private docLengths: number[];
private avgDl: number;
private docCount: number;
private docFreqs: Map<string, number> = new Map();
private termFreqs: Map<string, number>[];
constructor(documents: string[], k1 = 1.5, b = 0.75) {
this.k1 = k1;
this.b = b;
this.docCount = documents.length;
this.docLengths = documents.map(d => d.split(/\s+/).length);
this.avgDl = this.docLengths.reduce((a, b) => a + b, 0) / this.docCount;
// Build inverted index
this.termFreqs = documents.map(doc => {
const tf = new Map<string, number>();
for (const word of doc.toLowerCase().split(/\s+/)) {
tf.set(word, (tf.get(word) || 0) + 1);
}
for (const term of tf.keys()) {
this.docFreqs.set(term, (this.docFreqs.get(term) || 0) + 1);
}
return tf;
});
}
score(query: string, topK = 10): ScoredDoc[] {
const queryTerms = query.toLowerCase().split(/\s+/);
const scores: ScoredDoc[] = [];
for (let i = 0; i < this.docCount; i++) {
let s = 0;
for (const term of queryTerms) {
const df = this.docFreqs.get(term) || 0;
if (df === 0) continue;
const idf = Math.log((this.docCount - df + 0.5) / (df + 0.5) + 1);
const tf = this.termFreqs[i].get(term) || 0;
const numerator = tf * (this.k1 + 1);
const denominator = tf + this.k1 *
(1 - this.b + this.b * this.docLengths[i] / this.avgDl);
s += idf * numerator / denominator;
}
scores.push({ index: i, score: s });
}
return scores.sort((a, b) => b.score - a.score).slice(0, topK);
}
}
function reciprocalRankFusion(
rankedLists: ScoredDoc[][],
k = 60
): ScoredDoc[] {
const fusedScores = new Map<number, number>();
for (const list of rankedLists) {
list.forEach((doc, rank) => {
const current = fusedScores.get(doc.index) || 0;
fusedScores.set(doc.index, current + 1 / (k + rank + 1));
});
}
return Array.from(fusedScores.entries())
.map(([index, score]) => ({ index, score }))
.sort((a, b) => b.score - a.score);
}
async function hybridSearch(
query: string,
bm25: BM25,
vectorStore: any, // Your ChromaDB collection from M09
embedFn: (text: string) => Promise<number[]>,
topK = 10
): Promise<ScoredDoc[]> {
try {
// BM25 keyword search
const bm25Results = bm25.score(query, topK * 2);
// Vector similarity search
const queryEmbedding = await embedFn(query);
const vectorRaw = await vectorStore.query({
queryEmbeddings: [queryEmbedding],
nResults: topK * 2,
});
const vectorResults: ScoredDoc[] = vectorRaw.ids[0].map(
(id: string, i: number) => ({
index: parseInt(id),
score: 1 - vectorRaw.distances[0][i],
})
);
// Fuse with RRF
const fused = reciprocalRankFusion([bm25Results, vectorResults]);
return fused.slice(0, topK);
} catch (error) {
console.error("Hybrid search error:", error);
return []; // Fallback: empty results
}
}
You built a BM25 keyword search engine from scratch (no external library needed for the basics), then merged it with your existing vector search using Reciprocal Rank Fusion. The hybrid_search function runs both searches, fuses the results, and falls back to vector-only if something fails. Your retrieval now catches both exact terms and semantic matches.
Step 2: Claude-Based Re-Ranking
Now for the interesting part: after hybrid retrieval returns the top candidates, we ask Claude to score each document’s relevance to the query on a 1–10 scale, then re-sort by those scores. This is where the real quality jump happens. Claude understands nuance that cosine similarity and BM25 simply cannot capture. For example, a chunk about “filing amendments” might score high on vector similarity for a query about “new filings,” but Claude knows that amendments and new filings are fundamentally different things.
Here’s the dilemma with LLM-based re-ranking: Claude might give all documents similar scores (e.g., 7, 7, 8, 7), which defeats the purpose of re-ranking. The trick is to add “use the full 1-10 scale” in the scoring prompt and require a brief justification for each score. This forces more discriminative scoring — Claude has to think about why one document deserves a 9 while another only deserves a 3.
import anthropic
import json
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY env var
async def rerank_with_claude(
query: str,
documents: list[dict], # [{"id": "...", "text": "..."}, ...]
top_k: int = 5,
timeout: float = 30.0
) -> list[dict]:
"""Re-rank documents using Claude as a cross-encoder.
Returns top_k documents sorted by Claude's relevance score.
"""
if not documents:
return []
# Build the scoring prompt
doc_list = "\n\n".join(
f"[Document {i+1}] (ID: {doc['id']})\n{doc['text'][:500]}"
for i, doc in enumerate(documents)
)
try:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"""Score each document's relevance to the query on a scale of 1-10.
Use the FULL scale: 1=completely irrelevant, 5=somewhat related, 10=directly answers the query.
Query: {query}
{doc_list}
Respond with ONLY a JSON array of objects, each with "doc_id" and "score" fields.
Example: [{{"doc_id": "abc", "score": 8}}, {{"doc_id": "def", "score": 3}}]"""
}]
)
# Parse Claude's scoring response
scores = json.loads(response.content[0].text)
# Build lookup and sort
score_map = {s["doc_id"]: s["score"] for s in scores}
ranked = sorted(
documents,
key=lambda d: score_map.get(d["id"], 0),
reverse=True
)
return ranked[:top_k]
except json.JSONDecodeError as e:
print(f"Re-rank parse error: {e}. Returning original order.")
return documents[:top_k]
except anthropic.APITimeoutError:
print("Re-rank timeout. Returning original order.")
return documents[:top_k]
except Exception as e:
print(f"Re-rank error: {e}. Returning original order.")
return documents[:top_k]
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic(); // reads ANTHROPIC_API_KEY env var
interface Document { id: string; text: string; }
interface RankedDoc extends Document { score: number; }
async function rerankWithClaude(
query: string,
documents: Document[],
topK = 5
): Promise<RankedDoc[]> {
if (documents.length === 0) return [];
const docList = documents
.map((doc, i) =>
`[Document ${i + 1}] (ID: ${doc.id})\n${doc.text.slice(0, 500)}`
)
.join("\n\n");
try {
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
messages: [{
role: "user",
content: `Score each document's relevance to the query on a scale of 1-10.
Use the FULL scale: 1=completely irrelevant, 5=somewhat related, 10=directly answers the query.
Query: ${query}
${docList}
Respond with ONLY a JSON array of objects, each with "doc_id" and "score" fields.
Example: [{"doc_id": "abc", "score": 8}, {"doc_id": "def", "score": 3}]`,
}],
});
const text = response.content[0].type === "text"
? response.content[0].text : "";
const scores: { doc_id: string; score: number }[] = JSON.parse(text);
const scoreMap = new Map(scores.map(s => [s.doc_id, s.score]));
const ranked: RankedDoc[] = documents
.map(d => ({ ...d, score: scoreMap.get(d.id) || 0 }))
.sort((a, b) => b.score - a.score);
return ranked.slice(0, topK);
} catch (error) {
if (error instanceof SyntaxError) {
console.error("Re-rank parse error:", error.message);
} else {
console.error("Re-rank error:", error);
}
// Fallback: return original order with score 0
return documents.slice(0, topK).map(d => ({ ...d, score: 0 }));
}
}
Step 3: HyDE Query Transformation
This next technique is clever. Before we even search, we ask Claude to generate a hypothetical answer to the user’s question, then embed that answer instead of the raw query. Why does this help? The user’s short question (“What’s the risk for filing 0093?”) uses question language, but documents use statement language (“The debtor’s credit profile indicates elevated default probability...”). A hypothetical answer bridges that gap — its embedding lands closer to relevant chunks because it sounds like the documents themselves.
The risk with HyDE is that if the hypothetical answer is wrong or off-topic, you’ll retrieve irrelevant documents. That’s why the code below treats HyDE as an additional input to Reciprocal Rank Fusion, not a replacement for the original query. If the hypothesis is good, it boosts retrieval. If it’s bad, the original query’s results still dominate.
async def hyde_transform(
query: str,
domain_context: str = "UCC filings and commercial lending"
) -> str:
"""Generate a hypothetical document that answers the query.
The hypothesis is embedded instead of the raw query,
producing better vector matches against real documents.
"""
try:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=300,
messages=[{
"role": "user",
"content": f"""Write a short paragraph (3-5 sentences) that would appear in a
document about {domain_context} and directly answers this question:
Question: {query}
Write as if this paragraph is from an actual document, not as a response
to a user. Use technical language appropriate for the domain.
Do NOT start with "Here is" or similar preamble."""
}]
)
return response.content[0].text
except Exception as e:
print(f"HyDE generation failed: {e}. Using original query.")
return query
async def advanced_rag_query(
query: str,
bm25: BM25,
vector_store,
embedding_fn,
documents: list[dict],
top_k: int = 5
) -> list[dict]:
"""Full advanced RAG pipeline: HyDE + Hybrid Search + Re-rank."""
# Step 1: Generate hypothetical answer for better embedding
hypothesis = await hyde_transform(query)
# Step 2: Run hybrid search with BOTH original query and hypothesis
original_results = hybrid_search(query, bm25, vector_store, embedding_fn, top_k=top_k * 2)
hyde_embedding = embedding_fn(hypothesis)
hyde_results_raw = vector_store.query(
query_embeddings=[hyde_embedding], n_results=top_k * 2
)
hyde_results = [
(int(doc_id), 1.0 - dist)
for doc_id, dist in zip(
hyde_results_raw["ids"][0], hyde_results_raw["distances"][0]
)
]
# Step 3: Fuse all three result sets (BM25, vector, HyDE-vector)
bm25_only = bm25.score(query, top_k=top_k * 2)
all_fused = reciprocal_rank_fusion([
bm25_only,
[(idx, sc) for idx, sc in original_results],
hyde_results
])
# Step 4: Gather top candidates as documents
candidate_docs = []
seen = set()
for doc_idx, _ in all_fused[:top_k * 2]:
if doc_idx not in seen and doc_idx < len(documents):
candidate_docs.append(documents[doc_idx])
seen.add(doc_idx)
# Step 5: Re-rank with Claude
reranked = await rerank_with_claude(query, candidate_docs, top_k=top_k)
return reranked
async function hydeTransform(
query: string,
domainContext = "UCC filings and commercial lending"
): Promise<string> {
try {
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 300,
messages: [{
role: "user",
content: `Write a short paragraph (3-5 sentences) that would appear in a
document about ${domainContext} and directly answers this question:
Question: ${query}
Write as if this paragraph is from an actual document, not as a response
to a user. Use technical language appropriate for the domain.
Do NOT start with "Here is" or similar preamble.`,
}],
});
const text = response.content[0].type === "text"
? response.content[0].text : query;
return text;
} catch (error) {
console.error("HyDE generation failed:", error);
return query; // Fallback to original query
}
}
async function advancedRagQuery(
query: string,
bm25: BM25,
vectorStore: any,
embedFn: (text: string) => Promise<number[]>,
documents: Document[],
topK = 5
): Promise<RankedDoc[]> {
// Step 1: HyDE — generate hypothetical answer
const hypothesis = await hydeTransform(query);
// Step 2: Hybrid search with original query
const originalResults = await hybridSearch(
query, bm25, vectorStore, embedFn, topK * 2
);
// Step 3: Vector search with HyDE embedding
const hydeEmbedding = await embedFn(hypothesis);
const hydeRaw = await vectorStore.query({
queryEmbeddings: [hydeEmbedding],
nResults: topK * 2,
});
const hydeResults: ScoredDoc[] = hydeRaw.ids[0].map(
(id: string, i: number) => ({
index: parseInt(id),
score: 1 - hydeRaw.distances[0][i],
})
);
// Step 4: Fuse all three result sets
const bm25Only = bm25.score(query, topK * 2);
const allFused = reciprocalRankFusion([
bm25Only, originalResults, hydeResults
]);
// Step 5: Gather candidates and re-rank
const seen = new Set<number>();
const candidates: Document[] = [];
for (const { index } of allFused.slice(0, topK * 2)) {
if (!seen.has(index) && index < documents.length) {
candidates.push(documents[index]);
seen.add(index);
}
}
return rerankWithClaude(query, candidates, topK);
}
You wired together the complete Advanced RAG pipeline: HyDE generates a hypothesis for better embedding, hybrid search combines BM25 + vector + HyDE results via RRF, and Claude re-ranks the top candidates by true relevance. The advanced_rag_query function orchestrates all five steps. Error handling ensures each stage falls back gracefully — a HyDE failure uses the original query, a re-rank timeout keeps the original order.
Step 4: RAG Evaluation Harness
Now let’s build something that will save you hours of guessing: a simple evaluation framework that measures precision, recall, and faithfulness across a test set of questions with known answers. Without these metrics, you genuinely cannot tell if your advanced pipeline actually improved things. Maybe re-ranking helps for your data; maybe it adds latency without improving quality. The numbers will tell you — and they often surprise you.
The hardest part of evaluation is not the code — it’s creating the ground-truth labels. Someone (usually you) has to manually decide which documents are relevant for each test query. This is tedious but essential. Start with 10–20 test cases and label them carefully. More test cases are not better if the labels are sloppy — a small, accurate test set beats a large, noisy one every time.
The code below has three parts. First, a TestCase data class that holds a query plus its known-good answers — this is your ground truth. Second, evaluate_retrieval computes precision and recall by comparing what the system retrieved against what it should have retrieved. Third, evaluate_faithfulness asks Claude to check whether the generated answer actually sticks to the retrieved context or invents facts. Together, these three functions tell you exactly where your pipeline is strong and where it’s leaking quality.
from dataclasses import dataclass
@dataclass
class TestCase:
query: str
relevant_doc_ids: list[str] # Ground truth: which docs answer this
expected_answer_fragments: list[str] # Key phrases the answer should contain
@dataclass
class EvalResult:
precision: float # Retrieved relevant / total retrieved
recall: float # Retrieved relevant / total relevant
faithfulness: float # Claims grounded in context / total claims
def evaluate_retrieval(
retrieved_ids: list[str],
relevant_ids: list[str],
k: int = 5
) -> tuple[float, float]:
"""Compute precision@k and recall@k."""
top_k = set(retrieved_ids[:k])
relevant = set(relevant_ids)
hits = len(top_k & relevant)
precision = hits / k if k > 0 else 0.0
recall = hits / len(relevant) if relevant else 0.0
return precision, recall
async def evaluate_faithfulness(
answer: str,
retrieved_chunks: list[str]
) -> float:
"""Use Claude to check if the answer is grounded in retrieved context."""
context = "\n---\n".join(retrieved_chunks)
try:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
messages=[{
"role": "user",
"content": f"""Evaluate whether the answer below is fully supported by the context.
For each factual claim in the answer, check if it appears in the context.
Context:
{context}
Answer:
{answer}
Respond with JSON: {{"total_claims": N, "grounded_claims": M, "ungrounded": ["list of claims not in context"]}}"""
}]
)
result = json.loads(response.content[0].text)
total = result.get("total_claims", 1)
grounded = result.get("grounded_claims", 0)
return grounded / total if total > 0 else 0.0
except Exception as e:
print(f"Faithfulness eval error: {e}")
return 0.0
async def run_evaluation(
test_cases: list[TestCase],
search_fn, # Function: query -> (retrieved_doc_ids, answer, chunks)
k: int = 5
) -> dict:
"""Run full evaluation across test cases."""
precisions, recalls, faiths = [], [], []
for tc in test_cases:
retrieved_ids, answer, chunks = await search_fn(tc.query)
p, r = evaluate_retrieval(retrieved_ids, tc.relevant_doc_ids, k)
f = await evaluate_faithfulness(answer, chunks)
precisions.append(p)
recalls.append(r)
faiths.append(f)
print(f" Query: {tc.query[:50]}... P={p:.2f} R={r:.2f} F={f:.2f}")
return {
"avg_precision": sum(precisions) / len(precisions),
"avg_recall": sum(recalls) / len(recalls),
"avg_faithfulness": sum(faiths) / len(faiths),
"n_queries": len(test_cases)
}
interface TestCase {
query: string;
relevantDocIds: string[];
expectedFragments: string[];
}
interface EvalResult {
precision: number;
recall: number;
faithfulness: number;
}
function evaluateRetrieval(
retrievedIds: string[],
relevantIds: string[],
k = 5
): { precision: number; recall: number } {
const topK = new Set(retrievedIds.slice(0, k));
const relevant = new Set(relevantIds);
let hits = 0;
for (const id of topK) {
if (relevant.has(id)) hits++;
}
return {
precision: k > 0 ? hits / k : 0,
recall: relevant.size > 0 ? hits / relevant.size : 0,
};
}
async function evaluateFaithfulness(
answer: string,
retrievedChunks: string[]
): Promise<number> {
const context = retrievedChunks.join("\n---\n");
try {
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 512,
messages: [{
role: "user",
content: `Evaluate whether the answer below is fully supported by the context.
For each factual claim in the answer, check if it appears in the context.
Context:
${context}
Answer:
${answer}
Respond with JSON: {"total_claims": N, "grounded_claims": M, "ungrounded": ["list"]}`,
}],
});
const text = response.content[0].type === "text"
? response.content[0].text : "{}";
const result = JSON.parse(text);
const total = result.total_claims || 1;
const grounded = result.grounded_claims || 0;
return total > 0 ? grounded / total : 0;
} catch (error) {
console.error("Faithfulness eval error:", error);
return 0;
}
}
async function runEvaluation(
testCases: TestCase[],
searchFn: (q: string) => Promise<{ids: string[]; answer: string; chunks: string[]}>,
k = 5
): Promise<{avgPrecision: number; avgRecall: number; avgFaithfulness: number}> {
const precisions: number[] = [];
const recalls: number[] = [];
const faiths: number[] = [];
for (const tc of testCases) {
const { ids, answer, chunks } = await searchFn(tc.query);
const { precision, recall } = evaluateRetrieval(ids, tc.relevantDocIds, k);
const faith = await evaluateFaithfulness(answer, chunks);
precisions.push(precision);
recalls.push(recall);
faiths.push(faith);
console.log(` ${tc.query.slice(0, 50)}... P=${precision.toFixed(2)} R=${recall.toFixed(2)} F=${faith.toFixed(2)}`);
}
const avg = (arr: number[]) => arr.reduce((a, b) => a + b, 0) / arr.length;
return {
avgPrecision: avg(precisions),
avgRecall: avg(recalls),
avgFaithfulness: avg(faiths),
};
}
You built a complete evaluation framework. evaluate_retrieval computes precision and recall by comparing retrieved doc IDs against ground truth. evaluate_faithfulness uses Claude to check whether the generated answer is grounded in the retrieved context. run_evaluation runs both across a test set and reports averages. Now you can objectively compare naive RAG vs. your advanced pipeline.
Hands-On Exercise
What You'll Build
An advanced RAG pipeline that upgrades the M09 basic system with BM25 keyword search, Reciprocal Rank Fusion, and Claude-based re-ranking — then measures the improvement with an evaluation framework.
Time Estimate: 45–60 minutes | Prerequisites: Working M09 RAG pipeline with docs/ folder | Files You'll Create: advanced_rag.py
Environment Setup
This lab builds on the M09 RAG lab. You need the rag-lab/ directory with the docs/ folder containing sample UCC filing documents. If you skipped M09, create the folder and add 2–3 .md files about UCC filings.
# Navigate to your M09 rag-lab directory
cd rag-lab
# Activate virtual environment
source venv/bin/activate # macOS/Linux
# venv\Scripts\activate # Windows
# Install dependencies (if not already installed)
pip install anthropic>=0.30.0 chromadb>=0.4.0
# Set your API key
export ANTHROPIC_API_KEY=your-key-here # macOS/Linux
# set ANTHROPIC_API_KEY=your-key-here # Windows CMD
# $env:ANTHROPIC_API_KEY="your-key-here" # Windows PowerShell
ModuleNotFoundError: No module named 'chromadb'→ Runpip install chromadb. If it fails, trypip install --upgrade pipfirst.ModuleNotFoundError: No module named 'anthropic'→ Runpip install anthropic.AuthenticationError→ Check thatANTHROPIC_API_KEYis set in your current shell session.
Step 1: Build the Advanced RAG Pipeline
What & Why: This step adds three advanced patterns on top of the M09 base: BM25 keyword search for exact term matching, RRF fusion to merge both result sets, and Claude-based re-ranking to sort by true relevance. It uses the same docs/ folder and ChromaDB setup from M09. By the end of this step, you’ll have a single Python file that runs the complete advanced pipeline end-to-end.
Create a new file called advanced_rag.py in your rag-lab/ directory and add the following:
import os
import glob
import json
import math
from collections import Counter
import anthropic
import chromadb
client = anthropic.Anthropic()
# ── Document Loading & Chunking (from M09) ──────────────────
def load_documents(docs_dir: str) -> list[dict]:
docs = []
for pattern in ["*.md", "*.txt"]:
for path in glob.glob(os.path.join(docs_dir, pattern)):
try:
with open(path, "r", encoding="utf-8") as f:
docs.append({"content": f.read(), "source": os.path.basename(path)})
except (IOError, UnicodeDecodeError):
pass
if not docs:
raise FileNotFoundError(f"No documents found in {docs_dir}")
return docs
def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
if len(text) <= chunk_size:
return [text]
chunks, start = [], 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
if end < len(text):
for sep in ["\n\n", "\n", ". ", " "]:
last_sep = chunk.rfind(sep)
if last_sep > chunk_size * 0.5:
end = start + last_sep + len(sep)
chunk = text[start:end]
break
chunks.append(chunk.strip())
start = end - overlap
return [c for c in chunks if c]
# ── BM25 Keyword Search ─────────────────────────────────────
class BM25:
def __init__(self, documents: list[str], k1=1.5, b=0.75):
self.k1, self.b = k1, b
self.docs = documents
self.doc_len = [len(d.split()) for d in documents]
self.avg_dl = sum(self.doc_len) / len(self.doc_len)
self.doc_count = len(documents)
self.doc_freqs = {}
self.term_freqs = []
for doc in documents:
tf = Counter(doc.lower().split())
self.term_freqs.append(tf)
for term in tf:
self.doc_freqs[term] = self.doc_freqs.get(term, 0) + 1
def score(self, query: str, top_k: int = 10) -> list[tuple[int, float]]:
query_terms = query.lower().split()
scores = []
for i in range(self.doc_count):
s = 0.0
for term in query_terms:
if term not in self.doc_freqs:
continue
df = self.doc_freqs[term]
idf = math.log((self.doc_count - df + 0.5) / (df + 0.5) + 1)
tf = self.term_freqs[i].get(term, 0)
s += idf * tf * (self.k1 + 1) / (tf + self.k1 * (1 - self.b + self.b * self.doc_len[i] / self.avg_dl))
scores.append((i, s))
scores.sort(key=lambda x: x[1], reverse=True)
return scores[:top_k]
# ── Reciprocal Rank Fusion ───────────────────────────────────
def rrf_fuse(ranked_lists: list[list[tuple[int, float]]], k: int = 60) -> list[tuple[int, float]]:
fused = {}
for ranked_list in ranked_lists:
for rank, (doc_idx, _) in enumerate(ranked_list):
fused[doc_idx] = fused.get(doc_idx, 0) + 1.0 / (k + rank + 1)
return sorted(fused.items(), key=lambda x: x[1], reverse=True)
# ── Claude-Based Re-Ranking ─────────────────────────────────
def rerank_with_claude(query: str, chunks: list[dict], top_n: int = 5) -> list[dict]:
if not chunks:
return []
docs_text = "\n".join(f"[Doc {i+1}] {c['text'][:300]}" for i, c in enumerate(chunks))
prompt = (
f"Query: {query}\n\nDocuments:\n{docs_text}\n\n"
f"Score each document 1-10 for relevance to the query. "
f"1 = completely irrelevant, 10 = directly answers the question.\n"
f"Output ONLY the raw JSON array, no markdown fences, no commentary.\n"
f"Format: [{{\"doc\": 1, \"score\": 8.5}}, ...]"
)
try:
response = client.messages.create(
model="claude-sonnet-4-6", max_tokens=512,
messages=[{"role": "user", "content": prompt}],
)
raw = response.content[0].text.strip()
# Strip common markdown code fences if Claude wraps the JSON
if raw.startswith("```"):
raw = raw.split("```")[1]
if raw.startswith("json"):
raw = raw[4:]
raw = raw.strip()
scores = json.loads(raw)
scored = []
for s in scores:
idx = s["doc"] - 1
if 0 <= idx < len(chunks):
chunks[idx]["rerank_score"] = s["score"]
scored.append(chunks[idx])
scored.sort(key=lambda x: x.get("rerank_score", 0), reverse=True)
return scored[:top_n]
except Exception as e:
print(f" Re-ranking failed: {e}, using original order")
return chunks[:top_n]
# ── Full Advanced Pipeline ───────────────────────────────────
def ingest_and_build(docs_dir="./docs"):
"""Load, chunk, store in ChromaDB, build BM25 index."""
docs = load_documents(docs_dir)
all_chunks = []
for doc in docs:
for i, chunk in enumerate(chunk_text(doc["content"])):
all_chunks.append({"text": chunk, "source": doc["source"], "index": i})
chroma = chromadb.Client()
collection = chroma.get_or_create_collection("advanced_rag", metadata={"hnsw:space": "cosine"})
collection.add(
ids=[f"chunk_{i}" for i in range(len(all_chunks))],
documents=[c["text"] for c in all_chunks],
metadatas=[{"source": c["source"], "index": c["index"]} for c in all_chunks],
)
bm25 = BM25([c["text"] for c in all_chunks])
print(f" Loaded {len(docs)} docs, {len(all_chunks)} chunks")
return collection, bm25, all_chunks
def advanced_query(query, collection, bm25, all_chunks, verbose=True):
"""Hybrid search + RRF + re-ranking + Claude generation."""
# 1. BM25 search
bm25_results = bm25.score(query, top_k=10)
# 2. Vector search
vector_raw = collection.query(query_texts=[query], n_results=10)
vector_results = [(int(doc_id.split("_")[1]), 1.0 - dist)
for doc_id, dist in zip(vector_raw["ids"][0], vector_raw["distances"][0])]
# 3. RRF fusion
fused = rrf_fuse([bm25_results, vector_results])
fused_chunks = [all_chunks[idx] for idx, _ in fused[:10]]
if verbose:
print(f"\n Query: {query}")
print(f" BM25 top-3: {[all_chunks[i]['source'] for i, _ in bm25_results[:3]]}")
print(f" Vector top-3: {[all_chunks[i]['source'] for i, _ in vector_results[:3]]}")
print(f" Fused top-3: {[all_chunks[i]['source'] for i, _ in fused[:3]]}")
# 4. Re-rank top candidates
reranked = rerank_with_claude(query, fused_chunks, top_n=3)
if verbose and reranked:
print(f" Re-ranked top-3: {[(c['source'], c.get('rerank_score', '?')) for c in reranked]}")
# 5. Generate answer
context = "\n\n---\n\n".join(
f"[Source {i+1}: {c['source']}]\n{c['text']}" for i, c in enumerate(reranked)
)
response = client.messages.create(
model="claude-sonnet-4-6", max_tokens=1024,
system="Answer based ONLY on the provided context. Cite sources using [Source N].",
messages=[{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}],
)
return response.content[0].text
# ── Run Tests ────────────────────────────────────────────────
if __name__ == "__main__":
print("── Advanced RAG Pipeline ──")
collection, bm25, all_chunks = ingest_and_build()
questions = [
"What are the high risk indicators for a UCC filing?",
"How long is a UCC-1 effective and what happens when it lapses?",
"What is the difference between a UCC-1 and a UCC-3?",
]
for q in questions:
answer = advanced_query(q, collection, bm25, all_chunks)
print(f"\n Answer: {answer[:200]}...")
print(f" {'═' * 50}")
Run the advanced pipeline:
Look for these key behaviors:
- BM25 vs Vector: BM25 and vector search should return different orderings — BM25 favors exact keyword matches, vector favors semantic similarity
- Fused results: Should combine the best of both lists (documents appearing in both get boosted)
- Re-ranking: Should promote the most directly relevant document to position #1 (check the scores)
- Cited answers: Claude's responses should include
[Source N]citations
- Re-ranking returns error / invalid JSON → Claude's response format may vary. The code falls back to the original order if parsing fails. Try running again — it's usually a one-off issue.
FileNotFoundError: No documents found→ You need thedocs/folder from M09's lab. Create it with the 3 sample files from the M09 exercise.- BM25 and vector return identical results → With a small corpus (3 docs, ~8 chunks), there's limited diversity. The difference becomes dramatic with 50+ documents.
APIError: 529→ This test makes 4+ API calls (3 queries + re-ranking). Wait 30 seconds between runs.
Verify Everything Works
Run the complete pipeline end-to-end. All 3 queries should complete, showing the full retrieval trace (BM25 → vector → fused → re-ranked) at each stage, followed by cited answers:
- You should see
Loaded N docs, M chunksat the top (confirming ingestion worked) - Each query should print BM25 top-3, Vector top-3, Fused top-3, and Re-ranked top-3 with scores
- BM25 and Vector results should differ — that’s the whole point of hybrid search
- Re-ranked scores should be on a 1–10 scale, not cosine similarity (0–1)
- Final answers should contain
[Source N]citations
You've upgraded a naive RAG system with three production patterns: BM25 hybrid search for exact term matching, RRF fusion for combining result lists, and Claude-based re-ranking for precise relevance ordering. This is the same architecture used by enterprise document search systems.
- HyDE: Add a
hyde_transformfunction that generates a hypothetical answer before embedding. Measure recall improvement on abstract queries. - Contextual compression: After re-ranking, use Claude to extract only the relevant sentences from each chunk. Measure token savings.
- Evaluation framework: Create a test set of 10 questions with ground-truth relevant chunk IDs. Compute precision@3 and recall@5 for naive vs. advanced pipeline.
- BM25 tokenization mismatch: Always BM25-index the same chunks you vector-indexed. Different chunking = wrong scores.
- Re-ranker hallucination: Claude might invent relevance. Always include "1 = completely irrelevant" in the scoring prompt.
- Test set bias: Include adversarial queries (ambiguous, no-match, multi-hop) in your evaluation set.
Knowledge Check
1. When would BM25 keyword search outperform vector similarity search?
2. What problem does Reciprocal Rank Fusion (RRF) solve?
3. A RAG system has high recall but low precision. What does this mean, and which pattern would help most?
4. In HyDE, why does embedding a hypothetical answer often retrieve better results than embedding the raw query?
5. Which evaluation metric tells you whether the LLM is hallucinating in its generated answer?
6. You’re building a RAG system for a healthcare company. Which advanced pattern should you add FIRST for the biggest impact? (Recall from M09: your basic vector search is already working.)
Module Summary
Key Concepts Recap
- Naive RAG follows a single-pass retrieve-then-generate pattern. Advanced RAG adds pre-retrieval, retrieval, and post-retrieval optimizations.
- Hybrid Search combines BM25 keyword matching with vector similarity via Reciprocal Rank Fusion, catching both exact terms and semantic matches.
- Re-Ranking uses a cross-encoder or LLM to re-sort retrieved candidates by true relevance, fixing the approximate ordering from initial retrieval.
- Query Transformation (HyDE, multi-query, step-back) improves retrieval by reformulating the user’s question before search.
- Contextual Compression extracts only query-relevant sentences from chunks, reducing noise and token cost.
- RAG Evaluation uses precision, recall, and faithfulness to diagnose exactly where your pipeline is failing.
What We Built
A complete advanced RAG pipeline: BM25 + vector hybrid search with RRF fusion, Claude-based re-ranking, HyDE query transformation, and an evaluation harness measuring precision, recall, and faithfulness. Each component is independent and composable — add them as needed for your use case.
Next: M11 — Multi-Layer Memory
You’ve mastered retrieval. But what about information that persists across conversations? In M11, you’ll build a multi-layer memory system with short-term (conversation), medium-term (session summaries), and long-term (persistent knowledge) tiers — giving your agent the ability to remember users, preferences, and context across interactions.