Claude Code Mastery — Direct Track
CC12 — Knowledge Module 13 of 16
50 minIntermediate
← CC11: Evaluation 🏠 Home CC13: Workflows & Computer Use →

CC12: RAG for Claude Code

Retrieval Augmented Generation explained from a CLI-developer's lens: chunking, embeddings, BM25, hybrid retrieval, and how to wrap any of these as a Claude Code subagent or MCP server tool. Concepts first, then a runnable end-to-end pipeline.

Learning Objectives

  • Explain what RAG is, when to reach for it, and when NOT to.
  • Pick a chunking strategy for code, docs, and structured data.
  • Use embeddings + a vector store to find semantically similar chunks.
  • Use BM25 for keyword-precise lookups; combine with vectors via hybrid search.
  • Wrap a RAG pipeline as a subagent or MCP server tool that Claude Code can call.
CC10 covered citations — how Claude grounds an answer in supplied documents. CC12 covers the step before that: finding the right documents to supply.

Why RAG — And Why Not

Everyday Analogy

A new junior engineer joins your team. Do you (a) make them memorize the codebase before they can write any code, or (b) give them grep + good search and let them look things up? Option (a) is impossible at any real scale; option (b) is how everyone actually works.

Claude is like that engineer. Putting your entire codebase in the system prompt is option (a) — expensive, slow, capped by context length. Letting Claude search and pull only relevant chunks is option (b). That's RAG.

Technical Definition

Retrieval Augmented Generation (RAG) is a pattern where, before answering, the system retrieves relevant passages from a knowledge base and includes them in Claude's prompt as context. Two phases: retrieval (find relevant passages) and generation (Claude reads them and answers). The "augmentation" is the retrieved context injected at runtime.

When RAG wins

  • Knowledge base is too large to fit in context (codebase, doc set, ticket history).
  • Knowledge changes — you don't want to retrain or redeploy when docs update.
  • You need citation: "Claude said X because of doc Y line Z" (CC10).
  • You need access control: retrieve only docs the user is allowed to see.

When RAG loses

  • Small knowledge base (< 100K tokens) — just put it all in a cached system prompt (CC10).
  • Knowledge is in code structure (call graphs, type hierarchies) — AST tools beat embeddings.
  • Single-document tasks (summarize this PDF) — just send the document.
For Claude Code

Claude Code already has Read, Grep, and Glob — that's RAG with the codebase as the index and grep/glob as the retriever. It works great for code. RAG with embeddings becomes useful when you want to retrieve over docs, tickets, runbooks, or other corpora the CLI doesn't index by default — expose those via a subagent or MCP server.

The RAG Flow

From documents to grounded answer
A. INGEST
Chunk
Split docs into passages.
B. INGEST
Embed
Vectorize each chunk.
C. INGEST
Index
Store vectors + BM25 index.
D. QUERY
Retrieve
Find top-K similar chunks.
E. QUERY
Augment
Insert chunks into prompt.
F. QUERY
Generate
Claude answers with context.

Steps A–C run once when documents change (offline ingest). Steps D–F run per query. Get them right and the rest of the system feels magical; get them wrong and Claude either hallucinates or cites the wrong passage.

Chunking Strategies

Chunks are the "atoms" of retrieval. Too big → you lose precision and stuff context with junk. Too small → you lose surrounding context and miss the answer. Three patterns covering most cases:

1. Fixed-size with overlap (start here)

Slide a window of N tokens across the doc with M tokens of overlap. Simple, robust, works for prose.

def chunk_fixed(text: str, size: int = 500, overlap: int = 100) -> list[str]:
    words = text.split()
    chunks = []
    i = 0
    while i < len(words):
        chunk = " ".join(words[i:i + size])
        chunks.append(chunk)
        i += size - overlap
    return chunks

Defaults that work: 300–800 tokens per chunk, 50–150 token overlap. Smaller for highly structured docs (FAQs); larger for long-form prose.

2. Semantic chunking (better for docs with structure)

Split on natural boundaries: headings, paragraphs, code-block fences. Keeps related sentences together.

import re
def chunk_by_heading(md: str) -> list[str]:
    # Split on H1/H2/H3
    parts = re.split(r"(?=^#{1,3} )", md, flags=re.MULTILINE)
    return [p.strip() for p in parts if p.strip()]

3. Code-aware chunking

Don't split inside a function. Use the language's AST or a tree-sitter parser; chunk per top-level definition (function, class, method).

import ast
def chunk_python(src: str) -> list[dict]:
    tree = ast.parse(src)
    chunks = []
    lines = src.splitlines()
    for node in tree.body:
        if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
            start = node.lineno - 1
            end   = node.end_lineno
            chunks.append({
                "name": node.name,
                "kind": type(node).__name__,
                "code": "\n".join(lines[start:end]),
            })
    return chunks
Always store metadata with each chunk

At minimum: source file, byte range, document title. You'll need this for citations (CC10) and for de-duplicating retrieval results. Add the metadata at ingest time — you can't recover it later.

Embeddings & Vector Search

Technical Definition

An embedding is a vector (typically 384–3072 floats) that represents the meaning of a piece of text. Texts with similar meaning land near each other in the vector space; cosine similarity ~ 1 means very similar, ~ 0 means unrelated. To find passages relevant to a query, embed the query and find the chunks with the closest cosine similarity.

Anthropic does not currently ship a first-party embedding model — you bring your own. Common choices:

Provider / modelDimNotes
Voyage AI voyage-31024Strong general-purpose, recommended in Anthropic docs.
OpenAI text-embedding-3-large3072Solid quality, widely available.
Sentence-transformers all-MiniLM-L6-v2384Free, runs locally; lower quality but fine for prototypes.
Cohere embed-english-v31024Good quality, batched-friendly.

Minimal vector search end-to-end

import numpy as np
import voyageai

vo = voyageai.Client()

def embed(texts: list[str]) -> np.ndarray:
    r = vo.embed(texts, model="voyage-3")
    return np.array(r.embeddings)

# Ingest
chunks = chunk_fixed(open("docs.md").read())
chunk_vecs = embed(chunks)   # shape: (N, 1024)

# Query
def search(query: str, k: int = 5) -> list[tuple[float, str]]:
    q = embed([query])[0]
    sims = chunk_vecs @ q / (np.linalg.norm(chunk_vecs, axis=1) * np.linalg.norm(q))
    top = np.argsort(-sims)[:k]
    return [(float(sims[i]), chunks[i]) for i in top]

print(search("how is money represented?"))

For real corpora, use a vector database (Qdrant, Weaviate, pgvector, Chroma) instead of in-memory NumPy. Same query interface, persistent + scalable.

Strengths and weaknesses

Embeddings catch semantic matches: "how do I refund?" matches a chunk titled "returning a charge" even though no word overlaps. Weakness: they often fail on exact identifiers — class names, error codes, function signatures — because semantic similarity doesn't help you find the exact string.

BM25 — Lexical Search That Still Wins

Embeddings get all the press, but a 25-year-old algorithm called BM25 still beats them on a lot of queries. Don't skip it.

Technical Definition

BM25 is a ranking function from information retrieval: documents are scored by term frequency (how often query terms appear in the document) tempered by document length and inverse document frequency (terms common across all docs are downweighted). It's a refinement of TF-IDF and is what powers most search engines underneath the hood.

Where BM25 beats embeddings

  • Identifiers and codes: "PO-100023", "ECONNREFUSED", "useState" — exact matches matter.
  • Acronyms: "UCC", "HIPAA" — embeddings cluster them with similar concepts; BM25 finds the literal token.
  • Long documents: length normalization in BM25 means you don't get drowned out by huge files.
  • Cold-start: no model, no embeddings to recompute when docs change — just reindex tokens.

BM25 in 15 lines

from rank_bm25 import BM25Okapi
import re

def tokenize(s: str) -> list[str]:
    return re.findall(r"\w+", s.lower())

# Ingest
chunks = chunk_fixed(open("docs.md").read())
bm25 = BM25Okapi([tokenize(c) for c in chunks])

# Query
def bm25_search(query: str, k: int = 5) -> list[tuple[float, str]]:
    scores = bm25.get_scores(tokenize(query))
    top = sorted(range(len(chunks)), key=lambda i: -scores[i])[:k]
    return [(scores[i], chunks[i]) for i in top]

print(bm25_search("PO-100023 status"))
Tokenization matters more than the formula

BM25 with bad tokenization (no lowercasing, splits on weird boundaries) is worse than embeddings. Use a real tokenizer for your corpus — for code, snake_case and camelCase need to be split. Many implementations have language-specific tokenizers; use them.

Multi-Index (Hybrid) Pipelines

The professional pattern: run BM25 and vector search in parallel, merge the two ranked lists, optionally rerank the merged list with a cross-encoder. You get keyword precision and semantic recall.

Hybrid retrieval — the production default
StageWhat runsOutput
Parallel retrieveBM25 (top 50) and vector search (top 50)Two ranked lists, possibly overlapping
MergeReciprocal Rank Fusion (RRF) or weighted scoreOne ranked list of top 50
RerankCross-encoder (e.g. Voyage rerank-2) on top 50 → top KTop K (usually 5–10) for the prompt
Augment + generatePass top K to Claude as document blocksGrounded answer with citations

Reciprocal Rank Fusion in 8 lines

def rrf(rankings: list[list[int]], k: int = 60) -> list[int]:
    """Merge multiple ranked lists of chunk indices using RRF."""
    scores = {}
    for ranking in rankings:
        for rank, idx in enumerate(ranking):
            scores[idx] = scores.get(idx, 0) + 1.0 / (k + rank + 1)
    return sorted(scores, key=lambda i: -scores[i])

q = "how do I refund an order?"
bm25_top = sorted(range(len(chunks)), key=lambda i: -bm25.get_scores(tokenize(q))[i])[:50]
vec_top  = np.argsort(-(chunk_vecs @ embed([q])[0]))[:50].tolist()
merged   = rrf([bm25_top, vec_top])[:10]
final_chunks = [chunks[i] for i in merged]

Reranking

A cross-encoder takes (query, candidate) pairs and outputs a refined relevance score. Slow per-pair, so only used on the top 50 from earlier stages. Voyage AI's rerank-2 is a common choice; Cohere offers similar.

Stop adding stages when accuracy plateaus

BM25 + vectors is enough for most projects. Add a reranker only if your eval (CC11!) shows it helps. Every stage adds latency and a failure surface — only spend the complexity when you've measured the win.

RAG as a Subagent or MCP Tool

For Claude Code specifically, you don't ship "a RAG app" — you ship a tool that Claude calls when it needs to look something up. Two patterns:

Pattern A — RAG subagent

The subagent system prompt embeds the retrieval logic. Claude Code passes the user's question, the subagent does the lookup + answer in one step.

---
name: docs-qa
description: Answers questions about internal docs. Use whenever a user asks about company policy, runbooks, or onboarding.
model: claude-sonnet-4-6
tools: ["mcp__internal_docs__search_docs"]   # MCP tool from Pattern B
---

You answer questions strictly from internal documentation.

Process:
1. Call `search_docs(query)` to retrieve top 8 passages.
2. If no passage is relevant, say "I don't have that documented" — do not guess.
3. Cite each claim with the doc title. Do not invent doc titles.

Pattern B — RAG as MCP server tool

Wrap retrieval as an MCP tool exposed to Claude Code. CC9 covers MCP server building in detail; here's the tool surface:

# mcp_rag_server.py
from mcp.server.fastmcp import FastMCP

mcp = FastMCP("internal_docs")

@mcp.tool()
def search_docs(query: str, k: int = 8) -> list[dict]:
    """Search internal docs and return top K relevant passages with citations.

    Use whenever a user mentions runbooks, policies, or onboarding.
    Returns: list of {title, source, snippet, score}.
    """
    bm25_top = bm25_rank(query, top=50)
    vec_top  = vec_rank(query, top=50)
    merged   = rrf([bm25_top, vec_top])[:k]
    return [{"title": meta[i].title, "source": meta[i].path,
             "snippet": chunks[i], "score": float(scores[i])} for i in merged]

if __name__ == "__main__":
    mcp.run()

Now any Claude Code session that has this MCP server registered (CC9) gets a search_docs tool. Claude decides when to call it. No prompting required to "remember" to look something up — Claude does it on its own when the description matches the user's question.

What just happened?

You saw why RAG, in Claude Code, is just another tool in the loop from CC8. The retrieval is your code; the calling is Claude's. The combination — retrieval + grounded generation + citations (CC10) — gives you an auditable knowledge agent in < 200 lines.

Hands-On Lab — A 100-Line RAG MCP Server

You'll build a local MCP server that indexes a directory of markdown files with both BM25 and embeddings, exposes a search_docs tool, and registers with Claude Code so any session can answer "what does our doc X say about Y?"

Step 1 — Setup

$ mkdir docs-rag && cd docs-rag
$ python -m venv .venv && source .venv/bin/activate
$ pip install "mcp[cli]" rank-bm25 voyageai numpy
$ export VOYAGE_API_KEY="..."     # any embedding provider

Step 2 — The server

# server.py
import os, re, glob
import numpy as np
from rank_bm25 import BM25Okapi
from mcp.server.fastmcp import FastMCP
import voyageai

DOCS_DIR = os.environ.get("DOCS_DIR", "./docs")

# --- Ingest at startup ---
def chunk_md(text, size=500, overlap=100):
    parts = re.split(r"(?=^#{1,3} )", text, flags=re.MULTILINE) or [text]
    out = []
    for p in parts:
        words = p.split()
        i = 0
        while i < len(words):
            out.append(" ".join(words[i:i+size]))
            i += size - overlap
    return out

CHUNKS, META = [], []
for path in glob.glob(f"{DOCS_DIR}/**/*.md", recursive=True):
    text = open(path).read()
    for ch in chunk_md(text):
        CHUNKS.append(ch)
        META.append({"source": path, "title": os.path.basename(path)})

bm25 = BM25Okapi([re.findall(r"\w+", c.lower()) for c in CHUNKS])
vo = voyageai.Client()
VECS = np.array(vo.embed(CHUNKS, model="voyage-3").embeddings)

def rrf(rankings, k=60):
    s = {}
    for r in rankings:
        for rank, idx in enumerate(r):
            s[idx] = s.get(idx, 0) + 1 / (k + rank + 1)
    return sorted(s, key=lambda i: -s[i])

# --- MCP tool ---
mcp = FastMCP("internal_docs")

@mcp.tool()
def search_docs(query: str, k: int = 6) -> list[dict]:
    """Search internal markdown docs by hybrid BM25 + vector retrieval. Returns top-K passages with source.

    Use when answering questions about internal documentation, runbooks, policies, onboarding, FAQs.
    """
    bm = sorted(range(len(CHUNKS)),
                key=lambda i: -bm25.get_scores(re.findall(r"\w+", query.lower()))[i])[:30]
    qv = np.array(vo.embed([query], model="voyage-3").embeddings[0])
    vc = np.argsort(-(VECS @ qv) / (np.linalg.norm(VECS, axis=1) * np.linalg.norm(qv)))[:30].tolist()
    top = rrf([bm, vc])[:k]
    return [{"title": META[i]["title"], "source": META[i]["source"],
             "snippet": CHUNKS[i][:600]} for i in top]

if __name__ == "__main__":
    mcp.run()

Step 3 — Register with Claude Code

$ claude mcp add docs-rag --scope project \
    --env VOYAGE_API_KEY="$VOYAGE_API_KEY" \
    --env DOCS_DIR=/path/to/your/docs \
    -- python "$PWD/server.py"

Step 4 — Use it

$ claude
> What does our refund policy say about refunds older than 30 days?

[Claude calls search_docs("refund policy 30 days")]
[6 chunks returned, top from policies/refunds-v3.md]

According to policies/refunds-v3.md, refunds are not honored after 30 days
unless... [grounded answer with cited source]

Step 5 — Add an eval

Take 20 questions you know the answers to. Run them through the search + generate flow. Use CC11's runner to score — "answer contains expected fact" via code grading, or LLM-as-judge for richer scoring. Set a threshold and gate updates to docs or the search code.

Lab complete — what you should have

An MCP RAG server backed by hybrid retrieval, registered with Claude Code, queryable from any session. Plus an eval that ensures changes don't regress retrieval quality. You can drop this server into any team's Claude Code setup and unlock grounded answers over their docs in an afternoon.

Knowledge Check

1. Your knowledge base is 50K tokens of internal policy. Should you build a RAG pipeline?

A
Yes — that's huge.
B
No — 50K fits in context. Put it in a cached system prompt and skip RAG.
C
Yes, but only if you reindex weekly.
D
Doesn't matter.
Correct. 50K tokens fits comfortably; with prompt caching (CC10) you pay almost nothing. RAG adds engineering for no win at this scale — reach for it past ~200K tokens or when knowledge updates frequently.
Look again. RAG's value comes when the corpus exceeds context. 50K tokens fits comfortably; cached system prompts beat RAG for that size.

2. Users search by error codes like "ECONNREFUSED". Embeddings only or BM25 only?

A
Embeddings only — modern semantic search wins.
B
BM25 (or hybrid) — exact identifiers favor lexical search.
C
Doesn't matter.
D
Use full-text search.
Correct. Identifiers, error codes, and acronyms favor BM25 because exact-string matters. Embeddings cluster by meaning, which can map "ECONNREFUSED" to "network error" — useful sometimes, miss the literal sometimes. Hybrid covers both.
Look again. Exact identifier matches favor lexical search (BM25). Hybrid (BM25 + vectors) is the production default.

3. Your RAG answers are accurate but the citations don't match the actual source. What's wrong?

A
The model hallucinated.
B
You're not storing source metadata with chunks at ingest, so the prompt has no source to cite from.
C
Embeddings are bad.
D
Reranker dropped the right chunk.
Correct. Metadata (source path, title, byte range) must be attached at ingest. If the prompt doesn't include "this came from refunds-v3.md," Claude has no choice but to invent a citation. Use Claude's built-in citations (CC10) on document blocks for guaranteed accuracy.
Look again. Citation accuracy depends on metadata stored alongside chunks. If chunks don't carry their source, the model has nothing to cite.

4. You add a cross-encoder reranker as a third stage. Eval shows zero accuracy improvement and 2× latency. What now?

A
Keep it — rerankers are best practice.
B
Remove it — complexity without measured benefit isn't worth shipping.
C
Switch to a different reranker.
D
Ignore the eval.
Correct. Every stage is a failure surface. Keep stages whose value is measured by your eval (CC11). "Best practice" without your-corpus evidence is just cargo cult.
Look again. The point of evals is to keep what helps, drop what doesn't. A reranker that costs latency without accuracy gain shouldn't ship.

5. You're exposing RAG to Claude Code. Subagent or MCP tool?

A
Always subagent — it's simpler.
B
Always MCP — more flexible.
C
MCP tool if multiple subagents/skills need retrieval; subagent if it's a single self-contained Q&A use case.
D
Both at the same time.
Correct. MCP exposes the tool to any Claude Code agent or skill. A subagent bundles retrieval with a specific Q&A workflow. Pick by reuse: many consumers → MCP; one consumer → subagent.
Look again. The right answer depends on reuse. Single use case → subagent (simpler). Multiple consumers → MCP (one tool, many callers).

Module Summary

  • RAG = retrieve relevant passages, then generate. Two phases, three ingest steps, three query steps.
  • Use RAG when corpus exceeds context, knowledge changes, or you need access control. Skip if < 100K tokens.
  • Chunking: fixed-size + overlap (start here), semantic (heading-aware), code-aware (AST-based). Always store metadata.
  • Embeddings = semantic similarity. Strong on paraphrase, weak on identifiers. Use any provider; switch is easy.
  • BM25 = lexical similarity. Strong on identifiers, codes, acronyms. Tokenization matters more than the formula.
  • Hybrid (BM25 + vectors via RRF) is the production default. Add a reranker only if eval shows it helps.
  • For Claude Code: ship RAG as a subagent (single use case) or MCP tool (multiple consumers). Either way it slots into the CC8 tool-use loop.