CC12: RAG for Claude Code | Claude Code Mastery

Learning Objectives

Explain what RAG is, when to reach for it, and when NOT to.
Pick a chunking strategy for code, docs, and structured data.
Use embeddings + a vector store to find semantically similar chunks.
Use BM25 for keyword-precise lookups; combine with vectors via hybrid search.
Wrap a RAG pipeline as a subagent or MCP server tool that Claude Code can call.

CC10 covered citations — how Claude grounds an answer in supplied documents. CC12 covers the step before that: finding the right documents to supply.

Why RAG — And Why Not

Everyday Analogy

A new junior engineer joins your team. Do you (a) make them memorize the codebase before they can write any code, or (b) give them grep + good search and let them look things up? Option (a) is impossible at any real scale; option (b) is how everyone actually works.

Claude is like that engineer. Putting your entire codebase in the system prompt is option (a) — expensive, slow, capped by context length. Letting Claude search and pull only relevant chunks is option (b). That's RAG.

Technical Definition

Retrieval Augmented Generation (RAG) is a pattern where, before answering, the system retrieves relevant passages from a knowledge base and includes them in Claude's prompt as context. Two phases: retrieval (find relevant passages) and generation (Claude reads them and answers). The "augmentation" is the retrieved context injected at runtime.

When RAG wins

Knowledge base is too large to fit in context (codebase, doc set, ticket history).
Knowledge changes — you don't want to retrain or redeploy when docs update.
You need citation: "Claude said X because of doc Y line Z" (CC10).
You need access control: retrieve only docs the user is allowed to see.

When RAG loses

Small knowledge base (< 100K tokens) — just put it all in a cached system prompt (CC10).
Knowledge is in code structure (call graphs, type hierarchies) — AST tools beat embeddings.
Single-document tasks (summarize this PDF) — just send the document.

For Claude Code

Claude Code already has Read, Grep, and Glob — that's RAG with the codebase as the index and grep/glob as the retriever. It works great for code. RAG with embeddings becomes useful when you want to retrieve over docs, tickets, runbooks, or other corpora the CLI doesn't index by default — expose those via a subagent or MCP server.

The RAG Flow

From documents to grounded answer

A. INGEST

Chunk

Split docs into passages.

B. INGEST

Embed

Vectorize each chunk.

C. INGEST

Index

Store vectors + BM25 index.

D. QUERY

Retrieve

Find top-K similar chunks.

E. QUERY

Augment

Insert chunks into prompt.

F. QUERY

Generate

Claude answers with context.

Steps A–C run once when documents change (offline ingest). Steps D–F run per query. Get them right and the rest of the system feels magical; get them wrong and Claude either hallucinates or cites the wrong passage.

Chunking Strategies

Chunks are the "atoms" of retrieval. Too big → you lose precision and stuff context with junk. Too small → you lose surrounding context and miss the answer. Three patterns covering most cases:

1. Fixed-size with overlap (start here)

Slide a window of N tokens across the doc with M tokens of overlap. Simple, robust, works for prose.

def chunk_fixed(text: str, size: int = 500, overlap: int = 100) -> list[str]:
    words = text.split()
    chunks = []
    i = 0
    while i < len(words):
        chunk = " ".join(words[i:i + size])
        chunks.append(chunk)
        i += size - overlap
    return chunks

Defaults that work: 300–800 tokens per chunk, 50–150 token overlap. Smaller for highly structured docs (FAQs); larger for long-form prose.

2. Semantic chunking (better for docs with structure)

Split on natural boundaries: headings, paragraphs, code-block fences. Keeps related sentences together.

import re
def chunk_by_heading(md: str) -> list[str]:
    # Split on H1/H2/H3
    parts = re.split(r"(?=^#{1,3} )", md, flags=re.MULTILINE)
    return [p.strip() for p in parts if p.strip()]

3. Code-aware chunking

Don't split inside a function. Use the language's AST or a tree-sitter parser; chunk per top-level definition (function, class, method).

import ast
def chunk_python(src: str) -> list[dict]:
    tree = ast.parse(src)
    chunks = []
    lines = src.splitlines()
    for node in tree.body:
        if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
            start = node.lineno - 1
            end   = node.end_lineno
            chunks.append({
                "name": node.name,
                "kind": type(node).__name__,
                "code": "\n".join(lines[start:end]),
            })
    return chunks

Always store metadata with each chunk

At minimum: source file, byte range, document title. You'll need this for citations (CC10) and for de-duplicating retrieval results. Add the metadata at ingest time — you can't recover it later.

Embeddings & Vector Search

Technical Definition

An embedding is a vector (typically 384–3072 floats) that represents the meaning of a piece of text. Texts with similar meaning land near each other in the vector space; cosine similarity ~ 1 means very similar, ~ 0 means unrelated. To find passages relevant to a query, embed the query and find the chunks with the closest cosine similarity.

Anthropic does not currently ship a first-party embedding model — you bring your own. Common choices:

Provider / model	Dim	Notes
Voyage AI `voyage-3`	1024	Strong general-purpose, recommended in Anthropic docs.
OpenAI `text-embedding-3-large`	3072	Solid quality, widely available.
Sentence-transformers `all-MiniLM-L6-v2`	384	Free, runs locally; lower quality but fine for prototypes.
Cohere `embed-english-v3`	1024	Good quality, batched-friendly.

Minimal vector search end-to-end

import numpy as np
import voyageai

vo = voyageai.Client()

def embed(texts: list[str]) -> np.ndarray:
    r = vo.embed(texts, model="voyage-3")
    return np.array(r.embeddings)

# Ingest
chunks = chunk_fixed(open("docs.md").read())
chunk_vecs = embed(chunks)   # shape: (N, 1024)

# Query
def search(query: str, k: int = 5) -> list[tuple[float, str]]:
    q = embed([query])[0]
    sims = chunk_vecs @ q / (np.linalg.norm(chunk_vecs, axis=1) * np.linalg.norm(q))
    top = np.argsort(-sims)[:k]
    return [(float(sims[i]), chunks[i]) for i in top]

print(search("how is money represented?"))

For real corpora, use a vector database (Qdrant, Weaviate, pgvector, Chroma) instead of in-memory NumPy. Same query interface, persistent + scalable.

Strengths and weaknesses

Embeddings catch semantic matches: "how do I refund?" matches a chunk titled "returning a charge" even though no word overlaps. Weakness: they often fail on exact identifiers — class names, error codes, function signatures — because semantic similarity doesn't help you find the exact string.

BM25 — Lexical Search That Still Wins

Embeddings get all the press, but a 25-year-old algorithm called BM25 still beats them on a lot of queries. Don't skip it.

Technical Definition

BM25 is a ranking function from information retrieval: documents are scored by term frequency (how often query terms appear in the document) tempered by document length and inverse document frequency (terms common across all docs are downweighted). It's a refinement of TF-IDF and is what powers most search engines underneath the hood.

Where BM25 beats embeddings

Identifiers and codes: "PO-100023", "ECONNREFUSED", "useState" — exact matches matter.
Acronyms: "UCC", "HIPAA" — embeddings cluster them with similar concepts; BM25 finds the literal token.
Long documents: length normalization in BM25 means you don't get drowned out by huge files.
Cold-start: no model, no embeddings to recompute when docs change — just reindex tokens.

BM25 in 15 lines

from rank_bm25 import BM25Okapi
import re

def tokenize(s: str) -> list[str]:
    return re.findall(r"\w+", s.lower())

# Ingest
chunks = chunk_fixed(open("docs.md").read())
bm25 = BM25Okapi([tokenize(c) for c in chunks])

# Query
def bm25_search(query: str, k: int = 5) -> list[tuple[float, str]]:
    scores = bm25.get_scores(tokenize(query))
    top = sorted(range(len(chunks)), key=lambda i: -scores[i])[:k]
    return [(scores[i], chunks[i]) for i in top]

print(bm25_search("PO-100023 status"))

Tokenization matters more than the formula

BM25 with bad tokenization (no lowercasing, splits on weird boundaries) is worse than embeddings. Use a real tokenizer for your corpus — for code, snake_case and camelCase need to be split. Many implementations have language-specific tokenizers; use them.

Multi-Index (Hybrid) Pipelines

The professional pattern: run BM25 and vector search in parallel, merge the two ranked lists, optionally rerank the merged list with a cross-encoder. You get keyword precision and semantic recall.

Hybrid retrieval — the production default

Stage	What runs	Output
Parallel retrieve	BM25 (top 50) and vector search (top 50)	Two ranked lists, possibly overlapping
Merge	Reciprocal Rank Fusion (RRF) or weighted score	One ranked list of top 50
Rerank	Cross-encoder (e.g. Voyage `rerank-2`) on top 50 → top K	Top K (usually 5–10) for the prompt
Augment + generate	Pass top K to Claude as document blocks	Grounded answer with citations

Reciprocal Rank Fusion in 8 lines

def rrf(rankings: list[list[int]], k: int = 60) -> list[int]:
    """Merge multiple ranked lists of chunk indices using RRF."""
    scores = {}
    for ranking in rankings:
        for rank, idx in enumerate(ranking):
            scores[idx] = scores.get(idx, 0) + 1.0 / (k + rank + 1)
    return sorted(scores, key=lambda i: -scores[i])

q = "how do I refund an order?"
bm25_top = sorted(range(len(chunks)), key=lambda i: -bm25.get_scores(tokenize(q))[i])[:50]
vec_top  = np.argsort(-(chunk_vecs @ embed([q])[0]))[:50].tolist()
merged   = rrf([bm25_top, vec_top])[:10]
final_chunks = [chunks[i] for i in merged]

Reranking

A cross-encoder takes (query, candidate) pairs and outputs a refined relevance score. Slow per-pair, so only used on the top 50 from earlier stages. Voyage AI's rerank-2 is a common choice; Cohere offers similar.

Stop adding stages when accuracy plateaus

BM25 + vectors is enough for most projects. Add a reranker only if your eval (CC11!) shows it helps. Every stage adds latency and a failure surface — only spend the complexity when you've measured the win.

RAG as a Subagent or MCP Tool

For Claude Code specifically, you don't ship "a RAG app" — you ship a tool that Claude calls when it needs to look something up. Two patterns:

Pattern A — RAG subagent

The subagent system prompt embeds the retrieval logic. Claude Code passes the user's question, the subagent does the lookup + answer in one step.

---
name: docs-qa
description: Answers questions about internal docs. Use whenever a user asks about company policy, runbooks, or onboarding.
model: claude-sonnet-4-6
tools: ["mcp__internal_docs__search_docs"]   # MCP tool from Pattern B
---

You answer questions strictly from internal documentation.

Process:
1. Call `search_docs(query)` to retrieve top 8 passages.
2. If no passage is relevant, say "I don't have that documented" — do not guess.
3. Cite each claim with the doc title. Do not invent doc titles.

Pattern B — RAG as MCP server tool

Wrap retrieval as an MCP tool exposed to Claude Code. CC9 covers MCP server building in detail; here's the tool surface:

# mcp_rag_server.py
from mcp.server.fastmcp import FastMCP

mcp = FastMCP("internal_docs")

@mcp.tool()
def search_docs(query: str, k: int = 8) -> list[dict]:
    """Search internal docs and return top K relevant passages with citations.

    Use whenever a user mentions runbooks, policies, or onboarding.
    Returns: list of {title, source, snippet, score}.
    """
    bm25_top = bm25_rank(query, top=50)
    vec_top  = vec_rank(query, top=50)
    merged   = rrf([bm25_top, vec_top])[:k]
    return [{"title": meta[i].title, "source": meta[i].path,
             "snippet": chunks[i], "score": float(scores[i])} for i in merged]

if __name__ == "__main__":
    mcp.run()

Now any Claude Code session that has this MCP server registered (CC9) gets a search_docs tool. Claude decides when to call it. No prompting required to "remember" to look something up — Claude does it on its own when the description matches the user's question.

What just happened?

You saw why RAG, in Claude Code, is just another tool in the loop from CC8. The retrieval is your code; the calling is Claude's. The combination — retrieval + grounded generation + citations (CC10) — gives you an auditable knowledge agent in < 200 lines.

Hands-On Lab — A 100-Line RAG MCP Server

You'll build a local MCP server that indexes a directory of markdown files with both BM25 and embeddings, exposes a search_docs tool, and registers with Claude Code so any session can answer "what does our doc X say about Y?"

Step 1 — Setup

$ mkdir docs-rag && cd docs-rag
$ python -m venv .venv && source .venv/bin/activate
$ pip install "mcp[cli]" rank-bm25 voyageai numpy
$ export VOYAGE_API_KEY="..."     # any embedding provider

Step 2 — The server

# server.py
import os, re, glob
import numpy as np
from rank_bm25 import BM25Okapi
from mcp.server.fastmcp import FastMCP
import voyageai

DOCS_DIR = os.environ.get("DOCS_DIR", "./docs")

# --- Ingest at startup ---
def chunk_md(text, size=500, overlap=100):
    parts = re.split(r"(?=^#{1,3} )", text, flags=re.MULTILINE) or [text]
    out = []
    for p in parts:
        words = p.split()
        i = 0
        while i < len(words):
            out.append(" ".join(words[i:i+size]))
            i += size - overlap
    return out

CHUNKS, META = [], []
for path in glob.glob(f"{DOCS_DIR}/**/*.md", recursive=True):
    text = open(path).read()
    for ch in chunk_md(text):
        CHUNKS.append(ch)
        META.append({"source": path, "title": os.path.basename(path)})

bm25 = BM25Okapi([re.findall(r"\w+", c.lower()) for c in CHUNKS])
vo = voyageai.Client()
VECS = np.array(vo.embed(CHUNKS, model="voyage-3").embeddings)

def rrf(rankings, k=60):
    s = {}
    for r in rankings:
        for rank, idx in enumerate(r):
            s[idx] = s.get(idx, 0) + 1 / (k + rank + 1)
    return sorted(s, key=lambda i: -s[i])

# --- MCP tool ---
mcp = FastMCP("internal_docs")

@mcp.tool()
def search_docs(query: str, k: int = 6) -> list[dict]:
    """Search internal markdown docs by hybrid BM25 + vector retrieval. Returns top-K passages with source.

    Use when answering questions about internal documentation, runbooks, policies, onboarding, FAQs.
    """
    bm = sorted(range(len(CHUNKS)),
                key=lambda i: -bm25.get_scores(re.findall(r"\w+", query.lower()))[i])[:30]
    qv = np.array(vo.embed([query], model="voyage-3").embeddings[0])
    vc = np.argsort(-(VECS @ qv) / (np.linalg.norm(VECS, axis=1) * np.linalg.norm(qv)))[:30].tolist()
    top = rrf([bm, vc])[:k]
    return [{"title": META[i]["title"], "source": META[i]["source"],
             "snippet": CHUNKS[i][:600]} for i in top]

if __name__ == "__main__":
    mcp.run()

Step 3 — Register with Claude Code

$ claude mcp add docs-rag --scope project \
    --env VOYAGE_API_KEY="$VOYAGE_API_KEY" \
    --env DOCS_DIR=/path/to/your/docs \
    -- python "$PWD/server.py"

Step 4 — Use it

$ claude
> What does our refund policy say about refunds older than 30 days?

[Claude calls search_docs("refund policy 30 days")]
[6 chunks returned, top from policies/refunds-v3.md]

According to policies/refunds-v3.md, refunds are not honored after 30 days
unless... [grounded answer with cited source]

Step 5 — Add an eval

D

Both at the same time.

Correct. MCP exposes the tool to any Claude Code agent or skill. A subagent bundles retrieval with a specific Q&A workflow. Pick by reuse: many consumers → MCP; one consumer → subagent.

Look again. The right answer depends on reuse. Single use case → subagent (simpler). Multiple consumers → MCP (one tool, many callers).

Module Summary

RAG = retrieve relevant passages, then generate. Two phases, three ingest steps, three query steps.
Use RAG when corpus exceeds context, knowledge changes, or you need access control. Skip if < 100K tokens.
Chunking: fixed-size + overlap (start here), semantic (heading-aware), code-aware (AST-based). Always store metadata.
Embeddings = semantic similarity. Strong on paraphrase, weak on identifiers. Use any provider; switch is easy.
BM25 = lexical similarity. Strong on identifiers, codes, acronyms. Tokenization matters more than the formula.
Hybrid (BM25 + vectors via RRF) is the production default. Add a reranker only if eval shows it helps.
For Claude Code: ship RAG as a subagent (single use case) or MCP tool (multiple consumers). Either way it slots into the CC8 tool-use loop.