M09: RAG — Retrieval-Augmented Generation
Claude is brilliant but has a knowledge cutoff. RAG solves this by giving Claude access to your documents at query time — no fine-tuning required. You'll build a complete "Chat with Your Docs" system.
Learning Objectives
- Explain why RAG is needed and what problem it solves that prompt engineering cannot
- Describe the end-to-end RAG pipeline: load, chunk, embed, store, retrieve, generate
- Explain how text embeddings and cosine similarity enable semantic search
- Compare chunking strategies and their impact on retrieval quality
- Build a complete RAG system with ChromaDB and Claude
The Knowledge Problem
BEFORE: Imagine a brilliant doctor who graduated in 2023 but has never seen a 2025 medical journal — they practiced medicine using only what they learned in school and residency, with no access to anything published since.
PAIN: When a patient asks about a breakthrough treatment approved six months ago, the doctor either says "I don't know" or, worse, confidently recommends an outdated protocol that has since been superseded — and the patient has no way to tell the difference.
MAPPING: That's exactly the knowledge cutoff problemLLMs are trained on data up to a specific date. They cannot access information published after that date, proprietary data, or private documents. This leads to hallucinations when asked about topics outside their training data. with LLMs: Claude is the doctor, your private company docs and recent data are the new medical journals, and RAG is the system that puts those journals on the doctor's desk right when they need them.
What Is RAG? (End-to-End Pipeline)
BEFORE: Imagine taking a closed-book exam where you must answer every question purely from memory — you spent months cramming, but there are thousands of facts you simply could not memorize.
PAIN: When a question covers a topic you didn't study deeply, you either leave it blank or guess — and guessing confidently is exactly what hallucination looks like in an LLM.
MAPPING: RAG turns this into an open-book exam: Claude is the student, your document corpus is the textbook sitting on the desk, and the retrieval system is the index at the back of the book that helps Claude flip to exactly the right page instead of guessing from memory.
Here's what this looks like in practice. When a user asks a question, the retrieval step returns actual chunk objects like this:
Those chunk texts get pasted directly into Claude's prompt as context. Claude reads them, writes an answer, and cites [Source 1: filing_guide.md]. The user never sees the retrieval machinery — they just get an accurate, sourced answer.
"RAG is just fine-tuning, right?" — No. Fine-tuning changes the model's internal weights — it costs $5,000–$50,000, takes weeks, and requires ML expertise. RAG doesn't touch the model at all. It just adds relevant text to the prompt at query time. Think of it as giving Claude a cheat sheet during the exam, not retraining Claude's brain.
"Bigger chunks = better results?" — Usually the opposite. A 2,000-word chunk that contains one relevant sentence dilutes the signal. The retriever found the right chunk, but Claude has to wade through 1,999 irrelevant words to find the answer. Smaller, focused chunks almost always retrieve better.
"RAG eliminates hallucinations" — It reduces them significantly (from ~35% to under 3% in typical deployments), but it doesn't eliminate them entirely. Claude can still misinterpret retrieved context, combine facts incorrectly, or fill gaps with plausible-sounding guesses. Always include the grounding instruction ("answer based ONLY on the provided context") and verify critical outputs.
"You need a vector database to do RAG" — For production, yes — vector databases give you fast ANN search over millions of vectors. But basic RAG works with simple in-memory similarity search over a few hundred chunks. Don't let infrastructure requirements stop you from prototyping.
"More retrieved chunks = better answers" — Diminishing returns kick in fast. Retrieving 3–5 highly relevant chunks is typically better than retrieving 20 chunks of mixed quality. Too many chunks flood Claude's context with noise, and information in the middle of long context gets lower recall (the "lost in the middle" effect).
Embeddings Explained from Scratch
BEFORE: Think of traditional keyword search: you type "heart attack treatment" and the engine only finds documents containing those exact words — missing a document titled "myocardial infarction therapy" even though it means the same thing.
PAIN: This keyword mismatch problem means relevant results are invisible unless the author happened to use the exact same vocabulary as the searcher, leading to frustrating dead ends and missed information.
MAPPING: Embeddings fix this by placing words as coordinates in meaning-space — "king" and "queen" live close together, just as Paris and London are nearby on a map of capital cities. Cosine similarityA measure of similarity between two vectors based on the angle between them. Ranges from -1 (opposite meaning) to 1 (identical meaning). Preferred over Euclidean distance for text embeddings because it measures direction (meaning) regardless of magnitude (text length). measures the angle between two points — a small angle means similar meaning, regardless of the words used.
So what IS an embedding? In plain English, it's a way of representing text as a point in space. Each word, sentence, or paragraph gets converted into a list of numbers — and those numbers encode what the text means, not just what words it contains. Two sentences that say the same thing in different words will end up at nearby points, even if they share zero words in common.
How does this actually work under the hood? An embedding model (a specialized neural network, much smaller than Claude) reads your text and outputs a fixed-size array of floats — typically 1,536 numbers. During training, the model learned to place semantically similar text at nearby coordinates. It doesn't follow rules you wrote; it learned these associations from millions of text pairs. The result is that "cancel my subscription" and "I want to stop paying" land near each other, even though they share no keywords.
If you're familiar with keyword search (like SQL's LIKE '%term%'), here's the key difference: keyword search requires an exact character match, so "heart attack" won't find "myocardial infarction." Embedding-based search doesn't care about specific words at all — it compares meaning. This is why RAG systems use embeddings instead of traditional search: your users won't always use the same vocabulary as your documents.
So what does an embedding actually look like? It's just an array of floating-point numbers — typically 1,536 of them:
Notice how the first two vectors start with almost identical numbers — they mean the same thing, so their coordinates are nearly the same. The pizza vector looks completely different. That's the whole trick: meaning becomes geometry, and "similar meaning" becomes "short distance."
"Embeddings understand language like humans do" — No. Embeddings capture statistical patterns of word co-occurrence, not true understanding. "Bank" (financial) and "bank" (river) get the same embedding unless the surrounding context disambiguates them. Always embed enough context (a sentence or paragraph), not isolated words.
"All embedding models produce the same quality" — Far from it. A general-purpose embedding model may not distinguish "UCC-1 filing" from "UCC-3 amendment" well, because it wasn't trained on legal text. Domain-specific or fine-tuned embedding models can dramatically improve retrieval quality for specialized content.
"You can compare embeddings from different models" — You cannot. An embedding from Voyage AI and one from OpenAI live in completely different vector spaces. Cosine similarity between them is meaningless. Always use the same model for both indexing and querying.
Chunking Strategies
BEFORE: Imagine you have a 300-page textbook and you need to create study flashcards — but you have to decide how much text goes on each card before you start studying.
PAIN: Cut too big (an entire chapter per card) and you waste time re-reading pages of irrelevant material to find one sentence. Cut too small (one sentence per card) and you lose the surrounding context that makes that sentence meaningful — "the treatment was effective" means nothing without knowing which treatment and for what condition.
MAPPING: Chunking is exactly this trade-off for your RAG system: each chunk is a flashcard that the retrieval engine can hand to Claude, and the right card size depends on the structure and density of your specific material.
Chunking is the process of splitting your documents into smaller pieces before embedding. Why not just embed the whole document? Because embedding models work best with short text (under ~512 tokens). A 5,000-word document crammed into a single vector loses nuance — the embedding becomes a blurry average of everything the document discusses, making it hard to match specific questions.
The core challenge is finding the right granularity. Too large, and each chunk covers multiple topics, so the retriever finds the chunk but Claude has to hunt for the relevant sentence inside a wall of text. Too small, and each chunk loses context — "the treatment was effective" tells you nothing without knowing which treatment or which patient. Most teams start with 300–500 character chunks and tune from there based on retrieval quality.
If you've used database pagination (LIMIT 50 OFFSET 100), chunking feels similar — you're dividing data into pages. But there's a crucial difference: database pages don't overlap, while chunks should. Without overlap, a sentence that spans a chunk boundary gets split across two chunks, and neither chunk alone answers the question. The overlap parameter (typically 10–20% of chunk size) duplicates a few sentences at each boundary to prevent this.
Here's what actual chunks look like after splitting a short document. Notice how the overlap zone duplicates text at chunk boundaries to preserve context:
The overlap ensures "The filing office assigns a unique file number" appears in both chunks, so a query about file numbers will match regardless of which chunk the retriever finds.
Vector Databases
BEFORE: Imagine a traditional library where books are shelved alphabetically by author — to find everything about "machine learning in healthcare," you'd need to know every author who ever wrote on that topic and check each shelf individually.
PAIN: You inevitably miss relevant books because you didn't know the author's name, and you waste hours browsing shelves that are organized by a criterion (author name) that has nothing to do with what you actually care about (the topic).
MAPPING: A vector database is a library where books are shelved by what they're about — all books about "machine learning in healthcare" end up near each other regardless of who wrote them, and you find them by describing the topic you need rather than knowing an exact title or keyword.
Here's what a "row" in a vector database actually contains — the vector, the original text, and metadata sitting side by side:
When you query the database, it compares your question's vector against every stored vector, finds the closest matches, and returns the document text and metadata for each. You never have to work with the raw vectors yourself — the database handles that.
.query()Under the hood, vector databases use approximate nearest neighbor (ANN)An algorithm that finds vectors "close to" a query vector without checking every single vector in the database. ANN trades a tiny amount of accuracy for massive speed gains, using index structures like HNSW (Hierarchical Navigable Small World) graphs. algorithms to avoid comparing your query against every single stored vector. The most common algorithm is called HNSW (Hierarchical Navigable Small World). It works by building a graph structure that connects similar vectors to each other. When a query arrives, instead of checking all 100,000 vectors, the algorithm navigates the graph and checks maybe 200 — returning results in milliseconds with 95–99% accuracy. This small trade-off in precision makes the difference between a search that takes 50ms and one that takes 30 seconds.
HNSW skims the top layer to find the rough neighborhood, then descends layer by layer to refine. That's why 100K vs 100M makes barely any difference to query time.
If you already know SQL databases, here's the key contrast: in PostgreSQL, you'd write SELECT * WHERE debtor_name = 'Acme Corp' — an exact match. A vector database lets you say "find me everything semantically similar to 'Acme Corporation'" and it would also find "ACME CORP," "Acme Corp Inc," and related entities — because their embedding vectors are close together. It's the difference between exact lookup and meaning-based search.
Popular options include: ChromaDB (local, Python-native, great for prototyping), Pinecone (managed cloud service, scales to billions of vectors), and pgvector (a PostgreSQL extension that adds vector search to your existing database).
"Vector databases replace SQL databases" — They don't. Vector databases excel at similarity search but are terrible at exact lookups, joins, aggregations, or transactions. Most production RAG systems use both: a vector DB for retrieval and a SQL DB for metadata, user data, and business logic.
"ANN search always returns the best matches" — The "approximate" in ANN means it trades a small amount of accuracy for massive speed. In rare cases, the true nearest neighbor may not appear in results. For most RAG applications this is fine (95–99% accuracy), but if you need guaranteed exact results, use exact nearest neighbor search (much slower).
"More vectors = slower search" — Not linearly. HNSW index performance scales logarithmically, so going from 10,000 to 100,000 vectors barely changes query time. The real bottleneck is usually the embedding step (calling the API to convert the query to a vector), not the search itself.
Citations — Native Provenance From Claude
The RAG pipeline you just built returns retrieved chunks and Claude's answer, but stitching those two together — "which sentence in the answer came from which chunk?" — is your responsibility. For audit-grade applications (legal, medical, regulated finance), that stitching is fragile: Claude can reword, summarize, or merge sources, and matching back to the original passages is error-prone.
Anthropic ships a built-in Citations feature that solves this at the API level. You pass your source documents into the request as document content blocks; Claude returns its answer with explicit citations arrays attached to each sentence, telling you exactly which document and which character span supports each claim. No regex, no fuzzy matching, no hallucinated quotes.
The Citations feature accepts up to ~20 documents per request, each with a unique title. Claude's response includes citations[] on each text block, where each citation references a document_index, document_title, and the exact cited_text span used. You enable it per request by passing documents as content blocks with "type": "document" and "citations": {"enabled": true}.
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "document",
"source": {"type": "text", "media_type": "text/plain", "data": chunk_1},
"title": "Filing UCC-2024-001",
"citations": {"enabled": True}
},
# ...more documents...
{"type": "text", "text": "Summarize the secured party's collateral interest."}
]
}]
)
# response.content[0].citations -> [{document_index: 0, document_title: "...", cited_text: "..."}]
RAG vs Citations — Pick Per Job
| Concern | Build RAG | Use Citations |
|---|---|---|
| Corpus size | Millions of docs — need vector search to narrow down | A small candidate set (<20 docs) per request |
| Provenance requirement | "Roughly attributable" via chunk metadata | Sentence-level character spans — audit-grade |
| Hallucination risk | Claude may paraphrase outside the sources | Each cited claim is grounded in a specific span |
| Token cost | Pay only for top-K retrieved chunks | Pay for all candidate documents (cap ~20) |
| Best stack | Vector DB + Claude (this module's pipeline) | Vector DB shortlist → pass top-K to Citations |
The strongest production pattern combines them: use your RAG pipeline to find the top ~10 candidate chunks, then pass those chunks as citation-enabled documents in the request. You get the scale of vector search and the audit-grade provenance of Citations in the same call.
The exam tests recognition that provenance is a first-class API feature, not something you bolt on after the fact. Anti-pattern: asking Claude in the prompt to "include citations like [1] [2]" and parsing them out — the model can fabricate citation numbers. Correct pattern: pass documents as citation-enabled content blocks and read structured citations[] from the response.
Multimodal Inputs — PDFs, Images, & the Files API
The retrieval pipeline you built so far assumes your documents arrive as plain text. In production they don't. They arrive as PDFs with embedded tables and figures, scanned images, charts, and screenshots. Claude has three native primitives for handling these without a separate OCR step or vision model: PDF support, vision, and the Files API. Knowing which to reach for — and when to combine them with your RAG pipeline — closes the last gap between a toy chatbot and a production document agent.
PDF Support — Skip the OCR Pipeline
Pass a PDF directly as a document content block (base64-encoded or via Files API reference) and Claude reads it natively — including tables, headings, and the layout structure that OCR'd text would lose. Limits to know: up to ~100 pages per document, ~32MB per file. For longer documents, split into chunks and use your RAG pipeline to shortlist before passing the most relevant pages.
import anthropic, base64
with open("ucc_filing.pdf", "rb") as f:
pdf_data = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
messages=[{
"role": "user",
"content": [
{
"type": "document",
"source": {"type": "base64", "media_type": "application/pdf", "data": pdf_data},
"title": "UCC-1 Filing 2024-001",
"citations": {"enabled": True} # ← pairs with Citations
},
{"type": "text", "text": "Extract debtor, secured party, and collateral. Cite each field."}
]
}]
)
Vision — Images as First-Class Inputs
Same content-block paradigm with type: "image". Useful for: extracting data from charts/graphs, reading screenshots from agents that monitor dashboards, classifying scanned forms when you don't have native PDFs, and processing diagram-heavy documents (architecture diagrams, flowcharts). Cost: image tokens are calculated by image dimensions — ~1.15 tokens per pixel after resizing. A 1280x800 screenshot is roughly 1500 tokens.
with open("dashboard_screenshot.png", "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{"type": "image",
"source": {"type": "base64", "media_type": "image/png", "data": image_data}},
{"type": "text", "text": "Read the latency chart and report the p95 value for the last 24h."}
]
}]
)
Files API — Upload Once, Reference Many Times
If you reference the same document across many requests — an agent that answers questions about a 50-page contract over a multi-turn session, or a batch job that runs 1000 questions against the same PDF — uploading via the Files API beats re-encoding the file in every request. You upload once, get a file_id, and reference it by ID in subsequent messages.
# Step 1: upload (do this once)
file = client.files.create(file=open("contract.pdf", "rb"), purpose="user_data")
file_id = file.id # e.g., "file_abc123"
# Step 2: reference by ID in any request
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{"type": "document", "source": {"type": "file", "file_id": file_id}},
{"type": "text", "text": "What is the termination clause?"}
]
}]
)
When to Use Each — Decision Matrix
| Scenario | Use |
|---|---|
| One-off PDF question, <100 pages | document content block (base64) + Citations |
| Same PDF referenced in 50+ requests | Files API + reference by file_id |
| PDF >100 pages or >32MB | RAG to shortlist + document blocks for top-K |
| Screenshot, chart, or scanned form | image content block |
| Diagram/architecture interpretation | image content block with explicit task prompt |
| Mixed content (PDF + screenshots + text) | All three in one content array — Claude reasons across them jointly |
PDFs: ~1500–3000 tokens per page depending on text density. Images: tokens scale with image dimensions; resize before sending if you don't need full resolution. Files API: file references count toward your context window the same as inline content — the upload is for re-use efficiency, not free context. Caching: document and image content blocks ARE cacheable via cache_control — combine with prompt caching for high-frequency document Q&A and you'll see 90% savings on the second-and-onward request.
The single biggest production gap for document agents isn't retrieval quality — it's format handling. Teams build elegant RAG pipelines, then hit a wall when 60% of their documents are scanned PDFs with tables. Native multimodal support collapses what used to be a 3-stage pipeline (OCR → layout parsing → LLM) into a single API call. Combined with Citations and prompt caching, you get a complete document-agent stack with audit-grade provenance for under 100 lines of code.
Code Walkthrough: Chat with Your Docs
Step 1: Document Loading & Chunking
Let's start by getting our documents into the system. The load_documents function is straightforward — it reads every markdown and text file from a folder and returns them as a list. Nothing clever here, just file I/O with proper error handling (we skip files with encoding issues rather than crashing the whole pipeline).
The interesting part is chunk_text. Here's the dilemma: embedding models work best with short text — typically under 512 tokens — but our documents might be thousands of words. We could just chop the text every 500 characters, but that might cut a sentence in half. So instead, the function tries to find a natural break point — a paragraph boundary, a period, or at worst a space — before making the cut. This small detail makes a real difference in retrieval quality.
One thing to watch for: the 50-character overlap between chunks. Without overlap, information that spans a chunk boundary gets split across two chunks, and neither chunk alone contains the full thought. The overlap duplicates a small amount of text at each boundary, acting as insurance that context isn't lost.
# pip install chromadb>=0.4.0 anthropic>=0.30.0
import os
import glob
def load_documents(docs_dir: str) -> list[dict]:
"""Load all .md and .txt files from a directory."""
docs = []
for pattern in ["*.md", "*.txt"]:
for path in glob.glob(os.path.join(docs_dir, pattern)):
try:
with open(path, "r", encoding="utf-8") as f:
docs.append({
"content": f.read(),
"source": os.path.basename(path),
})
except (IOError, UnicodeDecodeError) as e:
print(f"Skipping {path}: {e}")
if not docs:
raise FileNotFoundError(f"No documents found in {docs_dir}")
return docs
def chunk_text(
text: str, chunk_size: int = 500, overlap: int = 50
) -> list[str]:
"""Recursive character splitter with overlap."""
if len(text) <= chunk_size:
return [text]
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
# Try to break at a paragraph or sentence boundary
if end < len(text):
for sep in ["\n\n", "\n", ". ", " "]:
last_sep = chunk.rfind(sep)
if last_sep > chunk_size * 0.5:
end = start + last_sep + len(sep)
chunk = text[start:end]
break
chunks.append(chunk.strip())
start = end - overlap # overlap for continuity
return [c for c in chunks if c]
# Usage
docs = load_documents("./docs")
all_chunks = []
for doc in docs:
chunks = chunk_text(doc["content"], chunk_size=500, overlap=50)
for i, chunk in enumerate(chunks):
all_chunks.append({
"text": chunk,
"source": doc["source"],
"chunk_index": i,
})
print(f"Loaded {len(docs)} docs, {len(all_chunks)} chunks")
// npm install chromadb@^1.7.0 @anthropic-ai/sdk@^0.30.0
import { readFileSync, readdirSync } from 'fs';
import { join, basename } from 'path';
function loadDocuments(docsDir) {
const files = readdirSync(docsDir).filter(
f => f.endsWith('.md') || f.endsWith('.txt')
);
if (files.length === 0)
throw new Error(`No documents found in ${docsDir}`);
return files.map(f => {
try {
return {
content: readFileSync(join(docsDir, f), 'utf-8'),
source: f,
};
} catch (e) {
console.warn(`Skipping ${f}: ${e.message}`);
return null;
}
}).filter(Boolean);
}
function chunkText(text, chunkSize = 500, overlap = 50) {
if (text.length <= chunkSize) return [text];
const chunks = [];
let start = 0;
while (start < text.length) {
let end = start + chunkSize;
let chunk = text.slice(start, end);
if (end < text.length) {
for (const sep of ['\n\n', '\n', '. ', ' ']) {
const lastSep = chunk.lastIndexOf(sep);
if (lastSep > chunkSize * 0.5) {
end = start + lastSep + sep.length;
chunk = text.slice(start, end);
break;
}
}
}
chunks.push(chunk.trim());
start = end - overlap;
}
return chunks.filter(Boolean);
}
const docs = loadDocuments('./docs');
const allChunks = [];
for (const doc of docs) {
const chunks = chunkText(doc.content, 500, 50);
chunks.forEach((chunk, i) => {
allChunks.push({ text: chunk, source: doc.source, chunkIndex: i });
});
}
console.log(`Loaded ${docs.length} docs, ${allChunks.length} chunks`);
Step 2: Embed & Store in ChromaDB
Now we need somewhere to store our chunks as searchable vectors. Here's the pleasant surprise: ChromaDB handles the embedding step for you. When you call collection.add() with text documents, ChromaDB automatically runs them through its built-in embedding model and stores the resulting vectors. You never touch a separate embedding API. The "hnsw:space": "cosine" setting tells ChromaDB to use cosine similarity when searching — the standard for text embeddings, as we discussed in the Embeddings section.
One gotcha that trips up every beginner: chromadb.Client() stores everything in memory only. When your Python script exits, all your vectors disappear. For this lab that's fine — we re-ingest each run. But for anything beyond a quick experiment, switch to chromadb.PersistentClient(path="./chroma_db"), which writes to disk and survives restarts. Also worth knowing: ChromaDB's default embedding model is good enough for prototyping, but for production you'll want Voyage AI or OpenAI embeddings, which produce more accurate similarity scores for domain-specific content like legal or medical text.
import chromadb
# ChromaDB uses its own built-in embedding model by default
client = chromadb.Client() # in-memory; use PersistentClient for disk
collection = client.get_or_create_collection(
name="my_docs",
metadata={"hnsw:space": "cosine"}, # use cosine similarity
)
# Add chunks with metadata
try:
collection.add(
ids=[f"chunk_{i}" for i in range(len(all_chunks))],
documents=[c["text"] for c in all_chunks],
metadatas=[{"source": c["source"], "index": c["chunk_index"]}
for c in all_chunks],
)
print(f"Stored {collection.count()} chunks in ChromaDB")
except Exception as e:
print(f"Error storing chunks: {e}")
raise
import { ChromaClient } from 'chromadb';
const chroma = new ChromaClient();
const collection = await chroma.getOrCreateCollection({
name: "my_docs",
metadata: { "hnsw:space": "cosine" },
});
try {
await collection.add({
ids: allChunks.map((_, i) => `chunk_${i}`),
documents: allChunks.map(c => c.text),
metadatas: allChunks.map(c => ({
source: c.source, index: c.chunkIndex
})),
});
const count = await collection.count();
console.log(`Stored ${count} chunks in ChromaDB`);
} catch (e) {
console.error(`Error storing chunks: ${e.message}`);
throw e;
}
Step 3: Retrieve & Generate with Claude
This is where everything comes together — the "Retrieve" and "Generate" stages working as a pair. When a user asks a question, two things happen in sequence. First, we query ChromaDB to find the top-k most semantically similar chunks. Then we format those chunks into a context block and send them to Claude along with the original question. The key insight: we inject only the most relevant chunks (typically 3–5), not the entire corpus. This keeps the prompt small enough for Claude's context window while providing exactly the information needed to answer.
Pay close attention to the system prompt — it's the most important design decision in this entire function. We tell Claude to answer "based ONLY on the provided context" and to explicitly say so if the context doesn't contain the answer. Why so strict? Without this grounding instruction, Claude will cheerfully fall back to its training data and make up plausible-sounding answers — defeating the entire purpose of RAG. Also notice we wrap retrieval and generation in separate try/except blocks. These are independent failure points: ChromaDB might be down, or the Claude API might return a rate limit error. Catching them separately means you get a clear error message pointing to exactly which stage failed.
import anthropic
claude = anthropic.Anthropic() # reads ANTHROPIC_API_KEY env var
def query_rag(question: str, top_k: int = 3) -> str:
"""Retrieve relevant chunks and generate an answer with citations."""
# Step 1: Retrieve
try:
results = collection.query(
query_texts=[question],
n_results=top_k,
)
except Exception as e:
return f"Retrieval error: {e}"
if not results["documents"] or not results["documents"][0]:
return "No relevant documents found. I don't have enough information."
# Check relevance threshold
distances = results["distances"][0] if results.get("distances") else []
chunks = results["documents"][0]
sources = results["metadatas"][0]
# Build context from retrieved chunks
context_parts = []
for i, (chunk, meta) in enumerate(zip(chunks, sources)):
context_parts.append(
f"[Source {i+1}: {meta['source']}]\n{chunk}"
)
context = "\n\n---\n\n".join(context_parts)
# Step 2: Generate with Claude
try:
response = claude.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=(
"You are a helpful assistant that answers questions "
"based ONLY on the provided context. If the context "
"doesn't contain the answer, say so. Always cite "
"your sources using [Source N] format."
),
messages=[{
"role": "user",
"content": (
f"Context:\n{context}\n\n"
f"Question: {question}\n\n"
"Answer based on the context above, citing sources:"
),
}],
)
return response.content[0].text
except anthropic.APIError as e:
return f"Generation error: {e.status_code} - {e.message}"
# Test it
answer = query_rag("What are the key features of the product?")
print(answer)
import Anthropic from '@anthropic-ai/sdk';
const claude = new Anthropic(); // reads ANTHROPIC_API_KEY env var
async function queryRag(question, topK = 3) {
// Step 1: Retrieve
let results;
try {
results = await collection.query({
queryTexts: [question],
nResults: topK,
});
} catch (e) {
return `Retrieval error: ${e.message}`;
}
if (!results.documents?.[0]?.length) {
return "No relevant documents found. I don't have enough information.";
}
const chunks = results.documents[0];
const sources = results.metadatas[0];
const context = chunks.map((chunk, i) =>
`[Source ${i + 1}: ${sources[i].source}]\n${chunk}`
).join('\n\n---\n\n');
// Step 2: Generate with Claude
try {
const response = await claude.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
system:
"You are a helpful assistant that answers questions " +
"based ONLY on the provided context. If the context " +
"doesn't contain the answer, say so. Always cite " +
"your sources using [Source N] format.",
messages: [{
role: "user",
content:
`Context:\n${context}\n\n` +
`Question: ${question}\n\n` +
"Answer based on the context above, citing sources:",
}],
});
return response.content[0].text;
} catch (e) {
return `Generation error: ${e.status} - ${e.message}`;
}
}
const answer = await queryRag("What are the key features of the product?");
console.log(answer);
query_rag function takes a plain-English question, searches ChromaDB for the 3 most semantically similar chunks, formats them into a context block with source labels, and sends everything to Claude with a grounding system prompt. Claude responds using only the retrieved context and cites its sources. All six pipeline stages — Load, Chunk, Embed, Store, Retrieve, Generate — are now working together.
{claim, source_id, confidence}. Synthesizing answers without source attribution is an exam anti-pattern — even when the answer is correct, you can't audit it later. When multiple sources agree, mark the claim as well-established; when sources conflict, surface it as a contested claim with both source pointers rather than silently picking one. The exam tests whether your output schema enforces source mappings, not whether the answer happens to be right.
Hands-On Exercise
What You'll Build
A complete RAG pipeline that ingests sample documents, stores them in ChromaDB, and answers questions with cited sources using Claude.
Time Estimate: 30–45 minutes
Prerequisites: Python 3.9+, an Anthropic API key (ANTHROPIC_API_KEY environment variable), and a terminal/command prompt.
Files You'll Create: rag_pipeline.py (main pipeline script), docs/filing_guide.md, docs/risk_assessment.md, docs/compliance_faq.txt (sample documents)
Environment Setup
mkdir rag-lab && cd rag-lab
python -m venv venv && source venv/bin/activate # Windows: venv\Scripts\activate
pip install "anthropic>=0.30.0" "chromadb>=0.4.0"
export ANTHROPIC_API_KEY=your-key-here # Windows: set ANTHROPIC_API_KEY=your-key-here
Step 1: Create Sample Documents & Ingestion Pipeline
First, we need documents to search. We'll create a docs/ folder with 3 sample files, then build the ingestion pipeline that loads, chunks, and stores them in ChromaDB. This covers the first 4 stages of the RAG pipeline: Load → Chunk → Embed → Store.
Create a docs/ folder and add these 3 files:
# UCC Filing Guide
## Initial Filing (UCC-1)
A UCC-1 financing statement must be filed in the state where the debtor is located. For individuals, this is their principal residence. For registered organizations (corporations, LLCs), this is the state of organization.
The filing office assigns a unique file number and timestamps the filing. A properly filed UCC-1 is effective for 5 years from the date of filing.
## Continuation Statements
To maintain perfection beyond 5 years, a continuation statement (UCC-3) must be filed within 6 months before the 5-year lapse date. Missing this window means the filing lapses and the secured party loses priority.
## Amendments
Amendments can change the collateral description, add or remove debtors, or assign the financing statement to a new secured party. Each amendment receives its own file number but references the original UCC-1.
# Risk Assessment Criteria
## High Risk Indicators
- Debtor has multiple UCC filings from different secured parties (competing liens)
- Collateral description uses broad terms like "all assets" (blanket lien)
- Filing is within 90 days of a bankruptcy petition (preference risk)
- Debtor name on filing doesn't exactly match legal name (perfection defect)
## Medium Risk Indicators
- Continuation statement filed close to the deadline (less than 30 days before lapse)
- Collateral type is inventory or accounts receivable (high turnover)
- Secured party is an individual rather than an institution
## Low Risk Indicators
- Single secured party with specific collateral description
- Filing has been continued at least once (established relationship)
- Debtor is a well-established registered organization
Q: What happens if a continuation statement is filed late?
A: If filed after the 5-year lapse date, the original filing is no longer effective. The secured party must file a new UCC-1, and their priority date resets to the new filing date. Any liens filed by other parties in the gap period will have priority.
Q: Can a UCC filing be terminated?
A: Yes. The secured party can file a UCC-3 termination statement. Once filed, the financing statement is no longer effective. The debtor can also demand termination if the obligation has been satisfied.
Q: What is the difference between a UCC-1 and a UCC-3?
A: A UCC-1 is the initial financing statement that creates the public record. A UCC-3 is used for all subsequent changes: continuations, amendments, assignments, and terminations. The UCC-3 always references the original UCC-1 file number.
Now create rag_pipeline.py with the ingestion pipeline:
import os
import glob
import json
import anthropic
import chromadb
# ── Document Loading ─────────────────────────────────────────
def load_documents(docs_dir: str) -> list[dict]:
"""Load all .md and .txt files from a directory."""
docs = []
for pattern in ["*.md", "*.txt"]:
for path in glob.glob(os.path.join(docs_dir, pattern)):
try:
with open(path, "r", encoding="utf-8") as f:
docs.append({"content": f.read(), "source": os.path.basename(path)})
except (IOError, UnicodeDecodeError) as e:
print(f" Skipping {path}: {e}")
if not docs:
raise FileNotFoundError(f"No documents found in {docs_dir}")
return docs
# ── Chunking ─────────────────────────────────────────────────
def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
"""Split text into chunks with overlap at natural boundaries."""
if len(text) <= chunk_size:
return [text]
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
if end < len(text):
for sep in ["\n\n", "\n", ". ", " "]:
last_sep = chunk.rfind(sep)
if last_sep > chunk_size * 0.5:
end = start + last_sep + len(sep)
chunk = text[start:end]
break
chunks.append(chunk.strip())
start = end - overlap
return [c for c in chunks if c]
# ── Ingestion (Load → Chunk → Embed → Store) ────────────────
def ingest(docs_dir: str = "./docs") -> chromadb.Collection:
"""Load docs, chunk them, and store in ChromaDB."""
print("── Ingestion Pipeline ──")
docs = load_documents(docs_dir)
print(f" Loaded {len(docs)} documents")
all_chunks = []
for doc in docs:
chunks = chunk_text(doc["content"], chunk_size=500, overlap=50)
for i, chunk in enumerate(chunks):
all_chunks.append({"text": chunk, "source": doc["source"], "index": i})
print(f" Created {len(all_chunks)} chunks")
client = chromadb.Client() # in-memory for simplicity
collection = client.get_or_create_collection(
name="rag_lab", metadata={"hnsw:space": "cosine"}
)
collection.add(
ids=[f"chunk_{i}" for i in range(len(all_chunks))],
documents=[c["text"] for c in all_chunks],
metadatas=[{"source": c["source"], "index": c["index"]} for c in all_chunks],
)
print(f" Stored {collection.count()} chunks in ChromaDB")
return collection
# ── Run ingestion ────────────────────────────────────────────
if __name__ == "__main__":
collection = ingest()
print("\n✓ Ingestion complete. Ready for queries.")
Run the ingestion:
If you see 3 documents loaded and 6–10 chunks stored, Step 1 is working. The exact chunk count depends on document length and boundary detection. If you see 0 documents, check that your docs/ folder is in the same directory as rag_pipeline.py.
ModuleNotFoundError: No module named 'chromadb'→ Runpip install chromadbFileNotFoundError: No documents found→ Make sure thedocs/folder exists and contains.mdor.txtfiles in the same directory where you run the script- Very few chunks (1–2) → Your documents may be very short. Add more content to the sample files or lower
chunk_sizeto 200.
Step 2: Add Retrieval & Generation with Claude
Depends on: Step 1 (this step uses the collection object and ingest() function from Step 1. If you're starting fresh, complete Step 1 first.)
Now we wire up the last two pipeline stages — Retrieve and Generate. This function searches ChromaDB for the top-3 most relevant chunks, formats them into a context block with source labels, and sends everything to Claude with a grounding prompt.
Append the query_rag function to rag_pipeline.py and replace the existing if __name__ == "__main__" block with the code below (keep the imports and the load_documents, chunk_text, and ingest functions from Step 1):
# ── Query (Retrieve → Generate) ──────────────────────────────
def query_rag(collection, question: str, top_k: int = 3, verbose: bool = True) -> str:
"""Search for relevant chunks and generate an answer with Claude."""
client = anthropic.Anthropic()
# Retrieve
results = collection.query(query_texts=[question], n_results=top_k)
chunks = results["documents"][0]
sources = results["metadatas"][0]
distances = results["distances"][0] if results.get("distances") else []
if verbose:
print(f"\n Query: {question}")
print(f" Retrieved {len(chunks)} chunks:")
for i, (chunk, meta) in enumerate(zip(chunks, sources)):
dist = f" (distance: {distances[i]:.3f})" if distances else ""
print(f" [{i+1}] {meta['source']}{dist}: {chunk[:80]}...")
# Format context
context = "\n\n---\n\n".join(
f"[Source {i+1}: {meta['source']}]\n{chunk}"
for i, (chunk, meta) in enumerate(zip(chunks, sources))
)
# Generate
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=(
"You are a helpful assistant that answers questions "
"based ONLY on the provided context. If the context "
"doesn't contain the answer, say so explicitly. "
"Always cite sources using [Source N] format."
),
messages=[{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}\n\nAnswer based on the context above, citing sources:",
}],
)
return response.content[0].text
# ── Test Queries ─────────────────────────────────────────────
if __name__ == "__main__":
collection = ingest()
print("\n" + "=" * 55)
print("RAG QUERY TESTS")
print("=" * 55)
questions = [
"How long is a UCC-1 filing effective?",
"What are the high risk indicators for a UCC filing?",
"What happens if a continuation statement is filed late?",
"What is the difference between a UCC-1 and a UCC-3?",
"What is the weather in Tokyo?", # Should say "not in context"
]
for q in questions:
answer = query_rag(collection, q)
print(f"\n Answer: {answer[:200]}...")
print(f" {'─' * 50}")
Run the complete pipeline:
Look for these key behaviors:
- Questions 1–4: Should return accurate, cited answers drawn from the sample documents
- Question 5 (weather): Should say the context doesn't contain the answer — this confirms the grounding prompt is working
- Source citations: Answers should include
[Source 1],[Source 2]references matching the retrieved chunks - Distances: Lower distances mean more relevant chunks. The best match should have the lowest distance.
AuthenticationError→ Check yourANTHROPIC_API_KEYis set correctly- Claude answers the weather question → The grounding system prompt may not be strict enough. Make sure it says "based ONLY on the provided context"
- Irrelevant chunks retrieved → Try smaller
chunk_size(200 instead of 500) or add more focused content to your sample documents APIError: 529→ The test makes 5 API calls. Wait 30 seconds and try again, or run fewer questions.
Verify Everything Works
Run the complete pipeline end-to-end. All 5 queries should return answers, with the first 4 citing sources and the 5th correctly declining to answer:
If all queries complete and you see cited answers for domain questions and a "not in context" response for the weather question, you've built a working RAG pipeline.
You've built a complete RAG system from scratch — document loading, chunking with overlap, vector storage in ChromaDB, semantic retrieval, and grounded generation with Claude. This is the same architecture used by production document Q&A systems, knowledge bases, and AI assistants.
- Add metadata filtering:
collection.query(where={"source": "filing_guide.md"})to restrict search to specific files - Implement a relevance threshold: if all chunk distances are above 0.8, return "I don't have enough information"
- Experiment with chunk sizes of 200, 500, and 1000 — which gives the best answers for the UCC content?
- Add a
/sourcescommand that prints the raw retrieved chunks and similarity scores before the answer
Knowledge Check
Q1: What problem does RAG solve that prompt engineering alone cannot?
Q2: Given a 2000-character document, chunk size of 500, and overlap of 50, approximately how many chunks result?
Q3: Why is cosine similarity preferred over Euclidean distance for comparing text embeddings?
Q4: Your RAG system retrieves irrelevant chunks for user queries. Which fix is most likely to help?
Q5: What is the correct order for the RAG pipeline steps?
Q6: (Recall from M05/M06) In the RAG query function, you call Claude's Messages API. If the API call fails, what should your code do?
Module Summary
Key Concepts
- The knowledge problem: LLMs have a training cutoff and can't access private data — RAG fixes this.
- RAG pipeline: Load → Chunk → Embed → Store → Retrieve → Generate.
- Embeddings: Dense vectors that capture semantic meaning. Cosine similarity measures closeness.
- Chunking: How you split documents directly determines retrieval quality. Experiment with sizes.
- Vector databases: ChromaDB for prototyping, Pinecone/pgvector for production scale.
Next: M10 — Advanced RAG
You've built a basic RAG system. In M10, you'll tackle the hard problems: hybrid search (combining semantic + keyword), reranking retrieved results, handling multi-document queries, and evaluating retrieval quality with metrics like MRR and recall@k.