M09: RAG — Retrieval-Augmented Generation

Claude is brilliant but has a knowledge cutoff. RAG solves this by giving Claude access to your documents at query time — no fine-tuning required. You'll build a complete "Chat with Your Docs" system.

Learning Objectives

  • Explain why RAG is needed and what problem it solves that prompt engineering cannot
  • Describe the end-to-end RAG pipeline: load, chunk, embed, store, retrieve, generate
  • Explain how text embeddings and cosine similarity enable semantic search
  • Compare chunking strategies and their impact on retrieval quality
  • Build a complete RAG system with ChromaDB and Claude

The Knowledge Problem

Everyday Analogy

BEFORE: Imagine a brilliant doctor who graduated in 2023 but has never seen a 2025 medical journal — they practiced medicine using only what they learned in school and residency, with no access to anything published since.

PAIN: When a patient asks about a breakthrough treatment approved six months ago, the doctor either says "I don't know" or, worse, confidently recommends an outdated protocol that has since been superseded — and the patient has no way to tell the difference.

MAPPING: That's exactly the knowledge cutoff problemLLMs are trained on data up to a specific date. They cannot access information published after that date, proprietary data, or private documents. This leads to hallucinations when asked about topics outside their training data. with LLMs: Claude is the doctor, your private company docs and recent data are the new medical journals, and RAG is the system that puts those journals on the doctor's desk right when they need them.

Technical Definition LLMs have a fixed training cutoffThe date after which no data was included in the model's training set. Claude's training data has a specific cutoff, meaning it has no knowledge of events, documents, or data published after that date. and lack access to private, proprietary, or recently published data. This leads to hallucinationsWhen an LLM generates confident, plausible-sounding text that is factually incorrect. Hallucinations often occur when the model is asked about topics outside its training data and "fills in" answers from patterns rather than facts. when asked about unknown topics — the model generates plausible-sounding but incorrect answers. Prompt engineering alone can't fix this: no matter how clever your prompt, Claude can't reference a document it has never seen.
Why It Matters RAGRetrieval-Augmented Generation — a technique that combines information retrieval (searching a document corpus) with text generation (Claude). At query time, relevant documents are retrieved and injected into the prompt, giving Claude access to knowledge beyond its training data. solves the knowledge problem by giving Claude access to external documents at query time, eliminating the need to fine-tune or retrain the model. Real-world impact: A customer support team at a mid-size SaaS company implemented RAG over their 2,000-page knowledge base and reduced hallucinated answers from ~35% to under 3%, while cutting average ticket resolution time from 12 minutes to 4 minutes. Fine-tuning a model on the same data would have cost $5,000–$50,000 and taken weeks; their RAG prototype was running in an afternoon. Your company docs, recent research, private databases — all become accessible without retraining anything.

What Is RAG? (End-to-End Pipeline)

Everyday Analogy

BEFORE: Imagine taking a closed-book exam where you must answer every question purely from memory — you spent months cramming, but there are thousands of facts you simply could not memorize.

PAIN: When a question covers a topic you didn't study deeply, you either leave it blank or guess — and guessing confidently is exactly what hallucination looks like in an LLM.

MAPPING: RAG turns this into an open-book exam: Claude is the student, your document corpus is the textbook sitting on the desk, and the retrieval system is the index at the back of the book that helps Claude flip to exactly the right page instead of guessing from memory.

Here's what this looks like in practice. When a user asks a question, the retrieval step returns actual chunk objects like this:

Retrieved chunks (what the retriever hands to Claude)
[ { "text": "UCC amendments must be filed within...", "source": "filing_guide.md", "score": 0.92 }, { "text": "The debtor name on a UCC-1 must exactly...", "source": "compliance_faq.txt", "score": 0.87 }, { "text": "Continuation statements are due every...", "source": "filing_guide.md", "score": 0.84 } ]

Those chunk texts get pasted directly into Claude's prompt as context. Claude reads them, writes an answer, and cites [Source 1: filing_guide.md]. The user never sees the retrieval machinery — they just get an accurate, sourced answer.

Technical Definition RAG works in two phases. First, the setup phase (you do this once): take your documents and break them into small pieces. This splitting process is called chunkingThe process of splitting documents into smaller segments (chunks) for embedding. Each chunk should be small enough to embed meaningfully but large enough to preserve context. Typical chunk sizes range from 200 to 1000 tokens.. Next, convert each piece into a number array using an embedding modelConverting text into a dense numerical vector (array of floats, typically 768-1536 dimensions) using an embedding model. Semantically similar texts produce vectors that are close together in this high-dimensional space, enabling similarity search.. Finally, save those arrays in a vector databaseA database optimized for storing and searching high-dimensional vectors. It uses approximate nearest neighbor (ANN) algorithms to quickly find vectors most similar to a query vector. Examples: ChromaDB, Pinecone, pgvector.. Second, the query phase (this happens every time a user asks a question): search that database for the pieces most related to the question. Then paste them into the prompt and let Claude answer using that context. That's it — search, then generate.
⚠️ Common Misconceptions

"RAG is just fine-tuning, right?" — No. Fine-tuning changes the model's internal weights — it costs $5,000–$50,000, takes weeks, and requires ML expertise. RAG doesn't touch the model at all. It just adds relevant text to the prompt at query time. Think of it as giving Claude a cheat sheet during the exam, not retraining Claude's brain.

"Bigger chunks = better results?" — Usually the opposite. A 2,000-word chunk that contains one relevant sentence dilutes the signal. The retriever found the right chunk, but Claude has to wade through 1,999 irrelevant words to find the answer. Smaller, focused chunks almost always retrieve better.

"RAG eliminates hallucinations" — It reduces them significantly (from ~35% to under 3% in typical deployments), but it doesn't eliminate them entirely. Claude can still misinterpret retrieved context, combine facts incorrectly, or fill gaps with plausible-sounding guesses. Always include the grounding instruction ("answer based ONLY on the provided context") and verify critical outputs.

"You need a vector database to do RAG" — For production, yes — vector databases give you fast ANN search over millions of vectors. But basic RAG works with simple in-memory similarity search over a few hundred chunks. Don't let infrastructure requirements stop you from prototyping.

"More retrieved chunks = better answers" — Diminishing returns kick in fast. Retrieving 3–5 highly relevant chunks is typically better than retrieving 20 chunks of mixed quality. Too many chunks flood Claude's context with noise, and information in the middle of long context gets lower recall (the "lost in the middle" effect).

RAG Pipeline — End-to-End Flow
SETUP PHASE (once) QUERY PHASE (every request) Documents PDFs, CSVs, APIs Chunk Split into pieces Embed Text → vectors Vector DB Indexed embeddings User Question "How do I file a UCC amendment?" Embed Query Same model Top-K Chunks k=3 most relevant Claude + Context System prompt + retrieved chunks + question Cited Response [Source: filing_guide.md] 1,000s of docs 300-500 chars each 1,536-dim vectors
Animation: The RAG Pipeline
📄
Load
Chunk
🔢
Embed
🗃
Store
🔍
Retrieve
🤖
Generate

Embeddings Explained from Scratch

Everyday Analogy

BEFORE: Think of traditional keyword search: you type "heart attack treatment" and the engine only finds documents containing those exact words — missing a document titled "myocardial infarction therapy" even though it means the same thing.

PAIN: This keyword mismatch problem means relevant results are invisible unless the author happened to use the exact same vocabulary as the searcher, leading to frustrating dead ends and missed information.

MAPPING: Embeddings fix this by placing words as coordinates in meaning-space — "king" and "queen" live close together, just as Paris and London are nearby on a map of capital cities. Cosine similarityA measure of similarity between two vectors based on the angle between them. Ranges from -1 (opposite meaning) to 1 (identical meaning). Preferred over Euclidean distance for text embeddings because it measures direction (meaning) regardless of magnitude (text length). measures the angle between two points — a small angle means similar meaning, regardless of the words used.

So what IS an embedding? In plain English, it's a way of representing text as a point in space. Each word, sentence, or paragraph gets converted into a list of numbers — and those numbers encode what the text means, not just what words it contains. Two sentences that say the same thing in different words will end up at nearby points, even if they share zero words in common.

How does this actually work under the hood? An embedding model (a specialized neural network, much smaller than Claude) reads your text and outputs a fixed-size array of floats — typically 1,536 numbers. During training, the model learned to place semantically similar text at nearby coordinates. It doesn't follow rules you wrote; it learned these associations from millions of text pairs. The result is that "cancel my subscription" and "I want to stop paying" land near each other, even though they share no keywords.

If you're familiar with keyword search (like SQL's LIKE '%term%'), here's the key difference: keyword search requires an exact character match, so "heart attack" won't find "myocardial infarction." Embedding-based search doesn't care about specific words at all — it compares meaning. This is why RAG systems use embeddings instead of traditional search: your users won't always use the same vocabulary as your documents.

So what does an embedding actually look like? It's just an array of floating-point numbers — typically 1,536 of them:

Actual embedding vector (truncated)
"heart attack treatment" → [0.023, -0.841, 0.129, 0.567, -0.302, ... 1531 more floats] "myocardial infarction therapy" → [0.019, -0.838, 0.134, 0.571, -0.298, ... 1531 more floats] "best pizza recipes" → [0.892, 0.104, -0.667, 0.023, 0.445, ... 1531 more floats]

Notice how the first two vectors start with almost identical numbers — they mean the same thing, so their coordinates are nearly the same. The pizza vector looks completely different. That's the whole trick: meaning becomes geometry, and "similar meaning" becomes "short distance."

Technical Definition Text embeddings are dense numerical vectors — long arrays of floating-point numbers, typically 768 to 1,536 dimensions. They are produced by a specialized embedding modelA neural network trained specifically to convert text into fixed-size vectors that capture semantic meaning. Examples include Voyage AI embeddings and OpenAI's text-embedding-3-small. These models are different from LLMs — they produce vectors, not text. (not the same as an LLM — embedding models output numbers, not text). When two pieces of text mean similar things, their vectors end up close together, measured by cosine similarity (a score from 0 to 1, where 1 means identical meaning). This is what makes semantic search possible: finding relevant content by meaning rather than keyword matching.
⚠️ Common Misconceptions About Embeddings

"Embeddings understand language like humans do" — No. Embeddings capture statistical patterns of word co-occurrence, not true understanding. "Bank" (financial) and "bank" (river) get the same embedding unless the surrounding context disambiguates them. Always embed enough context (a sentence or paragraph), not isolated words.

"All embedding models produce the same quality" — Far from it. A general-purpose embedding model may not distinguish "UCC-1 filing" from "UCC-3 amendment" well, because it wasn't trained on legal text. Domain-specific or fine-tuned embedding models can dramatically improve retrieval quality for specialized content.

"You can compare embeddings from different models" — You cannot. An embedding from Voyage AI and one from OpenAI live in completely different vector spaces. Cosine similarity between them is meaningless. Always use the same model for both indexing and querying.

Embedding Space — 2D Scatter
Semantic Dimension 1 Semantic Dimension 2 heart attack treatment myocardial infarction cardiac symptoms blood pressure meds UCC filing requirements lien priority rules debtor obligations continuation statements best pizza recipes pasta cooking tips QUERY: "UCC amendment filing process" Medical Legal/UCC Food Query Top-K match (k=3)
Animation: Embedding Space — Words as Coordinates

Chunking Strategies

Everyday Analogy

BEFORE: Imagine you have a 300-page textbook and you need to create study flashcards — but you have to decide how much text goes on each card before you start studying.

PAIN: Cut too big (an entire chapter per card) and you waste time re-reading pages of irrelevant material to find one sentence. Cut too small (one sentence per card) and you lose the surrounding context that makes that sentence meaningful — "the treatment was effective" means nothing without knowing which treatment and for what condition.

MAPPING: Chunking is exactly this trade-off for your RAG system: each chunk is a flashcard that the retrieval engine can hand to Claude, and the right card size depends on the structure and density of your specific material.

Chunking is the process of splitting your documents into smaller pieces before embedding. Why not just embed the whole document? Because embedding models work best with short text (under ~512 tokens). A 5,000-word document crammed into a single vector loses nuance — the embedding becomes a blurry average of everything the document discusses, making it hard to match specific questions.

The core challenge is finding the right granularity. Too large, and each chunk covers multiple topics, so the retriever finds the chunk but Claude has to hunt for the relevant sentence inside a wall of text. Too small, and each chunk loses context — "the treatment was effective" tells you nothing without knowing which treatment or which patient. Most teams start with 300–500 character chunks and tune from there based on retrieval quality.

If you've used database pagination (LIMIT 50 OFFSET 100), chunking feels similar — you're dividing data into pages. But there's a crucial difference: database pages don't overlap, while chunks should. Without overlap, a sentence that spans a chunk boundary gets split across two chunks, and neither chunk alone answers the question. The overlap parameter (typically 10–20% of chunk size) duplicates a few sentences at each boundary to prevent this.

Here's what actual chunks look like after splitting a short document. Notice how the overlap zone duplicates text at chunk boundaries to preserve context:

Original document (210 chars)
"UCC filings must be filed in the state where the debtor is located. The filing office assigns a unique file number. Continuation statements must be filed within 6 months before the 5-year lapse date."
After chunking (chunk_size=120, overlap=30) → 2 chunks
Chunk 0: "UCC filings must be filed in the state where the debtor is located. The filing office assigns a unique file number." Chunk 1: "assigns a unique file number. Continuation statements must be filed within 6 months before the 5-year lapse date." ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ← overlap zone (30 chars from previous chunk)

The overlap ensures "The filing office assigns a unique file number" appears in both chunks, so a query about file numbers will match regardless of which chunk the retriever finds.

Technical Definition Chunking divides source documents into smaller segments so each one can be embedded and searched independently. There are three main strategies. Fixed-size chunking splits by character or token count with overlapThe number of characters/tokens shared between adjacent chunks. Overlap ensures that information at chunk boundaries isn't lost. A typical overlap is 10-20% of the chunk size (e.g., chunk_size=500, overlap=50). between adjacent chunks — simple and predictable, but may cut mid-sentence. Recursive chunking tries to split at natural boundaries first (headings, then paragraphs, then sentences), respecting document structure. Semantic chunking is the most advanced: it detects topic shifts by comparing embedding similarity between consecutive paragraphs, and splits where the topic changes.
Chunking Strategy Comparison
Original Document (2000 characters — 3 topics: Filing Rules, Fee Schedule, Deadlines) Filing Rules Fee Schedule Deadlines Fixed-Size 500 chars each, 50 overlap — 5 equal chunks (cuts mid-sentence) Chunk 1 Chunk 2 Chunk 3 Chunk 4 Chunk 5 ✘ cuts topics Recursive Heading → paragraph boundaries — 4 natural chunks Filing Rules §1 Filing Rules §2 Fee Schedule Deadlines ✔ respects headings Semantic Topic-shift detection — 3 meaningful chunks Filing Rules (complete) Fee Schedule (complete) Deadlines (complete) ✔ topic-aligned Overlap zone Cut through topic Natural boundary
Animation: Chunking Strategy Comparison
Original Document (2000 chars)
Fixed-Size (500 chars, 50 overlap) — 5 chunks
Recursive (by headings → paragraphs) — 4 chunks
Semantic (by topic shift) — 3 chunks
Key Insight Chunking strategy directly determines retrieval quality. Poor chunks mean the right answer exists in your corpus but the system cannot find it. Always experiment with chunk sizes for your specific content.

Vector Databases

Everyday Analogy

BEFORE: Imagine a traditional library where books are shelved alphabetically by author — to find everything about "machine learning in healthcare," you'd need to know every author who ever wrote on that topic and check each shelf individually.

PAIN: You inevitably miss relevant books because you didn't know the author's name, and you waste hours browsing shelves that are organized by a criterion (author name) that has nothing to do with what you actually care about (the topic).

MAPPING: A vector database is a library where books are shelved by what they're about — all books about "machine learning in healthcare" end up near each other regardless of who wrote them, and you find them by describing the topic you need rather than knowing an exact title or keyword.

Here's what a "row" in a vector database actually contains — the vector, the original text, and metadata sitting side by side:

One row in ChromaDB (conceptual view)
{ "id": "chunk_14", "vector": [0.023, -0.841, 0.129, ... 1533 more floats], "document": "UCC continuation statements must be filed within 6 months before the 5-year lapse date...", "metadata": { "source": "filing_guide.md", "chunk_index": 14 } }
Anatomy of a vector DB row
FOUR FIELDS · ONE RECORD id "chunk_14" vector (1536 floats) document "UCC continuation statements must be filed within 6 months..." metadata source: filing_guide.md chunk_index: 14 unique key re-fetch by id what gets searched cosine sim vs query vector what gets returned the chunk Claude reads filters & provenance where it came from All four fields travel together — vector for searching, text for reading, metadata for filtering.

When you query the database, it compares your question's vector against every stored vector, finds the closest matches, and returns the document text and metadata for each. You never have to work with the raw vectors yourself — the database handles that.

Technical Definition A vector database is a specialized storage system designed for one job: storing high-dimensional vectors (those long arrays of floats from the embedding step) and finding the closest matches to a query vector extremely fast. It also stores the original text and any metadata you attach — so when a search returns a match, you get back the actual chunk text, not just a vector.
Query flow — what happens on .query()
YOU ASK A QUESTION · THE DB DOES THE REST user question "How long until…?" embed query vector [0.02, -0.84, …] ANN Vector DB top-K by cosine top-K matches chunk_4 0.91 chunk_2 0.88 chunk_9 0.84 chunk_1 0.79 (ranked by similarity) your code gets text + metadata You never touch the raw vectors — the DB embeds, searches, and unpacks for you.

Under the hood, vector databases use approximate nearest neighbor (ANN)An algorithm that finds vectors "close to" a query vector without checking every single vector in the database. ANN trades a tiny amount of accuracy for massive speed gains, using index structures like HNSW (Hierarchical Navigable Small World) graphs. algorithms to avoid comparing your query against every single stored vector. The most common algorithm is called HNSW (Hierarchical Navigable Small World). It works by building a graph structure that connects similar vectors to each other. When a query arrives, instead of checking all 100,000 vectors, the algorithm navigates the graph and checks maybe 200 — returning results in milliseconds with 95–99% accuracy. This small trade-off in precision makes the difference between a search that takes 50ms and one that takes 30 seconds.

Why ANN search is fast — brute force vs HNSW
100,000 VECTORS · TWO WAYS TO FIND THE NEAREST BRUTE FORCE · check every vector query 100,000 comparisons slow · linear scaling (O(n)) HNSW · navigate a layered graph L2 L1 L0 enter nearest ~200 comparisons fast · O(log n) — sparse top, dense bottom

HNSW skims the top layer to find the rough neighborhood, then descends layer by layer to refine. That's why 100K vs 100M makes barely any difference to query time.

If you already know SQL databases, here's the key contrast: in PostgreSQL, you'd write SELECT * WHERE debtor_name = 'Acme Corp' — an exact match. A vector database lets you say "find me everything semantically similar to 'Acme Corporation'" and it would also find "ACME CORP," "Acme Corp Inc," and related entities — because their embedding vectors are close together. It's the difference between exact lookup and meaning-based search.

Popular options include: ChromaDB (local, Python-native, great for prototyping), Pinecone (managed cloud service, scales to billions of vectors), and pgvector (a PostgreSQL extension that adds vector search to your existing database).

Choosing a Vector DB For learning and prototyping, use ChromaDB — it runs locally, needs no API key, and stores data in-process. For production at scale, evaluate Pinecone or pgvector based on your infrastructure.
⚠️ Common Misconceptions About Vector Databases

"Vector databases replace SQL databases" — They don't. Vector databases excel at similarity search but are terrible at exact lookups, joins, aggregations, or transactions. Most production RAG systems use both: a vector DB for retrieval and a SQL DB for metadata, user data, and business logic.

"ANN search always returns the best matches" — The "approximate" in ANN means it trades a small amount of accuracy for massive speed. In rare cases, the true nearest neighbor may not appear in results. For most RAG applications this is fine (95–99% accuracy), but if you need guaranteed exact results, use exact nearest neighbor search (much slower).

"More vectors = slower search" — Not linearly. HNSW index performance scales logarithmically, so going from 10,000 to 100,000 vectors barely changes query time. The real bottleneck is usually the embedding step (calling the API to convert the query to a vector), not the search itself.

Citations — Native Provenance From Claude

The RAG pipeline you just built returns retrieved chunks and Claude's answer, but stitching those two together — "which sentence in the answer came from which chunk?" — is your responsibility. For audit-grade applications (legal, medical, regulated finance), that stitching is fragile: Claude can reword, summarize, or merge sources, and matching back to the original passages is error-prone.

Anthropic ships a built-in Citations feature that solves this at the API level. You pass your source documents into the request as document content blocks; Claude returns its answer with explicit citations arrays attached to each sentence, telling you exactly which document and which character span supports each claim. No regex, no fuzzy matching, no hallucinated quotes.

Technical Definition

The Citations feature accepts up to ~20 documents per request, each with a unique title. Claude's response includes citations[] on each text block, where each citation references a document_index, document_title, and the exact cited_text span used. You enable it per request by passing documents as content blocks with "type": "document" and "citations": {"enabled": true}.

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "document",
                "source": {"type": "text", "media_type": "text/plain", "data": chunk_1},
                "title": "Filing UCC-2024-001",
                "citations": {"enabled": True}
            },
            # ...more documents...
            {"type": "text", "text": "Summarize the secured party's collateral interest."}
        ]
    }]
)
# response.content[0].citations -> [{document_index: 0, document_title: "...", cited_text: "..."}]

RAG vs Citations — Pick Per Job

When to Reach for Each
Concern Build RAG Use Citations
Corpus sizeMillions of docs — need vector search to narrow downA small candidate set (<20 docs) per request
Provenance requirement"Roughly attributable" via chunk metadataSentence-level character spans — audit-grade
Hallucination riskClaude may paraphrase outside the sourcesEach cited claim is grounded in a specific span
Token costPay only for top-K retrieved chunksPay for all candidate documents (cap ~20)
Best stackVector DB + Claude (this module's pipeline)Vector DB shortlist → pass top-K to Citations

The strongest production pattern combines them: use your RAG pipeline to find the top ~10 candidate chunks, then pass those chunks as citation-enabled documents in the request. You get the scale of vector search and the audit-grade provenance of Citations in the same call.

🎓 Cert Tip — Domain 5.6 (Provenance)

The exam tests recognition that provenance is a first-class API feature, not something you bolt on after the fact. Anti-pattern: asking Claude in the prompt to "include citations like [1] [2]" and parsing them out — the model can fabricate citation numbers. Correct pattern: pass documents as citation-enabled content blocks and read structured citations[] from the response.

Multimodal Inputs — PDFs, Images, & the Files API

The retrieval pipeline you built so far assumes your documents arrive as plain text. In production they don't. They arrive as PDFs with embedded tables and figures, scanned images, charts, and screenshots. Claude has three native primitives for handling these without a separate OCR step or vision model: PDF support, vision, and the Files API. Knowing which to reach for — and when to combine them with your RAG pipeline — closes the last gap between a toy chatbot and a production document agent.

Three Input Modalities, One API Surface
Content Blocks type: "text" user message, prompt type: "document" PDF, up to 100 pages native text + table parsing type: "image" PNG, JPEG, GIF, WebP base64 or URL file_id: "file_abc" Files API reference Claude unified multimodal reasoning Structured extraction + citations

PDF Support — Skip the OCR Pipeline

Pass a PDF directly as a document content block (base64-encoded or via Files API reference) and Claude reads it natively — including tables, headings, and the layout structure that OCR'd text would lose. Limits to know: up to ~100 pages per document, ~32MB per file. For longer documents, split into chunks and use your RAG pipeline to shortlist before passing the most relevant pages.

import anthropic, base64

with open("ucc_filing.pdf", "rb") as f:
    pdf_data = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "document",
                "source": {"type": "base64", "media_type": "application/pdf", "data": pdf_data},
                "title": "UCC-1 Filing 2024-001",
                "citations": {"enabled": True}  # ← pairs with Citations
            },
            {"type": "text", "text": "Extract debtor, secured party, and collateral. Cite each field."}
        ]
    }]
)

Vision — Images as First-Class Inputs

Same content-block paradigm with type: "image". Useful for: extracting data from charts/graphs, reading screenshots from agents that monitor dashboards, classifying scanned forms when you don't have native PDFs, and processing diagram-heavy documents (architecture diagrams, flowcharts). Cost: image tokens are calculated by image dimensions — ~1.15 tokens per pixel after resizing. A 1280x800 screenshot is roughly 1500 tokens.

with open("dashboard_screenshot.png", "rb") as f:
    image_data = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {"type": "image",
             "source": {"type": "base64", "media_type": "image/png", "data": image_data}},
            {"type": "text", "text": "Read the latency chart and report the p95 value for the last 24h."}
        ]
    }]
)

Files API — Upload Once, Reference Many Times

If you reference the same document across many requests — an agent that answers questions about a 50-page contract over a multi-turn session, or a batch job that runs 1000 questions against the same PDF — uploading via the Files API beats re-encoding the file in every request. You upload once, get a file_id, and reference it by ID in subsequent messages.

# Step 1: upload (do this once)
file = client.files.create(file=open("contract.pdf", "rb"), purpose="user_data")
file_id = file.id  # e.g., "file_abc123"

# Step 2: reference by ID in any request
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {"type": "document", "source": {"type": "file", "file_id": file_id}},
            {"type": "text", "text": "What is the termination clause?"}
        ]
    }]
)

When to Use Each — Decision Matrix

Pick the Right Input Modality
Scenario Use
One-off PDF question, <100 pagesdocument content block (base64) + Citations
Same PDF referenced in 50+ requestsFiles API + reference by file_id
PDF >100 pages or >32MBRAG to shortlist + document blocks for top-K
Screenshot, chart, or scanned formimage content block
Diagram/architecture interpretationimage content block with explicit task prompt
Mixed content (PDF + screenshots + text)All three in one content array — Claude reasons across them jointly
Cost & Limits

PDFs: ~1500–3000 tokens per page depending on text density. Images: tokens scale with image dimensions; resize before sending if you don't need full resolution. Files API: file references count toward your context window the same as inline content — the upload is for re-use efficiency, not free context. Caching: document and image content blocks ARE cacheable via cache_control — combine with prompt caching for high-frequency document Q&A and you'll see 90% savings on the second-and-onward request.

Why It Matters

The single biggest production gap for document agents isn't retrieval quality — it's format handling. Teams build elegant RAG pipelines, then hit a wall when 60% of their documents are scanned PDFs with tables. Native multimodal support collapses what used to be a 3-stage pipeline (OCR → layout parsing → LLM) into a single API call. Combined with Citations and prompt caching, you get a complete document-agent stack with audit-grade provenance for under 100 lines of code.

Code Walkthrough: Chat with Your Docs

Conceptual Bridge: You now understand the six-stage RAG pipeline conceptually (Load → Chunk → Embed → Store → Retrieve → Generate), how embeddings turn text into searchable coordinates, and why chunking strategy matters. Next, you'll implement all six stages in working code. We'll build it in three steps: first load and chunk documents, then embed and store them in ChromaDB, and finally wire up retrieval with Claude to answer questions.

Step 1: Document Loading & Chunking

Let's start by getting our documents into the system. The load_documents function is straightforward — it reads every markdown and text file from a folder and returns them as a list. Nothing clever here, just file I/O with proper error handling (we skip files with encoding issues rather than crashing the whole pipeline).

The interesting part is chunk_text. Here's the dilemma: embedding models work best with short text — typically under 512 tokens — but our documents might be thousands of words. We could just chop the text every 500 characters, but that might cut a sentence in half. So instead, the function tries to find a natural break point — a paragraph boundary, a period, or at worst a space — before making the cut. This small detail makes a real difference in retrieval quality.

One thing to watch for: the 50-character overlap between chunks. Without overlap, information that spans a chunk boundary gets split across two chunks, and neither chunk alone contains the full thought. The overlap duplicates a small amount of text at each boundary, acting as insurance that context isn't lost.

# pip install chromadb>=0.4.0 anthropic>=0.30.0
import os
import glob

def load_documents(docs_dir: str) -> list[dict]:
    """Load all .md and .txt files from a directory."""
    docs = []
    for pattern in ["*.md", "*.txt"]:
        for path in glob.glob(os.path.join(docs_dir, pattern)):
            try:
                with open(path, "r", encoding="utf-8") as f:
                    docs.append({
                        "content": f.read(),
                        "source": os.path.basename(path),
                    })
            except (IOError, UnicodeDecodeError) as e:
                print(f"Skipping {path}: {e}")
    if not docs:
        raise FileNotFoundError(f"No documents found in {docs_dir}")
    return docs

def chunk_text(
    text: str, chunk_size: int = 500, overlap: int = 50
) -> list[str]:
    """Recursive character splitter with overlap."""
    if len(text) <= chunk_size:
        return [text]
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        # Try to break at a paragraph or sentence boundary
        if end < len(text):
            for sep in ["\n\n", "\n", ". ", " "]:
                last_sep = chunk.rfind(sep)
                if last_sep > chunk_size * 0.5:
                    end = start + last_sep + len(sep)
                    chunk = text[start:end]
                    break
        chunks.append(chunk.strip())
        start = end - overlap  # overlap for continuity
    return [c for c in chunks if c]

# Usage
docs = load_documents("./docs")
all_chunks = []
for doc in docs:
    chunks = chunk_text(doc["content"], chunk_size=500, overlap=50)
    for i, chunk in enumerate(chunks):
        all_chunks.append({
            "text": chunk,
            "source": doc["source"],
            "chunk_index": i,
        })
print(f"Loaded {len(docs)} docs, {len(all_chunks)} chunks")
// npm install chromadb@^1.7.0 @anthropic-ai/sdk@^0.30.0
import { readFileSync, readdirSync } from 'fs';
import { join, basename } from 'path';

function loadDocuments(docsDir) {
  const files = readdirSync(docsDir).filter(
    f => f.endsWith('.md') || f.endsWith('.txt')
  );
  if (files.length === 0)
    throw new Error(`No documents found in ${docsDir}`);
  return files.map(f => {
    try {
      return {
        content: readFileSync(join(docsDir, f), 'utf-8'),
        source: f,
      };
    } catch (e) {
      console.warn(`Skipping ${f}: ${e.message}`);
      return null;
    }
  }).filter(Boolean);
}

function chunkText(text, chunkSize = 500, overlap = 50) {
  if (text.length <= chunkSize) return [text];
  const chunks = [];
  let start = 0;
  while (start < text.length) {
    let end = start + chunkSize;
    let chunk = text.slice(start, end);
    if (end < text.length) {
      for (const sep of ['\n\n', '\n', '. ', ' ']) {
        const lastSep = chunk.lastIndexOf(sep);
        if (lastSep > chunkSize * 0.5) {
          end = start + lastSep + sep.length;
          chunk = text.slice(start, end);
          break;
        }
      }
    }
    chunks.push(chunk.trim());
    start = end - overlap;
  }
  return chunks.filter(Boolean);
}

const docs = loadDocuments('./docs');
const allChunks = [];
for (const doc of docs) {
  const chunks = chunkText(doc.content, 500, 50);
  chunks.forEach((chunk, i) => {
    allChunks.push({ text: chunk, source: doc.source, chunkIndex: i });
  });
}
console.log(`Loaded ${docs.length} docs, ${allChunks.length} chunks`);
What Just Happened? You loaded raw files from disk and split them into ~500-character chunks with 50-character overlap. Each chunk is a dictionary carrying the text, its source filename, and its position index. You now have a flat list of chunks ready for embedding — the "Load" and "Chunk" stages of the pipeline are complete.

Step 2: Embed & Store in ChromaDB

Now we need somewhere to store our chunks as searchable vectors. Here's the pleasant surprise: ChromaDB handles the embedding step for you. When you call collection.add() with text documents, ChromaDB automatically runs them through its built-in embedding model and stores the resulting vectors. You never touch a separate embedding API. The "hnsw:space": "cosine" setting tells ChromaDB to use cosine similarity when searching — the standard for text embeddings, as we discussed in the Embeddings section.

One gotcha that trips up every beginner: chromadb.Client() stores everything in memory only. When your Python script exits, all your vectors disappear. For this lab that's fine — we re-ingest each run. But for anything beyond a quick experiment, switch to chromadb.PersistentClient(path="./chroma_db"), which writes to disk and survives restarts. Also worth knowing: ChromaDB's default embedding model is good enough for prototyping, but for production you'll want Voyage AI or OpenAI embeddings, which produce more accurate similarity scores for domain-specific content like legal or medical text.

import chromadb

# ChromaDB uses its own built-in embedding model by default
client = chromadb.Client()  # in-memory; use PersistentClient for disk
collection = client.get_or_create_collection(
    name="my_docs",
    metadata={"hnsw:space": "cosine"},  # use cosine similarity
)

# Add chunks with metadata
try:
    collection.add(
        ids=[f"chunk_{i}" for i in range(len(all_chunks))],
        documents=[c["text"] for c in all_chunks],
        metadatas=[{"source": c["source"], "index": c["chunk_index"]}
                   for c in all_chunks],
    )
    print(f"Stored {collection.count()} chunks in ChromaDB")
except Exception as e:
    print(f"Error storing chunks: {e}")
    raise
import { ChromaClient } from 'chromadb';

const chroma = new ChromaClient();
const collection = await chroma.getOrCreateCollection({
  name: "my_docs",
  metadata: { "hnsw:space": "cosine" },
});

try {
  await collection.add({
    ids: allChunks.map((_, i) => `chunk_${i}`),
    documents: allChunks.map(c => c.text),
    metadatas: allChunks.map(c => ({
      source: c.source, index: c.chunkIndex
    })),
  });
  const count = await collection.count();
  console.log(`Stored ${count} chunks in ChromaDB`);
} catch (e) {
  console.error(`Error storing chunks: ${e.message}`);
  throw e;
}
What Just Happened? You created a ChromaDB collection and inserted all your chunks. ChromaDB automatically converted each chunk's text into an embedding vector and stored it alongside the metadata. The "Embed" and "Store" pipeline stages are now complete — your documents are searchable by meaning.

Step 3: Retrieve & Generate with Claude

This is where everything comes together — the "Retrieve" and "Generate" stages working as a pair. When a user asks a question, two things happen in sequence. First, we query ChromaDB to find the top-k most semantically similar chunks. Then we format those chunks into a context block and send them to Claude along with the original question. The key insight: we inject only the most relevant chunks (typically 3–5), not the entire corpus. This keeps the prompt small enough for Claude's context window while providing exactly the information needed to answer.

Pay close attention to the system prompt — it's the most important design decision in this entire function. We tell Claude to answer "based ONLY on the provided context" and to explicitly say so if the context doesn't contain the answer. Why so strict? Without this grounding instruction, Claude will cheerfully fall back to its training data and make up plausible-sounding answers — defeating the entire purpose of RAG. Also notice we wrap retrieval and generation in separate try/except blocks. These are independent failure points: ChromaDB might be down, or the Claude API might return a rate limit error. Catching them separately means you get a clear error message pointing to exactly which stage failed.

import anthropic

claude = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY env var

def query_rag(question: str, top_k: int = 3) -> str:
    """Retrieve relevant chunks and generate an answer with citations."""
    # Step 1: Retrieve
    try:
        results = collection.query(
            query_texts=[question],
            n_results=top_k,
        )
    except Exception as e:
        return f"Retrieval error: {e}"

    if not results["documents"] or not results["documents"][0]:
        return "No relevant documents found. I don't have enough information."

    # Check relevance threshold
    distances = results["distances"][0] if results.get("distances") else []
    chunks = results["documents"][0]
    sources = results["metadatas"][0]

    # Build context from retrieved chunks
    context_parts = []
    for i, (chunk, meta) in enumerate(zip(chunks, sources)):
        context_parts.append(
            f"[Source {i+1}: {meta['source']}]\n{chunk}"
        )
    context = "\n\n---\n\n".join(context_parts)

    # Step 2: Generate with Claude
    try:
        response = claude.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            system=(
                "You are a helpful assistant that answers questions "
                "based ONLY on the provided context. If the context "
                "doesn't contain the answer, say so. Always cite "
                "your sources using [Source N] format."
            ),
            messages=[{
                "role": "user",
                "content": (
                    f"Context:\n{context}\n\n"
                    f"Question: {question}\n\n"
                    "Answer based on the context above, citing sources:"
                ),
            }],
        )
        return response.content[0].text
    except anthropic.APIError as e:
        return f"Generation error: {e.status_code} - {e.message}"

# Test it
answer = query_rag("What are the key features of the product?")
print(answer)
import Anthropic from '@anthropic-ai/sdk';

const claude = new Anthropic(); // reads ANTHROPIC_API_KEY env var

async function queryRag(question, topK = 3) {
  // Step 1: Retrieve
  let results;
  try {
    results = await collection.query({
      queryTexts: [question],
      nResults: topK,
    });
  } catch (e) {
    return `Retrieval error: ${e.message}`;
  }

  if (!results.documents?.[0]?.length) {
    return "No relevant documents found. I don't have enough information.";
  }

  const chunks = results.documents[0];
  const sources = results.metadatas[0];

  const context = chunks.map((chunk, i) =>
    `[Source ${i + 1}: ${sources[i].source}]\n${chunk}`
  ).join('\n\n---\n\n');

  // Step 2: Generate with Claude
  try {
    const response = await claude.messages.create({
      model: "claude-sonnet-4-6",
      max_tokens: 1024,
      system:
        "You are a helpful assistant that answers questions " +
        "based ONLY on the provided context. If the context " +
        "doesn't contain the answer, say so. Always cite " +
        "your sources using [Source N] format.",
      messages: [{
        role: "user",
        content:
          `Context:\n${context}\n\n` +
          `Question: ${question}\n\n` +
          "Answer based on the context above, citing sources:",
      }],
    });
    return response.content[0].text;
  } catch (e) {
    return `Generation error: ${e.status} - ${e.message}`;
  }
}

const answer = await queryRag("What are the key features of the product?");
console.log(answer);
What Just Happened? You built the complete RAG pipeline end-to-end. The query_rag function takes a plain-English question, searches ChromaDB for the 3 most semantically similar chunks, formats them into a context block with source labels, and sends everything to Claude with a grounding system prompt. Claude responds using only the retrieved context and cites its sources. All six pipeline stages — Load, Chunk, Embed, Store, Retrieve, Generate — are now working together.
🎓 Cert Tip — Domain 5.1 The "lost in the middle" effect means information placed in the middle of a long context gets lower recall than information at the start or end. In RAG systems, this means the ORDER of your retrieved chunks matters: place the highest-relevance chunk first. If you retrieve 5 chunks, chunk #3 in the middle gets the least attention. The exam tests awareness of this positional bias and expects you to mitigate it by ranking chunks by relevance score.
🎓 Cert Tip — Domain 4.3 RAG provides CONTEXT but not CORRECTNESS guarantees. Even with perfect retrieval, Claude may still misinterpret or combine facts incorrectly. The exam-recommended pattern is to add a validation step after generation: check that cited source numbers actually exist, and that key claims in the answer appear verbatim in the retrieved chunks. Anti-pattern: trusting RAG output without any post-generation validation.
🎓 Cert Tip — Domain 5.6 (Provenance) Every claim a RAG agent emits must be tagged to its source chunk: {claim, source_id, confidence}. Synthesizing answers without source attribution is an exam anti-pattern — even when the answer is correct, you can't audit it later. When multiple sources agree, mark the claim as well-established; when sources conflict, surface it as a contested claim with both source pointers rather than silently picking one. The exam tests whether your output schema enforces source mappings, not whether the answer happens to be right.

Hands-On Exercise

What You'll Build

A complete RAG pipeline that ingests sample documents, stores them in ChromaDB, and answers questions with cited sources using Claude.

Time Estimate: 30–45 minutes

Prerequisites: Python 3.9+, an Anthropic API key (ANTHROPIC_API_KEY environment variable), and a terminal/command prompt.

Files You'll Create: rag_pipeline.py (main pipeline script), docs/filing_guide.md, docs/risk_assessment.md, docs/compliance_faq.txt (sample documents)

Environment Setup

mkdir rag-lab && cd rag-lab
python -m venv venv && source venv/bin/activate   # Windows: venv\Scripts\activate
pip install "anthropic>=0.30.0" "chromadb>=0.4.0"
export ANTHROPIC_API_KEY=your-key-here             # Windows: set ANTHROPIC_API_KEY=your-key-here

Step 1: Create Sample Documents & Ingestion Pipeline

First, we need documents to search. We'll create a docs/ folder with 3 sample files, then build the ingestion pipeline that loads, chunks, and stores them in ChromaDB. This covers the first 4 stages of the RAG pipeline: Load → Chunk → Embed → Store.

Create a docs/ folder and add these 3 files:

# UCC Filing Guide

## Initial Filing (UCC-1)
A UCC-1 financing statement must be filed in the state where the debtor is located. For individuals, this is their principal residence. For registered organizations (corporations, LLCs), this is the state of organization.

The filing office assigns a unique file number and timestamps the filing. A properly filed UCC-1 is effective for 5 years from the date of filing.

## Continuation Statements
To maintain perfection beyond 5 years, a continuation statement (UCC-3) must be filed within 6 months before the 5-year lapse date. Missing this window means the filing lapses and the secured party loses priority.

## Amendments
Amendments can change the collateral description, add or remove debtors, or assign the financing statement to a new secured party. Each amendment receives its own file number but references the original UCC-1.
# Risk Assessment Criteria

## High Risk Indicators
- Debtor has multiple UCC filings from different secured parties (competing liens)
- Collateral description uses broad terms like "all assets" (blanket lien)
- Filing is within 90 days of a bankruptcy petition (preference risk)
- Debtor name on filing doesn't exactly match legal name (perfection defect)

## Medium Risk Indicators
- Continuation statement filed close to the deadline (less than 30 days before lapse)
- Collateral type is inventory or accounts receivable (high turnover)
- Secured party is an individual rather than an institution

## Low Risk Indicators
- Single secured party with specific collateral description
- Filing has been continued at least once (established relationship)
- Debtor is a well-established registered organization
Q: What happens if a continuation statement is filed late?
A: If filed after the 5-year lapse date, the original filing is no longer effective. The secured party must file a new UCC-1, and their priority date resets to the new filing date. Any liens filed by other parties in the gap period will have priority.

Q: Can a UCC filing be terminated?
A: Yes. The secured party can file a UCC-3 termination statement. Once filed, the financing statement is no longer effective. The debtor can also demand termination if the obligation has been satisfied.

Q: What is the difference between a UCC-1 and a UCC-3?
A: A UCC-1 is the initial financing statement that creates the public record. A UCC-3 is used for all subsequent changes: continuations, amendments, assignments, and terminations. The UCC-3 always references the original UCC-1 file number.

Now create rag_pipeline.py with the ingestion pipeline:

import os
import glob
import json
import anthropic
import chromadb

# ── Document Loading ─────────────────────────────────────────
def load_documents(docs_dir: str) -> list[dict]:
    """Load all .md and .txt files from a directory."""
    docs = []
    for pattern in ["*.md", "*.txt"]:
        for path in glob.glob(os.path.join(docs_dir, pattern)):
            try:
                with open(path, "r", encoding="utf-8") as f:
                    docs.append({"content": f.read(), "source": os.path.basename(path)})
            except (IOError, UnicodeDecodeError) as e:
                print(f"  Skipping {path}: {e}")
    if not docs:
        raise FileNotFoundError(f"No documents found in {docs_dir}")
    return docs

# ── Chunking ─────────────────────────────────────────────────
def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    """Split text into chunks with overlap at natural boundaries."""
    if len(text) <= chunk_size:
        return [text]
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        if end < len(text):
            for sep in ["\n\n", "\n", ". ", " "]:
                last_sep = chunk.rfind(sep)
                if last_sep > chunk_size * 0.5:
                    end = start + last_sep + len(sep)
                    chunk = text[start:end]
                    break
        chunks.append(chunk.strip())
        start = end - overlap
    return [c for c in chunks if c]

# ── Ingestion (Load → Chunk → Embed → Store) ────────────────
def ingest(docs_dir: str = "./docs") -> chromadb.Collection:
    """Load docs, chunk them, and store in ChromaDB."""
    print("── Ingestion Pipeline ──")
    docs = load_documents(docs_dir)
    print(f"  Loaded {len(docs)} documents")

    all_chunks = []
    for doc in docs:
        chunks = chunk_text(doc["content"], chunk_size=500, overlap=50)
        for i, chunk in enumerate(chunks):
            all_chunks.append({"text": chunk, "source": doc["source"], "index": i})
    print(f"  Created {len(all_chunks)} chunks")

    client = chromadb.Client()  # in-memory for simplicity
    collection = client.get_or_create_collection(
        name="rag_lab", metadata={"hnsw:space": "cosine"}
    )
    collection.add(
        ids=[f"chunk_{i}" for i in range(len(all_chunks))],
        documents=[c["text"] for c in all_chunks],
        metadatas=[{"source": c["source"], "index": c["index"]} for c in all_chunks],
    )
    print(f"  Stored {collection.count()} chunks in ChromaDB")
    return collection

# ── Run ingestion ────────────────────────────────────────────
if __name__ == "__main__":
    collection = ingest()
    print("\n✓ Ingestion complete. Ready for queries.")

Run the ingestion:

Command
python rag_pipeline.py
Expected Output
── Ingestion Pipeline ── Loaded 3 documents Created 9 chunks Stored 9 chunks in ChromaDB ✓ Ingestion complete. Ready for queries.
✅ Checkpoint

If you see 3 documents loaded and 6–10 chunks stored, Step 1 is working. The exact chunk count depends on document length and boundary detection. If you see 0 documents, check that your docs/ folder is in the same directory as rag_pipeline.py.

Troubleshooting
  • ModuleNotFoundError: No module named 'chromadb' → Run pip install chromadb
  • FileNotFoundError: No documents found → Make sure the docs/ folder exists and contains .md or .txt files in the same directory where you run the script
  • Very few chunks (1–2) → Your documents may be very short. Add more content to the sample files or lower chunk_size to 200.

Step 2: Add Retrieval & Generation with Claude

Depends on: Step 1 (this step uses the collection object and ingest() function from Step 1. If you're starting fresh, complete Step 1 first.)

Now we wire up the last two pipeline stages — Retrieve and Generate. This function searches ChromaDB for the top-3 most relevant chunks, formats them into a context block with source labels, and sends everything to Claude with a grounding prompt.

Append the query_rag function to rag_pipeline.py and replace the existing if __name__ == "__main__" block with the code below (keep the imports and the load_documents, chunk_text, and ingest functions from Step 1):

# ── Query (Retrieve → Generate) ──────────────────────────────
def query_rag(collection, question: str, top_k: int = 3, verbose: bool = True) -> str:
    """Search for relevant chunks and generate an answer with Claude."""
    client = anthropic.Anthropic()

    # Retrieve
    results = collection.query(query_texts=[question], n_results=top_k)
    chunks = results["documents"][0]
    sources = results["metadatas"][0]
    distances = results["distances"][0] if results.get("distances") else []

    if verbose:
        print(f"\n  Query: {question}")
        print(f"  Retrieved {len(chunks)} chunks:")
        for i, (chunk, meta) in enumerate(zip(chunks, sources)):
            dist = f" (distance: {distances[i]:.3f})" if distances else ""
            print(f"    [{i+1}] {meta['source']}{dist}: {chunk[:80]}...")

    # Format context
    context = "\n\n---\n\n".join(
        f"[Source {i+1}: {meta['source']}]\n{chunk}"
        for i, (chunk, meta) in enumerate(zip(chunks, sources))
    )

    # Generate
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=(
            "You are a helpful assistant that answers questions "
            "based ONLY on the provided context. If the context "
            "doesn't contain the answer, say so explicitly. "
            "Always cite sources using [Source N] format."
        ),
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {question}\n\nAnswer based on the context above, citing sources:",
        }],
    )
    return response.content[0].text


# ── Test Queries ─────────────────────────────────────────────
if __name__ == "__main__":
    collection = ingest()

    print("\n" + "=" * 55)
    print("RAG QUERY TESTS")
    print("=" * 55)

    questions = [
        "How long is a UCC-1 filing effective?",
        "What are the high risk indicators for a UCC filing?",
        "What happens if a continuation statement is filed late?",
        "What is the difference between a UCC-1 and a UCC-3?",
        "What is the weather in Tokyo?",  # Should say "not in context"
    ]

    for q in questions:
        answer = query_rag(collection, q)
        print(f"\n  Answer: {answer[:200]}...")
        print(f"  {'─' * 50}")

Run the complete pipeline:

Command
python rag_pipeline.py
Expected Output (abbreviated)
── Ingestion Pipeline ── Loaded 3 documents Created 9 chunks Stored 9 chunks in ChromaDB ✓ Ingestion complete. Ready for queries. ======================================================= RAG QUERY TESTS ======================================================= Query: How long is a UCC-1 filing effective? Retrieved 3 chunks: [1] filing_guide.md (distance: 0.234): A UCC-1 financing statement must be filed in the state where... [2] filing_guide.md (distance: 0.412): To maintain perfection beyond 5 years, a continuation... [3] compliance_faq.txt (distance: 0.489): Q: What happens if a continuation statement is filed late?... Answer: According to the filing guide, a properly filed UCC-1 is effective for 5 years from the date of filing [Source 1]. To maintain... ────────────────────────────────────────────────────── Query: What is the weather in Tokyo? Retrieved 3 chunks: ... Answer: The provided context does not contain any information about the weather in Tokyo. The documents cover UCC filing procedures... ──────────────────────────────────────────────────────
✅ Checkpoint

Look for these key behaviors:

  • Questions 1–4: Should return accurate, cited answers drawn from the sample documents
  • Question 5 (weather): Should say the context doesn't contain the answer — this confirms the grounding prompt is working
  • Source citations: Answers should include [Source 1], [Source 2] references matching the retrieved chunks
  • Distances: Lower distances mean more relevant chunks. The best match should have the lowest distance.
Troubleshooting
  • AuthenticationError → Check your ANTHROPIC_API_KEY is set correctly
  • Claude answers the weather question → The grounding system prompt may not be strict enough. Make sure it says "based ONLY on the provided context"
  • Irrelevant chunks retrieved → Try smaller chunk_size (200 instead of 500) or add more focused content to your sample documents
  • APIError: 529 → The test makes 5 API calls. Wait 30 seconds and try again, or run fewer questions.

Verify Everything Works

Run the complete pipeline end-to-end. All 5 queries should return answers, with the first 4 citing sources and the 5th correctly declining to answer:

Command
python rag_pipeline.py

If all queries complete and you see cited answers for domain questions and a "not in context" response for the weather question, you've built a working RAG pipeline.

🎉 Congratulations

You've built a complete RAG system from scratch — document loading, chunking with overlap, vector storage in ChromaDB, semantic retrieval, and grounded generation with Claude. This is the same architecture used by production document Q&A systems, knowledge bases, and AI assistants.

Stretch Goals (Optional)
  • Add metadata filtering: collection.query(where={"source": "filing_guide.md"}) to restrict search to specific files
  • Implement a relevance threshold: if all chunk distances are above 0.8, return "I don't have enough information"
  • Experiment with chunk sizes of 200, 500, and 1000 — which gives the best answers for the UCC content?
  • Add a /sources command that prints the raw retrieved chunks and similarity scores before the answer

Knowledge Check

Q1: What problem does RAG solve that prompt engineering alone cannot?

A Making Claude respond faster
B Reducing API costs
C Giving Claude access to private, proprietary, or recent data beyond its training cutoff
D Improving Claude's grammar and writing style
Correct! No matter how clever your prompt, Claude can't reference documents it has never seen. RAG injects relevant external documents into the prompt at query time.

Q2: Given a 2000-character document, chunk size of 500, and overlap of 50, approximately how many chunks result?

A 3 chunks
B 5 chunks
C 4 chunks
D 8 chunks
Correct! With overlap, each step advances by (500 - 50) = 450 characters. 2000 / 450 ≈ 4.4, plus the initial chunk = approximately 5 chunks. Overlap increases chunk count but ensures no information is lost at boundaries.

Q3: Why is cosine similarity preferred over Euclidean distance for comparing text embeddings?

A It measures direction (meaning) regardless of vector magnitude (text length)
B It's faster to compute
C It produces integers instead of floats
D Euclidean distance doesn't work with high-dimensional data
Correct! Cosine similarity compares the angle between vectors, making it invariant to magnitude. A short and long text about the same topic will have similar direction (meaning) even if their magnitudes differ.

Q4: Your RAG system retrieves irrelevant chunks for user queries. Which fix is most likely to help?

A Use larger chunks (2000+ characters)
B Try smaller chunks with more overlap, or switch to a better embedding model
C Remove the vector database and use keyword search
D Increase Claude's max_tokens parameter
Correct! Smaller chunks improve retrieval precision by isolating distinct concepts. A better embedding model produces more accurate semantic representations. These two changes address the root cause: retrieval quality.

Q5: What is the correct order for the RAG pipeline steps?

A Embed → Chunk → Load → Store → Retrieve → Generate
B Load → Embed → Chunk → Store → Generate → Retrieve
C Chunk → Load → Embed → Retrieve → Store → Generate
D Load → Chunk → Embed → Store → Retrieve → Generate
Correct! Load documents first, chunk them into segments, embed chunks into vectors, store in the vector DB. At query time: retrieve relevant chunks, then generate an answer with Claude using those chunks as context.

Q6: (Recall from M05/M06) In the RAG query function, you call Claude's Messages API. If the API call fails, what should your code do?

A Return the raw chunks without an answer
B Silently return an empty string
C Catch the APIError and return a descriptive error message
D Retry indefinitely until it succeeds
Correct! Just like the error handling patterns from M05 and M06, catch the error and return a clear message. Never crash silently or retry forever.

Module Summary

Key Concepts

  • The knowledge problem: LLMs have a training cutoff and can't access private data — RAG fixes this.
  • RAG pipeline: Load → Chunk → Embed → Store → Retrieve → Generate.
  • Embeddings: Dense vectors that capture semantic meaning. Cosine similarity measures closeness.
  • Chunking: How you split documents directly determines retrieval quality. Experiment with sizes.
  • Vector databases: ChromaDB for prototyping, Pinecone/pgvector for production scale.

Next: M10 — Advanced RAG

You've built a basic RAG system. In M10, you'll tackle the hard problems: hybrid search (combining semantic + keyword), reranking retrieved results, handling multi-document queries, and evaluating retrieval quality with metrics like MRR and recall@k.

References & Resources