M23: Capstone Project Series | Building AI Agents with Claude

Learning Objectives

Apply a structured five-phase methodology (Requirements → Architecture → Build → Evaluate → Harden) to build an agent from scratch
Select the appropriate capstone project tier based on your skill level and completed modules
Understand how the same agent patterns (tool use, RAG, guardrails) adapt to three different industry domains
Use a self-assessment rubric to evaluate your capstone across functionality, code quality, prompts, safety, and observability
Set up and run a project starter template with domain-specific data loaders and evaluation harness

How to Approach a Capstone Project

Everyday Analogy

A capstone project is like building a house. Before picking paint colors and furniture (the fun parts), you need a foundation — a clear set of requirements: how many rooms, what is the lot size, what is the budget? Without that, you end up with load-bearing walls in the wrong place and a leaky roof.

The pain of skipping this step is real: developers who jump straight to coding an agent often build something that works for their first test case and collapses on the second. They fix that, and then it collapses on the third. This "patch-and-pray" cycle wastes more time than planning ever would.

The capstone methodology maps directly: foundation (requirements) → framing (architecture) → wiring and plumbing (tools and integrations) → inspection (evaluation) → weatherproofing (production hardening). Each phase has a clear checklist, and you do not move forward until the current phase is solid.

Here is what a Phase 1 requirements document actually looks like for a capstone project — this is the kind of artifact you would create before writing any code:

{ "project": "Capstone 1 — Order Status Bot", "domain": "B — B2B Ecommerce", "core_problem": "Customer support agents spend 40% of their time looking up PO status manually.", "success_criteria": [ "Returns correct status for 95% of valid PO numbers", "Responds with a helpful message (not a stack trace) for invalid POs", "Refuses to answer non-order questions politely", "Includes ETA and SLA warnings when applicable" ], "edge_cases": [ "Malformed PO number (e.g., 'PO-abc-xyz')", "PO exists but has no tracking info yet (status: processing)", "Multiple shipments for one PO (partial fulfillment)", "User asks about an order AND something unrelated in the same message" ], "tools_needed": ["get_order_status", "get_tracking_detail"], "agent_pattern": "single-turn tool-calling (M05)" }

Technical Definition

The capstone methodology is a five-phase process for building production-quality agents. Each phase maps to specific course modules, so you are never guessing what to do next. Here is how each phase works:

Phase 1 — Requirements Analysis: Read the domain brief. Identify the core user problem ("a provider needs to check pre-auth status"). Define what "success" looks like in measurable terms ("correctly returns status for 95% of valid reference numbers"). List edge cases ("what if the reference number is malformed?"). This is where you prevent 80% of future bugs.

Phase 2 — Architecture Design: Choose the agent pattern that fits your problem. A simple status lookup needs a single-turn tool-calling agent (M05). A complex research task needs ReAct (M12) or multi-agent (M14). Select your tools, data sources, and state management approach.

Phase 3 — Iterative Build: Build in vertical slices — one complete user flow at a time. Start with the happy path. Get it working. Then add error handling. Then edge cases. This is the iterative approach from M12-M14.

Phase 4 — Evaluation: Build an eval dataset from the domain brief's example scenarios. Run automated quality checks using the techniques from M18. Measure against your success criteria from Phase 1.

Phase 5 — Production Hardening: Add the guardrails (M16-M17), observability (M19-M20), and cost optimization (M22) that make your agent safe and efficient. This is the difference between a demo and a deployable system.

Capstone Methodology — Five Phases

📄

Requirements

M01-M04

→

📐

Architecture

M05, M12-M14

→

💻

Build

M05-M15

→

✅

Evaluate

M18

→

🛡

Harden

M16-M22

Why It Matters

Teams that follow a structured methodology deliver agents that work on the first production deployment. A 2024 survey of production agent failures found that 73% of failures were caused by skipping requirements analysis or architecture design — the agent worked on the developer's test cases but broke on real user input. The five-phase approach does not slow you down; it prevents the 3× rework that comes from building without a plan.

🎓 Cert Tip — Domain 1.5
The five-phase methodology maps directly to how the certification exam tests agent design. The exam rewards candidates who can articulate why they chose a specific agent pattern (Phase 2) and how they would evaluate it (Phase 4). "I picked ReAct because the task requires multi-step reasoning" scores higher than "I always use ReAct."

You now understand the methodology. Before you choose WHICH project to build, you need to honestly assess your current skill level. The next section helps you pick the right capstone for where you are right now.

Project Selection Guide

Everyday Analogy

Before GPS, hikers chose trails by reading the guidebook: "easy — 2 miles, flat terrain, suitable for families" or "difficult — 12 miles, 4,000 ft elevation gain, experienced hikers only." You would not take a family with small children on the expert trail, and an experienced hiker would be bored on the flat path.

The pain of choosing wrong is predictable: pick too easy and you do not learn anything new. Pick too hard and you spend the whole time frustrated, never finishing. Both outcomes waste your time.

Capstone selection works the same way. The five projects are rated by difficulty, and each requires a specific set of completed modules. Your "trail rating" is determined by the tracks you have completed — not how smart you are, but how many tools you have practiced with.

Here is what the "trail rating" actually looks like for each capstone tier:

Tier 1 — Capstone 1 (★☆☆☆☆) Trail: 2-3 hours, flat terrain Required gear: Tracks 1-3 (M01-M11) You will build: Single-tool conversational agent Tier 2 — Capstones 2-3 (★★★☆☆) Trail: 4-5 hours, moderate elevation Required gear: Tracks 1-5 (M01-M18) You will build: RAG pipeline OR multi-tool orchestration agent Tier 3 — Capstones 4-5 (★★★★★) Trail: 6-8 hours, expert terrain Required gear: All 7 tracks (M01-M22) You will build: Multi-agent system with production deployment

Technical Definition

The five capstone projects are organized into three difficulty tiers. Each tier requires you to have completed specific course tracks:

Tier 1 — Beginner-Friendly (Capstone 1): A single-tool conversational assistant. Requires Tracks 1–3 (M01-M11). Focuses on core prompt engineering and tool use. You build an agent that takes one input, calls one tool, and returns a structured response. Estimated time: 2–3 hours.

Tier 2 — Intermediate (Capstones 2–3): RAG-powered domain experts and multi-tool orchestration agents. Requires Tracks 1–5 (M01-M18). Introduces real-world data complexity, retrieval pipelines, and guardrails. Estimated time: 4–5 hours each.

Tier 3 — Advanced (Capstones 4–5): Multi-agent architectures and full production deployments. Requires all 7 tracks (M01-M22). Simulates real enterprise complexity with multi-agent coordination, production observability, and cost optimization. Estimated time: 6–8 hours each.

Before choosing, self-assess on four factors. First, how comfortable are you with async programming (promises, callbacks, event loops)? Second, have you worked with REST APIs before (sending requests, parsing JSON responses)? Third, do any of the three domains feel familiar? And fourth, how much time can you realistically dedicate? It is better to complete one project thoroughly than rush through three superficially.

Project Selection Flowchart

Completed Tracks 1-3?

↓

No → Review M01-M11

Yes ↓

↓

Capstone 1 — Single-Tool Agent (2-3 hrs)

↓

Completed Tracks 4-5?

↓

No → Review M12-M18

Yes ↓

↓

Capstones 2-3 — RAG + Multi-Tool (4-5 hrs)

↓

Completed Tracks 6-7?

↓

No → Review M19-M22

Yes ↓

↓

Capstones 4-5 — Multi-Agent + Production (6-8 hrs)

Common Misconceptions

"I should start with Capstone 5 to learn the most." — No. Capstone 5 assumes you can already build single-tool agents, RAG pipelines, and multi-agent systems. If you skip those building blocks, you will spend most of your time debugging fundamentals instead of learning production deployment.

"The beginner capstone is too easy for me." — Maybe. But can you build it in under 2 hours with full error handling, a self-assessment score of 20+/25, and observability? If so, move on. If not, there is still learning here.

"I need to complete all five capstones." — You do not. One well-executed capstone with a thorough self-assessment teaches more than five rushed ones. Quality over quantity.

Now you know how to assess your readiness. But what exactly will you be building? Each capstone operates in one of three real-world domains. The next section introduces them.

The Three Domain Anchors

Everyday Analogy

Imagine three kitchens: a French restaurant with delicate sauces and precise temperatures, a sushi bar with raw fish and exacting knife skills, and a food truck with speed-first cooking and limited counter space. Each kitchen serves different food with different techniques and different constraints.

But a chef who can only cook in ONE kitchen has a problem. They know French sauces but freeze when handed a fish knife. The best chefs adapt their core skills — heat control, seasoning, timing — to any kitchen. That adaptability is what separates a cook from a chef.

The three capstone domains work the same way: Healthcare uses clinical terminology and strict compliance. Ecommerce uses real-time APIs and multi-party coordination. Public Records uses messy data and entity resolution. Different "kitchens," but the same core agent skills: tool use, retrieval, guardrails, and evaluation. The domain forces you to adapt.

Here is what "same skill, different domain" looks like in practice — the same tool-use pattern adapted to three kitchens:

// Same pattern: agent calls a tool, gets structured data, formats a response Healthcare: agent.call("check_criteria", {cpt: "27447", icd: "M17.11"}) → {meets_criteria: true, guideline: "MCG-2024-KR-001"} Ecommerce: agent.call("get_order_status", {po: "PO-2024-8847"}) → {status: "shipped", carrier: "FedEx", eta: "2024-03-15"} Public Rec: agent.call("search_filings", {entity: "Acme Corporation"}) → {matches: 4, total_exposure: "$2.3M", states: ["NY","CA","TX"]}

Three Domains, Same Agent Patterns

Domain A — Healthcare

Clinical Review Agent

📥Receives pre-auth request (CPT 27447, ICD M17.11)

🔍Consults clinical guidelines tool (MCG criteria lookup)

✅Produces approval + reasoning: "Meets medical necessity criteria"

Domain B — Ecommerce

Order Tracking Agent

📥Receives "Where is my order?" (PO-2024-8847)

🔍Calls carrier API tool (FedEx tracking #7748294)

✅Returns status: "In transit, ETA March 15, Memphis hub"

Domain C — Public Records

UCC Analysis Agent

📥Receives entity search: "Acme Corporation"

🔍Performs entity resolution across 3 states (NY, CA, TX)

✅Returns lien risk: "4 active UCC-1 filings, $2.3M exposure"

Domain A: Healthcare Pre-Authorization

Pre-authorization is the process where a doctor's office asks an insurance company "will you cover this procedure for this patient?" before performing it. The agent must evaluate clinical requests against insurance criteria.

Key data: The agent works with two main code systems. First, CPT codes identify the procedure (e.g., CPT 27447 = total knee replacement). Second, ICD-10 codes identify the diagnosis (e.g., M17.11 = osteoarthritis of the right knee). Beyond codes, the agent also consults payer policy documents, clinical guidelines, and formulary lists.

Why it is great for agents: This domain exercises nearly every agent skill. You will use multi-step decision logic to evaluate clinical criteria. You will use RAG to retrieve relevant policy documents. You will produce structured output for authorization forms. You will implement human-in-the-loop flows for clinical reviewer approval. And you will need strict guardrails for HIPAA compliance and PHI handling — all under real cost and time pressure.

Domain B: B2B Ecommerce Order Tracking

B2B ecommerce involves complex order lifecycles — from RFQ and PO creation through multi-warehouse fulfillment, partial shipments, and carrier tracking. Unlike B2C, B2B orders have approval workflows, net payment terms, volume discounts, and multi-stakeholder communication.

Key data: The agent juggles several data sources. SKUs and PO numbers identify products and orders. Carrier tracking APIs (FedEx, UPS, DHL) provide shipment status. Warehouse management system data shows inventory and fulfillment progress. ERP order status tracks the overall lifecycle. And SLA commitments define the deadlines your agent must flag when they are at risk.

Why it is great for agents: This domain forces you to orchestrate multiple tools at once — querying an ERP system, a warehouse management system, and a carrier API in a single request. You will build real-time status aggregation across these sources. You will implement proactive alerting for delayed shipments. And you will practice multi-agent handoffs (sales → fulfillment → support) with production-grade observability.

Domain C: Public Records / UCC Data Engineering

UCC filings are public records that document who has a financial claim (lien) on a company's assets. Banks, investors, and credit analysts use them to assess risk before lending money or making deals.

Key data: The core records are UCC-1 financing statements (who has a lien on what) and UCC-3 amendments that modify, continue, or terminate those liens. Each filing includes debtor and secured party names and addresses, collateral descriptions, filing dates, and lapse dates. The challenge? This data comes from 50+ state offices in wildly inconsistent formats — CSV, fixed-width text, XML, and sometimes PDF.

Why it is great for agents: This domain tests three distinct skill groups. First, data engineering pipeline orchestration using the Medallion Architecture (Bronze → Silver → Gold). Second, RAG on legal and regulatory reference documents to answer compliance questions. Third, entity resolution as a multi-step reasoning task — determining whether "Acme Corp" and "ACME CORPORATION" are the same entity. You will also practice evaluation and testing focused on data quality metrics.

Why It Matters

Most tutorials teach agents with toy domains — "build a recipe chatbot." That is fine for learning syntax, but it does not prepare you for real complexity. In production, you face messy data (UCC filings have 50 different formats), strict compliance (HIPAA violations cost $100–$50,000 per incident), and multi-party coordination (an order touches ERP, WMS, carrier, and customer systems). These three domains expose you to real-world friction that tutorials skip.

Common Misconceptions

"I should pick the domain I already know best." — Not necessarily. If you already work in healthcare, Domain A will be comfortable but you will learn less about adapting to unfamiliar data. Picking an unfamiliar domain forces you to practice reading domain briefs and asking clarifying questions — a critical skill when building agents for real clients.

"The mock data is just placeholder — I'll use real APIs later." — The mock data IS your development and testing environment. Build your entire agent against mock data first, get it passing all eval cases, THEN swap in real APIs. Developers who start with real APIs spend 80% of their time debugging authentication and rate limits instead of building agent logic.

"Architecture design is overkill for a single-tool agent." — Even Capstone 1 benefits from 10 minutes of architecture planning. Which agent pattern? What error handling strategy? What are the edge cases? The students who skip this step in Capstone 1 consistently produce worse code than those who spend a few minutes planning, because they make structural decisions mid-implementation that create messy, hard-to-test code.

"I need to build the whole project before I can evaluate it." — The opposite is true. Write your eval test cases in Phase 1 (Requirements), before you write any agent code. This is test-driven development applied to agents. When you know what "success" looks like upfront, you build toward it instead of discovering failures after the fact.

You now know the three domains and their unique challenges. But which course modules apply to which capstone? The skills mapping tells you exactly what tools you need for each project.

Mapping Course Skills to Capstone Projects

Everyday Analogy

Over the past 22 modules, you have been collecting tools for your toolbox — a hammer (prompt engineering), a drill (tool use), a measuring tape (evaluation), safety goggles (guardrails), and a workbench (deployment). Each tool sits in its own compartment, neatly organized.

The challenge is this: when you walk up to a real project, nobody hands you a list of which tools to use. You open the toolbox and think, "Is this a hammer job or a drill job? Do I need the measuring tape now or later?" Making the wrong choice does not break anything, but it wastes time — like using a hammer to drive a screw.

The skills mapping is your project-specific tool list. It tells you, for each capstone: "You will definitely need the hammer and drill. The measuring tape is recommended. The rest is optional." It turns the overwhelming "I have 22 modules of knowledge" into a focused "I need these 8 skills for this project."

Here is what a skills mapping actually looks like in practice — a concrete checklist you would create before starting Capstone 2:

{ "capstone": "Capstone 2 — RAG Domain Expert", "domain": "C — Public Records / UCC", "required_skills": [ {"module": "M01-M04", "skill": "Prompt engineering + structured output", "status": "completed"}, {"module": "M05", "skill": "Function calling (tool use loop)", "status": "completed"}, {"module": "M09-M10", "skill": "RAG pipeline + advanced retrieval", "status": "completed"}, {"module": "M16-M17", "skill": "Input/output guardrails", "status": "in_progress"}, {"module": "M18", "skill": "Evaluation harness", "status": "not_started"} ], "recommended_skills": [ {"module": "M06", "skill": "Multi-tool orchestration", "status": "completed"}, {"module": "M08", "skill": "Conversation management", "status": "completed"} ], "readiness": "Review M16-M18 before starting. Core retrieval skills are ready." }

Technical Definition

Every capstone uses the fundamentals from M01-M05 (LLM mental model, tokens, prompts, structured output, and function calling). Beyond that, the requirements differ by project tier:

Capstone 1 (Single-Tool): Core tool use (M05) + conversation management (M08) + structured output (M04). That is it — three skills beyond the fundamentals. The simplicity is deliberate: this capstone tests whether you can combine the basics into a working agent.

Capstones 2-3 (RAG + Multi-Tool): On top of the fundamentals, you will need three additional skill groups. First, the retrieval skills from M09-M10 — how to build a RAG pipeline that finds relevant documents and returns cited results. Second, the orchestration skills from M06 — how to coordinate multiple tools so the agent calls them in the right order. Third, the safety skills from M16-M18 — guardrails to prevent bad outputs and evaluation to measure quality. The RAG capstone focuses on retrieval accuracy. The multi-tool capstone focuses on tool coordination logic.

Capstones 4-5 (Multi-Agent + Production): These are the most demanding, requiring two additional skill groups. First, agent architecture skills from M12-M14: the ReAct reasoning loop, planning and task decomposition, and multi-agent coordination. Second, production infrastructure from M19-M22: tracing, monitoring dashboards, deployment, and cost optimization. In short, these capstones test three things. Can you build a team of agents that collaborate? Can you monitor their health? And can you keep them within a cost budget?

Importantly, the domain you choose also shifts which modules matter most. If you pick Healthcare (Domain A), you will spend more time on guardrails from M16-M17. Why? Because medical decisions have real consequences — a wrong pre-auth approval could lead to an uncovered $50,000 surgery. If you pick Public Records (Domain C), you will lean harder on RAG from M09-M10. Why? Because entity resolution ("Is 'Acme Corp' the same as 'ACME CORPORATION'?") depends entirely on how accurately your retrieval pipeline finds and compares records across messy, inconsistent data sources.

Capstone × Module Readiness Matrix

	M01-04	M05	M06	M07	M08	M09	M10	M11	M12	M13	M14	M15	M16	M17	M18	M19	M20	M21	M22
Cap 1 ⭐
Cap 2 ⭐⭐
Cap 3 ⭐⭐
Cap 4 ⭐⭐⭐
Cap 5 ⭐⭐⭐

Required Recommended Optional

Why It Matters

No capstone project uses every module — but every module is used by at least one capstone. Capstone 1 uses 4 modules. Capstone 5 uses all 22. The breadth of the course gives you options; the capstone tests your ability to select and combine the right subset. This is exactly what real-world agent development looks like: you never use every tool at once, but you need to know which tool fits which problem.

You can now identify which modules to lean on for your chosen project. But how will you know if your capstone is actually good? The next section gives you a concrete rubric to score your own work.

Self-Assessment Rubric

Everyday Analogy

When you cook a piece of chicken, the recipe does not just say "cook until done." It gives you concrete checkpoints: "golden brown on the outside, internal temperature 165°F, let it rest for 5 minutes." Without those checkpoints, you are guessing — and you either undercook it (food poisoning) or overcook it (rubber).

The pain of not having a rubric is the same: you finish your capstone agent and ask yourself "is this good?" Without measurable criteria, you either think it is perfect (because it handled your one test case) or think it is terrible (because you are comparing it to production systems built by teams of 10).

The rubric gives you five "thermometer readings" for your capstone — each scored 1 to 5. You do not need a perfect score. You need honest data on your strengths and gaps, so you know exactly what to improve next.

Here is what a completed self-assessment actually looks like — a JSON report you would generate after finishing your capstone:

{ "capstone": "Capstone 1 — Order Status Bot", "domain": "B — B2B Ecommerce", "rubric_scores": { "functionality": {"score": 4, "notes": "Handles happy path + invalid PO. Missing: partial shipment edge case."}, "code_quality": {"score": 3, "notes": "Reasonable structure. No type hints on helper functions."}, "prompts": {"score": 4, "notes": "System prompt has specific rules. Added 2 few-shot examples."}, "safety": {"score": 3, "notes": "Basic input validation. No PII detection yet."}, "observability": {"score": 2, "notes": "Only console.log. No structured logging or request IDs."} }, "total": 16, "verdict": "Competent — review M19-M20 to improve observability before Capstone 2." }

Technical Definition

The self-assessment rubric evaluates your capstone across five dimensions. Each dimension is scored from 1 (needs significant work) to 5 (production-ready). Here is what each score means in practice:

1. Functionality (Does it work?): 1 = does not run. 2 = runs but crashes on basic inputs. 3 = handles the happy path correctly. 4 = handles edge cases and common errors. 5 = handles edge cases, malformed inputs, and unexpected tool responses gracefully.

2. Code Quality (Is it well-organized?): 1 = spaghetti code, no structure. 2 = functional but disorganized. 3 = reasonable modules and naming. 4 = well-typed, documented, and modular. 5 = production-ready with tests, types, and clear separation of concerns.

3. Prompt Engineering (Are prompts effective?): 1 = vague one-liner system prompt. 2 = basic instructions. 3 = structured prompts with some examples. 4 = optimized prompts with few-shot examples and edge case handling. 5 = prompts tested against adversarial inputs, with structured output schemas and fallback strategies.

4. Safety & Guardrails (Does it handle misuse?): 1 = no guardrails at all. 2 = basic input length limits. 3 = input validation and basic error messages. 4 = PII detection, injection defense, and graceful degradation. 5 = comprehensive guardrails with PII redaction, rate limiting, circuit breakers, and audit logging.

5. Observability (Can you debug it?): 1 = no logging. 2 = console.log statements. 3 = structured logs with request IDs. 4 = full tracing with spans and timing data. 5 = tracing (Langfuse/similar), structured logs, cost tracking, and a monitoring dashboard.

A total score of 15+ out of 25 demonstrates competency. 20+ out of 25 demonstrates excellence. The rubric is for self-assessment — honest reflection, not external grading.

Example Rubric Score — Capstone 1 Submission

Functionality

4/5

Code Quality

3/5

Prompts

4/5

Safety

3/5

Observability

2/5

Total: 16 / 25 — Competent

Recommendation: Review M19-M20 to improve Observability

⚠️ Common Misconceptions

"A 3/5 means I failed." — Not at all. A 3/5 means you understand the concept and implemented it at a functional level. That is competent work. A 5/5 means production-ready with comprehensive coverage. Most professional agents in production would score 4/5 on most dimensions. The rubric is a growth tool, not a pass/fail gate.

"I should aim for 5/5 on every dimension." — That is rarely the right use of your time. If your goal is to learn RAG (Capstone 2), investing 3 extra hours to push Observability from 3/5 to 5/5 teaches you less than starting Capstone 3. Optimize for learning, not for a perfect score.

"The rubric is objective — two people would give the same scores." — Not exactly. The rubric provides concrete criteria (e.g., "structured logs with request IDs" = 3/5 on Observability), but there is judgment involved. The goal is honest self-reflection, not precision. If you are within 1 point of the "right" score, you are close enough.

Why It Matters

Self-assessment is a professional skill, not just a learning exercise. In production agent development, teams run similar rubrics on every deployment: "Does it handle 95% of cases? Is it traceable? Is it within cost budget?" A developer who can honestly evaluate their own work — identifying both strengths and gaps — improves faster than one who either overestimates or underestimates their code.

You have the methodology, the project selection, the domains, the skills mapping, and the rubric. Now let's get practical: the starter template gives you a running project structure in minutes.

Project Starter Template

Every capstone project starts from the same directory structure. This template gives you configuration files, domain data loaders, and placeholder modules so you can focus on building the agent — not the scaffolding. Without a template, you would spend the first hour of every capstone creating folders, writing boilerplate, and setting up imports. The template eliminates that friction.

The template is intentionally opinionated about organization. Your agent code lives in src/, your mock data lives in data/, and your evaluation tests live in eval/. This separation matters because it scales: Capstone 1 might have 3 files in src/, while Capstone 5 might have 12 — but the structure stays the same. You always know where to find things.

How does a starter template differ from a framework or boilerplate generator? A framework imposes runtime constraints — you must use its request router, its plugin system, its lifecycle hooks. The starter template imposes zero runtime constraints. It is just a recommended folder layout plus domain-specific mock data. You can rename folders, add files, or restructure however you want. The template is a starting point, not a cage. Think of it as a pre-organized empty notebook with labeled tabs, not a form you must fill in.

The starter template has four parts: the directory structure, the configuration, the domain data loader, and the evaluation harness. Let's walk through each one.

Directory Structure

This is the recommended layout. Every file has a clear purpose, and the structure scales from Capstone 1 (simple) to Capstone 5 (complex).

capstone-project/
├── README.md                  # Project description, setup, and rubric
├── requirements.txt           # Python dependencies
├── .env.example               # Template for environment variables
├── src/
│   ├── agent.py               # Main agent implementation
│   ├── tools.py               # Tool definitions (your functions)
│   ├── prompts.py             # System prompts and templates
│   ├── guardrails.py          # Input/output validation
│   └── config.py              # Configuration and constants
├── data/
│   ├── domain_a/              # Healthcare: CPT codes, policies
│   ├── domain_b/              # Ecommerce: PO database, SKUs
│   └── domain_c/              # UCC: filings, entity records
├── eval/
│   ├── test_cases.json        # Evaluation scenarios
│   ├── harness.py             # Automated evaluation runner
│   └── rubric.py              # Self-assessment scoring
└── traces/                    # Observability output (logs, traces)

capstone-project/
├── README.md                  # Project description, setup, and rubric
├── package.json               # Node.js dependencies
├── tsconfig.json              # TypeScript configuration
├── .env.example               # Template for environment variables
├── src/
│   ├── agent.ts               # Main agent implementation
│   ├── tools.ts               # Tool definitions (your functions)
│   ├── prompts.ts             # System prompts and templates
│   ├── guardrails.ts          # Input/output validation
│   └── config.ts              # Configuration and constants
├── data/
│   ├── domain_a/              # Healthcare: CPT codes, policies
│   ├── domain_b/              # Ecommerce: PO database, SKUs
│   └── domain_c/              # UCC: filings, entity records
├── eval/
│   ├── test_cases.json        # Evaluation scenarios
│   ├── harness.ts             # Automated evaluation runner
│   └── rubric.ts              # Self-assessment scoring
└── traces/                    # Observability output (logs, traces)

What Just Happened?

You now have a mental map of the project layout. The src/ directory holds your agent code (split into focused files, not one giant script). The data/ directory holds domain-specific mock data. The eval/ directory holds your automated tests and rubric scoring. This separation keeps your code organized as it grows.

Domain Data Loader

Each domain has a data loader that provides mock data for development and testing. You do not need real healthcare APIs or carrier tracking services — the loaders simulate realistic responses so you can focus on building agent logic.

Let's build the Domain B (Ecommerce) data loader as an example. This loader simulates two things your real agent would talk to: an ERP database (where order details live) and a carrier tracking API (where shipment status lives). In production, these would be real HTTP calls to FedEx or your company's ERP system. For the capstone, we use in-memory dictionaries with realistic data instead. The same pattern applies to Domain A (healthcare policies and clinical criteria) and Domain C (UCC filings and entity records) — swap the data structure, keep the same function signature pattern.

Pay attention to the error response structure — this is the most important design decision in the loader. Every function returns a dictionary with three fields: is_error (did something go wrong?), error_category (what kind of problem?), and is_retryable (should the agent try again?). This is the structured error pattern from M05, and it is the difference between an agent that says "something broke" and one that says "that order doesn't exist, and retrying won't help — let me tell the customer." Without these fields, your agent is flying blind when tools fail.

# data/domain_b/loader.py
# Domain B: B2B Ecommerce Order Tracking — Mock Data Loader
# Simulates an ERP order database and carrier tracking API
# so you can develop your agent without real API credentials.

import json
from datetime import datetime, timedelta
from typing import Optional


# --- Mock PO Database ---
# These simulate rows from an ERP system. In production,
# you would query a real database or API instead.
MOCK_ORDERS = {
    "PO-2024-8847": {
        "status": "shipped",
        "customer": "Acme Industrial Supplies",
        "line_items": [
            {"sku": "WDG-4420", "description": "Steel Wedge Anchors (box of 100)",
             "qty": 5, "unit_price": 42.50, "status": "shipped"},
            {"sku": "BLT-1100", "description": "Hex Bolts M10x40 (box of 250)",
             "qty": 10, "unit_price": 28.00, "status": "shipped"},
        ],
        "tracking": [
            {"carrier": "FedEx", "tracking_number": "7748294001",
             "shipped_date": "2024-03-10", "eta": "2024-03-15",
             "current_location": "Memphis, TN", "status": "in_transit"}
        ],
        "total": 492.50,
        "sla_deadline": "2024-03-16",
    },
    "PO-2024-9102": {
        "status": "processing",
        "customer": "BuildRight Construction",
        "line_items": [
            {"sku": "PIP-3300", "description": "Copper Pipe 1in x 10ft",
             "qty": 20, "unit_price": 18.75, "status": "in_production"},
        ],
        "tracking": [],
        "total": 375.00,
        "sla_deadline": "2024-03-20",
    },
    "PO-2024-7500": {
        "status": "delivered",
        "customer": "Metro Office Furnishings",
        "line_items": [
            {"sku": "DSK-2200", "description": "Standing Desk Frame",
             "qty": 3, "unit_price": 289.00, "status": "delivered"},
        ],
        "tracking": [
            {"carrier": "UPS", "tracking_number": "1Z999AA10012",
             "shipped_date": "2024-03-01", "eta": "2024-03-05",
             "current_location": "Delivered", "status": "delivered"}
        ],
        "total": 867.00,
        "sla_deadline": "2024-03-06",
    },
}


def get_order_status(po_number: str) -> dict:
    """
    Simulate an ERP API call to look up order status.
    Returns order details or an error if not found.
    """
    try:
        if po_number not in MOCK_ORDERS:
            return {
                "is_error": True,
                "error_category": "not_found",
                "is_retryable": False,
                "context": f"No order found for PO number: {po_number}",
            }
        order = MOCK_ORDERS[po_number]
        return {
            "is_error": False,
            "po_number": po_number,
            "status": order["status"],
            "customer": order["customer"],
            "line_items": order["line_items"],
            "tracking": order["tracking"],
            "total": order["total"],
            "sla_deadline": order["sla_deadline"],
        }
    except Exception as e:
        return {
            "is_error": True,
            "error_category": "internal_error",
            "is_retryable": True,
            "context": str(e),
        }


def get_tracking_detail(tracking_number: str) -> dict:
    """
    Simulate a carrier tracking API call.
    In production, this would call FedEx/UPS/DHL APIs.
    """
    try:
        for order in MOCK_ORDERS.values():
            for shipment in order["tracking"]:
                if shipment["tracking_number"] == tracking_number:
                    return {
                        "is_error": False,
                        "carrier": shipment["carrier"],
                        "tracking_number": tracking_number,
                        "status": shipment["status"],
                        "current_location": shipment["current_location"],
                        "shipped_date": shipment["shipped_date"],
                        "eta": shipment["eta"],
                    }
        return {
            "is_error": True,
            "error_category": "not_found",
            "is_retryable": False,
            "context": f"No shipment found for tracking: {tracking_number}",
        }
    except Exception as e:
        return {
            "is_error": True,
            "error_category": "internal_error",
            "is_retryable": True,
            "context": str(e),
        }

// data/domain_b/loader.ts
// Domain B: B2B Ecommerce Order Tracking — Mock Data Loader
// Simulates an ERP order database and carrier tracking API
// so you can develop your agent without real API credentials.

interface LineItem {
  sku: string;
  description: string;
  qty: number;
  unitPrice: number;
  status: string;
}

interface Shipment {
  carrier: string;
  trackingNumber: string;
  shippedDate: string;
  eta: string;
  currentLocation: string;
  status: string;
}

interface Order {
  status: string;
  customer: string;
  lineItems: LineItem[];
  tracking: Shipment[];
  total: number;
  slaDeadline: string;
}

interface ToolResult {
  isError: boolean;
  errorCategory?: string;
  isRetryable?: boolean;
  context?: string;
  [key: string]: unknown;
}

// --- Mock PO Database ---
// These simulate rows from an ERP system. In production,
// you would query a real database or API instead.
const MOCK_ORDERS: Record<string, Order> = {
  "PO-2024-8847": {
    status: "shipped",
    customer: "Acme Industrial Supplies",
    lineItems: [
      { sku: "WDG-4420", description: "Steel Wedge Anchors (box of 100)",
        qty: 5, unitPrice: 42.50, status: "shipped" },
      { sku: "BLT-1100", description: "Hex Bolts M10x40 (box of 250)",
        qty: 10, unitPrice: 28.00, status: "shipped" },
    ],
    tracking: [
      { carrier: "FedEx", trackingNumber: "7748294001",
        shippedDate: "2024-03-10", eta: "2024-03-15",
        currentLocation: "Memphis, TN", status: "in_transit" }
    ],
    total: 492.50,
    slaDeadline: "2024-03-16",
  },
  "PO-2024-9102": {
    status: "processing",
    customer: "BuildRight Construction",
    lineItems: [
      { sku: "PIP-3300", description: "Copper Pipe 1in x 10ft",
        qty: 20, unitPrice: 18.75, status: "in_production" },
    ],
    tracking: [],
    total: 375.00,
    slaDeadline: "2024-03-20",
  },
  "PO-2024-7500": {
    status: "delivered",
    customer: "Metro Office Furnishings",
    lineItems: [
      { sku: "DSK-2200", description: "Standing Desk Frame",
        qty: 3, unitPrice: 289.00, status: "delivered" },
    ],
    tracking: [
      { carrier: "UPS", trackingNumber: "1Z999AA10012",
        shippedDate: "2024-03-01", eta: "2024-03-05",
        currentLocation: "Delivered", status: "delivered" }
    ],
    total: 867.00,
    slaDeadline: "2024-03-06",
  },
};


export function getOrderStatus(poNumber: string): ToolResult {
  try {
    const order = MOCK_ORDERS[poNumber];
    if (!order) {
      return {
        isError: true,
        errorCategory: "not_found",
        isRetryable: false,
        context: `No order found for PO number: ${poNumber}`,
      };
    }
    return {
      isError: false,
      poNumber,
      status: order.status,
      customer: order.customer,
      lineItems: order.lineItems,
      tracking: order.tracking,
      total: order.total,
      slaDeadline: order.slaDeadline,
    };
  } catch (error) {
    return {
      isError: true,
      errorCategory: "internal_error",
      isRetryable: true,
      context: String(error),
    };
  }
}


export function getTrackingDetail(trackingNumber: string): ToolResult {
  try {
    for (const order of Object.values(MOCK_ORDERS)) {
      for (const shipment of order.tracking) {
        if (shipment.trackingNumber === trackingNumber) {
          return {
            isError: false,
            carrier: shipment.carrier,
            trackingNumber,
            status: shipment.status,
            currentLocation: shipment.currentLocation,
            shippedDate: shipment.shippedDate,
            eta: shipment.eta,
          };
        }
      }
    }
    return {
      isError: true,
      errorCategory: "not_found",
      isRetryable: false,
      context: `No shipment found for tracking: ${trackingNumber}`,
    };
  } catch (error) {
    return {
      isError: true,
      errorCategory: "internal_error",
      isRetryable: true,
      context: String(error),
    };
  }
}

What Just Happened?

You built a mock data layer that simulates a real ERP system and carrier tracking API. The key design decisions: (1) structured error responses with is_error, error_category, and is_retryable — this follows the pattern from M05 so your agent can make intelligent retry decisions, (2) realistic mock data with multiple order states (processing, shipped, delivered) so you can test edge cases, and (3) separate functions for order lookup and tracking detail so you can practice multi-tool orchestration.

Evaluation Harness

The evaluation harness runs your agent against a set of test scenarios and scores the results automatically. Think of it as a mini test suite purpose-built for AI agents. Unlike manual testing (where you type a question, eyeball the answer, and say "looks good"), the harness runs the same scenarios every time, checks specific criteria, and gives you a pass/fail score. This means you can make a code change, rerun the harness in seconds, and immediately see if you broke something.

The harness works in three steps. First, you define test cases — each one specifies an input message, which tool the agent should (or should not) call, and what the response should contain. Second, the harness feeds each input to your agent and captures the response and tool calls. Third, it compares the actual output to the expected output and produces a score. This is the same pattern used in production agent evaluation (M18), just scaled down for a single capstone project.

Why does this matter for capstones specifically? Without a harness, you test your agent with 2-3 manual queries, declare victory, and move on. Then during self-assessment, you discover it crashes on edge cases you never tried. The harness forces you to think about edge cases upfront — before you write the code, not after. It also gives you a concrete pass rate number for your rubric's Functionality dimension.

# eval/harness.py
# Capstone Evaluation Harness
# Runs test cases against your agent and scores results.

import json
import time
from typing import Callable


# --- Test Case Format ---
# Each test case defines an input, expected behavior,
# and scoring criteria. These map to your rubric dimensions.
SAMPLE_TEST_CASES = [
    {
        "id": "TC-001",
        "description": "Valid PO number returns correct status",
        "input": "What is the status of PO-2024-8847?",
        "expected": {
            "should_call_tool": "get_order_status",
            "should_contain": ["shipped", "FedEx", "Memphis"],
            "should_not_contain": ["error", "not found"],
        },
        "rubric_dimension": "functionality",
    },
    {
        "id": "TC-002",
        "description": "Invalid PO number handled gracefully",
        "input": "What is the status of PO-9999-0000?",
        "expected": {
            "should_call_tool": "get_order_status",
            "should_contain": ["not found"],
            "should_not_contain": ["traceback", "exception", "undefined"],
        },
        "rubric_dimension": "functionality",
    },
    {
        "id": "TC-003",
        "description": "Non-order query handled appropriately",
        "input": "What is the weather today?",
        "expected": {
            "should_call_tool": None,
            "should_contain": ["order"],
            "should_not_contain": ["traceback", "exception"],
        },
        "rubric_dimension": "safety",
    },
]


def evaluate_response(
    response: str,
    tools_called: list[str],
    expected: dict,
) -> dict:
    """Score a single agent response against expectations."""
    score = {"passed": True, "details": []}

    # Check tool usage
    if expected["should_call_tool"]:
        if expected["should_call_tool"] in tools_called:
            score["details"].append("✓ Correct tool called")
        else:
            score["passed"] = False
            score["details"].append(
                f"✗ Expected tool '{expected['should_call_tool']}', "
                f"got {tools_called or 'none'}"
            )
    elif expected["should_call_tool"] is None and tools_called:
        score["passed"] = False
        score["details"].append(
            f"✗ No tool expected, but called: {tools_called}"
        )

    # Check response content
    response_lower = response.lower()
    for phrase in expected.get("should_contain", []):
        if phrase.lower() in response_lower:
            score["details"].append(f"✓ Contains '{phrase}'")
        else:
            score["passed"] = False
            score["details"].append(f"✗ Missing expected: '{phrase}'")

    for phrase in expected.get("should_not_contain", []):
        if phrase.lower() in response_lower:
            score["passed"] = False
            score["details"].append(f"✗ Contains forbidden: '{phrase}'")

    return score


def run_evaluation(
    agent_fn: Callable[[str], tuple[str, list[str]]],
    test_cases: list[dict] = SAMPLE_TEST_CASES,
) -> dict:
    """
    Run all test cases and produce a summary report.
    agent_fn should accept a user message string and return
    (response_text, list_of_tools_called).
    """
    results = []
    start = time.time()

    for tc in test_cases:
        try:
            response, tools = agent_fn(tc["input"])
            result = evaluate_response(response, tools, tc["expected"])
            result["test_id"] = tc["id"]
            result["description"] = tc["description"]
        except Exception as e:
            result = {
                "test_id": tc["id"],
                "description": tc["description"],
                "passed": False,
                "details": [f"✗ Agent crashed: {e}"],
            }
        results.append(result)

    elapsed = time.time() - start
    passed = sum(1 for r in results if r["passed"])
    total = len(results)

    return {
        "passed": passed,
        "total": total,
        "pass_rate": f"{passed/total*100:.0f}%",
        "elapsed_seconds": round(elapsed, 2),
        "results": results,
    }

// eval/harness.ts
// Capstone Evaluation Harness
// Runs test cases against your agent and scores results.

interface Expected {
  shouldCallTool: string | null;
  shouldContain: string[];
  shouldNotContain: string[];
}

interface TestCase {
  id: string;
  description: string;
  input: string;
  expected: Expected;
  rubricDimension: string;
}

interface TestResult {
  testId: string;
  description: string;
  passed: boolean;
  details: string[];
}

// --- Test Case Format ---
// Each test case defines an input, expected behavior,
// and scoring criteria. These map to your rubric dimensions.
const SAMPLE_TEST_CASES: TestCase[] = [
  {
    id: "TC-001",
    description: "Valid PO number returns correct status",
    input: "What is the status of PO-2024-8847?",
    expected: {
      shouldCallTool: "get_order_status",
      shouldContain: ["shipped", "FedEx", "Memphis"],
      shouldNotContain: ["error", "not found"],
    },
    rubricDimension: "functionality",
  },
  {
    id: "TC-002",
    description: "Invalid PO number handled gracefully",
    input: "What is the status of PO-9999-0000?",
    expected: {
      shouldCallTool: "get_order_status",
      shouldContain: ["not found"],
      shouldNotContain: ["traceback", "exception", "undefined"],
    },
    rubricDimension: "functionality",
  },
  {
    id: "TC-003",
    description: "Non-order query handled appropriately",
    input: "What is the weather today?",
    expected: {
      shouldCallTool: null,
      shouldContain: ["order"],
      shouldNotContain: ["traceback", "exception"],
    },
    rubricDimension: "safety",
  },
];


function evaluateResponse(
  response: string,
  toolsCalled: string[],
  expected: Expected,
): TestResult {
  const result: TestResult = {
    testId: "",
    description: "",
    passed: true,
    details: [],
  };

  // Check tool usage
  if (expected.shouldCallTool) {
    if (toolsCalled.includes(expected.shouldCallTool)) {
      result.details.push("✓ Correct tool called");
    } else {
      result.passed = false;
      result.details.push(
        `✗ Expected tool '${expected.shouldCallTool}', ` +
        `got ${toolsCalled.length ? toolsCalled.join(", ") : "none"}`
      );
    }
  } else if (expected.shouldCallTool === null && toolsCalled.length > 0) {
    result.passed = false;
    result.details.push(
      `✗ No tool expected, but called: ${toolsCalled.join(", ")}`
    );
  }

  // Check response content
  const lower = response.toLowerCase();
  for (const phrase of expected.shouldContain) {
    if (lower.includes(phrase.toLowerCase())) {
      result.details.push(`✓ Contains '${phrase}'`);
    } else {
      result.passed = false;
      result.details.push(`✗ Missing expected: '${phrase}'`);
    }
  }
  for (const phrase of expected.shouldNotContain) {
    if (lower.includes(phrase.toLowerCase())) {
      result.passed = false;
      result.details.push(`✗ Contains forbidden: '${phrase}'`);
    }
  }

  return result;
}


type AgentFn = (input: string) => Promise<[string, string[]]>;

export async function runEvaluation(
  agentFn: AgentFn,
  testCases: TestCase[] = SAMPLE_TEST_CASES,
) {
  const results: TestResult[] = [];
  const start = Date.now();

  for (const tc of testCases) {
    try {
      const [response, tools] = await agentFn(tc.input);
      const result = evaluateResponse(response, tools, tc.expected);
      result.testId = tc.id;
      result.description = tc.description;
      results.push(result);
    } catch (error) {
      results.push({
        testId: tc.id,
        description: tc.description,
        passed: false,
        details: [`✗ Agent crashed: ${error}`],
      });
    }
  }

  const elapsed = (Date.now() - start) / 1000;
  const passed = results.filter((r) => r.passed).length;
  const total = results.length;

  return {
    passed,
    total,
    passRate: `${Math.round((passed / total) * 100)}%`,
    elapsedSeconds: Math.round(elapsed * 100) / 100,
    results,
  };
}

What Just Happened?

You built an evaluation harness that tests three things for each scenario: (1) Did the agent call the right tool (or correctly refuse to call any tool)? (2) Does the response contain expected information? (3) Does the response avoid forbidden content (error traces, undefined values)? This maps directly to the rubric: test cases tagged with "functionality" score your Functionality dimension, and cases tagged with "safety" score your Safety dimension. As you build more test cases, you get a more accurate self-assessment.

⚠️ Common Misconceptions

"3 test cases is enough for a capstone." — It is enough to prove the agent can work. It is not enough to prove it reliably works. Production agent evaluation (M18) typically uses 20-50 test cases per agent. For capstones, aim for at least 8-10 — 3 happy path, 3 edge cases, and 2-3 error/safety scenarios.

"If it passes the harness, it's production-ready." — The harness tests what you thought to test. It does not test what you forgot. Real users will send inputs you never imagined. The harness proves minimum viability, not production readiness — that is what the full rubric (including safety and observability) is for.

"I should write test cases after I build the agent." — Writing tests after is natural, but writing them before is more effective. When you define success criteria in Phase 1, translate each criterion into a test case immediately. This is test-driven development for agents: you know what "done" looks like before you start coding.

Example: Capstone 1 — Minimum Viable Agent

Here is a complete "minimum viable" implementation of Capstone 1 (Domain B — Order Status Bot). This sets the bar for scope and quality. Your implementation should be at least this complete, and ideally better.

The agent uses the Messages API with tool use, exactly as you practiced in M05. Let's walk through the three key parts and why each decision matters.

First, the system prompt. Notice how it uses specific, measurable rules: "only answer questions about order status," "if a shipment has an ETA, include it," "never reveal raw JSON." Compare this to a vague prompt like "be helpful and answer questions" — the specific version gives Claude concrete criteria to follow, which means fewer hallucinations and more predictable behavior. This is the difference between a 3/5 and a 5/5 on the Prompts rubric dimension.

Second, the tool definitions. Each tool has a description that tells Claude when to use it, not just what it does. "Use this when a customer asks about their order" is much better than just "looks up an order" because it helps Claude decide between the two tools. Without clear usage guidance, Claude might call the wrong tool or call both tools unnecessarily.

Third, the agentic loop. The loop checks stop_reason each iteration — "tool_use" means "keep going, I need to call a tool," while "end_turn" means "I'm done, here's the final answer." The max_iterations = 10 is a safety net only — it prevents infinite loops but should never be the reason the agent stops.

# src/agent.py
# Capstone 1 — Order Status Bot (Minimum Viable Implementation)
# Domain B: B2B Ecommerce Order Tracking

import anthropic
import json
import sys
sys.path.append("..")
from data.domain_b.loader import get_order_status, get_tracking_detail


# --- System Prompt ---
# This defines the agent's role, capabilities, and constraints.
# Notice: specific instructions, not vague "be helpful."
SYSTEM_PROMPT = """You are an Order Status Assistant for a B2B ecommerce
company. Your job is to help customers check the status of their
purchase orders and shipments.

You have access to two tools:
1. get_order_status — looks up a purchase order by PO number
2. get_tracking_detail — looks up shipment tracking by tracking number

Rules:
- Only answer questions about order status and tracking.
- If a customer asks about something unrelated, politely explain
  that you can only help with order inquiries.
- Always include the PO number and status in your response.
- If a shipment has an ETA, include it.
- If the SLA deadline is approaching, warn the customer.
- Never reveal internal system details or raw JSON to the customer.
- If a tool returns an error, explain the issue in plain language.
"""

# --- Tool Definitions ---
# These tell Claude what tools are available and how to call them.
TOOLS = [
    {
        "name": "get_order_status",
        "description": (
            "Look up the status of a purchase order by PO number. "
            "Returns order status, line items, tracking info, and "
            "SLA deadline. Use this when a customer asks about "
            "their order."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "po_number": {
                    "type": "string",
                    "description": "The purchase order number (e.g., PO-2024-8847)",
                }
            },
            "required": ["po_number"],
        },
    },
    {
        "name": "get_tracking_detail",
        "description": (
            "Look up shipment tracking details by tracking number. "
            "Returns carrier, status, location, and ETA. Use this "
            "when you need more detail about a specific shipment."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "tracking_number": {
                    "type": "string",
                    "description": "The carrier tracking number",
                }
            },
            "required": ["tracking_number"],
        },
    },
]

# --- Tool Dispatch ---
# Maps tool names to actual Python functions.
TOOL_HANDLERS = {
    "get_order_status": lambda args: get_order_status(args["po_number"]),
    "get_tracking_detail": lambda args: get_tracking_detail(args["tracking_number"]),
}


def run_agent(user_message: str) -> tuple[str, list[str]]:
    """
    Run the agent for a single user message.
    Returns (response_text, list_of_tools_called).
    """
    client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY env var
    messages = [{"role": "user", "content": user_message}]
    tools_called = []

    # --- Agentic Loop ---
    # Keep calling Claude until it produces a final text response
    # (stop_reason == "end_turn"), not a tool call.
    max_iterations = 10  # safety net, not primary stopping logic
    for _ in range(max_iterations):
        try:
            response = client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=1024,
                system=SYSTEM_PROMPT,
                tools=TOOLS,
                messages=messages,
            )
        except anthropic.APIError as e:
            return f"I'm sorry, I encountered a technical issue: {e}", tools_called

        # Check if Claude wants to use a tool
        if response.stop_reason == "tool_use":
            # Process each tool call in the response
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    tool_name = block.name
                    tool_input = block.input
                    tools_called.append(tool_name)

                    # Execute the tool
                    handler = TOOL_HANDLERS.get(tool_name)
                    if handler:
                        result = handler(tool_input)
                    else:
                        result = {
                            "is_error": True,
                            "error_category": "unknown_tool",
                            "context": f"Unknown tool: {tool_name}",
                        }

                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": json.dumps(result),
                    })

            # Send tool results back to Claude
            messages.append({"role": "assistant", "content": response.content})
            messages.append({"role": "user", "content": tool_results})

        elif response.stop_reason == "end_turn":
            # Claude is done — extract the text response
            text_parts = [
                block.text for block in response.content
                if hasattr(block, "text")
            ]
            return " ".join(text_parts), tools_called

    return "I'm sorry, I was unable to complete your request.", tools_called


if __name__ == "__main__":
    query = input("Ask about an order: ")
    response, tools = run_agent(query)
    print(f"\nAgent: {response}")
    print(f"Tools used: {tools}")

// src/agent.ts
// Capstone 1 — Order Status Bot (Minimum Viable Implementation)
// Domain B: B2B Ecommerce Order Tracking

import Anthropic from "@anthropic-ai/sdk";
import { getOrderStatus, getTrackingDetail } from "../data/domain_b/loader.js";

// --- System Prompt ---
// This defines the agent's role, capabilities, and constraints.
// Notice: specific instructions, not vague "be helpful."
const SYSTEM_PROMPT = `You are an Order Status Assistant for a B2B ecommerce
company. Your job is to help customers check the status of their
purchase orders and shipments.

You have access to two tools:
1. get_order_status — looks up a purchase order by PO number
2. get_tracking_detail — looks up shipment tracking by tracking number

Rules:
- Only answer questions about order status and tracking.
- If a customer asks about something unrelated, politely explain
  that you can only help with order inquiries.
- Always include the PO number and status in your response.
- If a shipment has an ETA, include it.
- If the SLA deadline is approaching, warn the customer.
- Never reveal internal system details or raw JSON to the customer.
- If a tool returns an error, explain the issue in plain language.`;

// --- Tool Definitions ---
const TOOLS: Anthropic.Tool[] = [
  {
    name: "get_order_status",
    description:
      "Look up the status of a purchase order by PO number. " +
      "Returns order status, line items, tracking info, and " +
      "SLA deadline. Use this when a customer asks about their order.",
    input_schema: {
      type: "object" as const,
      properties: {
        po_number: {
          type: "string",
          description: "The purchase order number (e.g., PO-2024-8847)",
        },
      },
      required: ["po_number"],
    },
  },
  {
    name: "get_tracking_detail",
    description:
      "Look up shipment tracking details by tracking number. " +
      "Returns carrier, status, location, and ETA. Use this " +
      "when you need more detail about a specific shipment.",
    input_schema: {
      type: "object" as const,
      properties: {
        tracking_number: {
          type: "string",
          description: "The carrier tracking number",
        },
      },
      required: ["tracking_number"],
    },
  },
];

// --- Tool Dispatch ---
const TOOL_HANDLERS: Record<string, (args: Record<string, string>) => unknown> = {
  get_order_status: (args) => getOrderStatus(args.po_number),
  get_tracking_detail: (args) => getTrackingDetail(args.tracking_number),
};


export async function runAgent(
  userMessage: string,
): Promise<[string, string[]]> {
  const client = new Anthropic(); // reads ANTHROPIC_API_KEY env var
  const messages: Anthropic.MessageParam[] = [
    { role: "user", content: userMessage },
  ];
  const toolsCalled: string[] = [];

  // --- Agentic Loop ---
  const maxIterations = 10; // safety net
  for (let i = 0; i < maxIterations; i++) {
    let response: Anthropic.Message;
    try {
      response = await client.messages.create({
        model: "claude-sonnet-4-6",
        max_tokens: 1024,
        system: SYSTEM_PROMPT,
        tools: TOOLS,
        messages,
      });
    } catch (error) {
      return [`I'm sorry, I encountered a technical issue: ${error}`, toolsCalled];
    }

    if (response.stop_reason === "tool_use") {
      const toolResults: Anthropic.ToolResultBlockParam[] = [];

      for (const block of response.content) {
        if (block.type === "tool_use") {
          toolsCalled.push(block.name);
          const handler = TOOL_HANDLERS[block.name];
          const result = handler
            ? handler(block.input as Record<string, string>)
            : { isError: true, errorCategory: "unknown_tool",
                context: `Unknown tool: ${block.name}` };

          toolResults.push({
            type: "tool_result",
            tool_use_id: block.id,
            content: JSON.stringify(result),
          });
        }
      }

      messages.push({ role: "assistant", content: response.content });
      messages.push({ role: "user", content: toolResults });

    } else if (response.stop_reason === "end_turn") {
      const textParts = response.content
        .filter((b): b is Anthropic.TextBlock => b.type === "text")
        .map((b) => b.text);
      return [textParts.join(" "), toolsCalled];
    }
  }

  return ["I'm sorry, I was unable to complete your request.", toolsCalled];
}

What Just Happened?

You just saw a complete Capstone 1 agent. It has: a focused system prompt with specific rules (not vague "be helpful"), two tool definitions with clear descriptions, a tool dispatch map, and an agentic loop that continues until stop_reason === "end_turn". This is the minimum viable version — it handles happy paths and basic errors. Your capstone should extend it with: better guardrails (M16-M17), structured logging (M19), and more test cases in the eval harness (M18).

🎓 Cert Tip — Domain 1.1
The exam tests your understanding of loop termination. The correct answer is ALWAYS stop_reason: "tool_use" means continue, "end_turn" means done. The max_iterations counter is a safety net, NOT the primary stopping logic. Anti-pattern: parsing Claude's text response for phrases like "I'm done" or "task complete."

🎓 Cert Tip — Domain 2.2
Notice the structured error responses in the tool handlers: {is_error: true, error_category: "not_found", is_retryable: false}. The exam penalizes generic error returns like "Operation failed". Always include error category (what went wrong) and retryability (should the agent try again) so Claude can make intelligent recovery decisions.

Example Output

Ask about an order: What is the status of PO-2024-8847? Agent: Your order PO-2024-8847 is currently **shipped**. Here are the details: **Order for Acme Industrial Supplies** ($492.50 total) - Steel Wedge Anchors (box of 100) × 5 — shipped - Hex Bolts M10x40 (box of 250) × 10 — shipped **Shipment Tracking:** - Carrier: FedEx - Tracking: 7748294001 - Current Location: Memphis, TN - ETA: March 15, 2024 Your SLA deadline is March 16, so this shipment is on track to arrive on time. Tools used: ['get_order_status']

Hands-On Exercise: Launch Your Capstone

What You'll Build: A working project scaffold with mock data, a running data loader, and your first agent test — everything you need to begin your capstone.

Time Estimate: 45–60 minutes

Prerequisites: Python 3.10+ or Node.js 18+ installed. An Anthropic API key (set as ANTHROPIC_API_KEY environment variable). Completed at least Tracks 1–3 (M01-M11).

Files You'll Create:

capstone-project/ — project root directory
data/domain_b/loader.py (or .ts) — mock data loader
eval/test_cases.json — evaluation test scenarios
src/agent.py (or .ts) — your agent (copied from the example above)

Environment Setup

Copy and paste the following block into your terminal to create the project and install dependencies:

# Create project structure
mkdir -p capstone-project/src capstone-project/data/domain_b capstone-project/eval capstone-project/traces
cd capstone-project

# Set up Python virtual environment
python -m venv venv
source venv/bin/activate          # On Windows: venv\Scripts\activate

# Install dependencies
pip install anthropic

# Set your API key (replace with your actual key)
export ANTHROPIC_API_KEY=your-key-here   # On Windows: set ANTHROPIC_API_KEY=your-key-here

# Create project structure
mkdir -p capstone-project/src capstone-project/data/domain_b capstone-project/eval capstone-project/traces
cd capstone-project

# Initialize Node.js project
npm init -y
npm install @anthropic-ai/sdk
npm install -D typescript @types/node
npx tsc --init

# Set your API key (replace with your actual key)
export ANTHROPIC_API_KEY=your-key-here   # On Windows: set ANTHROPIC_API_KEY=your-key-here

Step 1: Create the Mock Data Loader

What & Why: Copy the Domain B data loader from the Starter Template section above into data/domain_b/loader.py (or loader.ts). This gives your agent realistic order data to work with, without needing real API credentials.

Create: A new file at data/domain_b/loader.py containing the full data loader code shown in the "Domain Data Loader" section above.

Test it:

cd capstone-project
python -c "
from data.domain_b.loader import get_order_status, get_tracking_detail
import json

# Test valid PO
result = get_order_status('PO-2024-8847')
print('Valid PO:', json.dumps(result, indent=2)[:200])

# Test invalid PO
result = get_order_status('PO-INVALID')
print('Invalid PO:', json.dumps(result, indent=2))
"

cd capstone-project
npx tsx -e "
import { getOrderStatus } from './data/domain_b/loader.js';

const valid = getOrderStatus('PO-2024-8847');
console.log('Valid PO:', JSON.stringify(valid, null, 2).slice(0, 200));

const invalid = getOrderStatus('PO-INVALID');
console.log('Invalid PO:', JSON.stringify(invalid, null, 2));
"

Expected Output (Python — snake_case keys)

Valid PO: { "is_error": false, "po_number": "PO-2024-8847", "status": "shipped", "customer": "Acme Industrial Supplies", ... } Invalid PO: { "is_error": true, "error_category": "not_found", "is_retryable": false, "context": "No order found for PO number: PO-INVALID" }

Node.js note: the TypeScript loader uses camelCase keys (isError, poNumber, errorCategory, isRetryable) instead of snake_case. The values are otherwise identical.

✅ Checkpoint

If you see the valid PO returning is_error/isError: false with order data, and the invalid PO returning the corresponding error flag true with "not_found", Step 1 is working.

Troubleshooting:

ModuleNotFoundError: No module named 'data' — Create empty __init__.py files in both data/ and data/domain_b/: touch data/__init__.py data/domain_b/__init__.py
FileNotFoundError or wrong path — Make sure you are running from the capstone-project/ root directory, not from inside data/.
Output shows None or empty — Verify you copied the full MOCK_ORDERS dictionary and that the PO number matches exactly (PO-2024-8847, with hyphens).

Step 2: Create the Agent

What & Why: Copy the Capstone 1 agent code from the "Example: Capstone 1" section above into src/agent.py (or agent.ts). This gives you a working minimum viable agent to iterate on.

Create: A new file at src/agent.py with the full agent code shown above.

Test it:

cd capstone-project
python -c "
from src.agent import run_agent
response, tools = run_agent('What is the status of PO-2024-8847?')
print('Response:', response[:300])
print('Tools called:', tools)
"

cd capstone-project
npx tsx -e "
import { runAgent } from './src/agent.js';
const [response, tools] = await runAgent('What is the status of PO-2024-8847?');
console.log('Response:', response.slice(0, 300));
console.log('Tools called:', tools);
"

Expected Output

Response: Your order PO-2024-8847 is currently shipped. Here are the details: Order for Acme Industrial Supplies ($492.50 total) - Steel Wedge Anchors (box of 100) x 5 — shipped - Hex Bolts M10x40 (box of 250) x 10 — shipped Shipment via FedEx, tracking 7748294001, ETA March 15... Tools called: ['get_order_status']

✅ Checkpoint

If the agent returns a natural language response mentioning "shipped," "FedEx," and "PO-2024-8847," and the tools list shows get_order_status, Step 2 is working.

Troubleshooting:

AuthenticationError — Your API key is missing or invalid. Run echo $ANTHROPIC_API_KEY (or echo %ANTHROPIC_API_KEY% on Windows) to verify it is set.
ModuleNotFoundError for the data loader — The import path in agent.py (from data.domain_b.loader import ...) assumes you run from the project root. Also create src/__init__.py if needed.
Agent returns generic "I can't help with that" — Your system prompt may be too restrictive. Check that the tools list is being passed to client.messages.create().

Step 3: Run the Evaluation Harness

What & Why: Copy the evaluation harness from the section above into eval/harness.py. Then run it against your agent to get a pass/fail score across all test cases. This gives you a concrete number for your rubric.

Create: A new file at eval/harness.py with the full harness code shown above.

Test it:

cd capstone-project
python -c "
import json
from src.agent import run_agent
from eval.harness import run_evaluation

report = run_evaluation(run_agent)
print(json.dumps(report, indent=2))
"

cd capstone-project
npx tsx -e "
import { runAgent } from './src/agent.js';
import { runEvaluation } from './eval/harness.js';
const report = await runEvaluation(runAgent);
console.log(JSON.stringify(report, null, 2));
"

Expected Output

{ "passed": 3, "total": 3, "pass_rate": "100%", "elapsed_seconds": 4.21, "results": [ {"test_id": "TC-001", "passed": true, "details": ["✓ Correct tool called", "✓ Contains 'shipped'", ...]}, {"test_id": "TC-002", "passed": true, "details": ["✓ Correct tool called", "✓ Contains 'not found'", ...]}, {"test_id": "TC-003", "passed": true, "details": ["✓ Contains 'cannot'", ...]} ] }

✅ Checkpoint

If all 3 test cases pass with "pass_rate": "100%", your minimum viable agent is working correctly.

Troubleshooting:

TC-001 fails ("Missing expected: 'shipped'") — Claude's wording varies between runs. If the response says "in transit" instead of "shipped," either update the test expectation or add "in transit" as an alternative in should_contain.
TC-002 fails ("Missing expected: 'not found'") — The agent might phrase it differently ("I couldn't locate that order"). Either update should_contain with an alternative phrase, or strengthen the system prompt to echo the tool's error message verbatim.
TC-003 fails ("No tool expected, but called: get_order_status") — Your system prompt needs a stronger boundary. Add an explicit rule like: "If the user asks about anything other than orders or shipments, do NOT call any tools."
ModuleNotFoundError for the harness — Create eval/__init__.py and run from the project root.

Step 4: Self-Assess with the Rubric

What & Why: Score your agent on each of the five rubric dimensions (Functionality, Code Quality, Prompts, Safety, Observability). Be honest — this is for your learning, not a grade. Write down specific notes for each dimension explaining why you gave that score.

Record your scores: Create a file called RUBRIC.md in your project root with your scores and notes, following the JSON format shown in the Self-Assessment Rubric section above.

Final Verification

Run this single command to verify your entire project scaffold is working end-to-end:

cd capstone-project && python -c "
from data.domain_b.loader import get_order_status
from src.agent import run_agent
from eval.harness import run_evaluation
print('Data loader: OK' if not get_order_status('PO-2024-8847')['is_error'] else 'FAIL')
resp, tools = run_agent('Status of PO-2024-8847?')
print('Agent: OK' if 'shipped' in resp.lower() else 'FAIL')
report = run_evaluation(run_agent)
print(f'Eval harness: {report[\"pass_rate\"]} ({report[\"passed\"]}/{report[\"total\"]})')
print('All systems go!' if report['passed'] == report['total'] else 'Some tests failed — review output above.')
"

cd capstone-project && npx tsx -e "
import { getOrderStatus } from './data/domain_b/loader.js';
import { runAgent } from './src/agent.js';
import { runEvaluation } from './eval/harness.js';
console.log('Data loader:', !getOrderStatus('PO-2024-8847').isError ? 'OK' : 'FAIL');
const [resp, tools] = await runAgent('Status of PO-2024-8847?');
console.log('Agent:', resp.toLowerCase().includes('shipped') ? 'OK' : 'FAIL');
const report = await runEvaluation(runAgent);
console.log('Eval harness:', report.passRate, '(' + report.passed + '/' + report.total + ')');
console.log(report.passed === report.total ? 'All systems go!' : 'Some tests failed.');
"

Expected Output

Data loader: OK Agent: OK Eval harness: 100% (3/3) All systems go!

🎉 Congratulations!

Your capstone project scaffold is fully operational. You have a working data loader, a minimum viable agent, and an evaluation harness. From here, your next steps are: (1) add more test cases to the harness covering edge cases, (2) improve the agent's system prompt and error handling, (3) add guardrails from M16-M17, and (4) implement structured logging from M19. Each improvement will raise your rubric score.

Troubleshooting

ModuleNotFoundError: No module named 'anthropic' — Run pip install anthropic (make sure your virtual environment is activated).
AuthenticationError — Check that ANTHROPIC_API_KEY is set: run echo $ANTHROPIC_API_KEY (or echo %ANTHROPIC_API_KEY% on Windows).
ModuleNotFoundError: No module named 'data' — Create empty __init__.py files in data/ and data/domain_b/. Run from the project root directory.
Agent response does not mention "shipped" — Claude's exact wording varies. Check the raw response. If it says "in transit" instead of "shipped," update your test expectations or your system prompt.

Stretch goal (OPTIONAL): Complete a second capstone in a different domain to practice adapting your skills across industries.

Stretch goal (OPTIONAL): Implement the "extended" version with full production hardening (observability, caching, deployment from M19-M22).

Knowledge Check

Q1: What are the five phases of the capstone methodology, in order?

A Architecture → Requirements → Build → Deploy → Test

B Build → Test → Fix → Deploy → Monitor

C Requirements → Architecture → Build → Evaluate → Harden

D Design → Code → Review → Ship → Maintain

Correct! The five phases are Requirements Analysis, Architecture Design, Iterative Build, Evaluation, and Production Hardening. Each phase maps to specific course modules.

Not quite. The correct order is: Requirements → Architecture → Build → Evaluate → Harden. Starting with requirements prevents the "patch-and-pray" cycle that happens when you jump straight to coding.

Q2: A student has completed Tracks 1–4 but not Tracks 5–7. Which capstone tier should they attempt?

A Tier 3 (Capstones 4–5) — they know enough to try the hardest projects

B Tier 1 (Capstone 1) — and review Tracks 5–7 before attempting Tier 2

C Tier 2 (Capstones 2–3) — they can skip guardrails since those are optional

D Skip capstones entirely until all tracks are complete

Correct! Without Tracks 5–7 (guardrails, observability, deployment), Tier 2 and 3 capstones would lack critical safety and monitoring components. Start with Tier 1, then come back for Tier 2 after reviewing the remaining tracks.

Not quite. Capstones 2–3 require Tracks 1–5, and guardrails are NOT optional for intermediate-level projects. Capstones 4–5 require all 7 tracks. The best approach is: complete Capstone 1 now, then review Tracks 5–7 before moving up.

Q3: Which domain has the strictest PII handling requirements, and why?

A Domain A (Healthcare) — HIPAA regulations require strict PHI protection with fines up to $50,000 per violation

B Domain B (Ecommerce) — credit card data falls under PCI-DSS compliance

C Domain C (Public Records) — UCC filings contain sensitive business information

D All three domains have identical PII requirements

Correct! Domain A (Healthcare) is governed by HIPAA, which imposes strict requirements on handling Protected Health Information (PHI). Violations can result in fines from $100 to $50,000 per incident. While all domains have some data sensitivity, healthcare has the most stringent regulatory framework.

Not quite. While Domain B has PCI-DSS and Domain C has public records considerations, Domain A (Healthcare) has the strictest PII requirements due to HIPAA regulations, which carry fines of $100–$50,000 per violation for mishandling Protected Health Information.

Q4: What rubric total score indicates a "production-ready" capstone?

A 10+ out of 25

B 15+ out of 25 (competent)

C 20+ out of 25 (excellent)

D 25 out of 25 (perfect)

Correct! A score of 20+ out of 25 demonstrates excellence and indicates production-readiness. This means scoring at least 4/5 on most dimensions, with particularly strong Functionality and Safety scores.

Not quite. 15+ demonstrates competency (you understand the concepts), but 20+ demonstrates excellence (production-ready). A perfect 25 is aspirational but not required — even production systems have areas for improvement.

Q5: You scored 4/5 on Functionality but 2/5 on Observability. What specific modules should you revisit?

A M05 (Function Calling) and M06 (Multi-Tool Orchestration)

B M16 (Input Guardrails) and M17 (Output Guardrails)

C M12 (ReAct Pattern) and M18 (Evaluation)

D M19 (Tracing & Logging) and M20 (Monitoring & Continuous Improvement)

Correct! A 2/5 on Observability means your agent lacks proper tracing and monitoring. M19 covers structured logging, tracing with spans, and tools like Langfuse. M20 covers production monitoring dashboards, alerting, and feedback loops. Together, they would bring your Observability score to 4 or 5.

Not quite. The Observability rubric dimension maps directly to M19 (Tracing & Logging) and M20 (Monitoring & Continuous Improvement). These modules cover structured logging, tracing with Langfuse, and monitoring dashboards — exactly what a 2/5 Observability score is missing.

Q6: In the example agent code, what determines when the agentic loop stops?

A The max_iterations counter reaching 10

B The response's stop_reason being "end_turn"

C Parsing Claude's response text for "I'm done"

D A timer that expires after 30 seconds

Correct! The loop checks stop_reason === "end_turn" to determine when Claude has finished. The max_iterations counter is a safety net (preventing infinite loops), not the primary stopping logic. This is the correct pattern from M12.

Not quite. The primary stopping condition is stop_reason === "end_turn", which means Claude has completed its response. The max_iterations counter is only a safety net. Parsing response text for phrases like "I'm done" is an anti-pattern — Claude's stop_reason is the reliable signal.

Your Score

0 / 6

Module Summary

In this module, you learned the complete framework for tackling capstone projects:

Five-Phase Methodology: Requirements → Architecture → Build → Evaluate → Harden. Each phase maps to specific course modules, preventing the "patch-and-pray" cycle.
Three Difficulty Tiers: Beginner (Capstone 1), Intermediate (Capstones 2–3), and Advanced (Capstones 4–5). Pick the tier that matches the tracks you have completed.
Three Domain Anchors: Healthcare (HIPAA, clinical criteria), B2B Ecommerce (PO lifecycle, carrier APIs), and Public Records (UCC filings, entity resolution). Same agent patterns, different real-world contexts.
Self-Assessment Rubric: Five dimensions (Functionality, Code Quality, Prompts, Safety, Observability) scored 1–5. 15+ = competent, 20+ = excellent.
Starter Template + Eval Harness: Directory structure, domain data loaders, and automated test runner — everything you need to start building immediately.

Next up: M24: What's Next — The Agent Frontier explores emerging patterns (agent-to-agent protocols, agent marketplaces), Claude's evolving capabilities (computer use, extended thinking), and your personal agent development roadmap.

References & Resources

Claude Tool Use Documentation — Official guide to function calling and tool use
Claude Model Overview — Model capabilities, pricing, and selection guidance
Anthropic Cookbook — Production-ready code examples for agents, RAG, and tool use
Prompt Engineering Guide — Best practices for system prompts and structured output
Evaluation Guide — Anthropic's guidance on testing and evaluating AI outputs