Capstone 7-C — Agent Evolution: UCC Risk Analyzer
Build the SAME UCC delinquency-risk agent THREE times — first by hand with the raw API, then with the Agent SDK and Claude Code, then from a spec document. Same tools, same mock data, same business question. Three radically different developer experiences.
Project Brief
Imagine you decide to build the same house three times. The first time, you fell every tree, mill every plank, and hammer every nail by hand. You learn exactly where the load-bearing walls go, why the joists are spaced 16 inches apart, and what happens when a sill plate is undersized. The build takes six months, but you understand every joint in the structure.
The second time, you order pre-cut lumber from a mill, pre-fab trusses, and use a nail gun. The framing crew shows up with a power tool for every job. You finish in six weeks. The walls go up in the same places, but you are no longer wondering why — you are picking which tools to use and where to apply them. You build twice as fast and the structure is just as strong, but only because you already know what a structure should look like.
The third time, you hand a contractor a set of architectural drawings. Two weeks later, the house is done — framing, plumbing, electrical, finishes — built by people you never met to specifications you wrote in plain English. You are now an architect. The previous two builds taught you what to draw and what to leave to the crew. Skip those, and your drawings would not survive contact with reality.
This capstone is those three builds, compressed into one week. You will write the same UCC risk agent in raw API code, then in the Agent SDK with Claude Code, then as a 12-section spec that Claude Code reads and implements. By the end, you will know in your hands — not just in theory — why each layer of abstraction exists, what it costs, and when to reach for each one.
You build a UCC Filing Risk Analyzer agent that takes a business name and produces a delinquency risk narrative: it searches UCC filings (handling name variations, abbreviations, DBAs), aggregates filing statistics, runs an ML delinquency model, examines the highest-risk filings, and writes a narrative report citing specific evidence.
You build it three times:
- Iteration 1 Raw API loop —
~250 lines,~3 hours, hand-coded everything (M15B way) - Iteration 2 Agent SDK + Claude Code —
~120 lines,~2 hours(M25–M26 way) - Iteration 3 Spec-driven —
~100 lines of spec,~1 hour(production way)
Production teams do not pick "raw API vs SDK vs spec" in the abstract — they pick based on what the team already understands and what the problem requires. If you skip Iteration 1, you cannot debug Iteration 3 when the generated code does something weird. If you skip Iteration 3, you are 10x slower than the teams shipping agents in 2026. The point of building the same thing three times is that the differences teach you the trade-offs in a way no diagram or article ever will. You are not learning three different agents. You are learning three different levels of abstraction, and gaining the judgment to know which one fits which problem.
The Three-Iteration Concept
Each iteration produces a working agent that solves the SAME business problem with the SAME tools and SAME mock data. The agent's output is identical across all three. What changes is everything around the agent: the lines you wrote, the time you spent, the abstractions you used, and the way you debug when something breaks.
"Iteration 3 is just better — why would I bother with the others?" Because Iteration 3's generated code is not magic. When it produces a tool that searches case-sensitively and you wanted case-insensitive, you have to read the generated tools.py, find the bug, and fix it — either in the code or in the spec. Without Iteration 1's deep familiarity with what tool calls, message shapes, and stop reasons look like, you cannot tell what the generated code is doing wrong. Iteration 3 is fast precisely because you can read its output. Take away that ability and it becomes a black box that produces silently broken agents.
The Scenario — UCC Filing Risk Analyzer
The agent takes a business name and produces a delinquency risk narrative. It searches UCC filings (with name variations including DBAs and abbreviations), aggregates filing statistics, runs an ML delinquency model, examines the highest-risk filings, and writes a narrative report citing specific evidence.
"Assess the delinquency risk for Acme Corporation using UCC filing data."
Tools (3)
search_filings(debtor_name, state?)— partial-match UCC search, case-insensitiveget_filing_details(filing_number)— full record lookuppredict_delinquency(features...)— sklearn RandomForest pickle, returns probability + HIGH/MEDIUM/LOW
Mock Data Shape
- 9 filings for Acme Corporation across NY (4), CA (2), TX (2), FL (1, DBA "ACME CORP DBA ROADRUNNER SUPPLIES")
- 2 filings for Pinnacle Industries, 1 for Sunrise Holdings
- Trained pickle model with 6 features (active count, state count, collateral types, age, amendment frequency, months-to-lapse)
- Files:
mock_data.py,delinquency_model.pkl,test_scenarios.json
You have already seen this domain in M09 (RAG), M12 (ReAct), M15B (build), CAPSTONE-3-DOMAIN-C (entity resolution), CAPSTONE-5-DOMAIN-C (production), and CAPSTONE-6 (parallel state testing). Picking this scenario lets you focus on the iteration differences instead of learning a new domain. If you'd rather work in a different industry, switch to Domain A (Healthcare) or Domain B (B2B Orders).
Animation 1: Three-Lane Evolution
Watch three lanes count down lines of code while the same five capabilities populate underneath each. The agent's capabilities never change — what shrinks is the code you write to express them. This is the entire thesis of the capstone in 8 seconds.
Animation 2: Code Size Waterfall
The waterfall makes the comparison concrete. Iteration 1 is 250 lines you wrote. Iteration 2 is 120 lines you wrote. Iteration 3 is 100 lines of spec plus ~300 lines of code Claude Code generated for you — shown stacked. The total system size grows, but the lines on your keyboard fall off a cliff.
Animation 3: Time Comparison
The bars fill in proportion to wall-clock time per iteration. Iteration 1 is the longest because you build everything — loop, validation, redaction, logging, sessions, deployment. Iteration 2 cuts the loop and guardrails (the SDK and hooks handle them). Iteration 3 cuts almost everything except the thinking: writing what the agent should do.
Total: 6 hours across 3 sessions for the same agent, three different builds.
Animation 4: Architecture Per Iteration
Each iteration has the same logical architecture (system prompt → tools → loop → output) but the physical architecture differs. Click the tabs to compare. Notice that more boxes do not mean more complexity for you — in Iteration 3, the SDK and Claude Code own most of the boxes; you only own the spec.
Animation 5: Spec-to-Code Flow
Watch the 12-section spec on the left get read line-by-line. Claude Code (the engine in the middle) translates each section into generated code on the right. Files appear as their corresponding spec section is consumed: section 3 (Tools) generates tools.py, section 5 (Hooks) generates hooks.py, and so on. The whole flow takes ~10 minutes in real life; here it is compressed to 8 seconds.
Prerequisites
- M05 — Tool Use: Tool definitions, tool_use blocks, the message loop
- M12 — ReAct Agents: Multi-step reasoning, the think→act→observe pattern
- M15B — Build Complete Agent: The whole Iter-1 mental model lives here
- M16–M17 — Guardrails & HITL: The hooks you'll wire up in Iter 2
- M19 — Tracing (Langfuse): Optional but useful for Iter 2 debugging
- M21 — Deployment: FastAPI + Docker patterns
- M22B — Cloud Deployment: Tier 1 deployment for all three iterations
- M25 — Claude Code: CLAUDE.md, slash commands, sessions
- M26 — Hooks & Sessions: Pre/post tool-use hooks, session forking
Iteration 1 is essentially a re-implementation of the M15B reference agent for the UCC scenario. If you have not built an agent from raw API calls before, do M15B first — otherwise Iteration 1 will be confusing rather than illuminating, and Iteration 3's debugging steps will not work because you will not recognize what the generated code is doing. Iteration 2 leans heavily on the claude-agent-sdk patterns taught in M26 (Hooks & Sessions & Agent SDK) — reach for it if any of the @tool / HookMatcher / ClaudeAgentOptions calls feel unfamiliar.
- Python 3.10+ with pip
- scikit-learn (for the delinquency pickle model)
- Claude Code CLI (
npm i -g @anthropic-ai/claude-code) — only needed for Iter 2 and Iter 3 - Docker Desktop (for the Tier 1 deployment in each iteration)
- An Anthropic API key (
ANTHROPIC_API_KEYenvironment variable) - Optional: a Langfuse account for Iter 2 tracing (free tier is fine)
Iteration 1: Raw API Loop
Build the agent the M15B way. You write the loop, the validation, the logging, the redaction, the sessions, and the deployment. Every line is yours. Every bug is yours to find.
What & Why: Create the project folder, install anthropic + scikit-learn, then build the mock data your agent will read. Mock data is what separates a "demo" agent from a "doesn't compile" agent — without it, every tool call returns an error and you cannot tell if the loop is broken or the data is. We also train a tiny RandomForest pickle so predict_delinquency has something real to call.
mkdir agent-iter1-raw && cd agent-iter1-raw
python -m venv venv && source venv/bin/activate # Windows: venv\Scripts\activate
pip install "anthropic>=0.40" "scikit-learn>=1.3" "pandas>=2.0" "fastapi>=0.110" "uvicorn>=0.27"
# Train the delinquency_model.pkl
python -c "
from sklearn.ensemble import RandomForestClassifier
import pickle, numpy as np
np.random.seed(42)
X = np.random.rand(200, 6); y = (X.sum(axis=1) > 3).astype(int)
clf = RandomForestClassifier(n_estimators=20, random_state=42).fit(X, y)
pickle.dump(clf, open('delinquency_model.pkl', 'wb'))
print('Model saved')
"
"""mock_data.py — 12 UCC filings across 3 entities."""
FILINGS = [
# Acme Corporation — 9 filings across 4 states (incl. 1 DBA)
{"filing_number": "NY-2024-0847", "debtor_name": "ACME CORPORATION",
"state": "NY", "filing_type": "UCC1", "status": "ACTIVE",
"filing_date": "2024-03-15", "lapse_date": "2029-03-15",
"secured_party": "FIRST NATIONAL BANK",
"collateral": "All inventory and accounts receivable"},
{"filing_number": "NY-2024-0848", "debtor_name": "ACME CORPORATION",
"state": "NY", "filing_type": "UCC1", "status": "ACTIVE",
"filing_date": "2024-04-22", "lapse_date": "2029-04-22",
"secured_party": "JPMORGAN CHASE",
"collateral": "Equipment and machinery"},
{"filing_number": "NY-2024-0849", "debtor_name": "ACME CORP",
"state": "NY", "filing_type": "UCC3_AMENDMENT", "status": "ACTIVE",
"filing_date": "2024-06-01", "lapse_date": "2029-04-22",
"secured_party": "JPMORGAN CHASE",
"collateral": "Equipment, machinery, and titled vehicles"},
{"filing_number": "NY-2024-0850", "debtor_name": "ACME CORPORATION INC",
"state": "NY", "filing_type": "UCC1", "status": "ACTIVE",
"filing_date": "2024-08-10", "lapse_date": "2029-08-10",
"secured_party": "WELLS FARGO",
"collateral": "Accounts receivable"},
{"filing_number": "CA-2024-1201", "debtor_name": "ACME CORP",
"state": "CA", "filing_type": "UCC1", "status": "ACTIVE",
"filing_date": "2024-02-18", "lapse_date": "2029-02-18",
"secured_party": "BANK OF AMERICA",
"collateral": "All assets of debtor"},
{"filing_number": "CA-2024-1202", "debtor_name": "ACME CORPORATION",
"state": "CA", "filing_type": "UCC1", "status": "ACTIVE",
"filing_date": "2024-05-05", "lapse_date": "2029-05-05",
"secured_party": "CITIBANK NA",
"collateral": "Inventory"},
{"filing_number": "TX-2024-0903", "debtor_name": "ACME CORP",
"state": "TX", "filing_type": "UCC1", "status": "ACTIVE",
"filing_date": "2024-07-12", "lapse_date": "2029-07-12",
"secured_party": "PNC FINANCIAL",
"collateral": "Equipment"},
{"filing_number": "TX-2024-0904", "debtor_name": "ACME CORPORATION",
"state": "TX", "filing_type": "UCC3_CONTINUATION", "status": "ACTIVE",
"filing_date": "2024-09-01", "lapse_date": "2034-07-12",
"secured_party": "PNC FINANCIAL",
"collateral": "Equipment"},
{"filing_number": "FL-2024-0455",
"debtor_name": "ACME CORP DBA ROADRUNNER SUPPLIES",
"state": "FL", "filing_type": "UCC1", "status": "ACTIVE",
"filing_date": "2024-10-30", "lapse_date": "2029-10-30",
"secured_party": "TRUIST FINANCIAL",
"collateral": "Motor vehicles and titled goods"},
# Pinnacle Industries — 2 filings
{"filing_number": "NY-2024-0501", "debtor_name": "PINNACLE INDUSTRIES",
"state": "NY", "filing_type": "UCC1", "status": "ACTIVE",
"filing_date": "2023-11-10", "lapse_date": "2028-11-10",
"secured_party": "TD BANK", "collateral": "Equipment"},
{"filing_number": "NJ-2024-0302", "debtor_name": "PINNACLE INDUSTRIES",
"state": "NJ", "filing_type": "UCC1", "status": "TERMINATED",
"filing_date": "2022-05-15", "lapse_date": "2027-05-15",
"secured_party": "US BANCORP", "collateral": "Inventory"},
# Sunrise Holdings — 1 filing
{"filing_number": "CA-2024-1500", "debtor_name": "SUNRISE HOLDINGS LLC",
"state": "CA", "filing_type": "UCC1", "status": "ACTIVE",
"filing_date": "2024-01-05", "lapse_date": "2029-01-05",
"secured_party": "FIRST NATIONAL BANK",
"collateral": "All assets of debtor"},
]
Run: python -c "from mock_data import FILINGS; print(len(FILINGS), 'filings loaded')"
Expected output:
Model saved
12 filings loaded12 filings loaded and have delinquency_model.pkl in the folder. If pickle write fails, check write permissions.What & Why: The Anthropic API needs every tool described as a JSON Schema object so Claude knows what arguments to pass. You also need a Python function that executes the tool when Claude asks. Get the schema wrong and Claude either ignores the tool or passes the wrong types; get the executor wrong and Claude gets back errors instead of data and may either retry forever or give up early.
"""tools.py — UCC tool schemas + executors for the raw API loop."""
import pickle
from mock_data import FILINGS
TOOLS = [
{
"name": "search_filings",
"description": "Search UCC filings by debtor name (partial, case-insensitive). Optional state filter.",
"input_schema": {
"type": "object",
"properties": {
"debtor_name": {"type": "string"},
"state": {"type": "string", "description": "2-letter code (optional)"},
},
"required": ["debtor_name"],
},
},
{
"name": "get_filing_details",
"description": "Get full details for a specific filing by filing number.",
"input_schema": {
"type": "object",
"properties": {"filing_number": {"type": "string"}},
"required": ["filing_number"],
},
},
{
"name": "predict_delinquency",
"description": "Run the ML delinquency model. Returns probability and HIGH/MEDIUM/LOW.",
"input_schema": {
"type": "object",
"properties": {
"active_filing_count": {"type": "integer"},
"state_count": {"type": "integer"},
"collateral_types": {"type": "integer"},
"filing_age_years": {"type": "number"},
"amendment_frequency": {"type": "number"},
"months_to_lapse": {"type": "number"},
},
"required": ["active_filing_count", "state_count", "collateral_types",
"filing_age_years", "amendment_frequency", "months_to_lapse"],
},
},
]
_MODEL = pickle.load(open("delinquency_model.pkl", "rb"))
def execute_tool(name, args):
if name == "search_filings":
q = args["debtor_name"].lower()
st = args.get("state", "").upper()
return [f for f in FILINGS
if q in f["debtor_name"].lower()
and (not st or f["state"] == st)]
if name == "get_filing_details":
for f in FILINGS:
if f["filing_number"] == args["filing_number"]:
return f
return {"error": "not found"}
if name == "predict_delinquency":
feats = [[args["active_filing_count"], args["state_count"],
args["collateral_types"], args["filing_age_years"],
args["amendment_frequency"], args["months_to_lapse"]]]
prob = float(_MODEL.predict_proba(feats)[0][1])
return {"probability": prob,
"prediction": "HIGH" if prob > 0.66 else "MEDIUM" if prob > 0.33 else "LOW"}
raise ValueError(f"Unknown tool: {name}")
python -c "from tools import execute_tool; print(len(execute_tool('search_filings', {'debtor_name': 'acme'})))". Expected: 9 (all Acme variants found via case-insensitive partial match, including the DBA).What & Why: This is the heart of the iteration — the agentic loop you will replace twice over in later iterations. Write it once by hand and you will recognize what the SDK and Claude Code generate later. Skip it and you will not be able to debug them.
The loop has three jobs in a fixed order: (1) send the running message history to messages.create, (2) inspect response.stop_reason, (3) if tool_use, execute the tool, append both the assistant turn and the tool_result turn to the messages list, and loop back. The single most common Iter-1 bug is appending the tool_use without the matching tool_result, which fails on the next turn with a confusing 400 error.
"""agent.py — the raw API loop. Everything is hand-coded.
Note: we split the loop into a private _run_messages() helper that takes a
message list and returns (text, updated_messages). run_agent() is a thin
wrapper for the single-shot case; in Step 6 the session manager will call
_run_messages directly to support multi-turn."""
import json, anthropic
from tools import TOOLS, execute_tool
client = anthropic.Anthropic()
MODEL = "claude-sonnet-4-6"
MAX_TURNS = 10
SYSTEM = """You are a UCC filing risk analyst. When given a business name, search
thoroughly using name variations including abbreviations and DBAs, gather filing
statistics, run the delinquency model, examine the riskiest filings, and write
a narrative report citing specific evidence."""
def _run_messages(messages: list) -> tuple[str, list]:
"""Drive the tool-use loop on the given messages list. Mutates messages.
Returns (final_text, messages)."""
for turn in range(MAX_TURNS):
response = client.messages.create(
model=MODEL,
max_tokens=4096,
system=SYSTEM,
tools=TOOLS,
messages=messages,
)
# Always append the assistant turn BEFORE handling tool calls.
messages.append({"role": "assistant", "content": response.content})
if response.stop_reason == "end_turn":
text = next(b.text for b in response.content if b.type == "text")
return text, messages
if response.stop_reason == "tool_use":
tool_results = []
for block in response.content:
if block.type == "tool_use":
try:
result = execute_tool(block.name, block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": json.dumps(result, default=str),
})
except Exception as e:
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": f"ERROR: {e}",
"is_error": True,
})
messages.append({"role": "user", "content": tool_results})
continue
raise RuntimeError(f"Unexpected stop_reason: {response.stop_reason}")
raise RuntimeError(f"Agent exceeded {MAX_TURNS} turns without finishing")
def run_agent(question: str) -> str:
"""Single-shot entry point. Wraps _run_messages with a fresh history."""
text, _ = _run_messages([{"role": "user", "content": question}])
return text
if __name__ == "__main__":
print(run_agent("Assess the delinquency risk for Acme Corporation."))
// agent.ts — the raw API loop. Everything is hand-coded.
import Anthropic from "@anthropic-ai/sdk";
import { TOOLS, executeTool } from "./tools.js";
const client = new Anthropic();
const MODEL = "claude-sonnet-4-6";
const SYSTEM = `You are a UCC filing risk analyst. When given a business name,
search thoroughly using name variations including abbreviations and DBAs,
gather filing statistics, run the delinquency model, examine the riskiest
filings, and write a narrative report citing specific evidence.`;
export async function runAgent(question: string, maxTurns = 10): Promise<string> {
type Msg = Anthropic.MessageParam;
const messages: Msg[] = [{ role: "user", content: question }];
for (let turn = 0; turn < maxTurns; turn++) {
const response = await client.messages.create({
model: MODEL, max_tokens: 4096,
system: SYSTEM, tools: TOOLS, messages,
});
// Always append the assistant turn BEFORE handling tool calls
messages.push({ role: "assistant", content: response.content });
if (response.stop_reason === "end_turn") {
const text = response.content.find(b => b.type === "text");
return text?.type === "text" ? text.text : "";
}
if (response.stop_reason === "tool_use") {
const toolResults: Anthropic.ToolResultBlockParam[] = [];
for (const block of response.content) {
if (block.type === "tool_use") {
try {
const result = await executeTool(block.name, block.input);
toolResults.push({
type: "tool_result",
tool_use_id: block.id,
content: JSON.stringify(result),
});
} catch (e) {
toolResults.push({
type: "tool_result",
tool_use_id: block.id,
content: `ERROR: ${(e as Error).message}`,
is_error: true,
});
}
}
}
messages.push({ role: "user", content: toolResults });
continue;
}
throw new Error(`Unexpected stop_reason: ${response.stop_reason}`);
}
throw new Error(`Agent exceeded ${maxTurns} turns without finishing`);
}
if (import.meta.url === `file://${process.argv[1]}`) {
console.log(await runAgent("Assess the delinquency risk for Acme Corporation."));
}
Run: python agent.py
Expected output (paraphrased — Claude's exact wording varies):
- tool_use_id error → the assistant turn must be appended to
messagesBEFORE thetool_resultturn. Check the order in_run_messages. - Agent stops with empty text →
max_tokens=4096may be too low. Bump it to 8192. - Agent loops forever / hits max_turns → the system prompt is too vague about when to stop. Add: "Once you have searched, scored, and written the report, stop."
- 0 filings found → the agent isn't trying name variations. Add to the prompt: "Try ACME, ACME CORP, ACME CORPORATION, and DBA forms."
What & Why: A loop that does whatever Claude asks is not safe to ship. Production agents need at minimum: input validation (reject overly broad queries like search_filings("a")), PII redaction, a cost cap (kill the run after N tokens), and a circuit breaker (after K consecutive failures, stop and alert). You write all four by hand, sprinkled into the loop.
"""guardrails.py — hand-coded checks called from the loop."""
import re
SSN_RE = re.compile(r"\b\d{3}-\d{2}-\d{4}\b")
PHONE_RE = re.compile(r"\b\d{3}-\d{3}-\d{4}\b")
COST_LIMIT_TOKENS = 50_000
CIRCUIT_FAIL_THRESHOLD = 3
def validate_input(tool_name: str, args: dict):
if tool_name == "search_filings":
q = args.get("debtor_name", "")
if len(q.strip()) < 3:
raise ValueError("Query too broad: debtor_name must be >=3 characters")
def redact_pii(payload: str) -> str:
payload = SSN_RE.sub("[SSN_REDACTED]", payload)
payload = PHONE_RE.sub("[PHONE_REDACTED]", payload)
return payload
class CircuitBreaker:
def __init__(self):
self.failures = 0
def record(self, ok: bool):
if ok: self.failures = 0
else: self.failures += 1
if self.failures >= CIRCUIT_FAIL_THRESHOLD:
raise RuntimeError("Circuit breaker tripped — aborting")
Now wire the guardrails into _run_messages in agent.py. This is the modified loop — replace your existing _run_messages function with the version below. New lines are flagged with # NEW comments.
# Add to the imports at the top of agent.py:
from guardrails import (validate_input, redact_pii,
CircuitBreaker, COST_LIMIT_TOKENS) # NEW
_breaker = CircuitBreaker() # NEW — module-level instance
def _run_messages(messages: list) -> tuple[str, list]:
total_tokens = 0 # NEW — cost cap counter
for turn in range(MAX_TURNS):
response = client.messages.create(
model=MODEL, max_tokens=4096, system=SYSTEM,
tools=TOOLS, messages=messages,
)
total_tokens += (response.usage.input_tokens
+ response.usage.output_tokens) # NEW
if total_tokens > COST_LIMIT_TOKENS: # NEW
raise RuntimeError(f"Cost cap exceeded: {total_tokens} tokens")
messages.append({"role": "assistant", "content": response.content})
if response.stop_reason == "end_turn":
text = next(b.text for b in response.content if b.type == "text")
return text, messages
if response.stop_reason == "tool_use":
tool_results = []
for block in response.content:
if block.type == "tool_use":
try:
validate_input(block.name, block.input) # NEW — pre-check
result = execute_tool(block.name, block.input)
_breaker.record(ok=True) # NEW
# Redact PII from the JSON we send back to Claude.
content = redact_pii(json.dumps(result, default=str)) # NEW
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": content,
})
except Exception as e:
_breaker.record(ok=False) # NEW — trips on 3rd fail
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": f"ERROR: {e}",
"is_error": True,
})
messages.append({"role": "user", "content": tool_results})
continue
raise RuntimeError(f"Unexpected stop_reason: {response.stop_reason}")
raise RuntimeError(f"Agent exceeded {MAX_TURNS} turns without finishing")
Run: python agent.py (no behavior change yet on a clean Acme query — guardrails only fire on bad inputs)
Test that the guardrails actually fire. Run this one-liner that calls the agent with a deliberately-broken question and confirms validate_input rejects it:
python -c "from agent import run_agent; print(run_agent('Search for company A'))"
# Expected: agent attempts search_filings(debtor_name='A'), validate_input
# raises ValueError, the error is fed back to Claude as is_error: true,
# and Claude reformulates with a longer fragment.
validate_input, the error flows back to Claude, and Claude either reformulates or apologizes. If you see RuntimeError: Cost cap exceeded, your loop is unbounded — check that total_tokens is being checked after every messages.create.
ImportError: cannot import name 'validate_input' from 'guardrails'→ you didn't saveguardrails.pyin the project folder, OR the function name has a typo.- Circuit breaker trips on first call → you wired
_breaker.record(ok=False)on every tool call instead of only in the except. Move it back to the except branch. - Cost cap fires immediately →
COST_LIMIT_TOKENS = 50_000is for the whole run; if you set it to 50 by mistake it'll fire after the first response.
You are now copy-pasting the same four guardrail calls into every tool execution path. This is exactly the boilerplate that hooks (Iter 2) eliminate. Notice how it feels — this annoyance is what motivates the next iteration.
What & Why: Auditability is non-negotiable for production agents. Every tool call needs a timestamped record: which tool, which inputs (redacted), the result summary, and the token count. Write a small append_audit() function in its own file, then call it from _run_messages after every successful tool execution.
Create a new file audit.py in the project folder:
"""audit.py — HIPAA-style audit log writer for the raw API loop."""
import json, datetime
from guardrails import redact_pii
def append_audit(tool_name: str, args: dict, result, tokens: int,
case_id: str = "default") -> None:
rec = {
"ts": datetime.datetime.utcnow().isoformat() + "Z",
"case_id": case_id,
"tool": tool_name,
"input_redacted": redact_pii(json.dumps(args, default=str)),
"output_summary": redact_pii(str(result))[:200],
"tokens": tokens,
}
with open("audit_log.jsonl", "a") as f:
f.write(json.dumps(rec) + "\n")
Now wire it into _run_messages. Add the import and the append_audit call right after the successful execute_tool:
# Add to the imports at the top of agent.py:
from audit import append_audit # NEW
# Inside _run_messages, in the try-block right after `result = execute_tool(...)`:
result = execute_tool(block.name, block.input)
_breaker.record(ok=True)
append_audit( # NEW
tool_name=block.name,
args=block.input,
result=result,
tokens=total_tokens,
case_id=messages[0]["content"][:40] if messages else "default",
)
content = redact_pii(json.dumps(result, default=str))
# ... rest of the try-block unchanged ...
Run: python agent.py
Then inspect the audit file:
cat audit_log.jsonl # macOS/Linux
type audit_log.jsonl # Windows cmd
Get-Content audit_log.jsonl # Windows PowerShell
Expected output (one JSON object per tool call — you should see roughly 4–6 lines):
audit_log.jsonl. If the file is missing, you ran the agent from the wrong directory; cd into the project folder. If the records are unreadable on one line, you forgot the + "\n" in the f.write call.ImportError: No module named 'audit'→audit.pyis in the wrong folder. Move it next toagent.py.- audit_log.jsonl is empty → either the agent didn't make any tool calls (rare — check the system prompt) or your
append_auditcall is in the wrong place. It must be inside the try-block, afterexecute_tool, before the append totool_results. - JSON parse errors when re-reading the log → missing newline. Confirm
f.write(json.dumps(rec) + "\n").
What & Why: Real users ask follow-ups. "What about Pinnacle?" should reuse the analyst persona without re-asking the system prompt and without losing the prior conversation. Iter 1 implements multi-turn by maintaining a per-session messages list, appending the new user message, and reusing the same _run_messages helper from agent.py. A sliding window keeps the list from growing forever.
Create a new file session.py in the project folder:
"""session.py — multi-turn over the same _run_messages helper from Step 3."""
from agent import _run_messages
SESSIONS: dict[str, list] = {} # session_id -> messages list
WINDOW = 20 # keep the last N messages
def chat(session_id: str, user_msg: str) -> str:
"""Append the user's message to the session, run the loop, return the answer."""
msgs = SESSIONS.setdefault(session_id, [])
msgs.append({"role": "user", "content": user_msg})
answer, msgs = _run_messages(msgs)
# Sliding window: keep only the last WINDOW messages so context doesn't grow forever.
SESSIONS[session_id] = msgs[-WINDOW:]
return answer
Try it — multi-turn from the Python REPL:
python -c "
from session import chat
print(chat('case-001', 'What is the lien exposure for Acme Corporation?'))
print('---')
print(chat('case-001', 'Now compare to Pinnacle Industries.'))
"
Expected behavior: the second call refers to Pinnacle in contrast to Acme — only possible if the session retained the first turn. If you instead get a generic "Pinnacle Industries has 2 active filings..." that doesn't reference Acme, the session isn't carrying over.
chat('case-001', 'Which one is higher risk?') — the agent should answer with reference to both. If it says "I don't know which two you mean", your SESSIONS dict isn't persisting across calls.ImportError: cannot import name '_run_messages' from 'agent'→ you skipped Step 3's refactor ofagent.py. Go back and split the loop into_run_messages+run_agent.- Each call starts fresh →
SESSIONSis a module-level dict. If you importsessionseparately each time (e.g., from a fresh subprocess), the dict resets. For a real server, use a database or Redis instead of an in-memory dict. - Window slices off the wrong end →
msgs[-WINDOW:]keeps the LAST N messages (recent), not the first N. If your follow-ups lose context after many turns, the window is too small — bumpWINDOWto 40.
What & Why: Wrap the agent in a small HTTP API and a container so it is reachable from anywhere. Same Tier-1 deployment as M22B; the agent code does not care that it is in a container. We need three new files: server.py (FastAPI handlers), Dockerfile (build recipe), and requirements.txt (pinned dependencies the Dockerfile installs).
Create three files at the project root.
"""server.py — FastAPI wrapper around run_agent + session.chat."""
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from agent import run_agent
from session import chat as session_chat
app = FastAPI(title="UCC Risk Analyzer (Iter 1)")
class Q(BaseModel):
question: str
class C(BaseModel):
session_id: str
message: str
@app.get("/health")
def health():
return {"status": "ok", "iter": 1}
@app.post("/query")
def query(q: Q):
try:
return {"answer": run_agent(q.question)}
except Exception as e:
raise HTTPException(500, str(e))
@app.post("/chat")
def chat_ep(c: C):
try:
return {"answer": session_chat(c.session_id, c.message)}
except Exception as e:
raise HTTPException(500, str(e))
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
ENV PYTHONUNBUFFERED=1
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8000"]
anthropic>=0.40
fastapi>=0.110
uvicorn>=0.27
pydantic>=2.0
Run locally first (no Docker):
uvicorn server:app --reload --port 8000
In another terminal:
curl localhost:8000/health
# Expected: {"status":"ok","iter":1}
curl -X POST localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"question":"Risk for Acme Corporation?"}'
# Expected: {"answer":"Acme Corporation has 9 active UCC filings ..."}
Then build and run the Docker container:
docker build -t iter1-c .
docker run --rm -p 8000:8000 -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY iter1-c
# In another terminal:
curl localhost:8000/health
docker: command not found→ install Docker Desktop and confirmdocker --versionworks.OSError: [Errno 98] Address already in use→ port 8000 is taken. Use--port 8001for uvicorn or-p 8001:8000for docker.- Container starts but /query returns 500 with "ANTHROPIC_API_KEY not set" → you forgot the
-e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEYflag on docker run. - Container rebuild is slow → expected. Each build re-installs dependencies. Use
--reloadwith uvicorn for dev work; only build the image when you ship.
You should now have 8 files in your project folder:
agent-iter1-raw/ │── agent.py # the loop + _run_messages + run_agent │── tools.py # 3 tool schemas + execute_tool dispatcher │── mock_data.py # 12 UCC filings (9 Acme + 2 Pinnacle + 1 Sunrise) │── delinquency_model.pkl # sklearn RandomForest from Step 1 │── guardrails.py # validate_input + redact_pii + CircuitBreaker │── audit.py # append_audit writer │── session.py # multi-turn chat() over _run_messages │── server.py # FastAPI handlers │── Dockerfile # python:3.11-slim build recipe │── requirements.txt # pinned deps
Run the full Iter-1 acceptance test:
# 1. Single-shot risk analysis
curl -s -X POST localhost:8000/query -H "Content-Type: application/json" \
-d '{"question":"Risk for Acme Corporation?"}' | python -m json.tool
# 2. Multi-turn (session_id keeps context)
curl -s -X POST localhost:8000/chat -H "Content-Type: application/json" \
-d '{"session_id":"acceptance","message":"Lien exposure for Acme?"}' | python -m json.tool
curl -s -X POST localhost:8000/chat -H "Content-Type: application/json" \
-d '{"session_id":"acceptance","message":"Now compare to Pinnacle."}' | python -m json.tool
# 3. Guardrails fire on a too-broad query
curl -s -X POST localhost:8000/query -H "Content-Type: application/json" \
-d '{"question":"Search for company A"}' | python -m json.tool
# 4. Inspect the audit log
docker exec $(docker ps -q --filter ancestor=iter1-c) cat audit_log.jsonl | head -10
You pass Iter 1 if: (1) /health returns ok; (2) /query returns a narrative mentioning 9 Acme filings and a HIGH score; (3) the second /chat call references Acme (proving multi-turn); (4) the broad query gets reformulated by the agent after the gate denies it; (5) audit_log.jsonl has redacted entries for every tool call.
- Files: 10 (agent.py, tools.py, mock_data.py, delinquency_model.pkl, guardrails.py, audit.py, session.py, server.py, Dockerfile, requirements.txt)
- Lines you wrote: ~250 (count
find . -name "*.py" -exec wc -l {} + | tail -1) - Time: ~3 hours
- Abstractions used: none — just
anthropicMessages API and FastAPI
Debugging in Iteration 1: Print Statements + Manual Inspection
When the agent gives wrong output in Iter 1, you debug like you debug any Python program: by reading your own code. There is no abstraction between you and Claude; you can print everything.
Drop this into agent.py and call it after each messages.create:
def debug_turn(turn_num, response, messages):
print(f"\n=== TURN {turn_num} ===")
print(f"stop_reason: {response.stop_reason}")
for block in response.content:
if block.type == "tool_use":
print(f" TOOL: {block.name}({block.input})")
elif block.type == "text":
print(f" TEXT: {block.text[:100]}...")
print(f" Messages in history: {len(messages)}")
print(f" Tokens: in={response.usage.input_tokens} out={response.usage.output_tokens}")
The most common Iter-1 bug is malformed messages. Add import json; print(json.dumps(messages, default=str, indent=2)) right before each messages.create call. You should see strict alternation: user → assistant → user (with tool_results) → assistant ... If you see two assistant turns in a row, that is your bug.
- Wrong
tool_use_idin tool_result → API returns 400. Check theidon thetool_useblock matchestool_use_idexactly. - Forgot to append the assistant turn → API complains about message order. Always append
response.contentBEFORE handling tool calls. - Loop never stops →
stop_reasonistool_useevery turn. Either Claude keeps asking for tools (system prompt is too vague) or you forgot to handleend_turn. - Loop stops too early → Claude returned
end_turnwith empty text. Usually becausemax_tokensis too low — bump from 1024 to 4096. - Tool returns garbage → You forgot
json.dumps(result)— the API requires a string fortool_result.content, not a Python dict. - UCC-specific: agent finds 4 Acme filings but should find 9. Almost always a case-sensitive match in
search_filingsor skipping the DBA variant. Add.lower()on both sides.
In agent.py, change "tool_use_id": block.id to "tool_use_id": "wrong-id" and re-run. You should get a 400 error from the API mentioning the missing tool_use_id. Read the error message carefully — it says which id was expected. Restore the correct id and re-run. This is the muscle memory you build in Iter 1 that lets you debug Iter 3 generated code later.
Iteration 2: Agent SDK + Claude Code
Now you let the SDK run the loop, hooks handle the guardrails, sessions handle multi-turn, and Claude Code does most of the typing. Same agent. Half the lines. Different debugging.
What & Why: CLAUDE.md is the project memory file Claude Code reads at every prompt. It tells Claude Code your conventions: where files live, what dependencies you use, what the system prompt should be, what tests to run. Without CLAUDE.md, every prompt becomes a re-explanation of the project.
mkdir agent-iter2-sdk && cd agent-iter2-sdk
python -m venv venv && source venv/bin/activate # Windows: venv\Scripts\activate
pip install "claude-agent-sdk>=0.2" "scikit-learn>=1.3" "fastapi>=0.110" "uvicorn>=0.27"
npm i -g @anthropic-ai/claude-code # if not already installed
claude
> /init
# Agent: UCC Filing Risk Analyzer (Iteration 2)
## Stack
- Python 3.11+
- `claude-agent-sdk` (the official Agent SDK — NOT a wrapper around `client.messages.create()`)
- scikit-learn for delinquency_model.pkl
- FastAPI + Docker for deployment
- Mock data in mock_data.py (12 UCC filings)
## File Layout
- agent.py — query() entry point + @tool-decorated tools + create_sdk_mcp_server
- hooks/ — PreToolUse + PostToolUse hook scripts (referenced from .claude/settings.json)
- .claude/settings.json — hooks registration
- .claude/agents/ — subagent definitions (optional)
- sessions.py — multi-turn session management
- server.py — FastAPI wrapper
## System Prompt
You are a UCC filing risk analyst. Search thoroughly using name variations
including abbreviations and DBAs, gather statistics, run the delinquency
model, examine riskiest filings, write a narrative report citing evidence.
What & Why: The claude-agent-sdk lets you define tools as @tool-decorated async functions registered with an in-process MCP server. query() drives the loop for you — no while True, no messages.append, no stop_reason checks. The SDK is a real package; do NOT simulate it with client.messages.create().
If you've used client.messages.create() in Iter 1, you might expect the SDK to be a thin Agent class wrapping it. It is not. The SDK is built around MCP tools + an async query() generator + options/hooks via ClaudeAgentOptions. Tools return {"content": [{"type": "text", "text": ...}]} (MCP shape), not bare Python values.
> Create agent.py using `claude-agent-sdk`. Define three UCC tools as
> @tool-decorated async functions: search_filings(debtor_name, state),
> get_filing_details(filing_number), predict_delinquency(...6 features).
> Wire them into a create_sdk_mcp_server, build ClaudeAgentOptions with the
> system prompt from CLAUDE.md, and expose an async run(question) entry
> point that drives query() and concatenates AssistantMessage text.
"""agent.py — claude-agent-sdk version. ~70 lines incl. tools."""
import json, pickle
from claude_agent_sdk import (
query, tool, create_sdk_mcp_server,
ClaudeAgentOptions, AssistantMessage,
)
from mock_data import FILINGS
_MODEL = pickle.load(open("delinquency_model.pkl", "rb"))
@tool("search_filings",
"Search UCC filings by debtor name (case-insensitive partial match).",
{"debtor_name": str, "state": str})
async def search_filings(args):
q = args["debtor_name"].lower()
st = (args.get("state") or "").upper()
hits = [f for f in FILINGS
if q in f["debtor_name"].lower()
and (not st or f["state"] == st)]
return {"content": [{"type": "text", "text": json.dumps(hits)}]}
@tool("get_filing_details",
"Get full details for a filing by filing number.",
{"filing_number": str})
async def get_filing_details(args):
rec = next((f for f in FILINGS if f["filing_number"] == args["filing_number"]),
{"error": "not found"})
return {"content": [{"type": "text", "text": json.dumps(rec)}]}
@tool("predict_delinquency",
"Run the delinquency ML model.",
{"active_filing_count": int, "state_count": int,
"collateral_types": int, "filing_age_years": float,
"amendment_frequency": float, "months_to_lapse": float})
async def predict_delinquency(args):
feats = [[args["active_filing_count"], args["state_count"],
args["collateral_types"], args["filing_age_years"],
args["amendment_frequency"], args["months_to_lapse"]]]
p = float(_MODEL.predict_proba(feats)[0][1])
out = {"probability": p,
"prediction": "HIGH" if p > 0.66 else "MEDIUM" if p > 0.33 else "LOW"}
return {"content": [{"type": "text", "text": json.dumps(out)}]}
ucc_server = create_sdk_mcp_server(
name="ucc_tools", version="1.0.0",
tools=[search_filings, get_filing_details, predict_delinquency],
)
OPTIONS = ClaudeAgentOptions(
model="claude-sonnet-4-6",
system_prompt=("You are a UCC filing risk analyst. Search thoroughly using "
"name variations including abbreviations and DBAs, gather "
"statistics, run the model, examine riskiest filings, cite "
"evidence."),
mcp_servers={"ucc": ucc_server},
allowed_tools=["mcp__ucc__search_filings",
"mcp__ucc__get_filing_details",
"mcp__ucc__predict_delinquency"],
max_turns=10,
)
async def run(question: str) -> str:
parts = []
async for msg in query(prompt=question, options=OPTIONS):
if isinstance(msg, AssistantMessage):
for block in msg.content:
if getattr(block, "text", None):
parts.append(block.text)
return "\n".join(parts)
// agent.ts — @anthropic-ai/claude-agent-sdk version
import { query, tool, createSdkMcpServer } from "@anthropic-ai/claude-agent-sdk";
import { z } from "zod";
import { readFileSync } from "fs";
import { FILINGS } from "./mock_data.js";
// (load delinquency model via your preferred Node ML lib, e.g. onnxruntime-node)
const searchFilings = tool(
"search_filings",
"Search UCC filings by debtor name (case-insensitive partial match).",
{ debtor_name: z.string(), state: z.string().optional() },
async (args) => {
const q = args.debtor_name.toLowerCase();
const st = (args.state ?? "").toUpperCase();
const hits = FILINGS.filter(f =>
f.debtor_name.toLowerCase().includes(q) &&
(!st || f.state === st));
return { content: [{ type: "text", text: JSON.stringify(hits) }] };
}
);
const getFilingDetails = tool(
"get_filing_details",
"Get full details for a filing by filing number.",
{ filing_number: z.string() },
async (args) => {
const rec = FILINGS.find(f => f.filing_number === args.filing_number)
?? { error: "not found" };
return { content: [{ type: "text", text: JSON.stringify(rec) }] };
}
);
const predictDelinquency = tool(
"predict_delinquency",
"Run the delinquency ML model.",
{
active_filing_count: z.number().int(), state_count: z.number().int(),
collateral_types: z.number().int(), filing_age_years: z.number(),
amendment_frequency: z.number(), months_to_lapse: z.number(),
},
async (args) => {
const p = await runModel(args); // your model wrapper
const out = { probability: p,
prediction: p > 0.66 ? "HIGH" : p > 0.33 ? "MEDIUM" : "LOW" };
return { content: [{ type: "text", text: JSON.stringify(out) }] };
}
);
const uccServer = createSdkMcpServer({
name: "ucc_tools",
tools: [searchFilings, getFilingDetails, predictDelinquency],
});
const OPTIONS = {
model: "claude-sonnet-4-6",
systemPrompt: "You are a UCC filing risk analyst. ...",
mcpServers: { ucc: uccServer },
allowedTools: [
"mcp__ucc__search_filings",
"mcp__ucc__get_filing_details",
"mcp__ucc__predict_delinquency",
],
maxTurns: 10,
};
export async function run(question: string): Promise<string> {
const parts: string[] = [];
for await (const msg of query({ prompt: question, options: OPTIONS })) {
if (msg.type === "assistant") {
for (const block of msg.content) {
if ("text" in block) parts.push(block.text);
}
}
}
return parts.join("\n");
}
You no longer write the message loop, the stop_reason check, the tool_result append, or JSON schema dicts. The 90-line raw loop from Iter 1 collapsed to one async for msg in query(...) (Python) / for await (TS). The MCP server is created in 4 lines. Same behavior, ~70 lines instead of ~140.
ModuleNotFoundError: No module named 'claude_agent_sdk'→pip install "claude-agent-sdk>=0.2"in your venv.ImportError: cannot import name 'Agent' from 'anthropic'→ you're trying the old fictional API. The real SDK isclaude_agent_sdk, notanthropic.Agent.- Tool returns "Tool not allowed" → add
"mcp__<server>__<tool>"toallowed_tools.
What & Why: The SDK supports two hook surfaces: (a) file-based hooks in .claude/settings.json that shell out to scripts (production-friendly, language-agnostic), and (b) in-process hooks via HookMatcher in ClaudeAgentOptions(hooks={...}). We use (a) for redaction/audit (matches the Tier 3 cert pattern) and (b) for in-process validation that needs to raise.
{
"hooks": {
"PreToolUse": [
{ "matcher": "mcp__ucc__search_filings",
"command": "python hooks/log_call.py" }
],
"PostToolUse": [
{ "matcher": "*", "command": "python hooks/redact_pii.py" },
{ "matcher": "*", "command": "python hooks/audit.py" }
]
}
}
"""hooks/redact_pii.py — reads tool result on stdin, writes redacted JSON on stdout."""
import sys, json, re
SSN_RE = re.compile(r"\b\d{3}-\d{2}-\d{4}\b")
PHONE_RE = re.compile(r"\b\d{3}-\d{3}-\d{4}\b")
payload = json.load(sys.stdin)
text = json.dumps(payload.get("tool_result"), default=str)
text = SSN_RE.sub("[SSN_REDACTED]", text)
text = PHONE_RE.sub("[PHONE_REDACTED]", text)
payload["tool_result"] = json.loads(text)
json.dump(payload, sys.stdout)
"""Add to agent.py: in-process validation hook via HookMatcher."""
from claude_agent_sdk import HookMatcher
async def validate_query(input_data, tool_use_id, context):
if input_data.get("tool_name", "").endswith("search_filings"):
q = (input_data.get("tool_input", {}).get("debtor_name") or "").strip()
if len(q) < 3:
return {"hookSpecificOutput": {"hookEventName": "PreToolUse",
"permissionDecision": "deny",
"permissionDecisionReason": "Query too broad"}}
return {}
# Then update OPTIONS in agent.py:
OPTIONS = ClaudeAgentOptions(
# ... existing fields ...
hooks={"PreToolUse": [HookMatcher(matcher="mcp__ucc__search_filings",
hooks=[validate_query])]},
)
Create the supporting hook scripts — both follow the same stdin/stdout pattern as redact_pii.py. Save these in hooks/:
"""hooks/log_call.py — PreToolUse: print to stderr, pass payload through unchanged."""
import sys, json, datetime
payload = json.load(sys.stdin)
ts = datetime.datetime.utcnow().isoformat() + "Z"
name = payload.get("tool_name", "?")
args = payload.get("tool_input", {})
print(f"[{ts}] PRE {name}({args})", file=sys.stderr)
json.dump(payload, sys.stdout) # pass-through — do NOT modify
"""hooks/audit.py — PostToolUse: append a redacted record to audit_log.jsonl."""
import sys, json, datetime
payload = json.load(sys.stdin)
rec = {
"ts": datetime.datetime.utcnow().isoformat() + "Z",
"tool": payload.get("tool_name"),
"input": payload.get("tool_input"),
"output_summary": str(payload.get("tool_result"))[:200],
}
with open("audit_log.jsonl", "a") as f:
f.write(json.dumps(rec, default=str) + "\n")
json.dump(payload, sys.stdout) # pass-through
Debug each hook standalone before wiring them up:
# Smoke-test the redactor (should replace the SSN):
echo '{"tool_name":"search_filings","tool_result":"customer SSN 123-45-6789"}' \
| python hooks/redact_pii.py
# Smoke-test the audit writer (should append a line):
echo '{"tool_name":"search_filings","tool_input":{"debtor_name":"Acme"},"tool_result":"ok"}' \
| python hooks/audit.py
cat audit_log.jsonl # should show the new entry
Then run the agent end-to-end:
python -c "import asyncio, agent; print(asyncio.run(agent.run('Risk for Acme Corporation?')))"
[timestamp] PRE search_filings(...) log lines on stderr from log_call.py, (2) the agent's narrative report on stdout, (3) one line per tool call in audit_log.jsonl. Try a 1-character query — the in-process HookMatcher validator should deny it with "Query too broad".
- Hooks don't run at all → the SDK looks for
.claude/settings.jsonin the current working directory. Run from the project root. - Hook script crashes on JSON parse → you forgot to
json.dump(payload, sys.stdout)as the LAST line. The SDK reads the script's stdout and parses it back as JSON; an empty or malformed stdout breaks the pipe. - Audit log missing entries → the matcher is too narrow.
"matcher": "*"matches every tool call;"matcher": "search_filings"only matches that one tool name. - In-process validator never fires → check that
hooks={"PreToolUse": [HookMatcher(...)]}is a kwarg onClaudeAgentOptions, not a separate variable.
What & Why: The SDK supports session continuation via the resume option (or continue_conversation=True for the most recent), and what-if branching via copying a session id. Iter 1's SESSIONS dict was a hand-rolled approximation; here we keep a thin in-memory map of session_id → last_resume_token and let the SDK handle the rest.
"""sessions.py — multi-turn via SDK resume tokens."""
from dataclasses import replace
from claude_agent_sdk import query, AssistantMessage
from agent import OPTIONS
SESSIONS: dict[str, str] = {} # case_id -> resume token (session_id)
async def _drive(prompt: str, options) -> tuple[str, str | None]:
parts, last_session_id = [], None
async for msg in query(prompt=prompt, options=options):
if isinstance(msg, AssistantMessage):
for block in msg.content:
if getattr(block, "text", None):
parts.append(block.text)
# The SDK emits a session_id on the result message at the end.
sid = getattr(msg, "session_id", None)
if sid: last_session_id = sid
return "\n".join(parts), last_session_id
async def chat(session_id: str, msg: str) -> str:
resume = SESSIONS.get(session_id)
options = replace(OPTIONS, resume=resume) if resume else OPTIONS
text, sid = await _drive(msg, options)
if sid: SESSIONS[session_id] = sid
return text
async def what_if(session_id: str, hypothetical: str) -> str:
"""Fork: resume from the same point but do NOT update the main session."""
resume = SESSIONS.get(session_id)
options = replace(OPTIONS, resume=resume) if resume else OPTIONS
text, _ = await _drive(hypothetical, options)
return text
Try it — multi-turn + fork demo:
python -c "
import asyncio
from sessions import chat, what_if
async def main():
# Turn 1: establish context
print('T1:', await chat('case-001', 'What is the lien exposure for Acme Corporation?'))
# Turn 2: builds on T1
print('T2:', await chat('case-001', 'Now compare to Pinnacle Industries.'))
# Fork: hypothetical that does NOT pollute the main session
print('FORK:', await what_if('case-001', 'What if Acme files a UCC-3 termination on NY-2024-0848?'))
# Turn 3 on the original session: the fork's hypothetical was NOT seen
print('T3:', await chat('case-001', 'Summarize Acme vs Pinnacle in one sentence.'))
asyncio.run(main())
"
what_if is updating SESSIONS[session_id] when it shouldn't.
TypeError: replace() got an unexpected keyword argument 'resume'→ClaudeAgentOptionsin your installed SDK version doesn't supportresume=. Upgrade withpip install -U "claude-agent-sdk>=0.2".- session_id is always None on first call → expected. The SDK only emits
session_idon the result message AFTER the agent finishes. The first turn has no resume token; subsequent turns do. - Fork still updates the main session → check that
what_ifusestext, _ = await _drive(...)(discarding the new sid) rather thantext, sid = await _drive(...)followed by an assignment.
What & Why: Claude Code slash commands let you run reusable workflows from inside the IDE. /run-agent calls the agent on a question; /test-agent runs the unit test suite; /eval-agent runs the 10-question UCC eval set and reports per-scenario scoring.
Create 3 files in .claude/commands/:
---
description: Run the UCC Risk Analyzer agent on a question
argument-hint: [question]
---
Run `python -c "import asyncio, agent; print(asyncio.run(agent.run('$ARGUMENTS')))"`.
If --verbose is in $ARGUMENTS, set AGENT_VERBOSE=1 (read by hooks/log_call.py).
Print the final report and the token + cost summary from query()'s ResultMessage.
---
description: Run the unit test suite for the UCC agent
---
Run `pytest tests/ -v`. The test suite covers tools, hooks, sessions, and the
end-to-end agent (must find all 9 Acme filings including the DBA variant).
Report each test's pass/fail and a summary at the end. If any tests fail,
read the spec at `spec/agent-spec.md` (in Iter 3) to see expected behavior,
then fix the implementation.
---
description: Run the 10-question evaluation suite
---
Read `test_scenarios.json` (10 question/expected-output pairs). For each
question, call `agent.run()`, score the response on a rubric: did it find
the right entities? did it return the right risk level? did it cite specific
filings? Report per-scenario score (0-5) and an overall percentage.
Try it from inside Claude Code:
claude
> /run-agent What is the lien exposure for Acme Corporation?
> /test-agent
> /eval-agent
/ in Claude Code, the three commands appear in the autocomplete. Running /run-agent with a UCC question produces the narrative report. /test-agent requires you to first create a tests/ folder — in Iter 3 the spec generates these for you.
- Slash commands don't appear in autocomplete → the directory must be exactly
.claude/commands/at the project root. Restart Claude Code after creating new files. $ARGUMENTSisn't substituted → this is a Claude Code template variable; it works inside the slash command, not in plain bash. If you're testing the underlying Python directly, hardcode the question.
What & Why: Same FastAPI + Docker pattern as Iter 1, but you ask Claude Code to write it.
> Create server.py and Dockerfile. Endpoints: GET /health, POST /query
> (single-shot), POST /chat (session_id + message). The endpoints are async
> FastAPI handlers that await agent.run() and sessions.chat() respectively
> (both are async coroutines from claude-agent-sdk). Dockerfile based on
> python:3.11-slim, install claude-agent-sdk + dependencies, expose 8000,
> uvicorn entrypoint. Mount .claude/ into the container so settings.json
> and hook scripts are picked up.
What Claude Code produces (review before saving):
"""server.py — async FastAPI wrapper around the SDK agent."""
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from agent import run as agent_run
from sessions import chat as session_chat
app = FastAPI(title="UCC Risk Analyzer (Iter 2 — SDK)")
class Q(BaseModel): question: str
class C(BaseModel): session_id: str; message: str
@app.get("/health")
def health(): return {"status": "ok", "iter": 2}
@app.post("/query")
async def query(q: Q):
try: return {"answer": await agent_run(q.question)}
except Exception as e: raise HTTPException(500, str(e))
@app.post("/chat")
async def chat_ep(c: C):
try: return {"answer": await session_chat(c.session_id, c.message)}
except Exception as e: raise HTTPException(500, str(e))
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
# .claude/settings.json + hooks/ must be in the build context for hooks to fire
EXPOSE 8000
ENV PYTHONUNBUFFERED=1
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8000"]
claude-agent-sdk>=0.2
fastapi>=0.110
uvicorn>=0.27
pydantic>=2.0
Run locally first:
uvicorn server:app --reload --port 8000
# In another terminal:
curl localhost:8000/health
# Expected: {"status":"ok","iter":2}
curl -X POST localhost:8000/query -H "Content-Type: application/json" \
-d '{"question":"Risk for Acme Corporation?"}'
Then build and run with Docker:
docker build -t iter2-c .
docker run --rm -p 8000:8000 -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY iter2-c
- Container starts but hooks don't fire →
.claude/wasn't copied into the image. ConfirmCOPY . .includes the.claude/folder; some.dockerignoresetups exclude dotfiles. RuntimeError: This event loop is already running→ the FastAPI handlers must beasync def, not plaindef(they're awaiting the SDK's asyncquery()).- Slow build → pin versions in requirements.txt and add a
.dockerignorewithvenv/,__pycache__/,*.pycto skip unnecessary copies.
You should now have ~13 files in your project:
agent-iter2-sdk/ │── CLAUDE.md │── agent.py # query() + 3 @tool functions + create_sdk_mcp_server │── sessions.py # chat() and what_if() over SDK resume tokens │── mock_data.py # same 12 records as Iter 1 │── delinquency_model.pkl │── .claude/ │─ │── settings.json # PreToolUse + PostToolUse matchers │─ │── commands/run-agent.md, test-agent.md, eval-agent.md │── hooks/ │─ │── log_call.py, redact_pii.py, audit.py │── server.py | Dockerfile | requirements.txt
Run the same Iter-1 acceptance test against the Iter-2 deployment — the curl shape and outputs are identical:
# Same 4 acceptance curls as Iter 1, but pointing at the Iter-2 container.
# Outputs should be functionally identical to Iter 1 (same 9 Acme filings, HIGH score).
curl -s -X POST localhost:8000/query -H "Content-Type: application/json" \
-d '{"question":"Risk for Acme Corporation?"}' | python -m json.tool
# Multi-turn via the SDK's resume token (different mechanism than Iter 1's transcript)
curl -s -X POST localhost:8000/chat -H "Content-Type: application/json" \
-d '{"session_id":"acceptance","message":"Lien exposure for Acme?"}' | python -m json.tool
curl -s -X POST localhost:8000/chat -H "Content-Type: application/json" \
-d '{"session_id":"acceptance","message":"Now compare to Pinnacle."}' | python -m json.tool
# Hook-driven audit (instead of Iter-1's hand-rolled append_audit call)
docker exec $(docker ps -q --filter ancestor=iter2-c) cat audit_log.jsonl | head -10
You pass Iter 2 if: all 4 outputs are functionally equivalent to Iter 1, but you wrote ~120 lines instead of ~250. The agent's observable behavior is unchanged; the implementation underneath uses the SDK + hooks + Claude Code.
- Files: ~13 (CLAUDE.md, agent.py, sessions.py, mock_data.py, delinquency_model.pkl, .claude/settings.json, .claude/commands/×3, hooks/×3, server.py, Dockerfile, requirements.txt)
- Lines you wrote: ~120 (vs Iter 1's ~250)
- Time: ~2 hours (vs Iter 1's ~3)
- Abstractions used:
claude-agent-sdk(query / @tool / create_sdk_mcp_server / ClaudeAgentOptions / HookMatcher / PermissionResultDeny) + Claude Code slash commands
Debugging in Iteration 2: Hooks + Console Web UI + Langfuse
The SDK abstracts the loop — you cannot drop a print statement inside it. You debug from the outside instead.
Hooks fire at every tool call. The simplest debug surface is a stdin-stdout script registered in .claude/settings.json that pretty-prints to stderr (so it doesn't pollute the hook payload):
"""hooks/debug.py — passthrough that pretty-prints to stderr."""
import sys, json
payload = json.load(sys.stdin)
print(f"[HOOK {payload.get('hook_event_name','?')}] "
f"{payload.get('tool_name','?')}: "
f"{json.dumps(payload.get('tool_input', payload.get('tool_result')))[:200]}",
file=sys.stderr)
json.dump(payload, sys.stdout) # pass-through
{
"hooks": {
"PreToolUse": [{ "matcher": "*", "command": "python hooks/debug.py" }],
"PostToolUse": [{ "matcher": "*", "command": "python hooks/debug.py" }]
}
}
BETTER than Iter-1 print statements because hooks are modular — swap .claude/settings.json back to the production version when you're done debugging and the agent code stays untouched.
The Anthropic Console (console.anthropic.com) shows every API call your agent made.
Worked example: Agent calls search_filings("acme") but gets 0 results when 9 are expected. Console → Logs → failed call → tool_use block shows Claude sent "acme" (lowercase) but the generated search_filings compares without lowercasing the data side. Fix: q in f["debtor_name"].lower(). Re-run; Console shows the fix working. Total time to diagnose: ~3 minutes vs ~15 in Iter 1.
If you wired Langfuse in M19, every agent run is a waterfall trace. Each tool call is a span with timing, input, output. Compare traces across runs — "why did today's run take 12 seconds vs yesterday's 4?" gets answered by reading the spans.
The slash command can include a verbose flag that turns on the debug hooks for one run. Output: every tool call, every hook fire, every token count, total cost, total time. Zero permanent code changes.
Iteration 3: Spec-Driven
You stop writing code. You write a spec describing what the agent should do, then ask Claude Code to build it. ~18 files appear. You review, iterate on the spec, regenerate. The spec IS the documentation.
What & Why: The spec is your full design doc. Claude Code reads it and produces every file from it. Be specific in section 3 about parameter names and types — that is what generates the schemas.
# Agent Specification: UCC Filing Risk Analyzer
## 1. Overview
An agent that assesses delinquency risk for businesses by searching UCC
filings, running an ML model, and writing a narrative risk report.
## 2. Configuration
- Model: claude-sonnet-4-6
- Framework: claude-agent-sdk (Python). Tools registered via
create_sdk_mcp_server. Driver: query() with ClaudeAgentOptions.
- Max turns: 10
- Max tokens per response: 4096
## 3. Tools (registered as MCP server "ucc")
### search_filings — mcp__ucc__search_filings
- @tool decorator with schema {debtor_name: str, state: str}
- Returns list of filing dicts wrapped in {"content": [{"type":"text","text": json}]}
- Mock data: 9 Acme filings across NY/CA/TX/FL incl. one DBA variant
### get_filing_details — mcp__ucc__get_filing_details
- {filing_number: str}
### predict_delinquency — mcp__ucc__predict_delinquency
- 6 features (active_filing_count, state_count, collateral_types,
filing_age_years, amendment_frequency, months_to_lapse)
- Implementation: load sklearn RandomForest from delinquency_model.pkl
## 4. System Prompt (passed to ClaudeAgentOptions.system_prompt)
You are a UCC filing risk analyst. Search thoroughly using name variations
including abbreviations and DBAs, gather statistics, run the model, examine
riskiest filings, write a narrative report citing specific evidence.
## 5. Hooks
File-based via .claude/settings.json:
- PreToolUse matcher "*" command "python hooks/log_call.py"
- PostToolUse matcher "*" command "python hooks/redact_pii.py"
- PostToolUse matcher "*" command "python hooks/audit.py"
In-process via ClaudeAgentOptions.hooks (HookMatcher):
- PreToolUse matcher "mcp__ucc__search_filings": deny if debtor_name < 3 chars
PHI patterns to redact: SSN (NNN-NN-NNNN), phone (NNN-NNN-NNNN).
## 6. Sessions
Multi-turn via SDK resume tokens (sessions.py keeps a session_id -> resume map).
what_if() forks by re-using a resume token without persisting the new session_id.
## 7. Mock Data
9 Acme filings (NY:4, CA:2, TX:2, FL:1 DBA). 2 Pinnacle. 1 Sunrise.
## 8. API Wrapper
FastAPI async handlers: POST /query, POST /chat, GET /health.
Auth via X-API-Key header. Rate limit 10/min/key.
## 9. Deployment
Tier 1: Docker. Mount .claude/ into container so hooks resolve at runtime.
## 10. Tests
test_tools.py, test_agent.py (must find all 9 Acme incl. DBA),
test_hooks.py (validates HookMatcher denies short queries),
test_sessions.py (resume token reuse), test_api.py.
## 11. Evaluation
spec/test_scenarios.json with 10 questions covering name variations, state
filter, risk levels, what-if, broad synthesis, missing entities, lapse detection.
## 12. File Structure
ucc-risk-agent/
│── spec/agent-spec.md # this file
│── spec/test_scenarios.json
│── CLAUDE.md
│── .claude/settings.json # PreToolUse/PostToolUse matchers
│── .claude/commands/ {run-agent.md, test-agent.md, eval-agent.md}
│── agent.py # query() + @tool + create_sdk_mcp_server
│── sessions.py # resume-token sessions
│── mock_data.py | delinquency_model.pkl
│── hooks/{log_call,redact_pii,audit}.py
│── server.py | Dockerfile | docker-compose.yml | requirements.txt
│── tests/ x5
│── appendix/manual-loop.py # Iter-1 reference, not for production
What & Why: Open Claude Code in the project folder. Issue ONE prompt that points at the spec and asks for a full build. Watch it work — this is the moment Iteration 3 earns its name.
> /generate-from-spec spec/agent-spec.md
# (or, if /generate-from-spec is not installed:)
> Read spec/agent-spec.md and build the entire project. Create every file
> listed in section 12 (File Structure). Implement every tool, hook, test,
> and endpoint exactly as specified. Use the `claude-agent-sdk` package
> (NOT a fictional anthropic.Agent class). Tools must be @tool-decorated
> async functions registered via create_sdk_mcp_server. The driver is
> query() + ClaudeAgentOptions. Hooks live in .claude/settings.json
> AND in OPTIONS.hooks (HookMatcher) per section 5. Generate realistic
> mock UCC filing data. Train and save the pickle model. Also generate
> appendix/manual-loop.py as an under-the-hood reference.
What you should see Claude Code create — expect a stream of file-creation messages over 5–10 minutes:
Verify it actually works:
# From inside Claude Code:
> /test-agent
# Or from a regular shell:
pytest tests/ -v
Expected pytest output:
test_finds_all_acme_filings_including_dba fails (got 5, expected 9), the generated search_filings probably skipped the DBA variant — the spec was ambiguous. See the debugging block below for how to fix it via spec edit (NOT by editing the generated code).
- Claude Code asks clarifying questions instead of generating → the spec has gaps. Either answer inline or update the spec and re-prompt with "Use the updated spec".
- Some files are missing → section 12 of the spec lists every expected file; if the generator missed one, point it out: "tests/test_hooks.py wasn't created — please add it per spec section 10".
- Generated code uses fictional API → re-issue the prompt and explicitly say "use the real
claude_agent_sdkpackage — do NOT use anthropic.Agent or @agent.tool". Then ask it to compare the generated agent.py against spec section 2.
What & Why: The spec is the source of truth. When you want to add or change behavior, edit the spec FIRST, then ask Claude Code to regenerate the affected files. Don't edit generated code directly — the next regen would overwrite you.
Example: add a fourth tool check_lapse_dates(months_ahead: int = 12) that returns filings approaching lapse:
- Edit
agent-spec.mdsection 3 to add the new tool - Edit section 11 to add an eval scenario for it
- In Claude Code: "I added check_lapse_dates to agent-spec.md. Update tools.py, mock_data.py, add a test, and an eval scenario."
- Claude Code reads the spec diff and makes targeted changes — not a full regen
/test-agent→ all green
What & Why: The Dockerfile and docker-compose.yml are already generated. Build, run, curl, compare output to Iter 1 and Iter 2 — the test of "spec-driven works" is that the agent's observable behavior matches the prior iterations.
# Build and run via docker-compose (generated by Claude Code)
docker compose up --build -d
docker compose ps # confirm container is "Up"
# Health check
curl localhost:8000/health
# Expected: {"status":"ok","iter":3}
# Same query as Iter 1 and Iter 2
curl -X POST localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"question":"Risk for Acme Corporation?"}' | python -m json.tool
Cross-iteration diff — this is where the punchline lands:
# Run the same query against all three deployments and diff
# (assumes Iter 1 on :8001, Iter 2 on :8002, Iter 3 on :8003)
for port in 8001 8002 8003; do
echo "=== Iter on :$port ==="
curl -s -X POST localhost:$port/query \
-H "Content-Type: application/json" \
-d '{"question":"Risk for Acme Corporation?"}' \
| python -c "import json, sys; d = json.load(sys.stdin); print(d['answer'][:300])"
echo
done
Expected: all three responses mention 9 Acme filings, 4 states, the DBA variant, and HIGH risk. Wording will vary (Claude is non-deterministic), but the facts match.
You should have ~24 generated files (Claude Code wrote them) plus your ~100-line spec:
ucc-risk-agent/ │── spec/agent-spec.md # YOU wrote this │── spec/test_scenarios.json # generated │── CLAUDE.md # generated │── .claude/settings.json # generated │── .claude/commands/ ×3 # generated │── agent.py | tools.py | sessions.py # generated (SDK) │── hooks/ ×3 # generated │── mock_data.py | delinquency_model.pkl # generated │── tests/ ×5 # generated │── server.py | Dockerfile | docker-compose.yml | requirements.txt # generated │── appendix/manual-loop.py # generated reference
Acceptance test — same shape as Iter 1 and Iter 2:
# 1. Eval suite passes
pytest tests/ -v
# Expected: 10 passed
# 2. End-to-end question on the deployed container
curl -s -X POST localhost:8000/query -H "Content-Type: application/json" \
-d '{"question":"Risk for Acme Corporation?"}' | python -m json.tool
# 3. Multi-turn (SDK resume tokens)
curl -s -X POST localhost:8000/chat -H "Content-Type: application/json" \
-d '{"session_id":"acceptance","message":"Lien exposure for Acme?"}' | python -m json.tool
curl -s -X POST localhost:8000/chat -H "Content-Type: application/json" \
-d '{"session_id":"acceptance","message":"Now compare to Pinnacle."}' | python -m json.tool
# 4. Spec-vs-code drift check (the Iter-3 specific verification)
claude
> Read spec/agent-spec.md and compare to agent.py + hooks/. Report any drift.
# Expected: "No drift detected. Code matches spec section by section."
You pass Iter 3 if: (1) all 10 generated tests pass; (2) curl outputs match Iter 1/2 facts; (3) the spec-vs-code drift check returns "no drift". You wrote a 100-line spec; Claude Code wrote 510 lines of working code from it.
- Files: ~24 (all generated by Claude Code from your spec)
- Lines you wrote: ~100 (the spec only)
- Lines Claude Code wrote: ~510 (and they're all readable, runnable, and pass the tests)
- Time: ~1 hour (~30 min writing the spec, ~10 min on the generation prompt, ~20 min reviewing + fixing)
- Abstractions used: spec format +
/generate-from-spec+ Claude Code + the same SDK runtime as Iter 2
Debugging in Iteration 3: Spec Comparison + Tests + Evals
When generated code has bugs, you do NOT debug it line by line. You debug at the spec level. The code is regenerable; the spec is not.
Most spec-driven bugs are spec/code drift. Ask Claude Code:
> Read agent-spec.md and compare it to the generated agent.py and tools.py.
> Report any deviations where the code does not match the spec.
Claude Code reads both and tells you exactly what diverged. The spec is the truth — if the code deviates, the code is wrong.
The spec includes test definitions. When tests fail:
> Test test_finds_all_acme_filings failed: expected 9 filings, got 5.
> Read agent-spec.md section 3 (Tools) for the search_filings spec.
> Read the generated search_filings function. Find why it misses 4 filings
> (likely: case-sensitive match or skipping DBA variant). Fix it.
Run /eval-agent. Output: scenario 6 scored 2/5 — "agent did not try DBA variations." This is a behavior bug, not a code bug.
- Open
agent-spec.mdsection 4 (System Prompt) - Make it more explicit: "Always try at least 3 name variations including any known DBAs"
- In Claude Code: "I updated the system prompt. Regenerate only agent.py."
/eval-agentagain → scenario 6 now scores 5/5
Because the generated code uses the same Agent SDK and hooks as Iter 2, all the same Console Web UI and Langfuse debugging from Iter 2 still work.
| Iteration | Primary debug method | Secondary | Speed to fix |
|---|---|---|---|
| 1 Raw | print() in the loop | Manual message inspection | Slow (find the line) |
| 2 SDK | Hooks + Console Web UI | Langfuse traces | Medium (modular probes) |
| 3 Spec | Spec vs code comparison | Tests + evals + Console | Fast (Claude Code finds it) |
The Comparison Table
Same UCC agent, same business question. Here is everything that changed:
| Metric | Iter 1: Raw API | Iter 2: Agent SDK | Iter 3: Spec-Driven |
|---|---|---|---|
| Lines YOU wrote | ~250 | ~120 | ~100 (spec only) |
| Time to build | ~3 hours | ~2 hours | ~1 hour |
| Agent output | Baseline | Same | Same |
| Guardrails | Inline code in loop | Hooks (modular) | Hooks (generated) |
| Multi-turn | Manual history dict | SDK sessions | SDK sessions (generated) |
| Adding a new tool | Edit 3 files | One Claude Code prompt | Update spec, ask Claude Code to regen |
| Tests | Manual | Claude Code generated | Spec generates them |
| Documentation | Separate (or none) | CLAUDE.md | Spec IS the docs |
| Control over internals | Full | SDK-managed | Least direct (but reviewable) |
| Understanding needed | Every line | SDK abstractions | Architecture-level |
| Debugging | print() in loop | Hooks + Console Web UI + Langfuse | Spec comparison + tests + evals |
| Onboarding a new dev | Read 7 files | Read CLAUDE.md + 8 files | Read 1 spec file |
All three produce the same UCC risk agent. The difference is what each iteration teaches you. Iteration 1 teaches WHAT an agent is. Iteration 2 teaches HOW to build efficiently. Iteration 3 teaches HOW production teams work. You need all three. Skip Iter 1 and you cannot debug Iter 3. Skip Iter 3 and you are 10x slower. Skip Iter 2 and you cannot understand the trade-off curve between them.
Grading Rubric
This capstone is the graduation project. The rubric weights each iteration equally and the comparison artifact heavily. You self-grade or peer-grade against the criteria below.
claude-agent-sdk with @tool-decorated async functions registered via create_sdk_mcp_server. Hooks split between .claude/settings.json (file-based redact/audit) and HookMatcher in ClaudeAgentOptions.hooks (in-process validate). Sessions resumed via SDK resume token. At least 3 slash commands. Same FastAPI + Docker deployment. CLAUDE.md present and accurate. Code count ~120 lines.agent-spec.md. Single Claude Code prompt generates all ~18 files. Tests pass on first or second iteration. At least one targeted spec edit + regen demonstrated. Same FastAPI + Docker deployment runs.720/1000 to pass (matches the Claude Certified Architect — Foundations exam). All three iterations must run; the comparison table must use your actual numbers; the reflection must show genuine judgment.
Reflection Prompts
Answer these in a REFLECTION.md file at the root of your capstone project. Aim for 200–400 words total. Be honest — the reflection is graded on judgment, not on agreeing with the course.
- Which iteration was the hardest for YOU specifically, and why? Was it because of unfamiliar tooling, conceptual complexity, or something subtler (like decision fatigue when picking what to put in the spec)?
- Which iteration would you default to in a real production project? If your answer is "Iter 3 always," you are probably wrong — describe a scenario where you would NOT pick it.
- Give one concrete situation where each iteration is the right choice. Iter 1 = ?, Iter 2 = ?, Iter 3 = ?
- What surprised you about the comparison table? Did the time savings match expectations? Was the line-count difference larger or smaller than you predicted?
- What would you change about the spec format if you were writing it from scratch?
Knowledge Check
Q1: All three iterations produce the same agent output. What is the most defensible reason to still go through Iteration 1 rather than skipping straight to Iteration 3?
Q2: In Iteration 2, your agent calls search_filings("acme") and gets 0 results when 9 are expected. What is the FASTEST debugging path?
Q3: In Iteration 3, an eval scenario fails because the agent did not try DBA name variations like "ACME CORP DBA ROADRUNNER SUPPLIES". The CORRECT fix is:
Q4: The most common Iteration 1 bug is — according to the debug exercise — ?
Q5: You need to add SSN/phone redaction to all three iterations. Which iteration requires the LEAST disruptive change?
Q6: When would you NOT pick Iteration 3 (spec-driven) for a real production project?
Q7: The mock data has 9 Acme filings including one DBA variant. If your agent finds only 8 (skipping the DBA), the bug is most likely:
Going Further (Optional)
- Build all three domains. After completing C, write specs (Iteration 3 only) for Domain A and Domain B. You'll see how the spec format generalizes.
- Add HITL to Iter 3. Extend the spec with a "Human Review" section that triggers when prediction is MEDIUM. Watch Claude Code wire it into the API and the slash commands.
- Cloud deployment. Take the Iter 3 generated code and ship it to GCP Cloud Run or AWS Lambda (M22B Tier 2/3). The spec did not need to change — only the deployment section.
- Spec linting. Write a Claude Code slash command
/lint-specthat checks the spec for missing sections, undefined tools referenced in hooks, or untestable claims. - Multi-agent spec. Pick CAPSTONE-4's 4-agent system. Write a SINGLE spec covering all 4 agents and have Claude Code generate the whole pipeline.
- Cross-iteration eval. Run the same 10-question eval against all three iterations. Are the agent outputs literally identical? Where do they differ, and why?
This is the final capstone. You have now built agents seven different ways. You have the foundation, the patterns, AND the judgment to pick the right one. Time to ship.