Capstone 7-B — Agent Evolution: Order Exception
Build the SAME B2B order exception agent THREE times — first by hand with the raw API, then with the Agent SDK and Claude Code, then from a spec document. Same five operations tools, same mock orders, same business question. Three radically different developer experiences.
Project Brief
Imagine you decide to build the same house three times. The first time, you fell every tree, mill every plank, and hammer every nail by hand. You learn exactly where the load-bearing walls go, why the joists are spaced 16 inches apart, and what happens when a sill plate is undersized. The build takes six months, but you understand every joint in the structure.
The second time, you order pre-cut lumber, pre-fab trusses, and use a nail gun. You finish in six weeks. You build twice as fast and the structure is just as strong, but only because you already know what a structure should look like.
The third time, you hand a contractor a set of architectural drawings. Two weeks later, the house is done — built by people you never met to specifications you wrote in plain English. The previous two builds taught you what to draw and what to leave to the crew.
This capstone is those three builds, compressed into one week. You will write the same B2B order-exception agent three ways. By the end, you will know in your hands — not just in theory — why each layer of abstraction exists, what it costs, and when to reach for each one.
You build a B2B Order Exception Agent. It takes a flagged purchase order — PO-2024-5678 — and walks through five tool calls: pulls order details from the ERP, tracks the shipment with the carrier, checks contract pricing against the invoice, queries available inventory, and drafts a customer notification. The output is a structured exception report (type + root cause + proposed resolution) plus a customer-tier-appropriate email draft.
You build it three times:
- Iteration 1 Raw API loop —
~250 lines,~3 hours, hand-coded everything (M15B way) - Iteration 2 Agent SDK + Claude Code —
~120 lines,~2 hours(M25–M26 way) - Iteration 3 Spec-driven —
~100 lines of spec,~1 hour(production way)
B2B notifications have to match customer-tier tone (terse for enterprise, warmer for SMB), reference contract clauses where relevant, and never overpromise resolution timelines. In Iteration 1 you encode tone rules in the system prompt and check the output by hand. In Iteration 2 you add an output guardrail hook that runs a tone classifier or pattern check. In Iteration 3 you specify the rules and Claude Code generates both the prompt rules AND the output hook. The pattern is identical to HIPAA in Domain A or PII redaction in Domain C — the spec encodes compliance, the iterations only differ in where that compliance lives.
Production teams do not pick "raw API vs SDK vs spec" in the abstract — they pick based on what the team already understands and what the problem requires. If you skip Iteration 1, you cannot debug Iteration 3 when the generated code does something weird. If you skip Iteration 3, you are 10x slower than the teams shipping agents in 2026. The point of building the same thing three times is that the differences teach you the trade-offs.
The Three-Iteration Concept
Each iteration produces a working agent that solves the SAME order exception with the SAME five tools and SAME mock data. The agent's output is identical across all three. What changes is everything around the agent: the lines you wrote, the time you spent, the abstractions you used, and the way you debug when something breaks.
"Iteration 3 is just better — why bother with the others?" Because Iteration 3's generated code is not magic. When it produces a track_shipment that returns the wrong field name and the agent then mis-reads "delivered" as "in transit," you have to read the generated tools.py, find the bug, and fix it — either in the code or in the spec. Without Iteration 1's familiarity with what tool calls and message shapes look like, you cannot tell what the generated code is doing wrong.
The Scenario — Order Exception Agent
The agent investigates problematic purchase orders: identifies the exception type (delayed shipment, partial delivery, pricing discrepancy, quality hold), gathers data from ERP, WMS, and carrier systems, determines root cause, and proposes resolution with a customer communication draft.
"PO-2024-5678 from Acme Corp has a flagged exception. Investigate, determine the root cause, and draft a customer notification with a resolution proposal."
Tools (5)
get_order_details(po_number)— line items, status, dates, customer tiertrack_shipment(tracking_number)— carrier status & locationcheck_contract_pricing(customer_id, sku)— contract vs invoiced pricequery_inventory(sku, warehouse)— stock availabilitydraft_notification(customer_id, type, resolution)— email draft with tier-aware tone
Mock Data Shape
- 10 purchase orders with 3 exception types
- Exceptions: delayed shipment (4), partial delivery (3), pricing discrepancy (3)
- Customers: Acme Corp (Enterprise), Globex Industries (Enterprise), Initech LLC (SMB), Stark Enterprises (Enterprise), Wayne Industries (SMB)
- Files:
orders.json,tracking.json,contracts.json,inventory.json,customers.json
Switch to Domain A (Healthcare Pre-Auth) or Domain C (UCC Risk Analyzer — default). The lab structure is identical; only the tools and data change.
Animation 1: Three-Lane Evolution
Watch three lanes count down lines of code while the same six capabilities populate underneath each. The agent's capabilities never change — what shrinks is the code you write to express them.
Animation 2: Code Size Waterfall
Iteration 1 is 250 lines you wrote. Iteration 2 is 120 lines you wrote. Iteration 3 is 100 lines of spec plus ~300 lines of code Claude Code generated — shown stacked. The total system size grows; the lines on your keyboard fall off a cliff.
Animation 3: Time Comparison
Iteration 1 is the longest because you build everything — loop, validation, tone-checking, audit, sessions, deployment. Iteration 2 cuts the loop and guardrails. Iteration 3 cuts almost everything except the thinking.
Total: 6 hours across 3 sessions for the same agent, three different builds.
Animation 4: Architecture Per Iteration
Each iteration has the same logical architecture but the physical architecture differs. Click the tabs to compare.
Animation 5: Spec-to-Code Flow
Watch the 12-section spec on the left get read line-by-line. Claude Code translates each section into generated code on the right. Files appear as their corresponding spec section is consumed.
Prerequisites
- M05 — Tool Use: Tool definitions, tool_use blocks, the message loop
- M12 — ReAct Agents: Multi-step reasoning across the 5-tool diagnostic flow
- M15B — Build Complete Agent: The whole Iter-1 mental model
- M16–M17 — Guardrails & HITL: Hooks, especially tone enforcement
- M19 — Tracing: Optional but useful for Iter 2 debugging
- M21, M22B — Deployment: FastAPI + Docker + Tier 1
- M25, M26 — Claude Code & Hooks: CLAUDE.md, slash commands, hooks API
Iteration 1 is a re-implementation of the M15B reference agent for the order-exception scenario. Do M15B first if you have not built an agent from raw API calls. Iteration 2 leans on the claude-agent-sdk patterns taught in M26 (Hooks & Sessions & Agent SDK) — reach for it if @tool / HookMatcher / ClaudeAgentOptions feels unfamiliar.
- Python 3.10+ with pip
- Claude Code CLI (
npm i -g @anthropic-ai/claude-code) — for Iter 2 and 3 - Docker Desktop for the Tier 1 deployment
ANTHROPIC_API_KEYenvironment variable- Optional: a Langfuse account for Iter 2 tracing
Iteration 1: Raw API Loop
Build the agent the M15B way. You write the loop, the validation, the tone enforcement, the audit, the sessions, and the deployment. Every line is yours. Every bug is yours to find.
What & Why: Create the project folder, install anthropic + FastAPI, then build the five mock JSON files. Mock data is what separates a "demo" agent from a "doesn't compile" agent — without realistic ERP, carrier, contract, and inventory data, every tool call returns garbage.
mkdir -p agent-iter1-raw/mock_data && cd agent-iter1-raw
python -m venv venv && source venv/bin/activate # Windows: venv\Scripts\activate
pip install "anthropic>=0.40" "fastapi>=0.110" "uvicorn>=0.27" "pydantic>=2.0"
// mock_data/orders.json (excerpt)
{
"PO-2024-5678": {
"po_number": "PO-2024-5678",
"customer_id": "ACME001",
"submitted": "2024-04-01",
"promised_ship": "2024-04-08",
"status": "PARTIALLY_SHIPPED",
"tracking_numbers": ["1Z999AA10123456784"],
"line_items": [
{"sku": "BRG-4892", "qty_ordered": 100, "qty_shipped": 60, "unit_price": 42.00},
{"sku": "GSK-1175", "qty_ordered": 50, "qty_shipped": 0, "unit_price": 18.50}
],
"exception_flagged": "PARTIAL_DELIVERY"
}
}
// mock_data/customers.json (excerpt)
{
"ACME001": {"name": "Acme Corporation", "tier": "ENTERPRISE",
"csm_email": "csm-acme@example.com", "sla_hours": 4},
"INIT003": {"name": "Initech LLC", "tier": "SMB",
"csm_email": "support@example.com", "sla_hours": 24}
}
// mock_data/tracking.json (excerpt)
{
"1Z999AA10123456784": {
"carrier": "UPS",
"status": "DELIVERED",
"delivered_at": "2024-04-12T14:30:00Z",
"history": [
{"ts": "2024-04-09T10:00:00Z", "event": "PICKED_UP", "loc": "Memphis, TN"},
{"ts": "2024-04-12T14:30:00Z", "event": "DELIVERED", "loc": "Newark, NJ"}
]
}
}
// mock_data/contracts.json (excerpt — for ACME001 + BRG-4892)
{
"ACME001_BRG-4892": {
"customer_id": "ACME001", "sku": "BRG-4892",
"contract_unit_price": 39.50, "volume_tier_min_qty": 100,
"effective": "2024-01-01", "expires": "2024-12-31"
}
}
// mock_data/inventory.json (excerpt)
{
"GSK-1175": {
"warehouses": {
"WH-MEMPHIS": {"qty_available": 0, "qty_reserved": 0},
"WH-NEWARK": {"qty_available": 200, "qty_reserved": 50}
}
}
}
Build all five JSON files with realistic data: 10 POs, 5 customers (3 Enterprise, 2 SMB), tracking for 8 shipments, contracts for 5 SKUs, inventory across 3 warehouses. Cover three exception types: 4 delayed shipments, 3 partial deliveries (incl. PO-2024-5678), 3 pricing discrepancies.
Run: python -c "import json; print(len(json.load(open('mock_data/orders.json'))), 'POs')"
10 POs10 POs. If you see 0 or a JSON parse error, check that all five files are valid JSON.What & Why: The Anthropic API needs every tool described as a JSON Schema. PO numbers, SKUs, and tracking numbers all have specific formats (PO-YYYY-NNNN, XXX-NNNN, UPS/FedEx 12–22 chars). Get the schema right and Claude passes correct types.
"""tools.py — order-exception tool schemas + executors."""
import json
from pathlib import Path
DATA = Path("mock_data")
ORDERS = json.loads((DATA / "orders.json").read_text())
TRACKING = json.loads((DATA / "tracking.json").read_text())
CONTRACTS = json.loads((DATA / "contracts.json").read_text())
INVENTORY = json.loads((DATA / "inventory.json").read_text())
CUSTOMERS = json.loads((DATA / "customers.json").read_text())
TOOLS = [
{
"name": "get_order_details",
"description": "Get full PO details: line items, status, dates, tracking numbers, customer.",
"input_schema": {
"type": "object",
"properties": {"po_number": {"type": "string", "pattern": "^PO-[0-9]{4}-[0-9]{4}$"}},
"required": ["po_number"],
},
},
{
"name": "track_shipment",
"description": "Get carrier status for a tracking number (UPS/FedEx/DHL).",
"input_schema": {
"type": "object",
"properties": {"tracking_number": {"type": "string"}},
"required": ["tracking_number"],
},
},
{
"name": "check_contract_pricing",
"description": "Return contract unit price for a customer + SKU. Used to detect discrepancies.",
"input_schema": {
"type": "object",
"properties": {"customer_id": {"type": "string"}, "sku": {"type": "string"}},
"required": ["customer_id", "sku"],
},
},
{
"name": "query_inventory",
"description": "Return stock available for a SKU at a specific warehouse.",
"input_schema": {
"type": "object",
"properties": {"sku": {"type": "string"}, "warehouse": {"type": "string"}},
"required": ["sku", "warehouse"],
},
},
{
"name": "draft_notification",
"description": "Generate an exception-notification email draft, tier-aware tone.",
"input_schema": {
"type": "object",
"properties": {
"customer_id": {"type": "string"},
"exception_type": {"type": "string",
"enum": ["DELAYED", "PARTIAL", "PRICING", "QUALITY_HOLD"]},
"resolution": {"type": "string"},
},
"required": ["customer_id", "exception_type", "resolution"],
},
},
]
def execute_tool(name, args):
if name == "get_order_details":
return ORDERS.get(args["po_number"], {"error": "PO not found"})
if name == "track_shipment":
return TRACKING.get(args["tracking_number"], {"error": "tracking not found"})
if name == "check_contract_pricing":
key = f"{args['customer_id']}_{args['sku']}"
return CONTRACTS.get(key, {"error": "no contract on file"})
if name == "query_inventory":
sku = INVENTORY.get(args["sku"], {})
return sku.get("warehouses", {}).get(args["warehouse"],
{"error": "no inventory record"})
if name == "draft_notification":
cust = CUSTOMERS.get(args["customer_id"], {})
tier = cust.get("tier", "SMB")
# Tier-aware salutation/tone
salutation = "Team," if tier == "ENTERPRISE" else f"Hi {cust.get('name','team')},"
sign_off = ("We will follow up before SLA expiry. — Operations"
if tier == "ENTERPRISE"
else "Let us know how we can help! — Customer Success")
body = (f"Regarding {args['exception_type']} on your recent order:\n\n"
f"{args['resolution']}\n\n{sign_off}")
return {"to": cust.get("csm_email", "ops@example.com"),
"tier": tier, "salutation": salutation, "body": body}
raise ValueError(f"Unknown tool: {name}")
python -c "from tools import execute_tool; print(execute_tool('get_order_details', {'po_number': 'PO-2024-5678'})['exception_flagged'])". Expected: PARTIAL_DELIVERY.What & Why: The system prompt is critical for order-exception: Claude needs to know to investigate ALL related data (order, tracking, contract, inventory) before drafting a notification, not just react to the first finding.
"""agent.py — the raw API loop. Everything is hand-coded."""
import json, anthropic
from tools import TOOLS, execute_tool
client = anthropic.Anthropic()
MODEL = "claude-sonnet-4-6"
SYSTEM = """You are an order-exception specialist. For every flagged PO:
1. get_order_details to understand the situation
2. track_shipment for any tracking numbers (CARRIER VIEW)
3. check_contract_pricing for any line item with potential discrepancy
4. query_inventory for any unshipped or backordered SKUs
5. draft_notification with the right exception_type and a clear resolution
ALWAYS investigate before drafting. Never recommend a credit without checking
contract pricing AND inventory. Match the customer tier's tone (terse for
ENTERPRISE, warmer for SMB). Never overpromise — cite SLA hours, not specifics."""
MAX_TURNS = 12
def _run_messages(messages: list) -> tuple[str, list]:
"""Drive the loop on the given messages list. Returns (final_text, messages).
Used by run_agent() for single-shot queries and by session.chat() for multi-turn."""
for turn in range(MAX_TURNS):
response = client.messages.create(
model=MODEL, max_tokens=4096, system=SYSTEM,
tools=TOOLS, messages=messages,
)
messages.append({"role": "assistant", "content": response.content})
if response.stop_reason == "end_turn":
text = next(b.text for b in response.content if b.type == "text")
return text, messages
if response.stop_reason == "tool_use":
tool_results = []
for block in response.content:
if block.type == "tool_use":
try:
result = execute_tool(block.name, block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": json.dumps(result, default=str),
})
except Exception as e:
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": f"ERROR: {e}",
"is_error": True,
})
messages.append({"role": "user", "content": tool_results})
continue
raise RuntimeError(f"Unexpected stop_reason: {response.stop_reason}")
raise RuntimeError(f"Agent exceeded {MAX_TURNS} turns without finishing")
def run_agent(question: str) -> str:
"""Single-shot entry point. Wraps _run_messages with a fresh history."""
text, _ = _run_messages([{"role": "user", "content": question}])
return text
if __name__ == "__main__":
q = "PO-2024-5678 from Acme Corp has a flagged exception. Investigate, " \
"determine the root cause, and draft a customer notification."
print(run_agent(q))
// agent.ts — the raw API loop. Order-exception investigation.
import Anthropic from "@anthropic-ai/sdk";
import { TOOLS, executeTool } from "./tools.js";
const client = new Anthropic();
const MODEL = "claude-sonnet-4-6";
const SYSTEM = `You are an order-exception specialist. For every flagged PO:
1. get_order_details to understand the situation
2. track_shipment for any tracking numbers
3. check_contract_pricing for any line item with potential discrepancy
4. query_inventory for any unshipped or backordered SKUs
5. draft_notification with the right exception_type and a clear resolution
ALWAYS investigate before drafting. Never recommend a credit without checking
contract pricing AND inventory. Match the customer tier's tone (terse for
ENTERPRISE, warmer for SMB). Never overpromise — cite SLA hours, not specifics.`;
export async function runAgent(question: string, maxTurns = 12): Promise<string> {
const messages: Anthropic.MessageParam[] = [{ role: "user", content: question }];
for (let turn = 0; turn < maxTurns; turn++) {
const response = await client.messages.create({
model: MODEL, max_tokens: 4096, system: SYSTEM, tools: TOOLS, messages,
});
messages.push({ role: "assistant", content: response.content });
if (response.stop_reason === "end_turn") {
const text = response.content.find(b => b.type === "text");
return text?.type === "text" ? text.text : "";
}
if (response.stop_reason === "tool_use") {
const toolResults: Anthropic.ToolResultBlockParam[] = [];
for (const block of response.content) {
if (block.type === "tool_use") {
try {
const result = await executeTool(block.name, block.input);
toolResults.push({ type: "tool_result", tool_use_id: block.id,
content: JSON.stringify(result) });
} catch (e) {
toolResults.push({ type: "tool_result", tool_use_id: block.id,
content: `ERROR: ${(e as Error).message}`, is_error: true });
}
}
}
messages.push({ role: "user", content: toolResults });
continue;
}
throw new Error(`Unexpected stop_reason: ${response.stop_reason}`);
}
throw new Error(`Agent exceeded ${maxTurns} turns without finishing`);
}
Run: python agent.py
Expected output (paraphrased — Claude's exact wording varies):
get_order_details returns the customer_id and the agent looks it up in the customers data.- Agent skips track_shipment → the order has no tracking number yet (e.g., for partial-shipment cases, only the shipped portion has tracking). System prompt should say "call track_shipment for each tracking_number returned by get_order_details — skip if none."
- Agent recommends a credit instead of investigating → reinforce the system prompt: "Never recommend a credit without first checking inventory at OTHER warehouses."
- Draft uses "Hi Acme!" for an ENTERPRISE customer → the agent didn't read the customer tier. Add explicit guidance: "Look up the customer's tier in customers.json and match tone: ENTERPRISE = formal/terse, SMB = warm/personal."
- Agent makes up dates or quantities → system prompt: "Cite ONLY data from tool responses. Never invent numbers."
What & Why: A loop that does whatever Claude asks is not safe to ship. Production order agents need at minimum: input validation (PO format, SKU format), output tone enforcement (enterprise drafts should not start with "Hi there!"), a cost cap, and a circuit breaker. The tone check is the B2B-specific compliance layer.
"""guardrails.py — B2B-aware checks for order agents."""
import re
PO_RE = re.compile(r"^PO-[0-9]{4}-[0-9]{4}$")
SKU_RE = re.compile(r"^[A-Z]{3}-[0-9]{4}$")
COST_LIMIT_TOKENS = 50_000
CIRCUIT_FAIL_THRESHOLD = 3
# Tokens that should NOT appear in ENTERPRISE-tier drafts
ENTERPRISE_BANNED_TOKENS = ["hey there", "hi there", "no worries", "super sorry"]
# Tokens that should NOT appear in any draft (overpromise)
OVERPROMISE_TOKENS = ["guarantee", "definitely tomorrow", "100% certain"]
def validate_input(tool_name: str, args: dict):
if tool_name == "get_order_details":
if not PO_RE.match(args.get("po_number", "")):
raise ValueError(f"Invalid PO format: {args.get('po_number')}")
if tool_name in ("check_contract_pricing", "query_inventory"):
if not SKU_RE.match(args.get("sku", "")):
raise ValueError(f"Invalid SKU format: {args.get('sku')}")
def check_draft_tone(draft: dict):
"""Run AFTER draft_notification. Raise if tone mismatch or overpromise."""
body = draft.get("body", "").lower()
tier = draft.get("tier", "SMB")
if tier == "ENTERPRISE":
for tok in ENTERPRISE_BANNED_TOKENS:
if tok in body:
raise ValueError(f"Tone mismatch: '{tok}' in ENTERPRISE draft")
for tok in OVERPROMISE_TOKENS:
if tok in body:
raise ValueError(f"Overpromise detected: '{tok}'")
class CircuitBreaker:
def __init__(self): self.failures = 0
def record(self, ok):
self.failures = 0 if ok else self.failures + 1
if self.failures >= CIRCUIT_FAIL_THRESHOLD:
raise RuntimeError("Circuit breaker tripped — aborting")
Now wire the guardrails into _run_messages in agent.py. Replace your existing _run_messages with the version below. The tone check fires on draft_notification results — if it raises, the error flows back to Claude as a tool error, prompting a redraft.
# Add to imports at the top of agent.py:
from guardrails import (validate_input, check_draft_tone,
CircuitBreaker, COST_LIMIT_TOKENS) # NEW
_breaker = CircuitBreaker() # NEW — module-level instance
def _run_messages(messages: list) -> tuple[str, list]:
total_tokens = 0 # NEW — cost cap counter
for turn in range(MAX_TURNS):
response = client.messages.create(
model=MODEL, max_tokens=4096, system=SYSTEM,
tools=TOOLS, messages=messages,
)
total_tokens += (response.usage.input_tokens
+ response.usage.output_tokens) # NEW
if total_tokens > COST_LIMIT_TOKENS: # NEW
raise RuntimeError(f"Cost cap exceeded: {total_tokens} tokens")
messages.append({"role": "assistant", "content": response.content})
if response.stop_reason == "end_turn":
text = next(b.text for b in response.content if b.type == "text")
return text, messages
if response.stop_reason == "tool_use":
tool_results = []
for block in response.content:
if block.type == "tool_use":
try:
validate_input(block.name, block.input) # NEW — PO/SKU check
result = execute_tool(block.name, block.input)
# B2B-specific: tone-check draft_notification results.
if block.name == "draft_notification": # NEW
check_draft_tone(result) # NEW — raises on bad tone
_breaker.record(ok=True) # NEW
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": json.dumps(result, default=str),
})
except Exception as e:
_breaker.record(ok=False) # NEW — trips on 3rd fail
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": f"ERROR: {e}",
"is_error": True,
})
messages.append({"role": "user", "content": tool_results})
continue
raise RuntimeError(f"Unexpected stop_reason: {response.stop_reason}")
raise RuntimeError(f"Agent exceeded {MAX_TURNS} turns without finishing")
Test that the tone check fires on a bad ENTERPRISE draft:
# Temporarily edit your draft_notification mock in tools.py to inject bad tone:
# "Hi there team! No worries about PO-2024-5678 ..."
# Run the agent. The check_draft_tone hook should raise ValueError,
# the error flows back as is_error: true, and Claude redrafts with formal tone.
python agent.py
In Domain A you redact PHI; in Domain B you enforce tone. Same hook pattern, different rules. Both are post-tool-use checks that can either reject the result (hard fail) or rewrite it (soft fail). For ENTERPRISE customers, a casual tone in a delay notification can damage the relationship; for SMB, an overly formal tone can feel cold. The tone check is what lets you scale customer comms without manual review.
ImportError: cannot import name 'check_draft_tone' from 'guardrails'→guardrails.pyisn't in the project folder, or the function name is misspelled.- Agent never redrafts after tone error → Claude may interpret the error as terminal. Strengthen the system prompt: "If you receive a tone_violation error from draft_notification, redraft with formal language and try again."
- Cost cap fires immediately →
COST_LIMIT_TOKENS = 50_000is generous; if you mistakenly set it lower, increase it back.
What & Why: SOX and customer-contract audits require an exception-handling trail. Every tool call needs a timestamped record: case_id (PO number), tool, inputs, output summary, token count.
"""audit.py — SOX-compliant audit log for B2B order operations."""
import json, datetime
def append_audit(tool_name, args, result, tokens, po_number):
rec = {
"ts": datetime.datetime.utcnow().isoformat() + "Z",
"po_number": po_number,
"tool": tool_name,
"input": args,
"output_summary": str(result)[:200],
"tokens": tokens,
}
with open("audit_log.jsonl", "a") as f:
f.write(json.dumps(rec) + "\n")
Now wire append_audit into _run_messages. Add the import and call right after the successful execute_tool:
# Add to imports at the top of agent.py:
from audit import append_audit # NEW
# Inside _run_messages, after `result = execute_tool(...)` and the tone check:
if block.name == "draft_notification":
check_draft_tone(result)
_breaker.record(ok=True)
# Extract PO number from the user's first message for audit context
po = next(
(w for w in messages[0]["content"].split() if w.startswith("PO-")),
"default"
)
append_audit( # NEW
tool_name=block.name,
args=block.input,
result=result,
tokens=total_tokens,
po_number=po,
)
tool_results.append({...})
# ... rest of try-block unchanged ...
Run: python agent.py
Then inspect the audit file:
cat audit_log.jsonl # macOS/Linux
type audit_log.jsonl # Windows cmd
Expected output (5 lines, one per tool call, all with the same po_number):
audit_log.jsonl all keyed by po_number: "PO-2024-5678". If po_number is "default", the parser didn't find a PO in the user's first message — either the question doesn't include the PO, or the parsing logic doesn't match your prompt format.What & Why: Ops reviewers ask follow-ups about the same PO: "What if we expedite-ship from Newark instead?" should reuse the prior investigation context. Iter 1 implements this by maintaining a per-PO messages list and reusing the same _run_messages helper from Step 3.
Create a new file session.py in the project folder:
"""session.py — multi-turn PO sessions over the same _run_messages helper."""
from agent import _run_messages
SESSIONS: dict[str, list] = {} # po_number -> messages list
WINDOW = 24 # 5 tools per turn × ~5 turns + buffer
def chat(po_number: str, user_msg: str) -> str:
"""Append the user's message to the PO session, run the loop, return the answer."""
msgs = SESSIONS.setdefault(po_number, [])
msgs.append({"role": "user", "content": user_msg})
answer, msgs = _run_messages(msgs)
# Sliding window: keep only the last WINDOW messages.
SESSIONS[po_number] = msgs[-WINDOW:]
return answer
Try it — multi-turn PO investigation:
python -c "
from session import chat
print(chat('PO-2024-5678', 'PO-2024-5678 from Acme Corp has a flagged exception. Investigate and draft a notification.'))
print('---')
print(chat('PO-2024-5678', 'What would change if we expedite-ship from Newark instead of waiting for Memphis restock?'))
"
Expected behavior: the second call references the prior investigation findings (Memphis stockout, Newark availability) and proposes the expedite scenario as an alternative. Without session continuity, the agent would have no memory of the GSK-1175 issue.
SESSIONS[po_number] persists across calls.
ImportError: cannot import name '_run_messages' from 'agent'→ you skipped Step 3's refactor ofagent.py. Go back and split the loop into_run_messages+run_agent.- Each call re-investigates →
SESSIONSis module-level. If you're calling from separate Python processes (e.g., via subprocess), use a database or Redis instead.
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from agent import run_agent
from session import chat
app = FastAPI()
class Q(BaseModel): question: str
class C(BaseModel): po_number: str; message: str
@app.get("/health")
def health(): return {"status": "ok"}
@app.post("/exception")
def exception_ep(q: Q):
try: return {"report": run_agent(q.question)}
except Exception as e: raise HTTPException(500, str(e))
@app.post("/chat")
def chat_ep(c: C):
return {"answer": chat(c.po_number, c.message)}
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
ENV PYTHONUNBUFFERED=1
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8000"]
Also create requirements.txt at the project root:
anthropic>=0.40
fastapi>=0.110
uvicorn>=0.27
pydantic>=2.0
Run locally first:
uvicorn server:app --reload --port 8000
# In another terminal:
curl localhost:8000/health
curl -X POST localhost:8000/exception -H "Content-Type: application/json" \
-d '{"question":"PO-2024-5678 from Acme Corp has a flagged exception. Investigate."}'
Then build and run with Docker:
docker build -t iter1-b .
docker run --rm -p 8000:8000 -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY iter1-b
You should now have 11 files in your project folder:
agent-iter1-raw/ │── agent.py # _run_messages + run_agent │── tools.py # 5 ops tool schemas + execute_tool │── mock_data/ │─ │── orders.json # 10 POs across 3 exception types │─ │── tracking.json # 8 carrier records │─ │── contracts.json # 5 customer-SKU contracts │─ │── inventory.json # 6 SKUs × 3 warehouses │─ │── customers.json # 5 customers (3 ENTERPRISE, 2 SMB) │── guardrails.py # validate_input + check_draft_tone + CircuitBreaker │── audit.py # SOX-compliant append_audit │── session.py # multi-turn PO sessions │── server.py # FastAPI handlers │── Dockerfile | requirements.txt
Run the full Iter-1 acceptance test:
# 1. Investigate the canonical PO (Acme, ENTERPRISE)
curl -s -X POST localhost:8000/exception -H "Content-Type: application/json" \
-d '{"question":"PO-2024-5678 from Acme has a flagged exception. Investigate."}' | python -m json.tool
# 2. Multi-turn PO session (expedite-ship what-if)
curl -s -X POST localhost:8000/chat -H "Content-Type: application/json" \
-d '{"po_number":"PO-2024-5678","message":"Investigate this exception."}' | python -m json.tool
curl -s -X POST localhost:8000/chat -H "Content-Type: application/json" \
-d '{"po_number":"PO-2024-5678","message":"What if we expedite-ship from Newark?"}' | python -m json.tool
# 3. Tone enforcement: bad ENTERPRISE draft gets rejected
# (You'd inject "Hi there!" into the draft_notification mock to test this)
# 4. Audit log keyed by PO
docker exec $(docker ps -q --filter ancestor=iter1-b) cat audit_log.jsonl | head -5
# Look for "po_number": "PO-2024-5678" on every line
You pass Iter 1 if: (1) /health returns ok; (2) /exception returns a PARTIAL exception report with formal ENTERPRISE-tier draft; (3) the second /chat call references the prior Memphis stockout findings; (4) audit log entries are all keyed by po_number: "PO-2024-5678".
- Files: 11 (agent.py, tools.py, mock_data/×5, guardrails.py, audit.py, session.py, server.py, Dockerfile, requirements.txt)
- Lines you wrote: ~250
- Time: ~3 hours
- Abstractions used: none — just
anthropicMessages API and FastAPI
Debugging in Iteration 1: Print Statements + Manual Inspection
When the agent gives wrong output in Iter 1, you debug like you debug any Python program: by reading your own code.
def debug_turn(turn_num, response, messages):
print(f"\n=== TURN {turn_num} === stop_reason: {response.stop_reason}")
for block in response.content:
if block.type == "tool_use":
print(f" TOOL: {block.name}({block.input})")
elif block.type == "text":
print(f" TEXT: {block.text[:120]}...")
print(f" Tokens: in={response.usage.input_tokens} out={response.usage.output_tokens}")
The most common Iter-1 bug is malformed messages. Add print(json.dumps(messages, default=str, indent=2)) before each messages.create.
- Wrong
tool_use_idin tool_result → 400 from API. Match theidon the tool_use block. - Forgot to append the assistant turn → API complains about message order.
- Domain-specific: agent drafts a notification BEFORE checking inventory or contract pricing — root cause incomplete. The system prompt did not enforce the investigate-then-draft order. Strengthen with: "ALWAYS call get_order_details, track_shipment, check_contract_pricing, AND query_inventory before draft_notification."
- Domain-specific: agent drafts an SMB-tone notification for an ENTERPRISE customer (or vice versa). Either tone-check hook didn't run, or the draft_notification result didn't include the customer tier. Add the tier to the response payload and to the audit log.
- Domain-specific: agent recommends a credit when a re-shipment would resolve. Usually means the agent stopped after track_shipment showed DELIVERED without checking inventory for unshipped lines. Re-read the system prompt — PARTIAL delivery requires inventory check.
Add a hard tone violation: change execute_tool for ENTERPRISE customers so the body starts with "Hey there!". Re-run the agent. The check_draft_tone guardrail should raise. Read the error message and trace where the tone got injected. Restore and re-run. This builds the muscle memory you need to debug Iter 3 generated drafts.
Iteration 2: Agent SDK + Claude Code
Now you let the SDK run the loop, hooks handle tone enforcement and validation, sessions handle multi-turn, and Claude Code does most of the typing. Same agent. Half the lines. Different debugging.
mkdir agent-iter2-sdk && cd agent-iter2-sdk
python -m venv venv && source venv/bin/activate # Windows: venv\Scripts\activate
pip install "claude-agent-sdk>=0.2" "fastapi>=0.110" "uvicorn>=0.27" "pydantic>=2.0"
npm i -g @anthropic-ai/claude-code # if not already installed
claude
> /init
# Agent: Order Exception Investigation Agent (Iteration 2)
## Stack
- Python 3.11+
- `claude-agent-sdk` (the official Agent SDK — NOT a wrapper around `client.messages.create()`)
- FastAPI + Docker for deployment
- Mock data in mock_data/*.json (5 files: orders, tracking, contracts, inventory, customers)
## File Layout
- agent.py — query() entry + 5 @tool-decorated MCP tools + create_sdk_mcp_server
- hooks/ — PreToolUse + PostToolUse hook scripts (tone enforcement on draft_notification)
- .claude/settings.json — hooks registration (matchers + commands)
- sessions.py — multi-PO session resume
- server.py — FastAPI async wrapper
## Compliance Rules (encode in hooks)
- ENTERPRISE drafts MUST NOT contain casual tokens ("hi there", "no worries")
- All drafts MUST NOT contain overpromise tokens ("guarantee", "definitely tomorrow")
- All tool calls MUST be audited with po_number as the case key
## System Prompt Rules
For every flagged PO, follow this order:
1. get_order_details
2. track_shipment for each tracking_number returned
3. check_contract_pricing for any line with potential discrepancy
4. query_inventory for any unshipped/backordered SKU
5. draft_notification (matching customer tier tone)
NEVER draft before completing the investigation. NEVER overpromise.
If you've used client.messages.create() in Iter 1, you might expect the SDK to be a thin Agent class wrapping it. It is not. The SDK is built around MCP tools + an async query() generator + options/hooks via ClaudeAgentOptions. Tools return {"content": [{"type": "text", "text": ...}]} (MCP shape), not bare Python values.
> Create agent.py using `claude-agent-sdk`. Define five order-ops tools as
> @tool-decorated async functions returning MCP-shaped {"content":[...]}:
> get_order_details, track_shipment, check_contract_pricing, query_inventory,
> draft_notification (tier-aware tone). Wire them into a create_sdk_mcp_server,
> build ClaudeAgentOptions with the system prompt from CLAUDE.md, and expose
> async run(question) driving query() and concatenating AssistantMessage text.
"""agent.py — claude-agent-sdk version. ~85 lines incl. 5 tools."""
import json
from pathlib import Path
from claude_agent_sdk import (
query, tool, create_sdk_mcp_server,
ClaudeAgentOptions, AssistantMessage,
)
DATA = Path("mock_data")
ORDERS = json.loads((DATA / "orders.json").read_text())
TRACKING = json.loads((DATA / "tracking.json").read_text())
CONTRACTS = json.loads((DATA / "contracts.json").read_text())
INVENTORY = json.loads((DATA / "inventory.json").read_text())
CUSTOMERS = json.loads((DATA / "customers.json").read_text())
def _ok(payload): return {"content": [{"type": "text", "text": json.dumps(payload)}]}
@tool("get_order_details", "Full PO details: line items, status, tracking, customer.",
{"po_number": str})
async def get_order_details(args):
return _ok(ORDERS.get(args["po_number"], {"error": "PO not found"}))
@tool("track_shipment", "Carrier status for a tracking number.",
{"tracking_number": str})
async def track_shipment(args):
return _ok(TRACKING.get(args["tracking_number"], {"error": "tracking not found"}))
@tool("check_contract_pricing", "Contract unit price for customer + SKU.",
{"customer_id": str, "sku": str})
async def check_contract_pricing(args):
return _ok(CONTRACTS.get(f"{args['customer_id']}_{args['sku']}",
{"error": "no contract"}))
@tool("query_inventory", "Stock available for a SKU at a warehouse.",
{"sku": str, "warehouse": str})
async def query_inventory(args):
s = INVENTORY.get(args["sku"], {})
return _ok(s.get("warehouses", {}).get(args["warehouse"],
{"error": "no inventory"}))
@tool("draft_notification",
"Generate exception-notification email with tier-aware tone.",
{"customer_id": str, "exception_type": str, "resolution": str})
async def draft_notification(args):
cust = CUSTOMERS.get(args["customer_id"], {})
tier = cust.get("tier", "SMB")
salutation = "Team," if tier == "ENTERPRISE" else f"Hi {cust.get('name','team')},"
sign_off = ("— Operations" if tier == "ENTERPRISE"
else "— Customer Success")
body = (f"Regarding {args['exception_type']}: {args['resolution']}\n\n{sign_off}")
return _ok({"to": cust.get("csm_email"), "tier": tier,
"salutation": salutation, "body": body})
ops_server = create_sdk_mcp_server(
name="ops_tools", version="1.0.0",
tools=[get_order_details, track_shipment, check_contract_pricing,
query_inventory, draft_notification],
)
OPTIONS = ClaudeAgentOptions(
model="claude-sonnet-4-6",
system_prompt=("You are an order-exception specialist. For every flagged PO: "
"1. get_order_details, 2. track_shipment, 3. check_contract_pricing, "
"4. query_inventory, 5. draft_notification (tier-aware). "
"NEVER draft before completing investigation. NEVER overpromise."),
mcp_servers={"ops": ops_server},
allowed_tools=[f"mcp__ops__{n}" for n in (
"get_order_details", "track_shipment", "check_contract_pricing",
"query_inventory", "draft_notification")],
max_turns=12,
)
async def run(question: str) -> str:
parts = []
async for msg in query(prompt=question, options=OPTIONS):
if isinstance(msg, AssistantMessage):
for block in msg.content:
if getattr(block, "text", None):
parts.append(block.text)
return "\n".join(parts)
// agent.ts — @anthropic-ai/claude-agent-sdk version
import { query, tool, createSdkMcpServer } from "@anthropic-ai/claude-agent-sdk";
import { z } from "zod";
import * as fs from "fs";
const ORDERS = JSON.parse(fs.readFileSync("mock_data/orders.json", "utf8"));
const TRACKING = JSON.parse(fs.readFileSync("mock_data/tracking.json", "utf8"));
const CONTRACTS = JSON.parse(fs.readFileSync("mock_data/contracts.json", "utf8"));
const INVENTORY = JSON.parse(fs.readFileSync("mock_data/inventory.json", "utf8"));
const CUSTOMERS = JSON.parse(fs.readFileSync("mock_data/customers.json", "utf8"));
const ok = (p: unknown) => ({ content: [{ type: "text" as const, text: JSON.stringify(p) }] });
const getOrderDetails = tool(
"get_order_details", "Full PO details.",
{ po_number: z.string() },
async (args) => ok(ORDERS[args.po_number] ?? { error: "PO not found" })
);
const draftNotification = tool(
"draft_notification", "Tier-aware exception email draft.",
{ customer_id: z.string(), exception_type: z.string(), resolution: z.string() },
async (args) => {
const cust = CUSTOMERS[args.customer_id] ?? {};
const tier = cust.tier ?? "SMB";
const salutation = tier === "ENTERPRISE" ? "Team," : `Hi ${cust.name ?? "team"},`;
const signOff = tier === "ENTERPRISE" ? "— Operations" : "— Customer Success";
const body = `Regarding ${args.exception_type}: ${args.resolution}\n\n${signOff}`;
return ok({ to: cust.csm_email, tier, salutation, body });
}
);
// (track_shipment, check_contract_pricing, query_inventory defined the same way)
const opsServer = createSdkMcpServer({
name: "ops_tools",
tools: [getOrderDetails, /* ... 4 more ... */, draftNotification],
});
const OPTIONS = {
model: "claude-sonnet-4-6",
systemPrompt: "You are an order-exception specialist. For every flagged PO: " +
"1. get_order_details, 2. track_shipment, 3. check_contract_pricing, " +
"4. query_inventory, 5. draft_notification (tier-aware). " +
"NEVER draft before completing investigation. NEVER overpromise.",
mcpServers: { ops: opsServer },
allowedTools: [
"mcp__ops__get_order_details", "mcp__ops__track_shipment",
"mcp__ops__check_contract_pricing", "mcp__ops__query_inventory",
"mcp__ops__draft_notification",
],
maxTurns: 12,
};
export async function run(question: string): Promise<string> {
const parts: string[] = [];
for await (const msg of query({ prompt: question, options: OPTIONS })) {
if (msg.type === "assistant") {
for (const block of msg.content) {
if ("text" in block) parts.push(block.text);
}
}
}
return parts.join("\n");
}
You no longer write the message loop, the stop_reason check, the tool_result append, or JSON schema dicts. Same five-tool investigation flow, ~85 lines instead of ~140.
ModuleNotFoundError: No module named 'claude_agent_sdk'→pip install "claude-agent-sdk>=0.2"in your venv.ImportError: cannot import name 'Agent' from 'anthropic'→ you're trying the old fictional API. The real SDK isclaude_agent_sdk.- Tool call rejected with "not allowed" → add the tool's
mcp__ops__<name>entry toallowed_tools.
What & Why: The SDK supports two hook surfaces: (a) file-based hooks in .claude/settings.json shelling out to scripts (production-friendly), and (b) in-process hooks via HookMatcher in ClaudeAgentOptions(hooks={...}). We use both: file-based for audit + tone enforcement on draft_notification, in-process for input validation that needs to deny.
{
"hooks": {
"PreToolUse": [
{ "matcher": "*", "command": "python hooks/log_call.py" }
],
"PostToolUse": [
{ "matcher": "mcp__ops__draft_notification",
"command": "python hooks/tone_check.py" },
{ "matcher": "*", "command": "python hooks/audit.py" }
]
}
}
"""hooks/tone_check.py — reject ENTERPRISE drafts with casual tokens
or any draft containing overpromise tokens. Pass the draft through if OK."""
import sys, json
ENTERPRISE_BANNED = ["hi there", "hey there", "no worries", "super sorry"]
OVERPROMISE = ["guarantee", "definitely tomorrow", "100% certain"]
payload = json.load(sys.stdin)
result = payload.get("tool_result") or {}
# tool_result is wrapped {"content":[{"type":"text","text": json}]}
try:
inner = json.loads(result["content"][0]["text"])
except Exception:
json.dump(payload, sys.stdout); sys.exit(0)
body = (inner.get("body") or "").lower()
tier = inner.get("tier", "SMB")
violations = []
if tier == "ENTERPRISE":
violations += [t for t in ENTERPRISE_BANNED if t in body]
violations += [t for t in OVERPROMISE if t in body]
if violations:
# Rewrite the tool_result so the agent sees an error and re-drafts.
err = {"error": "tone_violation", "violations": violations,
"tier": tier, "advice": "Re-draft with formal/non-overpromising tone."}
payload["tool_result"] = {"content": [{"type": "text",
"text": json.dumps(err)}]}
json.dump(payload, sys.stdout)
"""Add to agent.py: in-process input validation via HookMatcher."""
import re
from claude_agent_sdk import HookMatcher
PO_RE = re.compile(r"^PO-[0-9]{4}-[0-9]{4}$")
SKU_RE = re.compile(r"^[A-Z]{3}-[0-9]{4}$")
async def validate_input(input_data, tool_use_id, context):
name = input_data.get("tool_name", "")
args = input_data.get("tool_input", {}) or {}
fail = None
if name.endswith("get_order_details") and not PO_RE.match(args.get("po_number", "")):
fail = f"Invalid PO format: {args.get('po_number')!r}"
elif name.endswith(("check_contract_pricing", "query_inventory")) \
and not SKU_RE.match(args.get("sku", "")):
fail = f"Invalid SKU format: {args.get('sku')!r}"
if fail:
return {"hookSpecificOutput": {"hookEventName": "PreToolUse",
"permissionDecision": "deny",
"permissionDecisionReason": fail}}
return {}
# Then update OPTIONS in agent.py:
OPTIONS = ClaudeAgentOptions(
# ... existing fields ...
hooks={"PreToolUse": [HookMatcher(matcher="mcp__ops__*",
hooks=[validate_input])]},
)
Create the supporting log_call.py and audit.py hooks — same stdin/stdout pattern:
"""hooks/log_call.py — PreToolUse: print to stderr, pass through stdout."""
import sys, json, datetime
payload = json.load(sys.stdin)
ts = datetime.datetime.utcnow().isoformat() + "Z"
print(f"[{ts}] PRE {payload.get('tool_name','?')}({payload.get('tool_input',{})})",
file=sys.stderr)
json.dump(payload, sys.stdout) # pass-through
"""hooks/audit.py — PostToolUse: append a SOX-compliant record to audit_log.jsonl."""
import sys, json, datetime
payload = json.load(sys.stdin)
# Extract PO number from tool_input if present, else 'default'.
ti = payload.get("tool_input", {}) or {}
po = ti.get("po_number") or "default"
rec = {
"ts": datetime.datetime.utcnow().isoformat() + "Z",
"po_number": po,
"tool": payload.get("tool_name"),
"input": ti,
"output_summary": str(payload.get("tool_result"))[:200],
}
with open("audit_log.jsonl", "a") as f:
f.write(json.dumps(rec, default=str) + "\n")
json.dump(payload, sys.stdout) # pass-through
Smoke-test each hook standalone:
# tone_check should reject ENTERPRISE drafts with banned tokens
echo '{"tool_name":"mcp__ops__draft_notification","tool_result":{"content":[{"type":"text","text":"{\"body\":\"Hi there team!\",\"tier\":\"ENTERPRISE\"}"}]}}' \
| python hooks/tone_check.py
# Should rewrite tool_result to an error payload prompting redraft.
# audit should append a record keyed by po_number
echo '{"tool_name":"get_order_details","tool_input":{"po_number":"PO-2024-5678"},"tool_result":"ok"}' \
| python hooks/audit.py
cat audit_log.jsonl | tail -1
Then run the agent end-to-end:
python -c "import asyncio, agent; print(asyncio.run(agent.run('Investigate PO-2024-5678 from Acme Corp')))"
[timestamp] PRE log lines on stderr, (2) the agent's exception report on stdout with formal ENTERPRISE-tier draft, (3) audit_log.jsonl with entries keyed by PO. If you trigger a tone violation (inject "Hi there!" in your draft_notification mock), the tone_check hook rewrites the tool_result as an error and the agent redrafts.
- Tone hook crashes on JSON parse → the SDK wraps tool results in
{"content":[{"type":"text","text":...}]}. Thetextfield is a JSON STRING that needsjson.loadsto inspect. - Agent ignores the rewritten error and uses the bad draft anyway → check that you assigned the new payload back:
payload["tool_result"] = {"content": [...error payload...]}beforejson.dump. - audit_log.jsonl shows po_number "default" → only
get_order_detailshas a po_number in tool_input. Other tool calls (draft_notification, query_inventory) need extracting po_number from elsewhere; safest to extract from the user's first message in agent.py instead of inside the hook.
"""sessions.py — multi-PO via SDK resume tokens."""
from dataclasses import replace
from claude_agent_sdk import query, AssistantMessage
from agent import OPTIONS
SESSIONS: dict[str, str] = {} # po_number -> resume token
async def _drive(prompt, options):
parts, sid = [], None
async for msg in query(prompt=prompt, options=options):
if isinstance(msg, AssistantMessage):
for block in msg.content:
if getattr(block, "text", None): parts.append(block.text)
s = getattr(msg, "session_id", None)
if s: sid = s
return "\n".join(parts), sid
async def chat(po_number: str, msg: str) -> str:
resume = SESSIONS.get(po_number)
options = replace(OPTIONS, resume=resume) if resume else OPTIONS
text, sid = await _drive(msg, options)
if sid: SESSIONS[po_number] = sid
return text
async def what_if(po_number: str, hypothetical: str) -> str:
"""Fork: 'what if we expedite-ship from Newark instead?' — do NOT save the new sid."""
resume = SESSIONS.get(po_number)
options = replace(OPTIONS, resume=resume) if resume else OPTIONS
text, _ = await _drive(hypothetical, options)
return text
Try it — multi-PO + fork demo:
python -c "
import asyncio
from sessions import chat, what_if
async def main():
print('T1:', await chat('PO-2024-5678', 'PO-2024-5678 from Acme has a flagged exception. Investigate.'))
print('FORK:', await what_if('PO-2024-5678', 'What if we expedite-ship from Newark instead of waiting for Memphis?'))
print('T2:', await chat('PO-2024-5678', 'Stick with the original plan. Draft the customer notification.'))
asyncio.run(main())
"
what_if is leaking into SESSIONS.
What & Why: /run-exception PO-2024-5678, /test-agent, /eval-agent turn the agent into a one-line workflow. Ops reviewers can investigate exceptions without leaving Claude Code.
Create 3 files in .claude/commands/:
---
description: Investigate an order exception by PO number
argument-hint: [po_number]
---
Look up $ARGUMENTS in mock_data/orders.json. Build the question string.
Run `python -c "import asyncio, agent; print(asyncio.run(agent.run(q)))"`
where q is the question. Print the exception report, draft, total tokens,
total cost from the ResultMessage emitted by query().
---
description: Run the unit test suite for the order-exception agent
---
Run `pytest tests/ -v`. Critical tests: test_partial_delivery_resolution
(PO-2024-5678 must identify Memphis stockout + Newark availability),
test_tone_check_rejects_bad_enterprise_draft (tone hook works),
test_audit_keyed_by_po (every audit line has the right po_number).
---
description: Run the 10-PO evaluation suite
---
Read test_scenarios.json (10 POs: 4 DELAYED, 3 PARTIAL, 3 PRICING). For each,
call agent.run(), score on: correct exception_type, root cause cited, tier-
appropriate tone, no overpromise. Report per-case score and overall percentage.
/ in Claude Code, the three commands appear in autocomplete. /run-exception PO-2024-5678 investigates and drafts the formal ENTERPRISE notification. /eval-agent requires a test_scenarios.json file — in Iter 3 the spec generates this for you.
What & Why: Same FastAPI + Docker pattern as Iter 1, but Claude Code writes it.
> Create server.py and Dockerfile. Endpoints: GET /health,
> POST /exception (single-shot), POST /chat (po_number + message).
> Async FastAPI handlers awaiting agent.run() and sessions.chat() (both
> async coroutines from claude-agent-sdk). python:3.11-slim base, install
> claude-agent-sdk + dependencies, expose 8000. Mount .claude/ into the
> container so settings.json + hook scripts resolve at runtime.
"""server.py — async FastAPI wrapper around the SDK order-exception agent."""
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from agent import run as agent_run
from sessions import chat as session_chat
app = FastAPI(title="Order Exception Agent (Iter 2 — SDK)")
class Q(BaseModel): question: str
class C(BaseModel): po_number: str; message: str
@app.get("/health")
def health(): return {"status": "ok", "iter": 2}
@app.post("/exception")
async def exception_ep(q: Q):
try: return {"report": await agent_run(q.question)}
except Exception as e: raise HTTPException(500, str(e))
@app.post("/chat")
async def chat_ep(c: C):
try: return {"answer": await session_chat(c.po_number, c.message)}
except Exception as e: raise HTTPException(500, str(e))
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
ENV PYTHONUNBUFFERED=1
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8000"]
Run locally first, then Docker:
uvicorn server:app --reload --port 8000
# In another terminal:
curl localhost:8000/health
curl -X POST localhost:8000/exception -H "Content-Type: application/json" \
-d '{"question":"Investigate PO-2024-5678"}'
# Then build the container:
docker build -t iter2-b .
docker run --rm -p 8000:8000 -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY iter2-b
You should now have ~14 files in your project:
agent-iter2-sdk/ │── CLAUDE.md │── agent.py # query() + 5 @tool functions + create_sdk_mcp_server │── sessions.py │── mock_data/ ×5 # orders, tracking, contracts, inventory, customers │── .claude/ │─ │── settings.json │─ │── commands/run-exception.md, test-agent.md, eval-agent.md │── hooks/ │─ │── log_call.py, tone_check.py, audit.py │── server.py | Dockerfile | requirements.txt
Acceptance test — same shape as Iter 1:
curl -s -X POST localhost:8000/exception -H "Content-Type: application/json" \
-d '{"question":"Investigate PO-2024-5678"}' | python -m json.tool
curl -s -X POST localhost:8000/chat -H "Content-Type: application/json" \
-d '{"po_number":"PO-2024-5678","message":"Investigate."}' | python -m json.tool
curl -s -X POST localhost:8000/chat -H "Content-Type: application/json" \
-d '{"po_number":"PO-2024-5678","message":"What if we expedite from Newark?"}' | python -m json.tool
# Tone enforcement: check audit log shows ENTERPRISE drafts only
docker exec $(docker ps -q --filter ancestor=iter2-b) cat audit_log.jsonl | grep ENTERPRISE
You pass Iter 2 if: all 3 outputs are functionally equivalent to Iter 1 but with ~120 lines of code instead of ~250.
- Files: ~10 (CLAUDE.md, agent.py, sessions.py, server.py, Dockerfile, .claude/settings.json, hooks/{log_call,tone_check,audit}.py, slash commands) + 5 mock JSON
- Lines you wrote: ~120
- Time: ~2 hours
- Abstractions used:
claude-agent-sdk(query / @tool / MCP server / ClaudeAgentOptions / HookMatcher) +.claude/settings.json+ Claude Code
Debugging in Iteration 2: Hooks + Console Web UI + Langfuse
The SDK abstracts the loop — you cannot drop a print inside it. You debug from the outside.
Swap .claude/settings.json to a debug variant for diagnostic runs (the script writes to stderr and passes the original payload through stdout):
"""hooks/debug.py — pretty-print to stderr, pass payload through."""
import sys, json
payload = json.load(sys.stdin)
print(f"[HOOK {payload.get('hook_event_name','?')}] "
f"{payload.get('tool_name','?')}: "
f"{json.dumps(payload.get('tool_input', payload.get('tool_result')))[:200]}",
file=sys.stderr)
json.dump(payload, sys.stdout)
{
"hooks": {
"PreToolUse": [{ "matcher": "*", "command": "python hooks/debug.py" }],
"PostToolUse": [{ "matcher": "*", "command": "python hooks/debug.py" }]
}
}
BETTER than Iter-1 print statements because hooks are modular — symlink .claude/settings.json to the debug variant for one run, then back to production.
console.anthropic.com shows every API call. Worked example: Agent recommends a credit but inventory had stock. Console → Logs → failed call → tool_use shows the agent never called query_inventory. Fix: strengthen system prompt to require ALL 4 investigation tools before draft_notification. Re-run; ~3 min vs ~15 in Iter 1.
Each PO becomes a waterfall trace: 5 tool calls, each a span. Compare a 2-second simple delay to a 12-second pricing-discrepancy investigation and see exactly which tool took the time (usually check_contract_pricing when contracts have many tier rules).
Verbose mode turns on debug hooks for one run. Shows every tool call, hook fire, token count, total cost.
Iteration 3: Spec-Driven
You stop writing code. You write a spec describing what the order-exception agent should do, then ask Claude Code to build it. ~18 files appear. The spec IS the documentation ops leadership can read.
# Agent Specification: Order Exception Investigation Agent
## 1. Overview
Investigates flagged POs, identifies exception type, gathers evidence from
ERP/carrier/contract/inventory systems, drafts tier-appropriate customer
notification with proposed resolution.
## 2. Configuration
- Model: claude-sonnet-4-6
- Framework: claude-agent-sdk (Python). Tools registered via
create_sdk_mcp_server. Driver: query() with ClaudeAgentOptions.
- Max turns: 12, Max tokens: 4096
## 3. Tools (registered as MCP server "ops", call in this order)
### mcp__ops__get_order_details
- @tool with schema {po_number: str (regex ^PO-[0-9]{4}-[0-9]{4}$)}
- Returns line items, status, tracking_numbers, customer_id, exception_flagged
### mcp__ops__track_shipment
- {tracking_number: str}
- Returns {carrier, status, delivered_at?, history[]}
### mcp__ops__check_contract_pricing
- {customer_id: str, sku: str (regex ^[A-Z]{3}-[0-9]{4}$)}
- Returns {contract_unit_price, volume_tier_min_qty, effective, expires}
### mcp__ops__query_inventory
- {sku: str, warehouse: str}
- Returns {qty_available, qty_reserved}
### mcp__ops__draft_notification
- {customer_id: str, exception_type (DELAYED|PARTIAL|PRICING|QUALITY_HOLD),
resolution: str}
- Returns {to, tier, salutation, body}
- Tone rules: ENTERPRISE = formal/terse; SMB = warm/personal.
## 4. System Prompt (passed to ClaudeAgentOptions.system_prompt)
For every flagged PO follow:
1. get_order_details
2. track_shipment for each tracking_number returned
3. check_contract_pricing for any line with potential discrepancy
4. query_inventory for any unshipped line
5. draft_notification (tier-aware tone)
NEVER draft before completing relevant investigation. NEVER overpromise.
Match tier from get_order_details → customer_id → tier.
## 5. Hooks
In-process via ClaudeAgentOptions.hooks (HookMatcher):
- PreToolUse matcher "mcp__ops__get_order_details": deny if PO format invalid
- PreToolUse matcher "mcp__ops__check_contract_pricing|mcp__ops__query_inventory":
deny if sku format invalid
File-based via .claude/settings.json:
- PreToolUse matcher "*" command "python hooks/log_call.py"
- PostToolUse matcher "mcp__ops__draft_notification" command "python hooks/tone_check.py"
- PostToolUse matcher "*" command "python hooks/audit.py"
ENTERPRISE banned tokens: "hi there", "hey there", "no worries", "super sorry"
Overpromise tokens (any tier): "guarantee", "definitely tomorrow", "100% certain"
tone_check.py rewrites tool_result to a structured error so the agent re-drafts.
## 6. Sessions
Multi-PO via SDK resume tokens (sessions.py keeps po_number -> session_id map).
what_if() forks by re-using a resume token without persisting the new session_id.
## 7. Mock Data
- mock_data/orders.json: 10 POs with 3 exception types
- mock_data/tracking.json: 8 tracking records (UPS/FedEx/DHL)
- mock_data/contracts.json: 5 customer-SKU contracts with volume tiers
- mock_data/inventory.json: 6 SKUs across 3 warehouses
- mock_data/customers.json: 5 customers (3 ENTERPRISE, 2 SMB)
## 8. API Wrapper
FastAPI async handlers: GET /health, POST /exception, POST /chat.
## 9. Deployment
Tier 1: Docker. Mount .claude/ into container. Audit log on mounted volume.
## 10. Tests
test_tools.py, test_agent.py (PO-2024-5678 PARTIAL, cites Newark warehouse),
test_hooks.py (tone_check rewrites bad ENTERPRISE drafts to errors),
test_sessions.py (resume), test_api.py.
## 11. Evaluation
spec/test_scenarios.json with 10 POs: 4 DELAYED, 3 PARTIAL, 3 PRICING.
Score on: correct exception_type, root cause cited, tier-appropriate tone,
no overpromise.
## 12. File Structure
order-exception-agent/
│── spec/agent-spec.md # this file
│── spec/test_scenarios.json
│── CLAUDE.md
│── .claude/settings.json # PreToolUse/PostToolUse matchers
│── .claude/commands/ {run-exception.md, test-agent.md, eval-agent.md}
│── agent.py # query() + 5 @tool + create_sdk_mcp_server
│── sessions.py # resume-token sessions
│── mock_data/ {orders, tracking, contracts, inventory, customers}.json
│── hooks/{log_call,tone_check,audit}.py
│── server.py | Dockerfile | docker-compose.yml | requirements.txt
│── tests/ x5
│── appendix/manual-loop.py # Iter-1 reference, not for production
> /generate-from-spec spec/agent-spec.md
# (or, if /generate-from-spec is not installed:)
> Read spec/agent-spec.md and build the entire project. Create every file
> in section 12. Use the `claude-agent-sdk` package (NOT a fictional
> anthropic.Agent class). Tools must be @tool-decorated async functions
> registered via create_sdk_mcp_server. Hooks split between
> .claude/settings.json (file-based log/tone/audit) and OPTIONS.hooks
> (HookMatcher in-process validation) per section 5. Generate realistic
> mock B2B data with 10 POs covering all 3 exception types. Also generate
> appendix/manual-loop.py as an under-the-hood reference.
What you should see Claude Code create:
Verify it actually works:
pytest tests/ -v
Expected pytest output (focus on the B2B-critical tests):
test_tone_check_rejects_bad_enterprise_draft fails, the generated tone_check hook probably checked tone case-sensitively (so "Hi There" passes when "hi there" is banned). Fix the spec section 5 to clarify "case-insensitive substring match" and regenerate.
- Generated tone hook is case-sensitive → spec ambiguity. Section 5 must say: "compare body.lower() against the banned tokens list."
- Generated mock_data/customers.json doesn't have ENTERPRISE/SMB tier field → section 7 of spec should give a sample record per file.
- tone_check rewrites the draft as a 200-line error → section 5 should constrain the error format: "Return error payload as JSON object with keys violations, tier, advice."
Example: add a sixth tool get_carrier_alternatives(origin, destination, sku, qty) that suggests faster carriers when the current one is delayed. Edit spec section 3 + 11 and ask Claude Code: "I added get_carrier_alternatives. Update tools.py, mock_data, add a test, and an eval scenario."
What & Why: Build, run, curl, compare against Iter 1 and Iter 2.
docker compose up --build -d
docker compose ps # confirm "Up"
curl localhost:8000/health
# Expected: {"status":"ok","iter":3}
curl -X POST localhost:8000/exception \
-H "Content-Type: application/json" \
-d '{"question":"Investigate PO-2024-5678"}' | python -m json.tool
Cross-iteration diff — the punchline:
# Same query against all three deployments (assumes :8001/8002/8003)
for port in 8001 8002 8003; do
echo "=== Iter on :$port ==="
curl -s -X POST localhost:$port/exception -H "Content-Type: application/json" \
-d '{"question":"Investigate PO-2024-5678"}' \
| python -c "import json, sys; d = json.load(sys.stdin); print((d.get('report') or d.get('answer'))[:300])"
echo
done
Expected: all three return the PARTIAL exception report with Memphis stockout root cause, Newark availability, and a formal ENTERPRISE-tier draft. Wording varies; facts match.
You have ~24 generated files + your ~100-line spec:
order-exception-agent/ │── spec/agent-spec.md # YOU wrote this │── spec/test_scenarios.json # generated │── CLAUDE.md | .claude/settings.json | .claude/commands/ ×3 │── agent.py | sessions.py # generated (SDK) │── hooks/log_call.py | hooks/tone_check.py | hooks/audit.py │── mock_data/ ×5 # 10 POs, 8 tracking, 5 contracts, 6 SKUs, 5 customers │── tests/ ×5 # generated │── server.py | Dockerfile | docker-compose.yml | requirements.txt │── appendix/manual-loop.py # under-the-hood reference
Acceptance test — same shape as Iter 1 and Iter 2:
# 1. All 10 generated tests pass
pytest tests/ -v
# 2. End-to-end exception investigation
curl -s -X POST localhost:8000/exception -H "Content-Type: application/json" \
-d '{"question":"Investigate PO-2024-5678"}' | python -m json.tool
# 3. Multi-PO session via SDK resume tokens
curl -s -X POST localhost:8000/chat -H "Content-Type: application/json" \
-d '{"po_number":"PO-2024-5678","message":"Investigate."}' | python -m json.tool
curl -s -X POST localhost:8000/chat -H "Content-Type: application/json" \
-d '{"po_number":"PO-2024-5678","message":"What if we expedite from Newark?"}' | python -m json.tool
# 4. Spec-vs-code drift check (B2B-specific)
claude
> Read spec/agent-spec.md sections 4 and 5. Compare to agent.py and hooks/.
> Check that ENTERPRISE banned tokens are checked case-insensitively.
You pass Iter 3 if: (1) all 10 tests pass; (2) PO-2024-5678 returns the PARTIAL report; (3) the second /chat call references prior context; (4) drift check returns "no drift".
- Files: ~24 (all generated by Claude Code from your spec)
- Lines you wrote: ~100 (the spec only)
- Lines Claude Code wrote: ~545 (readable, runnable, all tests pass)
- Time: ~1 hour
- Abstractions used: spec format +
/generate-from-spec+ same SDK runtime as Iter 2
Debugging in Iteration 3: Spec Comparison + Tests + Evals
You debug at the spec level. The code is regenerable; the spec is not.
> Read agent-spec.md and compare it to the generated agent.py and hooks.py.
> Report any deviations — especially around tone enforcement (section 5).
> Test test_partial_delivery_resolution failed: expected resolution to mention
> Newark warehouse, got generic "we apologize" text. Read agent-spec.md
> section 4 (System Prompt). Read generated agent.py. Find why the agent
> skipped query_inventory and fix it.
Run /eval-agent. Output: PO-2024-5683 scored 2/4 — "draft tone too casual for ENTERPRISE customer." Fix: strengthen spec section 5 with the explicit banned-token list. Regenerate hooks.py. Re-eval → 4/4.
| Iteration | Primary debug method | Secondary | Speed to fix |
|---|---|---|---|
| 1 Raw | print() in the loop | Manual message inspection | Slow (find the line) |
| 2 SDK | Hooks + Console Web UI | Langfuse traces | Medium (modular probes) |
| 3 Spec | Spec vs code comparison | Tests + evals + Console | Fast (Claude Code finds it) |
The Comparison Table
Same order-exception agent, same business question. Here is everything that changed:
| Metric | Iter 1: Raw API | Iter 2: Agent SDK | Iter 3: Spec-Driven |
|---|---|---|---|
| Lines YOU wrote | ~250 | ~120 | ~100 (spec only) |
| Time to build | ~3 hours | ~2 hours | ~1 hour |
| Agent output | Baseline | Same | Same |
| Tone enforcement | Inline check after draft | One .claude/settings.json entry + a 25-line stdin/stdout script | 5 lines in spec section 5 |
| Multi-PO sessions | Manual history dict | SDK sessions | SDK sessions (generated) |
| Adding alternative-carrier tool | Edit 3 files + add validation | One Claude Code prompt | Update spec, ask to regen |
| Tests (incl. tone) | Manual | Claude Code generated | Spec generates them |
| Documentation for ops leadership | Separate | CLAUDE.md | Spec IS the doc ops can read |
| Control over internals | Full | SDK-managed | Least direct (but reviewable) |
| Understanding needed | Every line | SDK abstractions | Architecture-level |
| Debugging | print() in loop | Hooks + Console + Langfuse | Spec compare + tests + evals |
| Onboarding a new ops engineer | Read 7 files | Read CLAUDE.md + 8 files | Read 1 spec file |
All three produce the same exception report and tier-appropriate draft. The difference is what each iteration teaches you. Iteration 1 teaches WHAT an agent is. Iteration 2 teaches HOW to build efficiently. Iteration 3 teaches HOW production teams work — especially for B2B where the spec doubles as the contract-aware playbook ops can review.
Grading Rubric
claude-agent-sdk with @tool-decorated async functions registered via create_sdk_mcp_server for all 5 tools. Tone hook (in .claude/settings.json) rewrites bad ENTERPRISE drafts to errors so the agent re-drafts. In-process HookMatcher validates PO/SKU formats. Sessions resumed via SDK resume token (what-if expedite from Newark demoed). At least 3 slash commands. Same FastAPI + Docker. CLAUDE.md present with compliance rules.agent-spec.md. Single Claude Code prompt generates all ~18 files. Tests pass on first or second iteration (especially test_tone_check). At least one targeted spec edit + regen demonstrated. Same FastAPI + Docker deployment.720/1000 to pass. All three iterations must run; the comparison table must use your actual numbers; tone enforcement must work in all three iterations on the same set of test POs.
Reflection Prompts
Answer in REFLECTION.md. 200–400 words total.
- Which iteration was hardest for YOU specifically, and why?
- Which iteration would you default to for production B2B order ops? Consider: the need for ops leadership to read and approve playbooks, customer-tier audit requirements, and pace of contract changes.
- Give one concrete situation where each iteration is the right choice.
- Was the tone-enforcement hook harder or easier than you expected? What would you change about the banned-token approach?
- What would you change about the spec format if you were writing it from scratch for B2B ops? Add a "tier-policy" section? An "SLA" section?
Knowledge Check
Q1: All three iterations produce the same exception report. What is the most defensible reason to still go through Iteration 1 rather than skipping straight to Iteration 3?
Q2: Your agent drafts an SMB-tone notification ("Hi there!") for an ENTERPRISE customer. What is the FASTEST debugging path in Iteration 2?
Q3: In Iteration 3, an eval case fails because the agent skipped query_inventory and recommended a credit when stock was actually available in another warehouse. The CORRECT fix is:
Q4: The most common Iteration 1 bug in the order-exception scenario is:
Q5: You need to add an "expedite shipping suggestion" capability to all three iterations. Which iteration requires the LEAST disruptive change?
Q6: When would you NOT pick Iteration 3 (spec-driven) for a real production order-exception system?
Q7: An ENTERPRISE-tier draft contains "no worries, we'll get this sorted." The tone-check hook should:
Going Further (Optional)
- Build all three domains. After completing B, write specs for Domain A and Domain C. The spec format generalizes.
- Add HITL to Iter 3. Extend the spec with a "Human Review" section that triggers when proposed credit > $50K or expedite cost exceeds margin. Ops manager dashboard surfaces the case.
- Cloud deployment. Take the Iter 3 generated code and ship it to GCP Cloud Run with Pub/Sub for order events (M22B Tier 2). The spec did not need to change — only the deployment section.
- Carrier integration. Replace the mock
track_shipmentwith real UPS/FedEx APIs. Add retry + circuit breaker for the live carrier endpoint. - Multi-agent ops pipeline. Extend to CAPSTONE-4's 4-agent pattern: Intake, Investigation, Decision (auto/HITL), Communication. Single spec for all four.
- Tier-policy linting. Write a slash command
/lint-tiersthat flags drafts where banned tokens slipped past the hook in production logs.
This is the final capstone. You have now built agents seven different ways. Time to ship.