Building AI Agents with Claude
Capstone 7 — Graduation Domain B — B2B Order Exception
Capstone 7-B6–8 hours3 sessions
← Capstone 6 🏠 Home Course Home →

Capstone 7-B — Agent Evolution: Order Exception

Build the SAME B2B order exception agent THREE times — first by hand with the raw API, then with the Agent SDK and Claude Code, then from a spec document. Same five operations tools, same mock orders, same business question. Three radically different developer experiences.

Project Brief

Three Ways to Build a House

Imagine you decide to build the same house three times. The first time, you fell every tree, mill every plank, and hammer every nail by hand. You learn exactly where the load-bearing walls go, why the joists are spaced 16 inches apart, and what happens when a sill plate is undersized. The build takes six months, but you understand every joint in the structure.

The second time, you order pre-cut lumber, pre-fab trusses, and use a nail gun. You finish in six weeks. You build twice as fast and the structure is just as strong, but only because you already know what a structure should look like.

The third time, you hand a contractor a set of architectural drawings. Two weeks later, the house is done — built by people you never met to specifications you wrote in plain English. The previous two builds taught you what to draw and what to leave to the crew.

This capstone is those three builds, compressed into one week. You will write the same B2B order-exception agent three ways. By the end, you will know in your hands — not just in theory — why each layer of abstraction exists, what it costs, and when to reach for each one.

What You'll Build

You build a B2B Order Exception Agent. It takes a flagged purchase order — PO-2024-5678 — and walks through five tool calls: pulls order details from the ERP, tracks the shipment with the carrier, checks contract pricing against the invoice, queries available inventory, and drafts a customer notification. The output is a structured exception report (type + root cause + proposed resolution) plus a customer-tier-appropriate email draft.

You build it three times:

  • Iteration 1 Raw API loop — ~250 lines, ~3 hours, hand-coded everything (M15B way)
  • Iteration 2 Agent SDK + Claude Code — ~120 lines, ~2 hours (M25–M26 way)
  • Iteration 3 Spec-driven — ~100 lines of spec, ~1 hour (production way)
SLA & Tone Constraints (the B2B-specific compliance layer)

B2B notifications have to match customer-tier tone (terse for enterprise, warmer for SMB), reference contract clauses where relevant, and never overpromise resolution timelines. In Iteration 1 you encode tone rules in the system prompt and check the output by hand. In Iteration 2 you add an output guardrail hook that runs a tone classifier or pattern check. In Iteration 3 you specify the rules and Claude Code generates both the prompt rules AND the output hook. The pattern is identical to HIPAA in Domain A or PII redaction in Domain C — the spec encodes compliance, the iterations only differ in where that compliance lives.

Why It Matters

Production teams do not pick "raw API vs SDK vs spec" in the abstract — they pick based on what the team already understands and what the problem requires. If you skip Iteration 1, you cannot debug Iteration 3 when the generated code does something weird. If you skip Iteration 3, you are 10x slower than the teams shipping agents in 2026. The point of building the same thing three times is that the differences teach you the trade-offs.

The Three-Iteration Concept

Each iteration produces a working agent that solves the SAME order exception with the SAME five tools and SAME mock data. The agent's output is identical across all three. What changes is everything around the agent: the lines you wrote, the time you spent, the abstractions you used, and the way you debug when something breaks.

An agent is "a system prompt + tools + a loop." Iteration 1 makes you build all three from scratch. Iteration 2 keeps the system prompt and tools the same, but the SDK runs the loop for you. Iteration 3 keeps everything the same, but Claude Code writes the prompt, tools, AND loop from a spec you gave it. The agent is unchanged. You changed.

A Common Misunderstanding

"Iteration 3 is just better — why bother with the others?" Because Iteration 3's generated code is not magic. When it produces a track_shipment that returns the wrong field name and the agent then mis-reads "delivered" as "in transit," you have to read the generated tools.py, find the bug, and fix it — either in the code or in the spec. Without Iteration 1's familiarity with what tool calls and message shapes look like, you cannot tell what the generated code is doing wrong.

The Scenario — Order Exception Agent

The agent investigates problematic purchase orders: identifies the exception type (delayed shipment, partial delivery, pricing discrepancy, quality hold), gathers data from ERP, WMS, and carrier systems, determines root cause, and proposes resolution with a customer communication draft.

Business Question (use this in all three iterations)

"PO-2024-5678 from Acme Corp has a flagged exception. Investigate, determine the root cause, and draft a customer notification with a resolution proposal."

Tools (5)
  • get_order_details(po_number) — line items, status, dates, customer tier
  • track_shipment(tracking_number) — carrier status & location
  • check_contract_pricing(customer_id, sku) — contract vs invoiced price
  • query_inventory(sku, warehouse) — stock availability
  • draft_notification(customer_id, type, resolution) — email draft with tier-aware tone
Mock Data Shape
  • 10 purchase orders with 3 exception types
  • Exceptions: delayed shipment (4), partial delivery (3), pricing discrepancy (3)
  • Customers: Acme Corp (Enterprise), Globex Industries (Enterprise), Initech LLC (SMB), Stark Enterprises (Enterprise), Wayne Industries (SMB)
  • Files: orders.json, tracking.json, contracts.json, inventory.json, customers.json
Want a Different Domain?

Switch to Domain A (Healthcare Pre-Auth) or Domain C (UCC Risk Analyzer — default). The lab structure is identical; only the tools and data change.

Animation 1: Three-Lane Evolution

Watch three lanes count down lines of code while the same six capabilities populate underneath each. The agent's capabilities never change — what shrinks is the code you write to express them.

Code Shrinks — Capabilities Stay
Iteration 1: Raw API
0
lines you wrote
Iteration 2: Agent SDK
0
lines you wrote
Iteration 3: Spec-Driven
0
lines of spec

Animation 2: Code Size Waterfall

Iteration 1 is 250 lines you wrote. Iteration 2 is 120 lines you wrote. Iteration 3 is 100 lines of spec plus ~300 lines of code Claude Code generated — shown stacked. The total system size grows; the lines on your keyboard fall off a cliff.

Lines of Code: You Wrote vs Generated
250
Iter 1
250 lines hand-written
120
Iter 2
120 lines hand-written
300
100
Iter 3
100 spec + 300 generated
Iter 1 hand-written Iter 2 hand-written Iter 3 spec (you wrote) Iter 3 generated by Claude Code

Animation 3: Time Comparison

Iteration 1 is the longest because you build everything — loop, validation, tone-checking, audit, sessions, deployment. Iteration 2 cuts the loop and guardrails. Iteration 3 cuts almost everything except the thinking.

Wall-Clock Hours per Iteration
Iter 1
3.0 h
3 hr
Iter 2
2.0 h
2 hr
Iter 3
1.0 h
1 hr

Total: 6 hours across 3 sessions for the same agent, three different builds.

Animation 4: Architecture Per Iteration

Each iteration has the same logical architecture but the physical architecture differs. Click the tabs to compare.

System Architecture — Three Versions
YOU OWN EVERYTHING IN BLUE +----------------------------------------------------------+ | agent.py (~250 lines, all hand-written) | | | | while True: <-- loop YOU wrote | | response = client.messages.create(...) | | if response.stop_reason == "end_turn": break | | for block in response.content: | | if block.type == "tool_use": | | validate_input(block.input) <-- YOU wrote | | log_with_timestamp(block) <-- YOU wrote | | result = execute_tool(...) <-- YOU wrote | | check_tone_if_email(result) <-- YOU wrote | | append_messages(...) <-- YOU wrote | | write_audit_log(...) <-- YOU wrote | +----------------------------------------------------------+ | | | v v v tools.py circuit_breaker audit_log.jsonl (5 ops tools (YOU built) (YOU rotate) YOU wrote) | v +----------------------------------------------------------+ | server.py (FastAPI) <-- YOU wrote everything | | Dockerfile <-- YOU wrote | +----------------------------------------------------------+
YOU OWN PINK; SDK OWNS GRAY +----------------------------------------------------------+ | agent.py (~85 lines) | | | | @tool("get_order_details", ..., {...}) | | async def get_order_details(args): ... | | (5 @tool functions total) << YOU wrote | | | | ops_server = create_sdk_mcp_server( | | name="ops_tools", tools=[...]) | | | | OPTIONS = ClaudeAgentOptions( | | system_prompt="Order ops analyst...", | | mcp_servers={"ops": ops_server}, | | allowed_tools=["mcp__ops__..."], | | hooks={"PreToolUse": [HookMatcher(validate_input)]}, | | ) | | | | async for msg in query(prompt=..., options=OPTIONS): ... | | | | -- LOOP, MESSAGE-PASSING, RETRIES, STREAMING ---- | | -- LIVE INSIDE claude-agent-sdk. YOU DO NOT WRITE THEM. -| +----------------------------------------------------------+ | | v v +--------------------+ +--------------------+ | Claude Code | | claude-agent-sdk | | generated: | | managed: | | - server.py | | - query() loop | | - Dockerfile | | - MCP transport | | - tests | | - HookMatcher | | .claude/ | | - resume tokens | | settings.json | | - streaming | | + hooks/*.py | +--------------------+ | (incl. tone) | +--------------------+
YOU OWN ONLY THE GREEN BOX +----------------------------------------------------------+ | spec/agent-spec.md (~100 lines) << YOU wrote | | ---------------------------------- | | # Sections | | 1. Overview 8. API Wrapper | | 2. Configuration 9. Deployment | | 3. Tools (5 ops) 10. Tests | | 4. System Prompt 11. Evaluation (10 POs) | | 5. Hooks (tone) 12. File Structure | | 6. Sessions | | 7. Mock Data | +----------------------------------------------------------+ | v +----------------------------------------------------------+ | /generate-from-spec spec/agent-spec.md | | Claude Code reads spec, generates ~18 files: | | agent.py (claude-agent-sdk) | sessions.py | | mock_data/*.json (5 files) | server.py | Dockerfile | | .claude/settings.json | hooks/*.py | .claude/commands/ | | tests/test_*.py x5 | spec/agent-spec.md (you) | | appendix/manual-loop.py (Iter-1 reference) | +----------------------------------------------------------+

Animation 5: Spec-to-Code Flow

Watch the 12-section spec on the left get read line-by-line. Claude Code translates each section into generated code on the right. Files appear as their corresponding spec section is consumed.

agent-spec.md → Claude Code → 18 Files
Claude Code
read → plan → write

Prerequisites

Required Modules
  • M05 — Tool Use: Tool definitions, tool_use blocks, the message loop
  • M12 — ReAct Agents: Multi-step reasoning across the 5-tool diagnostic flow
  • M15B — Build Complete Agent: The whole Iter-1 mental model
  • M16–M17 — Guardrails & HITL: Hooks, especially tone enforcement
  • M19 — Tracing: Optional but useful for Iter 2 debugging
  • M21, M22B — Deployment: FastAPI + Docker + Tier 1
  • M25, M26 — Claude Code & Hooks: CLAUDE.md, slash commands, hooks API
If You Have Not Done M15B / M26

Iteration 1 is a re-implementation of the M15B reference agent for the order-exception scenario. Do M15B first if you have not built an agent from raw API calls. Iteration 2 leans on the claude-agent-sdk patterns taught in M26 (Hooks & Sessions & Agent SDK) — reach for it if @tool / HookMatcher / ClaudeAgentOptions feels unfamiliar.

Tools You'll Need Installed
  • Python 3.10+ with pip
  • Claude Code CLI (npm i -g @anthropic-ai/claude-code) — for Iter 2 and 3
  • Docker Desktop for the Tier 1 deployment
  • ANTHROPIC_API_KEY environment variable
  • Optional: a Langfuse account for Iter 2 tracing
SESSION 1

Iteration 1: Raw API Loop

Build the agent the M15B way. You write the loop, the validation, the tone enforcement, the audit, the sessions, and the deployment. Every line is yours. Every bug is yours to find.

~3 hours~250 lines7 files0 abstractions
Step 1: Setup & Mock Data
15 minmock_data/*.json

What & Why: Create the project folder, install anthropic + FastAPI, then build the five mock JSON files. Mock data is what separates a "demo" agent from a "doesn't compile" agent — without realistic ERP, carrier, contract, and inventory data, every tool call returns garbage.

mkdir -p agent-iter1-raw/mock_data && cd agent-iter1-raw
python -m venv venv && source venv/bin/activate    # Windows: venv\Scripts\activate
pip install "anthropic>=0.40" "fastapi>=0.110" "uvicorn>=0.27" "pydantic>=2.0"
// mock_data/orders.json (excerpt)
{
  "PO-2024-5678": {
    "po_number": "PO-2024-5678",
    "customer_id": "ACME001",
    "submitted": "2024-04-01",
    "promised_ship": "2024-04-08",
    "status": "PARTIALLY_SHIPPED",
    "tracking_numbers": ["1Z999AA10123456784"],
    "line_items": [
      {"sku": "BRG-4892", "qty_ordered": 100, "qty_shipped": 60, "unit_price": 42.00},
      {"sku": "GSK-1175", "qty_ordered": 50,  "qty_shipped": 0,  "unit_price": 18.50}
    ],
    "exception_flagged": "PARTIAL_DELIVERY"
  }
}

// mock_data/customers.json (excerpt)
{
  "ACME001": {"name": "Acme Corporation", "tier": "ENTERPRISE",
              "csm_email": "csm-acme@example.com", "sla_hours": 4},
  "INIT003": {"name": "Initech LLC",       "tier": "SMB",
              "csm_email": "support@example.com", "sla_hours": 24}
}

// mock_data/tracking.json (excerpt)
{
  "1Z999AA10123456784": {
    "carrier": "UPS",
    "status": "DELIVERED",
    "delivered_at": "2024-04-12T14:30:00Z",
    "history": [
      {"ts": "2024-04-09T10:00:00Z", "event": "PICKED_UP", "loc": "Memphis, TN"},
      {"ts": "2024-04-12T14:30:00Z", "event": "DELIVERED", "loc": "Newark, NJ"}
    ]
  }
}

// mock_data/contracts.json (excerpt — for ACME001 + BRG-4892)
{
  "ACME001_BRG-4892": {
    "customer_id": "ACME001", "sku": "BRG-4892",
    "contract_unit_price": 39.50, "volume_tier_min_qty": 100,
    "effective": "2024-01-01", "expires": "2024-12-31"
  }
}

// mock_data/inventory.json (excerpt)
{
  "GSK-1175": {
    "warehouses": {
      "WH-MEMPHIS": {"qty_available": 0,  "qty_reserved": 0},
      "WH-NEWARK":  {"qty_available": 200, "qty_reserved": 50}
    }
  }
}

Build all five JSON files with realistic data: 10 POs, 5 customers (3 Enterprise, 2 SMB), tracking for 8 shipments, contracts for 5 SKUs, inventory across 3 warehouses. Cover three exception types: 4 delayed shipments, 3 partial deliveries (incl. PO-2024-5678), 3 pricing discrepancies.

Run: python -c "import json; print(len(json.load(open('mock_data/orders.json'))), 'POs')"

10 POs
Checkpoint
You should see 10 POs. If you see 0 or a JSON parse error, check that all five files are valid JSON.
Step 2: Define Tools as JSON Schema
15 mintools.py

What & Why: The Anthropic API needs every tool described as a JSON Schema. PO numbers, SKUs, and tracking numbers all have specific formats (PO-YYYY-NNNN, XXX-NNNN, UPS/FedEx 12–22 chars). Get the schema right and Claude passes correct types.

"""tools.py — order-exception tool schemas + executors."""
import json
from pathlib import Path

DATA = Path("mock_data")
ORDERS = json.loads((DATA / "orders.json").read_text())
TRACKING = json.loads((DATA / "tracking.json").read_text())
CONTRACTS = json.loads((DATA / "contracts.json").read_text())
INVENTORY = json.loads((DATA / "inventory.json").read_text())
CUSTOMERS = json.loads((DATA / "customers.json").read_text())

TOOLS = [
    {
        "name": "get_order_details",
        "description": "Get full PO details: line items, status, dates, tracking numbers, customer.",
        "input_schema": {
            "type": "object",
            "properties": {"po_number": {"type": "string", "pattern": "^PO-[0-9]{4}-[0-9]{4}$"}},
            "required": ["po_number"],
        },
    },
    {
        "name": "track_shipment",
        "description": "Get carrier status for a tracking number (UPS/FedEx/DHL).",
        "input_schema": {
            "type": "object",
            "properties": {"tracking_number": {"type": "string"}},
            "required": ["tracking_number"],
        },
    },
    {
        "name": "check_contract_pricing",
        "description": "Return contract unit price for a customer + SKU. Used to detect discrepancies.",
        "input_schema": {
            "type": "object",
            "properties": {"customer_id": {"type": "string"}, "sku": {"type": "string"}},
            "required": ["customer_id", "sku"],
        },
    },
    {
        "name": "query_inventory",
        "description": "Return stock available for a SKU at a specific warehouse.",
        "input_schema": {
            "type": "object",
            "properties": {"sku": {"type": "string"}, "warehouse": {"type": "string"}},
            "required": ["sku", "warehouse"],
        },
    },
    {
        "name": "draft_notification",
        "description": "Generate an exception-notification email draft, tier-aware tone.",
        "input_schema": {
            "type": "object",
            "properties": {
                "customer_id": {"type": "string"},
                "exception_type": {"type": "string",
                                   "enum": ["DELAYED", "PARTIAL", "PRICING", "QUALITY_HOLD"]},
                "resolution": {"type": "string"},
            },
            "required": ["customer_id", "exception_type", "resolution"],
        },
    },
]

def execute_tool(name, args):
    if name == "get_order_details":
        return ORDERS.get(args["po_number"], {"error": "PO not found"})
    if name == "track_shipment":
        return TRACKING.get(args["tracking_number"], {"error": "tracking not found"})
    if name == "check_contract_pricing":
        key = f"{args['customer_id']}_{args['sku']}"
        return CONTRACTS.get(key, {"error": "no contract on file"})
    if name == "query_inventory":
        sku = INVENTORY.get(args["sku"], {})
        return sku.get("warehouses", {}).get(args["warehouse"],
                                             {"error": "no inventory record"})
    if name == "draft_notification":
        cust = CUSTOMERS.get(args["customer_id"], {})
        tier = cust.get("tier", "SMB")
        # Tier-aware salutation/tone
        salutation = "Team," if tier == "ENTERPRISE" else f"Hi {cust.get('name','team')},"
        sign_off = ("We will follow up before SLA expiry. — Operations"
                    if tier == "ENTERPRISE"
                    else "Let us know how we can help! — Customer Success")
        body = (f"Regarding {args['exception_type']} on your recent order:\n\n"
                f"{args['resolution']}\n\n{sign_off}")
        return {"to": cust.get("csm_email", "ops@example.com"),
                "tier": tier, "salutation": salutation, "body": body}
    raise ValueError(f"Unknown tool: {name}")
Checkpoint
Run python -c "from tools import execute_tool; print(execute_tool('get_order_details', {'po_number': 'PO-2024-5678'})['exception_flagged'])". Expected: PARTIAL_DELIVERY.
Step 3: Build the While Loop
45 minagent.pyThe CORE of Iter 1

What & Why: The system prompt is critical for order-exception: Claude needs to know to investigate ALL related data (order, tracking, contract, inventory) before drafting a notification, not just react to the first finding.

"""agent.py — the raw API loop. Everything is hand-coded."""
import json, anthropic
from tools import TOOLS, execute_tool

client = anthropic.Anthropic()
MODEL = "claude-sonnet-4-6"
SYSTEM = """You are an order-exception specialist. For every flagged PO:
1. get_order_details to understand the situation
2. track_shipment for any tracking numbers (CARRIER VIEW)
3. check_contract_pricing for any line item with potential discrepancy
4. query_inventory for any unshipped or backordered SKUs
5. draft_notification with the right exception_type and a clear resolution

ALWAYS investigate before drafting. Never recommend a credit without checking
contract pricing AND inventory. Match the customer tier's tone (terse for
ENTERPRISE, warmer for SMB). Never overpromise — cite SLA hours, not specifics."""

MAX_TURNS = 12

def _run_messages(messages: list) -> tuple[str, list]:
    """Drive the loop on the given messages list. Returns (final_text, messages).
    Used by run_agent() for single-shot queries and by session.chat() for multi-turn."""
    for turn in range(MAX_TURNS):
        response = client.messages.create(
            model=MODEL, max_tokens=4096, system=SYSTEM,
            tools=TOOLS, messages=messages,
        )
        messages.append({"role": "assistant", "content": response.content})

        if response.stop_reason == "end_turn":
            text = next(b.text for b in response.content if b.type == "text")
            return text, messages

        if response.stop_reason == "tool_use":
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    try:
                        result = execute_tool(block.name, block.input)
                        tool_results.append({
                            "type": "tool_result",
                            "tool_use_id": block.id,
                            "content": json.dumps(result, default=str),
                        })
                    except Exception as e:
                        tool_results.append({
                            "type": "tool_result",
                            "tool_use_id": block.id,
                            "content": f"ERROR: {e}",
                            "is_error": True,
                        })
            messages.append({"role": "user", "content": tool_results})
            continue

        raise RuntimeError(f"Unexpected stop_reason: {response.stop_reason}")
    raise RuntimeError(f"Agent exceeded {MAX_TURNS} turns without finishing")

def run_agent(question: str) -> str:
    """Single-shot entry point. Wraps _run_messages with a fresh history."""
    text, _ = _run_messages([{"role": "user", "content": question}])
    return text

if __name__ == "__main__":
    q = "PO-2024-5678 from Acme Corp has a flagged exception. Investigate, " \
        "determine the root cause, and draft a customer notification."
    print(run_agent(q))
// agent.ts — the raw API loop. Order-exception investigation.
import Anthropic from "@anthropic-ai/sdk";
import { TOOLS, executeTool } from "./tools.js";

const client = new Anthropic();
const MODEL = "claude-sonnet-4-6";
const SYSTEM = `You are an order-exception specialist. For every flagged PO:
1. get_order_details to understand the situation
2. track_shipment for any tracking numbers
3. check_contract_pricing for any line item with potential discrepancy
4. query_inventory for any unshipped or backordered SKUs
5. draft_notification with the right exception_type and a clear resolution

ALWAYS investigate before drafting. Never recommend a credit without checking
contract pricing AND inventory. Match the customer tier's tone (terse for
ENTERPRISE, warmer for SMB). Never overpromise — cite SLA hours, not specifics.`;

export async function runAgent(question: string, maxTurns = 12): Promise<string> {
  const messages: Anthropic.MessageParam[] = [{ role: "user", content: question }];
  for (let turn = 0; turn < maxTurns; turn++) {
    const response = await client.messages.create({
      model: MODEL, max_tokens: 4096, system: SYSTEM, tools: TOOLS, messages,
    });
    messages.push({ role: "assistant", content: response.content });
    if (response.stop_reason === "end_turn") {
      const text = response.content.find(b => b.type === "text");
      return text?.type === "text" ? text.text : "";
    }
    if (response.stop_reason === "tool_use") {
      const toolResults: Anthropic.ToolResultBlockParam[] = [];
      for (const block of response.content) {
        if (block.type === "tool_use") {
          try {
            const result = await executeTool(block.name, block.input);
            toolResults.push({ type: "tool_result", tool_use_id: block.id,
                                content: JSON.stringify(result) });
          } catch (e) {
            toolResults.push({ type: "tool_result", tool_use_id: block.id,
                                content: `ERROR: ${(e as Error).message}`, is_error: true });
          }
        }
      }
      messages.push({ role: "user", content: toolResults });
      continue;
    }
    throw new Error(`Unexpected stop_reason: ${response.stop_reason}`);
  }
  throw new Error(`Agent exceeded ${maxTurns} turns without finishing`);
}

Run: python agent.py

Expected output (paraphrased — Claude's exact wording varies):

Exception Report: PO-2024-5678 (Acme Corp, ENTERPRISE tier) Type: PARTIAL_DELIVERY Root cause: 50 units of GSK-1175 unshipped — Memphis warehouse had 0 stock at pick time. Newark warehouse has 200 units available; can ship within SLA window. Draft notification (ENTERPRISE tone — formal): "Team, Regarding the partial delivery on PO-2024-5678: 60 of 100 units of BRG-4892 have shipped (UPS tracking 1Z999AA10123456784, delivered Newark NJ 2024-04-12). The remaining 50 units of GSK-1175 are unshipped due to a Memphis stockout; we have 200 units available in Newark and will dispatch within your 4-hour SLA. — Operations"
✅ Checkpoint — Step 3
You should see an exception report identifying PARTIAL delivery, root cause naming the Memphis stockout + Newark availability, and a draft notification with ENTERPRISE-tier tone (formal salutation, no "Hi there!"). If the agent drafts an SMB-tone notification ("Hi Acme team!"), the system prompt isn't picking up the customer tier — check that get_order_details returns the customer_id and the agent looks it up in the customers data.
Troubleshooting
  • Agent skips track_shipment → the order has no tracking number yet (e.g., for partial-shipment cases, only the shipped portion has tracking). System prompt should say "call track_shipment for each tracking_number returned by get_order_details — skip if none."
  • Agent recommends a credit instead of investigating → reinforce the system prompt: "Never recommend a credit without first checking inventory at OTHER warehouses."
  • Draft uses "Hi Acme!" for an ENTERPRISE customer → the agent didn't read the customer tier. Add explicit guidance: "Look up the customer's tier in customers.json and match tone: ENTERPRISE = formal/terse, SMB = warm/personal."
  • Agent makes up dates or quantities → system prompt: "Cite ONLY data from tool responses. Never invent numbers."
Step 4: Add Guardrails Manually (incl. Tone Check)
30 minguardrails.py

What & Why: A loop that does whatever Claude asks is not safe to ship. Production order agents need at minimum: input validation (PO format, SKU format), output tone enforcement (enterprise drafts should not start with "Hi there!"), a cost cap, and a circuit breaker. The tone check is the B2B-specific compliance layer.

"""guardrails.py — B2B-aware checks for order agents."""
import re

PO_RE = re.compile(r"^PO-[0-9]{4}-[0-9]{4}$")
SKU_RE = re.compile(r"^[A-Z]{3}-[0-9]{4}$")
COST_LIMIT_TOKENS = 50_000
CIRCUIT_FAIL_THRESHOLD = 3

# Tokens that should NOT appear in ENTERPRISE-tier drafts
ENTERPRISE_BANNED_TOKENS = ["hey there", "hi there", "no worries", "super sorry"]
# Tokens that should NOT appear in any draft (overpromise)
OVERPROMISE_TOKENS = ["guarantee", "definitely tomorrow", "100% certain"]

def validate_input(tool_name: str, args: dict):
    if tool_name == "get_order_details":
        if not PO_RE.match(args.get("po_number", "")):
            raise ValueError(f"Invalid PO format: {args.get('po_number')}")
    if tool_name in ("check_contract_pricing", "query_inventory"):
        if not SKU_RE.match(args.get("sku", "")):
            raise ValueError(f"Invalid SKU format: {args.get('sku')}")

def check_draft_tone(draft: dict):
    """Run AFTER draft_notification. Raise if tone mismatch or overpromise."""
    body = draft.get("body", "").lower()
    tier = draft.get("tier", "SMB")
    if tier == "ENTERPRISE":
        for tok in ENTERPRISE_BANNED_TOKENS:
            if tok in body:
                raise ValueError(f"Tone mismatch: '{tok}' in ENTERPRISE draft")
    for tok in OVERPROMISE_TOKENS:
        if tok in body:
            raise ValueError(f"Overpromise detected: '{tok}'")

class CircuitBreaker:
    def __init__(self): self.failures = 0
    def record(self, ok):
        self.failures = 0 if ok else self.failures + 1
        if self.failures >= CIRCUIT_FAIL_THRESHOLD:
            raise RuntimeError("Circuit breaker tripped — aborting")

Now wire the guardrails into _run_messages in agent.py. Replace your existing _run_messages with the version below. The tone check fires on draft_notification results — if it raises, the error flows back to Claude as a tool error, prompting a redraft.

# Add to imports at the top of agent.py:
from guardrails import (validate_input, check_draft_tone,
                        CircuitBreaker, COST_LIMIT_TOKENS)   # NEW

_breaker = CircuitBreaker()        # NEW — module-level instance

def _run_messages(messages: list) -> tuple[str, list]:
    total_tokens = 0                # NEW — cost cap counter
    for turn in range(MAX_TURNS):
        response = client.messages.create(
            model=MODEL, max_tokens=4096, system=SYSTEM,
            tools=TOOLS, messages=messages,
        )
        total_tokens += (response.usage.input_tokens
                         + response.usage.output_tokens)              # NEW
        if total_tokens > COST_LIMIT_TOKENS:                          # NEW
            raise RuntimeError(f"Cost cap exceeded: {total_tokens} tokens")

        messages.append({"role": "assistant", "content": response.content})

        if response.stop_reason == "end_turn":
            text = next(b.text for b in response.content if b.type == "text")
            return text, messages

        if response.stop_reason == "tool_use":
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    try:
                        validate_input(block.name, block.input)        # NEW — PO/SKU check
                        result = execute_tool(block.name, block.input)
                        # B2B-specific: tone-check draft_notification results.
                        if block.name == "draft_notification":         # NEW
                            check_draft_tone(result)                    # NEW — raises on bad tone
                        _breaker.record(ok=True)                        # NEW
                        tool_results.append({
                            "type": "tool_result",
                            "tool_use_id": block.id,
                            "content": json.dumps(result, default=str),
                        })
                    except Exception as e:
                        _breaker.record(ok=False)                       # NEW — trips on 3rd fail
                        tool_results.append({
                            "type": "tool_result",
                            "tool_use_id": block.id,
                            "content": f"ERROR: {e}",
                            "is_error": True,
                        })
            messages.append({"role": "user", "content": tool_results})
            continue

        raise RuntimeError(f"Unexpected stop_reason: {response.stop_reason}")

    raise RuntimeError(f"Agent exceeded {MAX_TURNS} turns without finishing")

Test that the tone check fires on a bad ENTERPRISE draft:

# Temporarily edit your draft_notification mock in tools.py to inject bad tone:
#   "Hi there team! No worries about PO-2024-5678 ..."
# Run the agent. The check_draft_tone hook should raise ValueError,
# the error flows back as is_error: true, and Claude redrafts with formal tone.
python agent.py
✅ Checkpoint — Step 4
The clean PO-2024-5678 query still produces an ENTERPRISE-tier formal draft. With the injected bad tone, the agent receives the tone-check error, redrafts, and the second attempt passes. Restore the clean draft_notification mock when done testing.
Tone Enforcement is the B2B Equivalent of HIPAA Redaction

In Domain A you redact PHI; in Domain B you enforce tone. Same hook pattern, different rules. Both are post-tool-use checks that can either reject the result (hard fail) or rewrite it (soft fail). For ENTERPRISE customers, a casual tone in a delay notification can damage the relationship; for SMB, an overly formal tone can feel cold. The tone check is what lets you scale customer comms without manual review.

Troubleshooting
  • ImportError: cannot import name 'check_draft_tone' from 'guardrails'guardrails.py isn't in the project folder, or the function name is misspelled.
  • Agent never redrafts after tone error → Claude may interpret the error as terminal. Strengthen the system prompt: "If you receive a tone_violation error from draft_notification, redraft with formal language and try again."
  • Cost cap fires immediatelyCOST_LIMIT_TOKENS = 50_000 is generous; if you mistakenly set it lower, increase it back.
Step 5: Add Audit Logging
15 minaudit_log.jsonl

What & Why: SOX and customer-contract audits require an exception-handling trail. Every tool call needs a timestamped record: case_id (PO number), tool, inputs, output summary, token count.

"""audit.py — SOX-compliant audit log for B2B order operations."""
import json, datetime
def append_audit(tool_name, args, result, tokens, po_number):
    rec = {
        "ts": datetime.datetime.utcnow().isoformat() + "Z",
        "po_number": po_number,
        "tool": tool_name,
        "input": args,
        "output_summary": str(result)[:200],
        "tokens": tokens,
    }
    with open("audit_log.jsonl", "a") as f:
        f.write(json.dumps(rec) + "\n")

Now wire append_audit into _run_messages. Add the import and call right after the successful execute_tool:

# Add to imports at the top of agent.py:
from audit import append_audit                    # NEW

# Inside _run_messages, after `result = execute_tool(...)` and the tone check:
                        if block.name == "draft_notification":
                            check_draft_tone(result)
                        _breaker.record(ok=True)
                        # Extract PO number from the user's first message for audit context
                        po = next(
                            (w for w in messages[0]["content"].split() if w.startswith("PO-")),
                            "default"
                        )
                        append_audit(                              # NEW
                            tool_name=block.name,
                            args=block.input,
                            result=result,
                            tokens=total_tokens,
                            po_number=po,
                        )
                        tool_results.append({...})
                        # ... rest of try-block unchanged ...

Run: python agent.py

Then inspect the audit file:

cat audit_log.jsonl   # macOS/Linux
type audit_log.jsonl  # Windows cmd

Expected output (5 lines, one per tool call, all with the same po_number):

{"ts": "2026-05-09T12:01:14Z", "po_number": "PO-2024-5678", "tool": "get_order_details", "input": {"po_number": "PO-2024-5678"}, "output_summary": "{\"customer_id\": \"ACME001\", \"status\": \"PARTIALLY_SHIPPED\", ...}", "tokens": 1842} {"ts": "2026-05-09T12:01:18Z", "po_number": "PO-2024-5678", "tool": "track_shipment", "input": {"tracking_number": "1Z999AA10123456784"}, "output_summary": "{\"carrier\": \"UPS\", \"status\": \"DELIVERED\", ...}", "tokens": 2510} {"ts": "2026-05-09T12:01:22Z", "po_number": "PO-2024-5678", "tool": "query_inventory", "input": {"sku": "GSK-1175", "warehouse": "WH-NEWARK"}, "output_summary": "{\"qty_available\": 200, ...}", "tokens": 3185} {"ts": "2026-05-09T12:01:26Z", "po_number": "PO-2024-5678", "tool": "draft_notification", "input": {"customer_id": "ACME001", "exception_type": "PARTIAL", ...}, "output_summary": "{\"to\": \"csm-acme@example.com\", \"tier\": \"ENTERPRISE\", ...}", "tokens": 3920}
✅ Checkpoint — Step 5
You should see 4–5 lines in audit_log.jsonl all keyed by po_number: "PO-2024-5678". If po_number is "default", the parser didn't find a PO in the user's first message — either the question doesn't include the PO, or the parsing logic doesn't match your prompt format.
Step 6: Multi-Turn PO Sessions
15 minsession.py

What & Why: Ops reviewers ask follow-ups about the same PO: "What if we expedite-ship from Newark instead?" should reuse the prior investigation context. Iter 1 implements this by maintaining a per-PO messages list and reusing the same _run_messages helper from Step 3.

Create a new file session.py in the project folder:

"""session.py — multi-turn PO sessions over the same _run_messages helper."""
from agent import _run_messages

SESSIONS: dict[str, list] = {}   # po_number -> messages list
WINDOW = 24                      # 5 tools per turn × ~5 turns + buffer

def chat(po_number: str, user_msg: str) -> str:
    """Append the user's message to the PO session, run the loop, return the answer."""
    msgs = SESSIONS.setdefault(po_number, [])
    msgs.append({"role": "user", "content": user_msg})
    answer, msgs = _run_messages(msgs)
    # Sliding window: keep only the last WINDOW messages.
    SESSIONS[po_number] = msgs[-WINDOW:]
    return answer

Try it — multi-turn PO investigation:

python -c "
from session import chat
print(chat('PO-2024-5678', 'PO-2024-5678 from Acme Corp has a flagged exception. Investigate and draft a notification.'))
print('---')
print(chat('PO-2024-5678', 'What would change if we expedite-ship from Newark instead of waiting for Memphis restock?'))
"

Expected behavior: the second call references the prior investigation findings (Memphis stockout, Newark availability) and proposes the expedite scenario as an alternative. Without session continuity, the agent would have no memory of the GSK-1175 issue.

✅ Checkpoint — Step 6
The second call references "the GSK-1175 stockout from Memphis" and proposes a different resolution (expedite ship from Newark with revised SLA estimate). If the second call says "I need to investigate first", session isn't carrying over — verify SESSIONS[po_number] persists across calls.
Troubleshooting
  • ImportError: cannot import name '_run_messages' from 'agent' → you skipped Step 3's refactor of agent.py. Go back and split the loop into _run_messages + run_agent.
  • Each call re-investigatesSESSIONS is module-level. If you're calling from separate Python processes (e.g., via subprocess), use a database or Redis instead.
Step 7: Deploy as FastAPI + Docker
20 minserver.py + Dockerfile
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from agent import run_agent
from session import chat

app = FastAPI()
class Q(BaseModel): question: str
class C(BaseModel): po_number: str; message: str

@app.get("/health")
def health(): return {"status": "ok"}

@app.post("/exception")
def exception_ep(q: Q):
    try: return {"report": run_agent(q.question)}
    except Exception as e: raise HTTPException(500, str(e))

@app.post("/chat")
def chat_ep(c: C):
    return {"answer": chat(c.po_number, c.message)}
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
ENV PYTHONUNBUFFERED=1
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8000"]

Also create requirements.txt at the project root:

anthropic>=0.40
fastapi>=0.110
uvicorn>=0.27
pydantic>=2.0

Run locally first:

uvicorn server:app --reload --port 8000
# In another terminal:
curl localhost:8000/health
curl -X POST localhost:8000/exception -H "Content-Type: application/json" \
     -d '{"question":"PO-2024-5678 from Acme Corp has a flagged exception. Investigate."}'

Then build and run with Docker:

docker build -t iter1-b .
docker run --rm -p 8000:8000 -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY iter1-b
🎉 Iteration 1 Complete — End-to-End Verification

You should now have 11 files in your project folder:

agent-iter1-raw/
│── agent.py            # _run_messages + run_agent
│── tools.py            # 5 ops tool schemas + execute_tool
│── mock_data/
│─   │── orders.json       # 10 POs across 3 exception types
│─   │── tracking.json     # 8 carrier records
│─   │── contracts.json    # 5 customer-SKU contracts
│─   │── inventory.json    # 6 SKUs × 3 warehouses
│─   │── customers.json    # 5 customers (3 ENTERPRISE, 2 SMB)
│── guardrails.py       # validate_input + check_draft_tone + CircuitBreaker
│── audit.py            # SOX-compliant append_audit
│── session.py          # multi-turn PO sessions
│── server.py           # FastAPI handlers
│── Dockerfile | requirements.txt

Run the full Iter-1 acceptance test:

# 1. Investigate the canonical PO (Acme, ENTERPRISE)
curl -s -X POST localhost:8000/exception -H "Content-Type: application/json" \
     -d '{"question":"PO-2024-5678 from Acme has a flagged exception. Investigate."}' | python -m json.tool

# 2. Multi-turn PO session (expedite-ship what-if)
curl -s -X POST localhost:8000/chat -H "Content-Type: application/json" \
     -d '{"po_number":"PO-2024-5678","message":"Investigate this exception."}' | python -m json.tool
curl -s -X POST localhost:8000/chat -H "Content-Type: application/json" \
     -d '{"po_number":"PO-2024-5678","message":"What if we expedite-ship from Newark?"}' | python -m json.tool

# 3. Tone enforcement: bad ENTERPRISE draft gets rejected
# (You'd inject "Hi there!" into the draft_notification mock to test this)

# 4. Audit log keyed by PO
docker exec $(docker ps -q --filter ancestor=iter1-b) cat audit_log.jsonl | head -5
# Look for "po_number": "PO-2024-5678" on every line

You pass Iter 1 if: (1) /health returns ok; (2) /exception returns a PARTIAL exception report with formal ENTERPRISE-tier draft; (3) the second /chat call references the prior Memphis stockout findings; (4) audit log entries are all keyed by po_number: "PO-2024-5678".

Iteration 1 Metrics
  • Files: 11 (agent.py, tools.py, mock_data/×5, guardrails.py, audit.py, session.py, server.py, Dockerfile, requirements.txt)
  • Lines you wrote: ~250
  • Time: ~3 hours
  • Abstractions used: none — just anthropic Messages API and FastAPI

Debugging in Iteration 1: Print Statements + Manual Inspection

When the agent gives wrong output in Iter 1, you debug like you debug any Python program: by reading your own code.

A. Add a debug_turn() helper
def debug_turn(turn_num, response, messages):
    print(f"\n=== TURN {turn_num} === stop_reason: {response.stop_reason}")
    for block in response.content:
        if block.type == "tool_use":
            print(f"  TOOL: {block.name}({block.input})")
        elif block.type == "text":
            print(f"  TEXT: {block.text[:120]}...")
    print(f"  Tokens: in={response.usage.input_tokens} out={response.usage.output_tokens}")
B. Inspect messages list manually

The most common Iter-1 bug is malformed messages. Add print(json.dumps(messages, default=str, indent=2)) before each messages.create.

C. Common Iter-1 bugs and how to spot them
  • Wrong tool_use_id in tool_result → 400 from API. Match the id on the tool_use block.
  • Forgot to append the assistant turn → API complains about message order.
  • Domain-specific: agent drafts a notification BEFORE checking inventory or contract pricing — root cause incomplete. The system prompt did not enforce the investigate-then-draft order. Strengthen with: "ALWAYS call get_order_details, track_shipment, check_contract_pricing, AND query_inventory before draft_notification."
  • Domain-specific: agent drafts an SMB-tone notification for an ENTERPRISE customer (or vice versa). Either tone-check hook didn't run, or the draft_notification result didn't include the customer tier. Add the tier to the response payload and to the audit log.
  • Domain-specific: agent recommends a credit when a re-shipment would resolve. Usually means the agent stopped after track_shipment showed DELIVERED without checking inventory for unshipped lines. Re-read the system prompt — PARTIAL delivery requires inventory check.
D. Debug exercise (do this before moving on)

Add a hard tone violation: change execute_tool for ENTERPRISE customers so the body starts with "Hey there!". Re-run the agent. The check_draft_tone guardrail should raise. Read the error message and trace where the tone got injected. Restore and re-run. This builds the muscle memory you need to debug Iter 3 generated drafts.

SESSION 2

Iteration 2: Agent SDK + Claude Code

Now you let the SDK run the loop, hooks handle tone enforcement and validation, sessions handle multi-turn, and Claude Code does most of the typing. Same agent. Half the lines. Different debugging.

~2 hours~120 lines8 filesSDK + hooks + sessions
Step 8: Create CLAUDE.md via Claude Code
10 minCLAUDE.md
mkdir agent-iter2-sdk && cd agent-iter2-sdk
python -m venv venv && source venv/bin/activate    # Windows: venv\Scripts\activate
pip install "claude-agent-sdk>=0.2" "fastapi>=0.110" "uvicorn>=0.27" "pydantic>=2.0"
npm i -g @anthropic-ai/claude-code   # if not already installed
claude
> /init
# Agent: Order Exception Investigation Agent (Iteration 2)

## Stack
- Python 3.11+
- `claude-agent-sdk` (the official Agent SDK — NOT a wrapper around `client.messages.create()`)
- FastAPI + Docker for deployment
- Mock data in mock_data/*.json (5 files: orders, tracking, contracts, inventory, customers)

## File Layout
- agent.py            — query() entry + 5 @tool-decorated MCP tools + create_sdk_mcp_server
- hooks/              — PreToolUse + PostToolUse hook scripts (tone enforcement on draft_notification)
- .claude/settings.json — hooks registration (matchers + commands)
- sessions.py         — multi-PO session resume
- server.py           — FastAPI async wrapper

## Compliance Rules (encode in hooks)
- ENTERPRISE drafts MUST NOT contain casual tokens ("hi there", "no worries")
- All drafts MUST NOT contain overpromise tokens ("guarantee", "definitely tomorrow")
- All tool calls MUST be audited with po_number as the case key

## System Prompt Rules
For every flagged PO, follow this order:
1. get_order_details
2. track_shipment for each tracking_number returned
3. check_contract_pricing for any line with potential discrepancy
4. query_inventory for any unshipped/backordered SKU
5. draft_notification (matching customer tier tone)

NEVER draft before completing the investigation. NEVER overpromise.
Step 9: Build Agent with @tool Decorators (claude-agent-sdk)
15 minagent.py
What the Real SDK Looks Like

If you've used client.messages.create() in Iter 1, you might expect the SDK to be a thin Agent class wrapping it. It is not. The SDK is built around MCP tools + an async query() generator + options/hooks via ClaudeAgentOptions. Tools return {"content": [{"type": "text", "text": ...}]} (MCP shape), not bare Python values.

> Create agent.py using `claude-agent-sdk`. Define five order-ops tools as
> @tool-decorated async functions returning MCP-shaped {"content":[...]}:
> get_order_details, track_shipment, check_contract_pricing, query_inventory,
> draft_notification (tier-aware tone). Wire them into a create_sdk_mcp_server,
> build ClaudeAgentOptions with the system prompt from CLAUDE.md, and expose
> async run(question) driving query() and concatenating AssistantMessage text.
"""agent.py — claude-agent-sdk version. ~85 lines incl. 5 tools."""
import json
from pathlib import Path
from claude_agent_sdk import (
    query, tool, create_sdk_mcp_server,
    ClaudeAgentOptions, AssistantMessage,
)

DATA = Path("mock_data")
ORDERS    = json.loads((DATA / "orders.json").read_text())
TRACKING  = json.loads((DATA / "tracking.json").read_text())
CONTRACTS = json.loads((DATA / "contracts.json").read_text())
INVENTORY = json.loads((DATA / "inventory.json").read_text())
CUSTOMERS = json.loads((DATA / "customers.json").read_text())

def _ok(payload): return {"content": [{"type": "text", "text": json.dumps(payload)}]}

@tool("get_order_details", "Full PO details: line items, status, tracking, customer.",
      {"po_number": str})
async def get_order_details(args):
    return _ok(ORDERS.get(args["po_number"], {"error": "PO not found"}))

@tool("track_shipment", "Carrier status for a tracking number.",
      {"tracking_number": str})
async def track_shipment(args):
    return _ok(TRACKING.get(args["tracking_number"], {"error": "tracking not found"}))

@tool("check_contract_pricing", "Contract unit price for customer + SKU.",
      {"customer_id": str, "sku": str})
async def check_contract_pricing(args):
    return _ok(CONTRACTS.get(f"{args['customer_id']}_{args['sku']}",
                             {"error": "no contract"}))

@tool("query_inventory", "Stock available for a SKU at a warehouse.",
      {"sku": str, "warehouse": str})
async def query_inventory(args):
    s = INVENTORY.get(args["sku"], {})
    return _ok(s.get("warehouses", {}).get(args["warehouse"],
                                            {"error": "no inventory"}))

@tool("draft_notification",
      "Generate exception-notification email with tier-aware tone.",
      {"customer_id": str, "exception_type": str, "resolution": str})
async def draft_notification(args):
    cust = CUSTOMERS.get(args["customer_id"], {})
    tier = cust.get("tier", "SMB")
    salutation = "Team," if tier == "ENTERPRISE" else f"Hi {cust.get('name','team')},"
    sign_off = ("— Operations" if tier == "ENTERPRISE"
                else "— Customer Success")
    body = (f"Regarding {args['exception_type']}: {args['resolution']}\n\n{sign_off}")
    return _ok({"to": cust.get("csm_email"), "tier": tier,
                "salutation": salutation, "body": body})

ops_server = create_sdk_mcp_server(
    name="ops_tools", version="1.0.0",
    tools=[get_order_details, track_shipment, check_contract_pricing,
           query_inventory, draft_notification],
)

OPTIONS = ClaudeAgentOptions(
    model="claude-sonnet-4-6",
    system_prompt=("You are an order-exception specialist. For every flagged PO: "
                   "1. get_order_details, 2. track_shipment, 3. check_contract_pricing, "
                   "4. query_inventory, 5. draft_notification (tier-aware). "
                   "NEVER draft before completing investigation. NEVER overpromise."),
    mcp_servers={"ops": ops_server},
    allowed_tools=[f"mcp__ops__{n}" for n in (
        "get_order_details", "track_shipment", "check_contract_pricing",
        "query_inventory", "draft_notification")],
    max_turns=12,
)

async def run(question: str) -> str:
    parts = []
    async for msg in query(prompt=question, options=OPTIONS):
        if isinstance(msg, AssistantMessage):
            for block in msg.content:
                if getattr(block, "text", None):
                    parts.append(block.text)
    return "\n".join(parts)
// agent.ts — @anthropic-ai/claude-agent-sdk version
import { query, tool, createSdkMcpServer } from "@anthropic-ai/claude-agent-sdk";
import { z } from "zod";
import * as fs from "fs";

const ORDERS    = JSON.parse(fs.readFileSync("mock_data/orders.json", "utf8"));
const TRACKING  = JSON.parse(fs.readFileSync("mock_data/tracking.json", "utf8"));
const CONTRACTS = JSON.parse(fs.readFileSync("mock_data/contracts.json", "utf8"));
const INVENTORY = JSON.parse(fs.readFileSync("mock_data/inventory.json", "utf8"));
const CUSTOMERS = JSON.parse(fs.readFileSync("mock_data/customers.json", "utf8"));

const ok = (p: unknown) => ({ content: [{ type: "text" as const, text: JSON.stringify(p) }] });

const getOrderDetails = tool(
  "get_order_details", "Full PO details.",
  { po_number: z.string() },
  async (args) => ok(ORDERS[args.po_number] ?? { error: "PO not found" })
);

const draftNotification = tool(
  "draft_notification", "Tier-aware exception email draft.",
  { customer_id: z.string(), exception_type: z.string(), resolution: z.string() },
  async (args) => {
    const cust = CUSTOMERS[args.customer_id] ?? {};
    const tier = cust.tier ?? "SMB";
    const salutation = tier === "ENTERPRISE" ? "Team," : `Hi ${cust.name ?? "team"},`;
    const signOff = tier === "ENTERPRISE" ? "— Operations" : "— Customer Success";
    const body = `Regarding ${args.exception_type}: ${args.resolution}\n\n${signOff}`;
    return ok({ to: cust.csm_email, tier, salutation, body });
  }
);
// (track_shipment, check_contract_pricing, query_inventory defined the same way)

const opsServer = createSdkMcpServer({
  name: "ops_tools",
  tools: [getOrderDetails, /* ... 4 more ... */, draftNotification],
});

const OPTIONS = {
  model: "claude-sonnet-4-6",
  systemPrompt: "You are an order-exception specialist. For every flagged PO: " +
                "1. get_order_details, 2. track_shipment, 3. check_contract_pricing, " +
                "4. query_inventory, 5. draft_notification (tier-aware). " +
                "NEVER draft before completing investigation. NEVER overpromise.",
  mcpServers: { ops: opsServer },
  allowedTools: [
    "mcp__ops__get_order_details", "mcp__ops__track_shipment",
    "mcp__ops__check_contract_pricing", "mcp__ops__query_inventory",
    "mcp__ops__draft_notification",
  ],
  maxTurns: 12,
};

export async function run(question: string): Promise<string> {
  const parts: string[] = [];
  for await (const msg of query({ prompt: question, options: OPTIONS })) {
    if (msg.type === "assistant") {
      for (const block of msg.content) {
        if ("text" in block) parts.push(block.text);
      }
    }
  }
  return parts.join("\n");
}
What Just Disappeared

You no longer write the message loop, the stop_reason check, the tool_result append, or JSON schema dicts. Same five-tool investigation flow, ~85 lines instead of ~140.

Troubleshooting
  • ModuleNotFoundError: No module named 'claude_agent_sdk'pip install "claude-agent-sdk>=0.2" in your venv.
  • ImportError: cannot import name 'Agent' from 'anthropic' → you're trying the old fictional API. The real SDK is claude_agent_sdk.
  • Tool call rejected with "not allowed" → add the tool's mcp__ops__<name> entry to allowed_tools.
Step 10: Hooks via .claude/settings.json + HookMatcher
20 minhooks/*.py + .claude/settings.json

What & Why: The SDK supports two hook surfaces: (a) file-based hooks in .claude/settings.json shelling out to scripts (production-friendly), and (b) in-process hooks via HookMatcher in ClaudeAgentOptions(hooks={...}). We use both: file-based for audit + tone enforcement on draft_notification, in-process for input validation that needs to deny.

{
  "hooks": {
    "PreToolUse": [
      { "matcher": "*", "command": "python hooks/log_call.py" }
    ],
    "PostToolUse": [
      { "matcher": "mcp__ops__draft_notification",
        "command": "python hooks/tone_check.py" },
      { "matcher": "*", "command": "python hooks/audit.py" }
    ]
  }
}
"""hooks/tone_check.py — reject ENTERPRISE drafts with casual tokens
or any draft containing overpromise tokens. Pass the draft through if OK."""
import sys, json

ENTERPRISE_BANNED = ["hi there", "hey there", "no worries", "super sorry"]
OVERPROMISE       = ["guarantee", "definitely tomorrow", "100% certain"]

payload = json.load(sys.stdin)
result  = payload.get("tool_result") or {}
# tool_result is wrapped {"content":[{"type":"text","text": json}]}
try:
    inner = json.loads(result["content"][0]["text"])
except Exception:
    json.dump(payload, sys.stdout); sys.exit(0)

body = (inner.get("body") or "").lower()
tier = inner.get("tier", "SMB")

violations = []
if tier == "ENTERPRISE":
    violations += [t for t in ENTERPRISE_BANNED if t in body]
violations += [t for t in OVERPROMISE if t in body]

if violations:
    # Rewrite the tool_result so the agent sees an error and re-drafts.
    err = {"error": "tone_violation", "violations": violations,
           "tier": tier, "advice": "Re-draft with formal/non-overpromising tone."}
    payload["tool_result"] = {"content": [{"type": "text",
                                            "text": json.dumps(err)}]}
json.dump(payload, sys.stdout)
"""Add to agent.py: in-process input validation via HookMatcher."""
import re
from claude_agent_sdk import HookMatcher

PO_RE  = re.compile(r"^PO-[0-9]{4}-[0-9]{4}$")
SKU_RE = re.compile(r"^[A-Z]{3}-[0-9]{4}$")

async def validate_input(input_data, tool_use_id, context):
    name = input_data.get("tool_name", "")
    args = input_data.get("tool_input", {}) or {}
    fail = None
    if name.endswith("get_order_details") and not PO_RE.match(args.get("po_number", "")):
        fail = f"Invalid PO format: {args.get('po_number')!r}"
    elif name.endswith(("check_contract_pricing", "query_inventory")) \
            and not SKU_RE.match(args.get("sku", "")):
        fail = f"Invalid SKU format: {args.get('sku')!r}"
    if fail:
        return {"hookSpecificOutput": {"hookEventName": "PreToolUse",
                                        "permissionDecision": "deny",
                                        "permissionDecisionReason": fail}}
    return {}

# Then update OPTIONS in agent.py:
OPTIONS = ClaudeAgentOptions(
    # ... existing fields ...
    hooks={"PreToolUse": [HookMatcher(matcher="mcp__ops__*",
                                      hooks=[validate_input])]},
)

Create the supporting log_call.py and audit.py hooks — same stdin/stdout pattern:

"""hooks/log_call.py — PreToolUse: print to stderr, pass through stdout."""
import sys, json, datetime

payload = json.load(sys.stdin)
ts = datetime.datetime.utcnow().isoformat() + "Z"
print(f"[{ts}] PRE  {payload.get('tool_name','?')}({payload.get('tool_input',{})})",
      file=sys.stderr)
json.dump(payload, sys.stdout)   # pass-through
"""hooks/audit.py — PostToolUse: append a SOX-compliant record to audit_log.jsonl."""
import sys, json, datetime

payload = json.load(sys.stdin)
# Extract PO number from tool_input if present, else 'default'.
ti = payload.get("tool_input", {}) or {}
po = ti.get("po_number") or "default"
rec = {
    "ts": datetime.datetime.utcnow().isoformat() + "Z",
    "po_number": po,
    "tool": payload.get("tool_name"),
    "input": ti,
    "output_summary": str(payload.get("tool_result"))[:200],
}
with open("audit_log.jsonl", "a") as f:
    f.write(json.dumps(rec, default=str) + "\n")
json.dump(payload, sys.stdout)   # pass-through

Smoke-test each hook standalone:

# tone_check should reject ENTERPRISE drafts with banned tokens
echo '{"tool_name":"mcp__ops__draft_notification","tool_result":{"content":[{"type":"text","text":"{\"body\":\"Hi there team!\",\"tier\":\"ENTERPRISE\"}"}]}}' \
  | python hooks/tone_check.py
# Should rewrite tool_result to an error payload prompting redraft.

# audit should append a record keyed by po_number
echo '{"tool_name":"get_order_details","tool_input":{"po_number":"PO-2024-5678"},"tool_result":"ok"}' \
  | python hooks/audit.py
cat audit_log.jsonl | tail -1

Then run the agent end-to-end:

python -c "import asyncio, agent; print(asyncio.run(agent.run('Investigate PO-2024-5678 from Acme Corp')))"
✅ Checkpoint — Step 10
You should see (1) [timestamp] PRE log lines on stderr, (2) the agent's exception report on stdout with formal ENTERPRISE-tier draft, (3) audit_log.jsonl with entries keyed by PO. If you trigger a tone violation (inject "Hi there!" in your draft_notification mock), the tone_check hook rewrites the tool_result as an error and the agent redrafts.
Troubleshooting
  • Tone hook crashes on JSON parse → the SDK wraps tool results in {"content":[{"type":"text","text":...}]}. The text field is a JSON STRING that needs json.loads to inspect.
  • Agent ignores the rewritten error and uses the bad draft anyway → check that you assigned the new payload back: payload["tool_result"] = {"content": [...error payload...]} before json.dump.
  • audit_log.jsonl shows po_number "default" → only get_order_details has a po_number in tool_input. Other tool calls (draft_notification, query_inventory) need extracting po_number from elsewhere; safest to extract from the user's first message in agent.py instead of inside the hook.
Step 11: Sessions — Multi-PO + Fork
15 minsessions.py
"""sessions.py — multi-PO via SDK resume tokens."""
from dataclasses import replace
from claude_agent_sdk import query, AssistantMessage
from agent import OPTIONS

SESSIONS: dict[str, str] = {}   # po_number -> resume token

async def _drive(prompt, options):
    parts, sid = [], None
    async for msg in query(prompt=prompt, options=options):
        if isinstance(msg, AssistantMessage):
            for block in msg.content:
                if getattr(block, "text", None): parts.append(block.text)
        s = getattr(msg, "session_id", None)
        if s: sid = s
    return "\n".join(parts), sid

async def chat(po_number: str, msg: str) -> str:
    resume = SESSIONS.get(po_number)
    options = replace(OPTIONS, resume=resume) if resume else OPTIONS
    text, sid = await _drive(msg, options)
    if sid: SESSIONS[po_number] = sid
    return text

async def what_if(po_number: str, hypothetical: str) -> str:
    """Fork: 'what if we expedite-ship from Newark instead?' — do NOT save the new sid."""
    resume = SESSIONS.get(po_number)
    options = replace(OPTIONS, resume=resume) if resume else OPTIONS
    text, _ = await _drive(hypothetical, options)
    return text

Try it — multi-PO + fork demo:

python -c "
import asyncio
from sessions import chat, what_if

async def main():
    print('T1:', await chat('PO-2024-5678', 'PO-2024-5678 from Acme has a flagged exception. Investigate.'))
    print('FORK:', await what_if('PO-2024-5678', 'What if we expedite-ship from Newark instead of waiting for Memphis?'))
    print('T2:', await chat('PO-2024-5678', 'Stick with the original plan. Draft the customer notification.'))

asyncio.run(main())
"
✅ Checkpoint — Step 11
T1 produces a PARTIAL exception report with Memphis stockout root cause. FORK explores the Newark expedite scenario WITHOUT polluting the main session. T2 references the original plan (not the Newark hypothetical) and produces the formal ENTERPRISE-tier draft. If T2 references Newark, your what_if is leaking into SESSIONS.
Step 12: Slash Commands
15 min.claude/commands/*.md (3 files)

What & Why: /run-exception PO-2024-5678, /test-agent, /eval-agent turn the agent into a one-line workflow. Ops reviewers can investigate exceptions without leaving Claude Code.

Create 3 files in .claude/commands/:

---
description: Investigate an order exception by PO number
argument-hint: [po_number]
---
Look up $ARGUMENTS in mock_data/orders.json. Build the question string.
Run `python -c "import asyncio, agent; print(asyncio.run(agent.run(q)))"`
where q is the question. Print the exception report, draft, total tokens,
total cost from the ResultMessage emitted by query().
---
description: Run the unit test suite for the order-exception agent
---
Run `pytest tests/ -v`. Critical tests: test_partial_delivery_resolution
(PO-2024-5678 must identify Memphis stockout + Newark availability),
test_tone_check_rejects_bad_enterprise_draft (tone hook works),
test_audit_keyed_by_po (every audit line has the right po_number).
---
description: Run the 10-PO evaluation suite
---
Read test_scenarios.json (10 POs: 4 DELAYED, 3 PARTIAL, 3 PRICING). For each,
call agent.run(), score on: correct exception_type, root cause cited, tier-
appropriate tone, no overpromise. Report per-case score and overall percentage.
✅ Checkpoint — Step 12
When you type / in Claude Code, the three commands appear in autocomplete. /run-exception PO-2024-5678 investigates and drafts the formal ENTERPRISE notification. /eval-agent requires a test_scenarios.json file — in Iter 3 the spec generates this for you.
Step 13: Deploy via Claude Code
15 minserver.py + Dockerfile

What & Why: Same FastAPI + Docker pattern as Iter 1, but Claude Code writes it.

> Create server.py and Dockerfile. Endpoints: GET /health,
> POST /exception (single-shot), POST /chat (po_number + message).
> Async FastAPI handlers awaiting agent.run() and sessions.chat() (both
> async coroutines from claude-agent-sdk). python:3.11-slim base, install
> claude-agent-sdk + dependencies, expose 8000. Mount .claude/ into the
> container so settings.json + hook scripts resolve at runtime.
"""server.py — async FastAPI wrapper around the SDK order-exception agent."""
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from agent import run as agent_run
from sessions import chat as session_chat

app = FastAPI(title="Order Exception Agent (Iter 2 — SDK)")

class Q(BaseModel): question: str
class C(BaseModel): po_number: str; message: str

@app.get("/health")
def health(): return {"status": "ok", "iter": 2}

@app.post("/exception")
async def exception_ep(q: Q):
    try: return {"report": await agent_run(q.question)}
    except Exception as e: raise HTTPException(500, str(e))

@app.post("/chat")
async def chat_ep(c: C):
    try: return {"answer": await session_chat(c.po_number, c.message)}
    except Exception as e: raise HTTPException(500, str(e))
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
ENV PYTHONUNBUFFERED=1
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8000"]

Run locally first, then Docker:

uvicorn server:app --reload --port 8000
# In another terminal:
curl localhost:8000/health
curl -X POST localhost:8000/exception -H "Content-Type: application/json" \
     -d '{"question":"Investigate PO-2024-5678"}'

# Then build the container:
docker build -t iter2-b .
docker run --rm -p 8000:8000 -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY iter2-b
🎉 Iteration 2 Complete — End-to-End Verification

You should now have ~14 files in your project:

agent-iter2-sdk/
│── CLAUDE.md
│── agent.py            # query() + 5 @tool functions + create_sdk_mcp_server
│── sessions.py
│── mock_data/ ×5       # orders, tracking, contracts, inventory, customers
│── .claude/
│─   │── settings.json
│─   │── commands/run-exception.md, test-agent.md, eval-agent.md
│── hooks/
│─   │── log_call.py, tone_check.py, audit.py
│── server.py | Dockerfile | requirements.txt

Acceptance test — same shape as Iter 1:

curl -s -X POST localhost:8000/exception -H "Content-Type: application/json" \
     -d '{"question":"Investigate PO-2024-5678"}' | python -m json.tool

curl -s -X POST localhost:8000/chat -H "Content-Type: application/json" \
     -d '{"po_number":"PO-2024-5678","message":"Investigate."}' | python -m json.tool
curl -s -X POST localhost:8000/chat -H "Content-Type: application/json" \
     -d '{"po_number":"PO-2024-5678","message":"What if we expedite from Newark?"}' | python -m json.tool

# Tone enforcement: check audit log shows ENTERPRISE drafts only
docker exec $(docker ps -q --filter ancestor=iter2-b) cat audit_log.jsonl | grep ENTERPRISE

You pass Iter 2 if: all 3 outputs are functionally equivalent to Iter 1 but with ~120 lines of code instead of ~250.

Iteration 2 Complete
Same curl as Iter 1; same exception report and tier-appropriate draft.
Iteration 2 Metrics
  • Files: ~10 (CLAUDE.md, agent.py, sessions.py, server.py, Dockerfile, .claude/settings.json, hooks/{log_call,tone_check,audit}.py, slash commands) + 5 mock JSON
  • Lines you wrote: ~120
  • Time: ~2 hours
  • Abstractions used: claude-agent-sdk (query / @tool / MCP server / ClaudeAgentOptions / HookMatcher) + .claude/settings.json + Claude Code

Debugging in Iteration 2: Hooks + Console Web UI + Langfuse

The SDK abstracts the loop — you cannot drop a print inside it. You debug from the outside.

A. Hooks as Debug Probes

Swap .claude/settings.json to a debug variant for diagnostic runs (the script writes to stderr and passes the original payload through stdout):

"""hooks/debug.py — pretty-print to stderr, pass payload through."""
import sys, json
payload = json.load(sys.stdin)
print(f"[HOOK {payload.get('hook_event_name','?')}] "
      f"{payload.get('tool_name','?')}: "
      f"{json.dumps(payload.get('tool_input', payload.get('tool_result')))[:200]}",
      file=sys.stderr)
json.dump(payload, sys.stdout)
{
  "hooks": {
    "PreToolUse":  [{ "matcher": "*", "command": "python hooks/debug.py" }],
    "PostToolUse": [{ "matcher": "*", "command": "python hooks/debug.py" }]
  }
}

BETTER than Iter-1 print statements because hooks are modular — symlink .claude/settings.json to the debug variant for one run, then back to production.

B. Anthropic Console Web UI

console.anthropic.com shows every API call. Worked example: Agent recommends a credit but inventory had stock. Console → Logs → failed call → tool_use shows the agent never called query_inventory. Fix: strengthen system prompt to require ALL 4 investigation tools before draft_notification. Re-run; ~3 min vs ~15 in Iter 1.

C. Langfuse Traces (if instrumented in M19)

Each PO becomes a waterfall trace: 5 tool calls, each a span. Compare a 2-second simple delay to a 12-second pricing-discrepancy investigation and see exactly which tool took the time (usually check_contract_pricing when contracts have many tier rules).

D. /run-exception --verbose

Verbose mode turns on debug hooks for one run. Shows every tool call, hook fire, token count, total cost.

SESSION 3

Iteration 3: Spec-Driven

You stop writing code. You write a spec describing what the order-exception agent should do, then ask Claude Code to build it. ~18 files appear. The spec IS the documentation ops leadership can read.

~1 hour~100 lines of spec~18 generated files0 lines of agent code
Step 14: Write agent-spec.md (12 sections)
30 minagent-spec.md
# Agent Specification: Order Exception Investigation Agent

## 1. Overview
Investigates flagged POs, identifies exception type, gathers evidence from
ERP/carrier/contract/inventory systems, drafts tier-appropriate customer
notification with proposed resolution.

## 2. Configuration
- Model: claude-sonnet-4-6
- Framework: claude-agent-sdk (Python). Tools registered via
  create_sdk_mcp_server. Driver: query() with ClaudeAgentOptions.
- Max turns: 12, Max tokens: 4096

## 3. Tools (registered as MCP server "ops", call in this order)

### mcp__ops__get_order_details
- @tool with schema {po_number: str (regex ^PO-[0-9]{4}-[0-9]{4}$)}
- Returns line items, status, tracking_numbers, customer_id, exception_flagged

### mcp__ops__track_shipment
- {tracking_number: str}
- Returns {carrier, status, delivered_at?, history[]}

### mcp__ops__check_contract_pricing
- {customer_id: str, sku: str (regex ^[A-Z]{3}-[0-9]{4}$)}
- Returns {contract_unit_price, volume_tier_min_qty, effective, expires}

### mcp__ops__query_inventory
- {sku: str, warehouse: str}
- Returns {qty_available, qty_reserved}

### mcp__ops__draft_notification
- {customer_id: str, exception_type (DELAYED|PARTIAL|PRICING|QUALITY_HOLD),
   resolution: str}
- Returns {to, tier, salutation, body}
- Tone rules: ENTERPRISE = formal/terse; SMB = warm/personal.

## 4. System Prompt (passed to ClaudeAgentOptions.system_prompt)
For every flagged PO follow:
1. get_order_details
2. track_shipment for each tracking_number returned
3. check_contract_pricing for any line with potential discrepancy
4. query_inventory for any unshipped line
5. draft_notification (tier-aware tone)

NEVER draft before completing relevant investigation. NEVER overpromise.
Match tier from get_order_details → customer_id → tier.

## 5. Hooks

In-process via ClaudeAgentOptions.hooks (HookMatcher):
- PreToolUse matcher "mcp__ops__get_order_details": deny if PO format invalid
- PreToolUse matcher "mcp__ops__check_contract_pricing|mcp__ops__query_inventory":
    deny if sku format invalid

File-based via .claude/settings.json:
- PreToolUse  matcher "*"                            command "python hooks/log_call.py"
- PostToolUse matcher "mcp__ops__draft_notification" command "python hooks/tone_check.py"
- PostToolUse matcher "*"                            command "python hooks/audit.py"

ENTERPRISE banned tokens: "hi there", "hey there", "no worries", "super sorry"
Overpromise tokens (any tier): "guarantee", "definitely tomorrow", "100% certain"
tone_check.py rewrites tool_result to a structured error so the agent re-drafts.

## 6. Sessions
Multi-PO via SDK resume tokens (sessions.py keeps po_number -> session_id map).
what_if() forks by re-using a resume token without persisting the new session_id.

## 7. Mock Data
- mock_data/orders.json: 10 POs with 3 exception types
- mock_data/tracking.json: 8 tracking records (UPS/FedEx/DHL)
- mock_data/contracts.json: 5 customer-SKU contracts with volume tiers
- mock_data/inventory.json: 6 SKUs across 3 warehouses
- mock_data/customers.json: 5 customers (3 ENTERPRISE, 2 SMB)

## 8. API Wrapper
FastAPI async handlers: GET /health, POST /exception, POST /chat.

## 9. Deployment
Tier 1: Docker. Mount .claude/ into container. Audit log on mounted volume.

## 10. Tests
test_tools.py, test_agent.py (PO-2024-5678 PARTIAL, cites Newark warehouse),
test_hooks.py (tone_check rewrites bad ENTERPRISE drafts to errors),
test_sessions.py (resume), test_api.py.

## 11. Evaluation
spec/test_scenarios.json with 10 POs: 4 DELAYED, 3 PARTIAL, 3 PRICING.
Score on: correct exception_type, root cause cited, tier-appropriate tone,
no overpromise.

## 12. File Structure
order-exception-agent/
│── spec/agent-spec.md          # this file
│── spec/test_scenarios.json
│── CLAUDE.md
│── .claude/settings.json       # PreToolUse/PostToolUse matchers
│── .claude/commands/ {run-exception.md, test-agent.md, eval-agent.md}
│── agent.py                    # query() + 5 @tool + create_sdk_mcp_server
│── sessions.py                 # resume-token sessions
│── mock_data/ {orders, tracking, contracts, inventory, customers}.json
│── hooks/{log_call,tone_check,audit}.py
│── server.py | Dockerfile | docker-compose.yml | requirements.txt
│── tests/ x5
│── appendix/manual-loop.py     # Iter-1 reference, not for production
Step 15: One Command Generates Everything
10 min~18 files appear
> /generate-from-spec spec/agent-spec.md

# (or, if /generate-from-spec is not installed:)
> Read spec/agent-spec.md and build the entire project. Create every file
> in section 12. Use the `claude-agent-sdk` package (NOT a fictional
> anthropic.Agent class). Tools must be @tool-decorated async functions
> registered via create_sdk_mcp_server. Hooks split between
> .claude/settings.json (file-based log/tone/audit) and OPTIONS.hooks
> (HookMatcher in-process validation) per section 5. Generate realistic
> mock B2B data with 10 POs covering all 3 exception types. Also generate
> appendix/manual-loop.py as an under-the-hood reference.

What you should see Claude Code create:

Reading spec/agent-spec.md ... Created agent.py (98 lines — 5 @tool functions + create_sdk_mcp_server + ClaudeAgentOptions) Created sessions.py (32 lines) Created hooks/log_call.py (12 lines) Created hooks/tone_check.py (28 lines — banned tokens + overpromise check) Created hooks/audit.py (18 lines) Created mock_data/orders.json (10 POs across 3 exception types) Created mock_data/tracking.json (8 carrier records) Created mock_data/contracts.json (5 customer-SKU contracts) Created mock_data/inventory.json (6 SKUs × 3 warehouses) Created mock_data/customers.json (5 customers: 3 ENTERPRISE, 2 SMB) Created server.py (24 lines — async FastAPI) Created Dockerfile (8 lines) Created docker-compose.yml (10 lines) Created requirements.txt (4 lines) Created CLAUDE.md (38 lines) Created .claude/settings.json (15 lines) Created .claude/commands/run-exception.md, test-agent.md, eval-agent.md Created tests/test_tools.py, test_agent.py, test_hooks.py, test_sessions.py, test_api.py Created spec/test_scenarios.json (10 scenarios) Created appendix/manual-loop.py (~80 lines) Total: 24 files, ~545 generated lines + 100 lines of spec you wrote.

Verify it actually works:

pytest tests/ -v

Expected pytest output (focus on the B2B-critical tests):

tests/test_tools.py::test_get_order_details_returns_acme_partial PASSED tests/test_tools.py::test_query_inventory_newark_has_gsk_1175 PASSED tests/test_agent.py::test_partial_delivery_resolution_cites_newark PASSED tests/test_agent.py::test_enterprise_draft_uses_formal_tone PASSED tests/test_hooks.py::test_tone_check_rejects_bad_enterprise_draft PASSED # <-- B2B-critical tests/test_hooks.py::test_overpromise_tokens_rejected PASSED # <-- B2B-critical tests/test_hooks.py::test_invalid_po_format_denied_by_validator PASSED tests/test_sessions.py::test_chat_persists_po_context PASSED tests/test_sessions.py::test_what_if_does_not_pollute_main_po PASSED tests/test_api.py::test_health_returns_ok PASSED =========================== 10 passed in 13.56s ============================
✅ Checkpoint — Step 15
All 10 tests pass on the first run. If test_tone_check_rejects_bad_enterprise_draft fails, the generated tone_check hook probably checked tone case-sensitively (so "Hi There" passes when "hi there" is banned). Fix the spec section 5 to clarify "case-insensitive substring match" and regenerate.
Troubleshooting
  • Generated tone hook is case-sensitive → spec ambiguity. Section 5 must say: "compare body.lower() against the banned tokens list."
  • Generated mock_data/customers.json doesn't have ENTERPRISE/SMB tier field → section 7 of spec should give a sample record per file.
  • tone_check rewrites the draft as a 200-line error → section 5 should constrain the error format: "Return error payload as JSON object with keys violations, tier, advice."
Step 16: Review & Iterate on the Spec
15 minspec edits + targeted regen

Example: add a sixth tool get_carrier_alternatives(origin, destination, sku, qty) that suggests faster carriers when the current one is delayed. Edit spec section 3 + 11 and ask Claude Code: "I added get_carrier_alternatives. Update tools.py, mock_data, add a test, and an eval scenario."

Step 17: Deploy & Compare
15 minsame curl, same output

What & Why: Build, run, curl, compare against Iter 1 and Iter 2.

docker compose up --build -d
docker compose ps   # confirm "Up"

curl localhost:8000/health
# Expected: {"status":"ok","iter":3}

curl -X POST localhost:8000/exception \
     -H "Content-Type: application/json" \
     -d '{"question":"Investigate PO-2024-5678"}' | python -m json.tool

Cross-iteration diff — the punchline:

# Same query against all three deployments (assumes :8001/8002/8003)
for port in 8001 8002 8003; do
  echo "=== Iter on :$port ==="
  curl -s -X POST localhost:$port/exception -H "Content-Type: application/json" \
       -d '{"question":"Investigate PO-2024-5678"}' \
    | python -c "import json, sys; d = json.load(sys.stdin); print((d.get('report') or d.get('answer'))[:300])"
  echo
done

Expected: all three return the PARTIAL exception report with Memphis stockout root cause, Newark availability, and a formal ENTERPRISE-tier draft. Wording varies; facts match.

🎉 Iteration 3 Complete — The Whole Capstone

You have ~24 generated files + your ~100-line spec:

order-exception-agent/
│── spec/agent-spec.md          # YOU wrote this
│── spec/test_scenarios.json    # generated
│── CLAUDE.md | .claude/settings.json | .claude/commands/ ×3
│── agent.py | sessions.py      # generated (SDK)
│── hooks/log_call.py | hooks/tone_check.py | hooks/audit.py
│── mock_data/ ×5               # 10 POs, 8 tracking, 5 contracts, 6 SKUs, 5 customers
│── tests/ ×5                    # generated
│── server.py | Dockerfile | docker-compose.yml | requirements.txt
│── appendix/manual-loop.py     # under-the-hood reference

Acceptance test — same shape as Iter 1 and Iter 2:

# 1. All 10 generated tests pass
pytest tests/ -v

# 2. End-to-end exception investigation
curl -s -X POST localhost:8000/exception -H "Content-Type: application/json" \
     -d '{"question":"Investigate PO-2024-5678"}' | python -m json.tool

# 3. Multi-PO session via SDK resume tokens
curl -s -X POST localhost:8000/chat -H "Content-Type: application/json" \
     -d '{"po_number":"PO-2024-5678","message":"Investigate."}' | python -m json.tool
curl -s -X POST localhost:8000/chat -H "Content-Type: application/json" \
     -d '{"po_number":"PO-2024-5678","message":"What if we expedite from Newark?"}' | python -m json.tool

# 4. Spec-vs-code drift check (B2B-specific)
claude
> Read spec/agent-spec.md sections 4 and 5. Compare to agent.py and hooks/.
> Check that ENTERPRISE banned tokens are checked case-insensitively.

You pass Iter 3 if: (1) all 10 tests pass; (2) PO-2024-5678 returns the PARTIAL report; (3) the second /chat call references prior context; (4) drift check returns "no drift".

Iteration 3 Metrics

Debugging in Iteration 3: Spec Comparison + Tests + Evals

You debug at the spec level. The code is regenerable; the spec is not.

A. Spec vs Code Comparison
> Read agent-spec.md and compare it to the generated agent.py and hooks.py.
> Report any deviations — especially around tone enforcement (section 5).
B. Test-Driven Debugging
> Test test_partial_delivery_resolution failed: expected resolution to mention
> Newark warehouse, got generic "we apologize" text. Read agent-spec.md
> section 4 (System Prompt). Read generated agent.py. Find why the agent
> skipped query_inventory and fix it.
C. Eval-Driven Debugging

Run /eval-agent. Output: PO-2024-5683 scored 2/4 — "draft tone too casual for ENTERPRISE customer." Fix: strengthen spec section 5 with the explicit banned-token list. Regenerate hooks.py. Re-eval → 4/4.

D. Console + Langfuse (same as Iter 2)
The Debugging Evolution Summary
IterationPrimary debug methodSecondarySpeed to fix
1 Rawprint() in the loopManual message inspectionSlow (find the line)
2 SDKHooks + Console Web UILangfuse tracesMedium (modular probes)
3 SpecSpec vs code comparisonTests + evals + ConsoleFast (Claude Code finds it)

The Comparison Table

Same order-exception agent, same business question. Here is everything that changed:

MetricIter 1: Raw APIIter 2: Agent SDKIter 3: Spec-Driven
Lines YOU wrote~250~120~100 (spec only)
Time to build~3 hours~2 hours~1 hour
Agent outputBaselineSameSame
Tone enforcementInline check after draftOne .claude/settings.json entry + a 25-line stdin/stdout script5 lines in spec section 5
Multi-PO sessionsManual history dictSDK sessionsSDK sessions (generated)
Adding alternative-carrier toolEdit 3 files + add validationOne Claude Code promptUpdate spec, ask to regen
Tests (incl. tone)ManualClaude Code generatedSpec generates them
Documentation for ops leadershipSeparateCLAUDE.mdSpec IS the doc ops can read
Control over internalsFullSDK-managedLeast direct (but reviewable)
Understanding neededEvery lineSDK abstractionsArchitecture-level
Debuggingprint() in loopHooks + Console + LangfuseSpec compare + tests + evals
Onboarding a new ops engineerRead 7 filesRead CLAUDE.md + 8 filesRead 1 spec file
Key Takeaway

All three produce the same exception report and tier-appropriate draft. The difference is what each iteration teaches you. Iteration 1 teaches WHAT an agent is. Iteration 2 teaches HOW to build efficiently. Iteration 3 teaches HOW production teams work — especially for B2B where the spec doubles as the contract-aware playbook ops can review.

Grading Rubric

25%
Iteration 1: Raw API Loop
Working tool-use loop with all 5 tools. Tone check on draft_notification (rejects ENTERPRISE drafts containing casual tokens). Validation regexes for PO and SKU. Audit log keyed by po_number. Multi-PO sessions. FastAPI + Docker deployment. Curl on PO-2024-5678 returns PARTIAL with inventory-based resolution.
ITER 1
25%
Iteration 2: Agent SDK + Claude Code
Real claude-agent-sdk with @tool-decorated async functions registered via create_sdk_mcp_server for all 5 tools. Tone hook (in .claude/settings.json) rewrites bad ENTERPRISE drafts to errors so the agent re-drafts. In-process HookMatcher validates PO/SKU formats. Sessions resumed via SDK resume token (what-if expedite from Newark demoed). At least 3 slash commands. Same FastAPI + Docker. CLAUDE.md present with compliance rules.
ITER 2
25%
Iteration 3: Spec-Driven
Complete 12-section agent-spec.md. Single Claude Code prompt generates all ~18 files. Tests pass on first or second iteration (especially test_tone_check). At least one targeted spec edit + regen demonstrated. Same FastAPI + Docker deployment.
ITER 3
15%
Comparison Table (honest metrics)
Filled-in copy with YOUR actual numbers. Honest reporting. Identical curl outputs (modulo non-determinism) across all three iterations on the same 10 test POs.
METRICS
10%
Reflection
200–400 words: hardest iteration and why; default for production exception handling; one concrete situation per iteration. Bonus for honest reflection on tone-enforcement surprises.
WRITE-UP
Passing Threshold

720/1000 to pass. All three iterations must run; the comparison table must use your actual numbers; tone enforcement must work in all three iterations on the same set of test POs.

Reflection Prompts

Answer in REFLECTION.md. 200–400 words total.

  1. Which iteration was hardest for YOU specifically, and why?
  2. Which iteration would you default to for production B2B order ops? Consider: the need for ops leadership to read and approve playbooks, customer-tier audit requirements, and pace of contract changes.
  3. Give one concrete situation where each iteration is the right choice.
  4. Was the tone-enforcement hook harder or easier than you expected? What would you change about the banned-token approach?
  5. What would you change about the spec format if you were writing it from scratch for B2B ops? Add a "tier-policy" section? An "SLA" section?

Knowledge Check

Q1: All three iterations produce the same exception report. What is the most defensible reason to still go through Iteration 1 rather than skipping straight to Iteration 3?

Q2: Your agent drafts an SMB-tone notification ("Hi there!") for an ENTERPRISE customer. What is the FASTEST debugging path in Iteration 2?

Q3: In Iteration 3, an eval case fails because the agent skipped query_inventory and recommended a credit when stock was actually available in another warehouse. The CORRECT fix is:

Q4: The most common Iteration 1 bug in the order-exception scenario is:

Q5: You need to add an "expedite shipping suggestion" capability to all three iterations. Which iteration requires the LEAST disruptive change?

Q6: When would you NOT pick Iteration 3 (spec-driven) for a real production order-exception system?

Q7: An ENTERPRISE-tier draft contains "no worries, we'll get this sorted." The tone-check hook should:

Going Further (Optional)

  1. Build all three domains. After completing B, write specs for Domain A and Domain C. The spec format generalizes.
  2. Add HITL to Iter 3. Extend the spec with a "Human Review" section that triggers when proposed credit > $50K or expedite cost exceeds margin. Ops manager dashboard surfaces the case.
  3. Cloud deployment. Take the Iter 3 generated code and ship it to GCP Cloud Run with Pub/Sub for order events (M22B Tier 2). The spec did not need to change — only the deployment section.
  4. Carrier integration. Replace the mock track_shipment with real UPS/FedEx APIs. Add retry + circuit breaker for the live carrier endpoint.
  5. Multi-agent ops pipeline. Extend to CAPSTONE-4's 4-agent pattern: Intake, Investigation, Decision (auto/HITL), Communication. Single spec for all four.
  6. Tier-policy linting. Write a slash command /lint-tiers that flags drafts where banned tokens slipped past the hook in production logs.
Course Complete

This is the final capstone. You have now built agents seven different ways. Time to ship.

← Capstone 6 🏠 Home Course Home →