Capstone 7-B — Agent Evolution: B2B Order Exception

Project Brief

Three Ways to Build a House

Imagine you decide to build the same house three times. The first time, you fell every tree, mill every plank, and hammer every nail by hand. You learn exactly where the load-bearing walls go, why the joists are spaced 16 inches apart, and what happens when a sill plate is undersized. The build takes six months, but you understand every joint in the structure.

The second time, you order pre-cut lumber, pre-fab trusses, and use a nail gun. You finish in six weeks. You build twice as fast and the structure is just as strong, but only because you already know what a structure should look like.

The third time, you hand a contractor a set of architectural drawings. Two weeks later, the house is done — built by people you never met to specifications you wrote in plain English. The previous two builds taught you what to draw and what to leave to the crew.

This capstone is those three builds, compressed into one week. You will write the same B2B order-exception agent three ways. By the end, you will know in your hands — not just in theory — why each layer of abstraction exists, what it costs, and when to reach for each one.

What You'll Build

You build a B2B Order Exception Agent. It takes a flagged purchase order — PO-2024-5678 — and walks through five tool calls: pulls order details from the ERP, tracks the shipment with the carrier, checks contract pricing against the invoice, queries available inventory, and drafts a customer notification. The output is a structured exception report (type + root cause + proposed resolution) plus a customer-tier-appropriate email draft.

You build it three times:

Iteration 1 Raw API loop — ~250 lines, ~3 hours, hand-coded everything (M15B way)
Iteration 2 Agent SDK + Claude Code — ~120 lines, ~2 hours (M25–M26 way)
Iteration 3 Spec-driven — ~100 lines of spec, ~1 hour (production way)

SLA & Tone Constraints (the B2B-specific compliance layer)

B2B notifications have to match customer-tier tone (terse for enterprise, warmer for SMB), reference contract clauses where relevant, and never overpromise resolution timelines. In Iteration 1 you encode tone rules in the system prompt and check the output by hand. In Iteration 2 you add an output guardrail hook that runs a tone classifier or pattern check. In Iteration 3 you specify the rules and Claude Code generates both the prompt rules AND the output hook. The pattern is identical to HIPAA in Domain A or PII redaction in Domain C — the spec encodes compliance, the iterations only differ in where that compliance lives.

Why It Matters

Production teams do not pick "raw API vs SDK vs spec" in the abstract — they pick based on what the team already understands and what the problem requires. If you skip Iteration 1, you cannot debug Iteration 3 when the generated code does something weird. If you skip Iteration 3, you are 10x slower than the teams shipping agents in 2026. The point of building the same thing three times is that the differences teach you the trade-offs.

The Three-Iteration Concept

Each iteration produces a working agent that solves the SAME order exception with the SAME five tools and SAME mock data. The agent's output is identical across all three. What changes is everything around the agent: the lines you wrote, the time you spent, the abstractions you used, and the way you debug when something breaks.

An agent is "a system prompt + tools + a loop." Iteration 1 makes you build all three from scratch. Iteration 2 keeps the system prompt and tools the same, but the SDK runs the loop for you. Iteration 3 keeps everything the same, but Claude Code writes the prompt, tools, AND loop from a spec you gave it. The agent is unchanged. You changed.

A Common Misunderstanding

"Iteration 3 is just better — why bother with the others?" Because Iteration 3's generated code is not magic. When it produces a track_shipment that returns the wrong field name and the agent then mis-reads "delivered" as "in transit," you have to read the generated tools.py, find the bug, and fix it — either in the code or in the spec. Without Iteration 1's familiarity with what tool calls and message shapes look like, you cannot tell what the generated code is doing wrong.

The Scenario — Order Exception Agent

The agent investigates problematic purchase orders: identifies the exception type (delayed shipment, partial delivery, pricing discrepancy, quality hold), gathers data from ERP, WMS, and carrier systems, determines root cause, and proposes resolution with a customer communication draft.

Business Question (use this in all three iterations)

"PO-2024-5678 from Acme Corp has a flagged exception. Investigate, determine the root cause, and draft a customer notification with a resolution proposal."

Tools (5)

get_order_details(po_number) — line items, status, dates, customer tier
track_shipment(tracking_number) — carrier status & location
check_contract_pricing(customer_id, sku) — contract vs invoiced price
query_inventory(sku, warehouse) — stock availability
draft_notification(customer_id, type, resolution) — email draft with tier-aware tone

Mock Data Shape

10 purchase orders with 3 exception types
Exceptions: delayed shipment (4), partial delivery (3), pricing discrepancy (3)
Customers: Acme Corp (Enterprise), Globex Industries (Enterprise), Initech LLC (SMB), Stark Enterprises (Enterprise), Wayne Industries (SMB)
Files: orders.json, tracking.json, contracts.json, inventory.json, customers.json

Want a Different Domain?

Switch to Domain A (Healthcare Pre-Auth) or Domain C (UCC Risk Analyzer — default). The lab structure is identical; only the tools and data change.

Animation 1: Three-Lane Evolution

Watch three lanes count down lines of code while the same six capabilities populate underneath each. The agent's capabilities never change — what shrinks is the code you write to express them.

Code Shrinks — Capabilities Stay

Iteration 1: Raw API

0

lines you wrote

Iteration 2: Agent SDK

0

lines you wrote

Iteration 3: Spec-Driven

0

lines of spec

Animation 2: Code Size Waterfall

Iteration 1 is 250 lines you wrote. Iteration 2 is 120 lines you wrote. Iteration 3 is 100 lines of spec plus ~300 lines of code Claude Code generated — shown stacked. The total system size grows; the lines on your keyboard fall off a cliff.

Lines of Code: You Wrote vs Generated

250

Iter 1

250 lines hand-written

120

Iter 2

120 lines hand-written

300

100

Iter 3

100 spec + 300 generated

Iter 1 hand-written Iter 2 hand-written Iter 3 spec (you wrote) Iter 3 generated by Claude Code

Animation 3: Time Comparison

Iteration 1 is the longest because you build everything — loop, validation, tone-checking, audit, sessions, deployment. Iteration 2 cuts the loop and guardrails. Iteration 3 cuts almost everything except the thinking.

Wall-Clock Hours per Iteration

Iter 1

3.0 h

3 hr

Iter 2

2.0 h

2 hr

Iter 3

1.0 h

1 hr

Total: 6 hours across 3 sessions for the same agent, three different builds.

Animation 4: Architecture Per Iteration

Each iteration has the same logical architecture but the physical architecture differs. Click the tabs to compare.

System Architecture — Three Versions

YOU OWN EVERYTHING IN BLUE +----------------------------------------------------------+ | agent.py (~250 lines, all hand-written) | | | | while True: <-- loop YOU wrote | | response = client.messages.create(...) | | if response.stop_reason == "end_turn": break | | for block in response.content: | | if block.type == "tool_use": | | validate_input(block.input) <-- YOU wrote | | log_with_timestamp(block) <-- YOU wrote | | result = execute_tool(...) <-- YOU wrote | | check_tone_if_email(result) <-- YOU wrote | | append_messages(...) <-- YOU wrote | | write_audit_log(...) <-- YOU wrote | +----------------------------------------------------------+ | | | v v v tools.py circuit_breaker audit_log.jsonl (5 ops tools (YOU built) (YOU rotate) YOU wrote) | v +----------------------------------------------------------+ | server.py (FastAPI) <-- YOU wrote everything | | Dockerfile <-- YOU wrote | +----------------------------------------------------------+

YOU OWN PINK; SDK OWNS GRAY +----------------------------------------------------------+ | agent.py (~85 lines) | | | | @tool("get_order_details", ..., {...}) | | async def get_order_details(args): ... | | (5 @tool functions total) << YOU wrote | | | | ops_server = create_sdk_mcp_server( | | name="ops_tools", tools=[...]) | | | | OPTIONS = ClaudeAgentOptions( | | system_prompt="Order ops analyst...", | | mcp_servers={"ops": ops_server}, | | allowed_tools=["mcp__ops__..."], | | hooks={"PreToolUse": [HookMatcher(validate_input)]}, | | ) | | | | async for msg in query(prompt=..., options=OPTIONS): ... | | | | -- LOOP, MESSAGE-PASSING, RETRIES, STREAMING ---- | | -- LIVE INSIDE claude-agent-sdk. YOU DO NOT WRITE THEM. -| +----------------------------------------------------------+ | | v v +--------------------+ +--------------------+ | Claude Code | | claude-agent-sdk | | generated: | | managed: | | - server.py | | - query() loop | | - Dockerfile | | - MCP transport | | - tests | | - HookMatcher | | .claude/ | | - resume tokens | | settings.json | | - streaming | | + hooks/*.py | +--------------------+ | (incl. tone) | +--------------------+

YOU OWN ONLY THE GREEN BOX +----------------------------------------------------------+ | spec/agent-spec.md (~100 lines) << YOU wrote | | ---------------------------------- | | # Sections | | 1. Overview 8. API Wrapper | | 2. Configuration 9. Deployment | | 3. Tools (5 ops) 10. Tests | | 4. System Prompt 11. Evaluation (10 POs) | | 5. Hooks (tone) 12. File Structure | | 6. Sessions | | 7. Mock Data | +----------------------------------------------------------+ | v +----------------------------------------------------------+ | /generate-from-spec spec/agent-spec.md | | Claude Code reads spec, generates ~18 files: | | agent.py (claude-agent-sdk) | sessions.py | | mock_data/*.json (5 files) | server.py | Dockerfile | | .claude/settings.json | hooks/*.py | .claude/commands/ | | tests/test_*.py x5 | spec/agent-spec.md (you) | | appendix/manual-loop.py (Iter-1 reference) | +----------------------------------------------------------+

Animation 5: Spec-to-Code Flow

Watch the 12-section spec on the left get read line-by-line. Claude Code translates each section into generated code on the right. Files appear as their corresponding spec section is consumed.

agent-spec.md → Claude Code → 18 Files

⚙

Claude Code

read → plan → write

Prerequisites

Required Modules

M05 — Tool Use: Tool definitions, tool_use blocks, the message loop
M12 — ReAct Agents: Multi-step reasoning across the 5-tool diagnostic flow
M15B — Build Complete Agent: The whole Iter-1 mental model
M16–M17 — Guardrails & HITL: Hooks, especially tone enforcement
M19 — Tracing: Optional but useful for Iter 2 debugging
M21, M22B — Deployment: FastAPI + Docker + Tier 1
M25, M26 — Claude Code & Hooks: CLAUDE.md, slash commands, hooks API

If You Have Not Done M15B / M26

Iteration 1 is a re-implementation of the M15B reference agent for the order-exception scenario. Do M15B first if you have not built an agent from raw API calls. Iteration 2 leans on the claude-agent-sdk patterns taught in M26 (Hooks & Sessions & Agent SDK) — reach for it if @tool / HookMatcher / ClaudeAgentOptions feels unfamiliar.

Tools You'll Need Installed

Python 3.10+ with pip
Claude Code CLI (npm i -g @anthropic-ai/claude-code) — for Iter 2 and 3
Docker Desktop for the Tier 1 deployment
ANTHROPIC_API_KEY environment variable
Optional: a Langfuse account for Iter 2 tracing

SESSION 1

Iteration 1: Raw API Loop

Build the agent the M15B way. You write the loop, the validation, the tone enforcement, the audit, the sessions, and the deployment. Every line is yours. Every bug is yours to find.

~3 hours~250 lines7 files0 abstractions

Step 1: Setup & Mock Data

15 minmock_data/*.json

What & Why: Create the project folder, install anthropic + FastAPI, then build the five mock JSON files. Mock data is what separates a "demo" agent from a "doesn't compile" agent — without realistic ERP, carrier, contract, and inventory data, every tool call returns garbage.

mkdir -p agent-iter1-raw/mock_data && cd agent-iter1-raw
python -m venv venv && source venv/bin/activate    # Windows: venv\Scripts\activate
pip install "anthropic>=0.40" "fastapi>=0.110" "uvicorn>=0.27" "pydantic>=2.0"

// mock_data/orders.json (excerpt)
{
  "PO-2024-5678": {
    "po_number": "PO-2024-5678",
    "customer_id": "ACME001",
    "submitted": "2024-04-01",
    "promised_ship": "2024-04-08",
    "status": "PARTIALLY_SHIPPED",
    "tracking_numbers": ["1Z999AA10123456784"],
    "line_items": [
      {"sku": "BRG-4892", "qty_ordered": 100, "qty_shipped": 60, "unit_price": 42.00},
      {"sku": "GSK-1175", "qty_ordered": 50,  "qty_shipped": 0,  "unit_price": 18.50}
    ],
    "exception_flagged": "PARTIAL_DELIVERY"
  }
}

// mock_data/customers.json (excerpt)
{
  "ACME001": {"name": "Acme Corporation", "tier": "ENTERPRISE",
              "csm_email": "csm-acme@example.com", "sla_hours": 4},
  "INIT003": {"name": "Initech LLC",       "tier": "SMB",
              "csm_email": "support@example.com", "sla_hours": 24}
}

// mock_data/tracking.json (excerpt)
{
  "1Z999AA10123456784": {
    "carrier": "UPS",
    "status": "DELIVERED",
    "delivered_at": "2024-04-12T14:30:00Z",
    "history": [
      {"ts": "2024-04-09T10:00:00Z", "event": "PICKED_UP", "loc": "Memphis, TN"},
      {"ts": "2024-04-12T14:30:00Z", "event": "DELIVERED", "loc": "Newark, NJ"}
    ]
  }
}

// mock_data/contracts.json (excerpt — for ACME001 + BRG-4892)
{
  "ACME001_BRG-4892": {
    "customer_id": "ACME001", "sku": "BRG-4892",
    "contract_unit_price": 39.50, "volume_tier_min_qty": 100,
    "effective": "2024-01-01", "expires": "2024-12-31"
  }
}

// mock_data/inventory.json (excerpt)
{
  "GSK-1175": {
    "warehouses": {
      "WH-MEMPHIS": {"qty_available": 0,  "qty_reserved": 0},
      "WH-NEWARK":  {"qty_available": 200, "qty_reserved": 50}
    }
  }
}

Build all five JSON files with realistic data: 10 POs, 5 customers (3 Enterprise, 2 SMB), tracking for 8 shipments, contracts for 5 SKUs, inventory across 3 warehouses. Cover three exception types: 4 delayed shipments, 3 partial deliveries (incl. PO-2024-5678), 3 pricing discrepancies.

Run: python -c "import json; print(len(json.load(open('mock_data/orders.json'))), 'POs')"

10 POs

Checkpoint

You should see 10 POs. If you see 0 or a JSON parse error, check that all five files are valid JSON.

Step 2: Define Tools as JSON Schema

15 mintools.py

What & Why: The Anthropic API needs every tool described as a JSON Schema. PO numbers, SKUs, and tracking numbers all have specific formats (PO-YYYY-NNNN, XXX-NNNN, UPS/FedEx 12–22 chars). Get the schema right and Claude passes correct types.

"""tools.py — order-exception tool schemas + executors."""
import json
from pathlib import Path

DATA = Path("mock_data")
ORDERS = json.loads((DATA / "orders.json").read_text())
TRACKING = json.loads((DATA / "tracking.json").read_text())
CONTRACTS = json.loads((DATA / "contracts.json").read_text())
INVENTORY = json.loads((DATA / "inventory.json").read_text())
CUSTOMERS = json.loads((DATA / "customers.json").read_text())

TOOLS = [
    {
        "name": "get_order_details",
        "description": "Get full PO details: line items, status, dates, tracking numbers, customer.",
        "input_schema": {
            "type": "object",
            "properties": {"po_number": {"type": "string", "pattern": "^PO-[0-9]{4}-[0-9]{4}$"}},
            "required": ["po_number"],
        },
    },
    {
        "name": "track_shipment",
        "description": "Get carrier status for a tracking number (UPS/FedEx/DHL).",
        "input_schema": {
            "type": "object",
            "properties": {"tracking_number": {"type": "string"}},
            "required": ["tracking_number"],
        },
    },
    {
        "name": "check_contract_pricing",
        "description": "Return contract unit price for a customer + SKU. Used to detect discrepancies.",
        "input_schema": {
            "type": "object",
            "properties": {"customer_id": {"type": "string"}, "sku": {"type": "string"}},
            "required": ["customer_id", "sku"],
        },
    },
    {
        "name": "query_inventory",
        "description": "Return stock available for a SKU at a specific warehouse.",
        "input_schema": {
            "type": "object",
            "properties": {"sku": {"type": "string"}, "warehouse": {"type": "string"}},
            "required": ["sku", "warehouse"],
        },
    },
    {
        "name": "draft_notification",
        "description": "Generate an exception-notification email draft, tier-aware tone.",
        "input_schema": {
            "type": "object",
            "properties": {
                "customer_id": {"type": "string"},
                "exception_type": {"type": "string",
                                   "enum": ["DELAYED", "PARTIAL", "PRICING", "QUALITY_HOLD"]},
                "resolution": {"type": "string"},
            },
            "required": ["customer_id", "exception_type", "resolution"],
        },
    },
]

def execute_tool(name, args):
    if name == "get_order_details":
        return ORDERS.get(args["po_number"], {"error": "PO not found"})
    if name == "track_shipment":
        return TRACKING.get(args["tracking_number"], {"error": "tracking not found"})
    if name == "check_contract_pricing":
        key = f"{args['customer_id']}_{args['sku']}"
        return CONTRACTS.get(key, {"error": "no contract on file"})
    if name == "query_inventory":
        sku = INVENTORY.get(args["sku"], {})
        return sku.get("warehouses", {}).get(args["warehouse"],
                                             {"error": "no inventory record"})
    if name == "draft_notification":
        cust = CUSTOMERS.get(args["customer_id"], {})
        tier = cust.get("tier", "SMB")
        # Tier-aware salutation/tone
        salutation = "Team," if tier == "ENTERPRISE" else f"Hi {cust.get('name','team')},"
        sign_off = ("We will follow up before SLA expiry. — Operations"
                    if tier == "ENTERPRISE"
                    else "Let us know how we can help! — Customer Success")
        body = (f"Regarding {args['exception_type']} on your recent order:\n\n"
                f"{args['resolution']}\n\n{sign_off}")
        return {"to": cust.get("csm_email", "ops@example.com"),
                "tier": tier, "salutation": salutation, "body": body}
    raise ValueError(f"Unknown tool: {name}")

Checkpoint

Run

python -c "from tools import execute_tool; print(execute_tool('get_order_details', {'po_number': 'PO-2024-5678'})['exception_flagged'])"

. Expected: PARTIAL_DELIVERY.

Step 3: Build the While Loop

45 minagent.pyThe CORE of Iter 1

What & Why: The system prompt is critical for order-exception: Claude needs to know to investigate ALL related data (order, tracking, contract, inventory) before drafting a notification, not just react to the first finding.

"""agent.py — the raw API loop. Everything is hand-coded."""
import json, anthropic
from tools import TOOLS, execute_tool

client = anthropic.Anthropic()
MODEL = "claude-sonnet-4-6"
SYSTEM = """You are an order-exception specialist. For every flagged PO:
1. get_order_details to understand the situation
2. track_shipment for any tracking numbers (CARRIER VIEW)
3. check_contract_pricing for any line item with potential discrepancy
4. query_inventory for any unshipped or backordered SKUs
5. draft_notification with the right exception_type and a clear resolution

ALWAYS investigate before drafting. Never recommend a credit without checking
contract pricing AND inventory. Match the customer tier's tone (terse for
ENTERPRISE, warmer for SMB). Never overpromise — cite SLA hours, not specifics."""

MAX_TURNS = 12

def _run_messages(messages: list) -> tuple[str, list]:
    """Drive the loop on the given messages list. Returns (final_text, messages).
    Used by run_agent() for single-shot queries and by session.chat() for multi-turn."""
    for turn in range(MAX_TURNS):
        response = client.messages.create(
            model=MODEL, max_tokens=4096, system=SYSTEM,
            tools=TOOLS, messages=messages,
        )
        messages.append({"role": "assistant", "content": response.content})

        if response.stop_reason == "end_turn":
            text = next(b.text for b in response.content if b.type == "text")
            return text, messages

        if response.stop_reason == "tool_use":
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    try:
                        result = execute_tool(block.name, block.input)
                        tool_results.append({
                            "type": "tool_result",
                            "tool_use_id": block.id,
                            "content": json.dumps(result, default=str),
                        })
                    except Exception as e:
                        tool_results.append({
                            "type": "tool_result",
                            "tool_use_id": block.id,
                            "content": f"ERROR: {e}",
                            "is_error": True,
                        })
            messages.append({"role": "user", "content": tool_results})
            continue

        raise RuntimeError(f"Unexpected stop_reason: {response.stop_reason}")
    raise RuntimeError(f"Agent exceeded {MAX_TURNS} turns without finishing")

def run_agent(question: str) -> str:
    """Single-shot entry point. Wraps _run_messages with a fresh history."""
    text, _ = _run_messages([{"role": "user", "content": question}])
    return text

if __name__ == "__main__":
    q = "PO-2024-5678 from Acme Corp has a flagged exception. Investigate, " \
        "determine the root cause, and draft a customer notification."
    print(run_agent(q))

// agent.ts — the raw API loop. Order-exception investigation.
import Anthropic from "@anthropic-ai/sdk";
import { TOOLS, executeTool } from "./tools.js";

const client = new Anthropic();
const MODEL = "claude-sonnet-4-6";
const SYSTEM = `You are an order-exception specialist. For every flagged PO:
1. get_order_details to understand the situation
2. track_shipment for any tracking numbers
3. check_contract_pricing for any line item with potential discrepancy
4. query_inventory for any unshipped or backordered SKUs
5. draft_notification with the right exception_type and a clear resolution

ALWAYS investigate before drafting. Never recommend a credit without checking
contract pricing AND inventory. Match the customer tier's tone (terse for
ENTERPRISE, warmer for SMB). Never overpromise — cite SLA hours, not specifics.`;

export async function runAgent(question: string, maxTurns = 12): Promise<string> {
  const messages: Anthropic.MessageParam[] = [{ role: "user", content: question }];
  for (let turn = 0; turn < maxTurns; turn++) {
    const response = await client.messages.create({
      model: MODEL, max_tokens: 4096, system: SYSTEM, tools: TOOLS, messages,
    });
    messages.push({ role: "assistant", content: response.content });
    if (response.stop_reason === "end_turn") {
      const text = response.content.find(b => b.type === "text");
      return text?.type === "text" ? text.text : "";
    }
    if (response.stop_reason === "tool_use") {
      const toolResults: Anthropic.ToolResultBlockParam[] = [];
      for (const block of response.content) {
        if (block.type === "tool_use") {
          try {
            const result = await executeTool(block.name, block.input);
            toolResults.push({ type: "tool_result", tool_use_id: block.id,
                                content: JSON.stringify(result) });
          } catch (e) {
            toolResults.push({ type: "tool_result", tool_use_id: block.id,
                                content: `ERROR: ${(e as Error).message}`, is_error: true });
          }
        }
      }
      messages.push({ role: "user", content: toolResults });
      continue;
    }
    throw new Error(`Unexpected stop_reason: ${response.stop_reason}`);
  }
  throw new Error(`Agent exceeded ${maxTurns} turns without finishing`);
}

Run: python agent.py

Expected output (paraphrased — Claude's exact wording varies):

Exception Report: PO-2024-5678 (Acme Corp, ENTERPRISE tier)
Type: PARTIAL_DELIVERY
Root cause: 50 units of GSK-1175 unshipped — Memphis warehouse had 0 stock at
pick time. Newark warehouse has 200 units available; can ship within SLA window.

Draft notification (ENTERPRISE tone — formal):
"Team,
Regarding the partial delivery on PO-2024-5678: 60 of 100 units of BRG-4892 have
shipped (UPS tracking 1Z999AA10123456784, delivered Newark NJ 2024-04-12). The
remaining 50 units of GSK-1175 are unshipped due to a Memphis stockout; we have
200 units available in Newark and will dispatch within your 4-hour SLA.
— Operations"

✅ Checkpoint — Step 3

You should see an exception report identifying PARTIAL delivery, root cause naming the Memphis stockout + Newark availability, and a draft notification with ENTERPRISE-tier tone (formal salutation, no "Hi there!"). If the agent drafts an SMB-tone notification ("Hi Acme team!"), the system prompt isn't picking up the customer tier — check that get_order_details returns the customer_id and the agent looks it up in the customers data.

Troubleshooting

Agent skips track_shipment → the order has no tracking number yet (e.g., for partial-shipment cases, only the shipped portion has tracking). System prompt should say "call track_shipment for each tracking_number returned by get_order_details — skip if none."
Agent recommends a credit instead of investigating → reinforce the system prompt: "Never recommend a credit without first checking inventory at OTHER warehouses."
Draft uses "Hi Acme!" for an ENTERPRISE customer → the agent didn't read the customer tier. Add explicit guidance: "Look up the customer's tier in customers.json and match tone: ENTERPRISE = formal/terse, SMB = warm/personal."
Agent makes up dates or quantities → system prompt: "Cite ONLY data from tool responses. Never invent numbers."

Step 4: Add Guardrails Manually (incl. Tone Check)

30 minguardrails.py

What & Why: A loop that does whatever Claude asks is not safe to ship. Production order agents need at minimum: input validation (PO format, SKU format), output tone enforcement (enterprise drafts should not start with "Hi there!"), a cost cap, and a circuit breaker. The tone check is the B2B-specific compliance layer.

"""guardrails.py — B2B-aware checks for order agents."""
import re

PO_RE = re.compile(r"^PO-[0-9]{4}-[0-9]{4}$")
SKU_RE = re.compile(r"^[A-Z]{3}-[0-9]{4}$")
COST_LIMIT_TOKENS = 50_000
CIRCUIT_FAIL_THRESHOLD = 3

# Tokens that should NOT appear in ENTERPRISE-tier drafts
ENTERPRISE_BANNED_TOKENS = ["hey there", "hi there", "no worries", "super sorry"]
# Tokens that should NOT appear in any draft (overpromise)
OVERPROMISE_TOKENS = ["guarantee", "definitely tomorrow", "100% certain"]

def validate_input(tool_name: str, args: dict):
    if tool_name == "get_order_details":
        if not PO_RE.match(args.get("po_number", "")):
            raise ValueError(f"Invalid PO format: {args.get('po_number')}")
    if tool_name in ("check_contract_pricing", "query_inventory"):
        if not SKU_RE.match(args.get("sku", "")):
            raise ValueError(f"Invalid SKU format: {args.get('sku')}")

def check_draft_tone(draft: dict):
    """Run AFTER draft_notification. Raise if tone mismatch or overpromise."""
    body = draft.get("body", "").lower()
    tier = draft.get("tier", "SMB")
    if tier == "ENTERPRISE":
        for tok in ENTERPRISE_BANNED_TOKENS:
            if tok in body:
                raise ValueError(f"Tone mismatch: '{tok}' in ENTERPRISE draft")
    for tok in OVERPROMISE_TOKENS:
        if tok in body:
            raise ValueError(f"Overpromise detected: '{tok}'")

class CircuitBreaker:
    def __init__(self): self.failures = 0
    def record(self, ok):
        self.failures = 0 if ok else self.failures + 1
        if self.failures >= CIRCUIT_FAIL_THRESHOLD:
            raise RuntimeError("Circuit breaker tripped — aborting")

Now wire the guardrails into _run_messages in agent.py. Replace your existing _run_messages with the version below. The tone check fires on draft_notification results — if it raises, the error flows back to Claude as a tool error, prompting a redraft.

# Add to imports at the top of agent.py:
from guardrails import (validate_input, check_draft_tone,
                        CircuitBreaker, COST_LIMIT_TOKENS)   # NEW

_breaker = CircuitBreaker()        # NEW — module-level instance

def _run_messages(messages: list) -> tuple[str, list]:
    total_tokens = 0                # NEW — cost cap counter
    for turn in range(MAX_TURNS):
        response = client.messages.create(
            model=MODEL, max_tokens=4096, system=SYSTEM,
            tools=TOOLS, messages=messages,
        )
        total_tokens += (response.usage.input_tokens
                         + response.usage.output_tokens)              # NEW
        if total_tokens > COST_LIMIT_TOKENS:                          # NEW
            raise RuntimeError(f"Cost cap exceeded: {total_tokens} tokens")

        messages.append({"role": "assistant", "content": response.content})

        if response.stop_reason == "end_turn":
            text = next(b.text for b in response.content if b.type == "text")
            return text, messages

        if response.stop_reason == "tool_use":
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    try:
                        validate_input(block.name, block.input)        # NEW — PO/SKU check
                        result = execute_tool(block.name, block.input)
                        # B2B-specific: tone-check draft_notification results.
                        if block.name == "draft_notification":         # NEW
                            check_draft_tone(result)                    # NEW — raises on bad tone
                        _breaker.record(ok=True)                        # NEW
                        tool_results.append({
                            "type": "tool_result",
                            "tool_use_id": block.id,
                            "content": json.dumps(result, default=str),
                        })
                    except Exception as e:
                        _breaker.record(ok=False)                       # NEW — trips on 3rd fail
                        tool_results.append({
                            "type": "tool_result",
                            "tool_use_id": block.id,
                            "content": f"ERROR: {e}",
                            "is_error": True,
                        })
            messages.append({"role": "user", "content": tool_results})
            continue

        raise RuntimeError(f"Unexpected stop_reason: {response.stop_reason}")

    raise RuntimeError(f"Agent exceeded {MAX_TURNS} turns without finishing")

Test that the tone check fires on a bad ENTERPRISE draft:

# Temporarily edit your draft_notification mock in tools.py to inject bad tone:
#   "Hi there team! No worries about PO-2024-5678 ..."
# Run the agent. The check_draft_tone hook should raise ValueError,
# the error flows back as is_error: true, and Claude redrafts with formal tone.
python agent.py

✅ Checkpoint — Step 4

The clean PO-2024-5678 query still produces an ENTERPRISE-tier formal draft. With the injected bad tone, the agent receives the tone-check error, redrafts, and the second attempt passes. Restore the clean draft_notification mock when done testing.

Tone Enforcement is the B2B Equivalent of HIPAA Redaction

In Domain A you redact PHI; in Domain B you enforce tone. Same hook pattern, different rules. Both are post-tool-use checks that can either reject the result (hard fail) or rewrite it (soft fail). For ENTERPRISE customers, a casual tone in a delay notification can damage the relationship; for SMB, an overly formal tone can feel cold. The tone check is what lets you scale customer comms without manual review.

Troubleshooting

ImportError: cannot import name 'check_draft_tone' from 'guardrails' → guardrails.py isn't in the project folder, or the function name is misspelled.
Agent never redrafts after tone error → Claude may interpret the error as terminal. Strengthen the system prompt: "If you receive a tone_violation error from draft_notification, redraft with formal language and try again."
Cost cap fires immediately → COST_LIMIT_TOKENS = 50_000 is generous; if you mistakenly set it lower, increase it back.

Step 5: Add Audit Logging

15 minaudit_log.jsonl

What & Why: SOX and customer-contract audits require an exception-handling trail. Every tool call needs a timestamped record: case_id (PO number), tool, inputs, output summary, token count.

"""audit.py — SOX-compliant audit log for B2B order operations."""
import json, datetime
def append_audit(tool_name, args, result, tokens, po_number):
    rec = {
        "ts": datetime.datetime.utcnow().isoformat() + "Z",
        "po_number": po_number,
        "tool": tool_name,
        "input": args,
        "output_summary": str(result)[:200],
        "tokens": tokens,
    }
    with open("audit_log.jsonl", "a") as f:
        f.write(json.dumps(rec) + "\n")

Now wire append_audit into _run_messages. Add the import and call right after the successful execute_tool:

# Add to imports at the top of agent.py:
from audit import append_audit                    # NEW

# Inside _run_messages, after `result = execute_tool(...)` and the tone check:
                        if block.name == "draft_notification":
                            check_draft_tone(result)
                        _breaker.record(ok=True)
                        # Extract PO number from the user's first message for audit context
                        po = next(
                            (w for w in messages[0]["content"].split() if w.startswith("PO-")),
                            "default"
                        )
                        append_audit(                              # NEW
                            tool_name=block.name,
                            args=block.input,
                            result=result,
                            tokens=total_tokens,
                            po_number=po,
                        )
                        tool_results.append({...})
                        # ... rest of try-block unchanged ...

Run: python agent.py

Then inspect the audit file:

cat audit_log.jsonl   # macOS/Linux
type audit_log.jsonl  # Windows cmd

Expected output (5 lines, one per tool call, all with the same po_number):

{"ts": "2026-05-09T12:01:14Z", "po_number": "PO-2024-5678", "tool": "get_order_details", "input": {"po_number": "PO-2024-5678"}, "output_summary": "{\"customer_id\": \"ACME001\", \"status\": \"PARTIALLY_SHIPPED\", ...}", "tokens": 1842}
{"ts": "2026-05-09T12:01:18Z", "po_number": "PO-2024-5678", "tool": "track_shipment", "input": {"tracking_number": "1Z999AA10123456784"}, "output_summary": "{\"carrier\": \"UPS\", \"status\": \"DELIVERED\", ...}", "tokens": 2510}
{"ts": "2026-05-09T12:01:22Z", "po_number": "PO-2024-5678", "tool": "query_inventory", "input": {"sku": "GSK-1175", "warehouse": "WH-NEWARK"}, "output_summary": "{\"qty_available\": 200, ...}", "tokens": 3185}
{"ts": "2026-05-09T12:01:26Z", "po_number": "PO-2024-5678", "tool": "draft_notification", "input": {"customer_id": "ACME001", "exception_type": "PARTIAL", ...}, "output_summary": "{\"to\": \"csm-acme@example.com\", \"tier\": \"ENTERPRISE\", ...}", "tokens": 3920}

✅ Checkpoint — Step 5

You should see 4–5 lines in audit_log.jsonl all keyed by po_number: "PO-2024-5678". If po_number is "default", the parser didn't find a PO in the user's first message — either the question doesn't include the PO, or the parsing logic doesn't match your prompt format.

Step 6: Multi-Turn PO Sessions

15 minsession.py

What & Why: Ops reviewers ask follow-ups about the same PO: "What if we expedite-ship from Newark instead?" should reuse the prior investigation context. Iter 1 implements this by maintaining a per-PO messages list and reusing the same _run_messages helper from Step 3.

Create a new file session.py in the project folder:

"""session.py — multi-turn PO sessions over the same _run_messages helper."""
from agent import _run_messages

SESSIONS: dict[str, list] = {}   # po_number -> messages list
WINDOW = 24                      # 5 tools per turn × ~5 turns + buffer

def chat(po_number: str, user_msg: str) -> str:
    """Append the user's message to the PO session, run the loop, return the answer."""
    msgs = SESSIONS.setdefault(po_number, [])
    msgs.append({"role": "user", "content": user_msg})
    answer, msgs = _run_messages(msgs)
    # Sliding window: keep only the last WINDOW messages.
    SESSIONS[po_number] = msgs[-WINDOW:]
    return answer

Try it — multi-turn PO investigation:

python -c "
from session import chat
print(chat('PO-2024-5678', 'PO-2024-5678 from Acme Corp has a flagged exception. Investigate and draft a notification.'))
print('---')
print(chat('PO-2024-5678', 'What would change if we expedite-ship from Newark instead of waiting for Memphis restock?'))
"

Expected behavior: the second call references the prior investigation findings (Memphis stockout, Newark availability) and proposes the expedite scenario as an alternative. Without session continuity, the agent would have no memory of the GSK-1175 issue.

✅ Checkpoint — Step 6

The second call references "the GSK-1175 stockout from Memphis" and proposes a different resolution (expedite ship from Newark with revised SLA estimate). If the second call says "I need to investigate first", session isn't carrying over — verify SESSIONS[po_number] persists across calls.

Troubleshooting

ImportError: cannot import name '_run_messages' from 'agent' → you skipped Step 3's refactor of agent.py. Go back and split the loop into _run_messages + run_agent.
Each call re-investigates → SESSIONS is module-level. If you're calling from separate Python processes (e.g., via subprocess), use a database or Redis instead.

Step 7: Deploy as FastAPI + Docker

20 minserver.py + Dockerfile

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from agent import run_agent
from session import chat

app = FastAPI()
class Q(BaseModel): question: str
class C(BaseModel): po_number: str; message: str

@app.get("/health")
def health(): return {"status": "ok"}

@app.post("/exception")
def exception_ep(q: Q):
    try: return {"report": run_agent(q.question)}
    except Exception as e: raise HTTPException(500, str(e))

@app.post("/chat")
def chat_ep(c: C):
    return {"answer": chat(c.po_number, c.message)}

FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
ENV PYTHONUNBUFFERED=1
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8000"]

Also create requirements.txt at the project root:

anthropic>=0.40
fastapi>=0.110
uvicorn>=0.27
pydantic>=2.0

Run locally first:

uvicorn server:app --reload --port 8000
# In another terminal:
curl localhost:8000/health
curl -X POST localhost:8000/exception -H "Content-Type: application/json" \
     -d '{"question":"PO-2024-5678 from Acme Corp has a flagged exception. Investigate."}'

Then build and run with Docker:

docker build -t iter1-b .
docker run --rm -p 8000:8000 -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY iter1-b

🎉 Iteration 1 Complete — End-to-End Verification

You should now have 11 files in your project folder:

agent-iter1-raw/
│── agent.py            # _run_messages + run_agent
│── tools.py            # 5 ops tool schemas + execute_tool
│── mock_data/
│─   │── orders.json       # 10 POs across 3 exception types
│─   │── tracking.json     # 8 carrier records
│─   │── contracts.json    # 5 customer-SKU contracts
│─   │── inventory.json    # 6 SKUs × 3 warehouses
│─   │── customers.json    # 5 customers (3 ENTERPRISE, 2 SMB)
│── guardrails.py       # validate_input + check_draft_tone + CircuitBreaker
│── audit.py            # SOX-compliant append_audit
│── session.py          # multi-turn PO sessions
│── server.py           # FastAPI handlers
│── Dockerfile | requirements.txt

Run the full Iter-1 acceptance test:

# 1. Investigate the canonical PO (Acme, ENTERPRISE)
curl -s -X POST localhost:8000/exception -H "Content-Type: application/json" \
     -d '{"question":"PO-2024-5678 from Acme has a flagged exception. Investigate."}' | python -m json.tool

# 2. Multi-turn PO session (expedite-ship what-if)
curl -s -X POST localhost:8000/chat -H "Content-Type: application/json" \
     -d '{"po_number":"PO-2024-5678","message":"Investigate this exception."}' | python -m json.tool
curl -s -X POST localhost:8000/chat -H "Content-Type: application/json" \
     -d '{"po_number":"PO-2024-5678","message":"What if we expedite-ship from Newark?"}' | python -m json.tool

# 3. Tone enforcement: bad ENTERPRISE draft gets rejected
# (You'd inject "Hi there!" into the draft_notification mock to test this)

# 4. Audit log keyed by PO
docker exec $(docker ps -q --filter ancestor=iter1-b) cat audit_log.jsonl | head -5
# Look for "po_number": "PO-2024-5678" on every line

You pass Iter 1 if: (1) /health returns ok; (2) /exception returns a PARTIAL exception report with formal ENTERPRISE-tier draft; (3) the second /chat call references the prior Memphis stockout findings; (4) audit log entries are all keyed by po_number: "PO-2024-5678".

Iteration 1 Metrics

Files: 11 (agent.py, tools.py, mock_data/×5, guardrails.py, audit.py, session.py, server.py, Dockerfile, requirements.txt)
Lines you wrote: ~250
Time: ~3 hours
Abstractions used: none — just anthropic Messages API and FastAPI

Debugging in Iteration 1: Print Statements + Manual Inspection

When the agent gives wrong output in Iter 1, you debug like you debug any Python program: by reading your own code.

A. Add a debug_turn() helper

def debug_turn(turn_num, response, messages):
    print(f"\n=== TURN {turn_num} === stop_reason: {response.stop_reason}")
    for block in response.content:
        if block.type == "tool_use":
            print(f"  TOOL: {block.name}({block.input})")
        elif block.type == "text":
            print(f"  TEXT: {block.text[:120]}...")
    print(f"  Tokens: in={response.usage.input_tokens} out={response.usage.output_tokens}")

B. Inspect messages list manually

The most common Iter-1 bug is malformed messages. Add print(json.dumps(messages, default=str, indent=2)) before each messages.create.

C. Common Iter-1 bugs and how to spot them

Wrong tool_use_id in tool_result → 400 from API. Match the id on the tool_use block.
Forgot to append the assistant turn → API complains about message order.
Domain-specific: agent drafts a notification BEFORE checking inventory or contract pricing — root cause incomplete. The system prompt did not enforce the investigate-then-draft order. Strengthen with: "ALWAYS call get_order_details, track_shipment, check_contract_pricing, AND query_inventory before draft_notification."
Domain-specific: agent drafts an SMB-tone notification for an ENTERPRISE customer (or vice versa). Either tone-check hook didn't run, or the draft_notification result didn't include the customer tier. Add the tier to the response payload and to the audit log.
Domain-specific: agent recommends a credit when a re-shipment would resolve. Usually means the agent stopped after track_shipment showed DELIVERED without checking inventory for unshipped lines. Re-read the system prompt — PARTIAL delivery requires inventory check.

D. Debug exercise (do this before moving on)

Add a hard tone violation: change execute_tool for ENTERPRISE customers so the body starts with "Hey there!". Re-run the agent. The check_draft_tone guardrail should raise. Read the error message and trace where the tone got injected. Restore and re-run. This builds the muscle memory you need to debug Iter 3 generated drafts.

SESSION 2

Iteration 2: Agent SDK + Claude Code

Now you let the SDK run the loop, hooks handle tone enforcement and validation, sessions handle multi-turn, and Claude Code does most of the typing. Same agent. Half the lines. Different debugging.

~2 hours~120 lines8 filesSDK + hooks + sessions

Step 8: Create CLAUDE.md via Claude Code

10 minCLAUDE.md

mkdir agent-iter2-sdk && cd agent-iter2-sdk
python -m venv venv && source venv/bin/activate    # Windows: venv\Scripts\activate
pip install "claude-agent-sdk>=0.2" "fastapi>=0.110" "uvicorn>=0.27" "pydantic>=2.0"
npm i -g @anthropic-ai/claude-code   # if not already installed
claude
> /init

# Agent: Order Exception Investigation Agent (Iteration 2)

## Stack
- Python 3.11+
- `claude-agent-sdk` (the official Agent SDK — NOT a wrapper around `client.messages.create()`)
- FastAPI + Docker for deployment
- Mock data in mock_data/*.json (5 files: orders, tracking, contracts, inventory, customers)

## File Layout
- agent.py            — query() entry + 5 @tool-decorated MCP tools + create_sdk_mcp_server
- hooks/              — PreToolUse + PostToolUse hook scripts (tone enforcement on draft_notification)
- .claude/settings.json — hooks registration (matchers + commands)
- sessions.py         — multi-PO session resume
- server.py           — FastAPI async wrapper

## Compliance Rules (encode in hooks)
- ENTERPRISE drafts MUST NOT contain casual tokens ("hi there", "no worries")
- All drafts MUST NOT contain overpromise tokens ("guarantee", "definitely tomorrow")
- All tool calls MUST be audited with po_number as the case key

## System Prompt Rules
For every flagged PO, follow this order:
1. get_order_details
2. track_shipment for each tracking_number returned
3. check_contract_pricing for any line with potential discrepancy
4. query_inventory for any unshipped/backordered SKU
5. draft_notification (matching customer tier tone)

NEVER draft before completing the investigation. NEVER overpromise.

Step 9: Build Agent with @tool Decorators (claude-agent-sdk)

15 minagent.py

What the Real SDK Looks Like

If you've used client.messages.create() in Iter 1, you might expect the SDK to be a thin Agent class wrapping it. It is not. The SDK is built around MCP tools + an async query() generator + options/hooks via ClaudeAgentOptions. Tools return {"content": [{"type": "text", "text": ...}]} (MCP shape), not bare Python values.

> Create agent.py using `claude-agent-sdk`. Define five order-ops tools as
> @tool-decorated async functions returning MCP-shaped {"content":[...]}:
> get_order_details, track_shipment, check_contract_pricing, query_inventory,
> draft_notification (tier-aware tone). Wire them into a create_sdk_mcp_server,
> build ClaudeAgentOptions with the system prompt from CLAUDE.md, and expose
> async run(question) driving query() and concatenating AssistantMessage text.

"""agent.py — claude-agent-sdk version. ~85 lines incl. 5 tools."""
import json
from pathlib import Path
from claude_agent_sdk import (
    query, tool, create_sdk_mcp_server,
    ClaudeAgentOptions, AssistantMessage,
)

DATA = Path("mock_data")
ORDERS    = json.loads((DATA / "orders.json").read_text())
TRACKING  = json.loads((DATA / "tracking.json").read_text())
CONTRACTS = json.loads((DATA / "contracts.json").read_text())
INVENTORY = json.loads((DATA / "inventory.json").read_text())
CUSTOMERS = json.loads((DATA / "customers.json").read_text())

def _ok(payload): return {"content": [{"type": "text", "text": json.dumps(payload)}]}

@tool("get_order_details", "Full PO details: line items, status, tracking, customer.",
      {"po_number": str})
async def get_order_details(args):
    return _ok(ORDERS.get(args["po_number"], {"error": "PO not found"}))

@tool("track_shipment", "Carrier status for a tracking number.",
      {"tracking_number": str})
async def track_shipment(args):
    return _ok(TRACKING.get(args["tracking_number"], {"error": "tracking not found"}))

@tool("check_contract_pricing", "Contract unit price for customer + SKU.",
      {"customer_id": str, "sku": str})
async def check_contract_pricing(args):
    return _ok(CONTRACTS.get(f"{args['customer_id']}_{args['sku']}",
                             {"error": "no contract"}))

@tool("query_inventory", "Stock available for a SKU at a warehouse.",
      {"sku": str, "warehouse": str})
async def query_inventory(args):
    s = INVENTORY.get(args["sku"], {})
    return _ok(s.get("warehouses", {}).get(args["warehouse"],
                                            {"error": "no inventory"}))

@tool("draft_notification",
      "Generate exception-notification email with tier-aware tone.",
      {"customer_id": str, "exception_type": str, "resolution": str})
async def draft_notification(args):
    cust = CUSTOMERS.get(args["customer_id"], {})
    tier = cust.get("tier", "SMB")
    salutation = "Team," if tier == "ENTERPRISE" else f"Hi {cust.get('name','team')},"
    sign_off = ("— Operations" if tier == "ENTERPRISE"
                else "— Customer Success")
    body = (f"Regarding {args['exception_type']}: {args['resolution']}\n\n{sign_off}")
    return _ok({"to": cust.get("csm_email"), "tier": tier,
                "salutation": salutation, "body": body})

ops_server = create_sdk_mcp_server(
    name="ops_tools", version="1.0.0",
    tools=[get_order_details, track_shipment, check_contract_pricing,
           query_inventory, draft_notification],
)

OPTIONS = ClaudeAgentOptions(
    model="claude-sonnet-4-6",
    system_prompt=("You are an order-exception specialist. For every flagged PO: "
                   "1. get_order_details, 2. track_shipment, 3. check_contract_pricing, "
                   "4. query_inventory, 5. draft_notification (tier-aware). "
                   "NEVER draft before completing investigation. NEVER overpromise."),
    mcp_servers={"ops": ops_server},
    allowed_tools=[f"mcp__ops__{n}" for n in (
        "get_order_details", "track_shipment", "check_contract_pricing",
        "query_inventory", "draft_notification")],
    max_turns=12,
)

async def run(question: str) -> str:
    parts = []
    async for msg in query(prompt=question, options=OPTIONS):
        if isinstance(msg, AssistantMessage):
            for block in msg.content:
                if getattr(block, "text", None):
                    parts.append(block.text)
    return "\n".join(parts)

// agent.ts — @anthropic-ai/claude-agent-sdk version
import { query, tool, createSdkMcpServer } from "@anthropic-ai/claude-agent-sdk";
import { z } from "zod";
import * as fs from "fs";

const ORDERS    = JSON.parse(fs.readFileSync("mock_data/orders.json", "utf8"));
const TRACKING  = JSON.parse(fs.readFileSync("mock_data/tracking.json", "utf8"));
const CONTRACTS = JSON.parse(fs.readFileSync("mock_data/contracts.json", "utf8"));
const INVENTORY = JSON.parse(fs.readFileSync("mock_data/inventory.json", "utf8"));
const CUSTOMERS = JSON.parse(fs.readFileSync("mock_data/customers.json", "utf8"));

const ok = (p: unknown) => ({ content: [{ type: "text" as const, text: JSON.stringify(p) }] });

const getOrderDetails = tool(
  "get_order_details", "Full PO details.",
  { po_number: z.string() },
  async (args) => ok(ORDERS[args.po_number] ?? { error: "PO not found" })
);

const draftNotification = tool(
  "draft_notification", "Tier-aware exception email draft.",
  { customer_id: z.string(), exception_type: z.string(), resolution: z.string() },
  async (args) => {
    const cust = CUSTOMERS[args.customer_id] ?? {};
    const tier = cust.tier ?? "SMB";
    const salutation = tier === "ENTERPRISE" ? "Team," : `Hi ${cust.name ?? "team"},`;
    const signOff = tier === "ENTERPRISE" ? "— Operations" : "— Customer Success";
    const body = `Regarding ${args.exception_type}: ${args.resolution}\n\n${signOff}`;
    return ok({ to: cust.csm_email, tier, salutation, body });
  }
);
// (track_shipment, check_contract_pricing, query_inventory defined the same way)

const opsServer = createSdkMcpServer({
  name: "ops_tools",
  tools: [getOrderDetails, /* ... 4 more ... */, draftNotification],
});

const OPTIONS = {
  model: "claude-sonnet-4-6",
  systemPrompt: "You are an order-exception specialist. For every flagged PO: " +
                "1. get_order_details, 2. track_shipment, 3. check_contract_pricing, " +
                "4. query_inventory, 5. draft_notification (tier-aware). " +
                "NEVER draft before completing investigation. NEVER overpromise.",
  mcpServers: { ops: opsServer },
  allowedTools: [
    "mcp__ops__get_order_details", "mcp__ops__track_shipment",
    "mcp__ops__check_contract_pricing", "mcp__ops__query_inventory",
    "mcp__ops__draft_notification",
  ],
  maxTurns: 12,
};

export async function run(question: string): Promise<string> {
  const parts: string[] = [];
  for await (const msg of query({ prompt: question, options: OPTIONS })) {
    if (msg.type === "assistant") {
      for (const block of msg.content) {
        if ("text" in block) parts.push(block.text);
      }
    }
  }
  return parts.join("\n");
}

What Just Disappeared

You no longer write the message loop, the stop_reason check, the tool_result append, or JSON schema dicts. Same five-tool investigation flow, ~85 lines instead of ~140.

Troubleshooting

ModuleNotFoundError: No module named 'claude_agent_sdk' → pip install "claude-agent-sdk>=0.2" in your venv.
ImportError: cannot import name 'Agent' from 'anthropic' → you're trying the old fictional API. The real SDK is claude_agent_sdk.
Tool call rejected with "not allowed" → add the tool's mcp__ops__<name> entry to allowed_tools.

Step 10: Hooks via .claude/settings.json + HookMatcher

20 minhooks/*.py + .claude/settings.json

What & Why: The SDK supports two hook surfaces: (a) file-based hooks in .claude/settings.json shelling out to scripts (production-friendly), and (b) in-process hooks via HookMatcher in ClaudeAgentOptions(hooks={...}). We use both: file-based for audit + tone enforcement on draft_notification, in-process for input validation that needs to deny.

{
  "hooks": {
    "PreToolUse": [
      { "matcher": "*", "command": "python hooks/log_call.py" }
    ],
    "PostToolUse": [
      { "matcher": "mcp__ops__draft_notification",
        "command": "python hooks/tone_check.py" },
      { "matcher": "*", "command": "python hooks/audit.py" }
    ]
  }
}

"""hooks/tone_check.py — reject ENTERPRISE drafts with casual tokens
or any draft containing overpromise tokens. Pass the draft through if OK."""
import sys, json

ENTERPRISE_BANNED = ["hi there", "hey there", "no worries", "super sorry"]
OVERPROMISE       = ["guarantee", "definitely tomorrow", "100% certain"]

payload = json.load(sys.stdin)
result  = payload.get("tool_result") or {}
# tool_result is wrapped {"content":[{"type":"text","text": json}]}
try:
    inner = json.loads(result["content"][0]["text"])
except Exception:
    json.dump(payload, sys.stdout); sys.exit(0)

body = (inner.get("body") or "").lower()
tier = inner.get("tier", "SMB")

violations = []
if tier == "ENTERPRISE":
    violations += [t for t in ENTERPRISE_BANNED if t in body]
violations += [t for t in OVERPROMISE if t in body]

if violations:
    # Rewrite the tool_result so the agent sees an error and re-drafts.
    err = {"error": "tone_violation", "violations": violations,
           "tier": tier, "advice": "Re-draft with formal/non-overpromising tone."}
    payload["tool_result"] = {"content": [{"type": "text",
                                            "text": json.dumps(err)}]}
json.dump(payload, sys.stdout)

"""Add to agent.py: in-process input validation via HookMatcher."""
import re
from claude_agent_sdk import HookMatcher

PO_RE  = re.compile(r"^PO-[0-9]{4}-[0-9]{4}$")
SKU_RE = re.compile(r"^[A-Z]{3}-[0-9]{4}$")

async def validate_input(input_data, tool_use_id, context):
    name = input_data.get("tool_name", "")
    args = input_data.get("tool_input", {}) or {}
    fail = None
    if name.endswith("get_order_details") and not PO_RE.match(args.get("po_number", "")):
        fail = f"Invalid PO format: {args.get('po_number')!r}"
    elif name.endswith(("check_contract_pricing", "query_inventory")) \
            and not SKU_RE.match(args.get("sku", "")):
        fail = f"Invalid SKU format: {args.get('sku')!r}"
    if fail:
        return {"hookSpecificOutput": {"hookEventName": "PreToolUse",
                                        "permissionDecision": "deny",
                                        "permissionDecisionReason": fail}}
    return {}

# Then update OPTIONS in agent.py:
OPTIONS = ClaudeAgentOptions(
    # ... existing fields ...
    hooks={"PreToolUse": [HookMatcher(matcher="mcp__ops__*",
                                      hooks=[validate_input])]},
)

Create the supporting log_call.py and audit.py hooks — same stdin/stdout pattern:

"""hooks/log_call.py — PreToolUse: print to stderr, pass through stdout."""
import sys, json, datetime

payload = json.load(sys.stdin)
ts = datetime.datetime.utcnow().isoformat() + "Z"
print(f"[{ts}] PRE  {payload.get('tool_name','?')}({payload.get('tool_input',{})})",
      file=sys.stderr)
json.dump(payload, sys.stdout)   # pass-through

"""hooks/audit.py — PostToolUse: append a SOX-compliant record to audit_log.jsonl."""
import sys, json, datetime

payload = json.load(sys.stdin)
# Extract PO number from tool_input if present, else 'default'.
ti = payload.get("tool_input", {}) or {}
po = ti.get("po_number") or "default"
rec = {
    "ts": datetime.datetime.utcnow().isoformat() + "Z",
    "po_number": po,
    "tool": payload.get("tool_name"),
    "input": ti,
    "output_summary": str(payload.get("tool_result"))[:200],
}
with open("audit_log.jsonl", "a") as f:
    f.write(json.dumps(rec, default=str) + "\n")
json.dump(payload, sys.stdout)   # pass-through

Smoke-test each hook standalone:

# tone_check should reject ENTERPRISE drafts with banned tokens
echo '{"tool_name":"mcp__ops__draft_notification","tool_result":{"content":[{"type":"text","text":"{\"body\":\"Hi there team!\",\"tier\":\"ENTERPRISE\"}"}]}}' \
  | python hooks/tone_check.py
# Should rewrite tool_result to an error payload prompting redraft.

# audit should append a record keyed by po_number
echo '{"tool_name":"get_order_details","tool_input":{"po_number":"PO-2024-5678"},"tool_result":"ok"}' \
  | python hooks/audit.py
cat audit_log.jsonl | tail -1

Then run the agent end-to-end:

python -c "import asyncio, agent; print(asyncio.run(agent.run('Investigate PO-2024-5678 from Acme Corp')))"

✅ Checkpoint — Step 10

You should see (1) [timestamp] PRE log lines on stderr, (2) the agent's exception report on stdout with formal ENTERPRISE-tier draft, (3) audit_log.jsonl with entries keyed by PO. If you trigger a tone violation (inject "Hi there!" in your draft_notification mock), the tone_check hook rewrites the tool_result as an error and the agent redrafts.

Troubleshooting

Tone hook crashes on JSON parse → the SDK wraps tool results in {"content":[{"type":"text","text":...}]}. The text field is a JSON STRING that needs json.loads to inspect.
Agent ignores the rewritten error and uses the bad draft anyway → check that you assigned the new payload back: payload["tool_result"] = {"content": [...error payload...]} before json.dump.
audit_log.jsonl shows po_number "default" → only get_order_details has a po_number in tool_input. Other tool calls (draft_notification, query_inventory) need extracting po_number from elsewhere; safest to extract from the user's first message in agent.py instead of inside the hook.

Step 11: Sessions — Multi-PO + Fork

15 minsessions.py

"""sessions.py — multi-PO via SDK resume tokens."""
from dataclasses import replace
from claude_agent_sdk import query, AssistantMessage
from agent import OPTIONS

SESSIONS: dict[str, str] = {}   # po_number -> resume token

async def _drive(prompt, options):
    parts, sid = [], None
    async for msg in query(prompt=prompt, options=options):
        if isinstance(msg, AssistantMessage):
            for block in msg.content:
                if getattr(block, "text", None): parts.append(block.text)
        s = getattr(msg, "session_id", None)
        if s: sid = s
    return "\n".join(parts), sid

async def chat(po_number: str, msg: str) -> str:
    resume = SESSIONS.get(po_number)
    options = replace(OPTIONS, resume=resume) if resume else OPTIONS
    text, sid = await _drive(msg, options)
    if sid: SESSIONS[po_number] = sid
    return text

async def what_if(po_number: str, hypothetical: str) -> str:
    """Fork: 'what if we expedite-ship from Newark instead?' — do NOT save the new sid."""
    resume = SESSIONS.get(po_number)
    options = replace(OPTIONS, resume=resume) if resume else OPTIONS
    text, _ = await _drive(hypothetical, options)
    return text

Try it — multi-PO + fork demo:

python -c "
import asyncio
from sessions import chat, what_if

async def main():
    print('T1:', await chat('PO-2024-5678', 'PO-2024-5678 from Acme has a flagged exception. Investigate.'))
    print('FORK:', await what_if('PO-2024-5678', 'What if we expedite-ship from Newark instead of waiting for Memphis?'))
    print('T2:', await chat('PO-2024-5678', 'Stick with the original plan. Draft the customer notification.'))

asyncio.run(main())
"

✅ Checkpoint — Step 11

T1 produces a PARTIAL exception report with Memphis stockout root cause. FORK explores the Newark expedite scenario WITHOUT polluting the main session. T2 references the original plan (not the Newark hypothetical) and produces the formal ENTERPRISE-tier draft. If T2 references Newark, your what_if is leaking into SESSIONS.

Step 12: Slash Commands

15 min.claude/commands/*.md (3 files)

What & Why: /run-exception PO-2024-5678, /test-agent, /eval-agent turn the agent into a one-line workflow. Ops reviewers can investigate exceptions without leaving Claude Code.

Create 3 files in .claude/commands/:

---
description: Investigate an order exception by PO number
argument-hint: [po_number]
---
Look up $ARGUMENTS in mock_data/orders.json. Build the question string.
Run `python -c "import asyncio, agent; print(asyncio.run(agent.run(q)))"`
where q is the question. Print the exception report, draft, total tokens,
total cost from the ResultMessage emitted by query().

---
description: Run the unit test suite for the order-exception agent
---
Run `pytest tests/ -v`. Critical tests: test_partial_delivery_resolution
(PO-2024-5678 must identify Memphis stockout + Newark availability),
test_tone_check_rejects_bad_enterprise_draft (tone hook works),
test_audit_keyed_by_po (every audit line has the right po_number).

---
description: Run the 10-PO evaluation suite
---
Read test_scenarios.json (10 POs: 4 DELAYED, 3 PARTIAL, 3 PRICING). For each,
call agent.run(), score on: correct exception_type, root cause cited, tier-
appropriate tone, no overpromise. Report per-case score and overall percentage.

✅ Checkpoint — Step 12

When you type / in Claude Code, the three commands appear in autocomplete. /run-exception PO-2024-5678 investigates and drafts the formal ENTERPRISE notification. /eval-agent requires a test_scenarios.json file — in Iter 3 the spec generates this for you.

Step 13: Deploy via Claude Code

15 minserver.py + Dockerfile

What & Why: Same FastAPI + Docker pattern as Iter 1, but Claude Code writes it.

> Create server.py and Dockerfile. Endpoints: GET /health,
> POST /exception (single-shot), POST /chat (po_number + message).
> Async FastAPI handlers awaiting agent.run() and sessions.chat() (both
> async coroutines from claude-agent-sdk). python:3.11-slim base, install
> claude-agent-sdk + dependencies, expose 8000. Mount .claude/ into the
> container so settings.json + hook scripts resolve at runtime.

"""server.py — async FastAPI wrapper around the SDK order-exception agent."""
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from agent import run as agent_run
from sessions import chat as session_chat

app = FastAPI(title="Order Exception Agent (Iter 2 — SDK)")

class Q(BaseModel): question: str
class C(BaseModel): po_number: str; message: str

@app.get("/health")
def health(): return {"status": "ok", "iter": 2}

@app.post("/exception")
async def exception_ep(q: Q):
    try: return {"report": await agent_run(q.question)}
    except Exception as e: raise HTTPException(500, str(e))

@app.post("/chat")
async def chat_ep(c: C):
    try: return {"answer": await session_chat(c.po_number, c.message)}
    except Exception as e: raise HTTPException(500, str(e))

FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
ENV PYTHONUNBUFFERED=1
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8000"]

Run locally first, then Docker:

uvicorn server:app --reload --port 8000
# In another terminal:
curl localhost:8000/health
curl -X POST localhost:8000/exception -H "Content-Type: application/json" \
     -d '{"question":"Investigate PO-2024-5678"}'

# Then build the container:
docker build -t iter2-b .
docker run --rm -p 8000:8000 -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY iter2-b

🎉 Iteration 2 Complete — End-to-End Verification

You should now have ~14 files in your project:

agent-iter2-sdk/
│── CLAUDE.md
│── agent.py            # query() + 5 @tool functions + create_sdk_mcp_server
│── sessions.py
│── mock_data/ ×5       # orders, tracking, contracts, inventory, customers
│── .claude/
│─   │── settings.json
│─   │── commands/run-exception.md, test-agent.md, eval-agent.md
│── hooks/
│─   │── log_call.py, tone_check.py, audit.py
│── server.py | Dockerfile | requirements.txt

Acceptance test — same shape as Iter 1:

curl -s -X POST localhost:8000/exception -H "Content-Type: application/json" \
     -d '{"question":"Investigate PO-2024-5678"}' | python -m json.tool

curl -s -X POST localhost:8000/chat -H "Content-Type: application/json" \
     -d '{"po_number":"PO-2024-5678","message":"Investigate."}' | python -m json.tool
curl -s -X POST localhost:8000/chat -H "Content-Type: application/json" \
     -d '{"po_number":"PO-2024-5678","message":"What if we expedite from Newark?"}' | python -m json.tool

# Tone enforcement: check audit log shows ENTERPRISE drafts only
docker exec $(docker ps -q --filter ancestor=iter2-b) cat audit_log.jsonl | grep ENTERPRISE

You pass Iter 2 if: all 3 outputs are functionally equivalent to Iter 1 but with ~120 lines of code instead of ~250.

Iteration 2 Complete

Same curl as Iter 1; same exception report and tier-appropriate draft.

Iteration 2 Metrics

Files: ~10 (CLAUDE.md, agent.py, sessions.py, server.py, Dockerfile, .claude/settings.json, hooks/{log_call,tone_check,audit}.py, slash commands) + 5 mock JSON
Lines you wrote: ~120
Time: ~2 hours
Abstractions used: claude-agent-sdk (query / @tool / MCP server / ClaudeAgentOptions / HookMatcher) + .claude/settings.json + Claude Code

Iteration	Primary debug method	Secondary	Speed to fix
1 Raw	print() in the loop	Manual message inspection	Slow (find the line)
2 SDK	Hooks + Console Web UI	Langfuse traces	Medium (modular probes)
3 Spec	Spec vs code comparison	Tests + evals + Console	Fast (Claude Code finds it)

Metric	Iter 1: Raw API	Iter 2: Agent SDK	Iter 3: Spec-Driven
Lines YOU wrote	~250	~120	~100 (spec only)
Time to build	~3 hours	~2 hours	~1 hour
Agent output	Baseline	Same	Same
Tone enforcement	Inline check after draft	One `.claude/settings.json` entry + a 25-line stdin/stdout script	5 lines in spec section 5
Multi-PO sessions	Manual history dict	SDK sessions	SDK sessions (generated)
Adding alternative-carrier tool	Edit 3 files + add validation	One Claude Code prompt	Update spec, ask to regen
Tests (incl. tone)	Manual	Claude Code generated	Spec generates them
Documentation for ops leadership	Separate	CLAUDE.md	Spec IS the doc ops can read
Control over internals	Full	SDK-managed	Least direct (but reviewable)
Understanding needed	Every line	SDK abstractions	Architecture-level
Debugging	print() in loop	Hooks + Console + Langfuse	Spec compare + tests + evals
Onboarding a new ops engineer	Read 7 files	Read CLAUDE.md + 8 files	Read 1 spec file

Capstone 7-B — Agent Evolution: Order Exception

Project Brief

The Three-Iteration Concept

The Scenario — Order Exception Agent

Tools (5)

Mock Data Shape

Animation 1: Three-Lane Evolution

Animation 2: Code Size Waterfall

Animation 3: Time Comparison

Animation 4: Architecture Per Iteration

Animation 5: Spec-to-Code Flow

Prerequisites

Iteration 1: Raw API Loop

Debugging in Iteration 1: Print Statements + Manual Inspection

Iteration 2: Agent SDK + Claude Code

Debugging in Iteration 2: Hooks + Console Web UI + Langfuse

Iteration 3: Spec-Driven

Debugging in Iteration 3: Spec Comparison + Tests + Evals

The Comparison Table

Grading Rubric

Reflection Prompts

Knowledge Check

Q1: All three iterations produce the same exception report. What is the most defensible reason to still go through Iteration 1 rather than skipping straight to Iteration 3?

Q2: Your agent drafts an SMB-tone notification ("Hi there!") for an ENTERPRISE customer. What is the FASTEST debugging path in Iteration 2?

Q3: In Iteration 3, an eval case fails because the agent skipped `query_inventory` and recommended a credit when stock was actually available in another warehouse. The CORRECT fix is:

Q4: The most common Iteration 1 bug in the order-exception scenario is:

Q5: You need to add an "expedite shipping suggestion" capability to all three iterations. Which iteration requires the LEAST disruptive change?

Q6: When would you NOT pick Iteration 3 (spec-driven) for a real production order-exception system?

Q7: An ENTERPRISE-tier draft contains "no worries, we'll get this sorted." The tone-check hook should:

Going Further (Optional)

Capstone 7-B — Agent Evolution: Order Exception

Project Brief

The Three-Iteration Concept

The Scenario — Order Exception Agent

Tools (5)

Mock Data Shape

Animation 1: Three-Lane Evolution

Animation 2: Code Size Waterfall

Animation 3: Time Comparison

Animation 4: Architecture Per Iteration

Animation 5: Spec-to-Code Flow

Prerequisites

Iteration 1: Raw API Loop

Debugging in Iteration 1: Print Statements + Manual Inspection

Iteration 2: Agent SDK + Claude Code

Debugging in Iteration 2: Hooks + Console Web UI + Langfuse

Iteration 3: Spec-Driven

Debugging in Iteration 3: Spec Comparison + Tests + Evals

The Comparison Table

Grading Rubric

Reflection Prompts

Knowledge Check

Q1: All three iterations produce the same exception report. What is the most defensible reason to still go through Iteration 1 rather than skipping straight to Iteration 3?

Q2: Your agent drafts an SMB-tone notification ("Hi there!") for an ENTERPRISE customer. What is the FASTEST debugging path in Iteration 2?

Q3: In Iteration 3, an eval case fails because the agent skipped query_inventory and recommended a credit when stock was actually available in another warehouse. The CORRECT fix is:

Q4: The most common Iteration 1 bug in the order-exception scenario is:

Q5: You need to add an "expedite shipping suggestion" capability to all three iterations. Which iteration requires the LEAST disruptive change?

Q6: When would you NOT pick Iteration 3 (spec-driven) for a real production order-exception system?

Q7: An ENTERPRISE-tier draft contains "no worries, we'll get this sorted." The tone-check hook should:

Going Further (Optional)

Q3: In Iteration 3, an eval case fails because the agent skipped `query_inventory` and recommended a credit when stock was actually available in another warehouse. The CORRECT fix is: