← M16: Input Guardrails 🏠 Home M18: Evaluation & Testing →

M17: Output Guardrails & Human-in-the-Loop

Validate what your agent says, control what it costs, and know when to hand off to a human.

Learning Objectives

Implement output validation: hallucination detection, toxicity filtering, and format checks
Build cost controls with budget ceilings, token caps, and loop detection
Design human-in-the-loop patterns: approval gates, modification gates, and escalation gates
Implement the circuit breaker pattern to prevent cascading failures
Wire input and output guardrails into a complete production pipeline

Output Validation

In M16, you built input guardrails that protect your system from bad inputs. Now we build the other half: output guardrails that protect your users from bad outputs. Input guardrails protect your system from users. Output guardrails protect users from your system.

💡 Everyday Analogy

Before quality inspectors existed on factory assembly lines, products went straight from the machines to the customer. Most products were fine, but occasionally a defective one slipped through — a car with a faulty brake line, a toy with a sharp edge, a medicine with the wrong dosage. The machines themselves didn't know they'd made a mistake. Quality inspectors sit at the END of the line, checking every product before it ships. They don't make the products — they catch the defects that the manufacturing process missed. Output guardrails work exactly the same way. Your agent generates responses using Claude, and most responses are excellent. But occasionally Claude hallucinates a fact, generates inappropriate content, or produces malformed output. Output guardrails catch these defects before the response reaches your user.

Here's what an output guardrail check actually looks like in practice. The hallucination detector returns a structured verdict for each factual claim:

// Agent response: "The filing was submitted on March 15, 2024
//   by Acme Corp for $2.3M in equipment collateral."
// Source document says: March 22, 2024; Acme Corp; $2.3M equipment

{
  "claims": [
    { "text": "filed on March 15, 2024", "status": "contradicted",
      "source": "Document shows March 22, 2024" },
    { "text": "by Acme Corp", "status": "supported",
      "source": "filing_2024_NY.txt line 3" },
    { "text": "$2.3M in equipment collateral", "status": "supported",
      "source": "filing_2024_NY.txt line 7" }
  ],
  "overall": "block",       // One contradicted claim = block
  "unsupported_count": 0
}
// Result: Response blocked. Agent told to re-check the date.

📐 Technical Definition

Output validation runs three critical checks on every agent response:

Hallucination detection verifies factual claims against source documents. If your RAG agent (M09-M10) says "The filing was submitted on March 15, 2024," the hallucination detector checks whether the source documents actually contain that date. Implementation uses a separate Claude call that receives both the response and the source documents, then classifies each claim as "supported," "unsupported," or "contradicted."

Toxicity filtering screens for harmful, biased, or inappropriate content. While Claude's safety training handles most cases, edge cases slip through — especially in agentic loops where the model processes external data that may contain offensive content. A Claude-as-judge call rates content on a severity scale (1-5) and blocks anything above the configured threshold.

Format validation ensures the output matches expected structure. If your agent should return JSON with specific fields, the format validator runs three sub-checks: Is the JSON syntactically valid? Are all required fields present? Are values within expected ranges (e.g., a dollar amount is positive, a date is not in the future)? This is the cheapest check — no API call needed, just local parsing — and it catches the most common failures.

Output Validation Pipeline

🤖 Agent Response

→

📋Format

→

🔍Halluc.

→

🛡️Toxicity

→

👤 User

Output Pipeline (static): Agent response flows through Format Validation (schema check), Hallucination Detection (fact verification), and Toxicity Filter (content safety). Clean outputs reach the user. Hallucinated claims get flagged. Toxic content gets blocked.

✅ Why It Matters

In a healthcare pre-authorization agent (Domain A), a hallucinated CPT code could approve the wrong procedure — costing thousands of dollars and potentially harming a patient. In a B2B ecommerce agent (Domain B), a format error that drops the shipping address field means packages go nowhere. A hallucination detector that costs $0.003 per check is trivially cheap compared to the cost of acting on fabricated data. Output guardrails are the last checkpoint before your agent's decisions affect real people.

⚠️ Common Misconceptions

"Output guardrails are unnecessary because Claude is already safe" — Claude's safety training catches most harmful content, but it's not infallible. In agentic loops where external data flows through the model, offensive content from data sources can leak into responses. And safety training doesn't prevent hallucinations — the model can confidently state incorrect facts.

"Human-in-the-loop means the AI is unreliable" — The opposite. HITL is a sign of a well-designed system. Even the best human employees need manager approval for large transactions or important communications. HITL gates are the same principle — they're about risk management, not reliability.

"The model can tell you how confident it is" — Self-reported confidence scores ("I'm 85% confident") are NOT calibrated. Claude might say "I'm very confident" about a hallucinated fact. Use programmatic criteria (source document matching, structured validation) for escalation decisions, not the model's self-assessment.

"Circuit breakers are overkill for AI agents" — Without them, a downstream API outage causes every agent request to fail, burn tokens on error responses, and trigger retries that compound the problem. A circuit breaker saves thousands during a 30-minute outage by short-circuiting to a fallback response after 3 failures.

Output validation catches mistakes in what your agent says. But there's another critical dimension: what it COSTS. Without budget controls, a single agent loop can burn through your entire monthly API budget in minutes.

Cost Controls

💡 Everyday Analogy

Before prepaid phone plans existed, parents gave their teenagers a phone with an unlimited plan — and then got a $2,000 bill because the teenager discovered international calling. The phone worked perfectly; the problem was that nothing stopped reasonable use from escalating into catastrophic cost. Running an AI agent without cost controls is the same story. Your agent loop works perfectly in testing — 3-4 iterations, $0.02 per request. But in production, an edge case triggers 50 iterations, each processing a 100K-token document, and suddenly one request costs $15 instead of $0.02. Without budget ceilings, a single bug can burn through your monthly budget in hours.

Here's what a cost middleware check looks like in practice — the object your middleware maintains after each API call:

// After iteration 3 of an agent loop:
{
  "iteration": 3,
  "input_tokens": 24000,    // 8K per iteration × 3
  "output_tokens": 6000,    // 2K per iteration × 3
  "total_cost": "$0.162",   // (24K × $3/M) + (6K × $15/M)
  "budget": "$0.50",
  "budget_remaining": "$0.338",
  "can_afford_next": true,  // Estimated next call: $0.054
  "status": "ok"
}

// After iteration 8:
{
  "iteration": 8,
  "total_cost": "$0.432",
  "budget_remaining": "$0.068",
  "can_afford_next": false, // Next call ~$0.054 would exceed budget
  "status": "budget_exceeded — returning fallback response"
}

📐 Technical Definition

Cost control operates at four levels. Here's each one and when it kicks in:

Per-request budget caps the maximum tokens (or dollar cost) for a single agent invocation. If the agent loop consumes more than the budget, it terminates and returns a fallback response instead of continuing to spend. Think of it as a prepaid card for each request — once the balance hits zero, the card declines.

Per-user budget sets daily or monthly limits per user. This prevents any single user from consuming disproportionate resources. Without it, one power user running expensive queries all day could eat your entire monthly allocation.

Loop detection identifies when the agent is calling the same tool repeatedly without making progress. For example, an agent might retry a failing API call infinitely — each retry costs tokens but produces no useful result. Loop detection spots this pattern and breaks the cycle.

Execution time limits are wall-clock timeouts that kill the agent process if it runs too long. These catch infinite loops that might not hit token limits — for example, an agent stuck in a reasoning loop that generates very few tokens per iteration but never terminates.

Implementation uses middleware that wraps every Claude API call. The middleware tracks cumulative input tokens, output tokens, and cost. Before each API call, it checks whether the next call would exceed the budget. If so, it short-circuits the call and returns a budget-exceeded response. The key insight: the middleware sits BETWEEN your agent code and the API client, so your agent logic doesn't need to know about budgets at all.

Cost Accumulation in Agent Loop

$0.50 budget

Input tokens

Output tokens

Tool calls

Total: $0.000

⛔ Budget Exceeded — Returning Fallback

Cost Breakdown (static): Stacked bar chart showing cost per agent loop iteration. Input tokens (blue), output tokens (green), tool calls (orange) accumulate. A budget ceiling line marks the $0.50 limit. When costs approach 80%, bars turn yellow. At 100%, the agent returns a fallback response instead of continuing.

💰 Cost Alert

Real numbers: A single Claude Sonnet call with 10K input tokens and 2K output tokens costs ~$0.06. An agentic loop that runs 20 iterations (each with tool use) processes ~240K total tokens — that's $1.44 per request. At 1,000 requests/day, you're at $1,440/day. A per-request budget of $0.50 caps runaway loops at 8-10 iterations while allowing normal 3-4 iteration flows, reducing daily costs to ~$500. The budget saves you from the long tail of expensive edge cases.

Cost controls protect your wallet. But some decisions are too important — or too risky — for an AI to make alone. That's where human-in-the-loop patterns come in: designing clear handoff points where humans take the controls.

Human-in-the-Loop Patterns

💡 Everyday Analogy

Before autopilot systems matured, every second of a flight required active human control. Modern autopilot handles the routine — maintaining altitude, speed, and heading for hours. But the pilot takes over for takeoff, landing, and anything unexpected — turbulence, system warnings, unusual ATC instructions. The key insight isn't that autopilot is unreliable — it's that some decisions require human judgment, and the system is designed with clear handoff points where control transfers smoothly. Human-in-the-loop (HITL) patterns work exactly the same way. Your agent handles routine requests autonomously, but at specific decision points — irreversible actions, high-stakes choices, edge cases — it pauses and hands control to a human.

Here's what an approval gate request looks like when it reaches a human reviewer — the JSON object that gets stored in the approval queue:

// What the human reviewer sees in their approval queue:
{
  "gate_type": "approval",
  "agent_id": "order-support-agent-7",
  "proposed_action": "Send $450 refund to customer",
  "context": {
    "order_id": "ORD-12345",
    "reason": "Damaged item — photos verified by vision model",
    "return_window": "within 30-day policy",
    "customer_history": "12 orders, 0 previous refunds"
  },
  "state": "waiting_for_human",
  "options": ["approve", "deny", "modify"],
  "created_at": "2024-03-22T14:32:00Z",
  "timeout": "4 hours — auto-escalate to manager if no response"
}

📐 Technical Definition

Three HITL gate patterns serve different purposes:

Approval gates pause before irreversible actions. The agent proposes "Send refund of $450 to customer X" and waits for human confirmation. The human sees the full context, clicks Approve or Deny, and the agent proceeds accordingly. Used for: sending emails, processing payments, modifying databases, deploying code.

Modification gates present a draft for human editing. The agent generates a customer response, displays it, and the human can edit the text before it's sent. Unlike approval gates (binary yes/no), modification gates allow the human to adjust the content. Used for: customer communications, reports, legal documents, content that needs a human voice.

Escalation gates detect when the agent has hit its competence boundary. The agent recognizes "I can't resolve this" and routes to a human specialist with full context. Unlike the other gates, escalation is triggered by the agent's own uncertainty — not a pre-configured rule. Used for: edge cases, ambiguous situations, policy gaps, high-stakes decisions the agent wasn't trained for.

Each pattern requires a state machine that pauses agent execution, stores the pending action in a queue, notifies the human, waits for their response, and resumes the agent with the human's decision.

Confidence Routing — Three Decision Branches

Human-in-the-Loop Gate Patterns

Approval

Agent runs

⏸

✓ Approve

Execute

Modification

Draft

✏️

Edit draft

Send

Escalation

Agent runs

🔀

Human resolves

HITL Gates (static): Three timelines: (1) Approval — agent proposes action, pauses for human approval, then executes. (2) Modification — agent drafts content, human edits it, then agent sends. (3) Escalation — agent detects uncertainty, routes to human specialist who resolves the issue.

🎓 Cert Tip — Domain 5.2

Escalate based on policy gaps, capability limits, explicit requests, or business thresholds. NEVER escalate based on sentiment/anger alone. An angry customer with a simple request does NOT need a human. A calm customer hitting a policy gap DOES.

🎓 Cert Tip — Domain 1.4 (Programmatic Prerequisite Gates)

This is sample question #1 in the official exam guide. When a specific tool sequence is required for critical business logic (e.g., verify customer identity via get_customer BEFORE process_refund), use a programmatic hook that blocks the downstream tool call until the prerequisite returns success. Prompt-based instructions ("you must call get_customer first") have a non-zero failure rate — that's unacceptable when errors have financial consequences. Hooks provide deterministic guarantees; prompts provide probabilistic compliance. The exam consistently rewards the hook answer over the prompt answer for high-stakes ordering. Pair with structured handoff summaries when escalating mid-process: customer ID, root cause, refund amount, recommended action — the human agent doesn't have your transcript.

✅ Why It Matters

The best HITL systems make human intervention effortless. When a human gets an approval request, they should see: the proposed action ("Send $450 refund to Order #12345"), the full context ("Customer received damaged item, photos verified, within 30-day return window"), and one-click actions (Approve / Deny / Modify). Forcing the human to reconstruct context from scratch defeats the purpose — the agent should do 95% of the work, and the human provides the final 5% of judgment.

HITL gates handle individual decisions. But what about systemic failures — when your downstream APIs go down, or your hallucination detector starts failing repeatedly? That's where the circuit breaker pattern prevents cascading damage.

Circuit Breaker Pattern

💡 Everyday Analogy

Before circuit breakers existed in homes, an electrical overload — a faulty appliance, a lightning strike, too many devices on one outlet — would cause the wiring to overheat and start a fire. The circuit breaker monitors current flow and TRIPS when it detects dangerous levels, cutting power instantly. Once the problem is fixed (unplug the faulty appliance), you flip the breaker back on. Without the breaker, one bad appliance could burn down the house. An agent circuit breaker works the same way. When failures pile up — API errors, hallucination detections, timeouts — the breaker trips and routes all requests to a safe fallback response instead of continuing to make failing API calls that waste money and degrade user experience.

Here's the internal state of a circuit breaker as it transitions through all three states:

// Request 1: success → state unchanged
{ "state": "closed", "failures": 0, "action": "allowed" }

// Request 2: API 500 error
{ "state": "closed", "failures": 1, "action": "allowed" }

// Request 3: API 500 error
{ "state": "closed", "failures": 2, "action": "allowed" }

// Request 4: API 500 error → threshold reached, circuit TRIPS
{ "state": "open", "failures": 3, "action": "tripped",
  "cooldown_until": "2024-03-22T14:33:00Z" }

// Request 5: arrives during cooldown
{ "state": "open", "action": "blocked",
  "response": "fallback — no API call made" }

// After 60s cooldown expires:
{ "state": "half_open", "action": "testing — 1 request allowed" }

// Test request succeeds:
{ "state": "closed", "failures": 0, "action": "recovered" }

📐 Technical Definition

The circuit breaker has three states. In Closed (normal), requests flow through to Claude. Failures are counted. If failures exceed the threshold (e.g., 3 consecutive failures in a 5-minute window), the circuit trips. In Open (tripped), ALL requests immediately receive a fallback response — no API calls are made. This prevents piling on a failing service. A cooldown timer starts. In Half-Open (testing), after the cooldown expires, ONE test request is allowed through. If it succeeds, the circuit closes (returns to normal). If it fails, the circuit re-opens with a longer cooldown (exponential backoff).

What counts as a "failure"? Several types: API 500 errors (the server is down), rate limit 429 responses (you're sending too many requests), hallucination detections above your threshold, guardrail violations, and execution timeouts. Each failure type can have its own threshold and cooldown. For example, you might tolerate 5 rate-limit errors (they're transient) but trip on just 2 consecutive hallucination detections (those suggest a deeper problem with the model's context).

Circuit Breaker State Machine

✓CLOSED

Failures: 0/3

Cooldown: 60s remaining...

Circuit Breaker (static): Starts in Closed state (green). After 3 consecutive failures, trips to Open state (red) — all requests get fallback response. After 60s cooldown, enters Half-Open (yellow) — one test request allowed. If test passes, returns to Closed. If test fails, re-opens with longer cooldown.

🎓 Cert Tip — Domain 5.5

Self-reported confidence scores are NOT reliable for escalation decisions. The model's internal confidence is not well-calibrated. Use structured programmatic criteria instead.

🎓 Cert Tip — Domain 5.5 (Stratified Sampling + Field-Level Confidence)

Don't escalate on aggregate confidence. Use field-level confidence: extract each field with its own score, escalate only the fields below threshold instead of the whole document. For human review batches, use stratified sampling — sample N from each confidence bucket (high/med/low) so reviewers see the full distribution. Anti-pattern: routing the top-N most-confident extractions to humans (they're all correct, so reviewers learn nothing) or sampling uniformly (low-confidence cases get under-reviewed).

✅ Why It Matters

Without a circuit breaker, a downstream API outage causes every agent request to fail. Each failure triggers a retry. Retries pile up. Rate limits are exhausted. Error responses consume tokens (you still pay for the failed attempt). All users are affected simultaneously. A circuit breaker detects the pattern after 3 failures and immediately switches to fallback responses — no more API calls, no more retries, no more cost. When the API recovers, the half-open test request verifies it's healthy before resuming normal traffic. One $0 circuit breaker saves thousands in wasted API calls during outages.

You now have all the building blocks: output validation, cost controls, HITL gates, and circuit breakers. Let's wire them into production-ready code.

Code Walkthrough

Hallucination Detector

Let's build the hallucination detector step by step. The core idea is the "Claude-as-judge" pattern: we make a SEPARATE Claude API call whose only job is fact-checking. Why separate? Because the original agent's reasoning context can bias its self-evaluation. A fresh Claude call with a focused prompt is a much more reliable judge.

The code has three logical parts. First, the prompt template — this tells the judge exactly how to classify claims and what JSON structure to return. Second, the API call — a straightforward Claude message where we inject the source documents and the response to check. Third, the decision logic — the part that turns the judge's per-claim classifications into an overall pass/flag/block verdict. The critical design choice here: we "fail closed." If the fact-checking call itself errors out (network issue, JSON parse failure), we flag the response for human review rather than letting it through unchecked.

import anthropic
import json

HALLUCINATION_PROMPT = """You are a fact-checking judge. Compare the agent's
response against the provided source documents.

For each factual claim in the response, classify it as:
- "supported": Claim is directly backed by source documents
- "unsupported": Claim is NOT in source documents (possible hallucination)
- "contradicted": Claim CONTRADICTS source documents (definite error)

Respond with ONLY a JSON object:
{
  "claims": [
    {"text": "claim text", "status": "supported|unsupported|contradicted", "source": "which document supports/contradicts, or null"}
  ],
  "overall": "pass|flag|block",
  "unsupported_count": 0
}

Source documents:

{sources}


Agent response to fact-check:

{response}
"""

def check_hallucination(
    client: anthropic.Anthropic,
    response_text: str,
    source_docs: list[str],
) -> dict:
    """Verify agent output against source documents.

    Uses a SEPARATE Claude call (Claude-as-judge pattern).
    The judge has its own clean context and can't be influenced
    by the agent's reasoning that led to the response.
    """
    sources_text = "\n---\n".join(source_docs)
    try:
        result = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            messages=[{
                "role": "user",
                "content": HALLUCINATION_PROMPT.format(
                    sources=sources_text,
                    response=response_text,
                ),
            }],
        )
        parsed = json.loads(result.content[0].text)

        # Decision logic:
        # - Any "contradicted" claim = block immediately
        # - 2+ "unsupported" claims = flag for review
        # - All "supported" = pass
        contradicted = [c for c in parsed["claims"] if c["status"] == "contradicted"]
        unsupported = [c for c in parsed["claims"] if c["status"] == "unsupported"]

        if contradicted:
            return {"result": "block", "reason": f"Contradicted claims: {contradicted}", "claims": parsed["claims"]}
        elif len(unsupported) >= 2:
            return {"result": "flag", "reason": f"{len(unsupported)} unsupported claims", "claims": parsed["claims"]}
        else:
            return {"result": "pass", "claims": parsed["claims"]}

    except Exception as e:
        # Fail closed — if fact-checking fails, flag for review
        return {"result": "flag", "reason": f"Fact-check error: {e}", "claims": []}

import Anthropic from '@anthropic-ai/sdk';

const HALLUCINATION_PROMPT = `You are a fact-checking judge. Compare the agent's
response against the provided source documents.

For each factual claim, classify as:
- "supported": Backed by source documents
- "unsupported": NOT in sources (possible hallucination)
- "contradicted": CONTRADICTS sources (definite error)

Respond with ONLY JSON:
{"claims": [{"text": "...", "status": "supported|unsupported|contradicted", "source": "..."}], "overall": "pass|flag|block", "unsupported_count": 0}

Source documents:
<sources>
{SOURCES}
</sources>

Agent response to fact-check:
<response>
{RESPONSE}
</response>`;

async function checkHallucination(client, responseText, sourceDocs) {
  // Claude-as-judge: separate call with clean context.
  const sourcesText = sourceDocs.join('\n---\n');
  try {
    const result = await client.messages.create({
      model: 'claude-sonnet-4-6',
      max_tokens: 1024,
      messages: [{
        role: 'user',
        content: HALLUCINATION_PROMPT
          .replace('{SOURCES}', sourcesText)
          .replace('{RESPONSE}', responseText),
      }],
    });

    const parsed = JSON.parse(result.content[0].text);
    const contradicted = parsed.claims.filter((c) => c.status === 'contradicted');
    const unsupported = parsed.claims.filter((c) => c.status === 'unsupported');

    if (contradicted.length > 0) {
      return { result: 'block', reason: `Contradicted claims found`, claims: parsed.claims };
    } else if (unsupported.length >= 2) {
      return { result: 'flag', reason: `${unsupported.length} unsupported claims`, claims: parsed.claims };
    }
    return { result: 'pass', claims: parsed.claims };
  } catch (error) {
    // Fail closed — flag for review if fact-checking errors
    return { result: 'flag', reason: `Fact-check error: ${error.message}`, claims: [] };
  }
}

🔍 What Just Happened?

You built a hallucination detector using the Claude-as-judge pattern. A separate Claude call receives both the agent's response and the source documents, then classifies each factual claim as supported, unsupported, or contradicted. The decision logic: any contradicted claim = block, 2+ unsupported claims = flag for review, all supported = pass. The judge runs in its own context so it can't be influenced by the agent's reasoning.

Circuit Breaker Implementation

Now let's build the circuit breaker itself. The interesting challenge here is managing state transitions cleanly. We need three states (Closed, Open, Half-Open), and the transitions between them must be atomic — you can't have two requests both thinking they're the "test request" in Half-Open state. The implementation below uses a simple class with methods for each state transition. In production, you'd add thread safety (a lock around state changes), but the logic is identical.

Pay attention to the can_execute() method — that's the gatekeeper. Every API call goes through it first. And notice how record_failure() handles the Half-Open case differently: a failure during testing re-opens the circuit with a DOUBLED cooldown (exponential backoff), giving the failing service progressively more time to recover.

import time
from enum import Enum
from dataclasses import dataclass, field

class BreakerState(Enum):
    CLOSED = "closed"       # Normal — requests flow through
    OPEN = "open"           # Tripped — all requests get fallback
    HALF_OPEN = "half_open" # Testing — one request allowed

@dataclass
class CircuitBreaker:
    """Circuit breaker for agent API calls.

    Monitors consecutive failures and trips when threshold
    is reached. Prevents cascading failures and wasted spend
    during outages.
    """
    failure_threshold: int = 3      # Trip after N failures
    cooldown_seconds: float = 60.0  # Wait before testing
    state: BreakerState = field(default=BreakerState.CLOSED)
    failure_count: int = field(default=0)
    last_failure_time: float = field(default=0.0)
    opened_at: float = field(default=0.0)

    def can_execute(self) -> bool:
        """Check if a request should be allowed through."""
        if self.state == BreakerState.CLOSED:
            return True

        if self.state == BreakerState.OPEN:
            # Check if cooldown has elapsed
            elapsed = time.time() - self.opened_at
            if elapsed >= self.cooldown_seconds:
                self.state = BreakerState.HALF_OPEN
                return True  # Allow one test request
            return False  # Still cooling down

        if self.state == BreakerState.HALF_OPEN:
            return False  # Already testing, block others

        return False

    def record_success(self):
        """Record a successful request."""
        if self.state == BreakerState.HALF_OPEN:
            # Test passed — close the circuit
            self.state = BreakerState.CLOSED
            self.failure_count = 0
        elif self.state == BreakerState.CLOSED:
            self.failure_count = 0  # Reset on success

    def record_failure(self):
        """Record a failed request."""
        self.failure_count += 1
        self.last_failure_time = time.time()

        if self.state == BreakerState.HALF_OPEN:
            # Test failed — re-open with longer cooldown
            self.state = BreakerState.OPEN
            self.opened_at = time.time()
            self.cooldown_seconds *= 2  # Exponential backoff

        elif self.state == BreakerState.CLOSED:
            if self.failure_count >= self.failure_threshold:
                self.state = BreakerState.OPEN
                self.opened_at = time.time()

    @property
    def fallback_response(self) -> str:
        return (
            "I'm temporarily unable to process your request. "
            "Our systems are experiencing issues. Please try "
            "again in a few minutes."
        )

# Usage with an agent
breaker = CircuitBreaker(failure_threshold=3, cooldown_seconds=60)

def call_agent_with_breaker(client, messages):
    """Wrap agent calls with circuit breaker protection."""
    if not breaker.can_execute():
        return {"content": breaker.fallback_response, "circuit_breaker": "open"}

    try:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            messages=messages,
        )
        breaker.record_success()
        return {"content": response.content[0].text, "circuit_breaker": breaker.state.value}
    except Exception as e:
        breaker.record_failure()
        if not breaker.can_execute():
            return {"content": breaker.fallback_response, "circuit_breaker": "open"}
        raise  # Re-raise if breaker hasn't tripped yet

import Anthropic from '@anthropic-ai/sdk';

const BreakerState = { CLOSED: 'closed', OPEN: 'open', HALF_OPEN: 'half_open' };

class CircuitBreaker {
  // Monitors failures and trips when threshold reached.
  // Prevents cascading failures during outages.
  constructor(failureThreshold = 3, cooldownSeconds = 60) {
    this.failureThreshold = failureThreshold;
    this.cooldownSeconds = cooldownSeconds;
    this.state = BreakerState.CLOSED;
    this.failureCount = 0;
    this.openedAt = 0;
  }

  canExecute() {
    if (this.state === BreakerState.CLOSED) return true;

    if (this.state === BreakerState.OPEN) {
      const elapsed = (Date.now() - this.openedAt) / 1000;
      if (elapsed >= this.cooldownSeconds) {
        this.state = BreakerState.HALF_OPEN;
        return true; // Allow one test request
      }
      return false;
    }

    return false; // HALF_OPEN blocks others while testing
  }

  recordSuccess() {
    if (this.state === BreakerState.HALF_OPEN) {
      this.state = BreakerState.CLOSED; // Test passed
      this.failureCount = 0;
    } else {
      this.failureCount = 0; // Reset on success
    }
  }

  recordFailure() {
    this.failureCount++;

    if (this.state === BreakerState.HALF_OPEN) {
      this.state = BreakerState.OPEN; // Test failed
      this.openedAt = Date.now();
      this.cooldownSeconds *= 2; // Exponential backoff
    } else if (this.failureCount >= this.failureThreshold) {
      this.state = BreakerState.OPEN;
      this.openedAt = Date.now();
    }
  }

  get fallbackResponse() {
    return 'I\'m temporarily unable to process your request. '
      + 'Our systems are experiencing issues. Please try again shortly.';
  }
}

// Usage with an agent
const breaker = new CircuitBreaker(3, 60);

async function callAgentWithBreaker(client, messages) {
  if (!breaker.canExecute()) {
    return { content: breaker.fallbackResponse, circuitBreaker: 'open' };
  }
  try {
    const response = await client.messages.create({
      model: 'claude-sonnet-4-6',
      max_tokens: 1024,
      messages,
    });
    breaker.recordSuccess();
    return { content: response.content[0].text, circuitBreaker: breaker.state };
  } catch (error) {
    breaker.recordFailure();
    if (!breaker.canExecute()) {
      return { content: breaker.fallbackResponse, circuitBreaker: 'open' };
    }
    throw error;
  }
}

🔍 What Just Happened?

You built a circuit breaker with three states: Closed (normal), Open (all requests get fallback), and Half-Open (one test request). When 3 consecutive failures occur, the circuit trips. After a 60-second cooldown, one test request probes recovery. If the test fails, the cooldown doubles (exponential backoff). This prevents burning money on retries during outages and gives downstream services time to recover.

Hands-On Exercise

What You'll Build

A complete output guardrail pipeline with hallucination detection (Claude-as-judge), cost tracking with budget enforcement, a circuit breaker with three-state transitions, and a CLI-based human approval gate — tested with 5 scenarios including contradicted claims, budget overruns, and cascading failures.

Time Estimate: 35-45 minutes
Prerequisites: Python 3.9+, an Anthropic API key (from console.anthropic.com), and a terminal
Files You'll Create:
- output_guardrails.py — All four guardrail components (hallucination detector, cost tracker, circuit breaker, approval gate) plus a 5-scenario test suite
API Cost: ~$0.01 total (2 Claude Sonnet calls for hallucination checks)

Environment Setup

mkdir output-guardrails && cd output-guardrails
python -m venv venv && source venv/bin/activate   # Windows: venv\Scripts\activate
pip install anthropic
export ANTHROPIC_API_KEY=your-key-here             # Windows: set ANTHROPIC_API_KEY=your-key-here

Step 1: Build the Output Guardrail Pipeline

What: Create a single file with four output guardrail components — a hallucination detector, cost tracker, circuit breaker, and approval gate — plus a test suite that exercises each one.

Why: Building all four components in one file lets you see how they work together as a pipeline. In production you'd split them into separate modules, but for learning, having everything in one place makes it easier to trace the data flow from agent response through each guardrail to the final output.

Create a new file called output_guardrails.py:

"""Output Guardrails & HITL — M17 Hands-On Lab"""
import anthropic
import json
import time
from dataclasses import dataclass, field
from enum import Enum

client = anthropic.Anthropic()


# ── Hallucination Detector (Claude-as-Judge) ─────────────

HALLUCINATION_PROMPT = """You are a fact-checking judge. Compare the response
against the source documents. For each factual claim, classify it as:
- "supported": Directly backed by sources
- "unsupported": NOT in sources (possible hallucination)
- "contradicted": CONTRADICTS sources (definite error)

Respond with ONLY JSON:
{{"claims": [{{"text": "claim", "status": "supported|unsupported|contradicted"}}],
 "overall": "pass|flag|block", "unsupported_count": 0}}

Sources:

{sources}


Response to check:

{response}
"""

def check_hallucination(response_text: str, source_docs: list[str]) -> dict:
    """Verify agent response against source documents."""
    sources = "\n---\n".join(source_docs)
    try:
        result = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            messages=[{
                "role": "user",
                "content": HALLUCINATION_PROMPT.format(
                    sources=sources, response=response_text
                ),
            }],
        )
        parsed = json.loads(result.content[0].text)

        contradicted = [c for c in parsed["claims"]
                       if c["status"] == "contradicted"]
        unsupported = [c for c in parsed["claims"]
                      if c["status"] == "unsupported"]

        if contradicted:
            return {"result": "block",
                    "reason": f"{len(contradicted)} contradicted claim(s)",
                    "claims": parsed["claims"]}
        elif len(unsupported) >= 2:
            return {"result": "flag",
                    "reason": f"{len(unsupported)} unsupported claim(s)",
                    "claims": parsed["claims"]}
        return {"result": "pass", "claims": parsed["claims"]}
    except Exception as e:
        return {"result": "flag", "reason": f"Check failed: {e}", "claims": []}


# ── Cost Tracker with Budget Enforcement ─────────────────

@dataclass
class CostTracker:
    """Track token usage and enforce per-request budget."""
    budget_dollars: float = 0.50
    input_tokens: int = 0
    output_tokens: int = 0

    # Claude Sonnet pricing (per token)
    INPUT_PRICE = 3.0 / 1_000_000    # $3/M input tokens
    OUTPUT_PRICE = 15.0 / 1_000_000  # $15/M output tokens

    @property
    def total_cost(self) -> float:
        return (self.input_tokens * self.INPUT_PRICE +
                self.output_tokens * self.OUTPUT_PRICE)

    @property
    def budget_remaining(self) -> float:
        return self.budget_dollars - self.total_cost

    def record_usage(self, input_toks: int, output_toks: int):
        self.input_tokens += input_toks
        self.output_tokens += output_toks

    def can_afford(self, estimated_input: int = 5000,
                   estimated_output: int = 1000) -> bool:
        """Check if the next call would stay within budget."""
        estimated_cost = (estimated_input * self.INPUT_PRICE +
                         estimated_output * self.OUTPUT_PRICE)
        return self.total_cost + estimated_cost <= self.budget_dollars

    def summary(self) -> str:
        return (f"Tokens: {self.input_tokens} in / {self.output_tokens} out "
                f"| Cost: ${self.total_cost:.4f} / ${self.budget_dollars:.2f}")


# ── Circuit Breaker ──────────────────────────────────────

class BreakerState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

@dataclass
class CircuitBreaker:
    failure_threshold: int = 3
    cooldown_seconds: float = 5.0   # Short for demo
    state: BreakerState = field(default=BreakerState.CLOSED)
    failure_count: int = field(default=0)
    opened_at: float = field(default=0.0)

    def can_execute(self) -> bool:
        if self.state == BreakerState.CLOSED:
            return True
        if self.state == BreakerState.OPEN:
            if time.time() - self.opened_at >= self.cooldown_seconds:
                self.state = BreakerState.HALF_OPEN
                return True
            return False
        return False  # HALF_OPEN blocks while testing

    def record_success(self):
        if self.state == BreakerState.HALF_OPEN:
            self.state = BreakerState.CLOSED
        self.failure_count = 0

    def record_failure(self):
        self.failure_count += 1
        if self.state == BreakerState.HALF_OPEN:
            self.state = BreakerState.OPEN
            self.opened_at = time.time()
            self.cooldown_seconds *= 2  # Exponential backoff
        elif self.failure_count >= self.failure_threshold:
            self.state = BreakerState.OPEN
            self.opened_at = time.time()


# ── Approval Gate (CLI) ──────────────────────────────────

def approval_gate(action: str, context: str,
                  auto_approve: bool = False) -> dict:
    """Pause for human approval before irreversible actions.
    Set auto_approve=True for automated testing."""
    print(f"\n{'='*50}")
    print(f"🔒 APPROVAL REQUIRED")
    print(f"Action: {action}")
    print(f"Context: {context}")
    print(f"{'='*50}")

    if auto_approve:
        print("  [Auto-approved for testing]")
        return {"approved": True, "modified": False}

    response = input("Approve? (y/n/e to edit): ").strip().lower()
    if response == "y":
        return {"approved": True, "modified": False}
    elif response == "e":
        new_action = input("Enter modified action: ").strip()
        return {"approved": True, "modified": True, "new_action": new_action}
    return {"approved": False, "reason": "Human denied"}


# ── Test Scenarios ────────────────────────────────────────
if __name__ == "__main__":
    print("=" * 60)
    print("OUTPUT GUARDRAILS — TEST SUITE")
    print("=" * 60)

    # Test 1: Hallucination Detection
    print("\n" + "─" * 60)
    print("TEST 1: Hallucination Detection — Contradicted Claim")
    source_docs = [
        "UCC Filing #2024-NY-0042: Filed March 22, 2024 by Acme Corp. "
        "Collateral: $2.3M in manufacturing equipment. Status: Active.",
        "Amendment filed April 10, 2024: Added collateral description "
        "for warehouse inventory valued at $890K."
    ]
    agent_response = (
        "The UCC filing was submitted on March 15, 2024 by Acme Corp "
        "for $2.3M in equipment collateral. An amendment was filed "
        "on April 10, 2024 adding $890K in warehouse inventory."
    )
    result = check_hallucination(agent_response, source_docs)
    print(f"  Result: {result['result']}")
    print(f"  Reason: {result.get('reason', 'All claims supported')}")
    for claim in result.get("claims", []):
        print(f"    [{claim['status']}] {claim['text']}")

    # Test 2: Hallucination Detection — All Supported
    print("\n" + "─" * 60)
    print("TEST 2: Hallucination Detection — All Supported")
    correct_response = (
        "The UCC filing was submitted on March 22, 2024 by Acme Corp "
        "for $2.3M in equipment collateral."
    )
    result2 = check_hallucination(correct_response, source_docs)
    print(f"  Result: {result2['result']}")

    # Test 3: Cost Tracker
    print("\n" + "─" * 60)
    print("TEST 3: Cost Tracking & Budget Enforcement")
    tracker = CostTracker(budget_dollars=0.10)
    for i in range(5):
        # Estimate matches the actual per-iteration usage below
        if not tracker.can_afford(estimated_input=8000, estimated_output=2000):
            print(f"  Iteration {i+1}: BUDGET EXCEEDED — {tracker.summary()}")
            break
        # Simulate API usage
        tracker.record_usage(input_toks=8000, output_toks=2000)
        print(f"  Iteration {i+1}: {tracker.summary()}")

    # Test 4: Circuit Breaker
    print("\n" + "─" * 60)
    print("TEST 4: Circuit Breaker State Transitions")
    breaker = CircuitBreaker(failure_threshold=3, cooldown_seconds=2)

    actions = [
        ("success", "Normal request 1"),
        ("failure", "API error 1"),
        ("failure", "API error 2"),
        ("failure", "API error 3 → TRIPS"),
        ("blocked", "Request during OPEN state"),
    ]
    for action, label in actions:
        can_exec = breaker.can_execute()
        if not can_exec:
            print(f"  {label}: BLOCKED (state={breaker.state.value})")
            continue
        if action == "success":
            breaker.record_success()
            print(f"  {label}: OK (state={breaker.state.value})")
        else:
            breaker.record_failure()
            print(f"  {label}: FAIL (state={breaker.state.value}, "
                  f"failures={breaker.failure_count}/{breaker.failure_threshold})")

    print(f"  Waiting {breaker.cooldown_seconds}s for cooldown...")
    time.sleep(breaker.cooldown_seconds + 0.1)
    can_test = breaker.can_execute()
    print(f"  Half-open test: can_execute={can_test} "
          f"(state={breaker.state.value})")
    breaker.record_success()
    print(f"  Test passed → state={breaker.state.value}")

    # Test 5: Approval Gate
    print("\n" + "─" * 60)
    print("TEST 5: Approval Gate (auto-approved for testing)")
    gate_result = approval_gate(
        action="Send $450 refund to Order #12345",
        context="Customer received damaged item, within return window",
        auto_approve=True,
    )
    print(f"  Result: {gate_result}")

// Output Guardrails & HITL — M17 Hands-On Lab
import Anthropic from "@anthropic-ai/sdk";
import * as readline from "node:readline/promises";

const client = new Anthropic();

// ── Hallucination Detector ──────────────────────────────
async function checkHallucination(responseText, sourceDocs) {
  const sources = sourceDocs.join("\n---\n");
  try {
    const r = await client.messages.create({
      model: "claude-sonnet-4-6", max_tokens: 1024,
      messages: [{ role: "user", content:
        `Fact-check this response against sources. For each claim: ` +
        `"supported", "unsupported", or "contradicted". ` +
        `Return ONLY JSON: {"claims":[{"text":"...","status":"..."}],` +
        `"overall":"pass|flag|block"}\n\n` +
        `${sources}\n` +
        `${responseText}`
      }],
    });
    const parsed = JSON.parse(r.content[0].text);
    const contradicted = parsed.claims.filter(c => c.status === "contradicted");
    if (contradicted.length) return { result: "block", claims: parsed.claims };
    return { result: "pass", claims: parsed.claims };
  } catch (e) {
    return { result: "flag", reason: e.message, claims: [] };
  }
}

// ── Cost Tracker ────────────────────────────────────────
class CostTracker {
  constructor(budget = 0.10) {
    this.budget = budget;
    this.inputTokens = 0; this.outputTokens = 0;
  }
  get cost() {
    return this.inputTokens * 3e-6 + this.outputTokens * 15e-6;
  }
  record(inp, out) { this.inputTokens += inp; this.outputTokens += out; }
  canAfford(estIn = 5000, estOut = 1000) {
    return this.cost + estIn * 3e-6 + estOut * 15e-6 <= this.budget;
  }
}

// ── Circuit Breaker ─────────────────────────────────────
class CircuitBreaker {
  constructor(threshold = 3, cooldown = 2) {
    this.threshold = threshold; this.cooldown = cooldown;
    this.state = "closed"; this.failures = 0; this.openedAt = 0;
  }
  canExecute() {
    if (this.state === "closed") return true;
    if (this.state === "open" &&
        (Date.now() - this.openedAt) / 1000 >= this.cooldown) {
      this.state = "half_open"; return true;
    }
    return false;
  }
  recordSuccess() {
    if (this.state === "half_open") this.state = "closed";
    this.failures = 0;
  }
  recordFailure() {
    this.failures++;
    if (this.state === "half_open") {
      this.state = "open"; this.openedAt = Date.now();
      this.cooldown *= 2;
    } else if (this.failures >= this.threshold) {
      this.state = "open"; this.openedAt = Date.now();
    }
  }
}

// ── Tests ───────────────────────────────────────────────
console.log("TEST 1: Hallucination Detection");
const sources = ["Filed March 22, 2024 by Acme Corp. $2.3M equipment."];
const r1 = await checkHallucination(
  "Filed on March 15, 2024 by Acme Corp for $2.3M.", sources);
console.log(`  Result: ${r1.result}`);
r1.claims.forEach(c => console.log(`  [${c.status}] ${c.text}`));

console.log("\nTEST 2: Cost Tracker");
const tracker = new CostTracker(0.10);
for (let i = 0; i < 5; i++) {
  // Estimate matches actual per-iteration usage below
  if (!tracker.canAfford(8000, 2000)) { console.log(`  Iter ${i+1}: BUDGET EXCEEDED`); break; }
  tracker.record(8000, 2000);
  console.log(`  Iter ${i+1}: $${tracker.cost.toFixed(4)} / $${tracker.budget}`);
}

console.log("\nTEST 3: Circuit Breaker");
const breaker = new CircuitBreaker(3, 2);
for (let i = 0; i < 4; i++) {
  if (!breaker.canExecute()) { console.log("  BLOCKED"); continue; }
  breaker.recordFailure();
  console.log(`  Failure ${breaker.failures} → ${breaker.state}`);
}
console.log(`  Waiting ${breaker.cooldown}s...`);
await new Promise(r => setTimeout(r, (breaker.cooldown + 0.1) * 1000));
console.log(`  Can execute: ${breaker.canExecute()} (${breaker.state})`);
breaker.recordSuccess();
console.log(`  After success: ${breaker.state}`);

Run Command: Execute the complete test suite with a single command:

python output_guardrails.py

The script runs 5 tests sequentially. Tests 1-2 make Claude API calls (~3 seconds each). Tests 3-5 are local-only and run instantly, except the circuit breaker cooldown wait (~2 seconds).

Expected Output

============================================================ OUTPUT GUARDRAILS — TEST SUITE ============================================================ ──────────────────────────────────────────────────────────── TEST 1: Hallucination Detection — Contradicted Claim Result: block Reason: 1 contradicted claim(s) [contradicted] filed on March 15, 2024 [supported] by Acme Corp [supported] $2.3M in equipment collateral ──────────────────────────────────────────────────────────── TEST 2: Hallucination Detection — All Supported Result: pass ──────────────────────────────────────────────────────────── TEST 3: Cost Tracking & Budget Enforcement Iteration 1: Tokens: 8000 in / 2000 out | Cost: $0.0540 / $0.10 Iteration 2: BUDGET EXCEEDED — Tokens: 8000 in / 2000 out | Cost: $0.0540 / $0.10 ──────────────────────────────────────────────────────────── TEST 4: Circuit Breaker State Transitions Normal request 1: OK (state=closed) API error 1: FAIL (state=closed, failures=1/3) API error 2: FAIL (state=closed, failures=2/3) API error 3 → TRIPS: FAIL (state=open, failures=3/3) Request during OPEN state: BLOCKED (state=open) Waiting 2s for cooldown... Half-open test: can_execute=True (state=half_open) Test passed → state=closed ──────────────────────────────────────────────────────────── TEST 5: Approval Gate (auto-approved for testing) ================================================== 🔒 APPROVAL REQUIRED Action: Send $450 refund to Order #12345 Context: Customer received damaged item, within return window ================================================== [Auto-approved for testing] Result: {'approved': True, 'modified': False}

✅ Checkpoint

Verify these 5 key results in your output:

Test 1: Result shows block — the hallucination detector caught the wrong date (March 15 vs March 22)
Test 2: Result shows pass — all claims match source documents
Test 3: Budget exceeded at iteration 2 — $0.054/iteration exceeds the $0.10 budget after one iteration of headroom
Test 4: Circuit breaker transitions through all states: closed → open (after 3 failures) → half-open (after cooldown) → closed (after test success)
Test 5: Approval gate displays the action and auto-approves with {'approved': True, 'modified': False}

If all 5 pass, your output guardrail pipeline is working correctly.

Troubleshooting

If you see AuthenticationError → Your ANTHROPIC_API_KEY is not set or is invalid. Run echo $ANTHROPIC_API_KEY (or echo %ANTHROPIC_API_KEY% on Windows) to verify. Re-export if needed: export ANTHROPIC_API_KEY=sk-ant-...
If you see ModuleNotFoundError: No module named 'anthropic' → Run pip install anthropic. Make sure your virtual environment is activated (source venv/bin/activate).
If hallucination test shows "pass" instead of "block" → The classifier may occasionally miss the date discrepancy (March 15 vs March 22). Run again — results are probabilistic. If persistent across 3 runs, make the contradiction more obvious by changing "March 15" to "June 15" in the test data.
If circuit breaker cooldown test fails → Ensure the time.sleep duration exceeds the cooldown. On slow systems, change cooldown_seconds=2 to cooldown_seconds=3 and add an extra 0.5s to the sleep.
If cost tracker doesn't exceed budget → Check that the budget is set to $0.10 (not $0.50). At $0.054/iteration, the tracker should report BUDGET EXCEEDED when checking affordability for iteration 2.

Verify Everything Works

Run the full script and use grep to confirm all 5 key guardrail behaviors trigger correctly:

# Verify key behaviors
python output_guardrails.py 2>&1 | grep -E "(block|BUDGET|TRIPS|APPROVAL)"
# Should show: block, BUDGET EXCEEDED, TRIPS, APPROVAL REQUIRED

Expected Verification Output

Result: block Iteration 2: BUDGET EXCEEDED — Tokens: 8000 in / 2000 out | Cost: $0.0540 / $0.10 API error 3 → TRIPS: FAIL (state=open, failures=3/3) 🔒 APPROVAL REQUIRED

🎉 Congratulations

You've built a complete output guardrail pipeline. Your agent now has four layers of protection: hallucination detection catches fabricated facts, cost tracking prevents runaway spending, circuit breakers handle cascading failures, and approval gates keep humans in the loop for high-stakes decisions. In M18, you'll learn how to measure whether these guardrails actually work in production.

Cost Note

The full test suite makes 2 Claude API calls (hallucination checks). At Claude Sonnet pricing, that's ~$0.01 total. The cost tracker, circuit breaker, and approval gate use zero API calls. In production, hallucination detection costs ~$0.003 per check. At 1,000 responses/day, that's $3/day for fact-checking — trivially cheap compared to the cost of acting on hallucinated data.

Knowledge Check

1. What is the key difference between an approval gate and a modification gate?

AApproval gates are for important decisions; modification gates are for minor ones

BApproval gates are binary (yes/no); modification gates allow the human to edit the content before proceeding

CApproval gates pause the agent; modification gates don't pause

DApproval gates require a manager; modification gates only need any user

✓ Correct! Approval gates are binary — the human approves or denies the proposed action. Modification gates present a draft that the human can edit before it's finalized. Both pause the agent; the difference is the type of human input (yes/no vs. text editing).

✗ The key difference is the type of human input. Approval gates: binary yes/no decision. Modification gates: the human can edit the agent's draft before it's sent. Both gates pause the agent and both can be used for important decisions.

2. A circuit breaker is in "half-open" state. The test request fails. What happens next?

AThe circuit closes and resumes normal operation

BAnother test request is sent immediately

CThe circuit re-opens with a longer cooldown period (exponential backoff)

DThe circuit stays half-open and tries again after a fixed 10-second delay

✓ Correct! When a half-open test fails, the circuit re-opens (back to blocking all requests) with a LONGER cooldown — typically double the previous cooldown (exponential backoff). This gives the failing service more time to recover before the next test.

✗ When the half-open test fails, the circuit re-opens with exponential backoff — the cooldown doubles. If the first cooldown was 60s, the next is 120s, then 240s. This prevents hammering a failing service with test requests and gives it progressively more time to recover.

3. An agent's response claims "The filing was submitted on March 15, 2024" but the source documents show the date was March 22. Which guardrail catches this?

AToxicity filter — the incorrect date is harmful

BFormat validator — the date format is wrong

CHallucination detector — the claim contradicts the source documents and should be blocked

DInput guardrail — the PII detector should catch the date

✓ Correct! The hallucination detector compares agent claims against source documents. "March 15" vs "March 22" is a "contradicted" claim — the source documents explicitly show a different date. The detector should classify this as "contradicted" and block the response.

✗ This is a hallucination (specifically a "contradicted" claim). The agent generated a date that contradicts the source documents. The hallucination detector catches this by comparing claims against sources. Toxicity, format, and PII checks don't verify factual accuracy.

4. Your agent loop typically runs 3-4 iterations at $0.05 each. You set a per-request budget of $0.50. An edge case triggers 20 iterations. What happens?

AThe agent completes all 20 iterations — the budget is just a warning

BAt iteration ~10, the cost middleware halts the loop and returns a fallback response

CThe API returns an error because Anthropic enforces the budget server-side

DThe agent continues but switches to a cheaper model for remaining iterations

✓ Correct! At ~$0.05 per iteration, 10 iterations = $0.50. The cost middleware checks the budget before each API call. At iteration 10 or 11, the cumulative cost exceeds $0.50, so the middleware short-circuits the call and returns a graceful fallback response. Iterations 11-20 never happen. Budget saved: ~$0.50.

✗ The cost middleware halts the loop at ~10 iterations ($0.50 budget / $0.05 per iteration). Budgets are enforced client-side by your middleware, not by Anthropic's API. The middleware checks cumulative cost before each API call and returns a fallback response when the budget is exceeded.

5. Which guardrail pipeline ordering is correct?

AInput: injection detection → rate limiting → PII scan → schema validation

BInput: rate limiting → schema validation → PII detection → injection detection

CInput: PII detection → injection detection → rate limiting → schema validation

DThe order doesn't matter — all checks run independently

✓ Correct! Order matters: cheapest checks first. Rate limiting (zero cost, instant) → schema validation (zero API calls) → PII detection (regex, sub-ms) → injection detection (Claude API call, ~200ms). This ensures the most expensive check only runs on inputs that passed the cheap checks. Running injection detection first wastes an API call on rate-limited or malformed requests.

✗ The correct order is B: cheapest checks first. Rate limiting is free and instant — run it first. Schema validation is free. PII detection uses regex (fast). Injection detection uses an API call (expensive) — run it last. Running expensive checks before cheap ones wastes money on inputs that would have been rejected.

6. When should you use an escalation gate instead of an approval gate?

AWhen the customer is angry or upset

BWhen the agent detects it has hit a policy gap, capability limit, or ambiguous situation it can't resolve

CWhen the request involves any dollar amount over $100

DWhenever the agent's self-reported confidence is below 80%

✓ Correct! Escalation gates are triggered by policy gaps, capability limits, explicit human requests, or business-defined thresholds — NOT by sentiment (A) or self-reported confidence (D). An approval gate would be used for the $100 threshold (C) because it's a pre-defined rule, not an uncertainty-based escalation.

✗ Escalation gates trigger on policy gaps, capability limits, or ambiguity that the agent can't resolve. Sentiment-based escalation (A) is an anti-pattern. Self-reported confidence (D) is unreliable. Dollar thresholds (C) are better handled by approval gates with pre-defined rules, not escalation.

Your Score

0/0

Summary

You've now built both sides of the guardrail sandwich — input validation (M16) and output validation (M17). Here's the complete picture:

Output Validation — Hallucination detection (Claude-as-judge), toxicity filtering, and format validation catch errors before responses reach users.
Cost Controls — Per-request budgets, token caps, and loop detection prevent runaway spending. A $0.50 budget caps edge-case loops at 10 iterations.
HITL Gates — Approval (yes/no), modification (edit draft), and escalation (route to human) patterns handle decisions that need human judgment.
Circuit Breaker — Three-state pattern (closed → open → half-open) prevents cascading failures during outages with exponential backoff.

Next up: In M18: Evaluation & Testing, you'll learn how to measure whether your guardrails actually work — building evaluation pipelines that test accuracy, catch regressions, and quantify your agent's reliability.

← M16: Input Guardrails 🏠 Home M18: Evaluation & Testing →