M17: Output Guardrails & Human-in-the-Loop
Validate what your agent says, control what it costs, and know when to hand off to a human.
Learning Objectives
- Implement output validation: hallucination detection, toxicity filtering, and format checks
- Build cost controls with budget ceilings, token caps, and loop detection
- Design human-in-the-loop patterns: approval gates, modification gates, and escalation gates
- Implement the circuit breaker pattern to prevent cascading failures
- Wire input and output guardrails into a complete production pipeline
Output Validation
In M16, you built input guardrails that protect your system from bad inputs. Now we build the other half: output guardrailsValidation layers that intercept agent responses BEFORE they reach the user. They check for hallucinations, toxic content, and format compliance — protecting users from your system's mistakes. that protect your users from bad outputs. Input guardrails protect your system from users. Output guardrails protect users from your system.
Before quality inspectors existed on factory assembly lines, products went straight from the machines to the customer. Most products were fine, but occasionally a defective one slipped through — a car with a faulty brake line, a toy with a sharp edge, a medicine with the wrong dosage. The machines themselves didn't know they'd made a mistake. Quality inspectors sit at the END of the line, checking every product before it ships. They don't make the products — they catch the defects that the manufacturing process missed. Output guardrails work exactly the same way. Your agent generates responses using Claude, and most responses are excellent. But occasionally Claude hallucinates a fact, generates inappropriate content, or produces malformed output. Output guardrails catch these defects before the response reaches your user.
Here's what an output guardrail check actually looks like in practice. The hallucination detector returns a structured verdict for each factual claim:
// Agent response: "The filing was submitted on March 15, 2024
// by Acme Corp for $2.3M in equipment collateral."
// Source document says: March 22, 2024; Acme Corp; $2.3M equipment
{
"claims": [
{ "text": "filed on March 15, 2024", "status": "contradicted",
"source": "Document shows March 22, 2024" },
{ "text": "by Acme Corp", "status": "supported",
"source": "filing_2024_NY.txt line 3" },
{ "text": "$2.3M in equipment collateral", "status": "supported",
"source": "filing_2024_NY.txt line 7" }
],
"overall": "block", // One contradicted claim = block
"unsupported_count": 0
}
// Result: Response blocked. Agent told to re-check the date.
Output validation runs three critical checks on every agent response:
Hallucination detection verifies factual claims against source documents. If your RAG agent (M09-M10) says "The filing was submitted on March 15, 2024," the hallucination detector checks whether the source documents actually contain that date. Implementation uses a separate Claude call that receives both the response and the source documents, then classifies each claim as "supported," "unsupported," or "contradicted."
Toxicity filtering screens for harmful, biased, or inappropriate content. While Claude's safety training handles most cases, edge cases slip through — especially in agentic loops where the model processes external data that may contain offensive content. A Claude-as-judgeA pattern where a separate Claude call evaluates the output of the main agent. The judge has its own focused prompt and clean context, making it more reliable than self-evaluation within the same conversation. call rates content on a severity scale (1-5) and blocks anything above the configured threshold.
Format validation ensures the output matches expected structure. If your agent should return JSON with specific fields, the format validator runs three sub-checks: Is the JSON syntactically valid? Are all required fields present? Are values within expected ranges (e.g., a dollar amount is positive, a date is not in the future)? This is the cheapest check — no API call needed, just local parsing — and it catches the most common failures.
Output Pipeline (static): Agent response flows through Format Validation (schema check), Hallucination Detection (fact verification), and Toxicity Filter (content safety). Clean outputs reach the user. Hallucinated claims get flagged. Toxic content gets blocked.
In a healthcare pre-authorization agent (Domain A), a hallucinated CPT code could approve the wrong procedure — costing thousands of dollars and potentially harming a patient. In a B2B ecommerce agent (Domain B), a format error that drops the shipping address field means packages go nowhere. A hallucination detector that costs $0.003 per check is trivially cheap compared to the cost of acting on fabricated data. Output guardrails are the last checkpoint before your agent's decisions affect real people.
"Output guardrails are unnecessary because Claude is already safe" — Claude's safety training catches most harmful content, but it's not infallible. In agentic loops where external data flows through the model, offensive content from data sources can leak into responses. And safety training doesn't prevent hallucinations — the model can confidently state incorrect facts.
"Human-in-the-loop means the AI is unreliable" — The opposite. HITL is a sign of a well-designed system. Even the best human employees need manager approval for large transactions or important communications. HITL gates are the same principle — they're about risk management, not reliability.
"The model can tell you how confident it is" — Self-reported confidence scores ("I'm 85% confident") are NOT calibrated. Claude might say "I'm very confident" about a hallucinated fact. Use programmatic criteria (source document matching, structured validation) for escalation decisions, not the model's self-assessment.
"Circuit breakers are overkill for AI agents" — Without them, a downstream API outage causes every agent request to fail, burn tokens on error responses, and trigger retries that compound the problem. A circuit breaker saves thousands during a 30-minute outage by short-circuiting to a fallback response after 3 failures.
Cost Controls
Before prepaid phone plans existed, parents gave their teenagers a phone with an unlimited plan — and then got a $2,000 bill because the teenager discovered international calling. The phone worked perfectly; the problem was that nothing stopped reasonable use from escalating into catastrophic cost. Running an AI agent without cost controls is the same story. Your agent loop works perfectly in testing — 3-4 iterations, $0.02 per request. But in production, an edge case triggers 50 iterations, each processing a 100K-token document, and suddenly one request costs $15 instead of $0.02. Without budget ceilingsHard limits on token usage or dollar cost per request, per user, or per time period. When the ceiling is hit, the agent returns a graceful fallback response instead of continuing to spend., a single bug can burn through your monthly budget in hours.
Here's what a cost middleware check looks like in practice — the object your middleware maintains after each API call:
// After iteration 3 of an agent loop:
{
"iteration": 3,
"input_tokens": 24000, // 8K per iteration × 3
"output_tokens": 6000, // 2K per iteration × 3
"total_cost": "$0.162", // (24K × $3/M) + (6K × $15/M)
"budget": "$0.50",
"budget_remaining": "$0.338",
"can_afford_next": true, // Estimated next call: $0.054
"status": "ok"
}
// After iteration 8:
{
"iteration": 8,
"total_cost": "$0.432",
"budget_remaining": "$0.068",
"can_afford_next": false, // Next call ~$0.054 would exceed budget
"status": "budget_exceeded — returning fallback response"
}
Cost control operates at four levels. Here's each one and when it kicks in:
Per-request budget caps the maximum tokens (or dollar cost) for a single agent invocation. If the agent loop consumes more than the budget, it terminates and returns a fallback response instead of continuing to spend. Think of it as a prepaid card for each request — once the balance hits zero, the card declines.
Per-user budget sets daily or monthly limits per user. This prevents any single user from consuming disproportionate resources. Without it, one power user running expensive queries all day could eat your entire monthly allocation.
Loop detection identifies when the agent is calling the same tool repeatedly without making progress. For example, an agent might retry a failing API call infinitely — each retry costs tokens but produces no useful result. Loop detection spots this pattern and breaks the cycle.
Execution time limits are wall-clock timeouts that kill the agent process if it runs too long. These catch infinite loops that might not hit token limits — for example, an agent stuck in a reasoning loop that generates very few tokens per iteration but never terminates.
Implementation uses middlewareA software layer that wraps API calls, adding functionality like cost tracking, logging, and enforcement without modifying the core agent logic. In Python, often implemented as decorators or context managers. that wraps every Claude API call. The middleware tracks cumulative input tokens, output tokens, and cost. Before each API call, it checks whether the next call would exceed the budget. If so, it short-circuits the call and returns a budget-exceeded response. The key insight: the middleware sits BETWEEN your agent code and the API client, so your agent logic doesn't need to know about budgets at all.
Cost Breakdown (static): Stacked bar chart showing cost per agent loop iteration. Input tokens (blue), output tokens (green), tool calls (orange) accumulate. A budget ceiling line marks the $0.50 limit. When costs approach 80%, bars turn yellow. At 100%, the agent returns a fallback response instead of continuing.
Real numbers: A single Claude Sonnet call with 10K input tokens and 2K output tokens costs ~$0.06. An agentic loop that runs 20 iterations (each with tool use) processes ~240K total tokens — that's $1.44 per request. At 1,000 requests/day, you're at $1,440/day. A per-request budget of $0.50 caps runaway loops at 8-10 iterations while allowing normal 3-4 iteration flows, reducing daily costs to ~$500. The budget saves you from the long tail of expensive edge cases.
Human-in-the-Loop Patterns
Before autopilot systems matured, every second of a flight required active human control. Modern autopilot handles the routine — maintaining altitude, speed, and heading for hours. But the pilot takes over for takeoff, landing, and anything unexpected — turbulence, system warnings, unusual ATC instructions. The key insight isn't that autopilot is unreliable — it's that some decisions require human judgment, and the system is designed with clear handoff points where control transfers smoothly. Human-in-the-loop (HITL)A design pattern where AI handles routine tasks autonomously but pauses at predefined decision points for human review, approval, or modification before proceeding. The human is "in the loop" — part of the workflow, not just observing. patterns work exactly the same way. Your agent handles routine requests autonomously, but at specific decision points — irreversible actions, high-stakes choices, edge cases — it pauses and hands control to a human.
Here's what an approval gate request looks like when it reaches a human reviewer — the JSON object that gets stored in the approval queue:
// What the human reviewer sees in their approval queue:
{
"gate_type": "approval",
"agent_id": "order-support-agent-7",
"proposed_action": "Send $450 refund to customer",
"context": {
"order_id": "ORD-12345",
"reason": "Damaged item — photos verified by vision model",
"return_window": "within 30-day policy",
"customer_history": "12 orders, 0 previous refunds"
},
"state": "waiting_for_human",
"options": ["approve", "deny", "modify"],
"created_at": "2024-03-22T14:32:00Z",
"timeout": "4 hours — auto-escalate to manager if no response"
}
Three HITL gate patterns serve different purposes:
Approval gates pause before irreversible actions. The agent proposes "Send refund of $450 to customer X" and waits for human confirmation. The human sees the full context, clicks Approve or Deny, and the agent proceeds accordingly. Used for: sending emails, processing payments, modifying databases, deploying code.
Modification gates present a draft for human editing. The agent generates a customer response, displays it, and the human can edit the text before it's sent. Unlike approval gates (binary yes/no), modification gates allow the human to adjust the content. Used for: customer communications, reports, legal documents, content that needs a human voice.
Escalation gates detect when the agent has hit its competence boundary. The agent recognizes "I can't resolve this" and routes to a human specialist with full context. Unlike the other gates, escalation is triggered by the agent's own uncertainty — not a pre-configured rule. Used for: edge cases, ambiguous situations, policy gaps, high-stakes decisions the agent wasn't trained for.
Each pattern requires a state machineA programming pattern where a system exists in one of several defined states (e.g., "running," "waiting_for_human," "completed") and transitions between states based on events. Used to track where an agent is in a HITL workflow. that pauses agent execution, stores the pending action in a queue, notifies the human, waits for their response, and resumes the agent with the human's decision.
HITL Gates (static): Three timelines: (1) Approval — agent proposes action, pauses for human approval, then executes. (2) Modification — agent drafts content, human edits it, then agent sends. (3) Escalation — agent detects uncertainty, routes to human specialist who resolves the issue.
Escalate based on policy gaps, capability limits, explicit requests, or business thresholds. NEVER escalate based on sentiment/anger alone. An angry customer with a simple request does NOT need a human. A calm customer hitting a policy gap DOES.
This is sample question #1 in the official exam guide. When a specific tool sequence is required for critical business logic (e.g., verify customer identity via get_customer BEFORE process_refund), use a programmatic hook that blocks the downstream tool call until the prerequisite returns success. Prompt-based instructions ("you must call get_customer first") have a non-zero failure rate — that's unacceptable when errors have financial consequences. Hooks provide deterministic guarantees; prompts provide probabilistic compliance. The exam consistently rewards the hook answer over the prompt answer for high-stakes ordering. Pair with structured handoff summaries when escalating mid-process: customer ID, root cause, refund amount, recommended action — the human agent doesn't have your transcript.
The best HITL systems make human intervention effortless. When a human gets an approval request, they should see: the proposed action ("Send $450 refund to Order #12345"), the full context ("Customer received damaged item, photos verified, within 30-day return window"), and one-click actions (Approve / Deny / Modify). Forcing the human to reconstruct context from scratch defeats the purpose — the agent should do 95% of the work, and the human provides the final 5% of judgment.
Circuit Breaker Pattern
Before circuit breakers existed in homes, an electrical overload — a faulty appliance, a lightning strike, too many devices on one outlet — would cause the wiring to overheat and start a fire. The circuit breaker monitors current flow and TRIPS when it detects dangerous levels, cutting power instantly. Once the problem is fixed (unplug the faulty appliance), you flip the breaker back on. Without the breaker, one bad appliance could burn down the house. An agent circuit breakerA pattern that monitors consecutive failures and automatically halts the agent when a threshold is reached. Three states: Closed (normal), Open (tripped — all requests get fallback response), Half-Open (testing — one request allowed through to check recovery). works the same way. When failures pile up — API errors, hallucination detections, timeouts — the breaker trips and routes all requests to a safe fallback response instead of continuing to make failing API calls that waste money and degrade user experience.
Here's the internal state of a circuit breaker as it transitions through all three states:
// Request 1: success → state unchanged
{ "state": "closed", "failures": 0, "action": "allowed" }
// Request 2: API 500 error
{ "state": "closed", "failures": 1, "action": "allowed" }
// Request 3: API 500 error
{ "state": "closed", "failures": 2, "action": "allowed" }
// Request 4: API 500 error → threshold reached, circuit TRIPS
{ "state": "open", "failures": 3, "action": "tripped",
"cooldown_until": "2024-03-22T14:33:00Z" }
// Request 5: arrives during cooldown
{ "state": "open", "action": "blocked",
"response": "fallback — no API call made" }
// After 60s cooldown expires:
{ "state": "half_open", "action": "testing — 1 request allowed" }
// Test request succeeds:
{ "state": "closed", "failures": 0, "action": "recovered" }
The circuit breaker has three states. In Closed (normal), requests flow through to Claude. Failures are counted. If failures exceed the threshold (e.g., 3 consecutive failures in a 5-minute window), the circuit trips. In Open (tripped), ALL requests immediately receive a fallback response — no API calls are made. This prevents piling on a failing service. A cooldown timer starts. In Half-Open (testing), after the cooldown expires, ONE test request is allowed through. If it succeeds, the circuit closes (returns to normal). If it fails, the circuit re-opens with a longer cooldown (exponential backoff).
What counts as a "failure"? Several types: API 500 errors (the server is down), rate limit 429 responses (you're sending too many requests), hallucination detections above your threshold, guardrail violations, and execution timeouts. Each failure type can have its own threshold and cooldown. For example, you might tolerate 5 rate-limit errors (they're transient) but trip on just 2 consecutive hallucination detections (those suggest a deeper problem with the model's context).
Circuit Breaker (static): Starts in Closed state (green). After 3 consecutive failures, trips to Open state (red) — all requests get fallback response. After 60s cooldown, enters Half-Open (yellow) — one test request allowed. If test passes, returns to Closed. If test fails, re-opens with longer cooldown.
Self-reported confidence scores are NOT reliable for escalation decisions. The model's internal confidence is not well-calibrated. Use structured programmatic criteria instead.
Don't escalate on aggregate confidence. Use field-level confidence: extract each field with its own score, escalate only the fields below threshold instead of the whole document. For human review batches, use stratified sampling — sample N from each confidence bucket (high/med/low) so reviewers see the full distribution. Anti-pattern: routing the top-N most-confident extractions to humans (they're all correct, so reviewers learn nothing) or sampling uniformly (low-confidence cases get under-reviewed).
Without a circuit breaker, a downstream API outage causes every agent request to fail. Each failure triggers a retry. Retries pile up. Rate limits are exhausted. Error responses consume tokens (you still pay for the failed attempt). All users are affected simultaneously. A circuit breaker detects the pattern after 3 failures and immediately switches to fallback responses — no more API calls, no more retries, no more cost. When the API recovers, the half-open test request verifies it's healthy before resuming normal traffic. One $0 circuit breaker saves thousands in wasted API calls during outages.
Code Walkthrough
Hallucination Detector
Let's build the hallucination detector step by step. The core idea is the "Claude-as-judge" pattern: we make a SEPARATE Claude API call whose only job is fact-checking. Why separate? Because the original agent's reasoning context can bias its self-evaluation. A fresh Claude call with a focused prompt is a much more reliable judge.
The code has three logical parts. First, the prompt template — this tells the judge exactly how to classify claims and what JSON structure to return. Second, the API call — a straightforward Claude message where we inject the source documents and the response to check. Third, the decision logic — the part that turns the judge's per-claim classifications into an overall pass/flag/block verdict. The critical design choice here: we "fail closed." If the fact-checking call itself errors out (network issue, JSON parse failure), we flag the response for human review rather than letting it through unchecked.
import anthropic
import json
HALLUCINATION_PROMPT = """You are a fact-checking judge. Compare the agent's
response against the provided source documents.
For each factual claim in the response, classify it as:
- "supported": Claim is directly backed by source documents
- "unsupported": Claim is NOT in source documents (possible hallucination)
- "contradicted": Claim CONTRADICTS source documents (definite error)
Respond with ONLY a JSON object:
{
"claims": [
{"text": "claim text", "status": "supported|unsupported|contradicted", "source": "which document supports/contradicts, or null"}
],
"overall": "pass|flag|block",
"unsupported_count": 0
}
Source documents:
{sources}
Agent response to fact-check:
{response}
"""
def check_hallucination(
client: anthropic.Anthropic,
response_text: str,
source_docs: list[str],
) -> dict:
"""Verify agent output against source documents.
Uses a SEPARATE Claude call (Claude-as-judge pattern).
The judge has its own clean context and can't be influenced
by the agent's reasoning that led to the response.
"""
sources_text = "\n---\n".join(source_docs)
try:
result = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": HALLUCINATION_PROMPT.format(
sources=sources_text,
response=response_text,
),
}],
)
parsed = json.loads(result.content[0].text)
# Decision logic:
# - Any "contradicted" claim = block immediately
# - 2+ "unsupported" claims = flag for review
# - All "supported" = pass
contradicted = [c for c in parsed["claims"] if c["status"] == "contradicted"]
unsupported = [c for c in parsed["claims"] if c["status"] == "unsupported"]
if contradicted:
return {"result": "block", "reason": f"Contradicted claims: {contradicted}", "claims": parsed["claims"]}
elif len(unsupported) >= 2:
return {"result": "flag", "reason": f"{len(unsupported)} unsupported claims", "claims": parsed["claims"]}
else:
return {"result": "pass", "claims": parsed["claims"]}
except Exception as e:
# Fail closed — if fact-checking fails, flag for review
return {"result": "flag", "reason": f"Fact-check error: {e}", "claims": []}
import Anthropic from '@anthropic-ai/sdk';
const HALLUCINATION_PROMPT = `You are a fact-checking judge. Compare the agent's
response against the provided source documents.
For each factual claim, classify as:
- "supported": Backed by source documents
- "unsupported": NOT in sources (possible hallucination)
- "contradicted": CONTRADICTS sources (definite error)
Respond with ONLY JSON:
{"claims": [{"text": "...", "status": "supported|unsupported|contradicted", "source": "..."}], "overall": "pass|flag|block", "unsupported_count": 0}
Source documents:
<sources>
{SOURCES}
</sources>
Agent response to fact-check:
<response>
{RESPONSE}
</response>`;
async function checkHallucination(client, responseText, sourceDocs) {
// Claude-as-judge: separate call with clean context.
const sourcesText = sourceDocs.join('\n---\n');
try {
const result = await client.messages.create({
model: 'claude-sonnet-4-6',
max_tokens: 1024,
messages: [{
role: 'user',
content: HALLUCINATION_PROMPT
.replace('{SOURCES}', sourcesText)
.replace('{RESPONSE}', responseText),
}],
});
const parsed = JSON.parse(result.content[0].text);
const contradicted = parsed.claims.filter((c) => c.status === 'contradicted');
const unsupported = parsed.claims.filter((c) => c.status === 'unsupported');
if (contradicted.length > 0) {
return { result: 'block', reason: `Contradicted claims found`, claims: parsed.claims };
} else if (unsupported.length >= 2) {
return { result: 'flag', reason: `${unsupported.length} unsupported claims`, claims: parsed.claims };
}
return { result: 'pass', claims: parsed.claims };
} catch (error) {
// Fail closed — flag for review if fact-checking errors
return { result: 'flag', reason: `Fact-check error: ${error.message}`, claims: [] };
}
}
You built a hallucination detector using the Claude-as-judge pattern. A separate Claude call receives both the agent's response and the source documents, then classifies each factual claim as supported, unsupported, or contradicted. The decision logic: any contradicted claim = block, 2+ unsupported claims = flag for review, all supported = pass. The judge runs in its own context so it can't be influenced by the agent's reasoning.
Circuit Breaker Implementation
Now let's build the circuit breaker itself. The interesting challenge here is managing state transitions cleanly. We need three states (Closed, Open, Half-Open), and the transitions between them must be atomic — you can't have two requests both thinking they're the "test request" in Half-Open state. The implementation below uses a simple class with methods for each state transition. In production, you'd add thread safety (a lock around state changes), but the logic is identical.
Pay attention to the can_execute() method — that's the gatekeeper. Every API call goes through it first. And notice how record_failure() handles the Half-Open case differently: a failure during testing re-opens the circuit with a DOUBLED cooldown (exponential backoff), giving the failing service progressively more time to recover.
import time
from enum import Enum
from dataclasses import dataclass, field
class BreakerState(Enum):
CLOSED = "closed" # Normal — requests flow through
OPEN = "open" # Tripped — all requests get fallback
HALF_OPEN = "half_open" # Testing — one request allowed
@dataclass
class CircuitBreaker:
"""Circuit breaker for agent API calls.
Monitors consecutive failures and trips when threshold
is reached. Prevents cascading failures and wasted spend
during outages.
"""
failure_threshold: int = 3 # Trip after N failures
cooldown_seconds: float = 60.0 # Wait before testing
state: BreakerState = field(default=BreakerState.CLOSED)
failure_count: int = field(default=0)
last_failure_time: float = field(default=0.0)
opened_at: float = field(default=0.0)
def can_execute(self) -> bool:
"""Check if a request should be allowed through."""
if self.state == BreakerState.CLOSED:
return True
if self.state == BreakerState.OPEN:
# Check if cooldown has elapsed
elapsed = time.time() - self.opened_at
if elapsed >= self.cooldown_seconds:
self.state = BreakerState.HALF_OPEN
return True # Allow one test request
return False # Still cooling down
if self.state == BreakerState.HALF_OPEN:
return False # Already testing, block others
return False
def record_success(self):
"""Record a successful request."""
if self.state == BreakerState.HALF_OPEN:
# Test passed — close the circuit
self.state = BreakerState.CLOSED
self.failure_count = 0
elif self.state == BreakerState.CLOSED:
self.failure_count = 0 # Reset on success
def record_failure(self):
"""Record a failed request."""
self.failure_count += 1
self.last_failure_time = time.time()
if self.state == BreakerState.HALF_OPEN:
# Test failed — re-open with longer cooldown
self.state = BreakerState.OPEN
self.opened_at = time.time()
self.cooldown_seconds *= 2 # Exponential backoff
elif self.state == BreakerState.CLOSED:
if self.failure_count >= self.failure_threshold:
self.state = BreakerState.OPEN
self.opened_at = time.time()
@property
def fallback_response(self) -> str:
return (
"I'm temporarily unable to process your request. "
"Our systems are experiencing issues. Please try "
"again in a few minutes."
)
# Usage with an agent
breaker = CircuitBreaker(failure_threshold=3, cooldown_seconds=60)
def call_agent_with_breaker(client, messages):
"""Wrap agent calls with circuit breaker protection."""
if not breaker.can_execute():
return {"content": breaker.fallback_response, "circuit_breaker": "open"}
try:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=messages,
)
breaker.record_success()
return {"content": response.content[0].text, "circuit_breaker": breaker.state.value}
except Exception as e:
breaker.record_failure()
if not breaker.can_execute():
return {"content": breaker.fallback_response, "circuit_breaker": "open"}
raise # Re-raise if breaker hasn't tripped yet
import Anthropic from '@anthropic-ai/sdk';
const BreakerState = { CLOSED: 'closed', OPEN: 'open', HALF_OPEN: 'half_open' };
class CircuitBreaker {
// Monitors failures and trips when threshold reached.
// Prevents cascading failures during outages.
constructor(failureThreshold = 3, cooldownSeconds = 60) {
this.failureThreshold = failureThreshold;
this.cooldownSeconds = cooldownSeconds;
this.state = BreakerState.CLOSED;
this.failureCount = 0;
this.openedAt = 0;
}
canExecute() {
if (this.state === BreakerState.CLOSED) return true;
if (this.state === BreakerState.OPEN) {
const elapsed = (Date.now() - this.openedAt) / 1000;
if (elapsed >= this.cooldownSeconds) {
this.state = BreakerState.HALF_OPEN;
return true; // Allow one test request
}
return false;
}
return false; // HALF_OPEN blocks others while testing
}
recordSuccess() {
if (this.state === BreakerState.HALF_OPEN) {
this.state = BreakerState.CLOSED; // Test passed
this.failureCount = 0;
} else {
this.failureCount = 0; // Reset on success
}
}
recordFailure() {
this.failureCount++;
if (this.state === BreakerState.HALF_OPEN) {
this.state = BreakerState.OPEN; // Test failed
this.openedAt = Date.now();
this.cooldownSeconds *= 2; // Exponential backoff
} else if (this.failureCount >= this.failureThreshold) {
this.state = BreakerState.OPEN;
this.openedAt = Date.now();
}
}
get fallbackResponse() {
return 'I\'m temporarily unable to process your request. '
+ 'Our systems are experiencing issues. Please try again shortly.';
}
}
// Usage with an agent
const breaker = new CircuitBreaker(3, 60);
async function callAgentWithBreaker(client, messages) {
if (!breaker.canExecute()) {
return { content: breaker.fallbackResponse, circuitBreaker: 'open' };
}
try {
const response = await client.messages.create({
model: 'claude-sonnet-4-6',
max_tokens: 1024,
messages,
});
breaker.recordSuccess();
return { content: response.content[0].text, circuitBreaker: breaker.state };
} catch (error) {
breaker.recordFailure();
if (!breaker.canExecute()) {
return { content: breaker.fallbackResponse, circuitBreaker: 'open' };
}
throw error;
}
}
You built a circuit breaker with three states: Closed (normal), Open (all requests get fallback), and Half-Open (one test request). When 3 consecutive failures occur, the circuit trips. After a 60-second cooldown, one test request probes recovery. If the test fails, the cooldown doubles (exponential backoff). This prevents burning money on retries during outages and gives downstream services time to recover.
Hands-On Exercise
What You'll Build
A complete output guardrail pipeline with hallucination detection (Claude-as-judge), cost tracking with budget enforcement, a circuit breaker with three-state transitions, and a CLI-based human approval gate — tested with 5 scenarios including contradicted claims, budget overruns, and cascading failures.
- Time Estimate: 35-45 minutes
- Prerequisites: Python 3.9+, an Anthropic API key (from console.anthropic.com), and a terminal
- Files You'll Create:
output_guardrails.py— All four guardrail components (hallucination detector, cost tracker, circuit breaker, approval gate) plus a 5-scenario test suite
- API Cost: ~$0.01 total (2 Claude Sonnet calls for hallucination checks)
Environment Setup
mkdir output-guardrails && cd output-guardrails
python -m venv venv && source venv/bin/activate # Windows: venv\Scripts\activate
pip install anthropic
export ANTHROPIC_API_KEY=your-key-here # Windows: set ANTHROPIC_API_KEY=your-key-here
Step 1: Build the Output Guardrail Pipeline
What: Create a single file with four output guardrail components — a hallucination detector, cost tracker, circuit breaker, and approval gate — plus a test suite that exercises each one.
Why: Building all four components in one file lets you see how they work together as a pipeline. In production you'd split them into separate modules, but for learning, having everything in one place makes it easier to trace the data flow from agent response through each guardrail to the final output.
Create a new file called output_guardrails.py:
"""Output Guardrails & HITL — M17 Hands-On Lab"""
import anthropic
import json
import time
from dataclasses import dataclass, field
from enum import Enum
client = anthropic.Anthropic()
# ── Hallucination Detector (Claude-as-Judge) ─────────────
HALLUCINATION_PROMPT = """You are a fact-checking judge. Compare the response
against the source documents. For each factual claim, classify it as:
- "supported": Directly backed by sources
- "unsupported": NOT in sources (possible hallucination)
- "contradicted": CONTRADICTS sources (definite error)
Respond with ONLY JSON:
{{"claims": [{{"text": "claim", "status": "supported|unsupported|contradicted"}}],
"overall": "pass|flag|block", "unsupported_count": 0}}
Sources:
{sources}
Response to check:
{response}
"""
def check_hallucination(response_text: str, source_docs: list[str]) -> dict:
"""Verify agent response against source documents."""
sources = "\n---\n".join(source_docs)
try:
result = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": HALLUCINATION_PROMPT.format(
sources=sources, response=response_text
),
}],
)
parsed = json.loads(result.content[0].text)
contradicted = [c for c in parsed["claims"]
if c["status"] == "contradicted"]
unsupported = [c for c in parsed["claims"]
if c["status"] == "unsupported"]
if contradicted:
return {"result": "block",
"reason": f"{len(contradicted)} contradicted claim(s)",
"claims": parsed["claims"]}
elif len(unsupported) >= 2:
return {"result": "flag",
"reason": f"{len(unsupported)} unsupported claim(s)",
"claims": parsed["claims"]}
return {"result": "pass", "claims": parsed["claims"]}
except Exception as e:
return {"result": "flag", "reason": f"Check failed: {e}", "claims": []}
# ── Cost Tracker with Budget Enforcement ─────────────────
@dataclass
class CostTracker:
"""Track token usage and enforce per-request budget."""
budget_dollars: float = 0.50
input_tokens: int = 0
output_tokens: int = 0
# Claude Sonnet pricing (per token)
INPUT_PRICE = 3.0 / 1_000_000 # $3/M input tokens
OUTPUT_PRICE = 15.0 / 1_000_000 # $15/M output tokens
@property
def total_cost(self) -> float:
return (self.input_tokens * self.INPUT_PRICE +
self.output_tokens * self.OUTPUT_PRICE)
@property
def budget_remaining(self) -> float:
return self.budget_dollars - self.total_cost
def record_usage(self, input_toks: int, output_toks: int):
self.input_tokens += input_toks
self.output_tokens += output_toks
def can_afford(self, estimated_input: int = 5000,
estimated_output: int = 1000) -> bool:
"""Check if the next call would stay within budget."""
estimated_cost = (estimated_input * self.INPUT_PRICE +
estimated_output * self.OUTPUT_PRICE)
return self.total_cost + estimated_cost <= self.budget_dollars
def summary(self) -> str:
return (f"Tokens: {self.input_tokens} in / {self.output_tokens} out "
f"| Cost: ${self.total_cost:.4f} / ${self.budget_dollars:.2f}")
# ── Circuit Breaker ──────────────────────────────────────
class BreakerState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
@dataclass
class CircuitBreaker:
failure_threshold: int = 3
cooldown_seconds: float = 5.0 # Short for demo
state: BreakerState = field(default=BreakerState.CLOSED)
failure_count: int = field(default=0)
opened_at: float = field(default=0.0)
def can_execute(self) -> bool:
if self.state == BreakerState.CLOSED:
return True
if self.state == BreakerState.OPEN:
if time.time() - self.opened_at >= self.cooldown_seconds:
self.state = BreakerState.HALF_OPEN
return True
return False
return False # HALF_OPEN blocks while testing
def record_success(self):
if self.state == BreakerState.HALF_OPEN:
self.state = BreakerState.CLOSED
self.failure_count = 0
def record_failure(self):
self.failure_count += 1
if self.state == BreakerState.HALF_OPEN:
self.state = BreakerState.OPEN
self.opened_at = time.time()
self.cooldown_seconds *= 2 # Exponential backoff
elif self.failure_count >= self.failure_threshold:
self.state = BreakerState.OPEN
self.opened_at = time.time()
# ── Approval Gate (CLI) ──────────────────────────────────
def approval_gate(action: str, context: str,
auto_approve: bool = False) -> dict:
"""Pause for human approval before irreversible actions.
Set auto_approve=True for automated testing."""
print(f"\n{'='*50}")
print(f"🔒 APPROVAL REQUIRED")
print(f"Action: {action}")
print(f"Context: {context}")
print(f"{'='*50}")
if auto_approve:
print(" [Auto-approved for testing]")
return {"approved": True, "modified": False}
response = input("Approve? (y/n/e to edit): ").strip().lower()
if response == "y":
return {"approved": True, "modified": False}
elif response == "e":
new_action = input("Enter modified action: ").strip()
return {"approved": True, "modified": True, "new_action": new_action}
return {"approved": False, "reason": "Human denied"}
# ── Test Scenarios ────────────────────────────────────────
if __name__ == "__main__":
print("=" * 60)
print("OUTPUT GUARDRAILS — TEST SUITE")
print("=" * 60)
# Test 1: Hallucination Detection
print("\n" + "─" * 60)
print("TEST 1: Hallucination Detection — Contradicted Claim")
source_docs = [
"UCC Filing #2024-NY-0042: Filed March 22, 2024 by Acme Corp. "
"Collateral: $2.3M in manufacturing equipment. Status: Active.",
"Amendment filed April 10, 2024: Added collateral description "
"for warehouse inventory valued at $890K."
]
agent_response = (
"The UCC filing was submitted on March 15, 2024 by Acme Corp "
"for $2.3M in equipment collateral. An amendment was filed "
"on April 10, 2024 adding $890K in warehouse inventory."
)
result = check_hallucination(agent_response, source_docs)
print(f" Result: {result['result']}")
print(f" Reason: {result.get('reason', 'All claims supported')}")
for claim in result.get("claims", []):
print(f" [{claim['status']}] {claim['text']}")
# Test 2: Hallucination Detection — All Supported
print("\n" + "─" * 60)
print("TEST 2: Hallucination Detection — All Supported")
correct_response = (
"The UCC filing was submitted on March 22, 2024 by Acme Corp "
"for $2.3M in equipment collateral."
)
result2 = check_hallucination(correct_response, source_docs)
print(f" Result: {result2['result']}")
# Test 3: Cost Tracker
print("\n" + "─" * 60)
print("TEST 3: Cost Tracking & Budget Enforcement")
tracker = CostTracker(budget_dollars=0.10)
for i in range(5):
# Estimate matches the actual per-iteration usage below
if not tracker.can_afford(estimated_input=8000, estimated_output=2000):
print(f" Iteration {i+1}: BUDGET EXCEEDED — {tracker.summary()}")
break
# Simulate API usage
tracker.record_usage(input_toks=8000, output_toks=2000)
print(f" Iteration {i+1}: {tracker.summary()}")
# Test 4: Circuit Breaker
print("\n" + "─" * 60)
print("TEST 4: Circuit Breaker State Transitions")
breaker = CircuitBreaker(failure_threshold=3, cooldown_seconds=2)
actions = [
("success", "Normal request 1"),
("failure", "API error 1"),
("failure", "API error 2"),
("failure", "API error 3 → TRIPS"),
("blocked", "Request during OPEN state"),
]
for action, label in actions:
can_exec = breaker.can_execute()
if not can_exec:
print(f" {label}: BLOCKED (state={breaker.state.value})")
continue
if action == "success":
breaker.record_success()
print(f" {label}: OK (state={breaker.state.value})")
else:
breaker.record_failure()
print(f" {label}: FAIL (state={breaker.state.value}, "
f"failures={breaker.failure_count}/{breaker.failure_threshold})")
print(f" Waiting {breaker.cooldown_seconds}s for cooldown...")
time.sleep(breaker.cooldown_seconds + 0.1)
can_test = breaker.can_execute()
print(f" Half-open test: can_execute={can_test} "
f"(state={breaker.state.value})")
breaker.record_success()
print(f" Test passed → state={breaker.state.value}")
# Test 5: Approval Gate
print("\n" + "─" * 60)
print("TEST 5: Approval Gate (auto-approved for testing)")
gate_result = approval_gate(
action="Send $450 refund to Order #12345",
context="Customer received damaged item, within return window",
auto_approve=True,
)
print(f" Result: {gate_result}")
// Output Guardrails & HITL — M17 Hands-On Lab
import Anthropic from "@anthropic-ai/sdk";
import * as readline from "node:readline/promises";
const client = new Anthropic();
// ── Hallucination Detector ──────────────────────────────
async function checkHallucination(responseText, sourceDocs) {
const sources = sourceDocs.join("\n---\n");
try {
const r = await client.messages.create({
model: "claude-sonnet-4-6", max_tokens: 1024,
messages: [{ role: "user", content:
`Fact-check this response against sources. For each claim: ` +
`"supported", "unsupported", or "contradicted". ` +
`Return ONLY JSON: {"claims":[{"text":"...","status":"..."}],` +
`"overall":"pass|flag|block"}\n\n` +
`${sources} \n` +
`${responseText} `
}],
});
const parsed = JSON.parse(r.content[0].text);
const contradicted = parsed.claims.filter(c => c.status === "contradicted");
if (contradicted.length) return { result: "block", claims: parsed.claims };
return { result: "pass", claims: parsed.claims };
} catch (e) {
return { result: "flag", reason: e.message, claims: [] };
}
}
// ── Cost Tracker ────────────────────────────────────────
class CostTracker {
constructor(budget = 0.10) {
this.budget = budget;
this.inputTokens = 0; this.outputTokens = 0;
}
get cost() {
return this.inputTokens * 3e-6 + this.outputTokens * 15e-6;
}
record(inp, out) { this.inputTokens += inp; this.outputTokens += out; }
canAfford(estIn = 5000, estOut = 1000) {
return this.cost + estIn * 3e-6 + estOut * 15e-6 <= this.budget;
}
}
// ── Circuit Breaker ─────────────────────────────────────
class CircuitBreaker {
constructor(threshold = 3, cooldown = 2) {
this.threshold = threshold; this.cooldown = cooldown;
this.state = "closed"; this.failures = 0; this.openedAt = 0;
}
canExecute() {
if (this.state === "closed") return true;
if (this.state === "open" &&
(Date.now() - this.openedAt) / 1000 >= this.cooldown) {
this.state = "half_open"; return true;
}
return false;
}
recordSuccess() {
if (this.state === "half_open") this.state = "closed";
this.failures = 0;
}
recordFailure() {
this.failures++;
if (this.state === "half_open") {
this.state = "open"; this.openedAt = Date.now();
this.cooldown *= 2;
} else if (this.failures >= this.threshold) {
this.state = "open"; this.openedAt = Date.now();
}
}
}
// ── Tests ───────────────────────────────────────────────
console.log("TEST 1: Hallucination Detection");
const sources = ["Filed March 22, 2024 by Acme Corp. $2.3M equipment."];
const r1 = await checkHallucination(
"Filed on March 15, 2024 by Acme Corp for $2.3M.", sources);
console.log(` Result: ${r1.result}`);
r1.claims.forEach(c => console.log(` [${c.status}] ${c.text}`));
console.log("\nTEST 2: Cost Tracker");
const tracker = new CostTracker(0.10);
for (let i = 0; i < 5; i++) {
// Estimate matches actual per-iteration usage below
if (!tracker.canAfford(8000, 2000)) { console.log(` Iter ${i+1}: BUDGET EXCEEDED`); break; }
tracker.record(8000, 2000);
console.log(` Iter ${i+1}: $${tracker.cost.toFixed(4)} / $${tracker.budget}`);
}
console.log("\nTEST 3: Circuit Breaker");
const breaker = new CircuitBreaker(3, 2);
for (let i = 0; i < 4; i++) {
if (!breaker.canExecute()) { console.log(" BLOCKED"); continue; }
breaker.recordFailure();
console.log(` Failure ${breaker.failures} → ${breaker.state}`);
}
console.log(` Waiting ${breaker.cooldown}s...`);
await new Promise(r => setTimeout(r, (breaker.cooldown + 0.1) * 1000));
console.log(` Can execute: ${breaker.canExecute()} (${breaker.state})`);
breaker.recordSuccess();
console.log(` After success: ${breaker.state}`);
Run Command: Execute the complete test suite with a single command:
python output_guardrails.py
The script runs 5 tests sequentially. Tests 1-2 make Claude API calls (~3 seconds each). Tests 3-5 are local-only and run instantly, except the circuit breaker cooldown wait (~2 seconds).
Expected Output
Verify these 5 key results in your output:
- Test 1: Result shows
block— the hallucination detector caught the wrong date (March 15 vs March 22) - Test 2: Result shows
pass— all claims match source documents - Test 3: Budget exceeded at iteration 2 — $0.054/iteration exceeds the $0.10 budget after one iteration of headroom
- Test 4: Circuit breaker transitions through all states: closed → open (after 3 failures) → half-open (after cooldown) → closed (after test success)
- Test 5: Approval gate displays the action and auto-approves with
{'approved': True, 'modified': False}
If all 5 pass, your output guardrail pipeline is working correctly.
Troubleshooting
- If you see
AuthenticationError→ YourANTHROPIC_API_KEYis not set or is invalid. Runecho $ANTHROPIC_API_KEY(orecho %ANTHROPIC_API_KEY%on Windows) to verify. Re-export if needed:export ANTHROPIC_API_KEY=sk-ant-... - If you see
ModuleNotFoundError: No module named 'anthropic'→ Runpip install anthropic. Make sure your virtual environment is activated (source venv/bin/activate). - If hallucination test shows "pass" instead of "block" → The classifier may occasionally miss the date discrepancy (March 15 vs March 22). Run again — results are probabilistic. If persistent across 3 runs, make the contradiction more obvious by changing "March 15" to "June 15" in the test data.
- If circuit breaker cooldown test fails → Ensure the
time.sleepduration exceeds the cooldown. On slow systems, changecooldown_seconds=2tocooldown_seconds=3and add an extra 0.5s to the sleep. - If cost tracker doesn't exceed budget → Check that the budget is set to
$0.10(not$0.50). At $0.054/iteration, the tracker should report BUDGET EXCEEDED when checking affordability for iteration 2.
Verify Everything Works
Run the full script and use grep to confirm all 5 key guardrail behaviors trigger correctly:
# Verify key behaviors
python output_guardrails.py 2>&1 | grep -E "(block|BUDGET|TRIPS|APPROVAL)"
# Should show: block, BUDGET EXCEEDED, TRIPS, APPROVAL REQUIRED
Expected Verification Output
You've built a complete output guardrail pipeline. Your agent now has four layers of protection: hallucination detection catches fabricated facts, cost tracking prevents runaway spending, circuit breakers handle cascading failures, and approval gates keep humans in the loop for high-stakes decisions. In M18, you'll learn how to measure whether these guardrails actually work in production.
The full test suite makes 2 Claude API calls (hallucination checks). At Claude Sonnet pricing, that's ~$0.01 total. The cost tracker, circuit breaker, and approval gate use zero API calls. In production, hallucination detection costs ~$0.003 per check. At 1,000 responses/day, that's $3/day for fact-checking — trivially cheap compared to the cost of acting on hallucinated data.
Knowledge Check
1. What is the key difference between an approval gate and a modification gate?
2. A circuit breaker is in "half-open" state. The test request fails. What happens next?
3. An agent's response claims "The filing was submitted on March 15, 2024" but the source documents show the date was March 22. Which guardrail catches this?
4. Your agent loop typically runs 3-4 iterations at $0.05 each. You set a per-request budget of $0.50. An edge case triggers 20 iterations. What happens?
5. Which guardrail pipeline ordering is correct?
6. When should you use an escalation gate instead of an approval gate?
Your Score
Summary
You've now built both sides of the guardrail sandwich — input validation (M16) and output validation (M17). Here's the complete picture:
- Output Validation — Hallucination detection (Claude-as-judge), toxicity filtering, and format validation catch errors before responses reach users.
- Cost Controls — Per-request budgets, token caps, and loop detection prevent runaway spending. A $0.50 budget caps edge-case loops at 10 iterations.
- HITL Gates — Approval (yes/no), modification (edit draft), and escalation (route to human) patterns handle decisions that need human judgment.
- Circuit Breaker — Three-state pattern (closed → open → half-open) prevents cascading failures during outages with exponential backoff.
Next up: In M18: Evaluation & Testing, you'll learn how to measure whether your guardrails actually work — building evaluation pipelines that test accuracy, catch regressions, and quantify your agent's reliability.