Building AI Agents with Claude
Track 9 — Certification Prep Module 30 of 30
90 min Advanced

M27: Certification Exam Prep

Anti-patterns, scenario walkthroughs, and mock exam practice for the Claude Certified Architect — Foundations exam.

Learning Objectives

  • Use the domain coverage map below to identify your weakest cert domains and target review
  • Recognize all 18 anti-patterns across the 5 certification domains
  • Understand the validation-retry loop with specific error feedback
  • Implement information provenance tracking for multi-agent systems
  • Walk through all 6 exam scenarios with correct architectural decisions
  • Practice with mock exam questions in the certification format

Cert Domain Coverage Map

This map is built directly from Anthropic's official Claude Certified Architect – Foundations Certification Exam Guide (v0.1, Feb 10 2025). Every official task statement (28 in total across 5 domains) maps to specific course modules below. If your practice score shows weakness in a domain, jump to the row instead of re-reading the whole course.

Domain Weights at a Glance
5 domains 28 task statements passing: 720/1000 D1 — Agentic Architecture & Orchestration27% · M12, M13, M14, M26 D2 — Tool Design & MCP18% · M05, M06, M07, M25 D3 — Claude Code Configuration20% · M25 D4 — Prompt Engineering & Structured Output20% · M03, M04, M18, M22 D5 — Context & Reliability15% · M11, M17, M27B

D1 is the heaviest at 27% — budget your study time accordingly. D5 sub-tasks 5.5/5.6 have their own deep dive in M27B.

Reading the map: "Primary" = where the topic is taught from scratch. "Cert tip" = a focused exam-relevant 🎓 callout on that point. Domain weights from the official guide: D1 = 27%, D2 = 18%, D3 = 20%, D4 = 20%, D5 = 15%.

Task Topic (official task statement) Where Covered
Domain 1 — Agentic Architecture & Orchestration (27%)
1.1Agentic loops: stop_reason control flow, tool result handlingM12 (primary) · cert tips in M12, M15B, M23, M16 (iteration cap anti-pattern)
1.2Coordinator-subagent (hub-and-spoke), task decomposition by coordinatorM14 (primary) · M14, M15B cert tips
1.3Subagent invocation: Task tool, allowedTools, explicit context passing, parallel spawningM14 · M15B · cert tips in both
1.4Multi-step workflows: programmatic prerequisite gates, structured handoff protocols (sample question #1)M17 cert tip (NEW) · M16 cert tip · M26 (hooks-as-enforcement)
1.5Agent SDK hooks: PostToolUse, tool call interception, data normalizationM26 (primary) · M15B, M23 cert tips
1.6Task decomposition: prompt chaining vs. dynamic adaptive decompositionM13 (primary) · M13, M26 cert tips
1.7Session state: --resume, fork_session, named sessions, resume vs. fresh w/ summaryM26 (primary) · M26 cert tip (NEW)
Domain 2 — Tool Design & MCP Integration (18%)
2.1Tool descriptions: input formats, examples, edge cases, boundariesM05 (primary) · M05 cert tip
2.2Structured MCP errors: isError, errorCategory, isRetryable; access failure vs. empty resultM05 (primary) · M05, M12, M15, M23 cert tips
2.3Tool distribution (4-5/agent cap), tool_choice: "auto"/"any"/forcedM06 (primary) · M06 cert tip
2.4MCP server config: project (.mcp.json) vs user (~/.claude.json), ${ENV_VAR} expansion, MCP resources for catalogsM07 (primary) · M07, M15B cert tips
2.5Built-in tools: Read, Write, Edit, Bash, Grep, Glob — selection criteria, Edit→Read+Write fallbackM25 (primary) · M25 cert tip (FIXED to 2.5)
Domain 3 — Claude Code Configuration & Workflows (20%)
3.1CLAUDE.md hierarchy (user/project/directory), @import, .claude/rules/, /memoryM25 (primary) · M25 cert tip
3.2Custom slash commands & skills: SKILL.md frontmatter (context: fork, allowed-tools, argument-hint)M25 (primary) · M25 cert tip
3.3Path-specific rules: YAML frontmatter glob patterns in .claude/rules/M25 · M25 cert tip
3.4Plan mode vs. direct execution; Explore subagent for verbose discoveryM25 · M25 cert tip
3.5Iterative refinement: input/output examples, TDD iteration, interview patternM25 · M25 cert tip
3.6CI/CD: -p, --output-format json, --json-schema, CLAUDE.md context for CIM25 · M25 cert tip (NEW)
Domain 4 — Prompt Engineering & Structured Output (20%)
4.1Explicit criteria over vague instructions; severity definitions with code examplesM03 (primary) · M03 cert tip
4.2Few-shot prompting: 2-4 examples for ambiguous scenariosM03 (primary) · M03 cert tip
4.3tool_use + JSON schemas; tool_choice variants; nullable fields to prevent hallucinationM04 (primary) · M04, M09 cert tips
4.4Validation-retry loops with specific errors; detected_pattern fields for dismissal trackingM04 · M18 · cert tips in both
4.5Message Batches API: 50% savings, 24h window, custom_id, no multi-turn tool callingM22 (primary) · M22 cert tip (FIXED to 4.5)
4.6Multi-instance & multi-pass review: separate sessions for gen vs review; per-file + cross-file passesM18 (primary) · M18 cert tips · M10 cert tip
Domain 5 — Context Management & Reliability (15%)
5.1Conversation context: progressive summarization risks, lost-in-the-middle, case-facts blocks, trimming verbose tool outputsM03B · M08 · cert tips in M02, M08, M09, M22
5.2Escalation & ambiguity resolution: policy gaps, sentiment != complexity, multi-match clarificationM17 (primary) · M17, M26 cert tips
5.3Error propagation: structured context (failure type, attempted query, partial results), local recovery before escalationM14 cert tip (NEW) · M17
5.4Large codebase exploration: scratchpads, subagent delegation, /compact, crash recovery manifestsM11 (primary) · M11 cert tip
5.5Human review workflows: stratified sampling, field-level confidence, accuracy by document type/fieldM27B · cert tips in M17, M18, M10
5.6Information provenance & uncertainty: claim-source mappings, temporal data, conflict annotation, coverage gapsM27B (deep dive) · cert tips in M09, M11
🎓 Out-of-Scope Reminder

Per the official guide, the following are NOT tested on the cert (and you shouldn't waste study time on them): fine-tuning, API auth/billing, MCP server hosting, Constitutional AI / RLHF internals, embedding models & vector DB internals, computer use, vision, streaming, rate limits / quotas / pricing, OAuth, cloud-provider configs, prompt-caching implementation details (beyond knowing it exists), token counting algorithms.

🎓 Targeted Review Strategy

If your practice exam scores show weakness in a specific domain row, jump there. The "primary" module teaches it from scratch; the cert tip pinpoints the exam-relevant scenario. Domain 5.5 + 5.6 jointly need their own deep-dive — M27B — because they cover topics scattered across M09, M11, M17, M18 that the cert tests as one discipline. Take M27B before any full timed practice exam. Domain 1 is the heaviest at 27% — budget review time accordingly.

Anti-Patterns Master Reference

The Claude Certified Architect examA 5-domain certification covering agentic architecture, tool design, Claude Code configuration, prompt engineering, and context/reliability. Passing score is 720/1000. Each question presents a scenario with four options — one correct and three distractors (often anti-patterns). tests your ability to spot bad patterns as much as good ones. Every wrong answer in the exam is a real anti-pattern that developers actually ship to production. This section catalogs all 18 anti-patterns grouped by domain — click each card to flip between the ❌ DON'T and ✅ DO.

💡 Everyday Analogy

Before medical boards tested clinical knowledge, doctors learned from textbooks alone. But knowing the right diagnosis isn't enough — you also need to recognize the wrong diagnoses that look plausible. A patient with chest pain could have a heart attack, acid reflux, or a pulled muscle. The textbook teaches the right answer; the boards test whether you can spot the impostors. The certification exam works the same way. Each question has one correct pattern and three anti-patterns. The anti-patterns aren't random nonsense — they're approaches that seem reasonable but fail in production. Knowing WHY each anti-pattern fails is more valuable than memorizing the correct answer, because in practice you'll encounter them in existing codebases and need to recognize and fix them.

That's the analogy — now let's see what a real exam question looks like. Notice how the three wrong answers are all plausible anti-patterns, not obvious nonsense:

Scenario: A support agent must enforce a $500 refund limit. A) Add "Never approve refunds over $500" to the system prompt ← Anti-pattern #3 (prompt-based enforcement) B) Check Claude's confidence score before processing ← Anti-pattern #5 (self-reported confidence) C) Use a programmatic hook to intercept process_refund calls ← ✅ CORRECT D) Set maxTurns: 1 to limit agent actions ← Anti-pattern #2 (iteration cap as control)
18 Anti-Patterns — Click to Flip

Domain 1 — Agentic Architecture

D1.1❌ Don't

Parse natural language to detect loop termination ("I'm done", "task complete")

Click to flip
D1.1✅ Do

Check stop_reason: 'tool_use' = continue, 'end_turn' = done. Deterministic, not guesswork.

D1.1❌ Don't

Use arbitrary iteration caps (e.g., max 10 loops) as the primary stop mechanism

Click to flip
D1.1✅ Do

Let the agent terminate naturally via stop_reason. Use maxTurns as a SAFETY NET only, not control flow.

D1.3❌ Don't

Use prompt-based enforcement for critical business rules ("Never approve refunds over $500")

Click to flip
D1.3✅ Do

Use programmatic hooks that intercept tool calls. Hooks are deterministic — they CAN'T be bypassed by prompt injection.

D1.4❌ Don't

Escalate to humans based on customer sentiment or anger level

Click to flip
D1.4✅ Do

Escalate on: policy gaps, capability limits, explicit requests, or business thresholds. Angry + simple ≠ escalate. Calm + policy gap = escalate.

D1.4❌ Don't

Trust self-reported confidence scores for escalation decisions

Click to flip
D1.4✅ Do

Use structured programmatic criteria. Model confidence is not well-calibrated — it says "95% confident" about hallucinations too.

Domain 2 — Tool Design & MCP

D2.2❌ Don't

Return generic errors: "Operation failed"

Click to flip
D2.2✅ Do

Return structured errors: {isError: true, errorCategory: "auth_failure", isRetryable: true, context: "Token expired"}

D2.3❌ Don't

Return empty results for access failures: {results: []}

Click to flip
D2.3✅ Do

Distinguish "nothing found" from "couldn't check." {isError: true, errorCategory: "access_denied"}{results: []}

D2.4❌ Don't

Give one agent 18+ tools — selection accuracy degrades rapidly

Click to flip
D2.4✅ Do

Keep 4-5 tools per agent. Distribute across specialized subagents. Tool accuracy drops significantly above 5 tools.

D2.5❌ Don't

Hardcode API keys in .mcp.json — gets committed to git

Click to flip
D2.5✅ Do

Use ${ENV_VAR} environment variable expansion. Know the config hierarchy: .mcp.json (project) vs ~/.claude.json (user).

Domains 3-5 — Config, Prompts, Context

D3❌ Don't

Put personal preferences in project-level CLAUDE.md

Click to flip
D3✅ Do

User-level (~/.claude/) for personal prefs, project CLAUDE.md for team-shared rules. Directory-level for subdirectory overrides.

D4.1❌ Don't

Give vague instructions: "be thorough," "find all issues"

Click to flip
D4.1✅ Do

Provide explicit, measurable criteria: "flag functions exceeding 50 lines" not "flag long functions."

D4.3❌ Don't

Assume tool_use guarantees semantic correctness (valid JSON ≠ correct values)

Click to flip
D4.3✅ Do

tool_use guarantees structure (valid JSON). Always add business rule validation after extraction. Values may still be wrong.

D5.1❌ Don't

Use progressive summarization for critical details (names, IDs, amounts, dates get lost)

Click to flip
D5.1✅ Do

Use immutable "case facts" blocks at the START of context. These are never summarized. Critical data stays in high-recall position.

D5.6❌ Don't

Use aggregate accuracy metrics only: "95% overall accuracy"

Click to flip
D5.6✅ Do

Track stratified per-category metrics. Invoices at 70% + receipts at 99% averages 95% — but invoices are broken. Drill down.

Anti-Patterns (static): 18 anti-patterns organized by domain. Each shows a bad practice (e.g., parsing NL for loop termination) and the correct alternative (e.g., check stop_reason). Key domains: Agentic Architecture (5), Tool Design (4), Claude Code Config (3), Prompt Engineering (3), Context & Reliability (3).

⚠️ Common Misconceptions About the Certification Exam

"I just need to memorize the anti-patterns." — Memorization alone won't work. The exam presents novel scenarios where you need to APPLY the principles. You might recognize anti-pattern #3 (prompt-based enforcement) in isolation, but can you spot it when it's embedded in a Scenario 5 CI/CD question about code review guardrails? The exam tests application, not recall.

"The exam only has one correct answer per question." — True, but the challenge is that two options often look correct. For example, both "use a hook" and "add a system prompt instruction" technically enforce a refund limit. The difference is that hooks are deterministic and can't be bypassed by prompt injection, while system prompt instructions are advisory. The exam rewards understanding WHY one approach is better.

"If I know the Anthropic API well, I'll pass." — API knowledge is necessary but insufficient. Roughly 40% of questions test architectural judgment (when to use multi-agent vs single-agent, when to escalate, how to handle conflicting data). These are design decisions, not API calls.

"Anti-patterns are obvious once you see them." — In a study setting, yes. Under exam pressure with a 90-minute timer, anti-patterns like returning empty results for access failures ({results: []} vs {isError: true}) look deceptively reasonable. Practice under timed conditions.

"I should aim for 100% — 720/1000 is a low bar." — The passing score is 72%, but the questions are designed so that only candidates with genuine understanding score above the threshold. The three distractors per question are calibrated to catch surface-level knowledge. A score of 750-800 represents solid mastery.

Now that you can recognize what NOT to do, let's fill in the remaining domain knowledge gaps — starting with the validation-retry pattern that the exam tests heavily.

Validation-Retry Pattern

📐 Technical Definition

The validation-retry loopA pattern where tool_use output is validated against business rules, and if validation fails, specific error details are appended to the conversation for the model to self-correct. Retries are capped at 3 to prevent infinite loops. is the exam's most-tested Domain 4 pattern. Here's how it works, step by step.

First, you ask Claude to extract structured data using tool_use. Claude returns valid JSON — the structure is guaranteed by the schema. But the values inside that JSON may be wrong. A CPT code field might contain "knee surgery" instead of "27447".

Second, you validate those values against your business rules. If validation fails, here's the critical part: you don't just say "try again." That's the anti-pattern. Instead, you append SPECIFIC error details — which field failed, what the expected format was, and what was actually returned. Claude then self-corrects with much higher accuracy because it knows exactly what to fix.

Third, the exam tests one more detail: after 3 failed retries on the same field, stop retrying and flag for human review. This prevents infinite loops. It also catches cases where the model genuinely can't extract the data — for example, when the source document simply doesn't contain the required information.

tool_choiceControls whether and which tools Claude uses. 'auto' = Claude decides, 'any' = must use a tool (Claude picks which), forced = must use a specific tool. The exam tests when to use each option. is a separate but related exam topic. There are three settings, and the exam tests when to use each one. 'auto' is the default — Claude decides on its own whether to use a tool. 'any' forces Claude to use a tool, but lets it pick which one. Finally, forced-specific ({"type": "tool", "name": "extract_filing"}) requires a specific tool — use this when you KNOW extraction is needed and don't want Claude to skip it.

Validation-Retry Loop
Claude + tool_use
JSON Output
Validator
❌ Field "cpt_code": expected 5-digit format, got "surgery123"
↻ Retry 1/3 — appending specific error to conversation
✓ All fields valid — {"cpt_code": "27447", "icd10": "M17.11", "units": 1}

Validation-Retry (static): Claude extracts data via tool_use → Validator checks business rules → If invalid, specific error details are appended → Claude retries with better context → After 3 failures on same field, flag for human review.

⚠️ Common Misconceptions About Validation-Retry

"If tool_use returns valid JSON, the data is correct." — No. tool_use guarantees structural correctness (valid JSON matching your schema), but NOT semantic correctness. A CPT code field containing "knee surgery" is valid JSON — it's a string in the right field — but it's not a valid CPT code. Always validate business rules after extraction.

"Generic retry messages work fine — Claude is smart enough." — In testing, generic "try again" messages produce roughly 30-40% self-correction rates. Specific field-level feedback ("field 'cpt_code': expected 5-digit numeric, got 'knee surgery'") pushes self-correction above 80%. The specificity matters enormously.

"Just retry indefinitely until it gets it right." — Some extraction failures are not self-correctable. If the source document says "the patient underwent knee surgery" but never mentions CPT code 27447, no amount of retrying will produce the code. The 3-retry-per-field cap catches these data-quality issues and routes them to humans instead of looping forever.

The validation-retry pattern ensures data quality from Claude's tool calls. But in multi-agent systems, you face a different challenge: when multiple subagents provide conflicting data, how do you decide which to trust?

Information Provenance

💡 Everyday Analogy

Before investigative journalists had editorial standards for sourcing, a reporter could publish any claim without attribution. "Sources say the company is bankrupt" told the reader nothing — was this the CEO, a disgruntled intern, or a rumor on social media? Modern journalism requires provenance: name the source, describe their credibility, note when they said it, and flag if other sources disagree. Multi-agent AI systems face exactly the same challenge. When your coordinator agent receives results from three subagents — one saying "Acme Corp has a lien," another saying "no liens found," and a third saying "data unavailable" — the coordinator needs provenance metadata to decide which claim to trust. Without it, the agent picks arbitrarily, and you can't audit why.

That's the analogy. Now let's see what provenance actually looks like in your code. Here's the metadata object that travels alongside every single claim in a multi-agent pipeline:

{ "claim": "Acme Corp has an active UCC lien — Filing #2024-NY-0047821", "source": "ny_state_filing_db", "confidence": "high", "timestamp": "2026-04-02T14:30:00Z", "agent_id": "subagent-filings-ny", "source_type": "official_filing" }
📐 Technical Definition

Information provenanceMetadata that tracks the origin, reliability, and freshness of each claim in a multi-agent system — including which agent/tool/API provided the data, confidence level, timestamp, and how to characterize the source. tracking means attaching metadata to every single claim that flows through your system. Think of it as a chain-of-custody label. Each label has four fields:

  • Source — which agent, tool, or API produced this data. Was it a direct database query, or did a third-party aggregator provide it?
  • Confidence — how reliable is that source? An official state filing database is "high." A third-party aggregator is "medium." User-submitted data is "low."
  • Timestamp — when was the data retrieved? This matters more than you might think. UCC filings can update daily, so data from last week might already be stale.
  • Agent ID — which subagent in the pipeline produced this claim? You need this for audit tracing when something goes wrong.

The exam tests this in Scenario 3 (Multi-Agent Research). When two subagents return conflicting entity data, the coordinator must use provenance to resolve the conflict. The resolution follows a priority chain. First, trust the official source over the aggregator. Second, if sources have equal reliability, trust fresher data over stale data. Third, if the conflict still can't be resolved, flag the claim as "contested" and route it to human review.

Claims are categorized as: "Well-established" (multiple sources agree), "Contested" (sources disagree — flag for verification), or "Single-source" (only one source, note the risk).

If you've worked with traditional data pipelines, you might compare provenance to data lineage in ETL systems — tracking where a row came from and which transformations touched it. The difference is that provenance in multi-agent systems must also track reliability. In an ETL pipeline, every transformation is deterministic — the same input always produces the same output. But when a subagent interprets an unstructured document, confidence varies. A filing number extracted from an official state database is high-confidence. The same filing number extracted by a third-party aggregator that scrapes PDFs is medium-confidence at best. Provenance captures that distinction so the coordinator can make informed decisions rather than treating all sources as equal.

⚠️ Common Misconceptions About Provenance

"Provenance is just logging." — Logging records what happened after the fact. Provenance is used at decision time — the coordinator reads provenance metadata to decide which conflicting claim to trust. Logs help you debug later. Provenance helps the agent decide now.

"The coordinator should ask Claude to judge which source is more reliable." — This delegates a deterministic decision to a non-deterministic model. Source reliability ranking (official DB > aggregator > user-submitted) should be hardcoded in your resolution logic, not left to Claude's judgment. The exam specifically penalizes this approach.

"If sources agree, provenance doesn't matter." — Even when sources agree, provenance tells you how much to trust the consensus. Three low-confidence sources agreeing is still lower confidence than one high-confidence official source. And you still need the audit trail for compliance.

With anti-patterns memorized and domain gaps filled, let's walk through the 6 exam scenarios — the actual architectures you'll be asked to design and evaluate.

Exam Scenario Walkthroughs

The certification exam presents 6 possible scenarios. Each scenario describes a real-world system in 2-3 sentences, then asks you to make an architectural decision. The scenarios aren't hypothetical puzzles — they mirror the actual systems that Claude developers build in production.

Think of scenarios as the exam's version of case studies. In each scenario, you're given a specific system (like a customer support agent or a CI/CD pipeline) and asked to choose the correct architecture, configuration, or error-handling approach. The scenario context stays the same across multiple questions — so you might see 3-4 questions all set in "Scenario 1: Customer Support." Each question tests a different domain concept using that same system as the backdrop.

What makes these scenarios tricky is that they test multiple domains at once. Take a customer support scenario (S1) as an example. One question might test your knowledge of escalation triggers (Domain 1). The next might test hook-based enforcement (Domain 1). A third might test structured error handling (Domain 2). All three questions use the same scenario context, but each targets a different concept. You need to recognize which domain concept applies to the specific decision being asked about.

Here are all 6 scenarios with their correct architecture, key decisions, and the anti-patterns you'll see as distractors. Pay attention to the "Key" note on each diagram — that's the decision the exam is actually testing.

6 Exam Scenarios — Architecture Decisions
Scenario 1: Customer Support Resolution Agent — A single agent handles refund requests, account changes, and FAQ resolution. Must enforce refund limits and know when to escalate to humans.
ReAct Agent (M12)
lookup_order
process_refund
update_account
Hooks (M26)
Key: Hook-based refund limits, escalate on policy gaps (NOT sentiment)
Scenario 2: Code Generation with Claude Code — Claude Code generates a full-stack app iteratively. Must plan before coding, test continuously, and organize project structure.
Claude Code (M25)
Plan Mode
CLAUDE.md
TDD Loop
Key: Plan mode for complex tasks, directory-level CLAUDE.md, iterative refinement
Scenario 3: Multi-Agent Research System — Coordinator dispatches research subagents to gather data from multiple sources. Must track provenance and resolve conflicts.
Coordinator (M14)
Subagent A
Subagent B
Subagent C
Key: Context isolation, provenance tracking, fork_session for exploration
Scenario 4: Developer Productivity — Agent SDK with built-in tools + MCP servers for codebase analysis and modification.
Agent SDK (M26)
Read/Write/Bash
Grep/Glob
MCP Servers (M07)
Key: Tool selection strategy, MCP for external data, codebase exploration
Scenario 5: CI/CD with Claude Code — Non-interactive Claude Code in GitHub Actions for code review, test generation, and PR analysis.
Claude Code -p (M25)
--output-format json
Session Isolation
Batch API
Key: -p flag, separate sessions for generate vs review, batch for non-urgent
Scenario 6: Structured Data Extraction — Extract structured fields from unstructured documents using tool_use with validation-retry.
tool_use Extraction (M04)
JSON Schema
Validation
Retry Loop
Key: Schema design (required/optional/enum+other), specific error feedback, few-shot

6 Scenarios (static): S1 = Support Agent (hooks + escalation), S2 = Code Gen (plan mode + CLAUDE.md), S3 = Multi-Agent (provenance + context isolation), S4 = Dev Productivity (Agent SDK + MCP), S5 = CI/CD (non-interactive + session isolation), S6 = Data Extraction (tool_use + validation-retry).

Code Walkthrough

Validation-Retry Loop Implementation

🔬 Hands-On Lab: Validation-Retry Loop

What You'll Build: A validation-retry loop that extracts medical data from documents and self-corrects extraction errors with specific feedback.

Time Estimate: 20-30 minutes

Prerequisites: Python 3.10+ or Node.js 18+, Anthropic API key

Files You'll Create: validation_retry.py (or validation_retry.mjs)

mkdir cert-prep-lab && cd cert-prep-lab
python -m venv venv && source venv/bin/activate   # Windows: venv\Scripts\activate
pip install anthropic
export ANTHROPIC_API_KEY=your-key-here             # Windows: set ANTHROPIC_API_KEY=your-key-here

The validation-retry pattern is the most code-tested pattern on the exam. Let's walk through the complete implementation — not just what the code does, but why each decision was made the way it was.

Let's start with the validator. The validate_extraction function is where your domain knowledge lives. It checks each extracted field against business rules — is this CPT code actually 5 digits? Is this ICD-10 code in the right format? Notice that we're not just checking types (the JSON schema already guarantees that). We're checking semantic correctness — whether the values make sense in the medical domain.

Next comes the interesting part: the retry loop itself. The extract_with_retry function manages the conversation with Claude. Here's the dilemma it solves: when Claude gets a field wrong, you COULD just say "try again." But that's like telling a student "wrong answer" without explaining what was wrong. Instead, the function appends SPECIFIC error details — which field, what format was expected, and what Claude actually returned. This small change dramatically improves self-correction accuracy.

Finally, there's a subtle but critical feature: the per-field failure counter. If the same field fails 3 times in a row, the problem isn't Claude being careless — it's that the source document probably doesn't contain the information in a recognizable form. At that point, more retries won't help. Flag it for a human.

import anthropic
import json
from dataclasses import dataclass

@dataclass
class ValidationError:
    field: str
    expected: str
    actual: str

def validate_extraction(data: dict) -> list[ValidationError]:
    """Validate extracted data against business rules.
    Returns list of validation errors (empty = valid)."""
    errors = []

    # CPT codes must be exactly 5 digits
    cpt = data.get("cpt_code", "")
    if not (cpt.isdigit() and len(cpt) == 5):
        errors.append(ValidationError(
            field="cpt_code",
            expected="5-digit numeric code (e.g., '27447')",
            actual=repr(cpt),
        ))

    # ICD-10 codes: letter + 2+ digits + optional decimal
    import re
    icd = data.get("icd10_code", "")
    if not re.match(r'^[A-Z]\d{2}(\.\d{1,4})?$', icd):
        errors.append(ValidationError(
            field="icd10_code",
            expected="ICD-10 format: letter + 2 digits + optional decimal (e.g., 'M17.11')",
            actual=repr(icd),
        ))

    # Units must be positive integer
    units = data.get("units")
    if not isinstance(units, int) or units < 1:
        errors.append(ValidationError(
            field="units",
            expected="positive integer >= 1",
            actual=repr(units),
        ))

    return errors

def extract_with_retry(
    client: anthropic.Anthropic,
    document: str,
    max_retries: int = 3,
) -> dict:
    """Extract structured data with validation-retry loop.

    Key pattern: on validation failure, append SPECIFIC error
    details — not generic "try again". This gives Claude the
    information it needs to self-correct.
    """
    # Define the extraction tool — Claude must return this schema
    tools = [{
        "name": "extract_medical_data",
        "description": "Extract structured medical data from a document",
        "input_schema": {
            "type": "object",
            "properties": {
                "cpt_code": {
                    "type": "string",
                    "description": "5-digit CPT procedure code"
                },
                "icd10_code": {
                    "type": "string",
                    "description": "ICD-10 diagnosis code (e.g., M17.11)"
                },
                "units": {
                    "type": "integer",
                    "description": "Number of procedure units"
                },
            },
            "required": ["cpt_code", "icd10_code", "units"],
        },
    }]

    messages = [{
        "role": "user",
        "content": f"Extract medical data from this document:\n\n{document}",
    }]

    # Track per-field failure counts — if same field fails 3 times,
    # it's not a self-correction issue, it's a data issue.
    field_failures: dict[str, int] = {}

    for attempt in range(max_retries + 1):
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            tools=tools,
            # Force the specific tool on first attempt
            tool_choice=(
                {"type": "tool", "name": "extract_medical_data"}
                if attempt == 0
                else {"type": "auto"}
            ),
            messages=messages,
        )

        # Find the tool_use block in the response
        tool_use = next(
            (b for b in response.content if b.type == "tool_use"),
            None,
        )
        if not tool_use:
            break  # Claude decided not to use the tool

        extracted = tool_use.input
        errors = validate_extraction(extracted)

        if not errors:
            return {"success": True, "data": extracted, "attempts": attempt + 1}

        # Check for stuck fields — same field failing repeatedly
        for err in errors:
            field_failures[err.field] = field_failures.get(err.field, 0) + 1
            if field_failures[err.field] >= 3:
                return {
                    "success": False,
                    "data": extracted,
                    "reason": f"Field '{err.field}' failed 3 times — flagging for human review",
                    "attempts": attempt + 1,
                }

        if attempt < max_retries:
            # THE KEY PATTERN: append SPECIFIC errors, not "try again"
            error_details = "\n".join(
                f"- Field '{e.field}': expected {e.expected}, got {e.actual}"
                for e in errors
            )
            # Add the tool result + error feedback to conversation
            messages.append({"role": "assistant", "content": response.content})
            messages.append({
                "role": "user",
                "content": [{
                    "type": "tool_result",
                    "tool_use_id": tool_use.id,
                    "content": f"Validation failed:\n{error_details}\n\nPlease correct these specific fields.",
                    "is_error": True,
                }],
            })

    return {"success": False, "data": extracted, "reason": "Max retries exceeded", "attempts": max_retries + 1}
import Anthropic from '@anthropic-ai/sdk';

function validateExtraction(data) {
  // Validate extracted data against business rules.
  const errors = [];

  // CPT codes: exactly 5 digits
  const cpt = data.cpt_code || '';
  if (!/^\d{5}$/.test(cpt)) {
    errors.push({
      field: 'cpt_code',
      expected: "5-digit numeric code (e.g., '27447')",
      actual: JSON.stringify(cpt),
    });
  }

  // ICD-10: letter + 2+ digits + optional decimal
  const icd = data.icd10_code || '';
  if (!/^[A-Z]\d{2}(\.\d{1,4})?$/.test(icd)) {
    errors.push({
      field: 'icd10_code',
      expected: "ICD-10 format: letter + 2 digits + optional decimal",
      actual: JSON.stringify(icd),
    });
  }

  // Units: positive integer
  if (!Number.isInteger(data.units) || data.units < 1) {
    errors.push({
      field: 'units',
      expected: 'positive integer >= 1',
      actual: JSON.stringify(data.units),
    });
  }

  return errors;
}

async function extractWithRetry(client, document, maxRetries = 3) {
  // Extraction tool — Claude must return this schema
  const tools = [{
    name: 'extract_medical_data',
    description: 'Extract structured medical data from a document',
    input_schema: {
      type: 'object',
      properties: {
        cpt_code: { type: 'string', description: '5-digit CPT procedure code' },
        icd10_code: { type: 'string', description: 'ICD-10 diagnosis code' },
        units: { type: 'integer', description: 'Number of procedure units' },
      },
      required: ['cpt_code', 'icd10_code', 'units'],
    },
  }];

  const messages = [{
    role: 'user',
    content: `Extract medical data from this document:\n\n${document}`,
  }];

  const fieldFailures = {};  // Track per-field failures

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    const response = await client.messages.create({
      model: 'claude-sonnet-4-6',
      max_tokens: 1024,
      tools,
      // Force specific tool on first attempt
      tool_choice: attempt === 0
        ? { type: 'tool', name: 'extract_medical_data' }
        : { type: 'auto' },
      messages,
    });

    const toolUse = response.content.find((b) => b.type === 'tool_use');
    if (!toolUse) break;

    const extracted = toolUse.input;
    const errors = validateExtraction(extracted);

    if (errors.length === 0) {
      return { success: true, data: extracted, attempts: attempt + 1 };
    }

    // Check for stuck fields
    for (const err of errors) {
      fieldFailures[err.field] = (fieldFailures[err.field] || 0) + 1;
      if (fieldFailures[err.field] >= 3) {
        return {
          success: false, data: extracted,
          reason: `Field '${err.field}' failed 3 times — flagging for human review`,
          attempts: attempt + 1,
        };
      }
    }

    if (attempt < maxRetries) {
      // KEY PATTERN: append SPECIFIC errors, not "try again"
      const errorDetails = errors
        .map((e) => `- Field '${e.field}': expected ${e.expected}, got ${e.actual}`)
        .join('\n');

      messages.push({ role: 'assistant', content: response.content });
      messages.push({
        role: 'user',
        content: [{
          type: 'tool_result',
          tool_use_id: toolUse.id,
          content: `Validation failed:\n${errorDetails}\n\nPlease correct these specific fields.`,
          is_error: true,
        }],
      });
    }
  }

  return { success: false, reason: 'Max retries exceeded', attempts: maxRetries + 1 };
}
🔍 What Just Happened?

You built a validation-retry loop with three critical features: (1) tool_choice forces the extraction tool on the first attempt, (2) validation errors include SPECIFIC field-level details (not "try again"), and (3) per-field failure tracking flags stuck fields for human review after 3 failures. This is the exam's most-tested code pattern — memorize the three components.

🎓 Cert Tip — Domain 4.3 & 4.4
The exam tests validation-retry as a two-part concept. Domain 4.3 tests that tool_use guarantees structure but NOT semantic correctness. Domain 4.4 tests that retry feedback must be specific — which field failed, expected format, and actual value. Generic "there were errors, please try again" is the #1 distractor for Domain 4 questions.

Run Command

Save the code above as validation_retry.py, then add this test block at the bottom of the file and run it:

# Add to bottom of validation_retry.py:
if __name__ == "__main__":
    client = anthropic.Anthropic()
    test_doc = """
    Patient: Jane Smith, DOB 1965-03-15
    Procedure: Total knee replacement (right knee)
    Diagnosis: Primary osteoarthritis, right knee
    Units: 1
    """
    result = extract_with_retry(client, test_doc)
    print(json.dumps(result, indent=2))

Run:

python validation_retry.py
Expected Output
{ "success": true, "data": { "cpt_code": "27447", "icd10_code": "M17.11", "units": 1 }, "attempts": 1 }
✅ Checkpoint

If you see "success": true with a valid 5-digit CPT code and ICD-10 code, the validation-retry loop is working. The "attempts" field tells you how many tries Claude needed — 1 means it got it right first time.

🔧 Troubleshooting

If you see AuthenticationError: Check that ANTHROPIC_API_KEY is set in your environment. Run echo $ANTHROPIC_API_KEY (or echo %ANTHROPIC_API_KEY% on Windows) to verify.

If you see ModuleNotFoundError: No module named 'anthropic': Run pip install anthropic and make sure your virtual environment is activated.

If "attempts" is greater than 1: That's actually fine — it means the validation-retry loop caught an error and self-corrected. Check the terminal output for the specific field errors Claude fixed.

Provenance Tracker

The provenance tracker looks simple at first — it's just a dictionary of claims with metadata. But the real intelligence is in the resolve() method. Here's the problem it solves: your coordinator agent just received results from three subagents, and two of them disagree about whether Acme Corp has an active lien. What do you do?

The resolver applies a three-step priority chain. First, it checks source reliability — an official state filing database (high confidence) beats a third-party aggregator (medium confidence) every time. Second, if both sources have equal reliability, it checks timestamps — fresher data wins because UCC filings can change daily. Third, if neither tiebreaker resolves the conflict, the claim gets flagged as "contested" and routed to human review. This is exactly the resolution logic the exam tests in Scenario 3.

from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum

class Confidence(Enum):
    HIGH = "high"       # Official source (state filing DB)
    MEDIUM = "medium"   # Third-party aggregator
    LOW = "low"         # User-submitted / unverified

class ClaimStatus(Enum):
    WELL_ESTABLISHED = "well_established"  # Multiple sources agree
    CONTESTED = "contested"                # Sources disagree
    SINGLE_SOURCE = "single_source"        # Only one source

@dataclass
class ProvenanceRecord:
    """Track the origin and reliability of each claim.

    The exam tests this in Scenario 3: when subagents return
    conflicting data, provenance lets the coordinator decide
    which claim to trust.
    """
    claim: str                    # The actual data/assertion
    source: str                   # Which agent/tool/API
    confidence: Confidence        # Source reliability
    timestamp: datetime           # When data was retrieved
    agent_id: str                 # Which subagent in pipeline
    source_type: str = ""         # "official_filing" vs "aggregator"

@dataclass
class ProvenanceTracker:
    """Aggregate claims from multiple subagents and resolve conflicts."""
    claims: dict[str, list[ProvenanceRecord]] = field(default_factory=dict)

    def add_claim(self, topic: str, record: ProvenanceRecord):
        """Add a claim from a subagent."""
        if topic not in self.claims:
            self.claims[topic] = []
        self.claims[topic].append(record)

    def resolve(self, topic: str) -> dict:
        """Resolve conflicting claims using provenance rules:
        1. Official sources trump aggregators
        2. Fresher data trumps stale data
        3. If still contested, flag for human review
        """
        records = self.claims.get(topic, [])
        if not records:
            return {"status": "no_data", "claim": None}
        if len(records) == 1:
            r = records[0]
            return {
                "status": ClaimStatus.SINGLE_SOURCE.value,
                "claim": r.claim,
                "source": r.source,
                "confidence": r.confidence.value,
                "warning": "Single source — verify independently",
            }

        # Check for agreement
        unique_claims = set(r.claim for r in records)
        if len(unique_claims) == 1:
            confidence_rank = {"low": 0, "medium": 1, "high": 2}
            best = max(records, key=lambda r: confidence_rank[r.confidence.value])
            return {
                "status": ClaimStatus.WELL_ESTABLISHED.value,
                "claim": best.claim,
                "sources": [r.source for r in records],
                "confidence": "high",
            }

        # Conflict — resolve by source reliability, then freshness
        # Sort descending: highest confidence first, then most recent timestamp
        ranked = sorted(
            records,
            key=lambda r: (
                ["low", "medium", "high"].index(r.confidence.value),
                r.timestamp,
            ),
            reverse=True,
        )
        best = ranked[0]
        return {
            "status": ClaimStatus.CONTESTED.value,
            "claim": best.claim,
            "source": best.source,
            "confidence": best.confidence.value,
            "alternatives": [
                {"claim": r.claim, "source": r.source}
                for r in ranked[1:]
            ],
            "warning": "Sources disagree — flagged for human review",
        }
const Confidence = { HIGH: 'high', MEDIUM: 'medium', LOW: 'low' };
const ClaimStatus = {
  WELL_ESTABLISHED: 'well_established',
  CONTESTED: 'contested',
  SINGLE_SOURCE: 'single_source',
};
const CONFIDENCE_RANK = { low: 0, medium: 1, high: 2 };

class ProvenanceTracker {
  // Aggregate claims from subagents and resolve conflicts.
  // Exam Scenario 3 tests this exact pattern.
  constructor() {
    this.claims = new Map();
  }

  addClaim(topic, record) {
    if (!this.claims.has(topic)) this.claims.set(topic, []);
    this.claims.get(topic).push({
      ...record,
      timestamp: record.timestamp || new Date(),
    });
  }

  resolve(topic) {
    const records = this.claims.get(topic) || [];
    if (records.length === 0) return { status: 'no_data', claim: null };

    if (records.length === 1) {
      return {
        status: ClaimStatus.SINGLE_SOURCE,
        claim: records[0].claim,
        source: records[0].source,
        confidence: records[0].confidence,
        warning: 'Single source — verify independently',
      };
    }

    // Check agreement
    const uniqueClaims = new Set(records.map((r) => r.claim));
    if (uniqueClaims.size === 1) {
      return {
        status: ClaimStatus.WELL_ESTABLISHED,
        claim: records[0].claim,
        sources: records.map((r) => r.source),
        confidence: 'high',
      };
    }

    // Conflict — rank by confidence then freshness
    const ranked = [...records].sort((a, b) => {
      const confDiff = CONFIDENCE_RANK[b.confidence] - CONFIDENCE_RANK[a.confidence];
      if (confDiff !== 0) return confDiff;
      return b.timestamp - a.timestamp;
    });

    return {
      status: ClaimStatus.CONTESTED,
      claim: ranked[0].claim,
      source: ranked[0].source,
      confidence: ranked[0].confidence,
      alternatives: ranked.slice(1).map((r) => ({
        claim: r.claim, source: r.source,
      })),
      warning: 'Sources disagree — flagged for human review',
    };
  }
}
🔍 What Just Happened?

You built a provenance tracker that resolves conflicting multi-agent claims. The key resolution logic: official sources outrank aggregators, fresher data outranks stale data, and "contested" claims are flagged for human review. Each claim carries four metadata fields (source, confidence, timestamp, agent_id). This is the Scenario 3 pattern the exam tests.

🎓 Cert Tip — Domain 1.2 & 1.3
Scenario 3 (Multi-Agent Research) tests two concepts together: context isolation (Domain 1.3 — subagents do NOT inherit the coordinator's full conversation) and provenance-based conflict resolution (Domain 1.2 — the coordinator must use structured metadata, not ask Claude to judge source quality). If you see a multi-agent question, check whether the answer respects context isolation AND uses deterministic conflict resolution.

Run Command

Save the provenance tracker as provenance.py, then add this test block and run it:

# Add to bottom of provenance.py:
if __name__ == "__main__":
    tracker = ProvenanceTracker()

    # Subagent A: official state filing database
    tracker.add_claim("acme_lien", ProvenanceRecord(
        claim="Acme Corp has an active UCC lien — Filing #2024-NY-0047821",
        source="ny_state_filing_db",
        confidence=Confidence.HIGH,
        timestamp=datetime(2026, 4, 2, 14, 30),
        agent_id="subagent-filings-ny",
        source_type="official_filing",
    ))

    # Subagent B: third-party aggregator disagrees
    tracker.add_claim("acme_lien", ProvenanceRecord(
        claim="No active liens found for Acme Corp",
        source="third_party_aggregator",
        confidence=Confidence.MEDIUM,
        timestamp=datetime(2026, 4, 1, 9, 0),
        agent_id="subagent-aggregator",
        source_type="aggregator",
    ))

    import json
    result = tracker.resolve("acme_lien")
    print(json.dumps(result, indent=2, default=str))

Run:

python provenance.py
Expected Output
{ "status": "contested", "claim": "Acme Corp has an active UCC lien — Filing #2024-NY-0047821", "source": "ny_state_filing_db", "confidence": "high", "alternatives": [ { "claim": "No active liens found for Acme Corp", "source": "third_party_aggregator" } ], "warning": "Sources disagree — flagged for human review" }
✅ Checkpoint

If you see "status": "contested" with the official filing winning over the aggregator, provenance resolution is working correctly. The official source (high confidence) outranked the aggregator (medium confidence). Both sources appear in the output for audit transparency.

🔧 Troubleshooting

If you see NameError: name 'datetime' is not defined: Add from datetime import datetime at the top of provenance.py.

If the wrong claim wins: Check that confidence levels are correct — Confidence.HIGH for the official source, Confidence.MEDIUM for the aggregator. The resolver ranks by confidence first, then by timestamp.

Final Verification — Both Components

You've now built both the validation-retry loop and the provenance tracker. These are the two most-tested code patterns on the certification exam. To verify both work end-to-end:

python validation_retry.py && python provenance.py
🎉 Congratulations

If both scripts run without errors, you've implemented the two core exam code patterns. The validation-retry loop handles Scenario 6 (Data Extraction) questions. The provenance tracker handles Scenario 3 (Multi-Agent Research) questions. Together they cover roughly 30% of the exam's code-specific questions.

Mock Exam Practice

📝 Practice Exam Instructions

What you're practicing: 8 mock exam questions in the exact certification format — scenario context, question stem, and 4 options (1 correct, 3 anti-pattern distractors).

Time target: Aim for 2 minutes per question (16 minutes total). On exam day, you'll have roughly 90 seconds per question, so this is generous practice time.

How to use: Read each scenario carefully. Before clicking an answer, identify which domain the question is testing (D1-D5) and which anti-patterns appear as distractors. Then select your answer. Detailed feedback appears immediately.

Self-assessment: After completing all 8 questions, check your domain score breakdown below. A domain with 0% or 50% accuracy means you should revisit that domain's module before exam day. Your target: 75%+ in every domain, not just 72% overall.

Read carefully — the exam rewards recognizing anti-patterns as much as knowing correct patterns. For each question, try to identify the anti-pattern number for each wrong answer before selecting.

Your Domain Scores
D1: Agentic Architecture
D2: Tool Design & MCP
D3: Claude Code Config
D4: Prompt Engineering
D5: Context & Reliability

Domain Scores: As you answer quiz questions below, your scores per domain are tracked and displayed as a bar chart showing strengths and weaknesses across all 5 certification domains.

Knowledge Check — Mock Exam Questions

1. [Scenario 1] A customer support agent processes refund requests. The system must enforce a $500 refund limit. Which approach is correct?

A Add "Never approve refunds over $500" to the system prompt — Claude will follow the instruction
B Check Claude's self-reported confidence score and only process refunds when confidence exceeds 90%
C Use a programmatic hook that intercepts the process_refund tool call and blocks any amount over $500
D Set maxTurns: 1 so the agent can only make one refund call per conversation
✓ Correct! Programmatic hooks (M26) are the correct enforcement mechanism for critical business rules. They intercept tool calls deterministically — they CAN'T be bypassed by prompt injection. Option A is anti-pattern #3 (prompt-based enforcement). Option B is anti-pattern #5 (self-reported confidence). Option D misuses iteration caps.
✗ The correct answer is C — programmatic hooks. System prompt instructions (A) are advisory, not enforced — anti-pattern #3. Self-reported confidence (B) is uncalibrated — anti-pattern #5. Iteration caps (D) are safety nets, not control mechanisms — anti-pattern #2. Hooks are deterministic and can't be bypassed.

2. [Scenario 4] Your agent's search tool queries a database but encounters an authentication failure. What should the tool return?

A {"results": []} — return empty results so the agent can move on
B "Operation failed" — a simple error string
C {"isError": true, "errorCategory": "auth_failure", "isRetryable": true, "context": "Token expired"}
D Throw an exception and let the agent's error handler catch it
✓ Correct! Structured errors (C) give Claude the information to decide: retry (token expired is transient), try an alternative, or escalate. Empty results (A) is anti-pattern #7 — Claude thinks "nothing found" when it should know "couldn't check." Generic errors (B) is anti-pattern #6. Unhandled exceptions (D) crash the agent loop.
✗ The answer is C — structured error with category, retryability, and context. Empty results (A) is anti-pattern #7: Claude thinks "no results" when it should know "access denied." Generic error (B) is anti-pattern #6. Exceptions (D) break the agent loop. Structured errors let Claude make informed decisions about retry, alternatives, or escalation.

3. [Scenario 6] Claude extracts a CPT code as "knee surgery" instead of the 5-digit code "27447". Your validation-retry loop should:

A Send "There were errors, please try again" and re-request
B Append: "Field 'cpt_code': expected 5-digit numeric code (e.g., '27447'), got 'knee surgery'" as a tool result error
C Increase max_tokens to give Claude more room to think
D Switch to a larger model (Opus) for more accurate extraction
✓ Correct! The validation-retry pattern requires SPECIFIC error feedback: which field failed, what was expected, and what was actually returned. This gives Claude the context to self-correct. Option A is anti-pattern #15 — generic retry without detail. Options C and D don't address the root cause (Claude needs format guidance, not more compute).
✗ The answer is B. The key anti-pattern is generic retry (A) — "try again" gives Claude zero information about what went wrong. Specific feedback ("field X expected Y, got Z") dramatically improves self-correction accuracy. More tokens (C) or bigger models (D) don't help when the issue is format misunderstanding, not capability.

4. [Scenario 3] Two subagents return conflicting data about a company's lien status. Subagent A (querying the state filing database) says "active lien." Subagent B (querying a third-party aggregator) says "no liens." How should the coordinator resolve this?

A Average the results — report "possible lien" with 50% confidence
B Use the most recent response regardless of source
C Trust the official state filing (high confidence) over the third-party aggregator (medium confidence), flag as "contested," and include both sources in the output
D Ask Claude to decide which source is more reliable based on the data content
✓ Correct! Provenance-based resolution: official sources (state filing DB = high confidence) outrank third-party aggregators (medium confidence). The claim is flagged as "contested" with both sources included for transparency. Averaging (A) is meaningless for binary claims. Recency alone (B) ignores source quality. Asking Claude to judge (D) delegates a deterministic decision to a non-deterministic model.
✗ The answer is C — provenance-based resolution. The state filing DB has higher confidence than a third-party aggregator. The conflict is flagged as "contested" with both sources for audit transparency. You can't average binary claims (A), recency alone ignores source quality (B), and Claude shouldn't make deterministic source-reliability judgments (D).

5. [Scenario 5] Your CI/CD pipeline uses Claude Code to generate code and then review it for security issues. What is the correct session configuration?

A Use the same session for both generation and review — it already has context about what was generated
B Use separate sessions: one for generation, a fresh one for review — prevents confirmation bias from shared context
C Use fork_session to branch the generation session for review — preserves context while allowing divergence
D Run both in a single -p command with "generate and then review" as the prompt
✓ Correct! Same-session self-review creates confirmation bias — the reviewer retains the generator's reasoning context and is less likely to catch issues. Separate sessions give the reviewer a fresh, unbiased perspective. This is anti-pattern #12. fork_session (C) still preserves the original reasoning context. Single command (D) is the worst case — no separation at all.
✗ The answer is B — separate sessions. Same-session review (A) is anti-pattern #12: the reviewer retains the generator's reasoning and develops confirmation bias. fork_session (C) still inherits context. Single command (D) provides zero separation. Fresh sessions for review give an unbiased perspective, especially critical in CI/CD security scanning.

6. [Scenario 1] A customer is angry and using profanity, but their request is straightforward: "Where the **** is my package?!" The agent should:

A Escalate to a human agent immediately — angry customers need human empathy
B Handle it normally — the request is within capability (package tracking). Anger alone is not an escalation trigger
C Ask the customer to rephrase politely before processing the request
D Reduce the agent's capabilities for this user to prevent escalation of behavior
✓ Correct! Escalation should be based on policy gaps, capability limits, explicit requests, or business thresholds — NOT sentiment or anger. An angry customer with a simple, within-capability request should be handled normally. Option A is anti-pattern #4 (sentiment-based escalation). A calm customer hitting a policy gap DOES need escalation; an angry one with a simple request does NOT.
✗ The answer is B. Anti-pattern #4 is sentiment-based escalation — escalate on policy gaps, capability limits, or business thresholds, NOT anger. This is a simple package tracking request within the agent's capabilities. The customer is angry but the request is straightforward. Escalating wastes human agent time on something the AI can handle perfectly well.

7. Your document extraction system reports "95% overall accuracy." Upon investigation, you find invoices are at 70% accuracy while receipts are at 99%. The correct response is:

A The system is performing well — 95% is above the 90% threshold
B Retrain the model with more invoice examples to improve the overall score
C Implement stratified per-document-type metrics and address the invoice accuracy gap specifically
D Lower the accuracy threshold to 85% to account for difficult document types
✓ Correct! Aggregate metrics mask per-category failures — anti-pattern #17. Invoices at 70% is a serious problem hidden by receipts at 99%. Stratified metrics reveal the real issue. You don't retrain the whole model (B) — you fix the specific extraction logic for invoices. Lowering thresholds (D) hides the problem instead of fixing it.
✗ The answer is C — stratified metrics. "95% overall" hides that invoices are at 70% (broken) while receipts are at 99% (great). Anti-pattern #17 is aggregate-only metrics. You need per-category tracking to catch hidden failures. Lowering thresholds (D) or accepting the aggregate (A) both sweep the invoice problem under the rug.

8. [Scenario 6] You need Claude to extract data from a document and MUST use the extraction tool (not free-text response). The correct tool_choice is:

A tool_choice: "auto" — let Claude decide whether to use the tool
B tool_choice: "any" — force Claude to use a tool, but let it pick which one
C tool_choice: {"type": "tool", "name": "extract_data"} — force the specific extraction tool
D Add "You MUST use the extract_data tool" to the system prompt instead
✓ Correct! When you KNOW which tool must be used, force it with {"type": "tool", "name": "extract_data"}. "auto" (A) means Claude might skip the tool entirely. "any" (B) forces a tool but Claude might pick the wrong one. Prompt-based enforcement (D) is anti-pattern #3 — Claude might still respond in free text.
✗ The answer is C — forced specific tool. "auto" lets Claude skip the tool. "any" forces a tool but doesn't specify which. Prompt-based enforcement (D) is anti-pattern #3. When you KNOW extraction is needed, use {"type": "tool", "name": "extract_data"} for deterministic behavior.

Your Score

0/0

Next steps based on your score:

  • 8/8: You're exam-ready. Review the anti-patterns one more time on exam morning, then go.
  • 6-7/8: Above passing threshold. Check which domain(s) you missed — revisit that module and re-read the anti-pattern cards above.
  • 4-5/8: At or near threshold. Re-read the Validation-Retry and Provenance sections, then retake this quiz after 24 hours.
  • Below 4/8: Go back to M25-M26, then re-read this module's anti-pattern cards and scenario walkthroughs before retaking.

Summary & Exam Day Tips

You've completed the entire course — from your first API call in M01 to certification-level mastery here in M27. Here's your final exam preparation checklist:

  • Anti-Patterns: Know all 18. The exam's wrong answers ARE anti-patterns. Recognizing them is half the battle.
  • stop_reason: The single most important technical detail. 'tool_use' means the agent loop should continue. 'end_turn' means done. Never parse natural language to detect loop termination.
  • Hooks vs Prompts: Critical business rules use hooks — they're deterministic and can't be bypassed. System prompt instructions are advisory only. The exam treats this distinction as a bright line.
  • Structured Errors: Always return {isError, errorCategory, isRetryable, context}. Never return a generic "Operation failed" string. Claude needs the error details to decide whether to retry, try an alternative, or escalate.
  • Validation-Retry: When extraction fails, append specific field-level errors — not "try again." Cap retries at 3 per field, then flag for human review.
  • Provenance: Every claim in a multi-agent system needs four metadata fields: source, confidence, timestamp, and agent_id. Without these, you can't audit conflict resolution.
  • Session Isolation: Generate and review in SEPARATE sessions. Same-session review creates confirmation bias because the reviewer retains the generator's reasoning context.
  • Stratified Metrics: Never report aggregate-only accuracy. Always drill down per document type or category. A 95% average can hide a 70% failure rate in one category.
  • Escalation: Escalate on policy gaps and capability limits. Never escalate based on sentiment or self-reported confidence scores.

Passing score: 720/1000. The exam favors hub-and-spoke architecture, programmatic enforcement, and structured data patterns. Trust what you've learned across M01-M27. Good luck! 🎓

References & Resources