M27: Certification Exam Prep
Anti-patterns, scenario walkthroughs, and mock exam practice for the Claude Certified Architect — Foundations exam.
Learning Objectives
- Use the domain coverage map below to identify your weakest cert domains and target review
- Recognize all 18 anti-patterns across the 5 certification domains
- Understand the validation-retry loop with specific error feedback
- Implement information provenance tracking for multi-agent systems
- Walk through all 6 exam scenarios with correct architectural decisions
- Practice with mock exam questions in the certification format
Cert Domain Coverage Map
This map is built directly from Anthropic's official Claude Certified Architect – Foundations Certification Exam Guide (v0.1, Feb 10 2025). Every official task statement (28 in total across 5 domains) maps to specific course modules below. If your practice score shows weakness in a domain, jump to the row instead of re-reading the whole course.
D1 is the heaviest at 27% — budget your study time accordingly. D5 sub-tasks 5.5/5.6 have their own deep dive in M27B.
Reading the map: "Primary" = where the topic is taught from scratch. "Cert tip" = a focused exam-relevant 🎓 callout on that point. Domain weights from the official guide: D1 = 27%, D2 = 18%, D3 = 20%, D4 = 20%, D5 = 15%.
| Task | Topic (official task statement) | Where Covered |
|---|---|---|
| Domain 1 — Agentic Architecture & Orchestration (27%) | ||
| 1.1 | Agentic loops: stop_reason control flow, tool result handling | M12 (primary) · cert tips in M12, M15B, M23, M16 (iteration cap anti-pattern) |
| 1.2 | Coordinator-subagent (hub-and-spoke), task decomposition by coordinator | M14 (primary) · M14, M15B cert tips |
| 1.3 | Subagent invocation: Task tool, allowedTools, explicit context passing, parallel spawning | M14 · M15B · cert tips in both |
| 1.4 | Multi-step workflows: programmatic prerequisite gates, structured handoff protocols (sample question #1) | M17 cert tip (NEW) · M16 cert tip · M26 (hooks-as-enforcement) |
| 1.5 | Agent SDK hooks: PostToolUse, tool call interception, data normalization | M26 (primary) · M15B, M23 cert tips |
| 1.6 | Task decomposition: prompt chaining vs. dynamic adaptive decomposition | M13 (primary) · M13, M26 cert tips |
| 1.7 | Session state: --resume, fork_session, named sessions, resume vs. fresh w/ summary | M26 (primary) · M26 cert tip (NEW) |
| Domain 2 — Tool Design & MCP Integration (18%) | ||
| 2.1 | Tool descriptions: input formats, examples, edge cases, boundaries | M05 (primary) · M05 cert tip |
| 2.2 | Structured MCP errors: isError, errorCategory, isRetryable; access failure vs. empty result | M05 (primary) · M05, M12, M15, M23 cert tips |
| 2.3 | Tool distribution (4-5/agent cap), tool_choice: "auto"/"any"/forced | M06 (primary) · M06 cert tip |
| 2.4 | MCP server config: project (.mcp.json) vs user (~/.claude.json), ${ENV_VAR} expansion, MCP resources for catalogs | M07 (primary) · M07, M15B cert tips |
| 2.5 | Built-in tools: Read, Write, Edit, Bash, Grep, Glob — selection criteria, Edit→Read+Write fallback | M25 (primary) · M25 cert tip (FIXED to 2.5) |
| Domain 3 — Claude Code Configuration & Workflows (20%) | ||
| 3.1 | CLAUDE.md hierarchy (user/project/directory), @import, .claude/rules/, /memory | M25 (primary) · M25 cert tip |
| 3.2 | Custom slash commands & skills: SKILL.md frontmatter (context: fork, allowed-tools, argument-hint) | M25 (primary) · M25 cert tip |
| 3.3 | Path-specific rules: YAML frontmatter glob patterns in .claude/rules/ | M25 · M25 cert tip |
| 3.4 | Plan mode vs. direct execution; Explore subagent for verbose discovery | M25 · M25 cert tip |
| 3.5 | Iterative refinement: input/output examples, TDD iteration, interview pattern | M25 · M25 cert tip |
| 3.6 | CI/CD: -p, --output-format json, --json-schema, CLAUDE.md context for CI | M25 · M25 cert tip (NEW) |
| Domain 4 — Prompt Engineering & Structured Output (20%) | ||
| 4.1 | Explicit criteria over vague instructions; severity definitions with code examples | M03 (primary) · M03 cert tip |
| 4.2 | Few-shot prompting: 2-4 examples for ambiguous scenarios | M03 (primary) · M03 cert tip |
| 4.3 | tool_use + JSON schemas; tool_choice variants; nullable fields to prevent hallucination | M04 (primary) · M04, M09 cert tips |
| 4.4 | Validation-retry loops with specific errors; detected_pattern fields for dismissal tracking | M04 · M18 · cert tips in both |
| 4.5 | Message Batches API: 50% savings, 24h window, custom_id, no multi-turn tool calling | M22 (primary) · M22 cert tip (FIXED to 4.5) |
| 4.6 | Multi-instance & multi-pass review: separate sessions for gen vs review; per-file + cross-file passes | M18 (primary) · M18 cert tips · M10 cert tip |
| Domain 5 — Context Management & Reliability (15%) | ||
| 5.1 | Conversation context: progressive summarization risks, lost-in-the-middle, case-facts blocks, trimming verbose tool outputs | M03B · M08 · cert tips in M02, M08, M09, M22 |
| 5.2 | Escalation & ambiguity resolution: policy gaps, sentiment != complexity, multi-match clarification | M17 (primary) · M17, M26 cert tips |
| 5.3 | Error propagation: structured context (failure type, attempted query, partial results), local recovery before escalation | M14 cert tip (NEW) · M17 |
| 5.4 | Large codebase exploration: scratchpads, subagent delegation, /compact, crash recovery manifests | M11 (primary) · M11 cert tip |
| 5.5 | Human review workflows: stratified sampling, field-level confidence, accuracy by document type/field | M27B · cert tips in M17, M18, M10 |
| 5.6 | Information provenance & uncertainty: claim-source mappings, temporal data, conflict annotation, coverage gaps | M27B (deep dive) · cert tips in M09, M11 |
Per the official guide, the following are NOT tested on the cert (and you shouldn't waste study time on them): fine-tuning, API auth/billing, MCP server hosting, Constitutional AI / RLHF internals, embedding models & vector DB internals, computer use, vision, streaming, rate limits / quotas / pricing, OAuth, cloud-provider configs, prompt-caching implementation details (beyond knowing it exists), token counting algorithms.
If your practice exam scores show weakness in a specific domain row, jump there. The "primary" module teaches it from scratch; the cert tip pinpoints the exam-relevant scenario. Domain 5.5 + 5.6 jointly need their own deep-dive — M27B — because they cover topics scattered across M09, M11, M17, M18 that the cert tests as one discipline. Take M27B before any full timed practice exam. Domain 1 is the heaviest at 27% — budget review time accordingly.
Anti-Patterns Master Reference
The Claude Certified Architect examA 5-domain certification covering agentic architecture, tool design, Claude Code configuration, prompt engineering, and context/reliability. Passing score is 720/1000. Each question presents a scenario with four options — one correct and three distractors (often anti-patterns). tests your ability to spot bad patterns as much as good ones. Every wrong answer in the exam is a real anti-pattern that developers actually ship to production. This section catalogs all 18 anti-patterns grouped by domain — click each card to flip between the ❌ DON'T and ✅ DO.
Before medical boards tested clinical knowledge, doctors learned from textbooks alone. But knowing the right diagnosis isn't enough — you also need to recognize the wrong diagnoses that look plausible. A patient with chest pain could have a heart attack, acid reflux, or a pulled muscle. The textbook teaches the right answer; the boards test whether you can spot the impostors. The certification exam works the same way. Each question has one correct pattern and three anti-patterns. The anti-patterns aren't random nonsense — they're approaches that seem reasonable but fail in production. Knowing WHY each anti-pattern fails is more valuable than memorizing the correct answer, because in practice you'll encounter them in existing codebases and need to recognize and fix them.
That's the analogy — now let's see what a real exam question looks like. Notice how the three wrong answers are all plausible anti-patterns, not obvious nonsense:
Domain 1 — Agentic Architecture
Parse natural language to detect loop termination ("I'm done", "task complete")
Click to flipCheck stop_reason: 'tool_use' = continue, 'end_turn' = done. Deterministic, not guesswork.
Use arbitrary iteration caps (e.g., max 10 loops) as the primary stop mechanism
Click to flipLet the agent terminate naturally via stop_reason. Use maxTurns as a SAFETY NET only, not control flow.
Use prompt-based enforcement for critical business rules ("Never approve refunds over $500")
Click to flipUse programmatic hooks that intercept tool calls. Hooks are deterministic — they CAN'T be bypassed by prompt injection.
Escalate to humans based on customer sentiment or anger level
Click to flipEscalate on: policy gaps, capability limits, explicit requests, or business thresholds. Angry + simple ≠ escalate. Calm + policy gap = escalate.
Trust self-reported confidence scores for escalation decisions
Click to flipUse structured programmatic criteria. Model confidence is not well-calibrated — it says "95% confident" about hallucinations too.
Domain 2 — Tool Design & MCP
Return generic errors: "Operation failed"
Return structured errors: {isError: true, errorCategory: "auth_failure", isRetryable: true, context: "Token expired"}
Return empty results for access failures: {results: []}
Distinguish "nothing found" from "couldn't check." {isError: true, errorCategory: "access_denied"} ≠ {results: []}
Give one agent 18+ tools — selection accuracy degrades rapidly
Click to flipKeep 4-5 tools per agent. Distribute across specialized subagents. Tool accuracy drops significantly above 5 tools.
Hardcode API keys in .mcp.json — gets committed to git
Use ${ENV_VAR} environment variable expansion. Know the config hierarchy: .mcp.json (project) vs ~/.claude.json (user).
Domains 3-5 — Config, Prompts, Context
Put personal preferences in project-level CLAUDE.md
Click to flipUser-level (~/.claude/) for personal prefs, project CLAUDE.md for team-shared rules. Directory-level for subdirectory overrides.
Give vague instructions: "be thorough," "find all issues"
Click to flipProvide explicit, measurable criteria: "flag functions exceeding 50 lines" not "flag long functions."
Assume tool_use guarantees semantic correctness (valid JSON ≠ correct values)
tool_use guarantees structure (valid JSON). Always add business rule validation after extraction. Values may still be wrong.
Use progressive summarization for critical details (names, IDs, amounts, dates get lost)
Click to flipUse immutable "case facts" blocks at the START of context. These are never summarized. Critical data stays in high-recall position.
Use aggregate accuracy metrics only: "95% overall accuracy"
Click to flipTrack stratified per-category metrics. Invoices at 70% + receipts at 99% averages 95% — but invoices are broken. Drill down.
Anti-Patterns (static): 18 anti-patterns organized by domain. Each shows a bad practice (e.g., parsing NL for loop termination) and the correct alternative (e.g., check stop_reason). Key domains: Agentic Architecture (5), Tool Design (4), Claude Code Config (3), Prompt Engineering (3), Context & Reliability (3).
"I just need to memorize the anti-patterns." — Memorization alone won't work. The exam presents novel scenarios where you need to APPLY the principles. You might recognize anti-pattern #3 (prompt-based enforcement) in isolation, but can you spot it when it's embedded in a Scenario 5 CI/CD question about code review guardrails? The exam tests application, not recall.
"The exam only has one correct answer per question." — True, but the challenge is that two options often look correct. For example, both "use a hook" and "add a system prompt instruction" technically enforce a refund limit. The difference is that hooks are deterministic and can't be bypassed by prompt injection, while system prompt instructions are advisory. The exam rewards understanding WHY one approach is better.
"If I know the Anthropic API well, I'll pass." — API knowledge is necessary but insufficient. Roughly 40% of questions test architectural judgment (when to use multi-agent vs single-agent, when to escalate, how to handle conflicting data). These are design decisions, not API calls.
"Anti-patterns are obvious once you see them." — In a study setting, yes. Under exam pressure with a 90-minute timer, anti-patterns like returning empty results for access failures ({results: []} vs {isError: true}) look deceptively reasonable. Practice under timed conditions.
"I should aim for 100% — 720/1000 is a low bar." — The passing score is 72%, but the questions are designed so that only candidates with genuine understanding score above the threshold. The three distractors per question are calibrated to catch surface-level knowledge. A score of 750-800 represents solid mastery.
Validation-Retry Pattern
The validation-retry loopA pattern where tool_use output is validated against business rules, and if validation fails, specific error details are appended to the conversation for the model to self-correct. Retries are capped at 3 to prevent infinite loops. is the exam's most-tested Domain 4 pattern. Here's how it works, step by step.
First, you ask Claude to extract structured data using tool_use. Claude returns valid JSON — the structure is guaranteed by the schema. But the values inside that JSON may be wrong. A CPT code field might contain "knee surgery" instead of "27447".
Second, you validate those values against your business rules. If validation fails, here's the critical part: you don't just say "try again." That's the anti-pattern. Instead, you append SPECIFIC error details — which field failed, what the expected format was, and what was actually returned. Claude then self-corrects with much higher accuracy because it knows exactly what to fix.
Third, the exam tests one more detail: after 3 failed retries on the same field, stop retrying and flag for human review. This prevents infinite loops. It also catches cases where the model genuinely can't extract the data — for example, when the source document simply doesn't contain the required information.
tool_choiceControls whether and which tools Claude uses. 'auto' = Claude decides, 'any' = must use a tool (Claude picks which), forced = must use a specific tool. The exam tests when to use each option. is a separate but related exam topic. There are three settings, and the exam tests when to use each one. 'auto' is the default — Claude decides on its own whether to use a tool. 'any' forces Claude to use a tool, but lets it pick which one. Finally, forced-specific ({"type": "tool", "name": "extract_filing"}) requires a specific tool — use this when you KNOW extraction is needed and don't want Claude to skip it.
Validation-Retry (static): Claude extracts data via tool_use → Validator checks business rules → If invalid, specific error details are appended → Claude retries with better context → After 3 failures on same field, flag for human review.
"If tool_use returns valid JSON, the data is correct." — No. tool_use guarantees structural correctness (valid JSON matching your schema), but NOT semantic correctness. A CPT code field containing "knee surgery" is valid JSON — it's a string in the right field — but it's not a valid CPT code. Always validate business rules after extraction.
"Generic retry messages work fine — Claude is smart enough." — In testing, generic "try again" messages produce roughly 30-40% self-correction rates. Specific field-level feedback ("field 'cpt_code': expected 5-digit numeric, got 'knee surgery'") pushes self-correction above 80%. The specificity matters enormously.
"Just retry indefinitely until it gets it right." — Some extraction failures are not self-correctable. If the source document says "the patient underwent knee surgery" but never mentions CPT code 27447, no amount of retrying will produce the code. The 3-retry-per-field cap catches these data-quality issues and routes them to humans instead of looping forever.
Information Provenance
Before investigative journalists had editorial standards for sourcing, a reporter could publish any claim without attribution. "Sources say the company is bankrupt" told the reader nothing — was this the CEO, a disgruntled intern, or a rumor on social media? Modern journalism requires provenance: name the source, describe their credibility, note when they said it, and flag if other sources disagree. Multi-agent AI systems face exactly the same challenge. When your coordinator agent receives results from three subagents — one saying "Acme Corp has a lien," another saying "no liens found," and a third saying "data unavailable" — the coordinator needs provenance metadata to decide which claim to trust. Without it, the agent picks arbitrarily, and you can't audit why.
That's the analogy. Now let's see what provenance actually looks like in your code. Here's the metadata object that travels alongside every single claim in a multi-agent pipeline:
Information provenanceMetadata that tracks the origin, reliability, and freshness of each claim in a multi-agent system — including which agent/tool/API provided the data, confidence level, timestamp, and how to characterize the source. tracking means attaching metadata to every single claim that flows through your system. Think of it as a chain-of-custody label. Each label has four fields:
- Source — which agent, tool, or API produced this data. Was it a direct database query, or did a third-party aggregator provide it?
- Confidence — how reliable is that source? An official state filing database is "high." A third-party aggregator is "medium." User-submitted data is "low."
- Timestamp — when was the data retrieved? This matters more than you might think. UCC filings can update daily, so data from last week might already be stale.
- Agent ID — which subagent in the pipeline produced this claim? You need this for audit tracing when something goes wrong.
The exam tests this in Scenario 3 (Multi-Agent Research). When two subagents return conflicting entity data, the coordinator must use provenance to resolve the conflict. The resolution follows a priority chain. First, trust the official source over the aggregator. Second, if sources have equal reliability, trust fresher data over stale data. Third, if the conflict still can't be resolved, flag the claim as "contested" and route it to human review.
Claims are categorized as: "Well-established" (multiple sources agree), "Contested" (sources disagree — flag for verification), or "Single-source" (only one source, note the risk).
If you've worked with traditional data pipelines, you might compare provenance to data lineage in ETL systems — tracking where a row came from and which transformations touched it. The difference is that provenance in multi-agent systems must also track reliability. In an ETL pipeline, every transformation is deterministic — the same input always produces the same output. But when a subagent interprets an unstructured document, confidence varies. A filing number extracted from an official state database is high-confidence. The same filing number extracted by a third-party aggregator that scrapes PDFs is medium-confidence at best. Provenance captures that distinction so the coordinator can make informed decisions rather than treating all sources as equal.
"Provenance is just logging." — Logging records what happened after the fact. Provenance is used at decision time — the coordinator reads provenance metadata to decide which conflicting claim to trust. Logs help you debug later. Provenance helps the agent decide now.
"The coordinator should ask Claude to judge which source is more reliable." — This delegates a deterministic decision to a non-deterministic model. Source reliability ranking (official DB > aggregator > user-submitted) should be hardcoded in your resolution logic, not left to Claude's judgment. The exam specifically penalizes this approach.
"If sources agree, provenance doesn't matter." — Even when sources agree, provenance tells you how much to trust the consensus. Three low-confidence sources agreeing is still lower confidence than one high-confidence official source. And you still need the audit trail for compliance.
Exam Scenario Walkthroughs
The certification exam presents 6 possible scenarios. Each scenario describes a real-world system in 2-3 sentences, then asks you to make an architectural decision. The scenarios aren't hypothetical puzzles — they mirror the actual systems that Claude developers build in production.
Think of scenarios as the exam's version of case studies. In each scenario, you're given a specific system (like a customer support agent or a CI/CD pipeline) and asked to choose the correct architecture, configuration, or error-handling approach. The scenario context stays the same across multiple questions — so you might see 3-4 questions all set in "Scenario 1: Customer Support." Each question tests a different domain concept using that same system as the backdrop.
What makes these scenarios tricky is that they test multiple domains at once. Take a customer support scenario (S1) as an example. One question might test your knowledge of escalation triggers (Domain 1). The next might test hook-based enforcement (Domain 1). A third might test structured error handling (Domain 2). All three questions use the same scenario context, but each targets a different concept. You need to recognize which domain concept applies to the specific decision being asked about.
Here are all 6 scenarios with their correct architecture, key decisions, and the anti-patterns you'll see as distractors. Pay attention to the "Key" note on each diagram — that's the decision the exam is actually testing.
6 Scenarios (static): S1 = Support Agent (hooks + escalation), S2 = Code Gen (plan mode + CLAUDE.md), S3 = Multi-Agent (provenance + context isolation), S4 = Dev Productivity (Agent SDK + MCP), S5 = CI/CD (non-interactive + session isolation), S6 = Data Extraction (tool_use + validation-retry).
Code Walkthrough
Validation-Retry Loop Implementation
What You'll Build: A validation-retry loop that extracts medical data from documents and self-corrects extraction errors with specific feedback.
Time Estimate: 20-30 minutes
Prerequisites: Python 3.10+ or Node.js 18+, Anthropic API key
Files You'll Create: validation_retry.py (or validation_retry.mjs)
mkdir cert-prep-lab && cd cert-prep-lab
python -m venv venv && source venv/bin/activate # Windows: venv\Scripts\activate
pip install anthropic
export ANTHROPIC_API_KEY=your-key-here # Windows: set ANTHROPIC_API_KEY=your-key-here
The validation-retry pattern is the most code-tested pattern on the exam. Let's walk through the complete implementation — not just what the code does, but why each decision was made the way it was.
Let's start with the validator. The validate_extraction function is where your domain knowledge lives. It checks each extracted field against business rules — is this CPT code actually 5 digits? Is this ICD-10 code in the right format? Notice that we're not just checking types (the JSON schema already guarantees that). We're checking semantic correctness — whether the values make sense in the medical domain.
Next comes the interesting part: the retry loop itself. The extract_with_retry function manages the conversation with Claude. Here's the dilemma it solves: when Claude gets a field wrong, you COULD just say "try again." But that's like telling a student "wrong answer" without explaining what was wrong. Instead, the function appends SPECIFIC error details — which field, what format was expected, and what Claude actually returned. This small change dramatically improves self-correction accuracy.
Finally, there's a subtle but critical feature: the per-field failure counter. If the same field fails 3 times in a row, the problem isn't Claude being careless — it's that the source document probably doesn't contain the information in a recognizable form. At that point, more retries won't help. Flag it for a human.
import anthropic
import json
from dataclasses import dataclass
@dataclass
class ValidationError:
field: str
expected: str
actual: str
def validate_extraction(data: dict) -> list[ValidationError]:
"""Validate extracted data against business rules.
Returns list of validation errors (empty = valid)."""
errors = []
# CPT codes must be exactly 5 digits
cpt = data.get("cpt_code", "")
if not (cpt.isdigit() and len(cpt) == 5):
errors.append(ValidationError(
field="cpt_code",
expected="5-digit numeric code (e.g., '27447')",
actual=repr(cpt),
))
# ICD-10 codes: letter + 2+ digits + optional decimal
import re
icd = data.get("icd10_code", "")
if not re.match(r'^[A-Z]\d{2}(\.\d{1,4})?$', icd):
errors.append(ValidationError(
field="icd10_code",
expected="ICD-10 format: letter + 2 digits + optional decimal (e.g., 'M17.11')",
actual=repr(icd),
))
# Units must be positive integer
units = data.get("units")
if not isinstance(units, int) or units < 1:
errors.append(ValidationError(
field="units",
expected="positive integer >= 1",
actual=repr(units),
))
return errors
def extract_with_retry(
client: anthropic.Anthropic,
document: str,
max_retries: int = 3,
) -> dict:
"""Extract structured data with validation-retry loop.
Key pattern: on validation failure, append SPECIFIC error
details — not generic "try again". This gives Claude the
information it needs to self-correct.
"""
# Define the extraction tool — Claude must return this schema
tools = [{
"name": "extract_medical_data",
"description": "Extract structured medical data from a document",
"input_schema": {
"type": "object",
"properties": {
"cpt_code": {
"type": "string",
"description": "5-digit CPT procedure code"
},
"icd10_code": {
"type": "string",
"description": "ICD-10 diagnosis code (e.g., M17.11)"
},
"units": {
"type": "integer",
"description": "Number of procedure units"
},
},
"required": ["cpt_code", "icd10_code", "units"],
},
}]
messages = [{
"role": "user",
"content": f"Extract medical data from this document:\n\n{document}",
}]
# Track per-field failure counts — if same field fails 3 times,
# it's not a self-correction issue, it's a data issue.
field_failures: dict[str, int] = {}
for attempt in range(max_retries + 1):
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
tools=tools,
# Force the specific tool on first attempt
tool_choice=(
{"type": "tool", "name": "extract_medical_data"}
if attempt == 0
else {"type": "auto"}
),
messages=messages,
)
# Find the tool_use block in the response
tool_use = next(
(b for b in response.content if b.type == "tool_use"),
None,
)
if not tool_use:
break # Claude decided not to use the tool
extracted = tool_use.input
errors = validate_extraction(extracted)
if not errors:
return {"success": True, "data": extracted, "attempts": attempt + 1}
# Check for stuck fields — same field failing repeatedly
for err in errors:
field_failures[err.field] = field_failures.get(err.field, 0) + 1
if field_failures[err.field] >= 3:
return {
"success": False,
"data": extracted,
"reason": f"Field '{err.field}' failed 3 times — flagging for human review",
"attempts": attempt + 1,
}
if attempt < max_retries:
# THE KEY PATTERN: append SPECIFIC errors, not "try again"
error_details = "\n".join(
f"- Field '{e.field}': expected {e.expected}, got {e.actual}"
for e in errors
)
# Add the tool result + error feedback to conversation
messages.append({"role": "assistant", "content": response.content})
messages.append({
"role": "user",
"content": [{
"type": "tool_result",
"tool_use_id": tool_use.id,
"content": f"Validation failed:\n{error_details}\n\nPlease correct these specific fields.",
"is_error": True,
}],
})
return {"success": False, "data": extracted, "reason": "Max retries exceeded", "attempts": max_retries + 1}
import Anthropic from '@anthropic-ai/sdk';
function validateExtraction(data) {
// Validate extracted data against business rules.
const errors = [];
// CPT codes: exactly 5 digits
const cpt = data.cpt_code || '';
if (!/^\d{5}$/.test(cpt)) {
errors.push({
field: 'cpt_code',
expected: "5-digit numeric code (e.g., '27447')",
actual: JSON.stringify(cpt),
});
}
// ICD-10: letter + 2+ digits + optional decimal
const icd = data.icd10_code || '';
if (!/^[A-Z]\d{2}(\.\d{1,4})?$/.test(icd)) {
errors.push({
field: 'icd10_code',
expected: "ICD-10 format: letter + 2 digits + optional decimal",
actual: JSON.stringify(icd),
});
}
// Units: positive integer
if (!Number.isInteger(data.units) || data.units < 1) {
errors.push({
field: 'units',
expected: 'positive integer >= 1',
actual: JSON.stringify(data.units),
});
}
return errors;
}
async function extractWithRetry(client, document, maxRetries = 3) {
// Extraction tool — Claude must return this schema
const tools = [{
name: 'extract_medical_data',
description: 'Extract structured medical data from a document',
input_schema: {
type: 'object',
properties: {
cpt_code: { type: 'string', description: '5-digit CPT procedure code' },
icd10_code: { type: 'string', description: 'ICD-10 diagnosis code' },
units: { type: 'integer', description: 'Number of procedure units' },
},
required: ['cpt_code', 'icd10_code', 'units'],
},
}];
const messages = [{
role: 'user',
content: `Extract medical data from this document:\n\n${document}`,
}];
const fieldFailures = {}; // Track per-field failures
for (let attempt = 0; attempt <= maxRetries; attempt++) {
const response = await client.messages.create({
model: 'claude-sonnet-4-6',
max_tokens: 1024,
tools,
// Force specific tool on first attempt
tool_choice: attempt === 0
? { type: 'tool', name: 'extract_medical_data' }
: { type: 'auto' },
messages,
});
const toolUse = response.content.find((b) => b.type === 'tool_use');
if (!toolUse) break;
const extracted = toolUse.input;
const errors = validateExtraction(extracted);
if (errors.length === 0) {
return { success: true, data: extracted, attempts: attempt + 1 };
}
// Check for stuck fields
for (const err of errors) {
fieldFailures[err.field] = (fieldFailures[err.field] || 0) + 1;
if (fieldFailures[err.field] >= 3) {
return {
success: false, data: extracted,
reason: `Field '${err.field}' failed 3 times — flagging for human review`,
attempts: attempt + 1,
};
}
}
if (attempt < maxRetries) {
// KEY PATTERN: append SPECIFIC errors, not "try again"
const errorDetails = errors
.map((e) => `- Field '${e.field}': expected ${e.expected}, got ${e.actual}`)
.join('\n');
messages.push({ role: 'assistant', content: response.content });
messages.push({
role: 'user',
content: [{
type: 'tool_result',
tool_use_id: toolUse.id,
content: `Validation failed:\n${errorDetails}\n\nPlease correct these specific fields.`,
is_error: true,
}],
});
}
}
return { success: false, reason: 'Max retries exceeded', attempts: maxRetries + 1 };
}
You built a validation-retry loop with three critical features: (1) tool_choice forces the extraction tool on the first attempt, (2) validation errors include SPECIFIC field-level details (not "try again"), and (3) per-field failure tracking flags stuck fields for human review after 3 failures. This is the exam's most-tested code pattern — memorize the three components.
The exam tests validation-retry as a two-part concept. Domain 4.3 tests that
tool_use guarantees structure but NOT semantic correctness. Domain 4.4 tests that retry feedback must be specific — which field failed, expected format, and actual value. Generic "there were errors, please try again" is the #1 distractor for Domain 4 questions.
Run Command
Save the code above as validation_retry.py, then add this test block at the bottom of the file and run it:
# Add to bottom of validation_retry.py:
if __name__ == "__main__":
client = anthropic.Anthropic()
test_doc = """
Patient: Jane Smith, DOB 1965-03-15
Procedure: Total knee replacement (right knee)
Diagnosis: Primary osteoarthritis, right knee
Units: 1
"""
result = extract_with_retry(client, test_doc)
print(json.dumps(result, indent=2))
Run:
If you see "success": true with a valid 5-digit CPT code and ICD-10 code, the validation-retry loop is working. The "attempts" field tells you how many tries Claude needed — 1 means it got it right first time.
If you see AuthenticationError: Check that ANTHROPIC_API_KEY is set in your environment. Run echo $ANTHROPIC_API_KEY (or echo %ANTHROPIC_API_KEY% on Windows) to verify.
If you see ModuleNotFoundError: No module named 'anthropic': Run pip install anthropic and make sure your virtual environment is activated.
If "attempts" is greater than 1: That's actually fine — it means the validation-retry loop caught an error and self-corrected. Check the terminal output for the specific field errors Claude fixed.
Provenance Tracker
The provenance tracker looks simple at first — it's just a dictionary of claims with metadata. But the real intelligence is in the resolve() method. Here's the problem it solves: your coordinator agent just received results from three subagents, and two of them disagree about whether Acme Corp has an active lien. What do you do?
The resolver applies a three-step priority chain. First, it checks source reliability — an official state filing database (high confidence) beats a third-party aggregator (medium confidence) every time. Second, if both sources have equal reliability, it checks timestamps — fresher data wins because UCC filings can change daily. Third, if neither tiebreaker resolves the conflict, the claim gets flagged as "contested" and routed to human review. This is exactly the resolution logic the exam tests in Scenario 3.
from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum
class Confidence(Enum):
HIGH = "high" # Official source (state filing DB)
MEDIUM = "medium" # Third-party aggregator
LOW = "low" # User-submitted / unverified
class ClaimStatus(Enum):
WELL_ESTABLISHED = "well_established" # Multiple sources agree
CONTESTED = "contested" # Sources disagree
SINGLE_SOURCE = "single_source" # Only one source
@dataclass
class ProvenanceRecord:
"""Track the origin and reliability of each claim.
The exam tests this in Scenario 3: when subagents return
conflicting data, provenance lets the coordinator decide
which claim to trust.
"""
claim: str # The actual data/assertion
source: str # Which agent/tool/API
confidence: Confidence # Source reliability
timestamp: datetime # When data was retrieved
agent_id: str # Which subagent in pipeline
source_type: str = "" # "official_filing" vs "aggregator"
@dataclass
class ProvenanceTracker:
"""Aggregate claims from multiple subagents and resolve conflicts."""
claims: dict[str, list[ProvenanceRecord]] = field(default_factory=dict)
def add_claim(self, topic: str, record: ProvenanceRecord):
"""Add a claim from a subagent."""
if topic not in self.claims:
self.claims[topic] = []
self.claims[topic].append(record)
def resolve(self, topic: str) -> dict:
"""Resolve conflicting claims using provenance rules:
1. Official sources trump aggregators
2. Fresher data trumps stale data
3. If still contested, flag for human review
"""
records = self.claims.get(topic, [])
if not records:
return {"status": "no_data", "claim": None}
if len(records) == 1:
r = records[0]
return {
"status": ClaimStatus.SINGLE_SOURCE.value,
"claim": r.claim,
"source": r.source,
"confidence": r.confidence.value,
"warning": "Single source — verify independently",
}
# Check for agreement
unique_claims = set(r.claim for r in records)
if len(unique_claims) == 1:
confidence_rank = {"low": 0, "medium": 1, "high": 2}
best = max(records, key=lambda r: confidence_rank[r.confidence.value])
return {
"status": ClaimStatus.WELL_ESTABLISHED.value,
"claim": best.claim,
"sources": [r.source for r in records],
"confidence": "high",
}
# Conflict — resolve by source reliability, then freshness
# Sort descending: highest confidence first, then most recent timestamp
ranked = sorted(
records,
key=lambda r: (
["low", "medium", "high"].index(r.confidence.value),
r.timestamp,
),
reverse=True,
)
best = ranked[0]
return {
"status": ClaimStatus.CONTESTED.value,
"claim": best.claim,
"source": best.source,
"confidence": best.confidence.value,
"alternatives": [
{"claim": r.claim, "source": r.source}
for r in ranked[1:]
],
"warning": "Sources disagree — flagged for human review",
}
const Confidence = { HIGH: 'high', MEDIUM: 'medium', LOW: 'low' };
const ClaimStatus = {
WELL_ESTABLISHED: 'well_established',
CONTESTED: 'contested',
SINGLE_SOURCE: 'single_source',
};
const CONFIDENCE_RANK = { low: 0, medium: 1, high: 2 };
class ProvenanceTracker {
// Aggregate claims from subagents and resolve conflicts.
// Exam Scenario 3 tests this exact pattern.
constructor() {
this.claims = new Map();
}
addClaim(topic, record) {
if (!this.claims.has(topic)) this.claims.set(topic, []);
this.claims.get(topic).push({
...record,
timestamp: record.timestamp || new Date(),
});
}
resolve(topic) {
const records = this.claims.get(topic) || [];
if (records.length === 0) return { status: 'no_data', claim: null };
if (records.length === 1) {
return {
status: ClaimStatus.SINGLE_SOURCE,
claim: records[0].claim,
source: records[0].source,
confidence: records[0].confidence,
warning: 'Single source — verify independently',
};
}
// Check agreement
const uniqueClaims = new Set(records.map((r) => r.claim));
if (uniqueClaims.size === 1) {
return {
status: ClaimStatus.WELL_ESTABLISHED,
claim: records[0].claim,
sources: records.map((r) => r.source),
confidence: 'high',
};
}
// Conflict — rank by confidence then freshness
const ranked = [...records].sort((a, b) => {
const confDiff = CONFIDENCE_RANK[b.confidence] - CONFIDENCE_RANK[a.confidence];
if (confDiff !== 0) return confDiff;
return b.timestamp - a.timestamp;
});
return {
status: ClaimStatus.CONTESTED,
claim: ranked[0].claim,
source: ranked[0].source,
confidence: ranked[0].confidence,
alternatives: ranked.slice(1).map((r) => ({
claim: r.claim, source: r.source,
})),
warning: 'Sources disagree — flagged for human review',
};
}
}
You built a provenance tracker that resolves conflicting multi-agent claims. The key resolution logic: official sources outrank aggregators, fresher data outranks stale data, and "contested" claims are flagged for human review. Each claim carries four metadata fields (source, confidence, timestamp, agent_id). This is the Scenario 3 pattern the exam tests.
Scenario 3 (Multi-Agent Research) tests two concepts together: context isolation (Domain 1.3 — subagents do NOT inherit the coordinator's full conversation) and provenance-based conflict resolution (Domain 1.2 — the coordinator must use structured metadata, not ask Claude to judge source quality). If you see a multi-agent question, check whether the answer respects context isolation AND uses deterministic conflict resolution.
Run Command
Save the provenance tracker as provenance.py, then add this test block and run it:
# Add to bottom of provenance.py:
if __name__ == "__main__":
tracker = ProvenanceTracker()
# Subagent A: official state filing database
tracker.add_claim("acme_lien", ProvenanceRecord(
claim="Acme Corp has an active UCC lien — Filing #2024-NY-0047821",
source="ny_state_filing_db",
confidence=Confidence.HIGH,
timestamp=datetime(2026, 4, 2, 14, 30),
agent_id="subagent-filings-ny",
source_type="official_filing",
))
# Subagent B: third-party aggregator disagrees
tracker.add_claim("acme_lien", ProvenanceRecord(
claim="No active liens found for Acme Corp",
source="third_party_aggregator",
confidence=Confidence.MEDIUM,
timestamp=datetime(2026, 4, 1, 9, 0),
agent_id="subagent-aggregator",
source_type="aggregator",
))
import json
result = tracker.resolve("acme_lien")
print(json.dumps(result, indent=2, default=str))
Run:
If you see "status": "contested" with the official filing winning over the aggregator, provenance resolution is working correctly. The official source (high confidence) outranked the aggregator (medium confidence). Both sources appear in the output for audit transparency.
If you see NameError: name 'datetime' is not defined: Add from datetime import datetime at the top of provenance.py.
If the wrong claim wins: Check that confidence levels are correct — Confidence.HIGH for the official source, Confidence.MEDIUM for the aggregator. The resolver ranks by confidence first, then by timestamp.
Final Verification — Both Components
You've now built both the validation-retry loop and the provenance tracker. These are the two most-tested code patterns on the certification exam. To verify both work end-to-end:
If both scripts run without errors, you've implemented the two core exam code patterns. The validation-retry loop handles Scenario 6 (Data Extraction) questions. The provenance tracker handles Scenario 3 (Multi-Agent Research) questions. Together they cover roughly 30% of the exam's code-specific questions.
Mock Exam Practice
What you're practicing: 8 mock exam questions in the exact certification format — scenario context, question stem, and 4 options (1 correct, 3 anti-pattern distractors).
Time target: Aim for 2 minutes per question (16 minutes total). On exam day, you'll have roughly 90 seconds per question, so this is generous practice time.
How to use: Read each scenario carefully. Before clicking an answer, identify which domain the question is testing (D1-D5) and which anti-patterns appear as distractors. Then select your answer. Detailed feedback appears immediately.
Self-assessment: After completing all 8 questions, check your domain score breakdown below. A domain with 0% or 50% accuracy means you should revisit that domain's module before exam day. Your target: 75%+ in every domain, not just 72% overall.
Read carefully — the exam rewards recognizing anti-patterns as much as knowing correct patterns. For each question, try to identify the anti-pattern number for each wrong answer before selecting.
Domain Scores: As you answer quiz questions below, your scores per domain are tracked and displayed as a bar chart showing strengths and weaknesses across all 5 certification domains.
Knowledge Check — Mock Exam Questions
1. [Scenario 1] A customer support agent processes refund requests. The system must enforce a $500 refund limit. Which approach is correct?
process_refund tool call and blocks any amount over $500
maxTurns: 1 so the agent can only make one refund call per conversation
2. [Scenario 4] Your agent's search tool queries a database but encounters an authentication failure. What should the tool return?
{"results": []} — return empty results so the agent can move on
"Operation failed" — a simple error string
{"isError": true, "errorCategory": "auth_failure", "isRetryable": true, "context": "Token expired"}
3. [Scenario 6] Claude extracts a CPT code as "knee surgery" instead of the 5-digit code "27447". Your validation-retry loop should:
max_tokens to give Claude more room to think
4. [Scenario 3] Two subagents return conflicting data about a company's lien status. Subagent A (querying the state filing database) says "active lien." Subagent B (querying a third-party aggregator) says "no liens." How should the coordinator resolve this?
5. [Scenario 5] Your CI/CD pipeline uses Claude Code to generate code and then review it for security issues. What is the correct session configuration?
fork_session to branch the generation session for review — preserves context while allowing divergence
-p command with "generate and then review" as the prompt
6. [Scenario 1] A customer is angry and using profanity, but their request is straightforward: "Where the **** is my package?!" The agent should:
7. Your document extraction system reports "95% overall accuracy." Upon investigation, you find invoices are at 70% accuracy while receipts are at 99%. The correct response is:
8. [Scenario 6] You need Claude to extract data from a document and MUST use the extraction tool (not free-text response). The correct tool_choice is:
tool_choice: "auto" — let Claude decide whether to use the tool
tool_choice: "any" — force Claude to use a tool, but let it pick which one
tool_choice: {"type": "tool", "name": "extract_data"} — force the specific extraction tool
{"type": "tool", "name": "extract_data"}. "auto" (A) means Claude might skip the tool entirely. "any" (B) forces a tool but Claude might pick the wrong one. Prompt-based enforcement (D) is anti-pattern #3 — Claude might still respond in free text.{"type": "tool", "name": "extract_data"} for deterministic behavior.Your Score
Next steps based on your score:
- 8/8: You're exam-ready. Review the anti-patterns one more time on exam morning, then go.
- 6-7/8: Above passing threshold. Check which domain(s) you missed — revisit that module and re-read the anti-pattern cards above.
- 4-5/8: At or near threshold. Re-read the Validation-Retry and Provenance sections, then retake this quiz after 24 hours.
- Below 4/8: Go back to M25-M26, then re-read this module's anti-pattern cards and scenario walkthroughs before retaking.
Summary & Exam Day Tips
You've completed the entire course — from your first API call in M01 to certification-level mastery here in M27. Here's your final exam preparation checklist:
- Anti-Patterns: Know all 18. The exam's wrong answers ARE anti-patterns. Recognizing them is half the battle.
- stop_reason: The single most important technical detail.
'tool_use'means the agent loop should continue.'end_turn'means done. Never parse natural language to detect loop termination. - Hooks vs Prompts: Critical business rules use hooks — they're deterministic and can't be bypassed. System prompt instructions are advisory only. The exam treats this distinction as a bright line.
- Structured Errors: Always return
{isError, errorCategory, isRetryable, context}. Never return a generic "Operation failed" string. Claude needs the error details to decide whether to retry, try an alternative, or escalate. - Validation-Retry: When extraction fails, append specific field-level errors — not "try again." Cap retries at 3 per field, then flag for human review.
- Provenance: Every claim in a multi-agent system needs four metadata fields: source, confidence, timestamp, and agent_id. Without these, you can't audit conflict resolution.
- Session Isolation: Generate and review in SEPARATE sessions. Same-session review creates confirmation bias because the reviewer retains the generator's reasoning context.
- Stratified Metrics: Never report aggregate-only accuracy. Always drill down per document type or category. A 95% average can hide a 70% failure rate in one category.
- Escalation: Escalate on policy gaps and capability limits. Never escalate based on sentiment or self-reported confidence scores.
Passing score: 720/1000. The exam favors hub-and-spoke architecture, programmatic enforcement, and structured data patterns. Trust what you've learned across M01-M27. Good luck! 🎓
References & Resources
- Claude Model Overview — Current models, capabilities, and pricing
- Tool Use Documentation — Tool definitions, stop_reason, and the agentic loop
- Claude Code Documentation — CLAUDE.md, hooks, sessions, and skills
- Prompt Engineering Guide — System prompts, few-shot, and structured output
- Claude Agent SDK — query(), message types, and subagent orchestration
- Anthropic Cookbook — Production-ready code examples and patterns