M27B: Cert Domain 5.5 + 5.6 Deep Dive — Human Review, Confidence, Provenance, Temporal Data, Synthesis
Per the official Anthropic cert exam guide, Domain 5.5 covers human review workflows and confidence calibration (stratified sampling, field-level confidence, accuracy by document type/field), and Domain 5.6 covers information provenance and uncertainty handling (claim-source mappings, temporal data, conflict annotation, coverage gaps). The cert tests these as one discipline. This module bundles them: stratified human review, field-level confidence, claim provenance, temporal as-of reasoning, and synthesis output for well-established vs. contested claims. Take it before any practice exam.
Learning Objectives
- Design output schemas that enforce claim → source mappings, with explicit well-established / contested / single-source / temporal-warning categories
- Store and query temporal facts with
{value, valid_from, valid_to, source}, distinguishing "current" from "as-of" queries - Apply stratified sampling for human review and field-level confidence scoring — and recognize why uniform and top-N sampling are exam anti-patterns
- Build a synthesizer that surfaces source disagreement instead of silently picking one source
- Recognize and avoid the four most common Domain 5.6 trap questions on the cert exam
Why This Module Exists
An audit of the course against the official Claude Certified Architect – Foundations Certification Exam Guide showed that Domains 5.5 and 5.6 had the most under-covered topics of any cert area. Cert tips were inserted into M09, M11, M17, and M18 to flag them in their natural homes — but the cert tests this material as a unified discipline, not as scattered footnotes across four modules.
This module exists to close that gap. The five sub-topics — stratified human review, field-level confidence, provenance, temporal handling, and synthesis output — share a single underlying skill: knowing what you don't know, and surfacing that uncertainty to humans in a structured form. Once you see them as one discipline, the cert questions stop looking like trick questions.
Domain 5.5 (human review workflows, confidence calibration) and Domain 5.6 (information provenance, uncertainty handling) — full coverage. Reinforces Domain 4.6 (multi-pass review — synthesis IS a second pass over per-chunk extractions) and Domain 5.4 (temporal handling reinforces stale-context themes). If you can pass this module's quiz, you will likely pass every 5.5/5.6 question on the practice exams.
Concept 1: Information Provenance — Claim-Source Mappings
Every claim a knowledge agent emits must be tagged to its source. Without provenance, you can't audit, can't recover from a retracted source, and can't honestly answer "where did this come from?" The cert tests whether your output schema enforces the claim → source link — not whether the answer happens to be right.
Compare a Wikipedia article without citations to one with [1][2] markers on every sentence. The first might be perfectly correct — but you can't tell, and you can't recover when one of its sources turns out to be wrong.
The pain: a journalist quotes the unsourced article. Six months later the original Wikipedia author admits they made up paragraph 4. The journalist's editor asks "which paragraphs of your article relied on that source?" — and the journalist can't answer.
The mapping: the unsourced article is your RAG agent's output without provenance. Every claim looks fine until something breaks. With {claim, source_id, confidence} markers, you can pull the thread on any retraction and immediately find every downstream claim that needs review.
Information provenance is the property of an output schema in which every emitted claim carries a structured pointer back to the source(s) it derives from. It is distinct from citation (a writing convention) because provenance lives in the schema — the output is {claims:[{text, sources:[id], confidence}]}, not prose with parenthetical references. The schema makes provenance enforceable; prose makes it optional.
Here's what happens when a source is retracted in a system with provenance: every claim that relied on that source gets flagged. Watch:
Sources
Claims with provenance
Click "play" to retract chunk_12. Two claims (off-label exclusion, step therapy) immediately surface as needing review — the schema let you pull the thread from a single source ID. Without provenance, you'd have to re-run the entire synthesis pipeline against the new source set just to find out which claims are now unsupported. Provenance turns "we have a problem somewhere" into "these specific 2 claims need review."
The exam doesn't test whether your agent gives correct answers — it tests whether the output schema makes correctness auditable. Trick question pattern: a long scenario describing a "high-accuracy" RAG agent that returns prose answers. The answer is always: regardless of accuracy, prose-with-parenthetical-citations is non-compliant. The output must be structured {claim, source_id, confidence}.
Concept 2: Temporal Data — As-Of Reasoning
Facts have lifespans. "CEO is Alice" was true once and may now be false. Memory layers without temporal metadata cause agents to confidently report stale facts as current — one of the most-tested Domain 5.6 scenarios.
Compare a weather forecast to a weather record. The forecast for tomorrow expires tonight — tomorrow's actual weather supersedes it. The record of yesterday's high is permanent — nothing supersedes it. They have completely different temporal semantics.
The pain: a planning agent reads "expected high tomorrow: 78°F" and uses it to recommend a t-shirt. By the time the user reads the recommendation, "tomorrow" has come and gone, and the actual high was 52°F.
The mapping: every fact in your knowledge base has a valid_from and a valid_to. Forecasts have a near-future valid_to. Records have a permanent one. CEOs, prices, and policy versions have lifespans that need explicit tracking. Confusing them is the bug.
Temporal data handling is the practice of storing every fact with explicit {value, valid_from, valid_to, source} metadata, and querying with explicit temporal predicates. The default fact query is "what's true now?" — SELECT * WHERE valid_to IS NULL. The "as-of" query is "what was true at time T?" — SELECT * WHERE valid_from <= T AND (valid_to IS NULL OR valid_to > T). Most temporal bugs come from omitting the valid_to predicate, which silently returns every historical version.
Drag the time marker to see how the same query returns different "current CEO" answers over time:
"Episodic memory has timestamps, so I'm covered." — Episodic memory captures when the event happened, not when the underlying fact was valid. A note from 2023 saying "Bob is CEO" doesn't tell you when Bob became CEO or stopped being CEO — only that you wrote that note in 2023.
"Most facts don't change, so I can ignore temporal handling." — Most facts do change in regulated domains (healthcare coverage, legal status, organizational structure). The cert exam is heavy on these domains because they're where stale facts cause real harm.
"I can just sort by latest timestamp." — This works for “most recent answer” but breaks for as-of queries. A user asking "what was the policy in 2023?" needs the row valid then, not the latest row in your table.
Trap question pattern: an agent confidently reports "Acme's CEO is Bob" but it's been Alice for 6 months. What's the bug class? Always temporal — the query omitted the valid_to IS NULL filter and got the historical row. This is one of the most-tested scenarios on the cert. Memorize: valid_to IS NULL is "current"; explicit timestamp is "as-of."
Concept 3: Stratified Sampling + Field-Level Confidence
When you route extractions to human reviewers, sampling matters as much as the extraction itself. Aggregate confidence is too coarse. Top-N-by-confidence over-reviews the easy cases. Uniform sampling under-reviews the hard ones. The cert pattern is stratified sampling: N samples from each confidence bucket, so reviewers see the full distribution.
Quality control on a factory line. You don't inspect every widget — that's uniform sampling, and it's expensive. You don't only inspect the shiniest widgets — that's top-N sampling, and you'd never find defects.
The pain: factories that sample uniformly waste inspector time on obviously-good batches. Factories that sample top-N never see the defective batches at all.
The mapping: stratified QC inspects N units from each grade tier — A-grade, B-grade, C-grade. Defects across grades surface evenly. The same logic applies to LLM extractions: sample from each confidence bucket so reviewers see the high-, medium-, and low-confidence distributions in proportion to where defects actually live (low confidence).
Field-level confidence is the second half of the trick. Aggregate document confidence hides defects in individual fields. A document with 88% overall confidence might have one field at 30% confidence — the field that determines whether you approve a $50K healthcare claim. Field-level scores let you escalate just that field, not the whole document.
❌ Top-N (sample 6 highest-confidence)
✅ Stratified (sample 2 from each bucket)
Two trap patterns to recognize: (1) "Send the top-N most-confident extractions to human review" — always wrong; reviewers see only correct cases and learn nothing. (2) "Escalate when overall document confidence drops below X%" — weak; aggregate hides per-field defects. The cert-correct answer is always stratified sampling + field-level confidence, in combination.
Concept 4: Synthesis Output — Well-Established vs. Contested Claims
When sources agree, the agent should say so confidently. When sources disagree, the agent must surface the disagreement, not silently pick one. This is the cert pattern that fails most often in practice — agents that “helpfully” reconcile contradictions by picking one side.
A meta-analysis paper. "Result X is replicated across 7 studies" reads very differently from "Result Y is supported by 3 studies and contradicted by 2." A reader can act on the first; the second requires more digging. Don't collapse them.
The pain: a healthcare agent that reads two payer documents disagreeing on coverage for CPT 99213. It picks the more recent one and reports "Covered." The reviewer downstream has no idea there was a contradiction — the disagreement was lost.
The mapping: your output schema must support a claim_status: "established" | "contested" | "single_source" field. When status is contested, the output must include both source pointers, not just one.
Same two source documents, two different synthesis behaviors:
Sources
Synthesis output
sources: [chunk_3, chunk_12]
Aetna: covered — [chunk_5]
BCBS: requires prior auth — [chunk_9]
source: [chunk_5] · needs_review: true
Four sources, three output categories. Sources 3 and 12 agree on the definition — that's an established claim with two source pointers. Sources 5 and 9 contradict on coverage — that's a contested claim that surfaces both, in opposition. Source 5 makes a separate claim with no corroboration — that's a single-source claim flagged for review. The schema makes the agent's epistemic state legible to whoever reads the output downstream — whether that's a human reviewer, a downstream agent, or an audit log.
The most common Domain 5.6 trap: a scenario describing two sources that disagree on a fact. The agent picks the “more recent” or “more authoritative” source and reports it. This is wrong. Even if the agent picks correctly, the cert-correct behavior is to surface the disagreement as a contested claim with both source pointers, and let the downstream consumer (often a human) decide. Silent disagreement resolution is non-compliant regardless of which source is “right.”
Code Walkthrough: The ProvenancedSynthesizer Class
We'll build a class that takes a list of retrieved chunks and a query, and returns a synthesis object with all four sub-categories. Annotated in 4 chunks.
Chunk 1: Per-Chunk Claim Extraction with Confidence
WHAT: Each chunk is sent to Claude with a structured tool that extracts every distinct claim plus a per-claim confidence. WHY: We need claim-level (not document-level) granularity to detect agreement and contradiction. GOTCHA: Don't ask for a single “summary” per chunk — that collapses claims and you lose the ability to detect disagreement.
import anthropic
from dataclasses import dataclass, field
@dataclass
class Claim:
text: str
source_id: str
confidence: float # 0.0 - 1.0
valid_from: str = "" # ISO date
valid_to: str | None = None
EXTRACT_TOOL = {
"name": "extract_claims",
"description": "Extract every distinct factual claim from a source chunk.",
"input_schema": {
"type": "object",
"properties": {
"claims": {
"type": "array",
"items": {
"type": "object",
"properties": {
"text": {"type": "string"},
"confidence": {"type": "number"},
"valid_from": {"type": "string"},
"valid_to": {"type": "string"},
},
"required": ["text", "confidence"],
},
},
},
"required": ["claims"],
},
}
def extract_claims(chunk_id: str, chunk_text: str) -> list[Claim]:
"""Send one chunk to Claude; return a list of Claim objects."""
client = anthropic.Anthropic()
try:
result = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=800,
tools=[EXTRACT_TOOL],
tool_choice={"type": "tool", "name": "extract_claims"},
messages=[{"role": "user",
"content": f"Extract every factual claim from this source.\n\n{chunk_text}"}],
)
except anthropic.APIError as e:
print(f"Extraction failed for {chunk_id}: {e.message}")
return []
# tool_use block contains the structured output
raw = next(b.input for b in result.content if b.type == "tool_use")
return [Claim(text=c["text"], source_id=chunk_id,
confidence=c["confidence"],
valid_from=c.get("valid_from", ""),
valid_to=c.get("valid_to"))
for c in raw["claims"]]
import Anthropic from '@anthropic-ai/sdk';
const EXTRACT_TOOL = {
name: 'extract_claims',
description: 'Extract every distinct factual claim from a source chunk.',
input_schema: {
type: 'object',
properties: {
claims: {
type: 'array',
items: {
type: 'object',
properties: {
text: { type: 'string' },
confidence: { type: 'number' },
valid_from: { type: 'string' },
valid_to: { type: 'string' },
},
required: ['text', 'confidence'],
},
},
},
required: ['claims'],
},
};
export async function extractClaims(chunkId, chunkText) {
const client = new Anthropic();
try {
const result = await client.messages.create({
model: 'claude-haiku-4-5-20251001',
max_tokens: 800,
tools: [EXTRACT_TOOL],
tool_choice: { type: 'tool', name: 'extract_claims' },
messages: [{ role: 'user',
content: `Extract every factual claim from this source.\n\n${chunkText}` }],
});
const raw = result.content.find(b => b.type === 'tool_use').input;
return raw.claims.map(c => ({
text: c.text,
source_id: chunkId,
confidence: c.confidence,
valid_from: c.valid_from || '',
valid_to: c.valid_to || null,
}));
} catch (error) {
console.error(`Extraction failed for ${chunkId}: ${error.message}`);
return [];
}
}
Chunk 2: Agreement Detection — Cluster Claims by Entailment
WHAT: Group claims that say the same thing across different chunks. WHY: This is what produces the “well-established” vs. “contested” vs. “single-source” categories. GOTCHA: Don't use raw string match. Use a small Claude call to classify pairs as agree / contradict / unrelated. Two claims that paraphrase each other should still cluster as agreement.
CLASSIFY_TOOL = {
"name": "classify_pair",
"description": "Classify whether two claims agree, contradict, or are unrelated.",
"input_schema": {
"type": "object",
"properties": {
"relation": {"type": "string", "enum": ["agree", "contradict", "unrelated"]},
"topic": {"type": "string", "description": "If agree or contradict, the shared topic."},
},
"required": ["relation"],
},
}
def classify_pair(client, c1: Claim, c2: Claim) -> tuple[str, str]:
"""Return (relation, topic). relation in {agree, contradict, unrelated}."""
try:
result = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=200,
tools=[CLASSIFY_TOOL],
tool_choice={"type": "tool", "name": "classify_pair"},
messages=[{"role": "user", "content":
f"Claim A (source {c1.source_id}): {c1.text}\n"
f"Claim B (source {c2.source_id}): {c2.text}\n\n"
"Classify the relation."}],
)
raw = next(b.input for b in result.content if b.type == "tool_use")
return raw["relation"], raw.get("topic", "")
except anthropic.APIError as e:
return "unrelated", "" # fail safe: don't cluster on errors
def cluster_claims(client, claims: list[Claim]) -> tuple[list, list, list]:
"""Return (established, contested, single_source) lists.
Pairwise classification is O(n^2); fine for tens of claims, switch to
embedding-based clustering for hundreds.
"""
n = len(claims)
visited = [False] * n
established, contested, single = [], [], []
for i in range(n):
if visited[i]:
continue
agree_with = [i]
contradicts = []
for j in range(i + 1, n):
if visited[j]:
continue
relation, topic = classify_pair(client, claims[i], claims[j])
if relation == "agree":
agree_with.append(j); visited[j] = True
elif relation == "contradict":
contradicts.append(j)
visited[i] = True
if contradicts:
contested.append({
"claim_a": claims[i].text,
"sources_a": [claims[k].source_id for k in agree_with],
"claim_b": claims[contradicts[0]].text,
"sources_b": [claims[contradicts[0]].source_id],
})
for k in contradicts: visited[k] = True
elif len(agree_with) >= 2:
established.append({
"claim": claims[i].text,
"sources": [claims[k].source_id for k in agree_with],
"confidence": sum(claims[k].confidence for k in agree_with) / len(agree_with),
})
else:
single.append({
"claim": claims[i].text,
"source": claims[i].source_id,
"confidence": claims[i].confidence,
"needs_review": claims[i].confidence < 0.85,
})
return established, contested, single
const CLASSIFY_TOOL = {
name: 'classify_pair',
description: 'Classify whether two claims agree, contradict, or are unrelated.',
input_schema: {
type: 'object',
properties: {
relation: { type: 'string', enum: ['agree', 'contradict', 'unrelated'] },
topic: { type: 'string' },
},
required: ['relation'],
},
};
async function classifyPair(client, c1, c2) {
try {
const result = await client.messages.create({
model: 'claude-haiku-4-5-20251001',
max_tokens: 200,
tools: [CLASSIFY_TOOL],
tool_choice: { type: 'tool', name: 'classify_pair' },
messages: [{
role: 'user',
content: `Claim A (source ${c1.source_id}): ${c1.text}\n` +
`Claim B (source ${c2.source_id}): ${c2.text}\n\nClassify the relation.`,
}],
});
const raw = result.content.find(b => b.type === 'tool_use').input;
return [raw.relation, raw.topic || ''];
} catch {
return ['unrelated', '']; // fail safe
}
}
export async function clusterClaims(client, claims) {
const n = claims.length;
const visited = new Array(n).fill(false);
const established = [], contested = [], single = [];
for (let i = 0; i < n; i++) {
if (visited[i]) continue;
const agreeWith = [i];
const contradicts = [];
for (let j = i + 1; j < n; j++) {
if (visited[j]) continue;
const [relation] = await classifyPair(client, claims[i], claims[j]);
if (relation === 'agree') { agreeWith.push(j); visited[j] = true; }
else if (relation === 'contradict') contradicts.push(j);
}
visited[i] = true;
if (contradicts.length) {
contested.push({
claim_a: claims[i].text,
sources_a: agreeWith.map(k => claims[k].source_id),
claim_b: claims[contradicts[0]].text,
sources_b: [claims[contradicts[0]].source_id],
});
contradicts.forEach(k => { visited[k] = true; });
} else if (agreeWith.length >= 2) {
established.push({
claim: claims[i].text,
sources: agreeWith.map(k => claims[k].source_id),
confidence: agreeWith.reduce((s, k) => s + claims[k].confidence, 0) / agreeWith.length,
});
} else {
single.push({
claim: claims[i].text,
source: claims[i].source_id,
confidence: claims[i].confidence,
needs_review: claims[i].confidence < 0.85,
});
}
}
return { established, contested, single };
}
Chunk 3: Temporal Flagging
WHAT: Compare each claim's valid_to against the query time; flag stale facts. WHY: Users querying "current" state need stale facts surfaced as warnings, not silently included. GOTCHA: A missing valid_to is ambiguous — could be permanent or could be unknown. Treat “unknown” differently from “permanent” in the output.
from datetime import date
def flag_temporal(claims: list[Claim], as_of: date) -> list[dict]:
"""Return a list of temporal warnings for stale claims."""
warnings = []
for c in claims:
# valid_to is set AND in the past relative to query time -> stale
if c.valid_to:
try:
vt = date.fromisoformat(c.valid_to)
if vt < as_of:
warnings.append({
"claim": c.text,
"as_of": c.valid_to,
"query_as_of": as_of.isoformat(),
"freshness": "stale",
"source": c.source_id,
})
except ValueError:
pass # malformed date, skip
# valid_to is None means "current" - don't flag
# No valid_from at all means "unknown temporal" - that's a single-source review trigger
return warnings
export function flagTemporal(claims, asOf) {
const warnings = [];
for (const c of claims) {
if (c.valid_to) {
const vt = new Date(c.valid_to);
if (!isNaN(vt) && vt < asOf) {
warnings.push({
claim: c.text,
as_of: c.valid_to,
query_as_of: asOf.toISOString().slice(0, 10),
freshness: 'stale',
source: c.source_id,
});
}
}
// valid_to null = current; missing valid_from = unknown temporal (single-source trigger)
}
return warnings;
}
Chunk 4: Stratified Sampler for Human Review
WHAT: Pick N claims from each output bucket for reviewer queues. WHY: The cert-correct sampling pattern. GOTCHA: When a bucket has fewer items than N, take all of them — don't pad with adjacent buckets.
import random
def stratified_sample(synthesis: dict, per_bucket: int = 1) -> list[dict]:
"""Pick `per_bucket` items from each output category for human review."""
review_queue = []
for bucket_name in ("established", "contested", "single", "temporal_warnings"):
bucket = synthesis.get(bucket_name, [])
n = min(len(bucket), per_bucket)
if n == 0:
continue
chosen = random.sample(bucket, n)
for item in chosen:
review_queue.append({"bucket": bucket_name, **item})
return review_queue
export function stratifiedSample(synthesis, perBucket = 1) {
const reviewQueue = [];
for (const bucketName of ['established', 'contested', 'single', 'temporal_warnings']) {
const bucket = synthesis[bucketName] || [];
const n = Math.min(bucket.length, perBucket);
if (n === 0) continue;
// simple shuffle-and-take-N
const shuffled = [...bucket].sort(() => Math.random() - 0.5);
for (const item of shuffled.slice(0, n)) {
reviewQueue.push({ bucket: bucketName, ...item });
}
}
return reviewQueue;
}
Four chunks, four cert sub-domains. Chunk 1 gives you provenance (every claim tagged to its source via tool_use output schema). Chunk 2 gives you the synthesis output buckets via agreement detection. Chunk 3 handles the temporal layer. Chunk 4 wires up stratified sampling for human review. The lab below assembles all four against a real Healthcare Pre-Auth fixture.
Hands-On Lab: Healthcare Pre-Auth Synthesizer
Lab Overview
What you'll build: A synthesizer that processes 8 retrieved chunks about a single CPT code's coverage policy across multiple payer documents. It produces 4 output buckets (established, contested, single-source, temporal warnings), runs a stratified human-review sampler, and asserts that contested claims always include both source pointers.
Time: 35-45 minutes · Domain: A (Healthcare Pre-Authorization)
Prerequisites: Python 3.10+ or Node.js 18+, Anthropic API key. The class from the walkthrough above.
Files you'll create:
preauth_chunks.json— the 8-chunk fixture (provided below)synthesizer.py(orsynthesizer.mjs) — the class from the walkthroughrun_lab.py(orrun_lab.mjs) — the lab runner
Environment Setup
# Python
mkdir m27b-domain56-lab && cd m27b-domain56-lab
python -m venv venv && source venv/bin/activate # Windows: venv\Scripts\activate
pip install anthropic
export ANTHROPIC_API_KEY=sk-ant-...
# Node.js (alternative)
mkdir m27b-domain56-lab && cd m27b-domain56-lab
npm init -y && npm install @anthropic-ai/sdk
export ANTHROPIC_API_KEY=sk-ant-...
Step 1: Drop In the 8-Chunk Pre-Auth Fixture
Save as preauth_chunks.json. The fixture is hand-crafted: chunks 1-3 agree on CPT definition (will become established), chunks 4-5 contradict on coverage (will become contested), chunk 6 is from 2022 (will become temporal warning), chunks 7-8 are single-source claims about exclusion criteria.
{
"query": "Is CPT 99213 covered for chronic condition follow-up?",
"as_of": "2025-01-15",
"chunks": [
{"chunk_id": "chunk_1", "text": "CMS Manual 2024: CPT 99213 is a Level 3 established patient evaluation and management service.", "valid_to": null},
{"chunk_id": "chunk_2", "text": "AMA CPT 2024: Code 99213 represents 20-29 minutes of medical decision making for an established patient.", "valid_to": null},
{"chunk_id": "chunk_3", "text": "AAPC coding guide 2024: CPT 99213 is the moderate-complexity Level 3 E/M code for established patient visits.", "valid_to": null},
{"chunk_id": "chunk_4", "text": "Aetna policy 2024-Q4: CPT 99213 is COVERED for established patient visits with documented chronic conditions.", "valid_to": null},
{"chunk_id": "chunk_5", "text": "BCBS policy 2024-Q4: CPT 99213 requires PRIOR AUTHORIZATION for chronic condition management beyond 4 visits per year.", "valid_to": null},
{"chunk_id": "chunk_6", "text": "Aetna policy 2022: CPT 99213 was covered without prior auth for any chronic condition.", "valid_to": "2023-12-31"},
{"chunk_id": "chunk_7", "text": "Aetna policy 2024-Q4: Off-label use of any E/M code is excluded from coverage.", "valid_to": null},
{"chunk_id": "chunk_8", "text": "BCBS policy 2024-Q4: Step therapy required before approving CPT 99213 for chronic condition management.", "valid_to": null}
]
}
valid_to in 2023), 2 single-source claims.
Step 2: Run Synthesis — Verify the Four Output Buckets
What & why: Wire up the four functions from the walkthrough into a single runner. Output should match the fixture's hand-known structure.
"""run_lab.py - Process the 8-chunk fixture, emit synthesis, verify shape."""
import json
import anthropic
from datetime import date
from synthesizer import extract_claims, cluster_claims, flag_temporal, stratified_sample
def main():
with open("preauth_chunks.json") as f:
fixture = json.load(f)
print(f"Query: {fixture['query']}")
print(f"As-of: {fixture['as_of']}\n")
# Step A: extract claims from each chunk
all_claims = []
for chunk in fixture["chunks"]:
claims = extract_claims(chunk["chunk_id"], chunk["text"])
for c in claims:
c.valid_to = chunk.get("valid_to")
all_claims.extend(claims)
print(f"Extracted {len(all_claims)} claims across {len(fixture['chunks'])} chunks\n")
# Step B: cluster into established / contested / single
client = anthropic.Anthropic()
established, contested, single = cluster_claims(client, all_claims)
# Step C: temporal flagging
as_of = date.fromisoformat(fixture["as_of"])
temporal_warnings = flag_temporal(all_claims, as_of)
synthesis = {
"established": established,
"contested": contested,
"single": single,
"temporal_warnings": temporal_warnings,
}
# Step D: print synthesis
print("=" * 60)
print("SYNTHESIS OUTPUT")
print("=" * 60)
for bucket in ("established", "contested", "single", "temporal_warnings"):
print(f"\n[{bucket}] {len(synthesis[bucket])} item(s):")
for item in synthesis[bucket]:
print(f" - {json.dumps(item, indent=2, default=str)[:200]}")
return synthesis
if __name__ == "__main__":
main()
// run_lab.mjs - Process the 8-chunk fixture, emit synthesis, verify shape.
import { readFileSync } from 'fs';
import Anthropic from '@anthropic-ai/sdk';
import { extractClaims, clusterClaims, flagTemporal, stratifiedSample } from './synthesizer.mjs';
async function main() {
const fixture = JSON.parse(readFileSync('preauth_chunks.json', 'utf-8'));
console.log(`Query: ${fixture.query}`);
console.log(`As-of: ${fixture.as_of}\n`);
const allClaims = [];
for (const chunk of fixture.chunks) {
const claims = await extractClaims(chunk.chunk_id, chunk.text);
claims.forEach(c => { c.valid_to = chunk.valid_to; });
allClaims.push(...claims);
}
console.log(`Extracted ${allClaims.length} claims across ${fixture.chunks.length} chunks\n`);
const client = new Anthropic();
const { established, contested, single } = await clusterClaims(client, allClaims);
const temporalWarnings = flagTemporal(allClaims, new Date(fixture.as_of));
const synthesis = { established, contested, single, temporal_warnings: temporalWarnings };
console.log('='.repeat(60));
console.log('SYNTHESIS OUTPUT');
console.log('='.repeat(60));
for (const bucket of ['established', 'contested', 'single', 'temporal_warnings']) {
console.log(`\n[${bucket}] ${synthesis[bucket].length} item(s):`);
synthesis[bucket].forEach(item => {
console.log(` - ${JSON.stringify(item).slice(0, 200)}`);
});
}
return synthesis;
}
main().catch(err => { console.error(err); process.exit(1); });
Run it:
python run_lab.py # or: node run_lab.mjs
- Established: at least 1 cluster with 3+ source pointers (the CPT definition agreement)
- Contested: exactly 1 item, with both
sources_aANDsources_bpopulated - Single: 2 items with
needs_reviewflag - Temporal warnings: exactly 1, pointing at chunk_6 with
freshness: "stale"
Step 3: Run Stratified Sampling for Reviewer Queue
What & why: Pick 1 from each bucket for the human-review queue. Verify reviewers see one item from each category — not the “most confident” or “most recent”.
# Append to run_lab.py or run interactively after main()
def review_step(synthesis: dict):
queue = stratified_sample(synthesis, per_bucket=1)
print("\n" + "=" * 60)
print("HUMAN REVIEW QUEUE (stratified, 1 per bucket)")
print("=" * 60)
for item in queue:
print(f"\n [bucket: {item['bucket']}]")
print(f" {json.dumps({k: v for k, v in item.items() if k != 'bucket'}, default=str)[:180]}")
print(f"\nTotal queue size: {len(queue)} items")
if __name__ == "__main__":
synthesis = main()
review_step(synthesis)
// Append to run_lab.mjs after main()
function reviewStep(synthesis) {
const queue = stratifiedSample(synthesis, 1);
console.log('\n' + '='.repeat(60));
console.log('HUMAN REVIEW QUEUE (stratified, 1 per bucket)');
console.log('='.repeat(60));
queue.forEach(item => {
const { bucket, ...rest } = item;
console.log(`\n [bucket: ${bucket}]`);
console.log(` ${JSON.stringify(rest).slice(0, 180)}`);
});
console.log(`\nTotal queue size: ${queue.length} items`);
}
const synthesis = await main();
reviewStep(synthesis);
Step 4: Add the Regression Test for Contested Claims
What & why: The most common Domain 5.6 trap is silently picking one source on disagreement. Add an assertion that catches this regression.
def assert_contested_have_both_sources(synthesis: dict):
"""Cert-correct contested claims must include BOTH source pointers."""
failures = []
for c in synthesis.get("contested", []):
if not c.get("sources_a") or not c.get("sources_b"):
failures.append(c)
if not c.get("claim_a") or not c.get("claim_b"):
failures.append(c)
if failures:
raise AssertionError(
f"REGRESSION: {len(failures)} contested item(s) missing both sources/claims. "
f"This is the canonical Domain 5.6 trap. Fix the synthesizer.")
print(f"\n✅ All {len(synthesis.get('contested', []))} contested claim(s) have both source pointers.")
if __name__ == "__main__":
synthesis = main()
review_step(synthesis)
assert_contested_have_both_sources(synthesis)
function assertContestedHaveBothSources(synthesis) {
const failures = [];
for (const c of synthesis.contested || []) {
if (!c.sources_a?.length || !c.sources_b?.length) failures.push(c);
if (!c.claim_a || !c.claim_b) failures.push(c);
}
if (failures.length) {
throw new Error(
`REGRESSION: ${failures.length} contested item(s) missing both sources/claims. ` +
`This is the canonical Domain 5.6 trap. Fix the synthesizer.`);
}
console.log(`\n✅ All ${synthesis.contested?.length || 0} contested claim(s) have both source pointers.`);
}
const synthesis = await main();
reviewStep(synthesis);
assertContestedHaveBothSources(synthesis);
Stretch Goals (Optional)
- Field-level confidence: Switch from claim-level to field-level extraction for structured outputs like
{cpt_code, coverage, copay, exclusions[]}and route just the low-confidence fields to review. - Provenance audit: Add a
provenance_audit()method that returns "every claim → which chunk(s)" for compliance review — the audit log a regulator would ask for. - Temporal diff: Run synthesis twice at different
as_ofdates; output the diff: which claims changed, became stale, or got new sources. - Confirmation-bias regression: Run the same synthesis twice in the same session vs. fresh sessions, and assert the outputs match (catches Domain 4.5 violations).
Knowledge Check — 8 Cert-Style Questions
Cert-density quiz. If you score 7/8 or higher, you're ready for the practice exam's Domain 5.6 sections.
Q1: Which output format is exam-compliant for a RAG synthesizer?
{claim, source_id, confidence} tuples per claim{claim, source_id, confidence} makes provenance enforceable and auditable. Prose with parenthetical citations may look correct but isn't machine-checkable, so the cert treats it as non-compliant.Q2: A knowledge-base row should include which fields for proper temporal handling?
{value, timestamp}{value, last_updated}{value, created_at, source}{value, valid_from, valid_to, source}valid_from and valid_to capture when the FACT was valid, not when the row was written. valid_to: NULL means "current." This schema lets you query both "what's true now?" and "what was true as-of T?" Other shapes only support the latest-value query.Q3: Two payer documents disagree on whether CPT 99213 is covered. Your agent picks the more recent one and reports "Covered." What's wrong?
claim_status: "contested" with both sources_a and sources_b so the downstream consumer (often a human) can decide. The cert tests this directly.Q4: Which sampling strategy aligns with cert recommendations for a human-review batch?
Q5: Why does field-level confidence beat document-level confidence for high-stakes extraction?
Q6: An agent confidently reports "Acme's CEO is Bob" but Bob hasn't been CEO for 6 months. What's the most likely bug class?
valid_to IS NULL and returned the historical rowvalid_to IS NULL = "current"; explicit timestamp = "as-of."Q7: When two retrieved chunks make the same claim using different wording (paraphrase), the synthesizer should:
Q8: When sources agree on a claim, should the synthesizer collapse them into one citation?
claim_status: "established" + a list of source pointers preserves that signal for downstream consumers. Collapsing destroys the corroboration evidence and makes the output indistinguishable from a single-source claim.Module Summary
Domain 5.6 Cheat Sheet
- Provenance lives in the schema, not in the prose. Output must be
{claim, source_id, confidence}. Prose-with-parenthetical-citations is non-compliant regardless of accuracy. - Temporal handling:
{value, valid_from, valid_to, source}.valid_to IS NULL= "current." Missingvalid_tofilter is the most common temporal bug. - Synthesis output: 4 categories — established, contested, single-source, temporal-warning. Silent disagreement resolution is the canonical exam trap.
- Stratified sampling beats top-N and uniform. Field-level confidence beats document-level. They compose — stratify the fields, not the documents.
- Paraphrase = agreement, not contradiction. Use entailment classification, not string match.
- Don't collapse corroborating sources. Three sources agreeing is a quality signal; preserve all pointers.
What's Next: The Practice Exam
You've completed the deepest cert-prep module. The next step is a full timed practice exam. M27 (Cert Exam Prep) covers exam strategy, question patterns, and the final review — pair it with the practice exams from the cert provider. Aim for 80%+ on Domain 5.6 questions specifically.