M27B: Cert Domain 5.5 + 5.6 Deep Dive — Human Review, Confidence, Provenance, Temporal Data, Synthesis

Per the official Anthropic cert exam guide, Domain 5.5 covers human review workflows and confidence calibration (stratified sampling, field-level confidence, accuracy by document type/field), and Domain 5.6 covers information provenance and uncertainty handling (claim-source mappings, temporal data, conflict annotation, coverage gaps). The cert tests these as one discipline. This module bundles them: stratified human review, field-level confidence, claim provenance, temporal as-of reasoning, and synthesis output for well-established vs. contested claims. Take it before any practice exam.

Learning Objectives

  • Design output schemas that enforce claim → source mappings, with explicit well-established / contested / single-source / temporal-warning categories
  • Store and query temporal facts with {value, valid_from, valid_to, source}, distinguishing "current" from "as-of" queries
  • Apply stratified sampling for human review and field-level confidence scoring — and recognize why uniform and top-N sampling are exam anti-patterns
  • Build a synthesizer that surfaces source disagreement instead of silently picking one source
  • Recognize and avoid the four most common Domain 5.6 trap questions on the cert exam

Why This Module Exists

An audit of the course against the official Claude Certified Architect – Foundations Certification Exam Guide showed that Domains 5.5 and 5.6 had the most under-covered topics of any cert area. Cert tips were inserted into M09, M11, M17, and M18 to flag them in their natural homes — but the cert tests this material as a unified discipline, not as scattered footnotes across four modules.

This module exists to close that gap. The five sub-topics — stratified human review, field-level confidence, provenance, temporal handling, and synthesis output — share a single underlying skill: knowing what you don't know, and surfacing that uncertainty to humans in a structured form. Once you see them as one discipline, the cert questions stop looking like trick questions.

🎓 This Module Maps Directly To Cert Topics

Domain 5.5 (human review workflows, confidence calibration) and Domain 5.6 (information provenance, uncertainty handling) — full coverage. Reinforces Domain 4.6 (multi-pass review — synthesis IS a second pass over per-chunk extractions) and Domain 5.4 (temporal handling reinforces stale-context themes). If you can pass this module's quiz, you will likely pass every 5.5/5.6 question on the practice exams.

Concept 1: Information Provenance — Claim-Source Mappings

Every claim a knowledge agent emits must be tagged to its source. Without provenance, you can't audit, can't recover from a retracted source, and can't honestly answer "where did this come from?" The cert tests whether your output schema enforces the claim → source link — not whether the answer happens to be right.

Provenance Schema — Every Claim Points Back
Sources [chunk_3] FDA letter 2024-Q1 [chunk_7] Payer policy 2024-Q3 [chunk_12] vendor whitepaper retracted! Synthesizer tags every claim with source IDs Structured Output CPT 99213 covered sources: [3, 7] confidence: 0.92 Prior auth if > $5K sources: [7] confidence: 0.95 Off-label excluded sources: [12] — needs review Step therapy required sources: [3, 12] — partial review When chunk_12 is retracted, the schema lets you pull the thread: 2 claims surface for review — instantly, with no re-synthesis pass {claim, sources: [id], confidence} — not prose with [1][2] markers
💡 Everyday Analogy

Compare a Wikipedia article without citations to one with [1][2] markers on every sentence. The first might be perfectly correct — but you can't tell, and you can't recover when one of its sources turns out to be wrong.

The pain: a journalist quotes the unsourced article. Six months later the original Wikipedia author admits they made up paragraph 4. The journalist's editor asks "which paragraphs of your article relied on that source?" — and the journalist can't answer.

The mapping: the unsourced article is your RAG agent's output without provenance. Every claim looks fine until something breaks. With {claim, source_id, confidence} markers, you can pull the thread on any retraction and immediately find every downstream claim that needs review.

📐 Technical Definition

Information provenance is the property of an output schema in which every emitted claim carries a structured pointer back to the source(s) it derives from. It is distinct from citation (a writing convention) because provenance lives in the schema — the output is {claims:[{text, sources:[id], confidence}]}, not prose with parenthetical references. The schema makes provenance enforceable; prose makes it optional.

Here's what happens when a source is retracted in a system with provenance: every claim that relied on that source gets flagged. Watch:

A Source Retraction, With and Without Provenance

Sources

[chunk_3] FDA approval letter, 2024-Q1
[chunk_7] Payer policy doc, 2024-Q3
[chunk_12] Vendor whitepaper, 2023

Claims with provenance

CPT 99213 covered if dx I10[3,7]
Prior auth needed if > $5K[7]
Off-label use excluded[12]
Step therapy required[3,12]
✅ What Just Happened?

Click "play" to retract chunk_12. Two claims (off-label exclusion, step therapy) immediately surface as needing review — the schema let you pull the thread from a single source ID. Without provenance, you'd have to re-run the entire synthesis pipeline against the new source set just to find out which claims are now unsupported. Provenance turns "we have a problem somewhere" into "these specific 2 claims need review."

🎓 Cert Tip — Domain 5.6 (Provenance)

The exam doesn't test whether your agent gives correct answers — it tests whether the output schema makes correctness auditable. Trick question pattern: a long scenario describing a "high-accuracy" RAG agent that returns prose answers. The answer is always: regardless of accuracy, prose-with-parenthetical-citations is non-compliant. The output must be structured {claim, source_id, confidence}.

Concept 2: Temporal Data — As-Of Reasoning

Facts have lifespans. "CEO is Alice" was true once and may now be false. Memory layers without temporal metadata cause agents to confidently report stale facts as current — one of the most-tested Domain 5.6 scenarios.

💡 Everyday Analogy

Compare a weather forecast to a weather record. The forecast for tomorrow expires tonight — tomorrow's actual weather supersedes it. The record of yesterday's high is permanent — nothing supersedes it. They have completely different temporal semantics.

The pain: a planning agent reads "expected high tomorrow: 78°F" and uses it to recommend a t-shirt. By the time the user reads the recommendation, "tomorrow" has come and gone, and the actual high was 52°F.

The mapping: every fact in your knowledge base has a valid_from and a valid_to. Forecasts have a near-future valid_to. Records have a permanent one. CEOs, prices, and policy versions have lifespans that need explicit tracking. Confusing them is the bug.

📐 Technical Definition

Temporal data handling is the practice of storing every fact with explicit {value, valid_from, valid_to, source} metadata, and querying with explicit temporal predicates. The default fact query is "what's true now?"SELECT * WHERE valid_to IS NULL. The "as-of" query is "what was true at time T?"SELECT * WHERE valid_from <= T AND (valid_to IS NULL OR valid_to > T). Most temporal bugs come from omitting the valid_to predicate, which silently returns every historical version.

Drag the time marker to see how the same query returns different "current CEO" answers over time:

Same Question, Different As-Of Times
CEO
Bob (2022-2024 Q3)
Alice (2024 Q3+)
Headquarters
Boston (permanent)
Stock price
$42
$58
$71
Query as of:
As-of 2024-Q3: CEO=Alice, HQ=Boston, Stock=$58
⚠️ Common Misconceptions

"Episodic memory has timestamps, so I'm covered." — Episodic memory captures when the event happened, not when the underlying fact was valid. A note from 2023 saying "Bob is CEO" doesn't tell you when Bob became CEO or stopped being CEO — only that you wrote that note in 2023.

"Most facts don't change, so I can ignore temporal handling." — Most facts do change in regulated domains (healthcare coverage, legal status, organizational structure). The cert exam is heavy on these domains because they're where stale facts cause real harm.

"I can just sort by latest timestamp." — This works for “most recent answer” but breaks for as-of queries. A user asking "what was the policy in 2023?" needs the row valid then, not the latest row in your table.

🎓 Cert Tip — Domain 5.6 (Temporal)

Trap question pattern: an agent confidently reports "Acme's CEO is Bob" but it's been Alice for 6 months. What's the bug class? Always temporal — the query omitted the valid_to IS NULL filter and got the historical row. This is one of the most-tested scenarios on the cert. Memorize: valid_to IS NULL is "current"; explicit timestamp is "as-of."

Concept 3: Stratified Sampling + Field-Level Confidence

When you route extractions to human reviewers, sampling matters as much as the extraction itself. Aggregate confidence is too coarse. Top-N-by-confidence over-reviews the easy cases. Uniform sampling under-reviews the hard ones. The cert pattern is stratified sampling: N samples from each confidence bucket, so reviewers see the full distribution.

💡 Everyday Analogy

Quality control on a factory line. You don't inspect every widget — that's uniform sampling, and it's expensive. You don't only inspect the shiniest widgets — that's top-N sampling, and you'd never find defects.

The pain: factories that sample uniformly waste inspector time on obviously-good batches. Factories that sample top-N never see the defective batches at all.

The mapping: stratified QC inspects N units from each grade tier — A-grade, B-grade, C-grade. Defects across grades surface evenly. The same logic applies to LLM extractions: sample from each confidence bucket so reviewers see the high-, medium-, and low-confidence distributions in proportion to where defects actually live (low confidence).

Field-level confidence is the second half of the trick. Aggregate document confidence hides defects in individual fields. A document with 88% overall confidence might have one field at 30% confidence — the field that determines whether you approve a $50K healthcare claim. Field-level scores let you escalate just that field, not the whole document.

Top-N vs. Stratified Sampling — Same Reviewer Budget

❌ Top-N (sample 6 highest-confidence)

HIGH conf (90%+)6 sampled / 6 in bucket
MED conf (70-90%)0 sampled / 8 in bucket
LOW conf (<70%)0 sampled / 4 in bucket
Reviewers see: all easy cases. Defects in low-conf bucket: 100% missed.

✅ Stratified (sample 2 from each bucket)

HIGH conf (90%+)2 sampled / 6 in bucket
MED conf (70-90%)2 sampled / 8 in bucket
LOW conf (<70%)2 sampled / 4 in bucket
Reviewers see: full distribution. Defects in low-conf: 50% caught.
🎓 Cert Tip — Domain 5.6 (Stratified + Field-Level)

Two trap patterns to recognize: (1) "Send the top-N most-confident extractions to human review" — always wrong; reviewers see only correct cases and learn nothing. (2) "Escalate when overall document confidence drops below X%" — weak; aggregate hides per-field defects. The cert-correct answer is always stratified sampling + field-level confidence, in combination.

Concept 4: Synthesis Output — Well-Established vs. Contested Claims

When sources agree, the agent should say so confidently. When sources disagree, the agent must surface the disagreement, not silently pick one. This is the cert pattern that fails most often in practice — agents that “helpfully” reconcile contradictions by picking one side.

💡 Everyday Analogy

A meta-analysis paper. "Result X is replicated across 7 studies" reads very differently from "Result Y is supported by 3 studies and contradicted by 2." A reader can act on the first; the second requires more digging. Don't collapse them.

The pain: a healthcare agent that reads two payer documents disagreeing on coverage for CPT 99213. It picks the more recent one and reports "Covered." The reviewer downstream has no idea there was a contradiction — the disagreement was lost.

The mapping: your output schema must support a claim_status: "established" | "contested" | "single_source" field. When status is contested, the output must include both source pointers, not just one.

Same two source documents, two different synthesis behaviors:

Synthesis — Same Sources, Two Output Categories

Sources

[chunk_5] Aetna policy 2024: "CPT 99213 covered for established patient visits with documented chronic condition."
[chunk_9] BCBS policy 2024: "CPT 99213 NOT covered without prior authorization for chronic conditions."
[chunk_3] CMS guidance 2024: "CPT 99213 is a Level 3 established patient E/M service."
[chunk_12] AMA CPT manual: "CPT 99213 represents 20-29 minutes of medical decision making."

Synthesis output

establishedCPT 99213 = Level 3 established patient E/M, 20-29 min decision making.
sources: [chunk_3, chunk_12]
contestedCoverage for chronic conditions DIFFERS by payer.
Aetna: covered — [chunk_5]
BCBS: requires prior auth — [chunk_9]
singleAetna requires "documented" chronic condition — ambiguous criterion.
source: [chunk_5] · needs_review: true
✅ What Just Happened?

Four sources, three output categories. Sources 3 and 12 agree on the definition — that's an established claim with two source pointers. Sources 5 and 9 contradict on coverage — that's a contested claim that surfaces both, in opposition. Source 5 makes a separate claim with no corroboration — that's a single-source claim flagged for review. The schema makes the agent's epistemic state legible to whoever reads the output downstream — whether that's a human reviewer, a downstream agent, or an audit log.

🎓 Cert Tip — Domain 5.6 (Synthesis)

The most common Domain 5.6 trap: a scenario describing two sources that disagree on a fact. The agent picks the “more recent” or “more authoritative” source and reports it. This is wrong. Even if the agent picks correctly, the cert-correct behavior is to surface the disagreement as a contested claim with both source pointers, and let the downstream consumer (often a human) decide. Silent disagreement resolution is non-compliant regardless of which source is “right.”

Code Walkthrough: The ProvenancedSynthesizer Class

We'll build a class that takes a list of retrieved chunks and a query, and returns a synthesis object with all four sub-categories. Annotated in 4 chunks.

Chunk 1: Per-Chunk Claim Extraction with Confidence

WHAT: Each chunk is sent to Claude with a structured tool that extracts every distinct claim plus a per-claim confidence. WHY: We need claim-level (not document-level) granularity to detect agreement and contradiction. GOTCHA: Don't ask for a single “summary” per chunk — that collapses claims and you lose the ability to detect disagreement.

import anthropic
from dataclasses import dataclass, field

@dataclass
class Claim:
    text: str
    source_id: str
    confidence: float  # 0.0 - 1.0
    valid_from: str = ""  # ISO date
    valid_to: str | None = None

EXTRACT_TOOL = {
    "name": "extract_claims",
    "description": "Extract every distinct factual claim from a source chunk.",
    "input_schema": {
        "type": "object",
        "properties": {
            "claims": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "text": {"type": "string"},
                        "confidence": {"type": "number"},
                        "valid_from": {"type": "string"},
                        "valid_to": {"type": "string"},
                    },
                    "required": ["text", "confidence"],
                },
            },
        },
        "required": ["claims"],
    },
}

def extract_claims(chunk_id: str, chunk_text: str) -> list[Claim]:
    """Send one chunk to Claude; return a list of Claim objects."""
    client = anthropic.Anthropic()
    try:
        result = client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=800,
            tools=[EXTRACT_TOOL],
            tool_choice={"type": "tool", "name": "extract_claims"},
            messages=[{"role": "user",
                       "content": f"Extract every factual claim from this source.\n\n{chunk_text}"}],
        )
    except anthropic.APIError as e:
        print(f"Extraction failed for {chunk_id}: {e.message}")
        return []
    # tool_use block contains the structured output
    raw = next(b.input for b in result.content if b.type == "tool_use")
    return [Claim(text=c["text"], source_id=chunk_id,
                  confidence=c["confidence"],
                  valid_from=c.get("valid_from", ""),
                  valid_to=c.get("valid_to"))
            for c in raw["claims"]]
import Anthropic from '@anthropic-ai/sdk';

const EXTRACT_TOOL = {
  name: 'extract_claims',
  description: 'Extract every distinct factual claim from a source chunk.',
  input_schema: {
    type: 'object',
    properties: {
      claims: {
        type: 'array',
        items: {
          type: 'object',
          properties: {
            text: { type: 'string' },
            confidence: { type: 'number' },
            valid_from: { type: 'string' },
            valid_to: { type: 'string' },
          },
          required: ['text', 'confidence'],
        },
      },
    },
    required: ['claims'],
  },
};

export async function extractClaims(chunkId, chunkText) {
  const client = new Anthropic();
  try {
    const result = await client.messages.create({
      model: 'claude-haiku-4-5-20251001',
      max_tokens: 800,
      tools: [EXTRACT_TOOL],
      tool_choice: { type: 'tool', name: 'extract_claims' },
      messages: [{ role: 'user',
                   content: `Extract every factual claim from this source.\n\n${chunkText}` }],
    });
    const raw = result.content.find(b => b.type === 'tool_use').input;
    return raw.claims.map(c => ({
      text: c.text,
      source_id: chunkId,
      confidence: c.confidence,
      valid_from: c.valid_from || '',
      valid_to: c.valid_to || null,
    }));
  } catch (error) {
    console.error(`Extraction failed for ${chunkId}: ${error.message}`);
    return [];
  }
}

Chunk 2: Agreement Detection — Cluster Claims by Entailment

WHAT: Group claims that say the same thing across different chunks. WHY: This is what produces the “well-established” vs. “contested” vs. “single-source” categories. GOTCHA: Don't use raw string match. Use a small Claude call to classify pairs as agree / contradict / unrelated. Two claims that paraphrase each other should still cluster as agreement.

CLASSIFY_TOOL = {
    "name": "classify_pair",
    "description": "Classify whether two claims agree, contradict, or are unrelated.",
    "input_schema": {
        "type": "object",
        "properties": {
            "relation": {"type": "string", "enum": ["agree", "contradict", "unrelated"]},
            "topic": {"type": "string", "description": "If agree or contradict, the shared topic."},
        },
        "required": ["relation"],
    },
}

def classify_pair(client, c1: Claim, c2: Claim) -> tuple[str, str]:
    """Return (relation, topic). relation in {agree, contradict, unrelated}."""
    try:
        result = client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=200,
            tools=[CLASSIFY_TOOL],
            tool_choice={"type": "tool", "name": "classify_pair"},
            messages=[{"role": "user", "content":
                f"Claim A (source {c1.source_id}): {c1.text}\n"
                f"Claim B (source {c2.source_id}): {c2.text}\n\n"
                "Classify the relation."}],
        )
        raw = next(b.input for b in result.content if b.type == "tool_use")
        return raw["relation"], raw.get("topic", "")
    except anthropic.APIError as e:
        return "unrelated", ""  # fail safe: don't cluster on errors

def cluster_claims(client, claims: list[Claim]) -> tuple[list, list, list]:
    """Return (established, contested, single_source) lists.
    Pairwise classification is O(n^2); fine for tens of claims, switch to
    embedding-based clustering for hundreds.
    """
    n = len(claims)
    visited = [False] * n
    established, contested, single = [], [], []

    for i in range(n):
        if visited[i]:
            continue
        agree_with = [i]
        contradicts = []
        for j in range(i + 1, n):
            if visited[j]:
                continue
            relation, topic = classify_pair(client, claims[i], claims[j])
            if relation == "agree":
                agree_with.append(j); visited[j] = True
            elif relation == "contradict":
                contradicts.append(j)

        visited[i] = True
        if contradicts:
            contested.append({
                "claim_a": claims[i].text,
                "sources_a": [claims[k].source_id for k in agree_with],
                "claim_b": claims[contradicts[0]].text,
                "sources_b": [claims[contradicts[0]].source_id],
            })
            for k in contradicts: visited[k] = True
        elif len(agree_with) >= 2:
            established.append({
                "claim": claims[i].text,
                "sources": [claims[k].source_id for k in agree_with],
                "confidence": sum(claims[k].confidence for k in agree_with) / len(agree_with),
            })
        else:
            single.append({
                "claim": claims[i].text,
                "source": claims[i].source_id,
                "confidence": claims[i].confidence,
                "needs_review": claims[i].confidence < 0.85,
            })
    return established, contested, single
const CLASSIFY_TOOL = {
  name: 'classify_pair',
  description: 'Classify whether two claims agree, contradict, or are unrelated.',
  input_schema: {
    type: 'object',
    properties: {
      relation: { type: 'string', enum: ['agree', 'contradict', 'unrelated'] },
      topic: { type: 'string' },
    },
    required: ['relation'],
  },
};

async function classifyPair(client, c1, c2) {
  try {
    const result = await client.messages.create({
      model: 'claude-haiku-4-5-20251001',
      max_tokens: 200,
      tools: [CLASSIFY_TOOL],
      tool_choice: { type: 'tool', name: 'classify_pair' },
      messages: [{
        role: 'user',
        content: `Claim A (source ${c1.source_id}): ${c1.text}\n` +
                 `Claim B (source ${c2.source_id}): ${c2.text}\n\nClassify the relation.`,
      }],
    });
    const raw = result.content.find(b => b.type === 'tool_use').input;
    return [raw.relation, raw.topic || ''];
  } catch {
    return ['unrelated', '']; // fail safe
  }
}

export async function clusterClaims(client, claims) {
  const n = claims.length;
  const visited = new Array(n).fill(false);
  const established = [], contested = [], single = [];

  for (let i = 0; i < n; i++) {
    if (visited[i]) continue;
    const agreeWith = [i];
    const contradicts = [];
    for (let j = i + 1; j < n; j++) {
      if (visited[j]) continue;
      const [relation] = await classifyPair(client, claims[i], claims[j]);
      if (relation === 'agree') { agreeWith.push(j); visited[j] = true; }
      else if (relation === 'contradict') contradicts.push(j);
    }
    visited[i] = true;
    if (contradicts.length) {
      contested.push({
        claim_a: claims[i].text,
        sources_a: agreeWith.map(k => claims[k].source_id),
        claim_b: claims[contradicts[0]].text,
        sources_b: [claims[contradicts[0]].source_id],
      });
      contradicts.forEach(k => { visited[k] = true; });
    } else if (agreeWith.length >= 2) {
      established.push({
        claim: claims[i].text,
        sources: agreeWith.map(k => claims[k].source_id),
        confidence: agreeWith.reduce((s, k) => s + claims[k].confidence, 0) / agreeWith.length,
      });
    } else {
      single.push({
        claim: claims[i].text,
        source: claims[i].source_id,
        confidence: claims[i].confidence,
        needs_review: claims[i].confidence < 0.85,
      });
    }
  }
  return { established, contested, single };
}

Chunk 3: Temporal Flagging

WHAT: Compare each claim's valid_to against the query time; flag stale facts. WHY: Users querying "current" state need stale facts surfaced as warnings, not silently included. GOTCHA: A missing valid_to is ambiguous — could be permanent or could be unknown. Treat “unknown” differently from “permanent” in the output.

from datetime import date

def flag_temporal(claims: list[Claim], as_of: date) -> list[dict]:
    """Return a list of temporal warnings for stale claims."""
    warnings = []
    for c in claims:
        # valid_to is set AND in the past relative to query time -> stale
        if c.valid_to:
            try:
                vt = date.fromisoformat(c.valid_to)
                if vt < as_of:
                    warnings.append({
                        "claim": c.text,
                        "as_of": c.valid_to,
                        "query_as_of": as_of.isoformat(),
                        "freshness": "stale",
                        "source": c.source_id,
                    })
            except ValueError:
                pass  # malformed date, skip
        # valid_to is None means "current" - don't flag
        # No valid_from at all means "unknown temporal" - that's a single-source review trigger
    return warnings
export function flagTemporal(claims, asOf) {
  const warnings = [];
  for (const c of claims) {
    if (c.valid_to) {
      const vt = new Date(c.valid_to);
      if (!isNaN(vt) && vt < asOf) {
        warnings.push({
          claim: c.text,
          as_of: c.valid_to,
          query_as_of: asOf.toISOString().slice(0, 10),
          freshness: 'stale',
          source: c.source_id,
        });
      }
    }
    // valid_to null = current; missing valid_from = unknown temporal (single-source trigger)
  }
  return warnings;
}

Chunk 4: Stratified Sampler for Human Review

WHAT: Pick N claims from each output bucket for reviewer queues. WHY: The cert-correct sampling pattern. GOTCHA: When a bucket has fewer items than N, take all of them — don't pad with adjacent buckets.

import random

def stratified_sample(synthesis: dict, per_bucket: int = 1) -> list[dict]:
    """Pick `per_bucket` items from each output category for human review."""
    review_queue = []
    for bucket_name in ("established", "contested", "single", "temporal_warnings"):
        bucket = synthesis.get(bucket_name, [])
        n = min(len(bucket), per_bucket)
        if n == 0:
            continue
        chosen = random.sample(bucket, n)
        for item in chosen:
            review_queue.append({"bucket": bucket_name, **item})
    return review_queue
export function stratifiedSample(synthesis, perBucket = 1) {
  const reviewQueue = [];
  for (const bucketName of ['established', 'contested', 'single', 'temporal_warnings']) {
    const bucket = synthesis[bucketName] || [];
    const n = Math.min(bucket.length, perBucket);
    if (n === 0) continue;
    // simple shuffle-and-take-N
    const shuffled = [...bucket].sort(() => Math.random() - 0.5);
    for (const item of shuffled.slice(0, n)) {
      reviewQueue.push({ bucket: bucketName, ...item });
    }
  }
  return reviewQueue;
}
✅ What Just Happened?

Four chunks, four cert sub-domains. Chunk 1 gives you provenance (every claim tagged to its source via tool_use output schema). Chunk 2 gives you the synthesis output buckets via agreement detection. Chunk 3 handles the temporal layer. Chunk 4 wires up stratified sampling for human review. The lab below assembles all four against a real Healthcare Pre-Auth fixture.

Hands-On Lab: Healthcare Pre-Auth Synthesizer

Lab Overview

What you'll build: A synthesizer that processes 8 retrieved chunks about a single CPT code's coverage policy across multiple payer documents. It produces 4 output buckets (established, contested, single-source, temporal warnings), runs a stratified human-review sampler, and asserts that contested claims always include both source pointers.

Time: 35-45 minutes · Domain: A (Healthcare Pre-Authorization)

Prerequisites: Python 3.10+ or Node.js 18+, Anthropic API key. The class from the walkthrough above.

Files you'll create:

  • preauth_chunks.json — the 8-chunk fixture (provided below)
  • synthesizer.py (or synthesizer.mjs) — the class from the walkthrough
  • run_lab.py (or run_lab.mjs) — the lab runner

Environment Setup

# Python
mkdir m27b-domain56-lab && cd m27b-domain56-lab
python -m venv venv && source venv/bin/activate    # Windows: venv\Scripts\activate
pip install anthropic
export ANTHROPIC_API_KEY=sk-ant-...

# Node.js (alternative)
mkdir m27b-domain56-lab && cd m27b-domain56-lab
npm init -y && npm install @anthropic-ai/sdk
export ANTHROPIC_API_KEY=sk-ant-...

Step 1: Drop In the 8-Chunk Pre-Auth Fixture

Save as preauth_chunks.json. The fixture is hand-crafted: chunks 1-3 agree on CPT definition (will become established), chunks 4-5 contradict on coverage (will become contested), chunk 6 is from 2022 (will become temporal warning), chunks 7-8 are single-source claims about exclusion criteria.

{
  "query": "Is CPT 99213 covered for chronic condition follow-up?",
  "as_of": "2025-01-15",
  "chunks": [
    {"chunk_id": "chunk_1", "text": "CMS Manual 2024: CPT 99213 is a Level 3 established patient evaluation and management service.", "valid_to": null},
    {"chunk_id": "chunk_2", "text": "AMA CPT 2024: Code 99213 represents 20-29 minutes of medical decision making for an established patient.", "valid_to": null},
    {"chunk_id": "chunk_3", "text": "AAPC coding guide 2024: CPT 99213 is the moderate-complexity Level 3 E/M code for established patient visits.", "valid_to": null},
    {"chunk_id": "chunk_4", "text": "Aetna policy 2024-Q4: CPT 99213 is COVERED for established patient visits with documented chronic conditions.", "valid_to": null},
    {"chunk_id": "chunk_5", "text": "BCBS policy 2024-Q4: CPT 99213 requires PRIOR AUTHORIZATION for chronic condition management beyond 4 visits per year.", "valid_to": null},
    {"chunk_id": "chunk_6", "text": "Aetna policy 2022: CPT 99213 was covered without prior auth for any chronic condition.", "valid_to": "2023-12-31"},
    {"chunk_id": "chunk_7", "text": "Aetna policy 2024-Q4: Off-label use of any E/M code is excluded from coverage.", "valid_to": null},
    {"chunk_id": "chunk_8", "text": "BCBS policy 2024-Q4: Step therapy required before approving CPT 99213 for chronic condition management.", "valid_to": null}
  ]
}
✅ Checkpoint: File around 2-3 KB. The deliberate structure: 3 agreeing chunks, 2 contradicting chunks, 1 stale chunk (valid_to in 2023), 2 single-source claims.

Step 2: Run Synthesis — Verify the Four Output Buckets

What & why: Wire up the four functions from the walkthrough into a single runner. Output should match the fixture's hand-known structure.

"""run_lab.py - Process the 8-chunk fixture, emit synthesis, verify shape."""
import json
import anthropic
from datetime import date
from synthesizer import extract_claims, cluster_claims, flag_temporal, stratified_sample

def main():
    with open("preauth_chunks.json") as f:
        fixture = json.load(f)

    print(f"Query: {fixture['query']}")
    print(f"As-of: {fixture['as_of']}\n")

    # Step A: extract claims from each chunk
    all_claims = []
    for chunk in fixture["chunks"]:
        claims = extract_claims(chunk["chunk_id"], chunk["text"])
        for c in claims:
            c.valid_to = chunk.get("valid_to")
        all_claims.extend(claims)

    print(f"Extracted {len(all_claims)} claims across {len(fixture['chunks'])} chunks\n")

    # Step B: cluster into established / contested / single
    client = anthropic.Anthropic()
    established, contested, single = cluster_claims(client, all_claims)

    # Step C: temporal flagging
    as_of = date.fromisoformat(fixture["as_of"])
    temporal_warnings = flag_temporal(all_claims, as_of)

    synthesis = {
        "established": established,
        "contested": contested,
        "single": single,
        "temporal_warnings": temporal_warnings,
    }

    # Step D: print synthesis
    print("=" * 60)
    print("SYNTHESIS OUTPUT")
    print("=" * 60)
    for bucket in ("established", "contested", "single", "temporal_warnings"):
        print(f"\n[{bucket}] {len(synthesis[bucket])} item(s):")
        for item in synthesis[bucket]:
            print(f"  - {json.dumps(item, indent=2, default=str)[:200]}")

    return synthesis

if __name__ == "__main__":
    main()
// run_lab.mjs - Process the 8-chunk fixture, emit synthesis, verify shape.
import { readFileSync } from 'fs';
import Anthropic from '@anthropic-ai/sdk';
import { extractClaims, clusterClaims, flagTemporal, stratifiedSample } from './synthesizer.mjs';

async function main() {
  const fixture = JSON.parse(readFileSync('preauth_chunks.json', 'utf-8'));
  console.log(`Query: ${fixture.query}`);
  console.log(`As-of: ${fixture.as_of}\n`);

  const allClaims = [];
  for (const chunk of fixture.chunks) {
    const claims = await extractClaims(chunk.chunk_id, chunk.text);
    claims.forEach(c => { c.valid_to = chunk.valid_to; });
    allClaims.push(...claims);
  }
  console.log(`Extracted ${allClaims.length} claims across ${fixture.chunks.length} chunks\n`);

  const client = new Anthropic();
  const { established, contested, single } = await clusterClaims(client, allClaims);
  const temporalWarnings = flagTemporal(allClaims, new Date(fixture.as_of));

  const synthesis = { established, contested, single, temporal_warnings: temporalWarnings };

  console.log('='.repeat(60));
  console.log('SYNTHESIS OUTPUT');
  console.log('='.repeat(60));
  for (const bucket of ['established', 'contested', 'single', 'temporal_warnings']) {
    console.log(`\n[${bucket}] ${synthesis[bucket].length} item(s):`);
    synthesis[bucket].forEach(item => {
      console.log(`  - ${JSON.stringify(item).slice(0, 200)}`);
    });
  }
  return synthesis;
}

main().catch(err => { console.error(err); process.exit(1); });

Run it:

python run_lab.py     # or: node run_lab.mjs
Expected Output (item counts; exact text varies):
Query: Is CPT 99213 covered for chronic condition follow-up? As-of: 2025-01-15 Extracted 11 claims across 8 chunks ============================================================ SYNTHESIS OUTPUT ============================================================ [established] 1 item(s): - {"claim": "CPT 99213 is a Level 3 E/M for established patients", "sources": ["chunk_1", "chunk_2", "chunk_3"], ...} [contested] 1 item(s): - {"claim_a": "CPT 99213 covered for chronic conditions", "sources_a": ["chunk_4"], "claim_b": "CPT 99213 requires prior authorization", "sources_b": ["chunk_5"], ...} [single] 2 item(s): - {"claim": "Off-label use excluded", "source": "chunk_7", ...} - {"claim": "Step therapy required", "source": "chunk_8", ...} [temporal_warnings] 1 item(s): - {"claim": "CPT 99213 was covered without prior auth", "as_of": "2023-12-31", "freshness": "stale", "source": "chunk_6"}
✅ Checkpoint — verify all four:
  • Established: at least 1 cluster with 3+ source pointers (the CPT definition agreement)
  • Contested: exactly 1 item, with both sources_a AND sources_b populated
  • Single: 2 items with needs_review flag
  • Temporal warnings: exactly 1, pointing at chunk_6 with freshness: "stale"

Step 3: Run Stratified Sampling for Reviewer Queue

What & why: Pick 1 from each bucket for the human-review queue. Verify reviewers see one item from each category — not the “most confident” or “most recent”.

# Append to run_lab.py or run interactively after main()
def review_step(synthesis: dict):
    queue = stratified_sample(synthesis, per_bucket=1)
    print("\n" + "=" * 60)
    print("HUMAN REVIEW QUEUE (stratified, 1 per bucket)")
    print("=" * 60)
    for item in queue:
        print(f"\n  [bucket: {item['bucket']}]")
        print(f"  {json.dumps({k: v for k, v in item.items() if k != 'bucket'}, default=str)[:180]}")
    print(f"\nTotal queue size: {len(queue)} items")

if __name__ == "__main__":
    synthesis = main()
    review_step(synthesis)
// Append to run_lab.mjs after main()
function reviewStep(synthesis) {
  const queue = stratifiedSample(synthesis, 1);
  console.log('\n' + '='.repeat(60));
  console.log('HUMAN REVIEW QUEUE (stratified, 1 per bucket)');
  console.log('='.repeat(60));
  queue.forEach(item => {
    const { bucket, ...rest } = item;
    console.log(`\n  [bucket: ${bucket}]`);
    console.log(`  ${JSON.stringify(rest).slice(0, 180)}`);
  });
  console.log(`\nTotal queue size: ${queue.length} items`);
}

const synthesis = await main();
reviewStep(synthesis);
✅ Checkpoint: Queue size = 4 items, one from each bucket. Compare with naive top-N: it would have selected 4 high-confidence items, all from the established bucket, and reviewers would never see the contested or stale items.

Step 4: Add the Regression Test for Contested Claims

What & why: The most common Domain 5.6 trap is silently picking one source on disagreement. Add an assertion that catches this regression.

def assert_contested_have_both_sources(synthesis: dict):
    """Cert-correct contested claims must include BOTH source pointers."""
    failures = []
    for c in synthesis.get("contested", []):
        if not c.get("sources_a") or not c.get("sources_b"):
            failures.append(c)
        if not c.get("claim_a") or not c.get("claim_b"):
            failures.append(c)
    if failures:
        raise AssertionError(
            f"REGRESSION: {len(failures)} contested item(s) missing both sources/claims. "
            f"This is the canonical Domain 5.6 trap. Fix the synthesizer.")
    print(f"\n✅ All {len(synthesis.get('contested', []))} contested claim(s) have both source pointers.")

if __name__ == "__main__":
    synthesis = main()
    review_step(synthesis)
    assert_contested_have_both_sources(synthesis)
function assertContestedHaveBothSources(synthesis) {
  const failures = [];
  for (const c of synthesis.contested || []) {
    if (!c.sources_a?.length || !c.sources_b?.length) failures.push(c);
    if (!c.claim_a || !c.claim_b) failures.push(c);
  }
  if (failures.length) {
    throw new Error(
      `REGRESSION: ${failures.length} contested item(s) missing both sources/claims. ` +
      `This is the canonical Domain 5.6 trap. Fix the synthesizer.`);
  }
  console.log(`\n✅ All ${synthesis.contested?.length || 0} contested claim(s) have both source pointers.`);
}

const synthesis = await main();
reviewStep(synthesis);
assertContestedHaveBothSources(synthesis);
🎉 Congratulations! You've built a Domain 5.6-compliant synthesizer end-to-end. The four cert sub-topics — provenance, temporal, stratified review, synthesis — are now wired into one pipeline you can adapt to any RAG-style agent. Reach for this exact pattern when you encounter Domain 5.6 questions on the practice exams.

Stretch Goals (Optional)

  • Field-level confidence: Switch from claim-level to field-level extraction for structured outputs like {cpt_code, coverage, copay, exclusions[]} and route just the low-confidence fields to review.
  • Provenance audit: Add a provenance_audit() method that returns "every claim → which chunk(s)" for compliance review — the audit log a regulator would ask for.
  • Temporal diff: Run synthesis twice at different as_of dates; output the diff: which claims changed, became stale, or got new sources.
  • Confirmation-bias regression: Run the same synthesis twice in the same session vs. fresh sessions, and assert the outputs match (catches Domain 4.5 violations).

Knowledge Check — 8 Cert-Style Questions

Cert-density quiz. If you score 7/8 or higher, you're ready for the practice exam's Domain 5.6 sections.

Q1: Which output format is exam-compliant for a RAG synthesizer?

AProse with parenthetical citations: "CPT 99213 is covered (Aetna 2024) but requires authorization (BCBS 2024)."
BStructured: {claim, source_id, confidence} tuples per claim
CA markdown summary that mentions sources at the end
DWhichever format the user prefers
Correct! Provenance must live in the schema, not in the prose. Structured {claim, source_id, confidence} makes provenance enforceable and auditable. Prose with parenthetical citations may look correct but isn't machine-checkable, so the cert treats it as non-compliant.

Q2: A knowledge-base row should include which fields for proper temporal handling?

A{value, timestamp}
B{value, last_updated}
C{value, created_at, source}
D{value, valid_from, valid_to, source}
Correct! valid_from and valid_to capture when the FACT was valid, not when the row was written. valid_to: NULL means "current." This schema lets you query both "what's true now?" and "what was true as-of T?" Other shapes only support the latest-value query.

Q3: Two payer documents disagree on whether CPT 99213 is covered. Your agent picks the more recent one and reports "Covered." What's wrong?

ANothing — the more recent source is more authoritative
BThe agent should refuse to answer when sources conflict
CThe disagreement was lost — this should be a contested claim with both source pointers, not a silently-resolved one
DThe agent should pick the source with higher confidence score
Correct! This is the canonical Domain 5.6 trap. Even if the agent picks correctly, silent disagreement resolution is non-compliant. The schema must support claim_status: "contested" with both sources_a and sources_b so the downstream consumer (often a human) can decide. The cert tests this directly.

Q4: Which sampling strategy aligns with cert recommendations for a human-review batch?

ATop-N: send the N most-confident extractions to reviewers
BStratified: N samples from each confidence bucket (high/med/low)
CUniform: random N from the full pool
DBottom-N: send only the lowest-confidence extractions
Correct! Stratified ensures reviewers see the full distribution. Top-N over-reviews easy cases (reviewers learn nothing). Uniform under-reviews low-confidence cases (where defects actually live). Bottom-N misses defects that occur even at high confidence (calibration errors).

Q5: Why does field-level confidence beat document-level confidence for high-stakes extraction?

AAn aggregate score hides per-field defects — a doc at 88% overall might have one critical field at 30%
BField-level scores are cheaper to compute
CDocument-level scores aren't supported by the Anthropic API
DIt lets you use a smaller model
Correct! The aggregate hides where it matters. In a $50K healthcare claim extraction, the field that determines approval might be at 30% confidence while the document averages 88%. Field-level lets you escalate just that field for human review without reviewing the whole document.

Q6: An agent confidently reports "Acme's CEO is Bob" but Bob hasn't been CEO for 6 months. What's the most likely bug class?

AHallucination — the agent fabricated the CEO name
BTool error — the lookup tool returned bad data
CPrompt injection — the user altered the system prompt
DTemporal — the query omitted valid_to IS NULL and returned the historical row
Correct! This is the most-tested Domain 5.6 scenario. Memory layers without temporal predicates return historical rows as if they were current. Memorize: valid_to IS NULL = "current"; explicit timestamp = "as-of."

Q7: When two retrieved chunks make the same claim using different wording (paraphrase), the synthesizer should:

ATreat them as separate single-source claims
BPick whichever wording is shorter
CCluster them together as one well-established claim with both source pointers
DMark both as contested because the wording differs
Correct! Paraphrase = agreement, not single-source and not contested. Use a small entailment classifier (cheap Haiku call) to detect "these two say the same thing." Raw string match would split them into single-source claims and lose the corroboration.

Q8: When sources agree on a claim, should the synthesizer collapse them into one citation?

AYes — one citation per claim keeps output clean
BNo — preserve all source pointers; agreement count IS itself a quality signal
CPick the most authoritative source and discard the rest
DYes — redundant citations are a cost optimization opportunity
Correct! Three-source agreement is meaningfully different from one-source. The schema's claim_status: "established" + a list of source pointers preserves that signal for downstream consumers. Collapsing destroys the corroboration evidence and makes the output indistinguishable from a single-source claim.

Module Summary

Domain 5.6 Cheat Sheet

  • Provenance lives in the schema, not in the prose. Output must be {claim, source_id, confidence}. Prose-with-parenthetical-citations is non-compliant regardless of accuracy.
  • Temporal handling: {value, valid_from, valid_to, source}. valid_to IS NULL = "current." Missing valid_to filter is the most common temporal bug.
  • Synthesis output: 4 categories — established, contested, single-source, temporal-warning. Silent disagreement resolution is the canonical exam trap.
  • Stratified sampling beats top-N and uniform. Field-level confidence beats document-level. They compose — stratify the fields, not the documents.
  • Paraphrase = agreement, not contradiction. Use entailment classification, not string match.
  • Don't collapse corroborating sources. Three sources agreeing is a quality signal; preserve all pointers.

What's Next: The Practice Exam

You've completed the deepest cert-prep module. The next step is a full timed practice exam. M27 (Cert Exam Prep) covers exam strategy, question patterns, and the final review — pair it with the practice exams from the cert provider. Aim for 80%+ on Domain 5.6 questions specifically.