← M03: Prompts 🏠 Home M04: Structured Output →

M03B: Context Engineering — Curating What the Model Sees

M03 taught you how to write the message. This module teaches you how to curate everything the model sees on each turn — system prompt, tools, history, retrieved docs, tool results — under a finite budget. Get this right and every later module (RAG, memory, multi-agent, caching) snaps into place.

Learning Objectives

Distinguish prompt engineering (writing the message) from context engineering (curating what the model sees) and explain why both matter
Inventory the six layers Claude reads on every API call: system, tools, history, retrieved docs, tool results, current turn
Apply the four context-engineering levers — add, compress, retrieve, offload — to a context budget problem
Order static and dynamic content correctly so prompt caching can hit, and explain the lost-in-the-middle effect
Diagnose context rot in a long-running agent transcript and fix it without losing critical case facts

Prompt Engineering vs. Context Engineering

In M03 you learned prompt engineering — the craft of writing a single message that guides Claude toward the answer you want. Roles, few-shot examples, structured output instructions, chain-of-thought. That is the visible iceberg.

Underneath it is the bigger discipline: context engineering. Every API call assembles a stack of content for the model — the system prompt you wrote, the tool definitions you registered, the conversation history that has piled up, any retrieved documents, the results of past tool calls, and finally the user's current message. The model attends to all of it. Most learners only think about the last line they typed. The other 95% sets the constraints, the priors, and (when neglected) the failure modes.

Here's the working definition we'll use throughout the course:

📐 Technical Definition

Context engineering is the practice of deciding what content occupies the model's context windowThe fixed-size buffer of tokens the model reads on each call. For Claude Sonnet 4.6 it's up to 200K tokens (or 1M with the long-context variant). Everything — system, tools, history, retrieved docs, tool results, current message, AND the response — competes for this space. on each turn, in what order, and how it is compressed or fetched. Prompt engineering is a sub-discipline of context engineering that handles the authoring of one piece of that context (typically the system prompt or current user message).

Why split them out? Because the levers are different. Prompt engineering asks "what words go in this message?" Context engineering asks "what gets included at all, in what order, fresh or cached, full or summarized?" The first is a writing problem. The second is a budgeting problem. By M11 you'll be juggling both on every agent you build — this module makes the budgeting half explicit.

Concept 1: What Does the Model Actually See?

Before you can engineer context, you have to know what's in it. On every Anthropic API call, Claude receives a layered stack of content. Most of it isn't in the messages array you passed in — it's assembled by the SDK from various parameters. Let's enumerate the six layers.

💡 Everyday Analogy

Before you walk into a courtroom, the lawyer hands the jury an evidence binder — pages of contracts, photos, prior testimony, and exhibits. The closing argument the lawyer delivers is what the jury hears last, but they've been reading the binder for hours.

The pain: rookie lawyers obsess over the closing argument and pay no attention to what's in the binder. The jury reaches its verdict based mostly on the binder. A great closing on top of a sloppy binder loses the case.

The mapping: the closing argument is your prompt. The binder is your context. Claude is the jury — it reads everything. If your retrieved chunks are irrelevant, your tool definitions are vague, or your history is full of resolved errors, no amount of clever prompting fixes the verdict.

The Six Layers, From Top to Bottom

Every Anthropic API call assembles these six layers into one giant promptNote: in the SDK you don't see "the prompt" as one string — you pass system, tools, messages separately. But internally the model receives a single sequence of tokens assembled from all of them. the model reads in order. Each one costs tokens, each one influences the answer:

System prompt — your instructions, persona, output rules. Set once per conversation. Static across turns.
Tool definitions — the JSON Schema for every tool you've registered. Even if Claude doesn't use any, it reads them all to decide. Static across turns.
Conversation history — every prior user/assistant message, in order. Grows with every turn.
Retrieved documents — chunks injected by your RAG pipeline (M09). Different on every turn.
Tool results — the JSON output of past tool calls, embedded as messages. Accumulates with every tool use.
Current user turn — the message you just appended. Smallest, most-attended layer.

Here's what that stack looks like for a UCC research agent on turn 8 of a session:

The Context Stack — Turn 8 of a UCC Research Agent

System prompt420 tok

Tool definitions (5 tools)1,180 tok

Conversation history (7 prior turns)3,250 tok

Retrieved RAG chunks (top 5)2,800 tok

Tool results (4 prior calls)1,950 tok

Current user message85 tok

Total input: 0 tokens

User message is 0% of total input

✅ What Just Happened?

The user typed 85 tokens. The model received 9,685 tokens. That means 99.1% of what shaped Claude's answer was content you assembled — not what the user said. Welcome to context engineering. From now on, when an agent gives a bad answer, your first question shouldn't be "was the prompt wrong?" — it should be "what did the model actually see?"

⚠️ Common Misconceptions

"Tool definitions don't count if Claude doesn't use them." — They do. Every tool's JSON Schema is in the prompt the model reads, even on turns where no tool is called. 5 tools at ~250 tokens each = 1,250 tokens of overhead per turn.

"The model focuses on the user message." — The model attends across the whole window. With 99% of tokens elsewhere, those other layers dominate the output unless you actively curate them.

"Long context windows mean I can stop worrying about size." — A 1M-token window doesn't mean 1M useful tokens. Cost scales linearly with input. Latency rises with input. And a noisy 50K-token context often beats a noisy 500K-token context, because signal-to-noise matters more than raw capacity.

Concept 2: The Four Levers

Now that you can see the stack, the question becomes: what do you do when it gets too big, too noisy, or too stale? Every context-engineering decision — every one — comes down to one of four moves. The rest of Track 3 is just specialized applications of these four.

💡 Everyday Analogy

You're packing for a two-week trip and your suitcase is full. You have four options for any item. (1) Just throw it in. (2) Use packing cubes to compress your shirts so they take half the space. (3) Leave it home and ship it ahead to the hotel for pickup if you need it. (4) Hand it to a travel buddy who will carry it on their trip and meet you there.

The pain: people default to option 1, the bag bursts, and they end up paying overweight fees or leaving things behind randomly at the airport.

The mapping: those four moves are add, compress, retrieve, offload. They are the only four moves you have when curating context. Every RAG technique, every memory architecture, every multi-agent pattern you'll learn later is one of these four wearing a different costume.

Here are the four levers, with what each one does to a single fact:

A Single Fact, Through Four Levers

➕

ADD

Include the fact directly in context. Simple, but spends tokens every turn.

🔅

COMPRESS

Summarize or truncate the fact. Lossy but cheap. Good for old turns.

🔍

RETRIEVE

Store the fact externally, fetch it only when relevant. Lossless but adds a lookup step.

👪

OFFLOAD

Hand the fact to a subagent with its own context. Returns only a summary.

Press play — the fact will travel through each lever Original fact: "The 2019 GE Capital filing for Acme Corp lists collateral as 'all assets, tangible and intangible, including all goodwill, IP, and accounts receivable, valued at $42.3M as of filing date.'"

✅ What Just Happened?

The same fact lived four different lives. Add kept all 38 words but burned 38 tokens every turn. Compress kept the gist (8 tokens) but lost the exact dollar amount — fine for "what does this filing cover?", fatal for "what's the collateral value?" Retrieve kept the full fact in a vector database, paid 0 tokens until the user asked something relevant, then fetched it. Offload spawned a subagent that read the whole filing, did the analysis, and returned only a 1-sentence verdict. There is no universally right answer — the lever depends on access frequency, freshness needs, and how lossy you can afford to be.

When to Reach for Each Lever

A simple decision heuristic, refined throughout Track 3:

Add when the fact is small, accessed every turn, and exact wording matters — e.g. policy rules, the user's name, current date.
Compress when the fact's gist matters but specifics are nice-to-have — e.g. "we already discussed the auth flow and concluded X" replacing 12 turns of debate.
Retrieve when the fact is one of many candidates and only relevant ones matter — e.g. one document out of 10,000 in a knowledge base.
Offload when the fact requires its own reasoning chain that would otherwise pollute your main context — e.g. "research the entire filing history of this debtor" handed to a subagent.

Concept 3: Static vs. Dynamic Context

Inside your context stack, some layers don't change between turns — the system prompt, the tool definitions, sometimes a fixed reference doc. Other layers change constantly — the conversation history grows, retrieved chunks differ per query. This split has a name and a consequence.

📐 Technical Definition

Static context is content that is identical across multiple consecutive API calls in a session. Dynamic context changes between calls. The split matters because Anthropic's prompt cachingA feature where Anthropic stores the result of processing a prefix of your prompt and reuses it on subsequent calls if the prefix is identical. Cached input tokens cost ~10% of regular input tokens. Cache lifetime is 5 minutes by default. Covered in detail in M22. only kicks in for prefixes that match exactly. If you put dynamic content before static content, none of the static content can be cached — you're billed full price every turn.

The Ordering Rule

The rule is simple: static blocks first, dynamic blocks last. The cache reuses a prefix as long as the prefix is byte-identical to a previous call. As soon as the prefix diverges, every byte after that point is "new" — even if it would have been identical further down. So the order has to be: static system, static tools, static reference doc, then dynamic history, then dynamic current turn.

Same Content, Different Order — What the Cache Sees

✅ Static-First (cacheable)

system: "You are..."HIT

tools: [search_filings, ...]HIT

reference_doc: "UCC schema..."HIT

history: turns 1–7NEW

user: "what's the latest filing?"NEW

Cached: 4,200 tok · New: 280 tok · Cost: $0.002

❌ Dynamic-First (cache miss)

history: turns 1–7NEW

system: "You are..."NEW

tools: [search_filings, ...]NEW

reference_doc: "UCC schema..."NEW

user: "what's the latest filing?"NEW

Cached: 0 tok · New: 4,480 tok · Cost: $0.013

💰 Cost Implication

Same exact content, just different ordering — the static-first version costs 6.5x less per call once the cache warms up. Across 1,000 daily calls in a production agent, that's the difference between $2/day and $13/day. Multiply by months and that's a five-figure annual swing on ordering alone. M22 will cover the full caching mechanics; for now, just remember: static first.

💰 Production Note — Prompt Caching

Prompt caching reuses prefixes that match exactly. A common production bug: identical content in different order produces no cache hits because the prefix breaks. Use cache_control markers to declare cacheable boundaries explicitly. Note: per the official cert exam guide, prompt-caching implementation details are out of scope — the exam expects you to know caching exists and to recognize prefix-stable ordering, but won't ask about cache TTL, hit rates, or pricing.

Concept 4: Position Effects — Lost in the Middle

Even if every byte you put in context is high-quality, where you put it matters. Large language models attend more strongly to content at the start and end of the context window than to content in the middle. This isn't a bug to fix — it's a property of how attention layers work, replicated across nearly every model family. It's been documented under the name "lost in the middleFrom the 2023 Liu et al. paper "Lost in the Middle: How Language Models Use Long Contexts." Researchers placed the same answer at different positions in a long context and measured retrieval accuracy. Accuracy was high at the start (~75%), high at the end (~70%), and dropped to ~45% in the middle.."

💡 Everyday Analogy

Think about a 23-bullet meeting agenda. A week later, ask a participant what was decided. They'll remember bullet 1 (the opening goal), bullet 23 (the action items), and maybe one of the most heated discussions in the middle. Bullets 7–15? Forgotten.

The pain: organizers stuff the most important detail at bullet 12 because that's when the meeting hit its stride — and a week later nobody remembers it.

The mapping: Claude reads context the same way. Information at position 1 and at the very end is recalled with high fidelity. Information in the middle of a long context, even when literally present, is sometimes effectively invisible to the model's reasoning.

Here's what that looks like when researchers test the same fact at different positions:

Same Needle, Three Haystacks — Recall by Position

START

76% recall

MIDDLE

47% recall

END

71% recall

Source: Liu et al. 2023 — pattern reproduces across most modern LLMs including Claude, GPT-4, Llama 2.

What This Means For Your Code

Three concrete rules drop out of the position-effects research:

Put the most critical instruction at the start of the system prompt, not buried at line 47.
Put the most-relevant retrieved chunk at the end of the retrieved-docs block (closest to the user's current question). Re-ranking pipelines exist exactly for this reason — they reorder chunks so the best one is last.
Repeat critical case facts — once in the system prompt (start position) and once just before the user message (end position). This redundancy costs ~50 tokens and meaningfully improves recall.

🎓 Cert Tip — Domain 5.1

The "lost in the middle" effect is a recurring exam topic. Trick questions describe a long-context agent that "reads" a 50-page document and ask why it misses a fact in the middle. The answer is always position effects, not "the model wasn't given the document." Mitigation: use immutable case-facts blocks at the START position, position the current query at the END.

Concept 5: Context Rot

Long-running agents accumulate junk. Tool results from searches that went nowhere. Plans that got abandoned three turns later. User instructions that were superseded ("actually, ignore that last bit"). Errors that were resolved. The model dutifully attends to all of it — including the parts that no longer apply. The result: by turn 30, your agent retries already-failed searches, contradicts its own corrections, and answers questions you never asked.

💡 Everyday Analogy

Think about an email thread that's been forwarded fifteen times with everyone leaving the full quoted history. The actual decision is buried under "Re: Re: Re: Re: FW: FW:" — and a new participant joining the thread acts on stale information from forward #3 because they read top-down and that's where the dollar figure happened to live.

The pain: every reply makes the thread harder to read, not easier. People reply to the wrong version of the spec. Decisions get re-litigated. The thread becomes net-negative information.

The mapping: the same thing happens in agent context windows. Each turn appends. Nothing self-cleans. By turn 25, the model is reading a thread that's mostly noise, and a small but critical fact at turn 12 is buried under tool-result garbage from turns 13–24.

📐 Technical Definition

Context rot is the degradation of agent quality caused by accumulated stale, contradictory, or low-relevance content in the context window. It's distinct from running out of tokens — you can hit context rot at 60% window utilization. The key signal is signal-to-noise ratio, not raw size.

Here's a real symptom you'll see in production: the agent works great for the first 10 turns, then starts visibly degrading. Token usage may still be well under the limit. The fix isn't a bigger window — it's a smaller, cleaner one. You compact, summarize, or evict the rotted content.

Before and After a Compaction Pass

BEFORE: 30 turns, lots of rot

user: research debtor "Acme Corp"

tool: search_filings("Acme") → 12 results

tool: search_filings("ACME CORP") → ERROR: rate limit

tool: search_filings("ACME CORP") → 12 results (duplicate)

user: actually ignore the 2018 filings

assistant: ok, focusing on 2019+

tool: get_filing("F-2018-04421") → {...}

user: I said ignore 2018!

assistant: apologies, switching to 2019+

tool: get_filing("F-2018-04422") → ERROR

user: what's the total collateral value across all 2019+ filings?

AFTER: 8 turns, clean signal

[summary] Researched debtor "Acme Corp." User narrowed scope to 2019+ filings only.

[case fact] Active filings (2019+): F-2019-1101, F-2019-2287, F-2020-0883, F-2020-1455

user: what's the total collateral value across all 2019+ filings?

✅ What Just Happened?

The right-hand context isn't just shorter (3 messages vs. 11) — it's cleaner. The duplicate failed search is gone. The superseded "ignore 2018" instruction is folded into a positive case fact ("active filings = 2019+"). The current question is now adjacent to the relevant facts, not buried under 8 turns of resolved errors. Same model, same question, dramatically better answer. This is what M08 will operationalize for conversation history specifically — but the principle is broader: compaction isn't just about budget, it's about signal.

Code Walkthrough: The ContextBudget Class

Time to make this concrete. We're going to build a small class that takes the six layers of your context, accounts for tokens in each, and applies one of the four levers when the total exceeds a budget. You'll use this directly in the lab and reuse the pattern in M08 (history) and M11 (memory).

Here's the design: ContextBudget is initialized with a target budget and a strategy. You feed it the six layers via constructor. It exposes account() for the per-layer breakdown, fits() for the budget check, and build_messages() for the assembled message array ready for the Anthropic SDK. We'll annotate the implementation in four chunks.

Chunk 1: Per-layer Token Accounting

WHAT: The constructor stores the six layers separately and counts tokens in each. WHY: You can't apply a lever intelligently without knowing which layer is the heaviest. GOTCHA: Don't use len(text) // 4 as a token count — that approximation is off by 20%+ for tool definitions and JSON-heavy content. Use anthropic.Anthropic().count_tokens() for production code; we'll use a fast heuristic plus a real API call wrapper for the lab.

from dataclasses import dataclass, field
from typing import Optional

# Fast heuristic: ~4 chars per token. Good enough for budgeting decisions.
# For production cost tracking, use anthropic.count_tokens().
def estimate_tokens(text: str) -> int:
    return max(1, len(text) // 4)

@dataclass
class ContextBudget:
    """Accounts for and curates the six layers of context."""
    system_prompt: str = ""
    tool_definitions: list = field(default_factory=list)  # list of dicts
    history: list = field(default_factory=list)            # list of {"role", "content"}
    retrieved_docs: list = field(default_factory=list)     # list of strings
    tool_results: list = field(default_factory=list)       # list of strings
    current_user_message: str = ""
    target_budget: int = 8000  # tokens

    def account(self) -> dict:
        """Return per-layer token counts."""
        import json
        return {
            "system": estimate_tokens(self.system_prompt),
            "tools": estimate_tokens(json.dumps(self.tool_definitions)),
            "history": sum(estimate_tokens(m["content"]) for m in self.history),
            "retrieved": sum(estimate_tokens(d) for d in self.retrieved_docs),
            "tool_results": sum(estimate_tokens(r) for r in self.tool_results),
            "current": estimate_tokens(self.current_user_message),
        }

    def total(self) -> int:
        return sum(self.account().values())

    def fits(self) -> bool:
        return self.total() <= self.target_budget

// Fast heuristic: ~4 chars per token. Good enough for budgeting decisions.
// For production cost tracking, use anthropic.countTokens().
function estimateTokens(text) {
  return Math.max(1, Math.floor((text || '').length / 4));
}

export class ContextBudget {
  constructor({
    systemPrompt = '',
    toolDefinitions = [],
    history = [],            // [{ role, content }]
    retrievedDocs = [],      // [string]
    toolResults = [],        // [string]
    currentUserMessage = '',
    targetBudget = 8000,
  } = {}) {
    Object.assign(this, {
      systemPrompt, toolDefinitions, history,
      retrievedDocs, toolResults, currentUserMessage, targetBudget,
    });
  }

  account() {
    return {
      system: estimateTokens(this.systemPrompt),
      tools: estimateTokens(JSON.stringify(this.toolDefinitions)),
      history: this.history.reduce((s, m) => s + estimateTokens(m.content), 0),
      retrieved: this.retrievedDocs.reduce((s, d) => s + estimateTokens(d), 0),
      toolResults: this.toolResults.reduce((s, r) => s + estimateTokens(r), 0),
      current: estimateTokens(this.currentUserMessage),
    };
  }

  total() {
    return Object.values(this.account()).reduce((a, b) => a + b, 0);
  }

  fits() {
    return this.total() <= this.targetBudget;
  }
}

Chunk 2: The Compress Lever

WHAT: When the budget is busted, replace old turns with a Claude-generated summary, keeping the most recent k turns verbatim. WHY: History is usually the heaviest layer in long sessions, and old turns have lower relevance per token than recent ones. GOTCHA: Use a separate Claude call to write the summary, with the explicit instruction "preserve all named entities, IDs, and numeric values." Without that instruction, summarization eats critical specifics first — this is the cert-tip flag for Domain 5.1.

import anthropic

SUMMARY_INSTRUCTION = """Summarize the following conversation in 3-5 sentences.
CRITICAL: preserve every named entity (people, companies, IDs), exact numeric
values (dollar amounts, dates, counts), and any explicit user instructions.
Drop only redundant or resolved tool errors and abandoned plans.

Conversation:
{transcript}

Summary:"""

def compress_history(history: list, keep_recent: int = 6) -> list:
    """Summarize older turns; keep the last `keep_recent` turns verbatim."""
    if len(history) <= keep_recent:
        return history  # nothing to compress

    older = history[:-keep_recent]
    recent = history[-keep_recent:]

    transcript = "\n".join(f"{m['role']}: {m['content']}" for m in older)
    client = anthropic.Anthropic()
    try:
        result = client.messages.create(
            model="claude-haiku-4-5-20251001",  # cheap model for summarization
            max_tokens=400,
            messages=[{"role": "user",
                       "content": SUMMARY_INSTRUCTION.format(transcript=transcript)}],
        )
        summary_text = result.content[0].text
    except anthropic.APIError as e:
        # If summarization fails, fall back to truncation rather than crashing.
        print(f"Summarization failed: {e.message}; falling back to truncation.")
        return recent

    summary_msg = {"role": "user",
                   "content": f"[Summary of {len(older)} earlier turns] {summary_text}"}
    return [summary_msg] + recent

import Anthropic from '@anthropic-ai/sdk';

const SUMMARY_INSTRUCTION = `Summarize the following conversation in 3-5 sentences.
CRITICAL: preserve every named entity (people, companies, IDs), exact numeric
values (dollar amounts, dates, counts), and any explicit user instructions.
Drop only redundant or resolved tool errors and abandoned plans.

Conversation:
{transcript}

Summary:`;

export async function compressHistory(history, keepRecent = 6) {
  if (history.length <= keepRecent) return history;

  const older = history.slice(0, -keepRecent);
  const recent = history.slice(-keepRecent);
  const transcript = older.map(m => `${m.role}: ${m.content}`).join('\n');

  const client = new Anthropic();
  let summaryText;
  try {
    const result = await client.messages.create({
      model: 'claude-haiku-4-5-20251001',
      max_tokens: 400,
      messages: [{ role: 'user',
                   content: SUMMARY_INSTRUCTION.replace('{transcript}', transcript) }],
    });
    summaryText = result.content[0].text;
  } catch (error) {
    console.error(`Summarization failed: ${error.message}; falling back to truncation.`);
    return recent;
  }

  const summaryMsg = {
    role: 'user',
    content: `[Summary of ${older.length} earlier turns] ${summaryText}`,
  };
  return [summaryMsg, ...recent];
}

Chunk 3: Strategy Dispatch

WHAT: One method picks the right lever based on the budget overflow. WHY: You don't want strategy selection scattered across your agent code — centralize it. GOTCHA: The order of fallback matters — we compress history before evicting retrieved docs, because retrieval was already curated by the RAG pipeline whereas history is "everything that ever happened." If that still overflows, we'd reach for retrieve/offload (left as stretch goals).

    def apply_compress(self, keep_recent: int = 6) -> "ContextBudget":
        """Compress history; return self for chaining."""
        self.history = compress_history(self.history, keep_recent=keep_recent)
        return self

    def apply_strategy(self, strategy: str = "auto") -> "ContextBudget":
        """Dispatch to a specific lever, or auto-pick based on overflow.
        Strategies: 'compress', 'truncate_retrieved', 'auto', 'noop'.
        """
        if self.fits() or strategy == "noop":
            return self

        if strategy == "compress" or strategy == "auto":
            self.apply_compress()

        if not self.fits() and strategy in ("truncate_retrieved", "auto"):
            # Last resort: keep only the top 2 retrieved chunks.
            self.retrieved_docs = self.retrieved_docs[:2]

        return self

// Add these methods to ContextBudget class

  async applyCompress(keepRecent = 6) {
    this.history = await compressHistory(this.history, keepRecent);
    return this;
  }

  async applyStrategy(strategy = 'auto') {
    if (this.fits() || strategy === 'noop') return this;

    if (strategy === 'compress' || strategy === 'auto') {
      await this.applyCompress();
    }

    if (!this.fits() && (strategy === 'truncateRetrieved' || strategy === 'auto')) {
      this.retrievedDocs = this.retrievedDocs.slice(0, 2);
    }
    return this;
  }

Chunk 4: Static-First Assembly

WHAT: build_messages() assembles the final SDK payload with static blocks first, dynamic last — the cache-friendly order from Concept 3. WHY: This is where Concept 3's ordering rule turns into actual money saved. GOTCHA: Retrieved docs are technically dynamic (they change per query), but we put them before history because in many RAG pipelines the same chunks recur for similar queries within a session — partial cache hits are still wins.

    def build_messages(self) -> tuple[str, list, list]:
        """Return (system, tools, messages) ready for client.messages.create().
        Order: STATIC (system, tools) - SEMI-STATIC (retrieved) - DYNAMIC (history, current).
        """
        # Retrieved docs prepended to first message (closer to the question = better recall).
        retrieved_block = ""
        if self.retrieved_docs:
            retrieved_block = "\n\n<reference_docs>\n" + \
                "\n---\n".join(self.retrieved_docs) + "\n</reference_docs>\n\n"

        messages = list(self.history)  # copy to avoid mutating caller's list
        # Append current user turn with retrieved docs at the END (high-recall position).
        messages.append({
            "role": "user",
            "content": retrieved_block + self.current_user_message,
        })
        return self.system_prompt, self.tool_definitions, messages

  buildMessages() {
    let retrievedBlock = '';
    if (this.retrievedDocs.length) {
      retrievedBlock = '\n\n<reference_docs>\n' +
        this.retrievedDocs.join('\n---\n') + '\n</reference_docs>\n\n';
    }

    const messages = [
      ...this.history,
      { role: 'user', content: retrievedBlock + this.currentUserMessage },
    ];
    return {
      system: this.systemPrompt,
      tools: this.toolDefinitions,
      messages,
    };
  }

✅ What Just Happened?

You now have a reusable class that turns the four levers into actual code: account() shows you what's heavy, apply_strategy("auto") picks the right lever, build_messages() emits the SDK payload in cache-friendly order. In the lab below you'll point this at a 30-turn rotted transcript, watch it shrink, and confirm the answer quality holds. From M08 forward, you'll see this exact pattern recur — just specialized: M08 swaps the compress strategy, M09 fills retrieved_docs, M14 spawns subagents to populate one of the layers.

Hands-On Lab: The Poisoned Transcript

Lab Overview

What you'll build: A diagnostic that reads a 30-turn rotted UCC research transcript, prints per-layer token breakdown, applies the compress lever, and shows the before/after token + answer-quality delta.

Time: 25-35 minutes · Level: Beginner → Intermediate

Prerequisites: Python 3.10+ or Node.js 18+, an Anthropic API key, the ContextBudget class from the walkthrough above.

Files you'll create:

poisoned_transcript.json — the rotted 30-turn fixture (provided below, copy-paste)
context_budget.py (or contextBudget.mjs) — the class from the walkthrough
diagnose.py (or diagnose.mjs) — the diagnostic and fix runner

Environment Setup

Single copy-pasteable block. Pick Python or Node.js, not both:

# Python
mkdir m03b-context-lab && cd m03b-context-lab
python -m venv venv && source venv/bin/activate    # Windows: venv\Scripts\activate
pip install anthropic
export ANTHROPIC_API_KEY=sk-ant-...                # Windows: set ANTHROPIC_API_KEY=sk-ant-...

# Node.js (alternative)
mkdir m03b-context-lab && cd m03b-context-lab
npm init -y && npm install @anthropic-ai/sdk
export ANTHROPIC_API_KEY=sk-ant-...

Step 1: Drop in the Poisoned Transcript Fixture

This is your raw material — a UCC research session that started fine and rotted by turn 30. The user is asking about debtor "Acme Corp" filings; over the session they corrected scope twice, hit two rate limits, and abandoned one tool path. The current question is at the end. Save this as poisoned_transcript.json:

{
  "system_prompt": "You are a UCC filings research assistant. Help users find and analyze public records filings. Always cite filing IDs. If a search fails, retry once with a normalized query.",
  "tool_definitions": [
    {"name": "search_filings", "description": "Search UCC filings by debtor name", "input_schema": {"type": "object", "properties": {"debtor": {"type": "string"}}, "required": ["debtor"]}},
    {"name": "get_filing", "description": "Retrieve a single filing by ID", "input_schema": {"type": "object", "properties": {"filing_id": {"type": "string"}}, "required": ["filing_id"]}},
    {"name": "calc_collateral_total", "description": "Sum collateral values across filing IDs", "input_schema": {"type": "object", "properties": {"filing_ids": {"type": "array", "items": {"type": "string"}}}, "required": ["filing_ids"]}}
  ],
  "history": [
    {"role": "user", "content": "Research debtor Acme Corp - I need an overview of their UCC filings"},
    {"role": "assistant", "content": "I'll search for filings under 'Acme Corp'."},
    {"role": "user", "content": "[tool_result] search_filings(debtor='Acme Corp') returned 12 filings spanning 2017-2023, IDs F-2017-0044, F-2018-04421, F-2018-04422, F-2019-1101, F-2019-2287, F-2020-0883, F-2020-1455, F-2021-0192, F-2021-3340, F-2022-0028, F-2022-1199, F-2023-0501"},
    {"role": "assistant", "content": "Found 12 filings from 2017 to 2023. Want a year-by-year breakdown?"},
    {"role": "user", "content": "Yes, but actually focus only on 2019 and later"},
    {"role": "assistant", "content": "Got it. Restricting to 2019+: 8 filings. Pulling details."},
    {"role": "user", "content": "[tool_result] get_filing('F-2018-04421') returned full filing data for 2018 (collateral: $8.1M, secured party: First Bank)"},
    {"role": "assistant", "content": "I see I pulled a 2018 filing by mistake. Discarding."},
    {"role": "user", "content": "[tool_result] get_filing('F-2018-04422') returned ERROR: rate_limit_exceeded after 3 retries"},
    {"role": "assistant", "content": "Rate limit on another 2018 filing - skipping, that's outside our scope anyway."},
    {"role": "user", "content": "[tool_result] search_filings(debtor='ACME CORP') returned 12 filings (same as before, normalization quirk)"},
    {"role": "assistant", "content": "Duplicate result from a normalized retry. Ignoring."},
    {"role": "user", "content": "I said 2019+! Why are you still pulling 2018 filings?"},
    {"role": "assistant", "content": "Apologies - that was a confused retry. Going forward, only 2019 and later: F-2019-1101, F-2019-2287, F-2020-0883, F-2020-1455, F-2021-0192, F-2021-3340, F-2022-0028, F-2022-1199, F-2023-0501. That's 9 filings."},
    {"role": "user", "content": "[tool_result] get_filing('F-2019-1101') returned collateral: $4.2M, secured party: Capital Trust, year: 2019"},
    {"role": "assistant", "content": "F-2019-1101: $4.2M collateral, Capital Trust as secured party."},
    {"role": "user", "content": "[tool_result] get_filing('F-2019-2287') returned collateral: $7.8M, secured party: First Bank, year: 2019"},
    {"role": "assistant", "content": "F-2019-2287: $7.8M, First Bank."},
    {"role": "user", "content": "[tool_result] get_filing('F-2020-0883') returned collateral: $11.5M, secured party: Capital Trust, year: 2020"},
    {"role": "assistant", "content": "F-2020-0883: $11.5M, Capital Trust."},
    {"role": "user", "content": "[tool_result] get_filing('F-2020-1455') returned collateral: $3.4M, secured party: GE Capital, year: 2020"},
    {"role": "assistant", "content": "F-2020-1455: $3.4M, GE Capital."},
    {"role": "user", "content": "[tool_result] get_filing('F-2021-0192') returned collateral: $9.0M, secured party: First Bank, year: 2021"},
    {"role": "assistant", "content": "F-2021-0192: $9.0M, First Bank."},
    {"role": "user", "content": "[tool_result] get_filing('F-2021-3340') returned collateral: $5.5M, secured party: Capital Trust, year: 2021"},
    {"role": "assistant", "content": "F-2021-3340: $5.5M, Capital Trust."},
    {"role": "user", "content": "[tool_result] get_filing('F-2022-0028') returned collateral: $13.2M, secured party: First Bank, year: 2022"},
    {"role": "assistant", "content": "F-2022-0028: $13.2M, First Bank."},
    {"role": "user", "content": "[tool_result] get_filing('F-2022-1199') returned collateral: $6.7M, secured party: GE Capital, year: 2022"},
    {"role": "assistant", "content": "F-2022-1199: $6.7M, GE Capital."},
    {"role": "user", "content": "[tool_result] get_filing('F-2023-0501') returned collateral: $8.9M, secured party: Capital Trust, year: 2023"},
    {"role": "assistant", "content": "F-2023-0501: $8.9M, Capital Trust."}
  ],
  "current_user_message": "What's the total collateral value across all 2019+ filings, and which secured party shows up most often?"
}

What just landed in your folder: a fixture with one system prompt, three tool definitions, 30 history turns (mix of user, assistant, and tool-result messages), and a current question. The poisoning is subtle but real: two rate-limit results, two 2018 filings the user explicitly excluded, one duplicate normalized search. Your diagnostic will surface these.

✅ Checkpoint: If ls -lh poisoned_transcript.json (Windows: dir poisoned_transcript.json) shows a file around 4-5 KB, you're set. If you see a 0-byte file, copy-paste didn't include all lines — re-copy the JSON.

Step 2: Build the Diagnostic + Compress Runner

What & why: This script loads the fixture, instantiates the ContextBudget class from the walkthrough, prints per-layer token breakdown before any fix, applies the compress lever, then prints the breakdown after. It's the smallest possible loop that exercises the full module.

First, save the ContextBudget class from the Code Walkthrough as context_budget.py (Python) or contextBudget.mjs (Node). Then create the runner:

"""diagnose.py - Reads poisoned_transcript.json, runs ContextBudget, prints before/after."""
import json
from context_budget import ContextBudget

def print_breakdown(label: str, account: dict, total: int, target: int):
    print(f"\n--- {label} ---")
    for layer, tokens in account.items():
        bar = "#" * max(1, tokens // 100)
        print(f"  {layer:<14} {tokens:>5} tok  {bar}")
    pct = round(100 * total / target)
    status = "OVER BUDGET" if total > target else "ok"
    print(f"  TOTAL          {total:>5} tok  ({pct}% of {target} budget) [{status}]")

def detect_rot(history: list) -> list:
    """Quick heuristic rot scan - look for explicit signals in tool-result messages."""
    signals = []
    for i, msg in enumerate(history):
        c = msg.get("content", "")
        if "rate_limit" in c.lower() or "ERROR" in c:
            signals.append(f"  turn {i}: failed tool call (likely retried)")
        if "duplicate" in c.lower() or "same as before" in c.lower():
            signals.append(f"  turn {i}: duplicate result")
    return signals

def main():
    with open("poisoned_transcript.json") as f:
        fixture = json.load(f)

    # Build the budget object from the fixture.
    budget = ContextBudget(
        system_prompt=fixture["system_prompt"],
        tool_definitions=fixture["tool_definitions"],
        history=fixture["history"],
        current_user_message=fixture["current_user_message"],
        target_budget=4000,  # deliberately tight to force compression
    )

    print("=" * 60)
    print("POISONED TRANSCRIPT DIAGNOSTIC")
    print("=" * 60)
    print_breakdown("BEFORE", budget.account(), budget.total(), budget.target_budget)

    rot_signals = detect_rot(budget.history)
    if rot_signals:
        print("\nRot signals detected:")
        for s in rot_signals:
            print(s)

    # Apply the compress lever.
    print("\nApplying compress lever (keep_recent=6)...")
    budget.apply_compress(keep_recent=6)

    print_breakdown("AFTER COMPRESSION", budget.account(), budget.total(), budget.target_budget)

    # Build the final payload that would be sent to Claude.
    system, tools, messages = budget.build_messages()
    print(f"\nFinal payload: {len(messages)} messages, "
          f"{budget.total()} input tokens, fits budget = {budget.fits()}")

if __name__ == "__main__":
    main()

// diagnose.mjs - Reads poisoned_transcript.json, runs ContextBudget, prints before/after.
import { readFileSync } from 'fs';
import { ContextBudget } from './contextBudget.mjs';

function printBreakdown(label, account, total, target) {
  console.log(`\n--- ${label} ---`);
  for (const [layer, tokens] of Object.entries(account)) {
    const bar = '#'.repeat(Math.max(1, Math.floor(tokens / 100)));
    console.log(`  ${layer.padEnd(14)} ${String(tokens).padStart(5)} tok  ${bar}`);
  }
  const pct = Math.round((100 * total) / target);
  const status = total > target ? 'OVER BUDGET' : 'ok';
  console.log(`  TOTAL          ${String(total).padStart(5)} tok  (${pct}% of ${target} budget) [${status}]`);
}

function detectRot(history) {
  const signals = [];
  history.forEach((msg, i) => {
    const c = msg.content || '';
    if (c.toLowerCase().includes('rate_limit') || c.includes('ERROR')) {
      signals.push(`  turn ${i}: failed tool call (likely retried)`);
    }
    if (c.toLowerCase().includes('duplicate') || c.toLowerCase().includes('same as before')) {
      signals.push(`  turn ${i}: duplicate result`);
    }
  });
  return signals;
}

async function main() {
  const fixture = JSON.parse(readFileSync('poisoned_transcript.json', 'utf-8'));

  const budget = new ContextBudget({
    systemPrompt: fixture.system_prompt,
    toolDefinitions: fixture.tool_definitions,
    history: fixture.history,
    currentUserMessage: fixture.current_user_message,
    targetBudget: 4000,
  });

  console.log('='.repeat(60));
  console.log('POISONED TRANSCRIPT DIAGNOSTIC');
  console.log('='.repeat(60));
  printBreakdown('BEFORE', budget.account(), budget.total(), budget.targetBudget);

  const rotSignals = detectRot(budget.history);
  if (rotSignals.length) {
    console.log('\nRot signals detected:');
    rotSignals.forEach(s => console.log(s));
  }

  console.log('\nApplying compress lever (keepRecent=6)...');
  await budget.applyCompress(6);

  printBreakdown('AFTER COMPRESSION', budget.account(), budget.total(), budget.targetBudget);

  const { messages } = budget.buildMessages();
  console.log(`\nFinal payload: ${messages.length} messages, ${budget.total()} input tokens, fits budget = ${budget.fits()}`);
}

main().catch(err => { console.error(err); process.exit(1); });

Run it:

python diagnose.py     # or: node diagnose.mjs

Expected Output (numbers will vary by ~5% due to token estimation):

============================================================ POISONED TRANSCRIPT DIAGNOSTIC ============================================================ --- BEFORE --- system 105 tok # tools 230 tok ## history 1980 tok ################### retrieved 1 tok # tool_results 1 tok # current 24 tok # TOTAL 2341 tok (58% of 4000 budget) [ok] Rot signals detected: turn 9: failed tool call (likely retried) turn 11: duplicate result Applying compress lever (keep_recent=6)... --- AFTER COMPRESSION --- system 105 tok # tools 230 tok ## history 580 tok ###### retrieved 1 tok # tool_results 1 tok # current 24 tok # TOTAL 941 tok (24% of 4000 budget) [ok] Final payload: 8 messages, 941 input tokens, fits budget = True

✅ Checkpoint: You should see a roughly 3-4x reduction in history tokens after compression, and at least 2 rot signals detected. The summary message preserves the named filing IDs and dollar amounts — if you don't see those in the summary message, your SUMMARY_INSTRUCTION in context_budget.py may be missing the "preserve every named entity" line. Re-check Chunk 2 of the walkthrough.

Troubleshooting

ModuleNotFoundError: No module named 'anthropic' — Run pip install anthropic in your activated venv (or npm install @anthropic-ai/sdk).
AuthenticationError: Could not resolve API key — export ANTHROPIC_API_KEY=sk-ant-... in the same shell. Check with echo $ANTHROPIC_API_KEY.
FileNotFoundError: poisoned_transcript.json — You're running from the wrong directory. cd into m03b-context-lab first.
Summary doesn't preserve filing IDs — Check the SUMMARY_INSTRUCTION string. The "preserve every named entity..." line is critical — without it, Haiku will summarize the gist and drop the IDs you need to answer the final question.

Step 3: Compare Answer Quality Before vs. After

What & why: Token reduction is a means, not the end. The end is answer quality. This step sends the same final question to Claude twice — once with the rotted context, once with the compressed context — and prints both answers so you can compare. Append this to diagnose.py (or create a new file). Important: replace the existing if __name__ == "__main__": block from Step 2 with the new one at the bottom of this snippet so main() only runs once. For the Node.js version, replace the trailing main().catch(...) line from Step 2 with the new top-level await calls at the bottom.

# Append to diagnose.py (or save as compare.py and import as needed)
import anthropic
import time

def ask_claude(budget: ContextBudget) -> tuple[str, dict]:
    """Send the budget's payload to Claude and return (text, usage)."""
    client = anthropic.Anthropic()
    system, tools, messages = budget.build_messages()
    t0 = time.time()
    try:
        result = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=400,
            system=system,
            messages=messages,
        )
    except anthropic.APIError as e:
        return f"[error: {e.message}]", {"input_tokens": 0, "output_tokens": 0, "latency_s": 0}
    elapsed = round(time.time() - t0, 2)
    return result.content[0].text, {
        "input_tokens": result.usage.input_tokens,
        "output_tokens": result.usage.output_tokens,
        "latency_s": elapsed,
    }

def compare_run():
    with open("poisoned_transcript.json") as f:
        fixture = json.load(f)

    # Run 1: rotted context (no fix applied)
    rotted = ContextBudget(
        system_prompt=fixture["system_prompt"],
        tool_definitions=fixture["tool_definitions"],
        history=fixture["history"],
        current_user_message=fixture["current_user_message"],
        target_budget=4000,
    )
    print("\n>>> Run 1: ROTTED context (no fix)")
    answer_a, usage_a = ask_claude(rotted)
    print(f"Tokens: {usage_a['input_tokens']} in, {usage_a['output_tokens']} out  ({usage_a['latency_s']}s)")
    print(f"Answer:\n{answer_a}\n")

    # Run 2: compressed context (compress lever applied)
    fixed = ContextBudget(
        system_prompt=fixture["system_prompt"],
        tool_definitions=fixture["tool_definitions"],
        history=fixture["history"],
        current_user_message=fixture["current_user_message"],
        target_budget=4000,
    ).apply_compress(keep_recent=6)
    print(">>> Run 2: COMPRESSED context")
    answer_b, usage_b = ask_claude(fixed)
    print(f"Tokens: {usage_b['input_tokens']} in, {usage_b['output_tokens']} out  ({usage_b['latency_s']}s)")
    print(f"Answer:\n{answer_b}\n")

    # Delta
    print("=" * 60)
    print(f"Token delta:    {usage_a['input_tokens'] - usage_b['input_tokens']} fewer input tokens")
    print(f"Latency delta:  {usage_a['latency_s'] - usage_b['latency_s']:.2f}s faster")

if __name__ == "__main__":
    main()        # original diagnostic
    compare_run() # new comparison

// Append to diagnose.mjs
import Anthropic from '@anthropic-ai/sdk';

async function askClaude(budget) {
  const client = new Anthropic();
  const { system, messages } = budget.buildMessages();
  const t0 = Date.now();
  let result;
  try {
    result = await client.messages.create({
      model: 'claude-sonnet-4-6',
      max_tokens: 400,
      system,
      messages,
    });
  } catch (error) {
    return [`[error: ${error.message}]`, { input_tokens: 0, output_tokens: 0, latency_s: 0 }];
  }
  const elapsed = ((Date.now() - t0) / 1000).toFixed(2);
  return [result.content[0].text, {
    input_tokens: result.usage.input_tokens,
    output_tokens: result.usage.output_tokens,
    latency_s: parseFloat(elapsed),
  }];
}

async function compareRun() {
  const fixture = JSON.parse(readFileSync('poisoned_transcript.json', 'utf-8'));
  const baseArgs = {
    systemPrompt: fixture.system_prompt,
    toolDefinitions: fixture.tool_definitions,
    history: fixture.history,
    currentUserMessage: fixture.current_user_message,
    targetBudget: 4000,
  };

  const rotted = new ContextBudget(baseArgs);
  console.log('\n>>> Run 1: ROTTED context (no fix)');
  const [answerA, usageA] = await askClaude(rotted);
  console.log(`Tokens: ${usageA.input_tokens} in, ${usageA.output_tokens} out  (${usageA.latency_s}s)`);
  console.log(`Answer:\n${answerA}\n`);

  const fixed = new ContextBudget(baseArgs);
  await fixed.applyCompress(6);
  console.log('>>> Run 2: COMPRESSED context');
  const [answerB, usageB] = await askClaude(fixed);
  console.log(`Tokens: ${usageB.input_tokens} in, ${usageB.output_tokens} out  (${usageB.latency_s}s)`);
  console.log(`Answer:\n${answerB}\n`);

  console.log('='.repeat(60));
  console.log(`Token delta:    ${usageA.input_tokens - usageB.input_tokens} fewer input tokens`);
  console.log(`Latency delta:  ${(usageA.latency_s - usageB.latency_s).toFixed(2)}s faster`);
}

await main();
await compareRun();

Run it:

python diagnose.py     # or: node diagnose.mjs

Expected Output (abbreviated — exact text varies):

>>> Run 1: ROTTED context (no fix) Tokens: 2298 in, 142 out (3.81s) Answer: The total collateral value across the 2019+ filings is $70.2M. Capital Trust appears most frequently with 4 filings. (Note: I had some confusion earlier with 2018 filings being pulled by mistake.) >>> Run 2: COMPRESSED context Tokens: 920 in, 138 out (2.14s) Answer: Total collateral across 2019+ filings: $70.2M Most frequent secured party: Capital Trust (4 of 9 filings). Breakdown by party: Capital Trust 4, First Bank 3, GE Capital 2. ============================================================ Token delta: 1378 fewer input tokens Latency delta: 1.67s faster

✅ Checkpoint — verify all three:

Token reduction: at least 1,000 fewer input tokens after compression
Faster latency: 1-2 seconds quicker (varies by network)
Same answer (or better): both runs should report the same total ($70.2M) and same most-frequent secured party (Capital Trust). The compressed run usually gives a cleaner structured answer because there's less noise to wade through.

Verify Everything Works

One command runs the full diagnostic + comparison end-to-end:

python diagnose.py    # or: node diagnose.mjs

🎉 Congratulations! You just diagnosed a poisoned transcript, applied the compress lever, and verified that token reduction didn't sacrifice answer quality. The ContextBudget class you wrote is the same pattern you'll specialize for conversation history (M08), retrieved chunks (M09), and memory layers (M11). Every later module in Track 3 is just one of these levers wearing a different costume.

Stretch Goals (Optional)

Implement the truncate-retrieved lever: when the budget overflows even after compression, drop retrieved docs to the top 2 by relevance score.
Static-first ordering test: send the same payload twice with cache_control markers, measure cache hit rate from usage.cache_read_input_tokens.
Position-effects test: place the critical case fact ("user wants 2019+ only") at three positions — start of system prompt, middle of history, just before the user message — and compare answer accuracy.
Auto-detect summary failures: parse the summary message after compression and assert that all 9 filing IDs from the original history still appear. If not, fall back to a less aggressive keep_recent.

Knowledge Check

Six questions on context inventory, the four levers, ordering, position effects, and rot.

Q1: Your agent has 5 registered tools but only uses 1 per turn on average. How many tokens do the 4 unused tool definitions cost on a turn?

A0 tokens — only the tool that gets called counts

B0 tokens — tool definitions are stored server-side

CThe full token cost of all 4 unused definitions, every turn

DOnly their names — descriptions are lazily loaded

Correct! Every tool definition is included in the prompt the model reads, on every turn, regardless of whether Claude calls it. The model has to read all 5 definitions to decide which (if any) to call. This is why the cert tip in M06 caps tools at 4-5 per agent — both for selection accuracy and token cost.

Q2: Which is NOT one of the four context-engineering levers?

ACompress — summarize old turns to save tokens

BEncrypt — reduce context length via cryptographic compression

CRetrieve — fetch facts from external storage on demand

DOffload — hand a sub-task to a subagent with its own context

Correct! The four levers are add, compress, retrieve, offload. "Encrypt" is a distractor — cryptographic compression doesn't reduce token count for the model since the model still needs to read the cleartext. Every later module in Track 3 is just a specialized application of one of these four.

Q3: An agent works fine for the first 8 turns, then degrades sharply. Token usage is at 60%, well under the limit. What's the most likely cause?

AThe model is rate-limited — switch to Haiku

BThe system prompt is too short — add more instructions

CToken budget exceeded — the agent is silently truncating

DContext rot — accumulated stale tool results and superseded instructions are lowering signal-to-noise

Correct! Context rot is signal-to-noise degradation, not a budget problem. You can hit it at 60% utilization. The fix isn't a bigger window — it's a cleaner one. Apply the compress lever to fold resolved errors and superseded instructions into a single positive case-fact summary.

Q4: Which message ordering allows prompt caching to hit on subsequent calls?

Asystem → tools → reference doc → history → current user turn

Bhistory → system → tools → current user turn

Ccurrent user turn → history → system → tools

DOrder doesn't affect caching — only content matters

Correct! Prompt caching reuses prefixes that match exactly. Static blocks (system, tools, fixed reference docs) must come first so they form a stable prefix across calls. As soon as a dynamic block (like history) appears, everything after it is "new" from the cache's perspective — even if it would have been identical further down. Static-first is the cache-friendly order.

Q5: You have a 25-page reference doc the agent needs occasional facts from across many turns. Which lever is the best fit?

AAdd — paste the full doc into the system prompt every turn

BCompress — summarize the doc down to one paragraph

CRetrieve — chunk and embed the doc, fetch only relevant chunks per turn

DOffload — spawn a subagent that holds the doc in its own window

Correct! "Occasional facts" is the giveaway — most turns won't need most of the doc. Retrieve is the right lever: pay 0 tokens until a query needs a chunk, then fetch only the relevant slice. Add wastes tokens every turn, compress loses specifics, offload is overkill for a single doc lookup. This is the setup for M09 (RAG).

Q6: A long context contains a critical instruction at position 12 of 23 layers. Why might Claude miss it?

AThe instruction has fewer tokens than the surrounding content

BThe "lost in the middle" effect — recall accuracy drops in middle positions even when the content is present

CClaude truncates context in the middle when it's too long

DTools always override instructions in the middle of context

Correct! Liu et al. 2023 documented the lost-in-the-middle effect across most modern LLMs — recall is high at start (~75%) and end (~70%) but drops to ~45% in the middle. The fix: put critical instructions at start (system prompt) or end (just before user query), or repeat them at both positions. This is exam Domain 5.2.

Module Summary

Key Takeaways

Prompt engineering writes one message; context engineering curates everything the model sees. The user message is often <1% of the input tokens — the other 99% is the layers you assembled.
Six layers — system, tools, history, retrieved docs, tool results, current turn. Inventory them before you debug an agent.
Four levers, no exceptions — add, compress, retrieve, offload. Every later module in Track 3 is one of these four wearing a different costume.
Static-first ordering is what makes prompt caching work. Same content in the wrong order can cost 6x more.
Lost in the middle is real. Put critical content at the start or end of long contexts — and consider repeating it at both.
Context rot is a signal-to-noise problem, not a budget problem. You can hit it at 60% window utilization. Compaction is the cure.

Next Module Preview: M04 — Structured Output

Now that you know what goes into the model, the next question is how to get reliably parseable output back. M04 covers JSON mode, tool-use as a structured-output mechanism, and Pydantic/Zod validation — turning Claude's natural-language replies into data your agent can act on without crashing on malformed responses.