M03B: Context Engineering — Curating What the Model Sees
M03 taught you how to write the message. This module teaches you how to curate everything the model sees on each turn — system prompt, tools, history, retrieved docs, tool results — under a finite budget. Get this right and every later module (RAG, memory, multi-agent, caching) snaps into place.
Learning Objectives
- Distinguish prompt engineering (writing the message) from context engineering (curating what the model sees) and explain why both matter
- Inventory the six layers Claude reads on every API call: system, tools, history, retrieved docs, tool results, current turn
- Apply the four context-engineering levers — add, compress, retrieve, offload — to a context budget problem
- Order static and dynamic content correctly so prompt caching can hit, and explain the lost-in-the-middle effect
- Diagnose context rot in a long-running agent transcript and fix it without losing critical case facts
Prompt Engineering vs. Context Engineering
In M03 you learned prompt engineering — the craft of writing a single message that guides Claude toward the answer you want. Roles, few-shot examples, structured output instructions, chain-of-thought. That is the visible iceberg.
Underneath it is the bigger discipline: context engineering. Every API call assembles a stack of content for the model — the system prompt you wrote, the tool definitions you registered, the conversation history that has piled up, any retrieved documents, the results of past tool calls, and finally the user's current message. The model attends to all of it. Most learners only think about the last line they typed. The other 95% sets the constraints, the priors, and (when neglected) the failure modes.
Here's the working definition we'll use throughout the course:
Context engineering is the practice of deciding what content occupies the model's context windowThe fixed-size buffer of tokens the model reads on each call. For Claude Sonnet 4.6 it's up to 200K tokens (or 1M with the long-context variant). Everything — system, tools, history, retrieved docs, tool results, current message, AND the response — competes for this space. on each turn, in what order, and how it is compressed or fetched. Prompt engineering is a sub-discipline of context engineering that handles the authoring of one piece of that context (typically the system prompt or current user message).
Why split them out? Because the levers are different. Prompt engineering asks "what words go in this message?" Context engineering asks "what gets included at all, in what order, fresh or cached, full or summarized?" The first is a writing problem. The second is a budgeting problem. By M11 you'll be juggling both on every agent you build — this module makes the budgeting half explicit.
Concept 1: What Does the Model Actually See?
Before you can engineer context, you have to know what's in it. On every Anthropic API call, Claude receives a layered stack of content. Most of it isn't in the messages array you passed in — it's assembled by the SDK from various parameters. Let's enumerate the six layers.
Before you walk into a courtroom, the lawyer hands the jury an evidence binder — pages of contracts, photos, prior testimony, and exhibits. The closing argument the lawyer delivers is what the jury hears last, but they've been reading the binder for hours.
The pain: rookie lawyers obsess over the closing argument and pay no attention to what's in the binder. The jury reaches its verdict based mostly on the binder. A great closing on top of a sloppy binder loses the case.
The mapping: the closing argument is your prompt. The binder is your context. Claude is the jury — it reads everything. If your retrieved chunks are irrelevant, your tool definitions are vague, or your history is full of resolved errors, no amount of clever prompting fixes the verdict.
The Six Layers, From Top to Bottom
Every Anthropic API call assembles these six layers into one giant promptNote: in the SDK you don't see "the prompt" as one string — you pass system, tools, messages separately. But internally the model receives a single sequence of tokens assembled from all of them. the model reads in order. Each one costs tokens, each one influences the answer:
- System prompt — your instructions, persona, output rules. Set once per conversation. Static across turns.
- Tool definitions — the JSON Schema for every tool you've registered. Even if Claude doesn't use any, it reads them all to decide. Static across turns.
- Conversation history — every prior user/assistant message, in order. Grows with every turn.
- Retrieved documents — chunks injected by your RAG pipeline (M09). Different on every turn.
- Tool results — the JSON output of past tool calls, embedded as messages. Accumulates with every tool use.
- Current user turn — the message you just appended. Smallest, most-attended layer.
Here's what that stack looks like for a UCC research agent on turn 8 of a session:
The user typed 85 tokens. The model received 9,685 tokens. That means 99.1% of what shaped Claude's answer was content you assembled — not what the user said. Welcome to context engineering. From now on, when an agent gives a bad answer, your first question shouldn't be "was the prompt wrong?" — it should be "what did the model actually see?"
"Tool definitions don't count if Claude doesn't use them." — They do. Every tool's JSON Schema is in the prompt the model reads, even on turns where no tool is called. 5 tools at ~250 tokens each = 1,250 tokens of overhead per turn.
"The model focuses on the user message." — The model attends across the whole window. With 99% of tokens elsewhere, those other layers dominate the output unless you actively curate them.
"Long context windows mean I can stop worrying about size." — A 1M-token window doesn't mean 1M useful tokens. Cost scales linearly with input. Latency rises with input. And a noisy 50K-token context often beats a noisy 500K-token context, because signal-to-noise matters more than raw capacity.
Concept 2: The Four Levers
Now that you can see the stack, the question becomes: what do you do when it gets too big, too noisy, or too stale? Every context-engineering decision — every one — comes down to one of four moves. The rest of Track 3 is just specialized applications of these four.
You're packing for a two-week trip and your suitcase is full. You have four options for any item. (1) Just throw it in. (2) Use packing cubes to compress your shirts so they take half the space. (3) Leave it home and ship it ahead to the hotel for pickup if you need it. (4) Hand it to a travel buddy who will carry it on their trip and meet you there.
The pain: people default to option 1, the bag bursts, and they end up paying overweight fees or leaving things behind randomly at the airport.
The mapping: those four moves are add, compress, retrieve, offload. They are the only four moves you have when curating context. Every RAG technique, every memory architecture, every multi-agent pattern you'll learn later is one of these four wearing a different costume.
Here are the four levers, with what each one does to a single fact:
ADD
Include the fact directly in context. Simple, but spends tokens every turn.
COMPRESS
Summarize or truncate the fact. Lossy but cheap. Good for old turns.
RETRIEVE
Store the fact externally, fetch it only when relevant. Lossless but adds a lookup step.
OFFLOAD
Hand the fact to a subagent with its own context. Returns only a summary.
The same fact lived four different lives. Add kept all 38 words but burned 38 tokens every turn. Compress kept the gist (8 tokens) but lost the exact dollar amount — fine for "what does this filing cover?", fatal for "what's the collateral value?" Retrieve kept the full fact in a vector database, paid 0 tokens until the user asked something relevant, then fetched it. Offload spawned a subagent that read the whole filing, did the analysis, and returned only a 1-sentence verdict. There is no universally right answer — the lever depends on access frequency, freshness needs, and how lossy you can afford to be.
When to Reach for Each Lever
A simple decision heuristic, refined throughout Track 3:
- Add when the fact is small, accessed every turn, and exact wording matters — e.g. policy rules, the user's name, current date.
- Compress when the fact's gist matters but specifics are nice-to-have — e.g. "we already discussed the auth flow and concluded X" replacing 12 turns of debate.
- Retrieve when the fact is one of many candidates and only relevant ones matter — e.g. one document out of 10,000 in a knowledge base.
- Offload when the fact requires its own reasoning chain that would otherwise pollute your main context — e.g. "research the entire filing history of this debtor" handed to a subagent.
Concept 3: Static vs. Dynamic Context
Inside your context stack, some layers don't change between turns — the system prompt, the tool definitions, sometimes a fixed reference doc. Other layers change constantly — the conversation history grows, retrieved chunks differ per query. This split has a name and a consequence.
Static context is content that is identical across multiple consecutive API calls in a session. Dynamic context changes between calls. The split matters because Anthropic's prompt cachingA feature where Anthropic stores the result of processing a prefix of your prompt and reuses it on subsequent calls if the prefix is identical. Cached input tokens cost ~10% of regular input tokens. Cache lifetime is 5 minutes by default. Covered in detail in M22. only kicks in for prefixes that match exactly. If you put dynamic content before static content, none of the static content can be cached — you're billed full price every turn.
The Ordering Rule
The rule is simple: static blocks first, dynamic blocks last. The cache reuses a prefix as long as the prefix is byte-identical to a previous call. As soon as the prefix diverges, every byte after that point is "new" — even if it would have been identical further down. So the order has to be: static system, static tools, static reference doc, then dynamic history, then dynamic current turn.
✅ Static-First (cacheable)
❌ Dynamic-First (cache miss)
Same exact content, just different ordering — the static-first version costs 6.5x less per call once the cache warms up. Across 1,000 daily calls in a production agent, that's the difference between $2/day and $13/day. Multiply by months and that's a five-figure annual swing on ordering alone. M22 will cover the full caching mechanics; for now, just remember: static first.
Prompt caching reuses prefixes that match exactly. A common production bug: identical content in different order produces no cache hits because the prefix breaks. Use cache_control markers to declare cacheable boundaries explicitly. Note: per the official cert exam guide, prompt-caching implementation details are out of scope — the exam expects you to know caching exists and to recognize prefix-stable ordering, but won't ask about cache TTL, hit rates, or pricing.
Concept 4: Position Effects — Lost in the Middle
Even if every byte you put in context is high-quality, where you put it matters. Large language models attend more strongly to content at the start and end of the context window than to content in the middle. This isn't a bug to fix — it's a property of how attention layers work, replicated across nearly every model family. It's been documented under the name "lost in the middleFrom the 2023 Liu et al. paper "Lost in the Middle: How Language Models Use Long Contexts." Researchers placed the same answer at different positions in a long context and measured retrieval accuracy. Accuracy was high at the start (~75%), high at the end (~70%), and dropped to ~45% in the middle.."
Think about a 23-bullet meeting agenda. A week later, ask a participant what was decided. They'll remember bullet 1 (the opening goal), bullet 23 (the action items), and maybe one of the most heated discussions in the middle. Bullets 7–15? Forgotten.
The pain: organizers stuff the most important detail at bullet 12 because that's when the meeting hit its stride — and a week later nobody remembers it.
The mapping: Claude reads context the same way. Information at position 1 and at the very end is recalled with high fidelity. Information in the middle of a long context, even when literally present, is sometimes effectively invisible to the model's reasoning.
Here's what that looks like when researchers test the same fact at different positions:
Source: Liu et al. 2023 — pattern reproduces across most modern LLMs including Claude, GPT-4, Llama 2.
What This Means For Your Code
Three concrete rules drop out of the position-effects research:
- Put the most critical instruction at the start of the system prompt, not buried at line 47.
- Put the most-relevant retrieved chunk at the end of the retrieved-docs block (closest to the user's current question). Re-ranking pipelines exist exactly for this reason — they reorder chunks so the best one is last.
- Repeat critical case facts — once in the system prompt (start position) and once just before the user message (end position). This redundancy costs ~50 tokens and meaningfully improves recall.
The "lost in the middle" effect is a recurring exam topic. Trick questions describe a long-context agent that "reads" a 50-page document and ask why it misses a fact in the middle. The answer is always position effects, not "the model wasn't given the document." Mitigation: use immutable case-facts blocks at the START position, position the current query at the END.
Concept 5: Context Rot
Long-running agents accumulate junk. Tool results from searches that went nowhere. Plans that got abandoned three turns later. User instructions that were superseded ("actually, ignore that last bit"). Errors that were resolved. The model dutifully attends to all of it — including the parts that no longer apply. The result: by turn 30, your agent retries already-failed searches, contradicts its own corrections, and answers questions you never asked.
Think about an email thread that's been forwarded fifteen times with everyone leaving the full quoted history. The actual decision is buried under "Re: Re: Re: Re: FW: FW:" — and a new participant joining the thread acts on stale information from forward #3 because they read top-down and that's where the dollar figure happened to live.
The pain: every reply makes the thread harder to read, not easier. People reply to the wrong version of the spec. Decisions get re-litigated. The thread becomes net-negative information.
The mapping: the same thing happens in agent context windows. Each turn appends. Nothing self-cleans. By turn 25, the model is reading a thread that's mostly noise, and a small but critical fact at turn 12 is buried under tool-result garbage from turns 13–24.
Context rot is the degradation of agent quality caused by accumulated stale, contradictory, or low-relevance content in the context window. It's distinct from running out of tokens — you can hit context rot at 60% window utilization. The key signal is signal-to-noise ratio, not raw size.
Here's a real symptom you'll see in production: the agent works great for the first 10 turns, then starts visibly degrading. Token usage may still be well under the limit. The fix isn't a bigger window — it's a smaller, cleaner one. You compact, summarize, or evict the rotted content.
BEFORE: 30 turns, lots of rot
AFTER: 8 turns, clean signal
The right-hand context isn't just shorter (3 messages vs. 11) — it's cleaner. The duplicate failed search is gone. The superseded "ignore 2018" instruction is folded into a positive case fact ("active filings = 2019+"). The current question is now adjacent to the relevant facts, not buried under 8 turns of resolved errors. Same model, same question, dramatically better answer. This is what M08 will operationalize for conversation history specifically — but the principle is broader: compaction isn't just about budget, it's about signal.
Code Walkthrough: The ContextBudget Class
Time to make this concrete. We're going to build a small class that takes the six layers of your context, accounts for tokens in each, and applies one of the four levers when the total exceeds a budget. You'll use this directly in the lab and reuse the pattern in M08 (history) and M11 (memory).
Here's the design: ContextBudget is initialized with a target budget and a strategy. You feed it the six layers via constructor. It exposes account() for the per-layer breakdown, fits() for the budget check, and build_messages() for the assembled message array ready for the Anthropic SDK. We'll annotate the implementation in four chunks.
Chunk 1: Per-layer Token Accounting
WHAT: The constructor stores the six layers separately and counts tokens in each. WHY: You can't apply a lever intelligently without knowing which layer is the heaviest. GOTCHA: Don't use len(text) // 4 as a token count — that approximation is off by 20%+ for tool definitions and JSON-heavy content. Use anthropic.Anthropic().count_tokens() for production code; we'll use a fast heuristic plus a real API call wrapper for the lab.
from dataclasses import dataclass, field
from typing import Optional
# Fast heuristic: ~4 chars per token. Good enough for budgeting decisions.
# For production cost tracking, use anthropic.count_tokens().
def estimate_tokens(text: str) -> int:
return max(1, len(text) // 4)
@dataclass
class ContextBudget:
"""Accounts for and curates the six layers of context."""
system_prompt: str = ""
tool_definitions: list = field(default_factory=list) # list of dicts
history: list = field(default_factory=list) # list of {"role", "content"}
retrieved_docs: list = field(default_factory=list) # list of strings
tool_results: list = field(default_factory=list) # list of strings
current_user_message: str = ""
target_budget: int = 8000 # tokens
def account(self) -> dict:
"""Return per-layer token counts."""
import json
return {
"system": estimate_tokens(self.system_prompt),
"tools": estimate_tokens(json.dumps(self.tool_definitions)),
"history": sum(estimate_tokens(m["content"]) for m in self.history),
"retrieved": sum(estimate_tokens(d) for d in self.retrieved_docs),
"tool_results": sum(estimate_tokens(r) for r in self.tool_results),
"current": estimate_tokens(self.current_user_message),
}
def total(self) -> int:
return sum(self.account().values())
def fits(self) -> bool:
return self.total() <= self.target_budget
// Fast heuristic: ~4 chars per token. Good enough for budgeting decisions.
// For production cost tracking, use anthropic.countTokens().
function estimateTokens(text) {
return Math.max(1, Math.floor((text || '').length / 4));
}
export class ContextBudget {
constructor({
systemPrompt = '',
toolDefinitions = [],
history = [], // [{ role, content }]
retrievedDocs = [], // [string]
toolResults = [], // [string]
currentUserMessage = '',
targetBudget = 8000,
} = {}) {
Object.assign(this, {
systemPrompt, toolDefinitions, history,
retrievedDocs, toolResults, currentUserMessage, targetBudget,
});
}
account() {
return {
system: estimateTokens(this.systemPrompt),
tools: estimateTokens(JSON.stringify(this.toolDefinitions)),
history: this.history.reduce((s, m) => s + estimateTokens(m.content), 0),
retrieved: this.retrievedDocs.reduce((s, d) => s + estimateTokens(d), 0),
toolResults: this.toolResults.reduce((s, r) => s + estimateTokens(r), 0),
current: estimateTokens(this.currentUserMessage),
};
}
total() {
return Object.values(this.account()).reduce((a, b) => a + b, 0);
}
fits() {
return this.total() <= this.targetBudget;
}
}
Chunk 2: The Compress Lever
WHAT: When the budget is busted, replace old turns with a Claude-generated summary, keeping the most recent k turns verbatim. WHY: History is usually the heaviest layer in long sessions, and old turns have lower relevance per token than recent ones. GOTCHA: Use a separate Claude call to write the summary, with the explicit instruction "preserve all named entities, IDs, and numeric values." Without that instruction, summarization eats critical specifics first — this is the cert-tip flag for Domain 5.1.
import anthropic
SUMMARY_INSTRUCTION = """Summarize the following conversation in 3-5 sentences.
CRITICAL: preserve every named entity (people, companies, IDs), exact numeric
values (dollar amounts, dates, counts), and any explicit user instructions.
Drop only redundant or resolved tool errors and abandoned plans.
Conversation:
{transcript}
Summary:"""
def compress_history(history: list, keep_recent: int = 6) -> list:
"""Summarize older turns; keep the last `keep_recent` turns verbatim."""
if len(history) <= keep_recent:
return history # nothing to compress
older = history[:-keep_recent]
recent = history[-keep_recent:]
transcript = "\n".join(f"{m['role']}: {m['content']}" for m in older)
client = anthropic.Anthropic()
try:
result = client.messages.create(
model="claude-haiku-4-5-20251001", # cheap model for summarization
max_tokens=400,
messages=[{"role": "user",
"content": SUMMARY_INSTRUCTION.format(transcript=transcript)}],
)
summary_text = result.content[0].text
except anthropic.APIError as e:
# If summarization fails, fall back to truncation rather than crashing.
print(f"Summarization failed: {e.message}; falling back to truncation.")
return recent
summary_msg = {"role": "user",
"content": f"[Summary of {len(older)} earlier turns] {summary_text}"}
return [summary_msg] + recent
import Anthropic from '@anthropic-ai/sdk';
const SUMMARY_INSTRUCTION = `Summarize the following conversation in 3-5 sentences.
CRITICAL: preserve every named entity (people, companies, IDs), exact numeric
values (dollar amounts, dates, counts), and any explicit user instructions.
Drop only redundant or resolved tool errors and abandoned plans.
Conversation:
{transcript}
Summary:`;
export async function compressHistory(history, keepRecent = 6) {
if (history.length <= keepRecent) return history;
const older = history.slice(0, -keepRecent);
const recent = history.slice(-keepRecent);
const transcript = older.map(m => `${m.role}: ${m.content}`).join('\n');
const client = new Anthropic();
let summaryText;
try {
const result = await client.messages.create({
model: 'claude-haiku-4-5-20251001',
max_tokens: 400,
messages: [{ role: 'user',
content: SUMMARY_INSTRUCTION.replace('{transcript}', transcript) }],
});
summaryText = result.content[0].text;
} catch (error) {
console.error(`Summarization failed: ${error.message}; falling back to truncation.`);
return recent;
}
const summaryMsg = {
role: 'user',
content: `[Summary of ${older.length} earlier turns] ${summaryText}`,
};
return [summaryMsg, ...recent];
}
Chunk 3: Strategy Dispatch
WHAT: One method picks the right lever based on the budget overflow. WHY: You don't want strategy selection scattered across your agent code — centralize it. GOTCHA: The order of fallback matters — we compress history before evicting retrieved docs, because retrieval was already curated by the RAG pipeline whereas history is "everything that ever happened." If that still overflows, we'd reach for retrieve/offload (left as stretch goals).
def apply_compress(self, keep_recent: int = 6) -> "ContextBudget":
"""Compress history; return self for chaining."""
self.history = compress_history(self.history, keep_recent=keep_recent)
return self
def apply_strategy(self, strategy: str = "auto") -> "ContextBudget":
"""Dispatch to a specific lever, or auto-pick based on overflow.
Strategies: 'compress', 'truncate_retrieved', 'auto', 'noop'.
"""
if self.fits() or strategy == "noop":
return self
if strategy == "compress" or strategy == "auto":
self.apply_compress()
if not self.fits() and strategy in ("truncate_retrieved", "auto"):
# Last resort: keep only the top 2 retrieved chunks.
self.retrieved_docs = self.retrieved_docs[:2]
return self
// Add these methods to ContextBudget class
async applyCompress(keepRecent = 6) {
this.history = await compressHistory(this.history, keepRecent);
return this;
}
async applyStrategy(strategy = 'auto') {
if (this.fits() || strategy === 'noop') return this;
if (strategy === 'compress' || strategy === 'auto') {
await this.applyCompress();
}
if (!this.fits() && (strategy === 'truncateRetrieved' || strategy === 'auto')) {
this.retrievedDocs = this.retrievedDocs.slice(0, 2);
}
return this;
}
Chunk 4: Static-First Assembly
WHAT: build_messages() assembles the final SDK payload with static blocks first, dynamic last — the cache-friendly order from Concept 3. WHY: This is where Concept 3's ordering rule turns into actual money saved. GOTCHA: Retrieved docs are technically dynamic (they change per query), but we put them before history because in many RAG pipelines the same chunks recur for similar queries within a session — partial cache hits are still wins.
def build_messages(self) -> tuple[str, list, list]:
"""Return (system, tools, messages) ready for client.messages.create().
Order: STATIC (system, tools) - SEMI-STATIC (retrieved) - DYNAMIC (history, current).
"""
# Retrieved docs prepended to first message (closer to the question = better recall).
retrieved_block = ""
if self.retrieved_docs:
retrieved_block = "\n\n<reference_docs>\n" + \
"\n---\n".join(self.retrieved_docs) + "\n</reference_docs>\n\n"
messages = list(self.history) # copy to avoid mutating caller's list
# Append current user turn with retrieved docs at the END (high-recall position).
messages.append({
"role": "user",
"content": retrieved_block + self.current_user_message,
})
return self.system_prompt, self.tool_definitions, messages
buildMessages() {
let retrievedBlock = '';
if (this.retrievedDocs.length) {
retrievedBlock = '\n\n<reference_docs>\n' +
this.retrievedDocs.join('\n---\n') + '\n</reference_docs>\n\n';
}
const messages = [
...this.history,
{ role: 'user', content: retrievedBlock + this.currentUserMessage },
];
return {
system: this.systemPrompt,
tools: this.toolDefinitions,
messages,
};
}
You now have a reusable class that turns the four levers into actual code: account() shows you what's heavy, apply_strategy("auto") picks the right lever, build_messages() emits the SDK payload in cache-friendly order. In the lab below you'll point this at a 30-turn rotted transcript, watch it shrink, and confirm the answer quality holds. From M08 forward, you'll see this exact pattern recur — just specialized: M08 swaps the compress strategy, M09 fills retrieved_docs, M14 spawns subagents to populate one of the layers.
Hands-On Lab: The Poisoned Transcript
Lab Overview
What you'll build: A diagnostic that reads a 30-turn rotted UCC research transcript, prints per-layer token breakdown, applies the compress lever, and shows the before/after token + answer-quality delta.
Time: 25-35 minutes · Level: Beginner → Intermediate
Prerequisites: Python 3.10+ or Node.js 18+, an Anthropic API key, the ContextBudget class from the walkthrough above.
Files you'll create:
poisoned_transcript.json— the rotted 30-turn fixture (provided below, copy-paste)context_budget.py(orcontextBudget.mjs) — the class from the walkthroughdiagnose.py(ordiagnose.mjs) — the diagnostic and fix runner
Environment Setup
Single copy-pasteable block. Pick Python or Node.js, not both:
# Python
mkdir m03b-context-lab && cd m03b-context-lab
python -m venv venv && source venv/bin/activate # Windows: venv\Scripts\activate
pip install anthropic
export ANTHROPIC_API_KEY=sk-ant-... # Windows: set ANTHROPIC_API_KEY=sk-ant-...
# Node.js (alternative)
mkdir m03b-context-lab && cd m03b-context-lab
npm init -y && npm install @anthropic-ai/sdk
export ANTHROPIC_API_KEY=sk-ant-...
Step 1: Drop in the Poisoned Transcript Fixture
This is your raw material — a UCC research session that started fine and rotted by turn 30. The user is asking about debtor "Acme Corp" filings; over the session they corrected scope twice, hit two rate limits, and abandoned one tool path. The current question is at the end. Save this as poisoned_transcript.json:
{
"system_prompt": "You are a UCC filings research assistant. Help users find and analyze public records filings. Always cite filing IDs. If a search fails, retry once with a normalized query.",
"tool_definitions": [
{"name": "search_filings", "description": "Search UCC filings by debtor name", "input_schema": {"type": "object", "properties": {"debtor": {"type": "string"}}, "required": ["debtor"]}},
{"name": "get_filing", "description": "Retrieve a single filing by ID", "input_schema": {"type": "object", "properties": {"filing_id": {"type": "string"}}, "required": ["filing_id"]}},
{"name": "calc_collateral_total", "description": "Sum collateral values across filing IDs", "input_schema": {"type": "object", "properties": {"filing_ids": {"type": "array", "items": {"type": "string"}}}, "required": ["filing_ids"]}}
],
"history": [
{"role": "user", "content": "Research debtor Acme Corp - I need an overview of their UCC filings"},
{"role": "assistant", "content": "I'll search for filings under 'Acme Corp'."},
{"role": "user", "content": "[tool_result] search_filings(debtor='Acme Corp') returned 12 filings spanning 2017-2023, IDs F-2017-0044, F-2018-04421, F-2018-04422, F-2019-1101, F-2019-2287, F-2020-0883, F-2020-1455, F-2021-0192, F-2021-3340, F-2022-0028, F-2022-1199, F-2023-0501"},
{"role": "assistant", "content": "Found 12 filings from 2017 to 2023. Want a year-by-year breakdown?"},
{"role": "user", "content": "Yes, but actually focus only on 2019 and later"},
{"role": "assistant", "content": "Got it. Restricting to 2019+: 8 filings. Pulling details."},
{"role": "user", "content": "[tool_result] get_filing('F-2018-04421') returned full filing data for 2018 (collateral: $8.1M, secured party: First Bank)"},
{"role": "assistant", "content": "I see I pulled a 2018 filing by mistake. Discarding."},
{"role": "user", "content": "[tool_result] get_filing('F-2018-04422') returned ERROR: rate_limit_exceeded after 3 retries"},
{"role": "assistant", "content": "Rate limit on another 2018 filing - skipping, that's outside our scope anyway."},
{"role": "user", "content": "[tool_result] search_filings(debtor='ACME CORP') returned 12 filings (same as before, normalization quirk)"},
{"role": "assistant", "content": "Duplicate result from a normalized retry. Ignoring."},
{"role": "user", "content": "I said 2019+! Why are you still pulling 2018 filings?"},
{"role": "assistant", "content": "Apologies - that was a confused retry. Going forward, only 2019 and later: F-2019-1101, F-2019-2287, F-2020-0883, F-2020-1455, F-2021-0192, F-2021-3340, F-2022-0028, F-2022-1199, F-2023-0501. That's 9 filings."},
{"role": "user", "content": "[tool_result] get_filing('F-2019-1101') returned collateral: $4.2M, secured party: Capital Trust, year: 2019"},
{"role": "assistant", "content": "F-2019-1101: $4.2M collateral, Capital Trust as secured party."},
{"role": "user", "content": "[tool_result] get_filing('F-2019-2287') returned collateral: $7.8M, secured party: First Bank, year: 2019"},
{"role": "assistant", "content": "F-2019-2287: $7.8M, First Bank."},
{"role": "user", "content": "[tool_result] get_filing('F-2020-0883') returned collateral: $11.5M, secured party: Capital Trust, year: 2020"},
{"role": "assistant", "content": "F-2020-0883: $11.5M, Capital Trust."},
{"role": "user", "content": "[tool_result] get_filing('F-2020-1455') returned collateral: $3.4M, secured party: GE Capital, year: 2020"},
{"role": "assistant", "content": "F-2020-1455: $3.4M, GE Capital."},
{"role": "user", "content": "[tool_result] get_filing('F-2021-0192') returned collateral: $9.0M, secured party: First Bank, year: 2021"},
{"role": "assistant", "content": "F-2021-0192: $9.0M, First Bank."},
{"role": "user", "content": "[tool_result] get_filing('F-2021-3340') returned collateral: $5.5M, secured party: Capital Trust, year: 2021"},
{"role": "assistant", "content": "F-2021-3340: $5.5M, Capital Trust."},
{"role": "user", "content": "[tool_result] get_filing('F-2022-0028') returned collateral: $13.2M, secured party: First Bank, year: 2022"},
{"role": "assistant", "content": "F-2022-0028: $13.2M, First Bank."},
{"role": "user", "content": "[tool_result] get_filing('F-2022-1199') returned collateral: $6.7M, secured party: GE Capital, year: 2022"},
{"role": "assistant", "content": "F-2022-1199: $6.7M, GE Capital."},
{"role": "user", "content": "[tool_result] get_filing('F-2023-0501') returned collateral: $8.9M, secured party: Capital Trust, year: 2023"},
{"role": "assistant", "content": "F-2023-0501: $8.9M, Capital Trust."}
],
"current_user_message": "What's the total collateral value across all 2019+ filings, and which secured party shows up most often?"
}
What just landed in your folder: a fixture with one system prompt, three tool definitions, 30 history turns (mix of user, assistant, and tool-result messages), and a current question. The poisoning is subtle but real: two rate-limit results, two 2018 filings the user explicitly excluded, one duplicate normalized search. Your diagnostic will surface these.
ls -lh poisoned_transcript.json (Windows: dir poisoned_transcript.json) shows a file around 4-5 KB, you're set. If you see a 0-byte file, copy-paste didn't include all lines — re-copy the JSON.
Step 2: Build the Diagnostic + Compress Runner
What & why: This script loads the fixture, instantiates the ContextBudget class from the walkthrough, prints per-layer token breakdown before any fix, applies the compress lever, then prints the breakdown after. It's the smallest possible loop that exercises the full module.
First, save the ContextBudget class from the Code Walkthrough as context_budget.py (Python) or contextBudget.mjs (Node). Then create the runner:
"""diagnose.py - Reads poisoned_transcript.json, runs ContextBudget, prints before/after."""
import json
from context_budget import ContextBudget
def print_breakdown(label: str, account: dict, total: int, target: int):
print(f"\n--- {label} ---")
for layer, tokens in account.items():
bar = "#" * max(1, tokens // 100)
print(f" {layer:<14} {tokens:>5} tok {bar}")
pct = round(100 * total / target)
status = "OVER BUDGET" if total > target else "ok"
print(f" TOTAL {total:>5} tok ({pct}% of {target} budget) [{status}]")
def detect_rot(history: list) -> list:
"""Quick heuristic rot scan - look for explicit signals in tool-result messages."""
signals = []
for i, msg in enumerate(history):
c = msg.get("content", "")
if "rate_limit" in c.lower() or "ERROR" in c:
signals.append(f" turn {i}: failed tool call (likely retried)")
if "duplicate" in c.lower() or "same as before" in c.lower():
signals.append(f" turn {i}: duplicate result")
return signals
def main():
with open("poisoned_transcript.json") as f:
fixture = json.load(f)
# Build the budget object from the fixture.
budget = ContextBudget(
system_prompt=fixture["system_prompt"],
tool_definitions=fixture["tool_definitions"],
history=fixture["history"],
current_user_message=fixture["current_user_message"],
target_budget=4000, # deliberately tight to force compression
)
print("=" * 60)
print("POISONED TRANSCRIPT DIAGNOSTIC")
print("=" * 60)
print_breakdown("BEFORE", budget.account(), budget.total(), budget.target_budget)
rot_signals = detect_rot(budget.history)
if rot_signals:
print("\nRot signals detected:")
for s in rot_signals:
print(s)
# Apply the compress lever.
print("\nApplying compress lever (keep_recent=6)...")
budget.apply_compress(keep_recent=6)
print_breakdown("AFTER COMPRESSION", budget.account(), budget.total(), budget.target_budget)
# Build the final payload that would be sent to Claude.
system, tools, messages = budget.build_messages()
print(f"\nFinal payload: {len(messages)} messages, "
f"{budget.total()} input tokens, fits budget = {budget.fits()}")
if __name__ == "__main__":
main()
// diagnose.mjs - Reads poisoned_transcript.json, runs ContextBudget, prints before/after.
import { readFileSync } from 'fs';
import { ContextBudget } from './contextBudget.mjs';
function printBreakdown(label, account, total, target) {
console.log(`\n--- ${label} ---`);
for (const [layer, tokens] of Object.entries(account)) {
const bar = '#'.repeat(Math.max(1, Math.floor(tokens / 100)));
console.log(` ${layer.padEnd(14)} ${String(tokens).padStart(5)} tok ${bar}`);
}
const pct = Math.round((100 * total) / target);
const status = total > target ? 'OVER BUDGET' : 'ok';
console.log(` TOTAL ${String(total).padStart(5)} tok (${pct}% of ${target} budget) [${status}]`);
}
function detectRot(history) {
const signals = [];
history.forEach((msg, i) => {
const c = msg.content || '';
if (c.toLowerCase().includes('rate_limit') || c.includes('ERROR')) {
signals.push(` turn ${i}: failed tool call (likely retried)`);
}
if (c.toLowerCase().includes('duplicate') || c.toLowerCase().includes('same as before')) {
signals.push(` turn ${i}: duplicate result`);
}
});
return signals;
}
async function main() {
const fixture = JSON.parse(readFileSync('poisoned_transcript.json', 'utf-8'));
const budget = new ContextBudget({
systemPrompt: fixture.system_prompt,
toolDefinitions: fixture.tool_definitions,
history: fixture.history,
currentUserMessage: fixture.current_user_message,
targetBudget: 4000,
});
console.log('='.repeat(60));
console.log('POISONED TRANSCRIPT DIAGNOSTIC');
console.log('='.repeat(60));
printBreakdown('BEFORE', budget.account(), budget.total(), budget.targetBudget);
const rotSignals = detectRot(budget.history);
if (rotSignals.length) {
console.log('\nRot signals detected:');
rotSignals.forEach(s => console.log(s));
}
console.log('\nApplying compress lever (keepRecent=6)...');
await budget.applyCompress(6);
printBreakdown('AFTER COMPRESSION', budget.account(), budget.total(), budget.targetBudget);
const { messages } = budget.buildMessages();
console.log(`\nFinal payload: ${messages.length} messages, ${budget.total()} input tokens, fits budget = ${budget.fits()}`);
}
main().catch(err => { console.error(err); process.exit(1); });
Run it:
python diagnose.py # or: node diagnose.mjs
SUMMARY_INSTRUCTION in context_budget.py may be missing the "preserve every named entity" line. Re-check Chunk 2 of the walkthrough.
Troubleshooting
ModuleNotFoundError: No module named 'anthropic'— Runpip install anthropicin your activated venv (ornpm install @anthropic-ai/sdk).AuthenticationError: Could not resolve API key—export ANTHROPIC_API_KEY=sk-ant-...in the same shell. Check withecho $ANTHROPIC_API_KEY.FileNotFoundError: poisoned_transcript.json— You're running from the wrong directory.cdintom03b-context-labfirst.- Summary doesn't preserve filing IDs — Check the
SUMMARY_INSTRUCTIONstring. The "preserve every named entity..." line is critical — without it, Haiku will summarize the gist and drop the IDs you need to answer the final question.
Step 3: Compare Answer Quality Before vs. After
What & why: Token reduction is a means, not the end. The end is answer quality. This step sends the same final question to Claude twice — once with the rotted context, once with the compressed context — and prints both answers so you can compare. Append this to diagnose.py (or create a new file). Important: replace the existing if __name__ == "__main__": block from Step 2 with the new one at the bottom of this snippet so main() only runs once. For the Node.js version, replace the trailing main().catch(...) line from Step 2 with the new top-level await calls at the bottom.
# Append to diagnose.py (or save as compare.py and import as needed)
import anthropic
import time
def ask_claude(budget: ContextBudget) -> tuple[str, dict]:
"""Send the budget's payload to Claude and return (text, usage)."""
client = anthropic.Anthropic()
system, tools, messages = budget.build_messages()
t0 = time.time()
try:
result = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=400,
system=system,
messages=messages,
)
except anthropic.APIError as e:
return f"[error: {e.message}]", {"input_tokens": 0, "output_tokens": 0, "latency_s": 0}
elapsed = round(time.time() - t0, 2)
return result.content[0].text, {
"input_tokens": result.usage.input_tokens,
"output_tokens": result.usage.output_tokens,
"latency_s": elapsed,
}
def compare_run():
with open("poisoned_transcript.json") as f:
fixture = json.load(f)
# Run 1: rotted context (no fix applied)
rotted = ContextBudget(
system_prompt=fixture["system_prompt"],
tool_definitions=fixture["tool_definitions"],
history=fixture["history"],
current_user_message=fixture["current_user_message"],
target_budget=4000,
)
print("\n>>> Run 1: ROTTED context (no fix)")
answer_a, usage_a = ask_claude(rotted)
print(f"Tokens: {usage_a['input_tokens']} in, {usage_a['output_tokens']} out ({usage_a['latency_s']}s)")
print(f"Answer:\n{answer_a}\n")
# Run 2: compressed context (compress lever applied)
fixed = ContextBudget(
system_prompt=fixture["system_prompt"],
tool_definitions=fixture["tool_definitions"],
history=fixture["history"],
current_user_message=fixture["current_user_message"],
target_budget=4000,
).apply_compress(keep_recent=6)
print(">>> Run 2: COMPRESSED context")
answer_b, usage_b = ask_claude(fixed)
print(f"Tokens: {usage_b['input_tokens']} in, {usage_b['output_tokens']} out ({usage_b['latency_s']}s)")
print(f"Answer:\n{answer_b}\n")
# Delta
print("=" * 60)
print(f"Token delta: {usage_a['input_tokens'] - usage_b['input_tokens']} fewer input tokens")
print(f"Latency delta: {usage_a['latency_s'] - usage_b['latency_s']:.2f}s faster")
if __name__ == "__main__":
main() # original diagnostic
compare_run() # new comparison
// Append to diagnose.mjs
import Anthropic from '@anthropic-ai/sdk';
async function askClaude(budget) {
const client = new Anthropic();
const { system, messages } = budget.buildMessages();
const t0 = Date.now();
let result;
try {
result = await client.messages.create({
model: 'claude-sonnet-4-6',
max_tokens: 400,
system,
messages,
});
} catch (error) {
return [`[error: ${error.message}]`, { input_tokens: 0, output_tokens: 0, latency_s: 0 }];
}
const elapsed = ((Date.now() - t0) / 1000).toFixed(2);
return [result.content[0].text, {
input_tokens: result.usage.input_tokens,
output_tokens: result.usage.output_tokens,
latency_s: parseFloat(elapsed),
}];
}
async function compareRun() {
const fixture = JSON.parse(readFileSync('poisoned_transcript.json', 'utf-8'));
const baseArgs = {
systemPrompt: fixture.system_prompt,
toolDefinitions: fixture.tool_definitions,
history: fixture.history,
currentUserMessage: fixture.current_user_message,
targetBudget: 4000,
};
const rotted = new ContextBudget(baseArgs);
console.log('\n>>> Run 1: ROTTED context (no fix)');
const [answerA, usageA] = await askClaude(rotted);
console.log(`Tokens: ${usageA.input_tokens} in, ${usageA.output_tokens} out (${usageA.latency_s}s)`);
console.log(`Answer:\n${answerA}\n`);
const fixed = new ContextBudget(baseArgs);
await fixed.applyCompress(6);
console.log('>>> Run 2: COMPRESSED context');
const [answerB, usageB] = await askClaude(fixed);
console.log(`Tokens: ${usageB.input_tokens} in, ${usageB.output_tokens} out (${usageB.latency_s}s)`);
console.log(`Answer:\n${answerB}\n`);
console.log('='.repeat(60));
console.log(`Token delta: ${usageA.input_tokens - usageB.input_tokens} fewer input tokens`);
console.log(`Latency delta: ${(usageA.latency_s - usageB.latency_s).toFixed(2)}s faster`);
}
await main();
await compareRun();
Run it:
python diagnose.py # or: node diagnose.mjs
- Token reduction: at least 1,000 fewer input tokens after compression
- Faster latency: 1-2 seconds quicker (varies by network)
- Same answer (or better): both runs should report the same total ($70.2M) and same most-frequent secured party (Capital Trust). The compressed run usually gives a cleaner structured answer because there's less noise to wade through.
Verify Everything Works
One command runs the full diagnostic + comparison end-to-end:
python diagnose.py # or: node diagnose.mjs
ContextBudget class you wrote is the same pattern you'll specialize for conversation history (M08), retrieved chunks (M09), and memory layers (M11). Every later module in Track 3 is just one of these levers wearing a different costume.
Stretch Goals (Optional)
- Implement the truncate-retrieved lever: when the budget overflows even after compression, drop retrieved docs to the top 2 by relevance score.
- Static-first ordering test: send the same payload twice with
cache_controlmarkers, measure cache hit rate fromusage.cache_read_input_tokens. - Position-effects test: place the critical case fact ("user wants 2019+ only") at three positions — start of system prompt, middle of history, just before the user message — and compare answer accuracy.
- Auto-detect summary failures: parse the summary message after compression and assert that all 9 filing IDs from the original history still appear. If not, fall back to a less aggressive
keep_recent.
Knowledge Check
Six questions on context inventory, the four levers, ordering, position effects, and rot.
Q1: Your agent has 5 registered tools but only uses 1 per turn on average. How many tokens do the 4 unused tool definitions cost on a turn?
Q2: Which is NOT one of the four context-engineering levers?
Q3: An agent works fine for the first 8 turns, then degrades sharply. Token usage is at 60%, well under the limit. What's the most likely cause?
Q4: Which message ordering allows prompt caching to hit on subsequent calls?
Q5: You have a 25-page reference doc the agent needs occasional facts from across many turns. Which lever is the best fit?
Q6: A long context contains a critical instruction at position 12 of 23 layers. Why might Claude miss it?
Module Summary
Key Takeaways
- Prompt engineering writes one message; context engineering curates everything the model sees. The user message is often <1% of the input tokens — the other 99% is the layers you assembled.
- Six layers — system, tools, history, retrieved docs, tool results, current turn. Inventory them before you debug an agent.
- Four levers, no exceptions — add, compress, retrieve, offload. Every later module in Track 3 is one of these four wearing a different costume.
- Static-first ordering is what makes prompt caching work. Same content in the wrong order can cost 6x more.
- Lost in the middle is real. Put critical content at the start or end of long contexts — and consider repeating it at both.
- Context rot is a signal-to-noise problem, not a budget problem. You can hit it at 60% window utilization. Compaction is the cure.
Next Module Preview: M04 — Structured Output
Now that you know what goes into the model, the next question is how to get reliably parseable output back. M04 covers JSON mode, tool-use as a structured-output mechanism, and Pydantic/Zod validation — turning Claude's natural-language replies into data your agent can act on without crashing on malformed responses.