Building AI Agents with Claude Track 3: Memory & Context
Module 9 of 30 ~55 min Intermediate

M08: Conversation Management

Master the art of managing multi-turn conversations with Claude — from stateless API calls to production-grade context management with tokenThe basic unit of text that language models process. A token is roughly 3–4 characters in English. Both input (what you send) and output (what the model generates) are measured in tokens, and you pay per token. budgets, pruning, and summarization.

Prerequisites: M01–M04 (Foundations track)

Learning Objectives

  • Explain why the Claude Messages API is stateless and how the illusion of memory is created by replaying message history
  • Implement three conversation history management strategies: full history, sliding windowA strategy that keeps only the most recent N messages in the conversation history, discarding older ones. Like a window that slides forward over time, always showing only the latest portion., and summarization
  • Calculate token budget allocation across system promptA special instruction block sent with every API request that sets the assistant's behavior, personality, and constraints. It is separate from the messages array and always occupies context window space., history, current message, and reserved response tokensInput tokens are what you send to the model (system prompt + messages). Output tokens are what the model generates in response. Both count against the context window, and both are billed separately.
  • Apply message pruning strategies that preserve information density while staying within token limits
  • Build a production-grade ConversationManager class with automatic pruning, summarization, and serializationConverting an in-memory data structure (like a conversation history object) into a format that can be stored in a file or transmitted over a network — typically JSON. Deserialization is the reverse process.

The Stateless Reality: Claude Has No Memory

Everyday Analogy

BEFORE: Imagine you could hire a world-class expert for advice, but they have perfect amnesia — every time you walk into the room, they have zero memory of any previous conversation you have had with them.

PAIN: If you just said "so what do you think about option B?" they would stare blankly, because they have no idea what option A was, who you are, or what problem you are solving. Every interaction starts from absolute zero.

MAPPING: This is exactly how the Claude Messages API works. Each API call is a fresh room with a fresh expert. The only way to give Claude "memory" is to hand it a written transcript of every previous exchange — your code replays the entire conversation from scratch on every single request.

What this actually looks like in code: On your third message to Claude, the API request does not just send message 3. It sends all previous messages again from scratch:

# API call #3 — the ENTIRE payload your code sends: { "system": "You are a helpful assistant.", "messages": [ {"role": "user", "content": "Hi, I need help with Python."}, {"role": "assistant", "content": "Sure! What do you need help with?"}, {"role": "user", "content": "How do I read a CSV file?"}, {"role": "assistant", "content": "Use pandas: pd.read_csv('file.csv')"}, {"role": "user", "content": "Can you add filtering?"} ← NEW ] }

Notice: messages 1 through 4 are resent verbatim. Claude does not "remember" them — your code replays them every time.

Technical Definition The Messages APIThe primary API endpoint for interacting with Claude. Each request contains a messages array and returns Claude's response. There is no server-side session state. is statelessEach API request is independent — the server does not store any information between requests. Every request must contain all the context the model needs.. Here is what that means, step by step.

First, each request is completely independent. There is no session ID. There is no server-side conversation state. There is no implicit memory of any kind.

Second, what users perceive as a continuous conversation is actually the client re-sending the entire message history in the messages array with every single request.

In other words, the assistant's "memory" of earlier messages exists only because the developer explicitly includes those messages in the next API call. If you leave a message out, it is gone — Claude will never know it happened.

The Stateless Reality: Every Call Starts from Zero API Call 1 📤 msg 1: "Hi, I'm Alice" → Claude processes → 📥 "Hello Alice!" 1 message sent no link API Call 2 📤 msg 1 (re-sent!) 📥 response 1 (re-sent!) 📤 msg 2: "What's my name?" → Claude processes → 📥 "Your name is Alice" 3 messages sent no link API Call 3 msg 1 (re-sent again!) response 1 (re-sent again!) msg 2 (re-sent again!) response 2 (re-sent again!) 📤 msg 3: new question 📥 response 3 5 messages sent ⚠ Each call re-sends EVERYTHING. Token usage grows with every turn. Claude has zero memory between API calls — your code manages all state.
Stateless API — Perception vs. Reality
What the User Sees
What Actually Happens (API Calls)
Why It Matters Statelessness is not a limitation — it is a superpower. Because you control the entire context, you can edit history, branch conversations, and inject context from external sources. You can also implement sophisticated memory strategies that would be impossible with server-managed state.

Concrete examples: A customer support agent handling 10,000 concurrent sessions does not need 10,000 server-side session stores — each request is self-contained. A healthcare agent can surgically remove PHI from history before sending it to a logging service. A debugging agent can "rewind" a conversation to turn 5 and try a different approach.

Understanding statelessness is the foundation of everything else in this module.

⚠️ Common Misconceptions

"Claude remembers our previous conversation." — No. Each API call is completely stateless. There is no session, no server-side storage, no memory of any kind. If you do not include previous messages in the messages array, they do not exist as far as Claude is concerned. The "memory" you experience in chat interfaces like claude.ai is built by the client replaying message history — the exact same technique you are learning here.

"Context window = memory." — No. The context window is temporary — it exists only for the duration of a single API request. Memory persists across requests. The context window is more like a whiteboard that gets erased after every meeting: you can write a lot on it, but once the meeting ends, it is blank. Persistent memory requires your code to save and reload state.

"Longer context = better results." — Not necessarily. Research shows that recall degrades with context length — a phenomenon called the "lost in the middle" effect. Information buried in the middle of a very long context gets lower attention than information at the start or end. Cramming everything into a 200K-token payload can actually make Claude worse at finding the relevant facts.

"Summarization is lossless." — No. Summarization is a lossy compression. It preserves the gist but loses specifics: exact names, account numbers, dates, dollar amounts, and nuanced phrasing. That is why production systems use "pinned facts" blocks that are never summarized — critical specifics stay verbatim while surrounding context gets compressed.

Conversation History Management Patterns

Everyday Analogy

BEFORE: Imagine packing for a two-week trip, but your airline only allows one carry-on bag with a strict 7 kg weight limit — you cannot check luggage.

PAIN: If you try to bring everything — every outfit, every gadget, every book — the bag overflows and the airline rejects it at the gate. But if you only bring what fits right now, you might arrive at a formal dinner with nothing but hiking clothes because you dropped the dress shoes three days ago.

MAPPING: Your context window is that carry-on bag. Full history means cramming everything in (works for short trips). Sliding window means only packing the last few days of clothes (cheap but you lose early essentials). Summarization means writing a packing list of what you left at home plus bringing the current essentials — the best balance for long journeys.

What this actually looks like in practice: Suppose a 20-turn conversation has accumulated 20 messages. Here is how each strategy would handle sending turn 21 to the API:

# Full History — sends all 20 messages + the new one (21 total) messages: [msg1, msg2, msg3, ... msg20, NEW_MSG] → ~3,000 tokens # Sliding Window (N=6) — sends only the last 6 + the new one messages: [msg15, msg16, msg17, msg18, msg19, msg20, NEW_MSG] → ~1,050 tokens # Summarization — a summary of msgs 1-14, then last 6 + new one messages: [ {"role":"user","content":"[Summary: user asked about Python CSV parsing, resolved encoding issue, moved to data filtering...]"}, {"role":"assistant","content":"Understood, I have the context."}, msg15, msg16, msg17, msg18, msg19, msg20, NEW_MSG ] → ~1,400 tokens
Technical Definition Three primary patterns exist for managing the conversation historyThe ordered array of messages (user + assistant turns) sent with each API request. This array IS the model's memory — nothing else persists between calls.:
  1. Full History — send all messages every time. Simple but hits token limitsThe maximum number of tokens (input + output) a model can process in a single request. For Claude, this can be up to 200K tokens for input. quickly. Best for short conversations (< 20 turns).
  2. Sliding Window — keep only the last N messages. Maintains recency but loses early context.
  3. Summarization — periodically compress older messages into a summary, prepend it to the history, and drop the originals. Preserves key information while staying within token budgets.
Most production agents use a hybrid approach: a summary of old turns + full recent turns + an always-present system prompt.

Summarization: The Strategy That Deserves a Closer Look

Summarization is the most powerful of the three strategies, so it is worth understanding at a deeper level. In plain English, summarization means using Claude itself to read a block of older messages and compress them into a short paragraph that captures the key facts, decisions, and preferences — then replacing those original messages with the summary. The result is a much shorter message history that still carries the essential context.

Here is how it works internally. When your conversation manager detects that token usage has crossed a threshold (say 100,000 tokens), it splits the message history into two parts: "old" messages (everything before the last few turns) and "recent" messages (the last 4 turns, kept verbatim). It sends the old messages to Claude with a special instruction like "Summarize this conversation. Preserve key decisions, user preferences, important facts, and tool results. Skip greetings and filler." Claude returns a compact summary, which gets injected as a synthetic user/assistant message pair at the start of the history, and the original old messages are discarded.

If you are thinking "this sounds a lot like a sliding window, just smarter" — you are right, and that is the key difference. A sliding window throws away old messages and loses their information forever. Summarization compresses them instead, preserving the meaning while discarding the verbatim text. The trade-off is that summarization costs an extra API call (and its own tokens), and some specific details — exact numbers, names, IDs — may be paraphrased away. That is why production systems often combine summarization with "pinned facts" that are never summarized.

Three History Management Strategies Full History msg 1 response 1 msg 2 response 2 ⋮ (all messages kept) msg N response N ✓ Complete context ✓ Simplest code ✕ Tokens grow unbounded ✕ Breaks after ~20 turns Short chats only Sliding Window msg 1 (dropped) response 1 (dropped) ⋮ (older messages gone) msg N-2 response N-2 msg N-1 response N-1 ✓ Fixed token cost ✓ Easy to implement ✕ Loses early context ✕ "Who are you?" problem Summarization 📝 Summary "Alice asked about X, decided Y, prefers Z format..." msg N-2 (recent, full) response N-2 msg N-1 (recent, full) response N-1 ✓ Preserves key facts ✓ Bounded token use ✕ Extra API call to summarize ✕ May lose specific details ⭐ Best for production
History Strategies Compared
Full History
0 tokens
Sliding (N=6)
0 tokens
Summarize
0 tokens
Why It Matters There is no single best strategy — the right approach depends on your conversation length, token budget, and how much early context matters. Real numbers: A 50-turn customer support conversation with full history sends ~75,000 tokens per request. At Claude Sonnet pricing ($3/M input tokens), that single conversation costs ~$0.23 per reply — multiplied across 10,000 daily conversations, that is $2,300/day. A sliding window (N=10) cuts that to ~15,000 tokens ($0.05/reply, $500/day). Summarization with the last 4 turns typically lands around 8,000–12,000 tokens ($0.03/reply, $300/day) while preserving far more context than a window alone. Choosing wrong means either burning money on bloated payloads or losing critical context mid-conversation.

Token Budget Allocation

Everyday Analogy

BEFORE: Imagine you have a fixed-size suitcase for an international trip — exactly 200,000 cubic centimeters, not one more — and everything you need must fit inside it.

PAIN: You must pack a travel guide (system prompt), a photo album of memories (conversation history), the souvenirs you are buying today (current user message), and leave empty space to bring things home (response tokens). If you overstuff the album with every photo from every trip, there is no room left for today's souvenirs or the space to bring anything home.

MAPPING: Your context windowThe total number of tokens a model can process in one request, including both input tokens (system prompt + messages) and reserved output tokens (max_tokens). Claude supports up to 200K input tokens. (e.g., 200K tokens) is that suitcase. The four consumers — system prompt, history, current message, and reserved response tokens — compete for the same fixed space. If conversation history grows unchecked, it squeezes out room for the model's response, eventually causing errors or truncation.

What this actually looks like in your code: When you set max_tokens in an API call, that is you reserving suitcase space for the return trip. Here is the actual budget breakdown for a typical production agent at turn 25:

Context window total: 200,000 tokens ───────────────────────────────────────────── System prompt: 1,200 tokens (fixed — your instructions) Conversation history (25 turns): 37,500 tokens (growing every turn!) Current user message: 800 tokens (variable) Reserved for response (max_tokens): 4,096 tokens (you set this) ───────────────────────────────────────────── REMAINING for more history: 156,404 tokens ✓ plenty of room ...but at turn 120: Conversation history (120 turns): 180,000 tokens ← exceeds budget! REMAINING: 13,904 tokens ⚠ barely fits a response
Technical Definition The context window (e.g., 200K tokens for Claude) is divided among four competing consumers:
  • System prompt — typically 500–2,000 tokens (fixed)
  • Conversation history — variable, grows per turn
  • Current user message — variable
  • max_tokensAn API parameter that reserves space in the context window for the model's response. Setting max_tokens=4096 means up to 4,096 tokens are reserved for output, reducing the space available for input. — reserved for the response (set via API parameter)

Here is how the math works: start with your total context window (e.g., 200,000 tokens). Subtract the system prompt, the current user message, and the tokens reserved for Claude's response (max_tokens). Whatever is left is the space available for conversation history. As a formula: available_for_history = context_window - system_tokens - current_message_tokens - max_tokens

If you are coming from traditional web development, token budgets are a new concept — there is no equivalent in a typical REST API. The closest analogy is a database connection with a maximum query size, but unlike a database, you pay per token on every request. That means the budget is both a technical constraint (exceed it and the API rejects your request) and a cost constraint (bigger payloads = bigger bills).

One subtlety that surprises beginners: the system prompt occupies budget space on every single request, not just the first one. A 2,000-token system prompt that seems negligible at turn 1 has quietly consumed 2,000 tokens × 50 turns = 100,000 tokens of cumulative billing by turn 50. This is why production teams obsess over concise system prompts — every word is multiplied by the total number of API calls in the conversation.

Token Budget Allocation (200K Context Window) Tokens (thousands) 200K 150K 100K 50K 0 Conversation Turn Turn 1 194K free Turn 10 15K 179K free Turn 25 37.5K 157K free Turn 50 75K 120K free Turn 100 150K! 45K left! System Prompt History User Msg Response Available History grows ~1,500 tokens/turn — prune or summarize before it crowds out the response
Token Budget Allocation
System
History
Message
Response
System (1,000)
History (0)
Current (500)
Response (4,096)
Context window: 200,000 tokens — Available for history: 194,404
Pruning triggered! History exceeds 80% of available budget.
⚠️ Common Misconceptions

"max_tokens limits how much the model reads." — No. max_tokens only limits how much the model writes (output tokens). It does not affect input processing. If you send 150,000 input tokens with max_tokens=100, Claude still reads all 150K — it just cannot write more than 100 tokens in response.

"I can just set max_tokens really high to be safe." — Setting it higher than needed wastes budget space. The reserved response tokens are subtracted from your available context window. If you reserve 8,192 tokens but Claude only needs 200, you have blocked 7,992 tokens of space that could have held conversation history.

"Token counts are exact and predictable." — Not quite. Token counts depend on the model's tokenizer and can vary slightly between words, languages, and even code vs. prose. Use the API response's usage.input_tokens field for exact counts rather than trying to predict them yourself.

Cost Implications Every token in your conversation history is billed as an input token on every subsequent API call. A 50-turn conversation with full history means those early messages are billed 50 times. Aggressive pruning and summarization directly reduce your API costs.
🎓 Cert Tip — Domain 5.1

The "lost in the middle" effect means information in the middle of long context gets lower recall than information at the start or end. Position critical data at the beginning (case facts) or end (current query).

Practical Context Window Management

Knowing the budget exists is one thing — actively managing it in production is another. The patterns below are what you will actually wire into a real ConversationManager. Four moving pieces work together: counting tokens before you send, allocating by percentage, handling overflow, and dynamically sizing the sliding window.

1. Count tokens before the request. Never send first and hope it fits. Use Anthropic's count_tokens endpoint or a local tokenizer to size the payload up front:

# Pseudocode — run BEFORE every API call input_tokens = count_tokens(system) \ + count_tokens(messages) \ + count_tokens(user_msg) if input_tokens + max_tokens > CONTEXT_LIMIT: messages = prune_to_budget( messages, target = CONTEXT_LIMIT - max_tokens - 5_000 # safety margin ) response = client.messages.create(...)

2. Allocate by percentage, not absolutes. A typical production split for a 200K window: 20% system prompt, 50% history, 20% current user message, 10% response headroom. The animated bar below fills each band proportionally.

Total budget: 200,000 tokens Tune percentages per use case — long-context RAG often inverts system/history

3. What overflow actually returns. Exceed the window and the request never reaches the model. The API answers with HTTP 400:

{ "type": "error", "error": { "type": "invalid_request_error", "message": "prompt is too long: 215743 tokens > 200000 maximum" } }

Catch this in code: anthropic.BadRequestError (Python SDK) or status 400 with error.type === "invalid_request_error". Then prune retroactively and retry once — do not loop, since a single oversized message will retry forever.

4. Slide on tokens, not message count. A naive "keep last 10 messages" rule breaks the moment one tool result is 30K tokens. Walk newest-to-oldest and accumulate until you hit your history budget — short chitchat keeps 50 turns, large tool outputs keep only 3:

kept, used = [], 0 for msg in reversed(history): t = count_tokens(msg) if used + t > budget_for_history: break kept.insert(0, msg) used += t return kept # newest preserved, oldest dropped

This adapts automatically to message size: a chatty support bot retains weeks of context, while a code-search agent that just received a giant grep dump keeps only the last two turns.

Message Pruning Strategies

Everyday Analogy

BEFORE: Imagine you are a film editor with 6 hours of raw footage that must become a 90-minute movie — you cannot simply cut from the end or the beginning, because key plot points are scattered throughout.

PAIN: If you blindly cut the first 4.5 hours, you lose the character introductions and the setup that makes the climax meaningful. If you keep everything, the audience (the model) loses focus during the filler scenes and misses the important moments buried in the middle.

MAPPING: Pruning a conversation works the same way. You score each message by importance: greetings and filler ("thanks!", "you're welcome!") are cut first. Key facts (account numbers, decisions, tool results) are preserved no matter how old they are. And sometimes you replace a block of scenes with a narrator's summary — "previously on this conversation..." — to compress without losing the plot.

What this actually looks like in real code: In practice, you attach metadata to each message — a simple dictionary with an importance field. Here is what a scored message object looks like, followed by a full example from a support conversation:

# A single scored message — this is the data structure your pruning logic operates on: { "role": "user", "content": "My account number is #4521", "metadata": { "importance": "high", # high | medium | low "timestamp": 1712345678, "token_count": 12, "tags": ["account_id", "key_fact"] } }

And here is a full conversation with importance scores assigned:

messages = [ {"role":"user", "content":"Hi there!", "importance":"low"}, ← greeting, safe to cut {"role":"asst", "content":"Hello! How can I help?","importance":"low"}, ← greeting {"role":"user", "content":"My account is #4521", "importance":"high"}, ← KEY FACT: never prune {"role":"asst", "content":"Found account #4521..","importance":"high"}, ← KEY FACT {"role":"user", "content":"Thanks!", "importance":"low"}, ← filler {"role":"asst", "content":"You're welcome!", "importance":"low"}, ← filler {"role":"user", "content":"What's my order status?","importance":"medium"}, ← question (summarizable) {"role":"asst", "content":"Order #889 is in transit","importance":"high"} ← KEY FACT: result ] # After pruning: 4 low-importance messages removed, 2 medium summarized # Result: 8 messages → 3 messages + 1 summary = ~60% token reduction
Technical Definition PruningThe process of selectively removing messages from conversation history to stay within token limits while preserving the most important context for the model. is the art of selectively removing messages to stay within token limits while keeping the information Claude actually needs. Five main strategies exist:
  1. FIFOFirst In, First Out — a strategy where the oldest items are removed first. In conversation management, this means dropping the earliest messages when the history gets too long. (first in, first out) — drop oldest messages first. Simple but may lose critical early context.
  2. Importance scoring — tag messages with metadataAdditional data attached to each message beyond its content — such as timestamps, token counts, importance scores, or tags. Metadata helps the conversation manager make smarter pruning decisions. and drop low-importance messages first.
  3. Semantic deduplicationIdentifying and removing messages that convey the same meaning or information, even if worded differently. For example, if the user asked the same question twice, the duplicate exchange can be safely removed. — identify and remove redundant exchanges.
  4. Role-based retention — always keep certain message types (tool results, user preferences, key decisions).
  5. Summarize-and-replace — condense a block of messages into a single summary before dropping originals.

Critical implementation detail: after removing any messages, your pruning function must verify that the remaining array still has valid message alternationThe Claude API requires messages to alternate between user and assistant roles. Removing a message can break this alternation, causing an API error.. The Claude API requires messages to alternate between user and assistant roles, and the array must start with a user message. If your pruning logic removes a user message but leaves the assistant reply, the API will reject the request.

⚠️ Common Misconceptions

"Just keep the last N messages — that's good enough." — A pure sliding window is the simplest pruning strategy, but it is also the most dangerous. If the user's account number, diagnosis code, or key decision was in message 1, a window of size 10 will silently drop it by turn 6. The user will say "what's my account number?" and Claude will have no idea.

"Pruning only matters for long conversations." — Even a 15-turn conversation with tool results can easily hit 50K+ tokens. Tool call results (JSON payloads, API responses, database rows) are often much larger than human messages. A single tool result can be 2,000–5,000 tokens.

"I can prune any message safely." — Removing a message that a later message references creates a confusing context. If Claude said "Based on your order #889..." and you prune the message containing order #889, Claude's reference becomes a non sequitur. Always prune in pairs (user + assistant) and check for downstream references.

If you are familiar with database query optimization, pruning is conceptually similar: you are reducing the amount of data processed while preserving the rows (messages) that matter most. But unlike a database index that speeds up retrieval without losing data, pruning permanently removes messages from the context. That is why importance scoring is critical — you need a reliable way to distinguish "this message contains the user's account number" from "this message says 'thanks!'" before deciding what to cut.

Importance-Based Pruning in Action
Messages: 0/20
Tokens: 0
Info retained: 100%
Why It Matters Smart pruning preserves the information density of your context. Real scenario: In the animation above, a 20-message customer support conversation (3,000 tokens) was pruned to 7 messages (1,200 tokens) — a 60% token reduction — while retaining 91% of the actionable information (account number, order status, return policy). In production, a B2B ecommerce agent handling 100-turn order negotiations can prune from ~150K tokens to ~20K tokens per request, cutting per-request cost from $0.45 to $0.06 while keeping every PO number, price agreement, and delivery date intact.

Building a Production Conversation Manager

Everyday Analogy

BEFORE: Imagine a CEO who takes every meeting without a personal assistant — they walk into each room carrying a growing stack of every note from every past meeting, verbatim, unfiltered.

PAIN: By month three, the stack is so tall they spend more time flipping through old notes than actually engaging in the current meeting. Worse, critical decisions from week one are buried under pages of "sounds good" and "let's circle back," and the CEO sometimes misses them entirely.

MAPPING: A conversation manager is that personal assistant. It sits between your application and the Claude API, recording every exchange, scoring messages by importance, compressing old meetings into executive summaries, and ensuring the CEO (Claude) walks into every meeting with exactly the right briefing — recent details in full, older context summarized, and critical facts always pinned to the top.

What this actually looks like as a class interface: The conversation manager sits between your app and the Claude API — like a personal assistant intercepting every message. Here is what the "assistant's desk" looks like in code:

manager = SmartConversationManager( system_prompt="You are a healthcare pre-auth agent.", token_threshold=100_000, # trigger summarization at 100K tokens recent_turns_to_keep=4, # always keep last 4 full turns ) manager.send("Patient has ICD-10 code M54.5...") # adds message, calls API, tracks tokens manager.send("Check if CPT 97110 is covered...") # auto-summarizes if threshold hit manager.save("session_12345.json") # persist full state to disk # Later, or on a different server: restored = SmartConversationManager.load("session_12345.json") restored.send("What was the patient's diagnosis?") # picks up where we left off
Technical Definition A production conversation managerA class/module that sits between your application logic and the Claude API, handling message storage, token counting, automatic pruning, summarization, and serialization of conversation state. is a single class that sits between your application and the Claude API. It handles six responsibilities:
  • Message storage — append, retrieve, and edit messages in an ordered array
  • Token counting — track token usage per message and cumulatively across the whole conversation
  • Automatic pruning — triggered when token counts cross a configurable threshold
  • Summarization — using Claude itself to compress older turns into a short paragraph
  • Metadata tracking — timestamps, token counts, and importance flags attached to each message
  • Serialization — save and load the full conversation state to JSONJavaScript Object Notation — a lightweight text format for storing and exchanging structured data. Looks like: {"key": "value", "list": [1, 2, 3]}. The standard format for API requests and data persistence. files, so conversations survive server restarts
Conversation Manager Lifecycle (30 Turns)
0
Turns
0
Total Tokens
0
Tokens Saved
0
Prune Events
Why It Matters The conversation manager is the unsung hero of every production agent. Real-world impact: Without one, a healthcare pre-authorization agent processing 500 cases/day with an average of 15 turns per case would send ~22,500 tokens per request at turn 15, costing ~$34/day on input tokens alone. With a well-tuned conversation manager (summarization at 8 turns, critical-fact pinning), that drops to ~8,000 tokens/request — $12/day, a 65% cost reduction — while maintaining 95%+ accuracy on clinical criteria references. The conversation manager also prevents the catastrophic failure mode where a long conversation silently exceeds the context window and the API returns an error mid-session, losing the patient's case.
🎓 Cert Tip — Domain 5.1

Progressive summarization loses critical specifics: names, IDs, amounts, dates. For production systems, use immutable "case facts" blocks positioned at the START of context (high-recall position). These are never summarized.

Code Walkthrough

Concept → Code Bridge: You now understand the five core concepts: statelessness, history patterns, token budgets, pruning strategies, and the role of a conversation manager. Next, we will translate each concept into working code. We build in three incremental steps: (1) a basic manager that stores messages and talks to Claude, (2) a sliding window extension that caps what gets sent, and (3) a smart manager that auto-summarizes and persists state to disk. Each step builds on the previous class using inheritance.

Step 1: Basic Conversation Manager

Let's start with the simplest possible version: a class that stores messages, sends them to Claude, and keeps a running count of token usage. This is the "full history" strategy in action — every message ever sent gets included in every API call. The key thing to watch is the messages array. It grows by two entries every turn (one user, one assistant), and the entire array is shipped to the API each time. That is what makes full history simple but expensive.

One important safety detail: look at the except block in send(). If the API call fails for any reason, the manager pops the user message it just added. Without this, a failed call would leave a "phantom" user message in your history with no assistant reply — breaking the required user/assistant alternation and causing the next call to fail too.

import anthropic
from dataclasses import dataclass, field
from typing import Optional

# pip install anthropic>=0.30.0

@dataclass
class ConversationManager:
    """Manages multi-turn conversations with Claude."""
    system_prompt: str = "You are a helpful assistant."
    model: str = "claude-sonnet-4-6"
    max_tokens: int = 4096
    messages: list = field(default_factory=list)

    def __post_init__(self):
        self.client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY
        self.total_input_tokens = 0
        self.total_output_tokens = 0

    def add_user_message(self, content: str) -> None:
        """Add a user message to the conversation history."""
        self.messages.append({"role": "user", "content": content})

    def add_assistant_message(self, content: str) -> None:
        """Add an assistant message to the conversation history."""
        self.messages.append({"role": "assistant", "content": content})

    def send(self, user_message: str) -> str:
        """Send a message and get Claude's response."""
        self.add_user_message(user_message)

        try:
            response = self.client.messages.create(
                model=self.model,
                max_tokens=self.max_tokens,
                system=self.system_prompt,
                messages=self.messages,
            )

            # Track token usage from the API response
            self.total_input_tokens += response.usage.input_tokens
            self.total_output_tokens += response.usage.output_tokens

            assistant_text = response.content[0].text
            self.add_assistant_message(assistant_text)
            return assistant_text

        except anthropic.APIError as e:
            # Remove the user message we just added on failure
            self.messages.pop()
            raise RuntimeError(f"API call failed: {e}") from e

    def get_messages(self) -> list:
        """Return the current message history."""
        return list(self.messages)

    def get_token_usage(self) -> dict:
        """Return cumulative token usage."""
        return {
            "total_input_tokens": self.total_input_tokens,
            "total_output_tokens": self.total_output_tokens,
            "message_count": len(self.messages),
        }


# Usage
manager = ConversationManager(
    system_prompt="You are a coding tutor. Be concise."
)

reply1 = manager.send("What is a list comprehension in Python?")
print(reply1)

reply2 = manager.send("Show me an example with filtering.")
print(reply2)

print(manager.get_token_usage())
import Anthropic from "@anthropic-ai/sdk";
// npm install @anthropic-ai/sdk@^0.30.0

class ConversationManager {
  constructor({
    systemPrompt = "You are a helpful assistant.",
    model = "claude-sonnet-4-6",
    maxTokens = 4096,
  } = {}) {
    this.client = new Anthropic(); // reads ANTHROPIC_API_KEY
    this.systemPrompt = systemPrompt;
    this.model = model;
    this.maxTokens = maxTokens;
    this.messages = [];
    this.totalInputTokens = 0;
    this.totalOutputTokens = 0;
  }

  addUserMessage(content) {
    this.messages.push({ role: "user", content });
  }

  addAssistantMessage(content) {
    this.messages.push({ role: "assistant", content });
  }

  async send(userMessage) {
    this.addUserMessage(userMessage);

    try {
      const response = await this.client.messages.create({
        model: this.model,
        max_tokens: this.maxTokens,
        system: this.systemPrompt,
        messages: this.messages,
      });

      // Track token usage from the API response
      this.totalInputTokens += response.usage.input_tokens;
      this.totalOutputTokens += response.usage.output_tokens;

      const assistantText = response.content[0].text;
      this.addAssistantMessage(assistantText);
      return assistantText;
    } catch (error) {
      // Remove the user message we just added on failure
      this.messages.pop();
      throw new Error(`API call failed: ${error.message}`);
    }
  }

  getMessages() {
    return [...this.messages];
  }

  getTokenUsage() {
    return {
      totalInputTokens: this.totalInputTokens,
      totalOutputTokens: this.totalOutputTokens,
      messageCount: this.messages.length,
    };
  }
}

// Usage
const manager = new ConversationManager({
  systemPrompt: "You are a coding tutor. Be concise.",
});

const reply1 = await manager.send(
  "What is a list comprehension in Python?"
);
console.log(reply1);

const reply2 = await manager.send(
  "Show me an example with filtering."
);
console.log(reply2);

console.log(manager.getTokenUsage());
What Just Happened? You built a ConversationManager that implements the "full history" strategy from earlier in this module. Let's trace what happens when you call send(). First, the user message is appended to the messages array. Then the entire array — every message from the beginning of the conversation — is sent to Claude. When the response arrives, the manager reads response.usage to track cumulative input and output tokens (you will need these numbers later for budgeting). The interesting safety detail: if the API call fails for any reason, the manager pops the user message it just added, so your history stays clean. This class is fully functional, but here is the catch — it has no protection against the context window filling up. That is exactly what Step 2 solves.

Step 2: Sliding Window Mode

Now here is the dilemma with Step 1: as conversations grow, you are sending thousands of tokens of old greetings and resolved questions on every call. The sliding window fixes this with a simple override. Instead of sending all messages, get_messages() returns only the last N. The interesting subtlety is on line where we check the first message's role — if our slice happens to start on an assistant message, we drop it to maintain the required user-first alternation. Without that check, the API would reject the request.

class SlidingWindowManager(ConversationManager):
    """Extends ConversationManager with a sliding window."""

    def __init__(self, window_size: int = 10, **kwargs):
        super().__init__(**kwargs)
        self.window_size = window_size

    def get_messages(self) -> list:
        """Return only the most recent N messages."""
        if len(self.messages) <= self.window_size:
            return list(self.messages)
        # Keep the last window_size messages
        # Ensure we start with a user message (valid alternation)
        windowed = self.messages[-self.window_size:]
        if windowed and windowed[0]["role"] == "assistant":
            windowed = windowed[1:]  # drop to maintain user-first
        return windowed

    def send(self, user_message: str) -> str:
        """Send using only the windowed history."""
        self.add_user_message(user_message)

        try:
            response = self.client.messages.create(
                model=self.model,
                max_tokens=self.max_tokens,
                system=self.system_prompt,
                messages=self.get_messages(),  # windowed!
            )
            self.total_input_tokens += response.usage.input_tokens
            self.total_output_tokens += response.usage.output_tokens

            assistant_text = response.content[0].text
            self.add_assistant_message(assistant_text)
            return assistant_text

        except anthropic.APIError as e:
            self.messages.pop()
            raise RuntimeError(f"API call failed: {e}") from e


# Usage — only last 6 messages sent per call
manager = SlidingWindowManager(
    window_size=6,
    system_prompt="You are a concise coding tutor.",
)
for q in [
    "What is Python?",
    "What are variables?",
    "Explain loops.",
    "What are functions?",
    "Explain classes.",
    "What is inheritance?",
]:
    print(f"Q: {q}")
    print(f"A: {manager.send(q)}\n")

# All 12 messages stored, but only last 6 sent to API
print(f"Stored: {len(manager.messages)} messages")
print(f"Sent: {len(manager.get_messages())} messages")
class SlidingWindowManager extends ConversationManager {
  constructor({ windowSize = 10, ...opts } = {}) {
    super(opts);
    this.windowSize = windowSize;
  }

  getMessages() {
    if (this.messages.length <= this.windowSize) {
      return [...this.messages];
    }
    // Keep the last windowSize messages
    let windowed = this.messages.slice(-this.windowSize);
    // Ensure we start with a user message
    if (windowed[0]?.role === "assistant") {
      windowed = windowed.slice(1);
    }
    return windowed;
  }

  async send(userMessage) {
    this.addUserMessage(userMessage);

    try {
      const response = await this.client.messages.create({
        model: this.model,
        max_tokens: this.maxTokens,
        system: this.systemPrompt,
        messages: this.getMessages(), // windowed!
      });

      this.totalInputTokens += response.usage.input_tokens;
      this.totalOutputTokens += response.usage.output_tokens;

      const assistantText = response.content[0].text;
      this.addAssistantMessage(assistantText);
      return assistantText;
    } catch (error) {
      this.messages.pop();
      throw new Error(`API call failed: ${error.message}`);
    }
  }
}

// Usage — only last 6 messages sent per call
const manager = new SlidingWindowManager({
  windowSize: 6,
  systemPrompt: "You are a concise coding tutor.",
});

for (const q of [
  "What is Python?",
  "What are variables?",
  "Explain loops.",
  "What are functions?",
  "Explain classes.",
  "What is inheritance?",
]) {
  console.log(`Q: ${q}`);
  console.log(`A: ${await manager.send(q)}\n`);
}

console.log(`Stored: ${manager.messages.length} messages`);
console.log(`Sent: ${manager.getMessages().length} messages`);
What Just Happened? You extended the base manager with a sliding window that (1) stores all messages internally for audit/replay purposes, (2) sends only the last N messages to the API via an overridden get_messages(), and (3) ensures valid alternation by dropping a leading assistant message if the slice starts on one. Trade-off: this approach is cheap and fast but permanently loses early context from Claude's perspective — if the user's account number was in message 1, it vanishes after a few turns.
Bridge: The sliding window is fast but forgetful. Step 3 solves this by using Claude itself to summarize old messages before discarding them, and adds JSON persistence so conversations survive server restarts. This is the production-grade pattern used by most deployed agents.

Step 3: Auto-Summarization & Persistence

This is the production-grade version, and it solves the sliding window's biggest weakness: losing early context forever. The key idea is elegant — instead of just dropping old messages, we ask Claude to read them and write a summary first. Then we replace the originals with that summary. The result is a much shorter history that still carries the essential context from earlier in the conversation.

The other major addition is save() and load(). Real agents need conversations to survive server restarts, deployments, and load balancer switches. These methods serialize the full state — messages, summary, token counts — to a JSON file. In production, you would swap the file system for a database or Redis, but the pattern is identical.

import json
import time
from pathlib import Path

class SmartConversationManager(ConversationManager):
    """Full-featured manager with summarization and persistence."""

    def __init__(
        self,
        token_threshold: int = 100_000,
        recent_turns_to_keep: int = 4,
        **kwargs
    ):
        super().__init__(**kwargs)
        self.token_threshold = token_threshold
        self.recent_turns_to_keep = recent_turns_to_keep
        self.summary: Optional[str] = None
        self.summary_history: list = []
        self.last_input_tokens = 0

    def _estimate_tokens(self) -> int:
        """Estimate current token usage from last API response."""
        return self.last_input_tokens

    def _should_summarize(self) -> bool:
        """Check if we've exceeded the token threshold."""
        return self.last_input_tokens > self.token_threshold

    def _summarize_old_messages(self) -> None:
        """Use Claude to summarize older messages."""
        # Keep the most recent turns intact
        keep_count = self.recent_turns_to_keep * 2  # user+assistant pairs
        if len(self.messages) <= keep_count:
            return

        old_messages = self.messages[:-keep_count]
        recent_messages = self.messages[-keep_count:]

        # Build a summary of the old messages
        summary_prompt = (
            "Summarize this conversation concisely. "
            "Preserve: key decisions, user preferences, "
            "important facts, and tool results. "
            "Skip: greetings, filler, repeated information.\n\n"
        )
        for msg in old_messages:
            summary_prompt += f"{msg['role']}: {msg['content']}\n"

        try:
            response = self.client.messages.create(
                model=self.model,
                max_tokens=1024,
                system="You are a conversation summarizer. Be concise.",
                messages=[{"role": "user", "content": summary_prompt}],
            )
            new_summary = response.content[0].text

            # Combine with existing summary if present
            if self.summary:
                new_summary = f"Previous context: {self.summary}\n\nRecent: {new_summary}"

            self.summary = new_summary
            self.summary_history.append({
                "timestamp": time.time(),
                "messages_summarized": len(old_messages),
                "summary_tokens": response.usage.output_tokens,
            })

            # Replace old messages with a summary message
            self.messages = [
                {"role": "user", "content": f"[Conversation summary: {self.summary}]"},
                {"role": "assistant", "content": "Understood. I have the conversation context."},
                *recent_messages,
            ]
        except anthropic.APIError as e:
            # If summarization fails, fall back to sliding window
            self.messages = recent_messages

    def send(self, user_message: str) -> str:
        """Send with automatic summarization when needed."""
        self.add_user_message(user_message)

        try:
            response = self.client.messages.create(
                model=self.model,
                max_tokens=self.max_tokens,
                system=self.system_prompt,
                messages=self.messages,
            )
            self.last_input_tokens = response.usage.input_tokens
            self.total_input_tokens += response.usage.input_tokens
            self.total_output_tokens += response.usage.output_tokens

            assistant_text = response.content[0].text
            self.add_assistant_message(assistant_text)

            # Check if we need to summarize
            if self._should_summarize():
                self._summarize_old_messages()

            return assistant_text

        except anthropic.APIError as e:
            self.messages.pop()
            raise RuntimeError(f"API call failed: {e}") from e

    def save(self, filepath: str) -> None:
        """Persist conversation state to a JSON file."""
        state = {
            "system_prompt": self.system_prompt,
            "model": self.model,
            "messages": self.messages,
            "summary": self.summary,
            "summary_history": self.summary_history,
            "total_input_tokens": self.total_input_tokens,
            "total_output_tokens": self.total_output_tokens,
            "saved_at": time.time(),
        }
        Path(filepath).write_text(json.dumps(state, indent=2))

    @classmethod
    def load(cls, filepath: str) -> "SmartConversationManager":
        """Load conversation state from a JSON file."""
        data = json.loads(Path(filepath).read_text())
        mgr = cls(
            system_prompt=data["system_prompt"],
            model=data["model"],
        )
        mgr.messages = data["messages"]
        mgr.summary = data.get("summary")
        mgr.summary_history = data.get("summary_history", [])
        mgr.total_input_tokens = data.get("total_input_tokens", 0)
        mgr.total_output_tokens = data.get("total_output_tokens", 0)
        return mgr


# Usage
manager = SmartConversationManager(
    token_threshold=50_000,
    recent_turns_to_keep=4,
    system_prompt="You are a helpful coding assistant.",
)

reply = manager.send("Help me build a REST API with FastAPI.")
print(reply)

# Save and restore across sessions
manager.save("conversation_state.json")
restored = SmartConversationManager.load("conversation_state.json")
reply2 = restored.send("Where were we?")
print(reply2)
import Anthropic from "@anthropic-ai/sdk";
import { readFileSync, writeFileSync } from "node:fs";

class SmartConversationManager extends ConversationManager {
  constructor({
    tokenThreshold = 100_000,
    recentTurnsToKeep = 4,
    ...opts
  } = {}) {
    super(opts);
    this.tokenThreshold = tokenThreshold;
    this.recentTurnsToKeep = recentTurnsToKeep;
    this.summary = null;
    this.summaryHistory = [];
    this.lastInputTokens = 0;
  }

  _shouldSummarize() {
    return this.lastInputTokens > this.tokenThreshold;
  }

  async _summarizeOldMessages() {
    const keepCount = this.recentTurnsToKeep * 2;
    if (this.messages.length <= keepCount) return;

    const oldMessages = this.messages.slice(0, -keepCount);
    const recentMessages = this.messages.slice(-keepCount);

    let summaryPrompt =
      "Summarize this conversation concisely. " +
      "Preserve: key decisions, user preferences, " +
      "important facts, and tool results. " +
      "Skip: greetings, filler, repeated information.\n\n";
    for (const msg of oldMessages) {
      summaryPrompt += `${msg.role}: ${msg.content}\n`;
    }

    try {
      const response = await this.client.messages.create({
        model: this.model,
        max_tokens: 1024,
        system: "You are a conversation summarizer. Be concise.",
        messages: [{ role: "user", content: summaryPrompt }],
      });
      let newSummary = response.content[0].text;

      if (this.summary) {
        newSummary = `Previous context: ${this.summary}\n\nRecent: ${newSummary}`;
      }

      this.summary = newSummary;
      this.summaryHistory.push({
        timestamp: Date.now(),
        messagesSummarized: oldMessages.length,
        summaryTokens: response.usage.output_tokens,
      });

      this.messages = [
        { role: "user", content: `[Conversation summary: ${this.summary}]` },
        { role: "assistant", content: "Understood. I have the conversation context." },
        ...recentMessages,
      ];
    } catch {
      // Fallback to sliding window on failure
      this.messages = recentMessages;
    }
  }

  async send(userMessage) {
    this.addUserMessage(userMessage);

    try {
      const response = await this.client.messages.create({
        model: this.model,
        max_tokens: this.maxTokens,
        system: this.systemPrompt,
        messages: this.messages,
      });

      this.lastInputTokens = response.usage.input_tokens;
      this.totalInputTokens += response.usage.input_tokens;
      this.totalOutputTokens += response.usage.output_tokens;

      const assistantText = response.content[0].text;
      this.addAssistantMessage(assistantText);

      if (this._shouldSummarize()) {
        await this._summarizeOldMessages();
      }

      return assistantText;
    } catch (error) {
      this.messages.pop();
      throw new Error(`API call failed: ${error.message}`);
    }
  }

  save(filepath) {
    const state = {
      systemPrompt: this.systemPrompt,
      model: this.model,
      messages: this.messages,
      summary: this.summary,
      summaryHistory: this.summaryHistory,
      totalInputTokens: this.totalInputTokens,
      totalOutputTokens: this.totalOutputTokens,
      savedAt: Date.now(),
    };
    writeFileSync(filepath, JSON.stringify(state, null, 2));
  }

  static load(filepath) {
    const data = JSON.parse(readFileSync(filepath, "utf-8"));
    const mgr = new SmartConversationManager({
      systemPrompt: data.systemPrompt,
      model: data.model,
    });
    mgr.messages = data.messages;
    mgr.summary = data.summary ?? null;
    mgr.summaryHistory = data.summaryHistory ?? [];
    mgr.totalInputTokens = data.totalInputTokens ?? 0;
    mgr.totalOutputTokens = data.totalOutputTokens ?? 0;
    return mgr;
  }
}

// Usage
const manager = new SmartConversationManager({
  tokenThreshold: 50_000,
  recentTurnsToKeep: 4,
  systemPrompt: "You are a helpful coding assistant.",
});

const reply = await manager.send("Help me build a REST API.");
console.log(reply);

// Save and restore across sessions
manager.save("conversation_state.json");
const restored = SmartConversationManager.load("conversation_state.json");
const reply2 = await restored.send("Where were we?");
console.log(reply2);
What Just Happened? You built a production-grade SmartConversationManager. Let's walk through the interesting parts. The manager watches last_input_tokens after every API response — this is how it knows when context is getting dangerously large. When tokens cross your configurable threshold, it triggers the real magic: a separate Claude call that reads all the old messages and compresses them into a concise summary. The interesting design choice is what gets preserved — the summarization prompt explicitly asks Claude to keep key decisions, facts, and tool results while dropping greetings and filler. After summarization, the old messages are replaced with a summary/acknowledgment pair plus the most recent turns kept verbatim. Here is the dilemma the code handles gracefully: what if the summarization call itself fails? Rather than crashing, it falls back to a simple sliding window — not perfect, but the conversation keeps going. Finally, save() and load() serialize the full state (messages, summary, token counts) to JSON so conversations survive server restarts.
Production Warning The summarization step itself costs tokens and adds latency. In production, run summarization asynchronously or batch it. Also consider that the summarization model may lose nuance — always keep the last few turns verbatim.

Hands-On Exercise

What You'll Build

A conversation manager that progresses through three strategies: full history, sliding window, and auto-summarization with JSON persistence. You'll see the token usage difference between each approach in real time.

Time Estimate: 30–45 minutes  |  Files You'll Create: conversation_manager.py

Environment Setup

mkdir conversation-lab && cd conversation-lab
python -m venv venv && source venv/bin/activate   # Windows: venv\Scripts\activate
pip install anthropic>=0.30.0
export ANTHROPIC_API_KEY=your-key-here             # Windows: set ANTHROPIC_API_KEY=your-key-here

Step 1: Basic ConversationManager with Token Tracking

What: Create a ConversationManager class that stores messages, sends them to Claude, and tracks token usage.

Why: This implements the "full history" strategy — every message is sent with every request. You will see input tokens grow with each turn, which demonstrates the cost problem that Steps 2 and 3 solve.

Create a new file called conversation_manager.py and add the following:

import anthropic
import json
import time
from pathlib import Path
from dataclasses import dataclass, field
from typing import Optional

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY

# ── Step 1: Basic ConversationManager ────────────────────────
@dataclass
class ConversationManager:
    """Full-history conversation manager with token tracking."""
    system_prompt: str = "You are a helpful assistant."
    model: str = "claude-sonnet-4-6"
    max_tokens: int = 4096
    messages: list = field(default_factory=list)

    def __post_init__(self):
        self.total_input_tokens = 0
        self.total_output_tokens = 0

    def add_user_message(self, content: str) -> None:
        self.messages.append({"role": "user", "content": content})

    def add_assistant_message(self, content: str) -> None:
        self.messages.append({"role": "assistant", "content": content})

    def get_messages(self) -> list:
        return list(self.messages)

    def send(self, user_message: str) -> str:
        self.add_user_message(user_message)
        try:
            response = client.messages.create(
                model=self.model,
                max_tokens=self.max_tokens,
                system=self.system_prompt,
                messages=self.get_messages(),
            )
            self.total_input_tokens += response.usage.input_tokens
            self.total_output_tokens += response.usage.output_tokens
            assistant_text = response.content[0].text
            self.add_assistant_message(assistant_text)
            return assistant_text
        except anthropic.APIError as e:
            self.messages.pop()  # remove failed user message
            raise RuntimeError(f"API call failed: {e}") from e

    def get_token_usage(self) -> dict:
        return {
            "total_input": self.total_input_tokens,
            "total_output": self.total_output_tokens,
            "messages": len(self.messages),
        }

# ── Test Step 1 ──────────────────────────────────────────────
if __name__ == "__main__":
    print("═" * 50)
    print("TEST 1: Basic ConversationManager (Full History)")
    print("═" * 50)

    mgr = ConversationManager(system_prompt="You are a concise coding tutor. Answer in 1-2 sentences.")

    questions = [
        "What is a list in Python?",
        "How do I add an item to it?",
        "What about removing items?",
    ]

    for i, q in enumerate(questions, 1):
        print(f"\n  Turn {i}: {q}")
        reply = mgr.send(q)
        print(f"  Claude: {reply[:100]}...")
        usage = mgr.get_token_usage()
        print(f"  [Messages: {usage['messages']} | Input tokens so far: {usage['total_input']}]")

    print(f"\n  ✓ Final: {usage['messages']} messages, {usage['total_input']} total input tokens")
    print(f"  Note: Input tokens grow each turn because ALL messages are resent!")

Run it:

Command
python conversation_manager.py
Expected Output (token counts will vary)
══════════════════════════════════════════════════ TEST 1: Basic ConversationManager (Full History) ══════════════════════════════════════════════════ Turn 1: What is a list in Python? Claude: A list is an ordered, mutable collection... [Messages: 2 | Input tokens so far: 38] Turn 2: How do I add an item to it? Claude: Use the .append() method... [Messages: 4 | Input tokens so far: 120] Turn 3: What about removing items? Claude: Use .remove(value) or .pop(index)... [Messages: 6 | Input tokens so far: 248] ✓ Final: 6 messages, 248 total input tokens Note: Input tokens grow each turn because ALL messages are resent!
✅ Checkpoint

Watch the input token count grow with each turn. Turn 1 sends ~38 tokens. Turn 2 resends turn 1 + new message (~120 tokens). Turn 3 resends everything (~248 tokens). This is the stateless reality in action — and exactly why we need the strategies in the next steps.

Troubleshooting
  • AuthenticationError → Check your ANTHROPIC_API_KEY is set. Run echo $ANTHROPIC_API_KEY to verify.
  • ModuleNotFoundError: No module named 'anthropic' → Run pip install anthropic
  • Responses are very long → Make sure your system prompt says "Answer in 1-2 sentences" to keep responses short for testing.

Step 2: Add Sliding Window & Smart Summarization

What: Add two more manager classes: a SlidingWindowManager that only sends the last N messages, and a SmartConversationManager that auto-summarizes old turns and persists state to JSON.

Why: This step demonstrates the token savings from each strategy side-by-side. The final comparison table shows you exactly how many tokens each approach uses over the same 6-turn conversation.

This step uses the ConversationManager from Step 1. Add the following to the bottom of conversation_manager.py, replacing the if __name__ == "__main__" block:

# ── Step 2a: Sliding Window Manager ──────────────────────────
class SlidingWindowManager(ConversationManager):
    """Sends only the last N messages to the API."""

    def __init__(self, window_size: int = 6, **kwargs):
        super().__init__(**kwargs)
        self.window_size = window_size

    def get_messages(self) -> list:
        if len(self.messages) <= self.window_size:
            return list(self.messages)
        windowed = self.messages[-self.window_size:]
        # Ensure we start with a user message (valid alternation)
        if windowed and windowed[0]["role"] == "assistant":
            windowed = windowed[1:]
        return windowed


# ── Step 2b: Smart Manager with Summarization + Persistence ──
class SmartConversationManager(ConversationManager):
    """Auto-summarizes old turns and saves/loads state."""

    def __init__(self, token_threshold: int = 50_000, recent_turns: int = 4, **kwargs):
        super().__init__(**kwargs)
        self.token_threshold = token_threshold
        self.recent_turns = recent_turns
        self.summary: Optional[str] = None
        self.last_input_tokens = 0

    def send(self, user_message: str) -> str:
        self.add_user_message(user_message)
        try:
            response = client.messages.create(
                model=self.model,
                max_tokens=self.max_tokens,
                system=self.system_prompt,
                messages=self.get_messages(),
            )
            self.last_input_tokens = response.usage.input_tokens
            self.total_input_tokens += response.usage.input_tokens
            self.total_output_tokens += response.usage.output_tokens
            assistant_text = response.content[0].text
            self.add_assistant_message(assistant_text)

            # Auto-summarize if threshold exceeded
            if self.last_input_tokens > self.token_threshold:
                self._summarize()

            return assistant_text
        except anthropic.APIError as e:
            self.messages.pop()
            raise RuntimeError(f"API call failed: {e}") from e

    def _summarize(self) -> None:
        keep_count = self.recent_turns * 2  # user+assistant pairs
        if len(self.messages) <= keep_count:
            return

        old_msgs = self.messages[:-keep_count]
        recent_msgs = self.messages[-keep_count:]

        prompt = "Summarize this conversation concisely. Preserve: key decisions, facts, tool results. Skip: greetings, filler.\n\n"
        for m in old_msgs:
            prompt += f"{m['role']}: {m['content']}\n"

        try:
            resp = client.messages.create(
                model=self.model, max_tokens=1024,
                system="You are a conversation summarizer.",
                messages=[{"role": "user", "content": prompt}],
            )
            summary_text = resp.content[0].text
            if self.summary:
                summary_text = f"Previous: {self.summary}\n\nRecent: {summary_text}"
            self.summary = summary_text
            self.messages = [
                {"role": "user", "content": f"[Context summary: {self.summary}]"},
                {"role": "assistant", "content": "Understood. I have the context."},
                *recent_msgs,
            ]
            print(f"    ⚡ Summarized {len(old_msgs)} old messages → {len(self.messages)} total")
        except Exception:
            self.messages = recent_msgs  # fallback to sliding window

    def save(self, filepath: str) -> None:
        state = {
            "system_prompt": self.system_prompt,
            "model": self.model,
            "messages": self.messages,
            "summary": self.summary,
            "total_input_tokens": self.total_input_tokens,
            "total_output_tokens": self.total_output_tokens,
        }
        Path(filepath).write_text(json.dumps(state, indent=2))
        print(f"    💾 Saved to {filepath}")

    @classmethod
    def load(cls, filepath: str) -> "SmartConversationManager":
        data = json.loads(Path(filepath).read_text())
        mgr = cls(system_prompt=data["system_prompt"], model=data["model"])
        mgr.messages = data["messages"]
        mgr.summary = data.get("summary")
        mgr.total_input_tokens = data.get("total_input_tokens", 0)
        mgr.total_output_tokens = data.get("total_output_tokens", 0)
        return mgr


# ── Test All Three Strategies ────────────────────────────────
if __name__ == "__main__":
    questions = [
        "What is a Python list?",
        "How do I sort a list?",
        "What about list comprehensions?",
        "How do dictionaries differ from lists?",
        "What are sets?",
        "When should I use tuples?",
    ]

    # --- Test 1: Full History ---
    print("═" * 55)
    print("TEST 1: Full History (all messages sent every time)")
    print("═" * 55)
    mgr1 = ConversationManager(system_prompt="Answer in 1 sentence.")
    for i, q in enumerate(questions, 1):
        mgr1.send(q)
        u = mgr1.get_token_usage()
        print(f"  Turn {i}: stored={u['messages']} sent={u['messages']} input_tokens={u['total_input']}")

    # --- Test 2: Sliding Window ---
    print(f"\n{'═' * 55}")
    print("TEST 2: Sliding Window (last 4 messages only)")
    print("═" * 55)
    mgr2 = SlidingWindowManager(window_size=4, system_prompt="Answer in 1 sentence.")
    for i, q in enumerate(questions, 1):
        mgr2.send(q)
        u = mgr2.get_token_usage()
        sent = len(mgr2.get_messages())
        print(f"  Turn {i}: stored={u['messages']} sent={sent} input_tokens={u['total_input']}")

    # --- Test 3: Smart Manager + Persistence ---
    print(f"\n{'═' * 55}")
    print("TEST 3: Smart Manager (auto-summarization + save/load)")
    print("═" * 55)
    mgr3 = SmartConversationManager(
        token_threshold=500,  # low threshold to trigger summarization quickly
        recent_turns=2,
        system_prompt="Answer in 1 sentence.",
    )
    for i, q in enumerate(questions, 1):
        mgr3.send(q)
        u = mgr3.get_token_usage()
        print(f"  Turn {i}: messages={u['messages']} input_tokens={u['total_input']}")

    # Save and restore
    mgr3.save("test_session.json")
    restored = SmartConversationManager.load("test_session.json")
    reply = restored.send("What did we discuss?")
    print(f"\n  Restored reply: {reply[:120]}...")

    # --- Comparison ---
    print(f"\n{'═' * 55}")
    print("COMPARISON (total input tokens after 6 turns):")
    print(f"  Full History:    {mgr1.total_input_tokens:,} tokens")
    print(f"  Sliding Window:  {mgr2.total_input_tokens:,} tokens")
    print(f"  Smart Manager:   {mgr3.total_input_tokens:,} tokens")
    print("═" * 55)

Run the full test suite:

Command
python conversation_manager.py
Expected Output (token counts will vary)
═══════════════════════════════════════════════════════ TEST 1: Full History (all messages sent every time) ═══════════════════════════════════════════════════════ Turn 1: stored=2 sent=2 input_tokens=35 Turn 2: stored=4 sent=4 input_tokens=112 Turn 3: stored=6 sent=6 input_tokens=225 Turn 4: stored=8 sent=8 input_tokens=370 Turn 5: stored=10 sent=10 input_tokens=545 Turn 6: stored=12 sent=12 input_tokens=750 ═══════════════════════════════════════════════════════ TEST 2: Sliding Window (last 4 messages only) ═══════════════════════════════════════════════════════ Turn 1: stored=2 sent=2 input_tokens=35 Turn 2: stored=4 sent=4 input_tokens=112 Turn 3: stored=6 sent=4 input_tokens=200 Turn 4: stored=8 sent=4 input_tokens=285 Turn 5: stored=10 sent=4 input_tokens=368 Turn 6: stored=12 sent=4 input_tokens=450 ═══════════════════════════════════════════════════════ TEST 3: Smart Manager (auto-summarization + save/load) ═══════════════════════════════════════════════════════ Turn 1: messages=2 input_tokens=35 Turn 2: messages=4 input_tokens=112 Turn 3: messages=6 input_tokens=225 Turn 4: messages=8 input_tokens=370 Turn 5: messages=10 input_tokens=545 ⚡ Summarized 6 old messages → 6 total Turn 6: messages=8 input_tokens=180 💾 Saved to test_session.json Restored reply: We discussed Python data structures... ═══════════════════════════════════════════════════════ COMPARISON (total input tokens after 6 turns): Full History: 750 tokens Sliding Window: 450 tokens Smart Manager: 1,467 tokens (higher due to short test; pays off over long sessions) ═══════════════════════════════════════════════════════
✅ Checkpoint

Look for these key behaviors:

  • Test 1: Input tokens should grow significantly each turn (full history resends everything)
  • Test 2: The sent column should cap at 4 after turn 2, while stored keeps growing
  • Test 3: You should see ⚡ Summarized messages when the threshold is hit, and the restored manager should know what was discussed
  • Comparison: Sliding window uses the fewest tokens in this short test. Smart manager appears higher here because the lab uses a deliberately low token_threshold=500 on a short conversation — in real long sessions (50+ turns) it ends up well below full history while preserving far more context
Troubleshooting
  • Summarization never triggers → The token_threshold=500 is set low for testing. If Claude's responses are very short, increase the number of questions or lower the threshold further.
  • FileNotFoundError on load → Make sure save() ran successfully first. Check that test_session.json exists in the current directory.
  • Restored manager doesn't remember context → This is expected if no summarization occurred (the manager only has recent messages). Try adding more turns before saving.
  • APIError: 529 (overloaded) → The test makes 7+ API calls. If rate-limited, add time.sleep(1) between turns or run tests one at a time.

Verify Everything Works

Run the complete file end-to-end. All 3 tests should complete, the comparison table should show Full History using the most tokens, and the save/load cycle should successfully restore a conversation:

Command
python conversation_manager.py

If you see the COMPARISON table and the restored reply, you've successfully built a production-pattern conversation manager with all three strategies.

🎉 Congratulations

You've built three conversation management strategies and measured their real token usage side-by-side. The SmartConversationManager is the pattern used by most production agents — you can tune token_threshold and recent_turns to balance cost vs. context retention for your specific use case.

Stretch Goals (Optional)
  • Importance-based pruning: Add an importance field to each message and drop lowest-scored messages first when pruning
  • Pinned facts: Add a pinned_facts list that is always prepended to the context and never summarized
  • Visual dashboard: Print a bar chart showing token usage per turn for each strategy

Knowledge Check

1. What happens if the client sends an API request to Claude without including previous messages?

A Claude returns an error because the conversation session has expired
B Claude responds with no memory of the conversation — it's a fresh start
C Claude retrieves the previous messages from its server-side session storage
D Claude uses cached embeddings from the previous messages to maintain context
Correct! The API is stateless — there is no server-side session. If you don't send the history, Claude has no memory of it. This is by design.
Not quite. The Claude API is fully stateless — there is no session storage, no caching between requests. If you omit previous messages, Claude treats it as a brand new conversation.

2. Given: system prompt = 1,500 tokens, conversation history = 40,000 tokens, current message = 800 tokens, max_tokens = 4,096. With a 200K context window, how many tokens are available for additional history before hitting the limit?

A 200,000 tokens
B 159,604 tokens
C 153,604 tokens — calculated as 200,000 − 1,500 − 40,000 − 800 − 4,096
D 193,604 tokens
Correct! 200,000 − 1,500 (system) − 40,000 (history) − 800 (message) − 4,096 (response) = 153,604 tokens remaining for additional history growth.
Remember the formula: available = context_window − system_tokens − history_tokens − current_message − max_tokens. That gives 200,000 − 1,500 − 40,000 − 800 − 4,096 = 153,604.

3. A customer support bot needs to remember the user's account number from message 1, but conversations can go 50+ turns. Which strategy is best?

A Full history — keep every message
B Sliding window (N=10) — drop messages older than 10 turns
C Summarization with importance-based retention — summarize old turns but always preserve key facts like account numbers
D FIFO pruning — drop the oldest messages first
Correct! Summarization with importance-based retention preserves critical facts (account number) while compressing less important exchanges. Pure sliding window or FIFO would lose the account number.
A sliding window or FIFO approach would drop message 1 (with the account number) after a few turns. Full history works but is wasteful at 50+ turns. The best fit is summarization with importance scoring to preserve key facts.

4. When summarizing a 10-turn technical conversation about debugging a database query, which information is MOST likely to be lost?

A The final solution that fixed the query
B The database engine being used (PostgreSQL, MySQL, etc.)
C The exact error message that was reported
D The intermediate hypotheses and dead-end approaches that were tried and rejected
Correct! Summarization naturally compresses the narrative, keeping outcomes (the fix) and key facts (the DB engine, the error) while discarding the exploratory process. Dead-end approaches rarely survive summarization.
Summarization preserves key outcomes and facts but tends to compress the exploratory process. The intermediate hypotheses and dead-end approaches are exactly the kind of "filler" that a summarizer drops. The final solution, the DB engine, and the error message are much more likely to be preserved.

5. Which of these message arrays would cause an API error when sent to Claude?

A [{role:"user",...}, {role:"assistant",...}, {role:"user",...}]
B [{role:"assistant",...}, {role:"user",...}, {role:"assistant",...}]
C [{role:"user",...}, {role:"assistant",..., content:[{type:"tool_use",...}]}, {role:"user",..., content:[{type:"tool_result",...}]}]
D [{role:"user",...}] (single message)
Correct! The messages array must start with a user role message. Starting with assistant violates the required message alternation and will cause an API error. This is a common bug when implementing pruning — always verify the first message role after removing messages.
The issue is that array B starts with an assistant role message. The Claude API requires the messages array to begin with a user message. A single user message (D) is valid, and tool_use/tool_result sequences (C) are valid when properly ordered. This is critical to check in your pruning logic!

Your Score

0/0

Module Summary

Key Concepts Recap

  • Statelessness — The Claude API has no memory. You replay the full conversation every call. This gives you total control.
  • Full History — Send everything. Simple, but hits limits fast. Good for <20 turns.
  • Sliding Window — Keep last N messages. Cheap but loses early context.
  • Summarization — Compress old turns into a summary. Best balance of cost and context.
  • Token Budget — System + History + Message + Response must fit in the context window. Monitor and prune proactively.
  • Pruning — FIFO, importance scoring, semantic dedup, and summarize-and-replace. Always maintain valid message alternation.
  • ConversationManager — A production class that encapsulates storage, counting, pruning, summarization, and persistence.

What We Built

A full ConversationManager class (Python & Node.js) that handles multi-turn conversations with automatic token tracking, sliding window mode, Claude-powered summarization, and JSON serialization for cross-session persistence.

Next Module Preview

M09: RAG — Retrieval-Augmented Generation takes conversation management further by connecting Claude to external knowledge bases. Instead of relying solely on conversation history, RAG lets your agent search documents, databases, and APIs to ground its responses in real data — building on the context management skills you learned here.