M08: Conversation Management
Master the art of managing multi-turn conversations with Claude — from stateless API calls to production-grade context management with tokenThe basic unit of text that language models process. A token is roughly 3–4 characters in English. Both input (what you send) and output (what the model generates) are measured in tokens, and you pay per token. budgets, pruning, and summarization.
Prerequisites: M01–M04 (Foundations track)
Learning Objectives
- Explain why the Claude Messages API is stateless and how the illusion of memory is created by replaying message history
- Implement three conversation history management strategies: full history, sliding windowA strategy that keeps only the most recent N messages in the conversation history, discarding older ones. Like a window that slides forward over time, always showing only the latest portion., and summarization
- Calculate token budget allocation across system promptA special instruction block sent with every API request that sets the assistant's behavior, personality, and constraints. It is separate from the messages array and always occupies context window space., history, current message, and reserved response tokensInput tokens are what you send to the model (system prompt + messages). Output tokens are what the model generates in response. Both count against the context window, and both are billed separately.
- Apply message pruning strategies that preserve information density while staying within token limits
- Build a production-grade ConversationManager class with automatic pruning, summarization, and serializationConverting an in-memory data structure (like a conversation history object) into a format that can be stored in a file or transmitted over a network — typically JSON. Deserialization is the reverse process.
The Stateless Reality: Claude Has No Memory
BEFORE: Imagine you could hire a world-class expert for advice, but they have perfect amnesia — every time you walk into the room, they have zero memory of any previous conversation you have had with them.
PAIN: If you just said "so what do you think about option B?" they would stare blankly, because they have no idea what option A was, who you are, or what problem you are solving. Every interaction starts from absolute zero.
MAPPING: This is exactly how the Claude Messages API works. Each API call is a fresh room with a fresh expert. The only way to give Claude "memory" is to hand it a written transcript of every previous exchange — your code replays the entire conversation from scratch on every single request.
What this actually looks like in code: On your third message to Claude, the API request does not just send message 3. It sends all previous messages again from scratch:
Notice: messages 1 through 4 are resent verbatim. Claude does not "remember" them — your code replays them every time.
First, each request is completely independent. There is no session ID. There is no server-side conversation state. There is no implicit memory of any kind.
Second, what users perceive as a continuous conversation is actually the client re-sending the entire message history in the messages array with every single request.
In other words, the assistant's "memory" of earlier messages exists only because the developer explicitly includes those messages in the next API call. If you leave a message out, it is gone — Claude will never know it happened.
Concrete examples: A customer support agent handling 10,000 concurrent sessions does not need 10,000 server-side session stores — each request is self-contained. A healthcare agent can surgically remove PHI from history before sending it to a logging service. A debugging agent can "rewind" a conversation to turn 5 and try a different approach.
Understanding statelessness is the foundation of everything else in this module.
"Claude remembers our previous conversation." — No. Each API call is completely stateless. There is no session, no server-side storage, no memory of any kind. If you do not include previous messages in the messages array, they do not exist as far as Claude is concerned. The "memory" you experience in chat interfaces like claude.ai is built by the client replaying message history — the exact same technique you are learning here.
"Context window = memory." — No. The context window is temporary — it exists only for the duration of a single API request. Memory persists across requests. The context window is more like a whiteboard that gets erased after every meeting: you can write a lot on it, but once the meeting ends, it is blank. Persistent memory requires your code to save and reload state.
"Longer context = better results." — Not necessarily. Research shows that recall degrades with context length — a phenomenon called the "lost in the middle" effect. Information buried in the middle of a very long context gets lower attention than information at the start or end. Cramming everything into a 200K-token payload can actually make Claude worse at finding the relevant facts.
"Summarization is lossless." — No. Summarization is a lossy compression. It preserves the gist but loses specifics: exact names, account numbers, dates, dollar amounts, and nuanced phrasing. That is why production systems use "pinned facts" blocks that are never summarized — critical specifics stay verbatim while surrounding context gets compressed.
Conversation History Management Patterns
BEFORE: Imagine packing for a two-week trip, but your airline only allows one carry-on bag with a strict 7 kg weight limit — you cannot check luggage.
PAIN: If you try to bring everything — every outfit, every gadget, every book — the bag overflows and the airline rejects it at the gate. But if you only bring what fits right now, you might arrive at a formal dinner with nothing but hiking clothes because you dropped the dress shoes three days ago.
MAPPING: Your context window is that carry-on bag. Full history means cramming everything in (works for short trips). Sliding window means only packing the last few days of clothes (cheap but you lose early essentials). Summarization means writing a packing list of what you left at home plus bringing the current essentials — the best balance for long journeys.
What this actually looks like in practice: Suppose a 20-turn conversation has accumulated 20 messages. Here is how each strategy would handle sending turn 21 to the API:
- Full History — send all messages every time. Simple but hits token limitsThe maximum number of tokens (input + output) a model can process in a single request. For Claude, this can be up to 200K tokens for input. quickly. Best for short conversations (< 20 turns).
- Sliding Window — keep only the last N messages. Maintains recency but loses early context.
- Summarization — periodically compress older messages into a summary, prepend it to the history, and drop the originals. Preserves key information while staying within token budgets.
Summarization: The Strategy That Deserves a Closer Look
Summarization is the most powerful of the three strategies, so it is worth understanding at a deeper level. In plain English, summarization means using Claude itself to read a block of older messages and compress them into a short paragraph that captures the key facts, decisions, and preferences — then replacing those original messages with the summary. The result is a much shorter message history that still carries the essential context.
Here is how it works internally. When your conversation manager detects that token usage has crossed a threshold (say 100,000 tokens), it splits the message history into two parts: "old" messages (everything before the last few turns) and "recent" messages (the last 4 turns, kept verbatim). It sends the old messages to Claude with a special instruction like "Summarize this conversation. Preserve key decisions, user preferences, important facts, and tool results. Skip greetings and filler." Claude returns a compact summary, which gets injected as a synthetic user/assistant message pair at the start of the history, and the original old messages are discarded.
If you are thinking "this sounds a lot like a sliding window, just smarter" — you are right, and that is the key difference. A sliding window throws away old messages and loses their information forever. Summarization compresses them instead, preserving the meaning while discarding the verbatim text. The trade-off is that summarization costs an extra API call (and its own tokens), and some specific details — exact numbers, names, IDs — may be paraphrased away. That is why production systems often combine summarization with "pinned facts" that are never summarized.
Token Budget Allocation
BEFORE: Imagine you have a fixed-size suitcase for an international trip — exactly 200,000 cubic centimeters, not one more — and everything you need must fit inside it.
PAIN: You must pack a travel guide (system prompt), a photo album of memories (conversation history), the souvenirs you are buying today (current user message), and leave empty space to bring things home (response tokens). If you overstuff the album with every photo from every trip, there is no room left for today's souvenirs or the space to bring anything home.
MAPPING: Your context windowThe total number of tokens a model can process in one request, including both input tokens (system prompt + messages) and reserved output tokens (max_tokens). Claude supports up to 200K input tokens. (e.g., 200K tokens) is that suitcase. The four consumers — system prompt, history, current message, and reserved response tokens — compete for the same fixed space. If conversation history grows unchecked, it squeezes out room for the model's response, eventually causing errors or truncation.
What this actually looks like in your code: When you set max_tokens in an API call, that is you reserving suitcase space for the return trip. Here is the actual budget breakdown for a typical production agent at turn 25:
- System prompt — typically 500–2,000 tokens (fixed)
- Conversation history — variable, grows per turn
- Current user message — variable
- max_tokensAn API parameter that reserves space in the context window for the model's response. Setting max_tokens=4096 means up to 4,096 tokens are reserved for output, reducing the space available for input. — reserved for the response (set via API parameter)
Here is how the math works: start with your total context window (e.g., 200,000 tokens). Subtract the system prompt, the current user message, and the tokens reserved for Claude's response (max_tokens). Whatever is left is the space available for conversation history. As a formula: available_for_history = context_window - system_tokens - current_message_tokens - max_tokens
If you are coming from traditional web development, token budgets are a new concept — there is no equivalent in a typical REST API. The closest analogy is a database connection with a maximum query size, but unlike a database, you pay per token on every request. That means the budget is both a technical constraint (exceed it and the API rejects your request) and a cost constraint (bigger payloads = bigger bills).
One subtlety that surprises beginners: the system prompt occupies budget space on every single request, not just the first one. A 2,000-token system prompt that seems negligible at turn 1 has quietly consumed 2,000 tokens × 50 turns = 100,000 tokens of cumulative billing by turn 50. This is why production teams obsess over concise system prompts — every word is multiplied by the total number of API calls in the conversation.
"max_tokens limits how much the model reads." — No. max_tokens only limits how much the model writes (output tokens). It does not affect input processing. If you send 150,000 input tokens with max_tokens=100, Claude still reads all 150K — it just cannot write more than 100 tokens in response.
"I can just set max_tokens really high to be safe." — Setting it higher than needed wastes budget space. The reserved response tokens are subtracted from your available context window. If you reserve 8,192 tokens but Claude only needs 200, you have blocked 7,992 tokens of space that could have held conversation history.
"Token counts are exact and predictable." — Not quite. Token counts depend on the model's tokenizer and can vary slightly between words, languages, and even code vs. prose. Use the API response's usage.input_tokens field for exact counts rather than trying to predict them yourself.
The "lost in the middle" effect means information in the middle of long context gets lower recall than information at the start or end. Position critical data at the beginning (case facts) or end (current query).
Practical Context Window Management
Knowing the budget exists is one thing — actively managing it in production is another. The patterns below are what you will actually wire into a real ConversationManager. Four moving pieces work together: counting tokens before you send, allocating by percentage, handling overflow, and dynamically sizing the sliding window.
1. Count tokens before the request. Never send first and hope it fits. Use Anthropic's count_tokens endpoint or a local tokenizer to size the payload up front:
2. Allocate by percentage, not absolutes. A typical production split for a 200K window: 20% system prompt, 50% history, 20% current user message, 10% response headroom. The animated bar below fills each band proportionally.
3. What overflow actually returns. Exceed the window and the request never reaches the model. The API answers with HTTP 400:
Catch this in code: anthropic.BadRequestError (Python SDK) or status 400 with error.type === "invalid_request_error". Then prune retroactively and retry once — do not loop, since a single oversized message will retry forever.
4. Slide on tokens, not message count. A naive "keep last 10 messages" rule breaks the moment one tool result is 30K tokens. Walk newest-to-oldest and accumulate until you hit your history budget — short chitchat keeps 50 turns, large tool outputs keep only 3:
This adapts automatically to message size: a chatty support bot retains weeks of context, while a code-search agent that just received a giant grep dump keeps only the last two turns.
Message Pruning Strategies
BEFORE: Imagine you are a film editor with 6 hours of raw footage that must become a 90-minute movie — you cannot simply cut from the end or the beginning, because key plot points are scattered throughout.
PAIN: If you blindly cut the first 4.5 hours, you lose the character introductions and the setup that makes the climax meaningful. If you keep everything, the audience (the model) loses focus during the filler scenes and misses the important moments buried in the middle.
MAPPING: Pruning a conversation works the same way. You score each message by importance: greetings and filler ("thanks!", "you're welcome!") are cut first. Key facts (account numbers, decisions, tool results) are preserved no matter how old they are. And sometimes you replace a block of scenes with a narrator's summary — "previously on this conversation..." — to compress without losing the plot.
What this actually looks like in real code: In practice, you attach metadata to each message — a simple dictionary with an importance field. Here is what a scored message object looks like, followed by a full example from a support conversation:
And here is a full conversation with importance scores assigned:
- FIFOFirst In, First Out — a strategy where the oldest items are removed first. In conversation management, this means dropping the earliest messages when the history gets too long. (first in, first out) — drop oldest messages first. Simple but may lose critical early context.
- Importance scoring — tag messages with metadataAdditional data attached to each message beyond its content — such as timestamps, token counts, importance scores, or tags. Metadata helps the conversation manager make smarter pruning decisions. and drop low-importance messages first.
- Semantic deduplicationIdentifying and removing messages that convey the same meaning or information, even if worded differently. For example, if the user asked the same question twice, the duplicate exchange can be safely removed. — identify and remove redundant exchanges.
- Role-based retention — always keep certain message types (tool results, user preferences, key decisions).
- Summarize-and-replace — condense a block of messages into a single summary before dropping originals.
Critical implementation detail: after removing any messages, your pruning function must verify that the remaining array still has valid message alternationThe Claude API requires messages to alternate between user and assistant roles. Removing a message can break this alternation, causing an API error.. The Claude API requires messages to alternate between user and assistant roles, and the array must start with a user message. If your pruning logic removes a user message but leaves the assistant reply, the API will reject the request.
"Just keep the last N messages — that's good enough." — A pure sliding window is the simplest pruning strategy, but it is also the most dangerous. If the user's account number, diagnosis code, or key decision was in message 1, a window of size 10 will silently drop it by turn 6. The user will say "what's my account number?" and Claude will have no idea.
"Pruning only matters for long conversations." — Even a 15-turn conversation with tool results can easily hit 50K+ tokens. Tool call results (JSON payloads, API responses, database rows) are often much larger than human messages. A single tool result can be 2,000–5,000 tokens.
"I can prune any message safely." — Removing a message that a later message references creates a confusing context. If Claude said "Based on your order #889..." and you prune the message containing order #889, Claude's reference becomes a non sequitur. Always prune in pairs (user + assistant) and check for downstream references.
If you are familiar with database query optimization, pruning is conceptually similar: you are reducing the amount of data processed while preserving the rows (messages) that matter most. But unlike a database index that speeds up retrieval without losing data, pruning permanently removes messages from the context. That is why importance scoring is critical — you need a reliable way to distinguish "this message contains the user's account number" from "this message says 'thanks!'" before deciding what to cut.
Building a Production Conversation Manager
BEFORE: Imagine a CEO who takes every meeting without a personal assistant — they walk into each room carrying a growing stack of every note from every past meeting, verbatim, unfiltered.
PAIN: By month three, the stack is so tall they spend more time flipping through old notes than actually engaging in the current meeting. Worse, critical decisions from week one are buried under pages of "sounds good" and "let's circle back," and the CEO sometimes misses them entirely.
MAPPING: A conversation manager is that personal assistant. It sits between your application and the Claude API, recording every exchange, scoring messages by importance, compressing old meetings into executive summaries, and ensuring the CEO (Claude) walks into every meeting with exactly the right briefing — recent details in full, older context summarized, and critical facts always pinned to the top.
What this actually looks like as a class interface: The conversation manager sits between your app and the Claude API — like a personal assistant intercepting every message. Here is what the "assistant's desk" looks like in code:
- Message storage — append, retrieve, and edit messages in an ordered array
- Token counting — track token usage per message and cumulatively across the whole conversation
- Automatic pruning — triggered when token counts cross a configurable threshold
- Summarization — using Claude itself to compress older turns into a short paragraph
- Metadata tracking — timestamps, token counts, and importance flags attached to each message
- Serialization — save and load the full conversation state to JSONJavaScript Object Notation — a lightweight text format for storing and exchanging structured data. Looks like: {"key": "value", "list": [1, 2, 3]}. The standard format for API requests and data persistence. files, so conversations survive server restarts
Progressive summarization loses critical specifics: names, IDs, amounts, dates. For production systems, use immutable "case facts" blocks positioned at the START of context (high-recall position). These are never summarized.
Code Walkthrough
Step 1: Basic Conversation Manager
Let's start with the simplest possible version: a class that stores messages, sends them to Claude, and keeps a running count of token usage. This is the "full history" strategy in action — every message ever sent gets included in every API call. The key thing to watch is the messages array. It grows by two entries every turn (one user, one assistant), and the entire array is shipped to the API each time. That is what makes full history simple but expensive.
One important safety detail: look at the except block in send(). If the API call fails for any reason, the manager pops the user message it just added. Without this, a failed call would leave a "phantom" user message in your history with no assistant reply — breaking the required user/assistant alternation and causing the next call to fail too.
import anthropic
from dataclasses import dataclass, field
from typing import Optional
# pip install anthropic>=0.30.0
@dataclass
class ConversationManager:
"""Manages multi-turn conversations with Claude."""
system_prompt: str = "You are a helpful assistant."
model: str = "claude-sonnet-4-6"
max_tokens: int = 4096
messages: list = field(default_factory=list)
def __post_init__(self):
self.client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY
self.total_input_tokens = 0
self.total_output_tokens = 0
def add_user_message(self, content: str) -> None:
"""Add a user message to the conversation history."""
self.messages.append({"role": "user", "content": content})
def add_assistant_message(self, content: str) -> None:
"""Add an assistant message to the conversation history."""
self.messages.append({"role": "assistant", "content": content})
def send(self, user_message: str) -> str:
"""Send a message and get Claude's response."""
self.add_user_message(user_message)
try:
response = self.client.messages.create(
model=self.model,
max_tokens=self.max_tokens,
system=self.system_prompt,
messages=self.messages,
)
# Track token usage from the API response
self.total_input_tokens += response.usage.input_tokens
self.total_output_tokens += response.usage.output_tokens
assistant_text = response.content[0].text
self.add_assistant_message(assistant_text)
return assistant_text
except anthropic.APIError as e:
# Remove the user message we just added on failure
self.messages.pop()
raise RuntimeError(f"API call failed: {e}") from e
def get_messages(self) -> list:
"""Return the current message history."""
return list(self.messages)
def get_token_usage(self) -> dict:
"""Return cumulative token usage."""
return {
"total_input_tokens": self.total_input_tokens,
"total_output_tokens": self.total_output_tokens,
"message_count": len(self.messages),
}
# Usage
manager = ConversationManager(
system_prompt="You are a coding tutor. Be concise."
)
reply1 = manager.send("What is a list comprehension in Python?")
print(reply1)
reply2 = manager.send("Show me an example with filtering.")
print(reply2)
print(manager.get_token_usage())
import Anthropic from "@anthropic-ai/sdk";
// npm install @anthropic-ai/sdk@^0.30.0
class ConversationManager {
constructor({
systemPrompt = "You are a helpful assistant.",
model = "claude-sonnet-4-6",
maxTokens = 4096,
} = {}) {
this.client = new Anthropic(); // reads ANTHROPIC_API_KEY
this.systemPrompt = systemPrompt;
this.model = model;
this.maxTokens = maxTokens;
this.messages = [];
this.totalInputTokens = 0;
this.totalOutputTokens = 0;
}
addUserMessage(content) {
this.messages.push({ role: "user", content });
}
addAssistantMessage(content) {
this.messages.push({ role: "assistant", content });
}
async send(userMessage) {
this.addUserMessage(userMessage);
try {
const response = await this.client.messages.create({
model: this.model,
max_tokens: this.maxTokens,
system: this.systemPrompt,
messages: this.messages,
});
// Track token usage from the API response
this.totalInputTokens += response.usage.input_tokens;
this.totalOutputTokens += response.usage.output_tokens;
const assistantText = response.content[0].text;
this.addAssistantMessage(assistantText);
return assistantText;
} catch (error) {
// Remove the user message we just added on failure
this.messages.pop();
throw new Error(`API call failed: ${error.message}`);
}
}
getMessages() {
return [...this.messages];
}
getTokenUsage() {
return {
totalInputTokens: this.totalInputTokens,
totalOutputTokens: this.totalOutputTokens,
messageCount: this.messages.length,
};
}
}
// Usage
const manager = new ConversationManager({
systemPrompt: "You are a coding tutor. Be concise.",
});
const reply1 = await manager.send(
"What is a list comprehension in Python?"
);
console.log(reply1);
const reply2 = await manager.send(
"Show me an example with filtering."
);
console.log(reply2);
console.log(manager.getTokenUsage());
ConversationManager that implements the "full history" strategy from earlier in this module. Let's trace what happens when you call send(). First, the user message is appended to the messages array. Then the entire array — every message from the beginning of the conversation — is sent to Claude. When the response arrives, the manager reads response.usage to track cumulative input and output tokens (you will need these numbers later for budgeting). The interesting safety detail: if the API call fails for any reason, the manager pops the user message it just added, so your history stays clean. This class is fully functional, but here is the catch — it has no protection against the context window filling up. That is exactly what Step 2 solves.
Step 2: Sliding Window Mode
Now here is the dilemma with Step 1: as conversations grow, you are sending thousands of tokens of old greetings and resolved questions on every call. The sliding window fixes this with a simple override. Instead of sending all messages, get_messages() returns only the last N. The interesting subtlety is on line where we check the first message's role — if our slice happens to start on an assistant message, we drop it to maintain the required user-first alternation. Without that check, the API would reject the request.
class SlidingWindowManager(ConversationManager):
"""Extends ConversationManager with a sliding window."""
def __init__(self, window_size: int = 10, **kwargs):
super().__init__(**kwargs)
self.window_size = window_size
def get_messages(self) -> list:
"""Return only the most recent N messages."""
if len(self.messages) <= self.window_size:
return list(self.messages)
# Keep the last window_size messages
# Ensure we start with a user message (valid alternation)
windowed = self.messages[-self.window_size:]
if windowed and windowed[0]["role"] == "assistant":
windowed = windowed[1:] # drop to maintain user-first
return windowed
def send(self, user_message: str) -> str:
"""Send using only the windowed history."""
self.add_user_message(user_message)
try:
response = self.client.messages.create(
model=self.model,
max_tokens=self.max_tokens,
system=self.system_prompt,
messages=self.get_messages(), # windowed!
)
self.total_input_tokens += response.usage.input_tokens
self.total_output_tokens += response.usage.output_tokens
assistant_text = response.content[0].text
self.add_assistant_message(assistant_text)
return assistant_text
except anthropic.APIError as e:
self.messages.pop()
raise RuntimeError(f"API call failed: {e}") from e
# Usage — only last 6 messages sent per call
manager = SlidingWindowManager(
window_size=6,
system_prompt="You are a concise coding tutor.",
)
for q in [
"What is Python?",
"What are variables?",
"Explain loops.",
"What are functions?",
"Explain classes.",
"What is inheritance?",
]:
print(f"Q: {q}")
print(f"A: {manager.send(q)}\n")
# All 12 messages stored, but only last 6 sent to API
print(f"Stored: {len(manager.messages)} messages")
print(f"Sent: {len(manager.get_messages())} messages")
class SlidingWindowManager extends ConversationManager {
constructor({ windowSize = 10, ...opts } = {}) {
super(opts);
this.windowSize = windowSize;
}
getMessages() {
if (this.messages.length <= this.windowSize) {
return [...this.messages];
}
// Keep the last windowSize messages
let windowed = this.messages.slice(-this.windowSize);
// Ensure we start with a user message
if (windowed[0]?.role === "assistant") {
windowed = windowed.slice(1);
}
return windowed;
}
async send(userMessage) {
this.addUserMessage(userMessage);
try {
const response = await this.client.messages.create({
model: this.model,
max_tokens: this.maxTokens,
system: this.systemPrompt,
messages: this.getMessages(), // windowed!
});
this.totalInputTokens += response.usage.input_tokens;
this.totalOutputTokens += response.usage.output_tokens;
const assistantText = response.content[0].text;
this.addAssistantMessage(assistantText);
return assistantText;
} catch (error) {
this.messages.pop();
throw new Error(`API call failed: ${error.message}`);
}
}
}
// Usage — only last 6 messages sent per call
const manager = new SlidingWindowManager({
windowSize: 6,
systemPrompt: "You are a concise coding tutor.",
});
for (const q of [
"What is Python?",
"What are variables?",
"Explain loops.",
"What are functions?",
"Explain classes.",
"What is inheritance?",
]) {
console.log(`Q: ${q}`);
console.log(`A: ${await manager.send(q)}\n`);
}
console.log(`Stored: ${manager.messages.length} messages`);
console.log(`Sent: ${manager.getMessages().length} messages`);
get_messages(), and (3) ensures valid alternation by dropping a leading assistant message if the slice starts on one. Trade-off: this approach is cheap and fast but permanently loses early context from Claude's perspective — if the user's account number was in message 1, it vanishes after a few turns.
Step 3: Auto-Summarization & Persistence
This is the production-grade version, and it solves the sliding window's biggest weakness: losing early context forever. The key idea is elegant — instead of just dropping old messages, we ask Claude to read them and write a summary first. Then we replace the originals with that summary. The result is a much shorter history that still carries the essential context from earlier in the conversation.
The other major addition is save() and load(). Real agents need conversations to survive server restarts, deployments, and load balancer switches. These methods serialize the full state — messages, summary, token counts — to a JSON file. In production, you would swap the file system for a database or Redis, but the pattern is identical.
import json
import time
from pathlib import Path
class SmartConversationManager(ConversationManager):
"""Full-featured manager with summarization and persistence."""
def __init__(
self,
token_threshold: int = 100_000,
recent_turns_to_keep: int = 4,
**kwargs
):
super().__init__(**kwargs)
self.token_threshold = token_threshold
self.recent_turns_to_keep = recent_turns_to_keep
self.summary: Optional[str] = None
self.summary_history: list = []
self.last_input_tokens = 0
def _estimate_tokens(self) -> int:
"""Estimate current token usage from last API response."""
return self.last_input_tokens
def _should_summarize(self) -> bool:
"""Check if we've exceeded the token threshold."""
return self.last_input_tokens > self.token_threshold
def _summarize_old_messages(self) -> None:
"""Use Claude to summarize older messages."""
# Keep the most recent turns intact
keep_count = self.recent_turns_to_keep * 2 # user+assistant pairs
if len(self.messages) <= keep_count:
return
old_messages = self.messages[:-keep_count]
recent_messages = self.messages[-keep_count:]
# Build a summary of the old messages
summary_prompt = (
"Summarize this conversation concisely. "
"Preserve: key decisions, user preferences, "
"important facts, and tool results. "
"Skip: greetings, filler, repeated information.\n\n"
)
for msg in old_messages:
summary_prompt += f"{msg['role']}: {msg['content']}\n"
try:
response = self.client.messages.create(
model=self.model,
max_tokens=1024,
system="You are a conversation summarizer. Be concise.",
messages=[{"role": "user", "content": summary_prompt}],
)
new_summary = response.content[0].text
# Combine with existing summary if present
if self.summary:
new_summary = f"Previous context: {self.summary}\n\nRecent: {new_summary}"
self.summary = new_summary
self.summary_history.append({
"timestamp": time.time(),
"messages_summarized": len(old_messages),
"summary_tokens": response.usage.output_tokens,
})
# Replace old messages with a summary message
self.messages = [
{"role": "user", "content": f"[Conversation summary: {self.summary}]"},
{"role": "assistant", "content": "Understood. I have the conversation context."},
*recent_messages,
]
except anthropic.APIError as e:
# If summarization fails, fall back to sliding window
self.messages = recent_messages
def send(self, user_message: str) -> str:
"""Send with automatic summarization when needed."""
self.add_user_message(user_message)
try:
response = self.client.messages.create(
model=self.model,
max_tokens=self.max_tokens,
system=self.system_prompt,
messages=self.messages,
)
self.last_input_tokens = response.usage.input_tokens
self.total_input_tokens += response.usage.input_tokens
self.total_output_tokens += response.usage.output_tokens
assistant_text = response.content[0].text
self.add_assistant_message(assistant_text)
# Check if we need to summarize
if self._should_summarize():
self._summarize_old_messages()
return assistant_text
except anthropic.APIError as e:
self.messages.pop()
raise RuntimeError(f"API call failed: {e}") from e
def save(self, filepath: str) -> None:
"""Persist conversation state to a JSON file."""
state = {
"system_prompt": self.system_prompt,
"model": self.model,
"messages": self.messages,
"summary": self.summary,
"summary_history": self.summary_history,
"total_input_tokens": self.total_input_tokens,
"total_output_tokens": self.total_output_tokens,
"saved_at": time.time(),
}
Path(filepath).write_text(json.dumps(state, indent=2))
@classmethod
def load(cls, filepath: str) -> "SmartConversationManager":
"""Load conversation state from a JSON file."""
data = json.loads(Path(filepath).read_text())
mgr = cls(
system_prompt=data["system_prompt"],
model=data["model"],
)
mgr.messages = data["messages"]
mgr.summary = data.get("summary")
mgr.summary_history = data.get("summary_history", [])
mgr.total_input_tokens = data.get("total_input_tokens", 0)
mgr.total_output_tokens = data.get("total_output_tokens", 0)
return mgr
# Usage
manager = SmartConversationManager(
token_threshold=50_000,
recent_turns_to_keep=4,
system_prompt="You are a helpful coding assistant.",
)
reply = manager.send("Help me build a REST API with FastAPI.")
print(reply)
# Save and restore across sessions
manager.save("conversation_state.json")
restored = SmartConversationManager.load("conversation_state.json")
reply2 = restored.send("Where were we?")
print(reply2)
import Anthropic from "@anthropic-ai/sdk";
import { readFileSync, writeFileSync } from "node:fs";
class SmartConversationManager extends ConversationManager {
constructor({
tokenThreshold = 100_000,
recentTurnsToKeep = 4,
...opts
} = {}) {
super(opts);
this.tokenThreshold = tokenThreshold;
this.recentTurnsToKeep = recentTurnsToKeep;
this.summary = null;
this.summaryHistory = [];
this.lastInputTokens = 0;
}
_shouldSummarize() {
return this.lastInputTokens > this.tokenThreshold;
}
async _summarizeOldMessages() {
const keepCount = this.recentTurnsToKeep * 2;
if (this.messages.length <= keepCount) return;
const oldMessages = this.messages.slice(0, -keepCount);
const recentMessages = this.messages.slice(-keepCount);
let summaryPrompt =
"Summarize this conversation concisely. " +
"Preserve: key decisions, user preferences, " +
"important facts, and tool results. " +
"Skip: greetings, filler, repeated information.\n\n";
for (const msg of oldMessages) {
summaryPrompt += `${msg.role}: ${msg.content}\n`;
}
try {
const response = await this.client.messages.create({
model: this.model,
max_tokens: 1024,
system: "You are a conversation summarizer. Be concise.",
messages: [{ role: "user", content: summaryPrompt }],
});
let newSummary = response.content[0].text;
if (this.summary) {
newSummary = `Previous context: ${this.summary}\n\nRecent: ${newSummary}`;
}
this.summary = newSummary;
this.summaryHistory.push({
timestamp: Date.now(),
messagesSummarized: oldMessages.length,
summaryTokens: response.usage.output_tokens,
});
this.messages = [
{ role: "user", content: `[Conversation summary: ${this.summary}]` },
{ role: "assistant", content: "Understood. I have the conversation context." },
...recentMessages,
];
} catch {
// Fallback to sliding window on failure
this.messages = recentMessages;
}
}
async send(userMessage) {
this.addUserMessage(userMessage);
try {
const response = await this.client.messages.create({
model: this.model,
max_tokens: this.maxTokens,
system: this.systemPrompt,
messages: this.messages,
});
this.lastInputTokens = response.usage.input_tokens;
this.totalInputTokens += response.usage.input_tokens;
this.totalOutputTokens += response.usage.output_tokens;
const assistantText = response.content[0].text;
this.addAssistantMessage(assistantText);
if (this._shouldSummarize()) {
await this._summarizeOldMessages();
}
return assistantText;
} catch (error) {
this.messages.pop();
throw new Error(`API call failed: ${error.message}`);
}
}
save(filepath) {
const state = {
systemPrompt: this.systemPrompt,
model: this.model,
messages: this.messages,
summary: this.summary,
summaryHistory: this.summaryHistory,
totalInputTokens: this.totalInputTokens,
totalOutputTokens: this.totalOutputTokens,
savedAt: Date.now(),
};
writeFileSync(filepath, JSON.stringify(state, null, 2));
}
static load(filepath) {
const data = JSON.parse(readFileSync(filepath, "utf-8"));
const mgr = new SmartConversationManager({
systemPrompt: data.systemPrompt,
model: data.model,
});
mgr.messages = data.messages;
mgr.summary = data.summary ?? null;
mgr.summaryHistory = data.summaryHistory ?? [];
mgr.totalInputTokens = data.totalInputTokens ?? 0;
mgr.totalOutputTokens = data.totalOutputTokens ?? 0;
return mgr;
}
}
// Usage
const manager = new SmartConversationManager({
tokenThreshold: 50_000,
recentTurnsToKeep: 4,
systemPrompt: "You are a helpful coding assistant.",
});
const reply = await manager.send("Help me build a REST API.");
console.log(reply);
// Save and restore across sessions
manager.save("conversation_state.json");
const restored = SmartConversationManager.load("conversation_state.json");
const reply2 = await restored.send("Where were we?");
console.log(reply2);
SmartConversationManager. Let's walk through the interesting parts. The manager watches last_input_tokens after every API response — this is how it knows when context is getting dangerously large. When tokens cross your configurable threshold, it triggers the real magic: a separate Claude call that reads all the old messages and compresses them into a concise summary. The interesting design choice is what gets preserved — the summarization prompt explicitly asks Claude to keep key decisions, facts, and tool results while dropping greetings and filler. After summarization, the old messages are replaced with a summary/acknowledgment pair plus the most recent turns kept verbatim. Here is the dilemma the code handles gracefully: what if the summarization call itself fails? Rather than crashing, it falls back to a simple sliding window — not perfect, but the conversation keeps going. Finally, save() and load() serialize the full state (messages, summary, token counts) to JSON so conversations survive server restarts.
Hands-On Exercise
What You'll Build
A conversation manager that progresses through three strategies: full history, sliding window, and auto-summarization with JSON persistence. You'll see the token usage difference between each approach in real time.
Time Estimate: 30–45 minutes | Files You'll Create: conversation_manager.py
Environment Setup
mkdir conversation-lab && cd conversation-lab
python -m venv venv && source venv/bin/activate # Windows: venv\Scripts\activate
pip install anthropic>=0.30.0
export ANTHROPIC_API_KEY=your-key-here # Windows: set ANTHROPIC_API_KEY=your-key-here
Step 1: Basic ConversationManager with Token Tracking
What: Create a ConversationManager class that stores messages, sends them to Claude, and tracks token usage.
Why: This implements the "full history" strategy — every message is sent with every request. You will see input tokens grow with each turn, which demonstrates the cost problem that Steps 2 and 3 solve.
Create a new file called conversation_manager.py and add the following:
import anthropic
import json
import time
from pathlib import Path
from dataclasses import dataclass, field
from typing import Optional
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY
# ── Step 1: Basic ConversationManager ────────────────────────
@dataclass
class ConversationManager:
"""Full-history conversation manager with token tracking."""
system_prompt: str = "You are a helpful assistant."
model: str = "claude-sonnet-4-6"
max_tokens: int = 4096
messages: list = field(default_factory=list)
def __post_init__(self):
self.total_input_tokens = 0
self.total_output_tokens = 0
def add_user_message(self, content: str) -> None:
self.messages.append({"role": "user", "content": content})
def add_assistant_message(self, content: str) -> None:
self.messages.append({"role": "assistant", "content": content})
def get_messages(self) -> list:
return list(self.messages)
def send(self, user_message: str) -> str:
self.add_user_message(user_message)
try:
response = client.messages.create(
model=self.model,
max_tokens=self.max_tokens,
system=self.system_prompt,
messages=self.get_messages(),
)
self.total_input_tokens += response.usage.input_tokens
self.total_output_tokens += response.usage.output_tokens
assistant_text = response.content[0].text
self.add_assistant_message(assistant_text)
return assistant_text
except anthropic.APIError as e:
self.messages.pop() # remove failed user message
raise RuntimeError(f"API call failed: {e}") from e
def get_token_usage(self) -> dict:
return {
"total_input": self.total_input_tokens,
"total_output": self.total_output_tokens,
"messages": len(self.messages),
}
# ── Test Step 1 ──────────────────────────────────────────────
if __name__ == "__main__":
print("═" * 50)
print("TEST 1: Basic ConversationManager (Full History)")
print("═" * 50)
mgr = ConversationManager(system_prompt="You are a concise coding tutor. Answer in 1-2 sentences.")
questions = [
"What is a list in Python?",
"How do I add an item to it?",
"What about removing items?",
]
for i, q in enumerate(questions, 1):
print(f"\n Turn {i}: {q}")
reply = mgr.send(q)
print(f" Claude: {reply[:100]}...")
usage = mgr.get_token_usage()
print(f" [Messages: {usage['messages']} | Input tokens so far: {usage['total_input']}]")
print(f"\n ✓ Final: {usage['messages']} messages, {usage['total_input']} total input tokens")
print(f" Note: Input tokens grow each turn because ALL messages are resent!")
Run it:
Watch the input token count grow with each turn. Turn 1 sends ~38 tokens. Turn 2 resends turn 1 + new message (~120 tokens). Turn 3 resends everything (~248 tokens). This is the stateless reality in action — and exactly why we need the strategies in the next steps.
AuthenticationError→ Check yourANTHROPIC_API_KEYis set. Runecho $ANTHROPIC_API_KEYto verify.ModuleNotFoundError: No module named 'anthropic'→ Runpip install anthropic- Responses are very long → Make sure your system prompt says "Answer in 1-2 sentences" to keep responses short for testing.
Step 2: Add Sliding Window & Smart Summarization
What: Add two more manager classes: a SlidingWindowManager that only sends the last N messages, and a SmartConversationManager that auto-summarizes old turns and persists state to JSON.
Why: This step demonstrates the token savings from each strategy side-by-side. The final comparison table shows you exactly how many tokens each approach uses over the same 6-turn conversation.
This step uses the ConversationManager from Step 1. Add the following to the bottom of conversation_manager.py, replacing the if __name__ == "__main__" block:
# ── Step 2a: Sliding Window Manager ──────────────────────────
class SlidingWindowManager(ConversationManager):
"""Sends only the last N messages to the API."""
def __init__(self, window_size: int = 6, **kwargs):
super().__init__(**kwargs)
self.window_size = window_size
def get_messages(self) -> list:
if len(self.messages) <= self.window_size:
return list(self.messages)
windowed = self.messages[-self.window_size:]
# Ensure we start with a user message (valid alternation)
if windowed and windowed[0]["role"] == "assistant":
windowed = windowed[1:]
return windowed
# ── Step 2b: Smart Manager with Summarization + Persistence ──
class SmartConversationManager(ConversationManager):
"""Auto-summarizes old turns and saves/loads state."""
def __init__(self, token_threshold: int = 50_000, recent_turns: int = 4, **kwargs):
super().__init__(**kwargs)
self.token_threshold = token_threshold
self.recent_turns = recent_turns
self.summary: Optional[str] = None
self.last_input_tokens = 0
def send(self, user_message: str) -> str:
self.add_user_message(user_message)
try:
response = client.messages.create(
model=self.model,
max_tokens=self.max_tokens,
system=self.system_prompt,
messages=self.get_messages(),
)
self.last_input_tokens = response.usage.input_tokens
self.total_input_tokens += response.usage.input_tokens
self.total_output_tokens += response.usage.output_tokens
assistant_text = response.content[0].text
self.add_assistant_message(assistant_text)
# Auto-summarize if threshold exceeded
if self.last_input_tokens > self.token_threshold:
self._summarize()
return assistant_text
except anthropic.APIError as e:
self.messages.pop()
raise RuntimeError(f"API call failed: {e}") from e
def _summarize(self) -> None:
keep_count = self.recent_turns * 2 # user+assistant pairs
if len(self.messages) <= keep_count:
return
old_msgs = self.messages[:-keep_count]
recent_msgs = self.messages[-keep_count:]
prompt = "Summarize this conversation concisely. Preserve: key decisions, facts, tool results. Skip: greetings, filler.\n\n"
for m in old_msgs:
prompt += f"{m['role']}: {m['content']}\n"
try:
resp = client.messages.create(
model=self.model, max_tokens=1024,
system="You are a conversation summarizer.",
messages=[{"role": "user", "content": prompt}],
)
summary_text = resp.content[0].text
if self.summary:
summary_text = f"Previous: {self.summary}\n\nRecent: {summary_text}"
self.summary = summary_text
self.messages = [
{"role": "user", "content": f"[Context summary: {self.summary}]"},
{"role": "assistant", "content": "Understood. I have the context."},
*recent_msgs,
]
print(f" ⚡ Summarized {len(old_msgs)} old messages → {len(self.messages)} total")
except Exception:
self.messages = recent_msgs # fallback to sliding window
def save(self, filepath: str) -> None:
state = {
"system_prompt": self.system_prompt,
"model": self.model,
"messages": self.messages,
"summary": self.summary,
"total_input_tokens": self.total_input_tokens,
"total_output_tokens": self.total_output_tokens,
}
Path(filepath).write_text(json.dumps(state, indent=2))
print(f" 💾 Saved to {filepath}")
@classmethod
def load(cls, filepath: str) -> "SmartConversationManager":
data = json.loads(Path(filepath).read_text())
mgr = cls(system_prompt=data["system_prompt"], model=data["model"])
mgr.messages = data["messages"]
mgr.summary = data.get("summary")
mgr.total_input_tokens = data.get("total_input_tokens", 0)
mgr.total_output_tokens = data.get("total_output_tokens", 0)
return mgr
# ── Test All Three Strategies ────────────────────────────────
if __name__ == "__main__":
questions = [
"What is a Python list?",
"How do I sort a list?",
"What about list comprehensions?",
"How do dictionaries differ from lists?",
"What are sets?",
"When should I use tuples?",
]
# --- Test 1: Full History ---
print("═" * 55)
print("TEST 1: Full History (all messages sent every time)")
print("═" * 55)
mgr1 = ConversationManager(system_prompt="Answer in 1 sentence.")
for i, q in enumerate(questions, 1):
mgr1.send(q)
u = mgr1.get_token_usage()
print(f" Turn {i}: stored={u['messages']} sent={u['messages']} input_tokens={u['total_input']}")
# --- Test 2: Sliding Window ---
print(f"\n{'═' * 55}")
print("TEST 2: Sliding Window (last 4 messages only)")
print("═" * 55)
mgr2 = SlidingWindowManager(window_size=4, system_prompt="Answer in 1 sentence.")
for i, q in enumerate(questions, 1):
mgr2.send(q)
u = mgr2.get_token_usage()
sent = len(mgr2.get_messages())
print(f" Turn {i}: stored={u['messages']} sent={sent} input_tokens={u['total_input']}")
# --- Test 3: Smart Manager + Persistence ---
print(f"\n{'═' * 55}")
print("TEST 3: Smart Manager (auto-summarization + save/load)")
print("═" * 55)
mgr3 = SmartConversationManager(
token_threshold=500, # low threshold to trigger summarization quickly
recent_turns=2,
system_prompt="Answer in 1 sentence.",
)
for i, q in enumerate(questions, 1):
mgr3.send(q)
u = mgr3.get_token_usage()
print(f" Turn {i}: messages={u['messages']} input_tokens={u['total_input']}")
# Save and restore
mgr3.save("test_session.json")
restored = SmartConversationManager.load("test_session.json")
reply = restored.send("What did we discuss?")
print(f"\n Restored reply: {reply[:120]}...")
# --- Comparison ---
print(f"\n{'═' * 55}")
print("COMPARISON (total input tokens after 6 turns):")
print(f" Full History: {mgr1.total_input_tokens:,} tokens")
print(f" Sliding Window: {mgr2.total_input_tokens:,} tokens")
print(f" Smart Manager: {mgr3.total_input_tokens:,} tokens")
print("═" * 55)
Run the full test suite:
Look for these key behaviors:
- Test 1: Input tokens should grow significantly each turn (full history resends everything)
- Test 2: The
sentcolumn should cap at 4 after turn 2, whilestoredkeeps growing - Test 3: You should see
⚡ Summarizedmessages when the threshold is hit, and the restored manager should know what was discussed - Comparison: Sliding window uses the fewest tokens in this short test. Smart manager appears higher here because the lab uses a deliberately low
token_threshold=500on a short conversation — in real long sessions (50+ turns) it ends up well below full history while preserving far more context
- Summarization never triggers → The
token_threshold=500is set low for testing. If Claude's responses are very short, increase the number of questions or lower the threshold further. FileNotFoundErroron load → Make suresave()ran successfully first. Check thattest_session.jsonexists in the current directory.- Restored manager doesn't remember context → This is expected if no summarization occurred (the manager only has recent messages). Try adding more turns before saving.
APIError: 529(overloaded) → The test makes 7+ API calls. If rate-limited, addtime.sleep(1)between turns or run tests one at a time.
Verify Everything Works
Run the complete file end-to-end. All 3 tests should complete, the comparison table should show Full History using the most tokens, and the save/load cycle should successfully restore a conversation:
If you see the COMPARISON table and the restored reply, you've successfully built a production-pattern conversation manager with all three strategies.
You've built three conversation management strategies and measured their real token usage side-by-side. The SmartConversationManager is the pattern used by most production agents — you can tune token_threshold and recent_turns to balance cost vs. context retention for your specific use case.
- Importance-based pruning: Add an
importancefield to each message and drop lowest-scored messages first when pruning - Pinned facts: Add a
pinned_factslist that is always prepended to the context and never summarized - Visual dashboard: Print a bar chart showing token usage per turn for each strategy
Knowledge Check
1. What happens if the client sends an API request to Claude without including previous messages?
2. Given: system prompt = 1,500 tokens, conversation history = 40,000 tokens, current message = 800 tokens, max_tokens = 4,096. With a 200K context window, how many tokens are available for additional history before hitting the limit?
3. A customer support bot needs to remember the user's account number from message 1, but conversations can go 50+ turns. Which strategy is best?
4. When summarizing a 10-turn technical conversation about debugging a database query, which information is MOST likely to be lost?
5. Which of these message arrays would cause an API error when sent to Claude?
[{role:"user",...}, {role:"assistant",...}, {role:"user",...}]
[{role:"assistant",...}, {role:"user",...}, {role:"assistant",...}]
[{role:"user",...}, {role:"assistant",..., content:[{type:"tool_use",...}]}, {role:"user",..., content:[{type:"tool_result",...}]}]
[{role:"user",...}] (single message)
user role message. Starting with assistant violates the required message alternation and will cause an API error. This is a common bug when implementing pruning — always verify the first message role after removing messages.assistant role message. The Claude API requires the messages array to begin with a user message. A single user message (D) is valid, and tool_use/tool_result sequences (C) are valid when properly ordered. This is critical to check in your pruning logic!Your Score
Module Summary
Key Concepts Recap
- Statelessness — The Claude API has no memory. You replay the full conversation every call. This gives you total control.
- Full History — Send everything. Simple, but hits limits fast. Good for <20 turns.
- Sliding Window — Keep last N messages. Cheap but loses early context.
- Summarization — Compress old turns into a summary. Best balance of cost and context.
- Token Budget — System + History + Message + Response must fit in the context window. Monitor and prune proactively.
- Pruning — FIFO, importance scoring, semantic dedup, and summarize-and-replace. Always maintain valid message alternation.
- ConversationManager — A production class that encapsulates storage, counting, pruning, summarization, and persistence.
What We Built
A full ConversationManager class (Python & Node.js) that handles multi-turn conversations with automatic token tracking, sliding window mode, Claude-powered summarization, and JSON serialization for cross-session persistence.
Next Module Preview
M09: RAG — Retrieval-Augmented Generation takes conversation management further by connecting Claude to external knowledge bases. Instead of relying solely on conversation history, RAG lets your agent search documents, databases, and APIs to ground its responses in real data — building on the context management skills you learned here.