Building AI Agents with Claude Track 3: Memory & Context
Module 12 of 30 ~65 min Advanced

Multi-Layer Memory Architecture

Give your agent a brain that remembers — across turns, across sessions, across tasks.

Prerequisites: M08: Conversation Management, M09: RAG Fundamentals

Learning Objectives

  • Explain why production agents need multiple memory tiers instead of a single context windowThe maximum amount of text (measured in tokens) that Claude can process in a single API call. It includes the system prompt, conversation history, tool definitions, and Claude's response. Current Claude models support up to 200K tokens, but larger context doesn't mean better recall. or vector store
  • Implement a working memory scratchpad that tracks current task state and injects it into each LLM call
  • Build an episodic memory system that stores conversation summaries in a vector database for cross-session retrieval
  • Create a procedural memory store that saves and retrieves proven tool-call sequences for reuse
  • Wire all three memory tiers together with a Memory Manager that orchestrates loading, updating, and persisting memory across sessions

Why One Memory Type Isn’t Enough

Everyday Analogy

Before: Imagine if your brain stored everything in one place — today’s grocery list, how to ride a bike, your childhood memories, and the recipe for your favorite pasta — all jumbled in a single notebook. Every time you needed something, you’d flip through every page.

Pain: You’d waste enormous time searching, the notebook would fill up fast, and important long-term memories would get crowded out by temporary notes. You’d forget how to ride a bike because you overwrote it with a shopping list.

Mapping: Your brain uses different memory systems for a reason: working memory for what you’re thinking about right now (like holding a phone number while you dial), episodic memory for past experiences (what happened at your wedding), and procedural memory for learned skills (how to ride a bike). AI agents need the same separation — a single context window or vector store is the “one notebook” approach, and it breaks down at scale.

Technical Definition

A multi-layer memory architectureA design pattern that separates an agent's memory into distinct tiers with different storage backends, retrieval strategies, and retention policies — optimized for the type of information each tier holds. separates an agent’s memory into specialized tiers, each optimized for a different kind of information:

  • Working memory — a fast, mutable scratchpad for the current task. Think of it as a key-value storeA simple data structure that maps unique keys (like “user_intent”) to values (like “book a flight”). Fast to read and write, often held in RAM. (like a Python dict or Redis cache) that holds the user’s current intent, extracted entities, intermediate tool results, and the plan for the current turn. It’s included in every LLM prompt and cleared when the task finishes.
  • Episodic memory — a searchable archive of past interactions. Stores summarized records of past conversations in a vector databaseA database that stores data as high-dimensional vectors (arrays of numbers) and supports similarity search — finding items whose vectors are closest to a query vector. Examples: ChromaDB, Pinecone, Weaviate., indexed by semantic embeddings and timestamps. Retrieved via similarity search when the agent needs context from prior sessions.
  • Procedural memory — a library of reusable action sequences. Stores proven tool-call chains (like “search → filter → summarize → respond”) as structured templates. When a new task matches a known pattern, the agent retrieves and executes the template instead of reasoning from scratch.

The key insight: mixing all three in a single store causes retrieval pollution — when you search for “what did the user ask last week?”, you don’t want to also get back the current task’s scratchpad entries and stored tool sequences. Separation keeps retrieval precise and context windows lean.

Three-Tier Memory Architecture
Procedural Memory Learned patterns & reusable action sequences Latency: ~5ms (template lookup) Episodic Memory Past interactions & session summaries Latency: ~50-200ms (vector search) Working Memory Current task scratchpad Latency: <1ms (in-memory dict) Key-value store (Redis/dict) intent, entities, plan, tool results Cleared after task completes Vector DB (ChromaDB/Pinecone) Session summaries, decisions, prefs Persists across sessions Template store (JSON/DB) Tool chains, proven workflows Updated via reinforcement Accessed: ~10% of requests Accessed: ~40% of requests Accessed: 100% of requests
Memory Tiers Architecture
User: “What did we decide about the deployment schedule last Tuesday?”
📝
Working Memory
Current task state • Fast read/write
intent: "recall_decision"
topic: "deployment schedule"
📖
Episodic Memory
Past interactions • Semantic search
Found: Session #42 (Tue)
"Deploy moved to Friday"
⚙️
Procedural Memory
Learned skills • Pattern matching
Template: recall_decision
→ search episodes → cite → respond

So what does multi-layer memory actually look like in an agent's system prompt? Here's the formatted output from all three tiers combined — this is what gets injected into every Claude call:

[Working Memory — Session sess_044] Created: 2026-03-29T10:00:00Z intent: recall_decision entities: {"topic": "deployment", "timeframe": "last Tuesday"} [Relevant Past Interactions] - Session sess_042 (2026-03-25): Discussed deployment schedule. Decision: deploy moved to Friday due to QA delays. - Session sess_038 (2026-03-20): Reviewed API rate limits. No deployment-related decisions. [Suggested Procedure: recall_decision (used 8x)] Step 1: Search episodic memory for matching topic (tool: search_episodes) Step 2: Cite the source session and date (tool: format_citation) Step 3: Respond with the decision and context (tool: generate_response)

Three tiers, three colors, under 400 tokens total. Claude sees the current task state (working memory in amber), the 2 most relevant past conversations (episodic in blue), and a proven action template (procedural in green). Compare that to stuffing the full transcript of every past conversation into the prompt — which would use 50,000+ tokens and exceed context limits within a few days.

⚠️ Common Misconceptions

"Every agent needs all three memory tiers" — Not at all. A simple Q&A bot needs zero memory. A single-session agent might only need working memory. Episodic memory matters when users return across sessions. Procedural memory matters when the agent repeats complex workflows. Start with the tier that solves your actual problem, not the full architecture.

"Episodic memory gives the agent perfect recall" — Episodic memory stores summaries, not transcripts. Summarization is lossy — specific names, numbers, and nuances may be dropped. If you need exact recall of a specific detail (like a contract amount), store it as structured metadata, not just in the summary text.

"More retrieved episodes = better responses" — Injecting 10 past episodes into the prompt does more harm than good. Each extra episode adds noise and tokens, and Claude may get confused about which details apply to the current situation. Best practice: retrieve 2-3 most relevant episodes, max.

"Procedural memory replaces Claude's reasoning" — Procedural memory provides a suggested plan, not a mandate. Claude can adapt or override the template based on the current context. Think of it as a recipe that an experienced chef can riff on — not a rigid script.

"Episodic memory has no privacy implications" — Storing conversation summaries means you're persisting user data across sessions. This has real GDPR/CCPA implications: users may have the right to request deletion of their stored episodes. You need to be able to delete all episodes for a specific user_id, and your privacy policy must disclose that conversations are summarized and stored.

Why It Matters

A customer-facing support agent handling 500 conversations per day generates roughly 2 million tokens of conversation history per week. Without separated memory tiers, you’d either hit context window limits constantly (200K tokens max) or pay $60+/day in API costs injecting irrelevant history. Multi-layer memory keeps the prompt lean: only the current task state (working), the 2–3 most relevant past conversations (episodic), and the matching skill template (procedural) go into each call — typically under 4,000 tokens of memory context total.

Now you understand WHY agents need multiple memory types. Let’s build each tier, starting with the fastest and most immediate: working memory.

Tier 1: Working Memory — The Scratchpad

Everyday Analogy

Before: Imagine a doctor seeing a patient without a clipboard. They walk into the room, the patient describes three symptoms, the doctor orders a blood test, then walks to the lab — and has to ask the patient to repeat everything because they had nowhere to jot notes.

Pain: Without a scratchpad, the doctor loses track of what they’ve already learned, repeats questions, orders duplicate tests, and the visit takes three times longer. The patient loses trust.

Mapping: Working memory is the agent’s clipboard. It holds the current user intent, extracted entities (names, dates, IDs), intermediate tool results, and the plan for the current task. Every LLM call sees this scratchpad, so the agent never “forgets” what it learned two tool calls ago. When the task is done, the scratchpad is cleared or archived.

In concrete terms, working memory is a Python dictionary that looks like this at any given moment during a task:

{ "intent": "find_deployment_date", "entities": { "topic": "deployment", "timeframe": "last Tuesday" }, "search_results": [{ "session": 42, "summary": "Deploy moved to Friday" }], "confidence": 0.94, "response_plan": "cite session #42, confirm Friday" }

Each key-value pair gets added as the agent progresses through the task. The to_prompt() method formats this dict into a text block that Claude reads at the start of every call. That's all working memory is — a structured dict that rides along with each API request.

Technical Definition

Working memory is a structured, mutable state object (a Python dictionary or JavaScript Map) that holds everything the agent needs for the current task. It includes:

  • User intent — what the user is trying to accomplish, parsed from their message
  • Extracted entities — specific values like names, dates, IDs pulled from the conversation
  • Intermediate results — outputs from tool calls that inform the next step
  • Task plan — the sequence of steps the agent plans to execute

The working memory is injected into the system prompt of every LLM call, so Claude always has full awareness of the current state. It lives in RAM (or a fast cache like Redis for distributed agents), and its lifetime matches the task — created when a request arrives, destroyed when the response is sent.

Working Memory Scratchpad
1
Parse user request
2
Extract entities
3
Call search tool
4
Evaluate results
5
Generate response
Scratchpad State
intent: "find_deployment_date"
entities: { topic: "deployment", timeframe: "last Tuesday" }
search_results: [Session #42: "Deploy moved to Fri"]
confidence: 0.94 — single clear match
response_plan: "cite session #42, confirm Friday"
Why It Matters

Without working memory, an agent that calls 3 tools in sequence has to re-derive context from the full conversation history each time — burning tokens and losing track of intermediate state. In benchmarks, adding a structured scratchpad to a multi-step agent reduces token usage by 30–40% and improves task completion rates by 15–25%, because the agent always knows exactly where it is in the task.

Working memory handles the “now.” But what about information from yesterday, last week, or last month? That’s where episodic memory comes in.

Tier 2: Episodic Memory — Past Interactions

Everyday Analogy

Before: Imagine a personal assistant who keeps a detailed diary that’s searchable by topic. They don’t read the entire diary every morning — that would take hours. Instead, when you ask “What did we discuss about the budget last week?”, they search their diary and find the right entry in seconds.

Pain: Without the diary, the assistant starts every day with amnesia. You’d have to re-explain your preferences, past decisions, and ongoing projects every single session. For a support agent, this means every returning customer feels like they’re talking to a stranger.

Mapping: Episodic memory is the agent’s searchable diary. After each conversation, the agent writes a summary of what happened — key topics, decisions made, user preferences learned, outcomes. These summaries are stored as vector embeddings in a database. When a new conversation starts, the agent searches for relevant past episodes and injects them into the prompt, giving it cross-session continuity.

Here's what an episode record actually looks like once it's stored in ChromaDB:

{ "id": "ep_a3f2b1c9d0e4", "document": "Discussed deployment schedule. Decision: deploy moved to Friday due to QA delays. User prefers email notifications.", "metadata": { "session_id": "sess_042", "user_id": "user_alice", "timestamp": "2026-03-25T14:30:00Z", "topics": "deployment,scheduling" }, "embedding": [0.023, -0.841, 0.129, ... 1533 more floats] }

The document is the summary text. The embedding is a 1536-float array representing its semantic meaning. The metadata enables filtered search (e.g., "find episodes for user_alice about deployment"). When a new message comes in, ChromaDB compares its embedding against all stored episodes and returns the closest matches.

Technical Definition

Episodic memory stores summarized records of past conversations in a vector database, indexed by semantic embeddingsNumerical representations (arrays of floating-point numbers) of text meaning. Texts with similar meanings have similar embedding vectors, enabling similarity search even when exact words differ. and timestamped metadataAdditional structured data attached to each record — like session ID, timestamp, user ID, topic tags — that enables filtered search beyond just semantic similarity.. The workflow is:

  1. At conversation end: Summarize the conversation into a structured episode record (topics, decisions, preferences, outcomes) using Claude
  2. Embed & store: Convert the summary into a vector embedding and store it in ChromaDB (or Pinecone, Weaviate, etc.) with metadata (session ID, timestamp, user ID)
  3. At conversation start: Take the new user message, embed it, and search the vector database for the most similar past episodes using cosine similarityA mathematical measure of how similar two vectors are, based on the angle between them. A score of 1.0 means identical direction (very similar meaning), 0.0 means unrelated. ChromaDB uses this by default via its HNSW index — an algorithm that makes nearest-neighbor search fast even with millions of vectors.
  4. Inject: Add the top 2–3 matching episode summaries into the system prompt as “relevant past context”

This gives the agent cross-session memory without replaying full conversation logs (which would be prohibitively expensive and exceed context limits).

Episodic Memory — Semantic Retrieval
Session #38: API rate limits
Session #39: User prefers email
Session #40: Bug in auth flow
Session #41: Pricing discussion
Session #42: Deploy → Friday
Session #43: Onboarding docs
Query: “deployment schedule”
Why It Matters

Episodic memory is the difference between a stateless chatbot and a persistent assistant. A SaaS support agent with episodic memory can say “Last time we spoke, you were having trouble with the OAuth redirect — did that get resolved?” instead of asking the customer to explain from scratch. Studies show that agents with cross-session memory reduce average handle time by 40% and increase customer satisfaction scores by 20%, because users don’t have to repeat themselves.

Episodic memory remembers what happened. But what about remembering how to do things? That’s procedural memory — the agent’s skill library.

Tier 3: Procedural Memory — The Skill Library

Everyday Analogy

Before: Imagine an experienced chef who has cooked the same pasta dish 200 times. They don’t re-read the recipe or measure spices from scratch each time. They’ve internalized the steps: boil water, salt it, cook pasta 8 minutes, sauté garlic in olive oil, toss everything together. It’s automatic.

Pain: A novice chef without any recipes has to reason through every step from first principles, makes mistakes, and takes 3x longer. Worse, they might solve the same problem differently each time, leading to inconsistent results. Customers who order the same dish twice get different meals.

Mapping: Procedural memory is the agent’s recipe book. When the agent successfully completes a multi-step task (e.g., “search → filter → summarize → format as table”), it stores that tool sequence as a reusable template. Next time a similar request arrives, the agent retrieves the template instead of reasoning from scratch — faster execution, consistent results, fewer tokens burned on planning.

Here's what a stored procedure template looks like as a JSON record:

{ "name": "weekly_sales_report", "trigger": "Generate a weekly sales report with charts and insights", "steps": [ { "tool": "query_db", "params": {"query": "sales_last_7_days"}, "description": "Fetch sales data" }, { "tool": "aggregate", "params": {"group_by": "region"}, "description": "Aggregate by region" }, { "tool": "chart_gen", "params": {"type": "bar"}, "description": "Generate bar chart" }, { "tool": "format_pdf", "params": {}, "description": "Format as PDF report" } ], "success_count": 12 }

The trigger is stored as an embedding for semantic matching. When a user says "Can you make the weekly report?", the agent embeds that request, searches procedural memory, and finds this template with high similarity. The steps array tells Claude exactly which tools to call and in what order. The success_count gives the agent confidence that this is a proven workflow.

Technical Definition

Procedural memory stores reusable action sequencesAn ordered list of tool calls with their parameters, representing a proven workflow. For example: [search_db(query), filter_results(criteria), format_output(template)]. The sequence can include conditional branches and parallel steps. — tool chains, proven workflows, and task templates — as structured records with two components:

  • Trigger condition: A natural language description of when this procedure applies (e.g., “user asks to generate a weekly sales report”). Stored as an embedding for similarity matching.
  • Execution steps: An ordered list of tool calls with parameter templates, expected outputs, and error handling instructions. Stored as JSON.

At runtime, the agent embeds the current user request, searches procedural memory for matching triggers, and — if a match is found with high confidence — loads the procedure into the prompt as a suggested plan. The agent can then follow or adapt the plan rather than reasoning from scratch.

Procedural Memory — Skill Retrieval & Execution
User: “Weekly report”
Match: report_gen
confidence: 0.93
1. query_db
(sales data)
2. aggregate
(by region)
3a. chart_gen
(bar chart)
3b. summary
(key insights)
4. format
(PDF report)
Why It Matters

Procedural memory is how agents get faster and more reliable over time. Without it, an agent that generates weekly reports reasons through the same 5-step plan every Monday — burning ~2,000 tokens on planning alone. With procedural memory, it retrieves the proven template in a single lookup, saving tokens and executing the same reliable sequence every time. For a B2B ecommerce agent handling 50 report requests per week, this saves roughly $15/week in API costs and eliminates the variance in report quality that comes from re-deriving the approach each time.

You now understand all three memory tiers. The next question is: how do conversations become memories? That’s the summarization pipeline — the bridge between short-term conversation and long-term storage.
🎓 Cert Tip — Domain 5.6 (Temporal Data)

Facts have an "as-of" timestamp. Memory layers must distinguish current facts ("CEO is Alice") from historical facts ("CEO was Bob through 2024-Q3"). Without temporal metadata, your agent will confidently report stale facts as current — the canonical exam scenario. Store every fact with {value, valid_from, valid_to, source}; on retrieval, prefer the row where valid_to IS NULL for "current" queries and filter by date for "as-of" queries. Episodic memory naturally has timestamps, but semantic and procedural memory often don't — this is where temporal bugs hide.

Summarization Pipeline & Cross-Session Persistence

Everyday Analogy

Before: At the end of each workday, imagine a project manager who writes a brief summary of what was accomplished: key decisions made, action items assigned, blockers identified, and what’s pending for tomorrow. They file this summary in a searchable archive.

Pain: Without this habit, Monday morning is chaos. Nobody remembers what was decided on Friday, action items fall through the cracks, and the team re-discusses the same issues. The project manager effectively has weekly amnesia.

Mapping: The summarization pipeline is this end-of-day summary habit, automated. When a conversation ends, Claude summarizes the key information — decisions, preferences, outcomes — into a compact record. This record is embedded and stored in episodic memory. When the next session starts, the stored summaries are retrieved and injected, giving the agent continuity across sessions without replaying entire conversation logs.

Here's what the summary record looks like after Claude processes a 10-turn conversation:

{ "summary": "Discussed deployment schedule and notification preferences. Decision: deploy moved to Friday due to QA delays. User prefers email over Slack for notifications.", "topics": ["deployment", "notifications", "scheduling"], "decisions": ["Deploy moved to Friday"], "user_preferences": ["Email notifications preferred over Slack"], "action_items": ["Confirm Friday deploy with QA team"] }

That's it — a 10-turn, 4,000-token conversation compressed into ~80 tokens of structured JSON. The topics array is used for metadata filtering, and the summary text is what gets embedded and stored in ChromaDB.

Memory Activation Timeline — Single Request Lifecycle
Time → 0ms 1ms 50ms 200ms 500ms ~1-3s Parse Retrieve Execute Persist Working Procedural Episodic Load context intent, entities Update state tool results, plan Clear Pattern match find template Learn Retrieve semantic search Store Working Procedural Episodic Active (read/write)
Summarization Pipeline
Conversation~4,000 tokens
SummarizeClaude extracts key info
EmbedConvert to vector
StoreEpisodic DB + metadata
RetrieveNext session loads it
Technical Definition

The summarization pipeline is a multi-stage process that converts raw conversations into compact, searchable memory records:

  1. Summarize: Send the conversation to Claude with a structured prompt that extracts: key topics discussed, decisions made, user preferences learned, action items, and outcomes. Output is a structured JSON record (~200–400 tokens), down from the original ~4,000+ token conversation.
  2. Embed: Convert the summary text into a vector embedding using an embedding model (like Voyage AI or OpenAI embeddings). This enables semantic search later.
  3. Store: Save the embedding + summary text + metadata (session ID, timestamp, user ID, topic tags) in a vector database like ChromaDB.
  4. Compact: Periodically merge related episodes and remove outdated information to prevent the memory store from growing unbounded. A compaction strategyA maintenance process that periodically reviews stored memories, merges duplicates, and removes entries older than a threshold or below a relevance score. Prevents the memory database from growing indefinitely. might merge all “deployment” episodes from the same week into one consolidated record.
Why It Matters

A 10-turn conversation with Claude uses roughly 4,000–8,000 tokens. Storing the full transcript for 500 daily conversations would mean 2–4 million tokens of raw history per day. The summarization pipeline compresses each conversation to ~300 tokens — a 93% reduction. That means 500 days of conversation summaries (~150,000 tokens total) can fit in a single ChromaDB collection and be searched in under 50ms. Without summarization, you’d blow through storage limits within a week.

⚠️ Common Misconceptions

"Summarization is lossless — Claude captures everything" — No. Summarization is inherently lossy. Specific numbers (contract amounts, API rate limits), exact timestamps, and subtle nuances often get dropped. If a detail is business-critical, store it as structured metadata alongside the summary, not just in the summary text.

"I should summarize after every message" — Summarize at conversation END, not after every turn. Mid-conversation summarization wastes API calls and may lose context that only makes sense in the full conversation flow. The one exception: very long conversations (50+ turns) where you want to summarize-and-compact periodically to prevent context overflow.

"Any LLM can summarize equally well" — Summarization quality varies significantly between models. A weak summarizer might miss that "the user seemed frustrated about the delay" or drop the distinction between "we discussed X" and "we decided X." Use a capable model (Claude Sonnet or better) for summarization — the cost difference is negligible and the quality gap is significant.

What Just Happened?

You’ve learned the four components of the memory architecture: working memory (current state), episodic memory (past conversations), procedural memory (learned skills), and the summarization pipeline (the bridge between them). Now let’s build the persistence layer that keeps all this data alive across process restarts.

Persistence Layer

Memory that only lives in RAM disappears when your process crashes or restarts. For production agents, all three memory tiers need durable storage — meaning the data is written to disk (or a managed database) and survives server restarts, crashes, and deployments.

Here's how persistence works internally, tier by tier. When working memory updates a key, the change happens in RAM first (that's what makes it fast). Periodically, the in-memory state is flushed to a backing store — Redis for distributed agents, or SQLite or a JSON file for simpler setups.

Episodic memory is a bit different. When you store an episode, ChromaDB writes two things to disk: the embedding vector (the 1536-float array) and the document text (the summary). Both go into ChromaDB's SQLite backend, so they survive restarts.

Procedural memory splits its data across two stores. The trigger description goes into the vector DB as an embedding (for semantic search). The steps JSON goes into a relational table (SQLite or PostgreSQL). On restart, each tier reads its persisted state and resumes where it left off.

If you've worked with web apps, this is similar to the difference between storing user data in a session cookie versus a database. Session cookies disappear when the browser closes. A database persists across restarts. The same pattern applies here: without persistence, your agent's memory is a session cookie that vanishes on crash.

This is fundamentally different from the context-window approach in M08, where conversation history lived only in the messages array passed to each API call. That array exists only while your code is running. Persistence adds a storage layer underneath each memory tier, turning ephemeral state into durable records.

The tradeoff is additional infrastructure complexity — you now need a database, backup strategy, and a plan for when storage grows unbounded. For a hobby project, SQLite and a local ChromaDB directory are fine. For a production agent handling thousands of users, you'll likely graduate to a managed vector database (Pinecone, Weaviate Cloud) with automatic backups and horizontal scaling.

  • Working memory: Persisted to Redis or a session store — survives brief disconnects. For simpler deployments, a JSON file or SQLite row per active session works.
  • Episodic memory: Vector database (ChromaDB with SQLite backend, Pinecone, Weaviate) — survives restarts and scales to millions of episodes.
  • Procedural memory: SQLite or PostgreSQL table with trigger embeddings stored in the same vector database — survives restarts and supports filtered retrieval.
Production Warning

ChromaDB’s default in-memory mode loses all data on restart. Always configure persistent storage: chromadb.PersistentClient(path="./chroma_db") in Python or specify a persist directory. In production, consider a managed vector database (Pinecone, Weaviate Cloud) for automatic backups and scaling.

🎓 Cert Tip — Domain 5.4

Long sessions accumulate stale context — Claude may reference code that’s been changed. Mitigations: /compact (lossy), scratchpad files (lossless), subagent delegation (fresh context), or crash recovery manifests for interrupted sessions. The exam tests your knowledge of these tradeoffs.

You've built memory yourself across three tiers. But Anthropic also ships a managed memory tool that abstracts most of this for you — handing you a ready-made persistence layer at the cost of less control. Knowing when to reach for it (and when to keep your DIY stack) is the next skill.

The Managed Memory Tool

Claude exposes a built-in memory tool in the API: a managed key/value-and-namespace store that persists across requests for a given user, project, or org. Instead of standing up Postgres, building a summarization pipeline, and writing your own retrieval prompts, you call memory.write and memory.read tools and Anthropic handles storage, retrieval, and quota management.

Technical Definition

The memory tool is enabled like any other built-in tool in your tools array. Claude can call memory.write to store a fact under a namespaced key (e.g. user_prefs/timezone), memory.read to retrieve, and memory.list to enumerate. Memory is scoped: you can isolate per-user, per-session, or per-tenant by passing scoping headers. Storage and retrieval cost is metered separately from input/output tokens.

Memory Tool vs Building It Yourself

When to Use the Managed Memory Tool
Concern Memory tool DIY 3-tier (this module)
Setup timeMinutes — enable the tool, doneDays — DB schema, summarizer, retrieval logic
Where data livesAnthropic-managed storageYour DB, your VPC, your control plane
Retrieval policyClaude decides when to readYou decide — eager, lazy, hybrid
Compliance / data residencyBound by Anthropic's region/ZDR optionsWhatever your infra allows (HIPAA, GDPR, etc.)
Cost modelPer-read/write meteringYour storage + compute
Best forPrototypes, B2C agents, simple personalizationRegulated data, complex retrieval, multi-tenant SaaS

A common production pattern: use the memory tool for short-lived per-user preferences (timezone, locale, "remember I want JSON output") and keep your DIY tiered system for case-grade context (medical history, account state, audit-relevant decisions). They aren't competitive — they target different problem shapes.

When NOT to Use the Memory Tool

Three cases where the DIY 3-tier system you just built is the right answer: (1) any data subject to data-residency regulation that requires you to store inside your own VPC. (2) retrieval policies more nuanced than "Claude reads when it thinks it should" — e.g., always-load-on-session-start case facts. (3) high-volume personalization with thousands of facts per user, where DB indexes give you better economics than per-read metering.

The managed memory tool is API-level — useful in any Claude application. Claude Code adds a second managed memory system on top of it: Auto Memory. Where the memory tool stores facts you tell Claude to remember, Auto Memory is something Claude does on its own, accumulating notes about your project as it works. It's a critical primitive that didn't exist before Claude Code v2.1.59.

Auto Memory — Claude Writes Its Own Notes

Imagine pairing with a colleague who silently keeps a running notebook of everything they've learned about your codebase — build commands they figured out, debugging quirks, your preferences — and reads that notebook before every session. That's Auto Memory: a Claude Code feature where Claude itself decides what's worth remembering across sessions and writes it to disk, without you typing a single line of CLAUDE.md.

Auto Memory is on by default in Claude Code v2.1.59 and later. Most learners discover it the first time they see "Writing memory…" in the terminal and wonder what just happened — that's Claude updating its notes. The course wouldn't be complete without showing you exactly how it works, where it stores data, and how it interacts with the CLAUDE.md memory you write yourself.

Technical Definition

Auto Memory is an automatic note-taking system built into Claude Code. Claude decides at runtime whether something is worth remembering — build commands, debugging insights, architectural decisions, preferences it has discovered — and writes those notes to a per-project memory directory. The directory is loaded at the start of every future session. Auto Memory is complementary to CLAUDE.md, not a replacement: CLAUDE.md is what you tell Claude; Auto Memory is what Claude tells itself.

CLAUDE.md vs Auto Memory — Two Complementary Systems

Two Memory Systems, Different Authors
CLAUDE.md files Auto Memory
Who writes itYouClaude
What it containsInstructions, rules, conventionsLearnings, patterns, discovered preferences
ScopeProject · user · orgPer working tree (git repo)
Loaded intoEvery session, in fullEvery session: first 200 lines or 25KB of MEMORY.md
Best forCoding standards, architecture, "always do X"Build quirks, debug insights, preferences Claude discovers
Storage./CLAUDE.md, ./.claude/CLAUDE.md, ~/.claude/CLAUDE.md~/.claude/projects/<project>/memory/

Both systems load at the start of every conversation. CLAUDE.md captures what you'd otherwise re-explain every session; Auto Memory captures what Claude would otherwise re-discover every session. Together they're a Pareto-optimal split — you write the unchangeable rules, Claude takes notes on the rest.

How It Works — The Memory Directory

Each git repository gets its own auto memory directory at ~/.claude/projects/<project>/memory/. The directory contains an entrypoint plus optional topic files:

Auto Memory Directory Layout
~/.claude/projects/<project>/memory/ MEMORY.md always loaded at session start capped: first 200 lines or 25KB debugging.md on-demand debug patterns api-conventions.md on-demand API decisions build-quirks.md on-demand env gotchas preferences.md on-demand user prefs MEMORY.md is the index — topic files are read by Claude on demand using its file tools. All files are plain markdown — you can read, edit, or delete them anytime via /memory.

The /memory Command

The /memory slash command is your control plane for both CLAUDE.md and Auto Memory. From inside a session, it lists every file currently loaded (CLAUDE.md, CLAUDE.local.md, rules files, MEMORY.md), lets you toggle Auto Memory on or off, and provides a link to open the auto memory folder. Selecting any file opens it in your editor.

Two natural prompts trigger Auto Memory writes:

  • "Remember that X""Remember that the API tests need a local Redis on port 6380." Claude writes this to Auto Memory.
  • "Add to CLAUDE.md""Add this to CLAUDE.md." Claude appends the rule to your project CLAUDE.md instead.

Choose deliberately: rules everyone on the team needs go in CLAUDE.md (committed); per-machine quirks Claude discovered go in Auto Memory (machine-local).

Configuration & Toggles

{
  "autoMemoryEnabled": true,
  "autoMemoryDirectory": "~/my-custom-memory-dir"
}
  • autoMemoryEnabled — on by default. Toggle from /memory in-session, or set in user/local settings (not project settings, to prevent shared projects from disabling it for your machine).
  • autoMemoryDirectory — redirects the storage location. Accepted from policy, local, and user settings only — not project settings, so a shared repo cannot redirect your auto memory writes to a sensitive location.
  • CLAUDE_CODE_DISABLE_AUTO_MEMORY=1 — environment-variable kill switch for sandboxed runs (CI, ephemeral containers).
Auto Memory Is Machine-Local

All worktrees and subdirectories of the same git repo share one auto memory directory on your machine. But that directory is not synced across machines, cloud environments, or teammates. If you switch laptops, you start with empty Auto Memory. This is by design — Auto Memory captures machine-specific quirks (your local Redis port, your sandbox URL) that wouldn't be useful or safe to share.

For team-shared learnings, use CLAUDE.md (committed to git). For machine-local Claude-discovered notes, use Auto Memory.

🎓 Cert Tip — Domain 3 (Memory)

The exam may test recognition of when Auto Memory applies vs CLAUDE.md. Pattern: a developer says "every session I have to remind Claude that the local DB is on port 5433." Wrong answer: add to CLAUDE.md. Correct answer: that's exactly what Auto Memory exists for — it's machine-local, Claude-written, and survives across sessions. Use CLAUDE.md only when the rule should be shared with the team via git.

Now you've seen all five scopes of Claude memory: in-call (M01), cross-call server-assisted (M22, M11, M09), within-session (M26), persistent across sessions both human-authored (CLAUDE.md in M25) and Claude-authored (Auto Memory, this section), and external (this module). Two cert-critical patterns — crash recovery manifests and tool-output trimming — close out the memory toolkit before we drop into code.

Claude's Native Memory Landscape & Cert-Critical Patterns

You've learned the academic 3-tier architecture (working / episodic / procedural) and the managed memory tool. Before we build it in code, take one step back and look at the full set of Claude-native memory mechanisms. Claude provides at least twelve distinct memory primitives across five scopes — from the conversation-level message array to file-based persistent memory in Claude Code. Knowing all twelve, and which scope each one operates at, is what separates "I learned the API" from "I can architect a Claude system."

Claude Memory Mechanisms — The Full Landscape
1. Within a single API call Stateless. Memory = whatever you send in this request. system prompt messages array tool_use cycle M01 · M03 · M05 2. Across API calls — developer-managed, server-assisted You re-send context every time, but Claude offers caching and managed stores to make it cheap. prompt caching memory tool Files API Citations M22 · M11 · M09 · M09 3. Within a session — Claude Code & Agent SDK The runtime keeps conversation state for you; you control compaction and forking. --resume session fork_session /compact PreCompact hook subagent isolation M26 sessions · M26 fork · M26 compaction · M26 hooks · M14 + M26 4. Persistent across sessions — Claude Code file-based memory Files on disk that auto-load every session. CLAUDE.md = you write it. Auto Memory = Claude writes it. CLAUDE.md hierarchy Auto Memory (NEW) Skills Slash commands Hooks Plugins M25 · M11 (Auto Memory) · M25 skills · M25 commands · M25 + M26 hooks · M25 plugins 5. External persistence — you build it When the managed primitives aren't enough: scale, compliance, custom retrieval policies. RAG + vector DB 3-tier memory architecture crash recovery manifests tool trim wrapper M09 RAG · M11 (this module) · M11 (this section) · M11 (this section)
Reading the Landscape

Three signals tell you which scope to reach for. Volatility: if the data lives only for this turn, scope 1 is enough — just put it in the prompt. Reuse rate: if you'll re-send the same context across many calls, scope 2 (prompt caching, Files API) collapses cost by 90%. Persistence: if the data must outlive the session, scope 4 (CLAUDE.md, skills) auto-loads forever; scope 5 (your DB) handles scale and compliance. Most production agents use 3–4 of the five scopes simultaneously — not because more is better, but because each scope solves a different memory problem.

Cert-Critical Patterns: Crash Recovery & Tool-Output Trimming

Two memory patterns from scope 5 don't fit cleanly into the working/episodic/procedural taxonomy but show up repeatedly as Domain 5 cert questions. Both solve real production failures — an interrupted long-running session, and a context window flooded with verbose tool output.

Pattern 1: Crash Recovery Manifests

You're 40 minutes into a long-running migration. The agent has touched 18 files, run 6 test passes, and has 3 outstanding TODOs in flight. Your laptop reboots for an OS update. Without a manifest, the next session starts cold — you re-prompt from scratch and Claude rediscovers everything. With a manifest, the next session reads a small structured file and picks up exactly where it left off.

Technical Definition

A crash recovery manifest is a small structured file (typically .claude/state/manifest.json or a markdown scratchpad) that the agent writes after every meaningful step. It captures the minimum information needed to resume work: modified files with one-line summaries, current test status, the agreed-upon plan, and any unresolved TODOs. The manifest lives outside the context window, so it survives session crashes, compaction, and machine restarts.

Crash Recovery — Manifest as External State
Without manifest 1 2 3 crash at step 4 restart from scratch 1 re-prompt, rediscover everything — lost 40 min With manifest (cert-recommended) 1 2 3 .claude/state/manifest.json written after every step 5 resumes at step 5 with full state manifest.json schema { "modified_files": [...] "test_status": "3/47 fail" "plan": [step1, step2,...] "current_step": 4 "open_todos": [...] } Pair with the PreCompact hook to also survive auto-compaction (M26).
from pathlib import Path
import json
from datetime import datetime

MANIFEST = Path(".claude/state/manifest.json")

def write_manifest(modified_files, test_status, plan, current_step, open_todos):
    """Call after EVERY meaningful step in a long-running session."""
    MANIFEST.parent.mkdir(parents=True, exist_ok=True)
    MANIFEST.write_text(json.dumps({
        "updated_at": datetime.utcnow().isoformat(),
        "modified_files": modified_files,   # [{"path": "src/api.ts", "summary": "added /v2/search"}]
        "test_status": test_status,         # "47/50 passing; auth_test.ts:42 failing"
        "plan": plan,                       # ["step1: schema", "step2: route", ...]
        "current_step": current_step,
        "open_todos": open_todos,           # ["fix locale fallback", "add rate-limit test"]
    }, indent=2))

def read_manifest():
    """First action of any resumed session."""
    if not MANIFEST.exists():
        return None
    return json.loads(MANIFEST.read_text())

Pattern 2: Trimming Verbose Tool Outputs

The single fastest way to fill a context window is by piping in raw tool output. A Bash(grep -r foo .) call can return 8,000 lines of matches; a Read on a 4,000-line file dumps the entire file into context. Most of those tokens are noise — the agent only needs the first 50 matches or the relevant function. Trimming tool output before it enters context is a Domain 5 cert pattern that's easy to miss because it lives in the boundary between the tool and the agent.

Technical Definition

Tool-output trimming is a pre-context filter applied to tool results before they're appended to the message history. The pattern is: the tool runs, returns its full result; a wrapper trims the result to the relevant slice (top-N matches, head/tail of a file, summary of a long stdout); only the trimmed slice is sent back to the model. The full result is preserved in a side log for debugging; the model only sees the trimmed version.

Trim Strategies — Pick Per Tool
Tool output type Trim strategy Limit
grep / search resultsTop-N by relevance + total count footer50 lines + "(... 247 more matches)"
file Read on large filesOffset + limit; agent re-reads on demand2000 lines per call
Bash stdout / log streamsHead + tail; drop middlefirst 30 + last 30 lines
DB query resultsLIMIT clause + row count100 rows max per call
HTTP responsesStrip headers, body limit10KB body slice
test runner outputFailures only; pass-count summary"47/50 passing" + 3 failure traces
The Two-Tier Logging Rule

Never throw away the full result — just don't show it to the model. The trim pattern is: (1) the tool wrapper writes the full result to a side log on disk (.claude/state/tool_log.jsonl) for debugging and audit. (2) The wrapper returns a trimmed slice to the model. (3) If the agent needs more, it can call the tool again with a different query (more specific grep, different file offset). This way the model sees clean signal but you keep full forensic data.

🎓 Cert Tip — Domain 5 (Context Optimization)

The exam tests recognition that "context degraded" isn't always solved by a bigger model or longer window — sometimes it's solved by less input. Trick scenario: an agent's quality drops after a Bash search returns 8000 lines. Wrong answers: switch to Opus, increase max_tokens, add more case facts. Correct answer: trim the tool output to the relevant slice before it enters the model's context. Pair with the manifest pattern above for crash recovery, and you've got the full Domain 5 toolkit.

Memory & Context Cert Topics — Where Each Lives

Domain 5 (15% of the exam) tests context management as a single discipline, but the topics are taught across multiple modules. Use this as your study map:

Cert Domain 5 — Topic-to-Module Map
Cert topic Primary module
"Lost in the middle" effect & position-aware orderingM03B Concept 4 + M08
Immutable "case facts" blocks at START positionM03B + M08
Progressive summarization risks & context rotM03B Concept 5
3-tier memory (working / episodic / procedural)M11 (this module)
Managed memory tool (Anthropic-hosted)M11 (this module)
Auto Memory (Claude Code, machine-local)M11 Auto Memory section
Crash recovery manifestsM11 (this section)
Trimming verbose tool outputsM11 (this section)
/compact discipline (manual at 50%)M26 Compaction section
CLAUDE.md compaction preservation directiveM26 Compaction section
PreCompact hook for critical stateM26 12 hook events
Subagent context isolationM14 + M26
Session forking (fork_session) vs resumeM26 Sessions
Information provenance & claim-source mappingsM27B
Temporal data & as-of reasoningM27B
Stratified sampling + field-level confidenceM27B
Synthesis output: well-established vs contestedM27B
With crash recovery and tool-output trimming closing the cert-critical patterns, you now have the complete memory toolkit. Time to drop into code: implementing the 3-tier system end-to-end, the version you'd reach for when the managed memory tool isn't a fit.

Code Walkthrough: 3-Tier Memory System

Step 1: Working Memory Class

Let's start with the simplest tier. The WorkingMemory class is essentially a key-value store (a Python dict with timestamps) that tracks the current task state. The interesting design choice is the to_prompt() method — it formats the entire state into a structured text block that you inject into Claude's system prompt. Without this, every LLM call within a multi-step task would be stateless — Claude would forget what it learned two tool calls ago. The scratchpad solves that by riding along with every API request.

import json
from datetime import datetime, timezone
from typing import Any


class WorkingMemory:
    """Fast, mutable scratchpad for current task state.

    Stores key-value pairs in memory and provides a formatted
    string for injection into Claude's system prompt.
    """

    def __init__(self, session_id: str):
        self.session_id = session_id
        self.created_at = datetime.now(timezone.utc).isoformat()
        self._state: dict[str, Any] = {}

    def set(self, key: str, value: Any) -> None:
        """Store a key-value pair in working memory."""
        self._state[key] = {
            "value": value,
            "updated_at": datetime.now(timezone.utc).isoformat()
        }

    def get(self, key: str, default: Any = None) -> Any:
        """Retrieve a value by key. Returns default if not found."""
        entry = self._state.get(key)
        return entry["value"] if entry else default

    def delete(self, key: str) -> bool:
        """Remove a key from working memory. Returns True if existed."""
        return self._state.pop(key, None) is not None

    def clear(self) -> None:
        """Clear all working memory state."""
        self._state.clear()

    def to_prompt(self) -> str:
        """Format working memory for injection into system prompt.

        Returns a structured string that Claude can parse to understand
        the current task state.
        """
        if not self._state:
            return "[Working Memory: empty — new task]"

        lines = [
            f"[Working Memory — Session {self.session_id}]",
            f"  Created: {self.created_at}"
        ]
        for key, entry in self._state.items():
            val = json.dumps(entry["value"]) if not isinstance(
                entry["value"], str
            ) else entry["value"]
            lines.append(f"  {key}: {val}")

        return "\n".join(lines)

    def to_dict(self) -> dict:
        """Serialize for persistence (e.g., to Redis or SQLite)."""
        return {
            "session_id": self.session_id,
            "created_at": self.created_at,
            "state": {
                k: v["value"] for k, v in self._state.items()
            }
        }


# --- Usage ---
wm = WorkingMemory(session_id="sess_001")
wm.set("intent", "find_deployment_date")
wm.set("entities", {"topic": "deployment", "timeframe": "last Tuesday"})
wm.set("search_results", [{"session": 42, "summary": "Deploy moved to Friday"}])

print(wm.to_prompt())
# [Working Memory — Session sess_001]
#   Created: 2026-03-29T10:00:00+00:00
#   intent: find_deployment_date
#   entities: {"topic": "deployment", "timeframe": "last Tuesday"}
#   search_results: [{"session": 42, "summary": "Deploy moved to Friday"}]
interface MemoryEntry {
  value: unknown;
  updatedAt: string;
}

class WorkingMemory {
  /** Fast, mutable scratchpad for current task state. */

  public readonly sessionId: string;
  public readonly createdAt: string;
  private state: Map<string, MemoryEntry> = new Map();

  constructor(sessionId: string) {
    this.sessionId = sessionId;
    this.createdAt = new Date().toISOString();
  }

  set(key: string, value: unknown): void {
    this.state.set(key, {
      value,
      updatedAt: new Date().toISOString(),
    });
  }

  get(key: string, defaultValue: unknown = null): unknown {
    const entry = this.state.get(key);
    return entry ? entry.value : defaultValue;
  }

  delete(key: string): boolean {
    return this.state.delete(key);
  }

  clear(): void {
    this.state.clear();
  }

  toPrompt(): string {
    if (this.state.size === 0) {
      return "[Working Memory: empty — new task]";
    }
    const lines = [
      `[Working Memory — Session ${this.sessionId}]`,
      `  Created: ${this.createdAt}`,
    ];
    for (const [key, entry] of this.state) {
      const val =
        typeof entry.value === "string"
          ? entry.value
          : JSON.stringify(entry.value);
      lines.push(`  ${key}: ${val}`);
    }
    return lines.join("\n");
  }

  toJSON(): Record<string, unknown> {
    const state: Record<string, unknown> = {};
    for (const [key, entry] of this.state) {
      state[key] = entry.value;
    }
    return {
      sessionId: this.sessionId,
      createdAt: this.createdAt,
      state,
    };
  }
}

// --- Usage ---
const wm = new WorkingMemory("sess_001");
wm.set("intent", "find_deployment_date");
wm.set("entities", { topic: "deployment", timeframe: "last Tuesday" });
wm.set("search_results", [{ session: 42, summary: "Deploy moved to Friday" }]);

console.log(wm.toPrompt());
What Just Happened?

You built a WorkingMemory class that acts as a structured scratchpad. It stores key-value pairs with timestamps, and the to_prompt() method formats the entire state into a string that you inject into Claude’s system prompt. This means every LLM call during a multi-step task can “see” everything the agent has learned so far.

Step 2: Episodic Memory Class

Now for the tier that gives agents cross-session continuity. The EpisodicMemory class wraps ChromaDB to store conversation summaries and retrieve them by semantic similarity. The core idea: after each conversation, you store a summary. Before each new conversation, you search for relevant past summaries and inject them into the prompt. Here's the dilemma with ChromaDB: its default mode is in-memory, which means all your carefully stored episodes vanish when the process restarts. That's why the constructor uses PersistentClient with a disk path — skip this and you'll spend hours debugging "why doesn't my agent remember anything?"

import chromadb  # pip install chromadb>=0.5.0
import uuid
from datetime import datetime, timezone


class EpisodicMemory:
    """Stores and retrieves conversation summaries via semantic search.

    Uses ChromaDB with persistent storage so memories survive restarts.
    """

    def __init__(self, persist_dir: str = "./chroma_db", collection_name: str = "episodes"):
        try:
            self.client = chromadb.PersistentClient(path=persist_dir)
            self.collection = self.client.get_or_create_collection(
                name=collection_name,
                metadata={"hnsw:space": "cosine"}  # cosine similarity
            )
        except Exception as e:
            raise ConnectionError(
                f"Failed to connect to ChromaDB at {persist_dir}: {e}"
            ) from e

    def store_episode(
        self,
        summary: str,
        session_id: str,
        user_id: str = "default",
        topics: list[str] | None = None,
    ) -> str:
        """Store a conversation summary as an episode.

        Returns the episode ID for later reference.
        """
        episode_id = f"ep_{uuid.uuid4().hex[:12]}"
        metadata = {
            "session_id": session_id,
            "user_id": user_id,
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "topics": ",".join(topics) if topics else "",
        }

        try:
            self.collection.add(
                documents=[summary],
                metadatas=[metadata],
                ids=[episode_id],
            )
        except Exception as e:
            raise RuntimeError(f"Failed to store episode: {e}") from e

        return episode_id

    def retrieve(
        self,
        query: str,
        n_results: int = 3,
        user_id: str | None = None,
    ) -> list[dict]:
        """Find the most relevant past episodes for a query.

        Returns a list of dicts with 'summary', 'session_id',
        'timestamp', and 'similarity' keys.
        """
        where_filter = {"user_id": user_id} if user_id else None

        try:
            results = self.collection.query(
                query_texts=[query],
                n_results=min(n_results, self.collection.count() or 1),
                where=where_filter,
            )
        except Exception as e:
            # Return empty on search failure — don't crash the agent
            print(f"Warning: episodic search failed: {e}")
            return []

        if not results["documents"] or not results["documents"][0]:
            return []

        episodes = []
        for i, doc in enumerate(results["documents"][0]):
            meta = results["metadatas"][0][i]
            distance = results["distances"][0][i] if results["distances"] else 0
            episodes.append({
                "summary": doc,
                "session_id": meta.get("session_id", ""),
                "timestamp": meta.get("timestamp", ""),
                "topics": meta.get("topics", "").split(","),
                "similarity": round(1 - distance, 3),  # cosine distance → similarity
            })

        return episodes

    def to_prompt(self, query: str, n_results: int = 3) -> str:
        """Retrieve relevant episodes and format for prompt injection."""
        episodes = self.retrieve(query, n_results)
        if not episodes:
            return "[Episodic Memory: no relevant past interactions found]"

        lines = ["[Relevant Past Interactions]"]
        for ep in episodes:
            lines.append(
                f"  - Session {ep['session_id']} ({ep['timestamp'][:10]}): "
                f"{ep['summary']}"
            )
        return "\n".join(lines)


# --- Usage ---
em = EpisodicMemory(persist_dir="./memory_db")

# Store a past conversation summary
em.store_episode(
    summary="User asked about deployment schedule. Decision: deploy moved to Friday due to QA delays.",
    session_id="sess_042",
    user_id="user_alice",
    topics=["deployment", "scheduling"],
)

# Later, retrieve relevant episodes
results = em.retrieve("When is the deployment?", user_id="user_alice")
for r in results:
    print(f"[{r['similarity']}] Session {r['session_id']}: {r['summary']}")
// npm install chromadb@^1.9.0
import { ChromaClient, Collection } from "chromadb";
import { randomUUID } from "crypto";

interface Episode {
  summary: string;
  sessionId: string;
  timestamp: string;
  topics: string[];
  similarity: number;
}

class EpisodicMemory {
  /** Stores and retrieves conversation summaries via semantic search. */

  private client: ChromaClient;
  private collection: Collection | null = null;
  private collectionName: string;

  constructor(collectionName: string = "episodes") {
    this.client = new ChromaClient();
    this.collectionName = collectionName;
  }

  async init(): Promise<void> {
    try {
      this.collection = await this.client.getOrCreateCollection({
        name: this.collectionName,
        metadata: { "hnsw:space": "cosine" },
      });
    } catch (error) {
      throw new Error(`Failed to connect to ChromaDB: ${error}`);
    }
  }

  async storeEpisode(
    summary: string,
    sessionId: string,
    userId: string = "default",
    topics: string[] = []
  ): Promise<string> {
    if (!this.collection) throw new Error("Call init() first");

    const episodeId = `ep_${randomUUID().replace(/-/g, "").slice(0, 12)}`;
    const metadata = {
      session_id: sessionId,
      user_id: userId,
      timestamp: new Date().toISOString(),
      topics: topics.join(","),
    };

    try {
      await this.collection.add({
        documents: [summary],
        metadatas: [metadata],
        ids: [episodeId],
      });
    } catch (error) {
      throw new Error(`Failed to store episode: ${error}`);
    }

    return episodeId;
  }

  async retrieve(
    query: string,
    nResults: number = 3,
    userId?: string
  ): Promise<Episode[]> {
    if (!this.collection) throw new Error("Call init() first");

    const whereFilter = userId ? { user_id: userId } : undefined;

    try {
      const count = await this.collection.count();
      const results = await this.collection.query({
        queryTexts: [query],
        nResults: Math.min(nResults, count || 1),
        where: whereFilter,
      });

      if (!results.documents?.[0]?.length) return [];

      return results.documents[0].map((doc, i) => ({
        summary: doc ?? "",
        sessionId: String(results.metadatas?.[0]?.[i]?.session_id ?? ""),
        timestamp: String(results.metadatas?.[0]?.[i]?.timestamp ?? ""),
        topics: String(results.metadatas?.[0]?.[i]?.topics ?? "")
          .split(",")
          .filter(Boolean),
        similarity: results.distances?.[0]?.[i]
          ? Math.round((1 - results.distances[0][i]) * 1000) / 1000
          : 0,
      }));
    } catch (error) {
      console.warn(`Episodic search failed: ${error}`);
      return [];
    }
  }

  async toPrompt(query: string, nResults: number = 3): Promise<string> {
    const episodes = await this.retrieve(query, nResults);
    if (!episodes.length) {
      return "[Episodic Memory: no relevant past interactions found]";
    }
    const lines = ["[Relevant Past Interactions]"];
    for (const ep of episodes) {
      lines.push(
        `  - Session ${ep.sessionId} (${ep.timestamp.slice(0, 10)}): ${ep.summary}`
      );
    }
    return lines.join("\n");
  }
}

// --- Usage ---
const em = new EpisodicMemory("episodes");
await em.init();

await em.storeEpisode(
  "User asked about deployment schedule. Decision: deploy moved to Friday.",
  "sess_042",
  "user_alice",
  ["deployment", "scheduling"]
);

const results = await em.retrieve("When is the deployment?", 3, "user_alice");
results.forEach((r) =>
  console.log(`[${r.similarity}] Session ${r.sessionId}: ${r.summary}`)
);
What Just Happened?

You built an EpisodicMemory class that stores conversation summaries in ChromaDB with metadata (session ID, user ID, timestamp, topics). The retrieve() method does semantic search — even if the user asks about “deployment” using different words, ChromaDB finds the matching episode. The to_prompt() method formats retrieved episodes for direct injection into Claude’s prompt.

Step 3: Procedural Memory Class

The third tier is where agents get faster over time. ProceduralMemory stores reusable action templates — proven sequences of tool calls — and retrieves them by matching user requests to trigger conditions via semantic similarity. The clever part is the min_similarity threshold (0.7 by default): if no stored procedure matches the user's request closely enough, the agent falls back to reasoning from scratch instead of executing an irrelevant template. This prevents the agent from re-planning the same multi-step workflow every time while avoiding false matches.

import chromadb
import json
import uuid


class ProceduralMemory:
    """Stores and retrieves reusable action templates (tool-call sequences).

    Trigger conditions are stored as embeddings for semantic matching.
    Execution steps are stored as structured JSON.
    """

    def __init__(self, persist_dir: str = "./chroma_db"):
        try:
            self.client = chromadb.PersistentClient(path=persist_dir)
            self.collection = self.client.get_or_create_collection(
                name="procedures",
                metadata={"hnsw:space": "cosine"},
            )
        except Exception as e:
            raise ConnectionError(f"Failed to connect to ChromaDB: {e}") from e

    def store_procedure(
        self,
        name: str,
        trigger: str,
        steps: list[dict],
        success_count: int = 1,
    ) -> str:
        """Store a reusable procedure.

        Args:
            name: Human-readable procedure name (e.g., 'weekly_report')
            trigger: Description of when this procedure applies
            steps: Ordered list of step dicts with 'tool', 'params', 'description'
            success_count: Number of times this procedure has succeeded
        """
        proc_id = f"proc_{uuid.uuid4().hex[:10]}"
        metadata = {
            "name": name,
            "steps_json": json.dumps(steps),
            "success_count": success_count,
        }

        try:
            self.collection.add(
                documents=[trigger],   # trigger is the searchable text
                metadatas=[metadata],
                ids=[proc_id],
            )
        except Exception as e:
            raise RuntimeError(f"Failed to store procedure: {e}") from e

        return proc_id

    def find_procedure(
        self,
        query: str,
        min_similarity: float = 0.7,
    ) -> dict | None:
        """Find the best matching procedure for a user request.

        Returns None if no procedure matches above the similarity threshold.
        """
        try:
            count = self.collection.count()
            if count == 0:
                return None

            results = self.collection.query(
                query_texts=[query],
                n_results=1,
            )
        except Exception as e:
            print(f"Warning: procedural search failed: {e}")
            return None

        if not results["documents"] or not results["documents"][0]:
            return None

        distance = results["distances"][0][0] if results["distances"] else 1.0
        similarity = 1 - distance

        if similarity < min_similarity:
            return None

        meta = results["metadatas"][0][0]
        return {
            "name": meta["name"],
            "trigger": results["documents"][0][0],
            "steps": json.loads(meta["steps_json"]),
            "success_count": meta.get("success_count", 0),
            "similarity": round(similarity, 3),
        }

    def to_prompt(self, query: str) -> str:
        """Find a matching procedure and format it as a suggested plan."""
        proc = self.find_procedure(query)
        if not proc:
            return "[Procedural Memory: no matching skill template found]"

        lines = [
            f"[Suggested Procedure: {proc['name']} (used {proc['success_count']}x)]",
        ]
        for i, step in enumerate(proc["steps"], 1):
            lines.append(
                f"  Step {i}: {step['description']} "
                f"(tool: {step['tool']})"
            )
        return "\n".join(lines)


# --- Usage ---
pm = ProceduralMemory(persist_dir="./memory_db")

# Store a proven procedure
pm.store_procedure(
    name="weekly_sales_report",
    trigger="Generate a weekly sales report with charts and insights",
    steps=[
        {"tool": "query_db", "params": {"query": "sales_last_7_days"}, "description": "Fetch sales data"},
        {"tool": "aggregate", "params": {"group_by": "region"}, "description": "Aggregate by region"},
        {"tool": "chart_gen", "params": {"type": "bar"}, "description": "Generate bar chart"},
        {"tool": "summarize", "params": {}, "description": "Extract key insights"},
        {"tool": "format_pdf", "params": {}, "description": "Format as PDF report"},
    ],
    success_count=12,
)

# Later, find a matching procedure
result = pm.find_procedure("Can you create the weekly sales report?")
if result:
    print(f"Found: {result['name']} (similarity: {result['similarity']})")
    for step in result["steps"]:
        print(f"  → {step['description']}")
import { ChromaClient, Collection } from "chromadb";
import { randomUUID } from "crypto";

interface ProcedureStep {
  tool: string;
  params: Record<string, unknown>;
  description: string;
}

interface Procedure {
  name: string;
  trigger: string;
  steps: ProcedureStep[];
  successCount: number;
  similarity: number;
}

class ProceduralMemory {
  /** Stores and retrieves reusable action templates. */

  private client: ChromaClient;
  private collection: Collection | null = null;

  constructor() {
    this.client = new ChromaClient();
  }

  async init(): Promise<void> {
    try {
      this.collection = await this.client.getOrCreateCollection({
        name: "procedures",
        metadata: { "hnsw:space": "cosine" },
      });
    } catch (error) {
      throw new Error(`Failed to connect to ChromaDB: ${error}`);
    }
  }

  async storeProcedure(
    name: string,
    trigger: string,
    steps: ProcedureStep[],
    successCount: number = 1
  ): Promise<string> {
    if (!this.collection) throw new Error("Call init() first");

    const procId = `proc_${randomUUID().replace(/-/g, "").slice(0, 10)}`;
    try {
      await this.collection.add({
        documents: [trigger],
        metadatas: [{
          name,
          steps_json: JSON.stringify(steps),
          success_count: successCount,
        }],
        ids: [procId],
      });
    } catch (error) {
      throw new Error(`Failed to store procedure: ${error}`);
    }
    return procId;
  }

  async findProcedure(
    query: string,
    minSimilarity: number = 0.7
  ): Promise<Procedure | null> {
    if (!this.collection) throw new Error("Call init() first");

    try {
      const count = await this.collection.count();
      if (count === 0) return null;

      const results = await this.collection.query({
        queryTexts: [query],
        nResults: 1,
      });

      if (!results.documents?.[0]?.length) return null;

      const distance = results.distances?.[0]?.[0] ?? 1.0;
      const similarity = 1 - distance;
      if (similarity < minSimilarity) return null;

      const meta = results.metadatas?.[0]?.[0] ?? {};
      return {
        name: String(meta.name ?? ""),
        trigger: results.documents[0][0] ?? "",
        steps: JSON.parse(String(meta.steps_json ?? "[]")),
        successCount: Number(meta.success_count ?? 0),
        similarity: Math.round(similarity * 1000) / 1000,
      };
    } catch (error) {
      console.warn(`Procedural search failed: ${error}`);
      return null;
    }
  }

  async toPrompt(query: string): Promise<string> {
    const proc = await this.findProcedure(query);
    if (!proc) {
      return "[Procedural Memory: no matching skill template found]";
    }
    const lines = [
      `[Suggested Procedure: ${proc.name} (used ${proc.successCount}x)]`,
    ];
    proc.steps.forEach((step, i) => {
      lines.push(`  Step ${i + 1}: ${step.description} (tool: ${step.tool})`);
    });
    return lines.join("\n");
  }
}

// --- Usage ---
const pm = new ProceduralMemory();
await pm.init();

await pm.storeProcedure(
  "weekly_sales_report",
  "Generate a weekly sales report with charts and insights",
  [
    { tool: "query_db", params: { query: "sales_last_7_days" }, description: "Fetch sales data" },
    { tool: "aggregate", params: { group_by: "region" }, description: "Aggregate by region" },
    { tool: "chart_gen", params: { type: "bar" }, description: "Generate bar chart" },
    { tool: "summarize", params: {}, description: "Extract key insights" },
    { tool: "format_pdf", params: {}, description: "Format as PDF report" },
  ],
  12
);

const result = await pm.findProcedure("Create the weekly sales report");
if (result) {
  console.log(`Found: ${result.name} (${result.similarity})`);
  result.steps.forEach((s) => console.log(`  → ${s.description}`));
}
What Just Happened?

You built a ProceduralMemory class that stores skill templates as trigger-description/steps pairs. The trigger is stored as an embedding in ChromaDB, so when a user request comes in, semantic search finds the closest matching procedure. The steps are stored as JSON and loaded into the prompt as a suggested plan. The min_similarity threshold (0.7) prevents false matches — if no procedure is relevant enough, the agent falls back to reasoning from scratch.

Memory Manager — Orchestrating All Three Tiers

Step 4: Conversation Summarizer

This is the bridge between short-term conversation and long-term memory. The summarize_conversation function takes a full conversation log and asks Claude a simple question: "What were the important parts of this conversation?" Claude extracts the essential information — topics discussed, decisions made, user preferences learned, and action items remaining — and returns it as a compact JSON record.

The output is ~200-400 tokens, which is 10-20x smaller than the raw conversation. That compression is what makes episodic memory economically viable. Without it, storing full transcripts for 500 daily conversations would blow through your storage budget within a week. The interesting design choice here is using Claude itself as the summarizer — it understands conversational nuance better than any rule-based approach, and the cost (~$0.002 per summary) is trivial compared to the savings on future prompt tokens.

import anthropic  # pip install anthropic>=0.30.0
import json


async def summarize_conversation(
    messages: list[dict],
    client: anthropic.AsyncAnthropic | None = None,
) -> dict:
    """Summarize a conversation into a structured episode record.

    Args:
        messages: List of conversation messages (role + content dicts)
        client: Anthropic client (creates one if not provided)

    Returns:
        Dict with: summary, topics, decisions, user_preferences, action_items
    """
    if client is None:
        client = anthropic.AsyncAnthropic()  # reads ANTHROPIC_API_KEY env var

    # Format the conversation for summarization
    conversation_text = "\n".join(
        f"{msg['role'].upper()}: {msg['content']}" for msg in messages
    )

    try:
        response = await client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=500,
            system=(
                "You are a conversation summarizer. Extract the key information "
                "from the conversation and return a JSON object with these fields:\n"
                '- "summary": 1-2 sentence overview of the conversation\n'
                '- "topics": list of main topics discussed\n'
                '- "decisions": list of decisions made (empty if none)\n'
                '- "user_preferences": any user preferences learned (empty if none)\n'
                '- "action_items": pending items (empty if none)\n'
                "Return ONLY valid JSON, no markdown fences."
            ),
            messages=[{"role": "user", "content": conversation_text}],
        )

        result = json.loads(response.content[0].text)
        return result

    except json.JSONDecodeError:
        # Fallback: return raw summary if Claude didn't produce valid JSON
        return {
            "summary": response.content[0].text[:500],
            "topics": [],
            "decisions": [],
            "user_preferences": [],
            "action_items": [],
        }
    except anthropic.APIError as e:
        raise RuntimeError(f"Summarization API call failed: {e}") from e
import Anthropic from "@anthropic-ai/sdk"; // npm install @anthropic-ai/sdk@^0.30.0

interface EpisodeRecord {
  summary: string;
  topics: string[];
  decisions: string[];
  userPreferences: string[];
  actionItems: string[];
}

async function summarizeConversation(
  messages: Array<{ role: string; content: string }>,
  client?: Anthropic
): Promise<EpisodeRecord> {
  const anthropic = client ?? new Anthropic(); // reads ANTHROPIC_API_KEY

  const conversationText = messages
    .map((msg) => `${msg.role.toUpperCase()}: ${msg.content}`)
    .join("\n");

  try {
    const response = await anthropic.messages.create({
      model: "claude-sonnet-4-6",
      max_tokens: 500,
      system:
        "You are a conversation summarizer. Extract key information " +
        "and return a JSON object with: summary, topics, decisions, " +
        "user_preferences, action_items. Return ONLY valid JSON.",
      messages: [{ role: "user", content: conversationText }],
    });

    const text =
      response.content[0].type === "text" ? response.content[0].text : "";
    const parsed = JSON.parse(text);
    return {
      summary: parsed.summary ?? "",
      topics: parsed.topics ?? [],
      decisions: parsed.decisions ?? [],
      userPreferences: parsed.user_preferences ?? [],
      actionItems: parsed.action_items ?? [],
    };
  } catch (error) {
    if (error instanceof SyntaxError) {
      return {
        summary: "Summarization produced invalid JSON",
        topics: [],
        decisions: [],
        userPreferences: [],
        actionItems: [],
      };
    }
    throw new Error(`Summarization failed: ${error}`);
  }
}

Step 5: Memory Manager

Finally, the orchestrator that ties everything together. The MemoryManager has a clear lifecycle: start_session() creates fresh working memory and loads relevant episodes and procedures. During the conversation, you update working memory and log each turn. end_session() summarizes the conversation, stores the summary in episodic memory, and clears working memory. The interesting part is build_memory_context() — it combines all three tiers into a single formatted string that becomes part of Claude's system prompt. Individual memory classes are building blocks; the Memory Manager is the glue that makes them work as a coherent system.

import anthropic


class MemoryManager:
    """Orchestrates all three memory tiers for a conversational agent.

    Lifecycle:
      1. start_session() — creates working memory, loads relevant episodes + procedures
      2. During conversation — update working memory via set/get
      3. end_session() — summarizes conversation, stores episode, clears working memory
    """

    def __init__(self, persist_dir: str = "./memory_db"):
        self.episodic = EpisodicMemory(persist_dir=persist_dir)
        self.procedural = ProceduralMemory(persist_dir=persist_dir)
        self.working: WorkingMemory | None = None
        self.client = anthropic.AsyncAnthropic()
        self.conversation_log: list[dict] = []

    def start_session(self, session_id: str, user_id: str = "default") -> str:
        """Initialize a new session with fresh working memory.

        Returns the formatted memory context for the first prompt.
        """
        self.working = WorkingMemory(session_id=session_id)
        self.working.set("user_id", user_id)
        self.conversation_log = []
        return self.working.to_prompt()

    def build_memory_context(self, user_message: str) -> str:
        """Build the full memory context to inject into Claude's system prompt.

        Combines all three tiers into a single formatted string.
        """
        if not self.working:
            raise RuntimeError("No active session. Call start_session() first.")

        sections = [
            self.working.to_prompt(),
            self.episodic.to_prompt(user_message, n_results=3),
            self.procedural.to_prompt(user_message),
        ]
        return "\n\n".join(sections)

    def log_turn(self, role: str, content: str) -> None:
        """Log a conversation turn for later summarization."""
        self.conversation_log.append({"role": role, "content": content})

    async def end_session(self) -> str | None:
        """End the session: summarize, store episode, clear working memory.

        Returns the episode ID if a summary was stored, None otherwise.
        """
        if not self.working or not self.conversation_log:
            return None

        session_id = self.working.session_id
        user_id = self.working.get("user_id", "default")

        # Summarize the conversation
        try:
            summary_record = await summarize_conversation(
                self.conversation_log, self.client
            )
        except Exception as e:
            print(f"Warning: summarization failed: {e}")
            summary_record = {
                "summary": f"Session {session_id} — summarization failed",
                "topics": [],
            }

        # Store in episodic memory
        episode_id = self.episodic.store_episode(
            summary=summary_record["summary"],
            session_id=session_id,
            user_id=user_id,
            topics=summary_record.get("topics", []),
        )

        # Clear working memory
        self.working.clear()
        self.working = None
        self.conversation_log = []

        return episode_id


# --- Usage: Full agent loop ---
async def agent_loop():
    """Demonstrates a full conversation with memory management."""
    manager = MemoryManager(persist_dir="./memory_db")
    client = anthropic.AsyncAnthropic()

    # Start session
    session_id = "sess_044"
    manager.start_session(session_id, user_id="user_alice")

    # Simulate a user message
    user_msg = "What did we decide about the deployment schedule?"
    manager.log_turn("user", user_msg)

    # Build memory-enriched system prompt
    memory_context = manager.build_memory_context(user_msg)
    manager.working.set("intent", "recall_decision")
    manager.working.set("topic", "deployment schedule")

    system_prompt = f"""You are a helpful assistant with memory.

{memory_context}

Use the memory context above to provide informed, personalized responses.
If you find relevant past interactions, reference them naturally."""

    # Call Claude with memory context
    response = await client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=system_prompt,
        messages=[{"role": "user", "content": user_msg}],
    )

    assistant_msg = response.content[0].text
    manager.log_turn("assistant", assistant_msg)
    print(f"Agent: {assistant_msg}")

    # End session — summarize and persist
    episode_id = await manager.end_session()
    print(f"Session saved as episode: {episode_id}")
import Anthropic from "@anthropic-ai/sdk";

class MemoryManager {
  /** Orchestrates all three memory tiers for a conversational agent. */

  private episodic: EpisodicMemory;
  private procedural: ProceduralMemory;
  private working: WorkingMemory | null = null;
  private client: Anthropic;
  private conversationLog: Array<{ role: string; content: string }> = [];

  constructor() {
    this.episodic = new EpisodicMemory("episodes");
    this.procedural = new ProceduralMemory();
    this.client = new Anthropic();
  }

  async init(): Promise<void> {
    await this.episodic.init();
    await this.procedural.init();
  }

  startSession(sessionId: string, userId: string = "default"): string {
    this.working = new WorkingMemory(sessionId);
    this.working.set("user_id", userId);
    this.conversationLog = [];
    return this.working.toPrompt();
  }

  async buildMemoryContext(userMessage: string): Promise<string> {
    if (!this.working) throw new Error("No active session");

    const sections = [
      this.working.toPrompt(),
      await this.episodic.toPrompt(userMessage, 3),
      await this.procedural.toPrompt(userMessage),
    ];
    return sections.join("\n\n");
  }

  logTurn(role: string, content: string): void {
    this.conversationLog.push({ role, content });
  }

  async endSession(): Promise<string | null> {
    if (!this.working || !this.conversationLog.length) return null;

    const sessionId = this.working.sessionId;
    const userId = String(this.working.get("user_id", "default"));

    let summaryRecord: EpisodeRecord;
    try {
      summaryRecord = await summarizeConversation(
        this.conversationLog,
        this.client
      );
    } catch {
      summaryRecord = {
        summary: `Session ${sessionId} — summarization failed`,
        topics: [],
        decisions: [],
        userPreferences: [],
        actionItems: [],
      };
    }

    const episodeId = await this.episodic.storeEpisode(
      summaryRecord.summary,
      sessionId,
      userId,
      summaryRecord.topics
    );

    this.working.clear();
    this.working = null;
    this.conversationLog = [];

    return episodeId;
  }
}

// --- Usage: Full agent loop ---
async function agentLoop() {
  const manager = new MemoryManager();
  await manager.init();
  const client = new Anthropic();

  manager.startSession("sess_044", "user_alice");

  const userMsg = "What did we decide about the deployment schedule?";
  manager.logTurn("user", userMsg);

  const memoryContext = await manager.buildMemoryContext(userMsg);

  const systemPrompt = `You are a helpful assistant with memory.

${memoryContext}

Use the memory context above to provide informed, personalized responses.`;

  const response = await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    system: systemPrompt,
    messages: [{ role: "user", content: userMsg }],
  });

  const text =
    response.content[0].type === "text" ? response.content[0].text : "";
  manager.logTurn("assistant", text);
  console.log(`Agent: ${text}`);

  const episodeId = await manager.endSession();
  console.log(`Session saved as episode: ${episodeId}`);
}
What Just Happened?

You built a complete MemoryManager that orchestrates all three tiers. The lifecycle is: start_session() creates working memory → build_memory_context() combines all three tiers into a system prompt → during the conversation, you update working memory and log turns → end_session() summarizes the conversation, stores it in episodic memory, and clears working memory. Each Claude call gets a rich memory context that includes current task state, relevant past interactions, and matching skill templates — all in under 2,000 tokens of memory overhead.

Cost Consideration

The summarization call at session end uses ~500 input tokens + ~200 output tokens per conversation, costing roughly $0.002 per session with Claude Sonnet. For 500 sessions/day, that’s $1/day in summarization costs. The savings from not replaying full conversation history (4,000+ tokens per session) far outweigh this cost — roughly $8/day saved on input tokens alone.

Hands-On Exercise

What You'll Build

A 3-tier memory system (working, episodic, procedural) with a MemoryManager that wires them together. You'll run a multi-turn session, save it to episodic memory, start a new session, and verify the agent remembers key details from the first session.

Time Estimate: 45–60 minutes

Prerequisites: Python 3.10+, an Anthropic API key (console.anthropic.com), and a terminal

Files You'll Create:

  • memory_system.py — All 3 memory tiers + MemoryManager + test harness
  • ./memory_db/ — Auto-created ChromaDB persistent storage directory

Environment Setup

mkdir memory-lab && cd memory-lab
python -m venv venv && source venv/bin/activate   # Windows: venv\Scripts\activate
pip install "anthropic>=0.30.0" "chromadb>=0.4.0"
export ANTHROPIC_API_KEY=your-key-here             # Windows: set ANTHROPIC_API_KEY=your-key-here

Step 1: Build All 3 Memory Tiers + MemoryManager

This step creates the complete memory system in one file: WorkingMemory (key-value scratchpad), EpisodicMemory (ChromaDB-backed past interactions), ProceduralMemory (reusable action templates), and a MemoryManager that wires them together and provides build_context() for injecting memory into prompts.

Create a new file called memory_system.py and add the following:

import json
import time
import uuid
import chromadb
import anthropic

client = anthropic.Anthropic()

# ── Tier 1: Working Memory (key-value scratchpad) ───────────
class WorkingMemory:
    """Fast, mutable scratchpad for current task state."""

    def __init__(self):
        self._store: dict = {}
        self.session_id = f"sess_{uuid.uuid4().hex[:8]}"
        self.created_at = time.time()

    def set(self, key: str, value) -> None:
        self._store[key] = value

    def get(self, key: str, default=None):
        return self._store.get(key, default)

    def delete(self, key: str) -> None:
        self._store.pop(key, None)

    def clear(self) -> None:
        self._store.clear()

    def to_prompt(self) -> str:
        if not self._store:
            return "[Working Memory: empty]"
        lines = [f"[Working Memory — Session {self.session_id}]"]
        for k, v in self._store.items():
            lines.append(f"  {k}: {v}")
        return "\n".join(lines)

    def to_dict(self) -> dict:
        return dict(self._store)


# ── Tier 2: Episodic Memory (past interactions via ChromaDB) ─
class EpisodicMemory:
    """Searchable archive of past conversation summaries."""

    def __init__(self, persist_dir: str = "./memory_db"):
        self._client = chromadb.PersistentClient(path=persist_dir)
        self._collection = self._client.get_or_create_collection(
            name="episodes", metadata={"hnsw:space": "cosine"}
        )

    def store_episode(self, summary: str, session_id: str, metadata: dict = None) -> str:
        episode_id = f"ep_{uuid.uuid4().hex[:12]}"
        meta = {
            "session_id": session_id,
            "timestamp": time.time(),
            **(metadata or {}),
        }
        self._collection.add(
            ids=[episode_id],
            documents=[summary],
            metadatas=[meta],
        )
        return episode_id

    def recall(self, query: str, top_k: int = 3) -> list[dict]:
        if self._collection.count() == 0:
            return []
        results = self._collection.query(query_texts=[query], n_results=min(top_k, self._collection.count()))
        episodes = []
        for doc, meta in zip(results["documents"][0], results["metadatas"][0]):
            episodes.append({"summary": doc, **meta})
        return episodes

    def to_prompt(self, query: str) -> str:
        episodes = self.recall(query, top_k=2)
        if not episodes:
            return "[Past Interactions: none yet]"
        lines = ["[Relevant Past Interactions]"]
        for ep in episodes:
            ts = time.strftime("%Y-%m-%d", time.localtime(ep.get("timestamp", 0)))
            lines.append(f"  - Session {ep.get('session_id', '?')} ({ts}): {ep['summary']}")
        return "\n".join(lines)

    @property
    def count(self) -> int:
        return self._collection.count()


# ── Tier 3: Procedural Memory (reusable action templates) ────
class ProceduralMemory:
    """Library of proven tool-call sequences."""

    def __init__(self):
        self._procedures: dict[str, dict] = {}

    def store_procedure(self, name: str, description: str, steps: list[str]) -> None:
        self._procedures[name] = {
            "description": description,
            "steps": steps,
            "usage_count": 0,
        }

    def find_procedure(self, query: str) -> dict | None:
        """Simple keyword matching (production: use embeddings)."""
        query_lower = query.lower()
        best_match, best_score = None, 0
        for name, proc in self._procedures.items():
            score = sum(1 for word in query_lower.split()
                       if word in name.lower() or word in proc["description"].lower())
            if score > best_score:
                best_match, best_score = (name, proc), score
        if best_match and best_score > 0:
            best_match[1]["usage_count"] += 1
            return {"name": best_match[0], **best_match[1]}
        return None

    def to_prompt(self, query: str) -> str:
        proc = self.find_procedure(query)
        if not proc:
            return "[Procedures: no matching template]"
        steps_str = "\n".join(f"    Step {i+1}: {s}" for i, s in enumerate(proc["steps"]))
        return f"[Suggested Procedure: {proc['name']} (used {proc['usage_count']}x)]\n{steps_str}"


# ── MemoryManager (wires all 3 tiers together) ───────────────
class MemoryManager:
    """Orchestrates all three memory tiers."""

    def __init__(self, persist_dir: str = "./memory_db"):
        self.working = WorkingMemory()
        self.episodic = EpisodicMemory(persist_dir=persist_dir)
        self.procedural = ProceduralMemory()
        self.conversation_log: list[dict] = []

    def build_context(self, current_query: str) -> str:
        """Build combined memory context for injection into the LLM prompt."""
        parts = [
            self.working.to_prompt(),
            self.episodic.to_prompt(current_query),
            self.procedural.to_prompt(current_query),
        ]
        return "\n\n".join(parts)

    def log_turn(self, role: str, content: str) -> None:
        self.conversation_log.append({"role": role, "content": content})

    def end_session(self, summary: str = None) -> str:
        """Summarize and archive the session to episodic memory."""
        if not summary and self.conversation_log:
            # Auto-summarize with Claude
            transcript = "\n".join(f"{m['role']}: {m['content']}" for m in self.conversation_log)
            try:
                response = client.messages.create(
                    model="claude-sonnet-4-6", max_tokens=300,
                    system="Summarize this conversation in 2-3 sentences. Preserve key decisions, preferences, and facts.",
                    messages=[{"role": "user", "content": transcript}],
                )
                summary = response.content[0].text
            except Exception:
                summary = f"Session with {len(self.conversation_log)} turns."

        episode_id = self.episodic.store_episode(
            summary=summary,
            session_id=self.working.session_id,
            metadata={"turns": len(self.conversation_log)},
        )
        # Reset for next session
        self.working.clear()
        self.working = WorkingMemory()  # new session_id
        self.conversation_log = []
        return episode_id

    def chat(self, user_message: str) -> str:
        """Send a message with full memory context."""
        self.log_turn("user", user_message)
        memory_context = self.build_context(user_message)

        response = client.messages.create(
            model="claude-sonnet-4-6", max_tokens=1024,
            system=(
                "You are a helpful assistant with multi-session memory. "
                "Use the memory context below to inform your responses.\n\n"
                f"{memory_context}"
            ),
            messages=[{"role": m["role"], "content": m["content"]}
                      for m in self.conversation_log],
        )
        reply = response.content[0].text
        self.log_turn("assistant", reply)
        return reply


# ── Tests ────────────────────────────────────────────────────
if __name__ == "__main__":
    import shutil
    # Clean up any previous test data
    shutil.rmtree("./memory_db", ignore_errors=True)

    print("═" * 55)
    print("TEST 1: Individual Memory Tiers")
    print("═" * 55)

    # Working Memory
    wm = WorkingMemory()
    wm.set("intent", "schedule_meeting")
    wm.set("date", "2026-04-05")
    wm.set("attendees", ["Alice", "Bob"])
    print("\n  Working Memory:")
    print("  " + wm.to_prompt().replace("\n", "\n  "))

    # Episodic Memory
    em = EpisodicMemory(persist_dir="./memory_db")
    em.store_episode("User prefers email over Slack for notifications.", "sess_001")
    em.store_episode("Discussed deployment schedule. Decision: deploy Friday.", "sess_002")
    em.store_episode("Reviewed API rate limits. Set max to 1000 req/min.", "sess_003")
    print(f"\n  Episodic Memory: {em.count} episodes stored")
    print("  " + em.to_prompt("deployment schedule").replace("\n", "\n  "))

    # Procedural Memory
    pm = ProceduralMemory()
    pm.store_procedure("generate_report", "Generate a formatted report from data",
                       ["Gather data from sources", "Analyze key metrics", "Format as markdown", "Send to user"])
    pm.store_procedure("debug_api_error", "Debug and fix API errors",
                       ["Check error code and message", "Search logs for context", "Identify root cause", "Suggest fix"])
    print(f"\n  Procedural Memory:")
    print("  " + pm.to_prompt("generate a report").replace("\n", "\n  "))

    print(f"\n{'═' * 55}")
    print("TEST 2: Cross-Session Memory (with Claude)")
    print("═" * 55)

    # Session 1
    mm = MemoryManager(persist_dir="./memory_db")
    mm.working.set("intent", "setup_preferences")

    print("\n  Session 1:")
    reply1 = mm.chat("Hi! I prefer getting notifications via email, not Slack.")
    print(f"    Turn 1: {reply1[:120]}...")
    reply2 = mm.chat("Also, let's plan to deploy the new API on Friday.")
    print(f"    Turn 2: {reply2[:120]}...")

    ep_id = mm.end_session()
    print(f"    Session ended → episode {ep_id}")

    # Session 2 — should remember preferences from Session 1
    print("\n  Session 2:")
    reply3 = mm.chat("Do you remember my notification preference?")
    print(f"    Turn 1: {reply3[:150]}...")
    print(f"\n  Memory context that was injected:")
    print("    " + mm.build_context("notification preference").replace("\n", "\n    "))

    # Cleanup
    shutil.rmtree("./memory_db", ignore_errors=True)
    print(f"\n{'═' * 55}")
    print("✓ All tests complete.")
    print("═" * 55)

Run it: This single command runs both Test 1 (individual tiers, no API calls) and Test 2 (cross-session memory with live Claude calls). Make sure your ANTHROPIC_API_KEY is set before running.

Command
python memory_system.py
Expected Output (abbreviated)
═══════════════════════════════════════════════════════ TEST 1: Individual Memory Tiers ═══════════════════════════════════════════════════════ Working Memory: [Working Memory — Session sess_a3f2b1c9] intent: schedule_meeting date: 2026-04-05 attendees: ['Alice', 'Bob'] Episodic Memory: 3 episodes stored [Relevant Past Interactions] - Session sess_002 (today's date): Discussed deployment schedule. Decision: deploy Friday. - Session sess_001 (today's date): User prefers email over Slack for notifications. Procedural Memory: [Suggested Procedure: generate_report (used 1x)] Step 1: Gather data from sources Step 2: Analyze key metrics Step 3: Format as markdown Step 4: Send to user ═══════════════════════════════════════════════════════ TEST 2: Cross-Session Memory (with Claude) ═══════════════════════════════════════════════════════ Session 1: Turn 1: I've noted your preference for email notifications... Turn 2: Great! I'll plan for a Friday deployment of the new API... Session ended → episode ep_a3f2b1c9d0e4 Session 2: Turn 1: Yes! From our previous conversation, I can see you prefer email notifications over Slack... Memory context that was injected: [Working Memory: empty] [Relevant Past Interactions] - Session sess_a3f2b1c9 (2026-05-07): User prefers email... [Procedures: no matching template] ═══════════════════════════════════════════════════════ ✓ All tests complete. ═══════════════════════════════════════════════════════
✅ Checkpoint

Look for these key behaviors:

  • Test 1: All 3 tiers produce formatted prompt output — working memory shows key-value pairs, episodic shows matching summaries, procedural shows matching template
  • Test 2, Session 1: Claude responds to preferences and deployment planning normally
  • Test 2, Session 2: Claude should reference the email preference from Session 1 — this confirms episodic memory retrieval is working
  • Memory context: Should show past interactions injected with the correct session ID and summary text
Troubleshooting
  • ModuleNotFoundError: No module named 'chromadb' → Run pip install chromadb
  • Episodic memory returns empty in Session 2 → Make sure end_session() was called after Session 1. Check that the ./memory_db directory was created (PersistentClient).
  • Claude doesn't mention the preference in Session 2 → Check the memory context output. If it shows "[Past Interactions: none yet]", the episode wasn't stored. Verify end_session() succeeded.
  • Permission error on ./memory_db → Delete the memory_db folder and try again. On Windows, make sure no other process has the folder open.

Verify Everything Works

Run the full test suite. Both tests should complete, with Session 2 demonstrating cross-session memory recall:

Command
python memory_system.py

If Claude references the email preference in Session 2 without being told again, your 3-tier memory system is working correctly.

🎉 Congratulations

You've built a complete multi-layer memory system with working memory (in-session scratchpad), episodic memory (cross-session recall via ChromaDB), and procedural memory (reusable action templates). This is the architecture used by production agents that need to remember user preferences, past decisions, and learned workflows across sessions.

Stretch Goals (Optional)
  • Memory compaction: When episodic memory exceeds 100 entries, merge related episodes from the same week into consolidated records
  • /memory command: Add a command that lets the user inspect all three memory tiers in a formatted display
  • Procedural learning: When the agent completes a multi-step task, automatically extract and store the tool sequence as a new procedure
Common Mistakes
  • Forgetting persistent storage: chromadb.Client() is in-memory only. Always use PersistentClient(path="...") for cross-session memory.
  • Not handling empty retrieval: On the first session, episodic and procedural memory are empty. Your code must return graceful defaults, not crash.
  • Overstuffing the prompt: Limit episodic retrieval to 2–3 episodes. More context ≠ better responses.
  • Skipping summarization: Store summaries, not raw transcripts. Raw transcripts bloat retrieval and waste tokens.

Knowledge Check

1. Match each scenario to the correct memory tier:

Scenario: “The agent needs to store that the user’s preferred output format is CSV, learned during last week’s conversation.”

A Episodic Memory — this is a user preference from a past interaction, stored as a conversation summary
B Working Memory — preferences are part of the current task state
C Procedural Memory — preferences are learned skills
D Context Window — just keep the old conversation in the prompt
Correct! User preferences from past conversations belong in episodic memory. They persist across sessions and are retrieved via semantic search when relevant.
Not quite. Working memory is for the current task only. Procedural memory stores action sequences, not preferences. And keeping old conversations in the context window wastes tokens. User preferences from past conversations belong in episodic memory, where they persist and are retrieved by relevance.

2. Why is it more efficient to store conversation summaries in episodic memory rather than full transcripts?

A Full transcripts contain sensitive information that shouldn’t be stored
B Summaries are 10–20x smaller, produce better embedding quality for retrieval, and cost less to inject into future prompts
C ChromaDB can only store documents under 500 characters
D Full transcripts cannot be embedded into vectors
Correct! Summaries compress ~4,000 tokens to ~300 tokens (93% reduction), embeddings of focused summaries match better than embeddings of rambling conversations, and injecting summaries into prompts uses far fewer tokens — saving cost and leaving more room for the actual task.
While security can be a consideration, the main reasons are efficiency: summaries are dramatically smaller (93% compression), produce better semantic embeddings for retrieval, and cost much less to inject into future prompts. Full transcripts CAN be embedded, and ChromaDB doesn’t have a 500-char limit.

3. An agent keeps forgetting user preferences between sessions even though it uses a vector database. What is the most likely cause?

A The embedding model is too small
B The agent is running out of RAM
C ChromaDB is running in default in-memory mode without persistent storage configured
D The summarization pipeline is using the wrong Claude model
Correct! ChromaDB’s default mode stores everything in RAM, which is lost on process restart. The fix is to use PersistentClient(path="./chroma_db") so data is written to disk and survives restarts.
The most common cause of “memory loss” across sessions is ChromaDB running in its default in-memory mode. When the process restarts, all stored episodes are gone. Always configure persistent storage with PersistentClient(path="./chroma_db").

4. What is the role of the summarization pipeline in the memory architecture?

A It compresses the model weights so the agent runs faster
B It translates conversations between Python and TypeScript
C It generates embeddings for the context window
D It converts raw conversations into compact, structured episode records that are stored in episodic memory for cross-session retrieval
Correct! The summarization pipeline is the bridge between short-term conversation and long-term memory. It uses Claude to extract key decisions, preferences, topics, and outcomes from a conversation, compressing ~4,000 tokens into ~300. These compact records are then embedded and stored in the vector database for retrieval in future sessions.
The summarization pipeline converts raw conversations into compact, structured episode records. It uses Claude to extract key decisions, preferences, and outcomes, then stores these compressed summaries in episodic memory. It’s the bridge between short-term conversation and long-term memory.

5. A customer support agent handles 500 conversations per day. Without any management, the episodic memory store will grow unbounded. Which approach best prevents this while preserving useful memories?

A Delete all episodes older than 24 hours
B Implement memory compaction: periodically merge related episodes and remove entries below a relevance threshold, keeping the store under a configured limit
C Store only the first 100 episodes and ignore the rest
D Use a larger vector database instance and never delete anything
Correct! Memory compaction merges related episodes (e.g., all “deployment” conversations from one week become a single consolidated record), removes outdated entries, and keeps the store at a manageable size. Deleting everything after 24 hours loses valuable context. Capping at 100 is arbitrary. And “never delete” just delays the problem.
The best approach is memory compaction — periodically merging related episodes and removing low-relevance entries. Deleting after 24 hours loses important context. A hard cap of 100 is arbitrary and lossy. And “never delete” just delays the problem while increasing storage costs and degrading search quality.

6. In M08, you learned about conversation context management. How does multi-layer memory improve upon the basic approaches covered there?

A It replaces the context window entirely — agents no longer need to send conversation history to Claude
B It uses a larger context window model, which solves all memory problems
C It adds persistent, cross-session memory (episodic + procedural) on top of the within-session conversation management, and uses structured working memory to reduce the amount of raw history needed in the context window
D It only works with RAG systems, not with basic conversation management
Correct! M08’s context management handles within-session history (sliding windows, summarization). Multi-layer memory adds two new dimensions: cross-session persistence (episodic memory remembers past conversations) and skill reuse (procedural memory avoids re-planning). Working memory is a structured upgrade to “just stuff the whole conversation into the prompt.”
Multi-layer memory doesn’t replace the context window — agents still send conversation history. It adds persistent, cross-session memory on top of within-session management. Episodic memory remembers past conversations (M08 only handles the current one). Procedural memory reuses proven skills. Working memory structures the current task state instead of relying on raw conversation history.

Your Score

0/0

Module Summary

Key Concepts Recap

  • Multi-layer memory architecture: Separate memory into working (current task), episodic (past interactions), and procedural (learned skills) tiers
  • Working memory: A mutable key-value scratchpad injected into every LLM call, cleared when the task completes
  • Episodic memory: Conversation summaries stored in a vector database, retrieved by semantic search for cross-session continuity
  • Procedural memory: Reusable action templates with trigger conditions, retrieved by similarity matching to avoid re-planning
  • Summarization pipeline: Uses Claude to compress raw conversations into compact episode records (93% token reduction)
  • Memory Manager: Orchestrates all three tiers — loads at session start, maintains during conversation, persists at session end

Next: M12 — ReAct Agent Loop

You’ve given your agent a brain that remembers. Now it’s time to give it the ability to reason and act. In M12, you’ll implement the ReAct pattern — a structured loop where Claude thinks about what to do, takes an action (tool call), observes the result, and decides the next step. This is the foundation of autonomous agent behavior, and your multi-layer memory will play a key role in giving the agent context for its decisions.