Multi-Layer Memory Architecture
Give your agent a brain that remembers — across turns, across sessions, across tasks.
Prerequisites: M08: Conversation Management, M09: RAG Fundamentals
Learning Objectives
- Explain why production agents need multiple memory tiers instead of a single context windowThe maximum amount of text (measured in tokens) that Claude can process in a single API call. It includes the system prompt, conversation history, tool definitions, and Claude's response. Current Claude models support up to 200K tokens, but larger context doesn't mean better recall. or vector store
- Implement a working memory scratchpad that tracks current task state and injects it into each LLM call
- Build an episodic memory system that stores conversation summaries in a vector database for cross-session retrieval
- Create a procedural memory store that saves and retrieves proven tool-call sequences for reuse
- Wire all three memory tiers together with a Memory Manager that orchestrates loading, updating, and persisting memory across sessions
Why One Memory Type Isn’t Enough
Before: Imagine if your brain stored everything in one place — today’s grocery list, how to ride a bike, your childhood memories, and the recipe for your favorite pasta — all jumbled in a single notebook. Every time you needed something, you’d flip through every page.
Pain: You’d waste enormous time searching, the notebook would fill up fast, and important long-term memories would get crowded out by temporary notes. You’d forget how to ride a bike because you overwrote it with a shopping list.
Mapping: Your brain uses different memory systems for a reason: working memory for what you’re thinking about right now (like holding a phone number while you dial), episodic memory for past experiences (what happened at your wedding), and procedural memory for learned skills (how to ride a bike). AI agents need the same separation — a single context window or vector store is the “one notebook” approach, and it breaks down at scale.
A multi-layer memory architectureA design pattern that separates an agent's memory into distinct tiers with different storage backends, retrieval strategies, and retention policies — optimized for the type of information each tier holds. separates an agent’s memory into specialized tiers, each optimized for a different kind of information:
- Working memory — a fast, mutable scratchpad for the current task. Think of it as a key-value storeA simple data structure that maps unique keys (like “user_intent”) to values (like “book a flight”). Fast to read and write, often held in RAM. (like a Python dict or Redis cache) that holds the user’s current intent, extracted entities, intermediate tool results, and the plan for the current turn. It’s included in every LLM prompt and cleared when the task finishes.
- Episodic memory — a searchable archive of past interactions. Stores summarized records of past conversations in a vector databaseA database that stores data as high-dimensional vectors (arrays of numbers) and supports similarity search — finding items whose vectors are closest to a query vector. Examples: ChromaDB, Pinecone, Weaviate., indexed by semantic embeddings and timestamps. Retrieved via similarity search when the agent needs context from prior sessions.
- Procedural memory — a library of reusable action sequences. Stores proven tool-call chains (like “search → filter → summarize → respond”) as structured templates. When a new task matches a known pattern, the agent retrieves and executes the template instead of reasoning from scratch.
The key insight: mixing all three in a single store causes retrieval pollution — when you search for “what did the user ask last week?”, you don’t want to also get back the current task’s scratchpad entries and stored tool sequences. Separation keeps retrieval precise and context windows lean.
topic: "deployment schedule"
"Deploy moved to Friday"
→ search episodes → cite → respond
So what does multi-layer memory actually look like in an agent's system prompt? Here's the formatted output from all three tiers combined — this is what gets injected into every Claude call:
Three tiers, three colors, under 400 tokens total. Claude sees the current task state (working memory in amber), the 2 most relevant past conversations (episodic in blue), and a proven action template (procedural in green). Compare that to stuffing the full transcript of every past conversation into the prompt — which would use 50,000+ tokens and exceed context limits within a few days.
"Every agent needs all three memory tiers" — Not at all. A simple Q&A bot needs zero memory. A single-session agent might only need working memory. Episodic memory matters when users return across sessions. Procedural memory matters when the agent repeats complex workflows. Start with the tier that solves your actual problem, not the full architecture.
"Episodic memory gives the agent perfect recall" — Episodic memory stores summaries, not transcripts. Summarization is lossy — specific names, numbers, and nuances may be dropped. If you need exact recall of a specific detail (like a contract amount), store it as structured metadata, not just in the summary text.
"More retrieved episodes = better responses" — Injecting 10 past episodes into the prompt does more harm than good. Each extra episode adds noise and tokens, and Claude may get confused about which details apply to the current situation. Best practice: retrieve 2-3 most relevant episodes, max.
"Procedural memory replaces Claude's reasoning" — Procedural memory provides a suggested plan, not a mandate. Claude can adapt or override the template based on the current context. Think of it as a recipe that an experienced chef can riff on — not a rigid script.
"Episodic memory has no privacy implications" — Storing conversation summaries means you're persisting user data across sessions. This has real GDPR/CCPA implications: users may have the right to request deletion of their stored episodes. You need to be able to delete all episodes for a specific user_id, and your privacy policy must disclose that conversations are summarized and stored.
A customer-facing support agent handling 500 conversations per day generates roughly 2 million tokens of conversation history per week. Without separated memory tiers, you’d either hit context window limits constantly (200K tokens max) or pay $60+/day in API costs injecting irrelevant history. Multi-layer memory keeps the prompt lean: only the current task state (working), the 2–3 most relevant past conversations (episodic), and the matching skill template (procedural) go into each call — typically under 4,000 tokens of memory context total.
Tier 1: Working Memory — The Scratchpad
Before: Imagine a doctor seeing a patient without a clipboard. They walk into the room, the patient describes three symptoms, the doctor orders a blood test, then walks to the lab — and has to ask the patient to repeat everything because they had nowhere to jot notes.
Pain: Without a scratchpad, the doctor loses track of what they’ve already learned, repeats questions, orders duplicate tests, and the visit takes three times longer. The patient loses trust.
Mapping: Working memory is the agent’s clipboard. It holds the current user intent, extracted entities (names, dates, IDs), intermediate tool results, and the plan for the current task. Every LLM call sees this scratchpad, so the agent never “forgets” what it learned two tool calls ago. When the task is done, the scratchpad is cleared or archived.
In concrete terms, working memory is a Python dictionary that looks like this at any given moment during a task:
Each key-value pair gets added as the agent progresses through the task. The to_prompt() method formats this dict into a text block that Claude reads at the start of every call. That's all working memory is — a structured dict that rides along with each API request.
Working memory is a structured, mutable state object (a Python dictionary or JavaScript Map) that holds everything the agent needs for the current task. It includes:
- User intent — what the user is trying to accomplish, parsed from their message
- Extracted entities — specific values like names, dates, IDs pulled from the conversation
- Intermediate results — outputs from tool calls that inform the next step
- Task plan — the sequence of steps the agent plans to execute
The working memory is injected into the system prompt of every LLM call, so Claude always has full awareness of the current state. It lives in RAM (or a fast cache like Redis for distributed agents), and its lifetime matches the task — created when a request arrives, destroyed when the response is sent.
Without working memory, an agent that calls 3 tools in sequence has to re-derive context from the full conversation history each time — burning tokens and losing track of intermediate state. In benchmarks, adding a structured scratchpad to a multi-step agent reduces token usage by 30–40% and improves task completion rates by 15–25%, because the agent always knows exactly where it is in the task.
Tier 2: Episodic Memory — Past Interactions
Before: Imagine a personal assistant who keeps a detailed diary that’s searchable by topic. They don’t read the entire diary every morning — that would take hours. Instead, when you ask “What did we discuss about the budget last week?”, they search their diary and find the right entry in seconds.
Pain: Without the diary, the assistant starts every day with amnesia. You’d have to re-explain your preferences, past decisions, and ongoing projects every single session. For a support agent, this means every returning customer feels like they’re talking to a stranger.
Mapping: Episodic memory is the agent’s searchable diary. After each conversation, the agent writes a summary of what happened — key topics, decisions made, user preferences learned, outcomes. These summaries are stored as vector embeddings in a database. When a new conversation starts, the agent searches for relevant past episodes and injects them into the prompt, giving it cross-session continuity.
Here's what an episode record actually looks like once it's stored in ChromaDB:
The document is the summary text. The embedding is a 1536-float array representing its semantic meaning. The metadata enables filtered search (e.g., "find episodes for user_alice about deployment"). When a new message comes in, ChromaDB compares its embedding against all stored episodes and returns the closest matches.
Episodic memory stores summarized records of past conversations in a vector database, indexed by semantic embeddingsNumerical representations (arrays of floating-point numbers) of text meaning. Texts with similar meanings have similar embedding vectors, enabling similarity search even when exact words differ. and timestamped metadataAdditional structured data attached to each record — like session ID, timestamp, user ID, topic tags — that enables filtered search beyond just semantic similarity.. The workflow is:
- At conversation end: Summarize the conversation into a structured episode record (topics, decisions, preferences, outcomes) using Claude
- Embed & store: Convert the summary into a vector embedding and store it in ChromaDB (or Pinecone, Weaviate, etc.) with metadata (session ID, timestamp, user ID)
- At conversation start: Take the new user message, embed it, and search the vector database for the most similar past episodes using cosine similarityA mathematical measure of how similar two vectors are, based on the angle between them. A score of 1.0 means identical direction (very similar meaning), 0.0 means unrelated. ChromaDB uses this by default via its HNSW index — an algorithm that makes nearest-neighbor search fast even with millions of vectors.
- Inject: Add the top 2–3 matching episode summaries into the system prompt as “relevant past context”
This gives the agent cross-session memory without replaying full conversation logs (which would be prohibitively expensive and exceed context limits).
Episodic memory is the difference between a stateless chatbot and a persistent assistant. A SaaS support agent with episodic memory can say “Last time we spoke, you were having trouble with the OAuth redirect — did that get resolved?” instead of asking the customer to explain from scratch. Studies show that agents with cross-session memory reduce average handle time by 40% and increase customer satisfaction scores by 20%, because users don’t have to repeat themselves.
Tier 3: Procedural Memory — The Skill Library
Before: Imagine an experienced chef who has cooked the same pasta dish 200 times. They don’t re-read the recipe or measure spices from scratch each time. They’ve internalized the steps: boil water, salt it, cook pasta 8 minutes, sauté garlic in olive oil, toss everything together. It’s automatic.
Pain: A novice chef without any recipes has to reason through every step from first principles, makes mistakes, and takes 3x longer. Worse, they might solve the same problem differently each time, leading to inconsistent results. Customers who order the same dish twice get different meals.
Mapping: Procedural memory is the agent’s recipe book. When the agent successfully completes a multi-step task (e.g., “search → filter → summarize → format as table”), it stores that tool sequence as a reusable template. Next time a similar request arrives, the agent retrieves the template instead of reasoning from scratch — faster execution, consistent results, fewer tokens burned on planning.
Here's what a stored procedure template looks like as a JSON record:
The trigger is stored as an embedding for semantic matching. When a user says "Can you make the weekly report?", the agent embeds that request, searches procedural memory, and finds this template with high similarity. The steps array tells Claude exactly which tools to call and in what order. The success_count gives the agent confidence that this is a proven workflow.
Procedural memory stores reusable action sequencesAn ordered list of tool calls with their parameters, representing a proven workflow. For example: [search_db(query), filter_results(criteria), format_output(template)]. The sequence can include conditional branches and parallel steps. — tool chains, proven workflows, and task templates — as structured records with two components:
- Trigger condition: A natural language description of when this procedure applies (e.g., “user asks to generate a weekly sales report”). Stored as an embedding for similarity matching.
- Execution steps: An ordered list of tool calls with parameter templates, expected outputs, and error handling instructions. Stored as JSON.
At runtime, the agent embeds the current user request, searches procedural memory for matching triggers, and — if a match is found with high confidence — loads the procedure into the prompt as a suggested plan. The agent can then follow or adapt the plan rather than reasoning from scratch.
confidence: 0.93
(sales data)
(by region)
(bar chart)
(key insights)
(PDF report)
Procedural memory is how agents get faster and more reliable over time. Without it, an agent that generates weekly reports reasons through the same 5-step plan every Monday — burning ~2,000 tokens on planning alone. With procedural memory, it retrieves the proven template in a single lookup, saving tokens and executing the same reliable sequence every time. For a B2B ecommerce agent handling 50 report requests per week, this saves roughly $15/week in API costs and eliminates the variance in report quality that comes from re-deriving the approach each time.
Facts have an "as-of" timestamp. Memory layers must distinguish current facts ("CEO is Alice") from historical facts ("CEO was Bob through 2024-Q3"). Without temporal metadata, your agent will confidently report stale facts as current — the canonical exam scenario. Store every fact with {value, valid_from, valid_to, source}; on retrieval, prefer the row where valid_to IS NULL for "current" queries and filter by date for "as-of" queries. Episodic memory naturally has timestamps, but semantic and procedural memory often don't — this is where temporal bugs hide.
Summarization Pipeline & Cross-Session Persistence
Before: At the end of each workday, imagine a project manager who writes a brief summary of what was accomplished: key decisions made, action items assigned, blockers identified, and what’s pending for tomorrow. They file this summary in a searchable archive.
Pain: Without this habit, Monday morning is chaos. Nobody remembers what was decided on Friday, action items fall through the cracks, and the team re-discusses the same issues. The project manager effectively has weekly amnesia.
Mapping: The summarization pipeline is this end-of-day summary habit, automated. When a conversation ends, Claude summarizes the key information — decisions, preferences, outcomes — into a compact record. This record is embedded and stored in episodic memory. When the next session starts, the stored summaries are retrieved and injected, giving the agent continuity across sessions without replaying entire conversation logs.
Here's what the summary record looks like after Claude processes a 10-turn conversation:
That's it — a 10-turn, 4,000-token conversation compressed into ~80 tokens of structured JSON. The topics array is used for metadata filtering, and the summary text is what gets embedded and stored in ChromaDB.
The summarization pipeline is a multi-stage process that converts raw conversations into compact, searchable memory records:
- Summarize: Send the conversation to Claude with a structured prompt that extracts: key topics discussed, decisions made, user preferences learned, action items, and outcomes. Output is a structured JSON record (~200–400 tokens), down from the original ~4,000+ token conversation.
- Embed: Convert the summary text into a vector embedding using an embedding model (like Voyage AI or OpenAI embeddings). This enables semantic search later.
- Store: Save the embedding + summary text + metadata (session ID, timestamp, user ID, topic tags) in a vector database like ChromaDB.
- Compact: Periodically merge related episodes and remove outdated information to prevent the memory store from growing unbounded. A compaction strategyA maintenance process that periodically reviews stored memories, merges duplicates, and removes entries older than a threshold or below a relevance score. Prevents the memory database from growing indefinitely. might merge all “deployment” episodes from the same week into one consolidated record.
A 10-turn conversation with Claude uses roughly 4,000–8,000 tokens. Storing the full transcript for 500 daily conversations would mean 2–4 million tokens of raw history per day. The summarization pipeline compresses each conversation to ~300 tokens — a 93% reduction. That means 500 days of conversation summaries (~150,000 tokens total) can fit in a single ChromaDB collection and be searched in under 50ms. Without summarization, you’d blow through storage limits within a week.
"Summarization is lossless — Claude captures everything" — No. Summarization is inherently lossy. Specific numbers (contract amounts, API rate limits), exact timestamps, and subtle nuances often get dropped. If a detail is business-critical, store it as structured metadata alongside the summary, not just in the summary text.
"I should summarize after every message" — Summarize at conversation END, not after every turn. Mid-conversation summarization wastes API calls and may lose context that only makes sense in the full conversation flow. The one exception: very long conversations (50+ turns) where you want to summarize-and-compact periodically to prevent context overflow.
"Any LLM can summarize equally well" — Summarization quality varies significantly between models. A weak summarizer might miss that "the user seemed frustrated about the delay" or drop the distinction between "we discussed X" and "we decided X." Use a capable model (Claude Sonnet or better) for summarization — the cost difference is negligible and the quality gap is significant.
You’ve learned the four components of the memory architecture: working memory (current state), episodic memory (past conversations), procedural memory (learned skills), and the summarization pipeline (the bridge between them). Now let’s build the persistence layer that keeps all this data alive across process restarts.
Persistence Layer
Memory that only lives in RAM disappears when your process crashes or restarts. For production agents, all three memory tiers need durable storage — meaning the data is written to disk (or a managed database) and survives server restarts, crashes, and deployments.
Here's how persistence works internally, tier by tier. When working memory updates a key, the change happens in RAM first (that's what makes it fast). Periodically, the in-memory state is flushed to a backing store — Redis for distributed agents, or SQLite or a JSON file for simpler setups.
Episodic memory is a bit different. When you store an episode, ChromaDB writes two things to disk: the embedding vector (the 1536-float array) and the document text (the summary). Both go into ChromaDB's SQLite backend, so they survive restarts.
Procedural memory splits its data across two stores. The trigger description goes into the vector DB as an embedding (for semantic search). The steps JSON goes into a relational table (SQLite or PostgreSQL). On restart, each tier reads its persisted state and resumes where it left off.
If you've worked with web apps, this is similar to the difference between storing user data in a session cookie versus a database. Session cookies disappear when the browser closes. A database persists across restarts. The same pattern applies here: without persistence, your agent's memory is a session cookie that vanishes on crash.
This is fundamentally different from the context-window approach in M08, where conversation history lived only in the messages array passed to each API call. That array exists only while your code is running. Persistence adds a storage layer underneath each memory tier, turning ephemeral state into durable records.
The tradeoff is additional infrastructure complexity — you now need a database, backup strategy, and a plan for when storage grows unbounded. For a hobby project, SQLite and a local ChromaDB directory are fine. For a production agent handling thousands of users, you'll likely graduate to a managed vector database (Pinecone, Weaviate Cloud) with automatic backups and horizontal scaling.
- Working memory: Persisted to Redis or a session store — survives brief disconnects. For simpler deployments, a JSON file or SQLite row per active session works.
- Episodic memory: Vector database (ChromaDB with SQLite backend, Pinecone, Weaviate) — survives restarts and scales to millions of episodes.
- Procedural memory: SQLite or PostgreSQL table with trigger embeddings stored in the same vector database — survives restarts and supports filtered retrieval.
ChromaDB’s default in-memory mode loses all data on restart. Always configure persistent storage: chromadb.PersistentClient(path="./chroma_db") in Python or specify a persist directory. In production, consider a managed vector database (Pinecone, Weaviate Cloud) for automatic backups and scaling.
Long sessions accumulate stale context — Claude may reference code that’s been changed. Mitigations: /compact (lossy), scratchpad files (lossless), subagent delegation (fresh context), or crash recovery manifests for interrupted sessions. The exam tests your knowledge of these tradeoffs.
The Managed Memory Tool
Claude exposes a built-in memory tool in the API: a managed key/value-and-namespace store that persists across requests for a given user, project, or org. Instead of standing up Postgres, building a summarization pipeline, and writing your own retrieval prompts, you call memory.write and memory.read tools and Anthropic handles storage, retrieval, and quota management.
The memory tool is enabled like any other built-in tool in your tools array. Claude can call memory.write to store a fact under a namespaced key (e.g. user_prefs/timezone), memory.read to retrieve, and memory.list to enumerate. Memory is scoped: you can isolate per-user, per-session, or per-tenant by passing scoping headers. Storage and retrieval cost is metered separately from input/output tokens.
Memory Tool vs Building It Yourself
| Concern | Memory tool | DIY 3-tier (this module) |
|---|---|---|
| Setup time | Minutes — enable the tool, done | Days — DB schema, summarizer, retrieval logic |
| Where data lives | Anthropic-managed storage | Your DB, your VPC, your control plane |
| Retrieval policy | Claude decides when to read | You decide — eager, lazy, hybrid |
| Compliance / data residency | Bound by Anthropic's region/ZDR options | Whatever your infra allows (HIPAA, GDPR, etc.) |
| Cost model | Per-read/write metering | Your storage + compute |
| Best for | Prototypes, B2C agents, simple personalization | Regulated data, complex retrieval, multi-tenant SaaS |
A common production pattern: use the memory tool for short-lived per-user preferences (timezone, locale, "remember I want JSON output") and keep your DIY tiered system for case-grade context (medical history, account state, audit-relevant decisions). They aren't competitive — they target different problem shapes.
Three cases where the DIY 3-tier system you just built is the right answer: (1) any data subject to data-residency regulation that requires you to store inside your own VPC. (2) retrieval policies more nuanced than "Claude reads when it thinks it should" — e.g., always-load-on-session-start case facts. (3) high-volume personalization with thousands of facts per user, where DB indexes give you better economics than per-read metering.
Auto Memory — Claude Writes Its Own Notes
Imagine pairing with a colleague who silently keeps a running notebook of everything they've learned about your codebase — build commands they figured out, debugging quirks, your preferences — and reads that notebook before every session. That's Auto Memory: a Claude Code feature where Claude itself decides what's worth remembering across sessions and writes it to disk, without you typing a single line of CLAUDE.md.
Auto Memory is on by default in Claude Code v2.1.59 and later. Most learners discover it the first time they see "Writing memory…" in the terminal and wonder what just happened — that's Claude updating its notes. The course wouldn't be complete without showing you exactly how it works, where it stores data, and how it interacts with the CLAUDE.md memory you write yourself.
Auto Memory is an automatic note-taking system built into Claude Code. Claude decides at runtime whether something is worth remembering — build commands, debugging insights, architectural decisions, preferences it has discovered — and writes those notes to a per-project memory directory. The directory is loaded at the start of every future session. Auto Memory is complementary to CLAUDE.md, not a replacement: CLAUDE.md is what you tell Claude; Auto Memory is what Claude tells itself.
CLAUDE.md vs Auto Memory — Two Complementary Systems
| CLAUDE.md files | Auto Memory | |
|---|---|---|
| Who writes it | You | Claude |
| What it contains | Instructions, rules, conventions | Learnings, patterns, discovered preferences |
| Scope | Project · user · org | Per working tree (git repo) |
| Loaded into | Every session, in full | Every session: first 200 lines or 25KB of MEMORY.md |
| Best for | Coding standards, architecture, "always do X" | Build quirks, debug insights, preferences Claude discovers |
| Storage | ./CLAUDE.md, ./.claude/CLAUDE.md, ~/.claude/CLAUDE.md | ~/.claude/projects/<project>/memory/ |
Both systems load at the start of every conversation. CLAUDE.md captures what you'd otherwise re-explain every session; Auto Memory captures what Claude would otherwise re-discover every session. Together they're a Pareto-optimal split — you write the unchangeable rules, Claude takes notes on the rest.
How It Works — The Memory Directory
Each git repository gets its own auto memory directory at ~/.claude/projects/<project>/memory/. The directory contains an entrypoint plus optional topic files:
The /memory Command
The /memory slash command is your control plane for both CLAUDE.md and Auto Memory. From inside a session, it lists every file currently loaded (CLAUDE.md, CLAUDE.local.md, rules files, MEMORY.md), lets you toggle Auto Memory on or off, and provides a link to open the auto memory folder. Selecting any file opens it in your editor.
Two natural prompts trigger Auto Memory writes:
- "Remember that X" — "Remember that the API tests need a local Redis on port 6380." Claude writes this to Auto Memory.
- "Add to CLAUDE.md" — "Add this to CLAUDE.md." Claude appends the rule to your project CLAUDE.md instead.
Choose deliberately: rules everyone on the team needs go in CLAUDE.md (committed); per-machine quirks Claude discovered go in Auto Memory (machine-local).
Configuration & Toggles
{
"autoMemoryEnabled": true,
"autoMemoryDirectory": "~/my-custom-memory-dir"
}
autoMemoryEnabled— on by default. Toggle from/memoryin-session, or set in user/local settings (not project settings, to prevent shared projects from disabling it for your machine).autoMemoryDirectory— redirects the storage location. Accepted from policy, local, and user settings only — not project settings, so a shared repo cannot redirect your auto memory writes to a sensitive location.CLAUDE_CODE_DISABLE_AUTO_MEMORY=1— environment-variable kill switch for sandboxed runs (CI, ephemeral containers).
All worktrees and subdirectories of the same git repo share one auto memory directory on your machine. But that directory is not synced across machines, cloud environments, or teammates. If you switch laptops, you start with empty Auto Memory. This is by design — Auto Memory captures machine-specific quirks (your local Redis port, your sandbox URL) that wouldn't be useful or safe to share.
For team-shared learnings, use CLAUDE.md (committed to git). For machine-local Claude-discovered notes, use Auto Memory.
The exam may test recognition of when Auto Memory applies vs CLAUDE.md. Pattern: a developer says "every session I have to remind Claude that the local DB is on port 5433." Wrong answer: add to CLAUDE.md. Correct answer: that's exactly what Auto Memory exists for — it's machine-local, Claude-written, and survives across sessions. Use CLAUDE.md only when the rule should be shared with the team via git.
Claude's Native Memory Landscape & Cert-Critical Patterns
You've learned the academic 3-tier architecture (working / episodic / procedural) and the managed memory tool. Before we build it in code, take one step back and look at the full set of Claude-native memory mechanisms. Claude provides at least twelve distinct memory primitives across five scopes — from the conversation-level message array to file-based persistent memory in Claude Code. Knowing all twelve, and which scope each one operates at, is what separates "I learned the API" from "I can architect a Claude system."
Three signals tell you which scope to reach for. Volatility: if the data lives only for this turn, scope 1 is enough — just put it in the prompt. Reuse rate: if you'll re-send the same context across many calls, scope 2 (prompt caching, Files API) collapses cost by 90%. Persistence: if the data must outlive the session, scope 4 (CLAUDE.md, skills) auto-loads forever; scope 5 (your DB) handles scale and compliance. Most production agents use 3–4 of the five scopes simultaneously — not because more is better, but because each scope solves a different memory problem.
Cert-Critical Patterns: Crash Recovery & Tool-Output Trimming
Two memory patterns from scope 5 don't fit cleanly into the working/episodic/procedural taxonomy but show up repeatedly as Domain 5 cert questions. Both solve real production failures — an interrupted long-running session, and a context window flooded with verbose tool output.
Pattern 1: Crash Recovery Manifests
You're 40 minutes into a long-running migration. The agent has touched 18 files, run 6 test passes, and has 3 outstanding TODOs in flight. Your laptop reboots for an OS update. Without a manifest, the next session starts cold — you re-prompt from scratch and Claude rediscovers everything. With a manifest, the next session reads a small structured file and picks up exactly where it left off.
A crash recovery manifest is a small structured file (typically .claude/state/manifest.json or a markdown scratchpad) that the agent writes after every meaningful step. It captures the minimum information needed to resume work: modified files with one-line summaries, current test status, the agreed-upon plan, and any unresolved TODOs. The manifest lives outside the context window, so it survives session crashes, compaction, and machine restarts.
from pathlib import Path
import json
from datetime import datetime
MANIFEST = Path(".claude/state/manifest.json")
def write_manifest(modified_files, test_status, plan, current_step, open_todos):
"""Call after EVERY meaningful step in a long-running session."""
MANIFEST.parent.mkdir(parents=True, exist_ok=True)
MANIFEST.write_text(json.dumps({
"updated_at": datetime.utcnow().isoformat(),
"modified_files": modified_files, # [{"path": "src/api.ts", "summary": "added /v2/search"}]
"test_status": test_status, # "47/50 passing; auth_test.ts:42 failing"
"plan": plan, # ["step1: schema", "step2: route", ...]
"current_step": current_step,
"open_todos": open_todos, # ["fix locale fallback", "add rate-limit test"]
}, indent=2))
def read_manifest():
"""First action of any resumed session."""
if not MANIFEST.exists():
return None
return json.loads(MANIFEST.read_text())
Pattern 2: Trimming Verbose Tool Outputs
The single fastest way to fill a context window is by piping in raw tool output. A Bash(grep -r foo .) call can return 8,000 lines of matches; a Read on a 4,000-line file dumps the entire file into context. Most of those tokens are noise — the agent only needs the first 50 matches or the relevant function. Trimming tool output before it enters context is a Domain 5 cert pattern that's easy to miss because it lives in the boundary between the tool and the agent.
Tool-output trimming is a pre-context filter applied to tool results before they're appended to the message history. The pattern is: the tool runs, returns its full result; a wrapper trims the result to the relevant slice (top-N matches, head/tail of a file, summary of a long stdout); only the trimmed slice is sent back to the model. The full result is preserved in a side log for debugging; the model only sees the trimmed version.
| Tool output type | Trim strategy | Limit |
|---|---|---|
| grep / search results | Top-N by relevance + total count footer | 50 lines + "(... 247 more matches)" |
| file Read on large files | Offset + limit; agent re-reads on demand | 2000 lines per call |
| Bash stdout / log streams | Head + tail; drop middle | first 30 + last 30 lines |
| DB query results | LIMIT clause + row count | 100 rows max per call |
| HTTP responses | Strip headers, body limit | 10KB body slice |
| test runner output | Failures only; pass-count summary | "47/50 passing" + 3 failure traces |
Never throw away the full result — just don't show it to the model. The trim pattern is: (1) the tool wrapper writes the full result to a side log on disk (.claude/state/tool_log.jsonl) for debugging and audit. (2) The wrapper returns a trimmed slice to the model. (3) If the agent needs more, it can call the tool again with a different query (more specific grep, different file offset). This way the model sees clean signal but you keep full forensic data.
The exam tests recognition that "context degraded" isn't always solved by a bigger model or longer window — sometimes it's solved by less input. Trick scenario: an agent's quality drops after a Bash search returns 8000 lines. Wrong answers: switch to Opus, increase max_tokens, add more case facts. Correct answer: trim the tool output to the relevant slice before it enters the model's context. Pair with the manifest pattern above for crash recovery, and you've got the full Domain 5 toolkit.
Memory & Context Cert Topics — Where Each Lives
Domain 5 (15% of the exam) tests context management as a single discipline, but the topics are taught across multiple modules. Use this as your study map:
| Cert topic | Primary module |
|---|---|
| "Lost in the middle" effect & position-aware ordering | M03B Concept 4 + M08 |
| Immutable "case facts" blocks at START position | M03B + M08 |
| Progressive summarization risks & context rot | M03B Concept 5 |
| 3-tier memory (working / episodic / procedural) | M11 (this module) |
| Managed memory tool (Anthropic-hosted) | M11 (this module) |
| Auto Memory (Claude Code, machine-local) | M11 Auto Memory section |
| Crash recovery manifests | M11 (this section) |
| Trimming verbose tool outputs | M11 (this section) |
/compact discipline (manual at 50%) | M26 Compaction section |
| CLAUDE.md compaction preservation directive | M26 Compaction section |
PreCompact hook for critical state | M26 12 hook events |
| Subagent context isolation | M14 + M26 |
Session forking (fork_session) vs resume | M26 Sessions |
| Information provenance & claim-source mappings | M27B |
| Temporal data & as-of reasoning | M27B |
| Stratified sampling + field-level confidence | M27B |
| Synthesis output: well-established vs contested | M27B |
Code Walkthrough: 3-Tier Memory System
Step 1: Working Memory Class
Let's start with the simplest tier. The WorkingMemory class is essentially a key-value store (a Python dict with timestamps) that tracks the current task state. The interesting design choice is the to_prompt() method — it formats the entire state into a structured text block that you inject into Claude's system prompt. Without this, every LLM call within a multi-step task would be stateless — Claude would forget what it learned two tool calls ago. The scratchpad solves that by riding along with every API request.
import json
from datetime import datetime, timezone
from typing import Any
class WorkingMemory:
"""Fast, mutable scratchpad for current task state.
Stores key-value pairs in memory and provides a formatted
string for injection into Claude's system prompt.
"""
def __init__(self, session_id: str):
self.session_id = session_id
self.created_at = datetime.now(timezone.utc).isoformat()
self._state: dict[str, Any] = {}
def set(self, key: str, value: Any) -> None:
"""Store a key-value pair in working memory."""
self._state[key] = {
"value": value,
"updated_at": datetime.now(timezone.utc).isoformat()
}
def get(self, key: str, default: Any = None) -> Any:
"""Retrieve a value by key. Returns default if not found."""
entry = self._state.get(key)
return entry["value"] if entry else default
def delete(self, key: str) -> bool:
"""Remove a key from working memory. Returns True if existed."""
return self._state.pop(key, None) is not None
def clear(self) -> None:
"""Clear all working memory state."""
self._state.clear()
def to_prompt(self) -> str:
"""Format working memory for injection into system prompt.
Returns a structured string that Claude can parse to understand
the current task state.
"""
if not self._state:
return "[Working Memory: empty — new task]"
lines = [
f"[Working Memory — Session {self.session_id}]",
f" Created: {self.created_at}"
]
for key, entry in self._state.items():
val = json.dumps(entry["value"]) if not isinstance(
entry["value"], str
) else entry["value"]
lines.append(f" {key}: {val}")
return "\n".join(lines)
def to_dict(self) -> dict:
"""Serialize for persistence (e.g., to Redis or SQLite)."""
return {
"session_id": self.session_id,
"created_at": self.created_at,
"state": {
k: v["value"] for k, v in self._state.items()
}
}
# --- Usage ---
wm = WorkingMemory(session_id="sess_001")
wm.set("intent", "find_deployment_date")
wm.set("entities", {"topic": "deployment", "timeframe": "last Tuesday"})
wm.set("search_results", [{"session": 42, "summary": "Deploy moved to Friday"}])
print(wm.to_prompt())
# [Working Memory — Session sess_001]
# Created: 2026-03-29T10:00:00+00:00
# intent: find_deployment_date
# entities: {"topic": "deployment", "timeframe": "last Tuesday"}
# search_results: [{"session": 42, "summary": "Deploy moved to Friday"}]
interface MemoryEntry {
value: unknown;
updatedAt: string;
}
class WorkingMemory {
/** Fast, mutable scratchpad for current task state. */
public readonly sessionId: string;
public readonly createdAt: string;
private state: Map<string, MemoryEntry> = new Map();
constructor(sessionId: string) {
this.sessionId = sessionId;
this.createdAt = new Date().toISOString();
}
set(key: string, value: unknown): void {
this.state.set(key, {
value,
updatedAt: new Date().toISOString(),
});
}
get(key: string, defaultValue: unknown = null): unknown {
const entry = this.state.get(key);
return entry ? entry.value : defaultValue;
}
delete(key: string): boolean {
return this.state.delete(key);
}
clear(): void {
this.state.clear();
}
toPrompt(): string {
if (this.state.size === 0) {
return "[Working Memory: empty — new task]";
}
const lines = [
`[Working Memory — Session ${this.sessionId}]`,
` Created: ${this.createdAt}`,
];
for (const [key, entry] of this.state) {
const val =
typeof entry.value === "string"
? entry.value
: JSON.stringify(entry.value);
lines.push(` ${key}: ${val}`);
}
return lines.join("\n");
}
toJSON(): Record<string, unknown> {
const state: Record<string, unknown> = {};
for (const [key, entry] of this.state) {
state[key] = entry.value;
}
return {
sessionId: this.sessionId,
createdAt: this.createdAt,
state,
};
}
}
// --- Usage ---
const wm = new WorkingMemory("sess_001");
wm.set("intent", "find_deployment_date");
wm.set("entities", { topic: "deployment", timeframe: "last Tuesday" });
wm.set("search_results", [{ session: 42, summary: "Deploy moved to Friday" }]);
console.log(wm.toPrompt());
You built a WorkingMemory class that acts as a structured scratchpad. It stores key-value pairs with timestamps, and the to_prompt() method formats the entire state into a string that you inject into Claude’s system prompt. This means every LLM call during a multi-step task can “see” everything the agent has learned so far.
Step 2: Episodic Memory Class
Now for the tier that gives agents cross-session continuity. The EpisodicMemory class wraps ChromaDB to store conversation summaries and retrieve them by semantic similarity. The core idea: after each conversation, you store a summary. Before each new conversation, you search for relevant past summaries and inject them into the prompt. Here's the dilemma with ChromaDB: its default mode is in-memory, which means all your carefully stored episodes vanish when the process restarts. That's why the constructor uses PersistentClient with a disk path — skip this and you'll spend hours debugging "why doesn't my agent remember anything?"
import chromadb # pip install chromadb>=0.5.0
import uuid
from datetime import datetime, timezone
class EpisodicMemory:
"""Stores and retrieves conversation summaries via semantic search.
Uses ChromaDB with persistent storage so memories survive restarts.
"""
def __init__(self, persist_dir: str = "./chroma_db", collection_name: str = "episodes"):
try:
self.client = chromadb.PersistentClient(path=persist_dir)
self.collection = self.client.get_or_create_collection(
name=collection_name,
metadata={"hnsw:space": "cosine"} # cosine similarity
)
except Exception as e:
raise ConnectionError(
f"Failed to connect to ChromaDB at {persist_dir}: {e}"
) from e
def store_episode(
self,
summary: str,
session_id: str,
user_id: str = "default",
topics: list[str] | None = None,
) -> str:
"""Store a conversation summary as an episode.
Returns the episode ID for later reference.
"""
episode_id = f"ep_{uuid.uuid4().hex[:12]}"
metadata = {
"session_id": session_id,
"user_id": user_id,
"timestamp": datetime.now(timezone.utc).isoformat(),
"topics": ",".join(topics) if topics else "",
}
try:
self.collection.add(
documents=[summary],
metadatas=[metadata],
ids=[episode_id],
)
except Exception as e:
raise RuntimeError(f"Failed to store episode: {e}") from e
return episode_id
def retrieve(
self,
query: str,
n_results: int = 3,
user_id: str | None = None,
) -> list[dict]:
"""Find the most relevant past episodes for a query.
Returns a list of dicts with 'summary', 'session_id',
'timestamp', and 'similarity' keys.
"""
where_filter = {"user_id": user_id} if user_id else None
try:
results = self.collection.query(
query_texts=[query],
n_results=min(n_results, self.collection.count() or 1),
where=where_filter,
)
except Exception as e:
# Return empty on search failure — don't crash the agent
print(f"Warning: episodic search failed: {e}")
return []
if not results["documents"] or not results["documents"][0]:
return []
episodes = []
for i, doc in enumerate(results["documents"][0]):
meta = results["metadatas"][0][i]
distance = results["distances"][0][i] if results["distances"] else 0
episodes.append({
"summary": doc,
"session_id": meta.get("session_id", ""),
"timestamp": meta.get("timestamp", ""),
"topics": meta.get("topics", "").split(","),
"similarity": round(1 - distance, 3), # cosine distance → similarity
})
return episodes
def to_prompt(self, query: str, n_results: int = 3) -> str:
"""Retrieve relevant episodes and format for prompt injection."""
episodes = self.retrieve(query, n_results)
if not episodes:
return "[Episodic Memory: no relevant past interactions found]"
lines = ["[Relevant Past Interactions]"]
for ep in episodes:
lines.append(
f" - Session {ep['session_id']} ({ep['timestamp'][:10]}): "
f"{ep['summary']}"
)
return "\n".join(lines)
# --- Usage ---
em = EpisodicMemory(persist_dir="./memory_db")
# Store a past conversation summary
em.store_episode(
summary="User asked about deployment schedule. Decision: deploy moved to Friday due to QA delays.",
session_id="sess_042",
user_id="user_alice",
topics=["deployment", "scheduling"],
)
# Later, retrieve relevant episodes
results = em.retrieve("When is the deployment?", user_id="user_alice")
for r in results:
print(f"[{r['similarity']}] Session {r['session_id']}: {r['summary']}")
// npm install chromadb@^1.9.0
import { ChromaClient, Collection } from "chromadb";
import { randomUUID } from "crypto";
interface Episode {
summary: string;
sessionId: string;
timestamp: string;
topics: string[];
similarity: number;
}
class EpisodicMemory {
/** Stores and retrieves conversation summaries via semantic search. */
private client: ChromaClient;
private collection: Collection | null = null;
private collectionName: string;
constructor(collectionName: string = "episodes") {
this.client = new ChromaClient();
this.collectionName = collectionName;
}
async init(): Promise<void> {
try {
this.collection = await this.client.getOrCreateCollection({
name: this.collectionName,
metadata: { "hnsw:space": "cosine" },
});
} catch (error) {
throw new Error(`Failed to connect to ChromaDB: ${error}`);
}
}
async storeEpisode(
summary: string,
sessionId: string,
userId: string = "default",
topics: string[] = []
): Promise<string> {
if (!this.collection) throw new Error("Call init() first");
const episodeId = `ep_${randomUUID().replace(/-/g, "").slice(0, 12)}`;
const metadata = {
session_id: sessionId,
user_id: userId,
timestamp: new Date().toISOString(),
topics: topics.join(","),
};
try {
await this.collection.add({
documents: [summary],
metadatas: [metadata],
ids: [episodeId],
});
} catch (error) {
throw new Error(`Failed to store episode: ${error}`);
}
return episodeId;
}
async retrieve(
query: string,
nResults: number = 3,
userId?: string
): Promise<Episode[]> {
if (!this.collection) throw new Error("Call init() first");
const whereFilter = userId ? { user_id: userId } : undefined;
try {
const count = await this.collection.count();
const results = await this.collection.query({
queryTexts: [query],
nResults: Math.min(nResults, count || 1),
where: whereFilter,
});
if (!results.documents?.[0]?.length) return [];
return results.documents[0].map((doc, i) => ({
summary: doc ?? "",
sessionId: String(results.metadatas?.[0]?.[i]?.session_id ?? ""),
timestamp: String(results.metadatas?.[0]?.[i]?.timestamp ?? ""),
topics: String(results.metadatas?.[0]?.[i]?.topics ?? "")
.split(",")
.filter(Boolean),
similarity: results.distances?.[0]?.[i]
? Math.round((1 - results.distances[0][i]) * 1000) / 1000
: 0,
}));
} catch (error) {
console.warn(`Episodic search failed: ${error}`);
return [];
}
}
async toPrompt(query: string, nResults: number = 3): Promise<string> {
const episodes = await this.retrieve(query, nResults);
if (!episodes.length) {
return "[Episodic Memory: no relevant past interactions found]";
}
const lines = ["[Relevant Past Interactions]"];
for (const ep of episodes) {
lines.push(
` - Session ${ep.sessionId} (${ep.timestamp.slice(0, 10)}): ${ep.summary}`
);
}
return lines.join("\n");
}
}
// --- Usage ---
const em = new EpisodicMemory("episodes");
await em.init();
await em.storeEpisode(
"User asked about deployment schedule. Decision: deploy moved to Friday.",
"sess_042",
"user_alice",
["deployment", "scheduling"]
);
const results = await em.retrieve("When is the deployment?", 3, "user_alice");
results.forEach((r) =>
console.log(`[${r.similarity}] Session ${r.sessionId}: ${r.summary}`)
);
You built an EpisodicMemory class that stores conversation summaries in ChromaDB with metadata (session ID, user ID, timestamp, topics). The retrieve() method does semantic search — even if the user asks about “deployment” using different words, ChromaDB finds the matching episode. The to_prompt() method formats retrieved episodes for direct injection into Claude’s prompt.
Step 3: Procedural Memory Class
The third tier is where agents get faster over time. ProceduralMemory stores reusable action templates — proven sequences of tool calls — and retrieves them by matching user requests to trigger conditions via semantic similarity. The clever part is the min_similarity threshold (0.7 by default): if no stored procedure matches the user's request closely enough, the agent falls back to reasoning from scratch instead of executing an irrelevant template. This prevents the agent from re-planning the same multi-step workflow every time while avoiding false matches.
import chromadb
import json
import uuid
class ProceduralMemory:
"""Stores and retrieves reusable action templates (tool-call sequences).
Trigger conditions are stored as embeddings for semantic matching.
Execution steps are stored as structured JSON.
"""
def __init__(self, persist_dir: str = "./chroma_db"):
try:
self.client = chromadb.PersistentClient(path=persist_dir)
self.collection = self.client.get_or_create_collection(
name="procedures",
metadata={"hnsw:space": "cosine"},
)
except Exception as e:
raise ConnectionError(f"Failed to connect to ChromaDB: {e}") from e
def store_procedure(
self,
name: str,
trigger: str,
steps: list[dict],
success_count: int = 1,
) -> str:
"""Store a reusable procedure.
Args:
name: Human-readable procedure name (e.g., 'weekly_report')
trigger: Description of when this procedure applies
steps: Ordered list of step dicts with 'tool', 'params', 'description'
success_count: Number of times this procedure has succeeded
"""
proc_id = f"proc_{uuid.uuid4().hex[:10]}"
metadata = {
"name": name,
"steps_json": json.dumps(steps),
"success_count": success_count,
}
try:
self.collection.add(
documents=[trigger], # trigger is the searchable text
metadatas=[metadata],
ids=[proc_id],
)
except Exception as e:
raise RuntimeError(f"Failed to store procedure: {e}") from e
return proc_id
def find_procedure(
self,
query: str,
min_similarity: float = 0.7,
) -> dict | None:
"""Find the best matching procedure for a user request.
Returns None if no procedure matches above the similarity threshold.
"""
try:
count = self.collection.count()
if count == 0:
return None
results = self.collection.query(
query_texts=[query],
n_results=1,
)
except Exception as e:
print(f"Warning: procedural search failed: {e}")
return None
if not results["documents"] or not results["documents"][0]:
return None
distance = results["distances"][0][0] if results["distances"] else 1.0
similarity = 1 - distance
if similarity < min_similarity:
return None
meta = results["metadatas"][0][0]
return {
"name": meta["name"],
"trigger": results["documents"][0][0],
"steps": json.loads(meta["steps_json"]),
"success_count": meta.get("success_count", 0),
"similarity": round(similarity, 3),
}
def to_prompt(self, query: str) -> str:
"""Find a matching procedure and format it as a suggested plan."""
proc = self.find_procedure(query)
if not proc:
return "[Procedural Memory: no matching skill template found]"
lines = [
f"[Suggested Procedure: {proc['name']} (used {proc['success_count']}x)]",
]
for i, step in enumerate(proc["steps"], 1):
lines.append(
f" Step {i}: {step['description']} "
f"(tool: {step['tool']})"
)
return "\n".join(lines)
# --- Usage ---
pm = ProceduralMemory(persist_dir="./memory_db")
# Store a proven procedure
pm.store_procedure(
name="weekly_sales_report",
trigger="Generate a weekly sales report with charts and insights",
steps=[
{"tool": "query_db", "params": {"query": "sales_last_7_days"}, "description": "Fetch sales data"},
{"tool": "aggregate", "params": {"group_by": "region"}, "description": "Aggregate by region"},
{"tool": "chart_gen", "params": {"type": "bar"}, "description": "Generate bar chart"},
{"tool": "summarize", "params": {}, "description": "Extract key insights"},
{"tool": "format_pdf", "params": {}, "description": "Format as PDF report"},
],
success_count=12,
)
# Later, find a matching procedure
result = pm.find_procedure("Can you create the weekly sales report?")
if result:
print(f"Found: {result['name']} (similarity: {result['similarity']})")
for step in result["steps"]:
print(f" → {step['description']}")
import { ChromaClient, Collection } from "chromadb";
import { randomUUID } from "crypto";
interface ProcedureStep {
tool: string;
params: Record<string, unknown>;
description: string;
}
interface Procedure {
name: string;
trigger: string;
steps: ProcedureStep[];
successCount: number;
similarity: number;
}
class ProceduralMemory {
/** Stores and retrieves reusable action templates. */
private client: ChromaClient;
private collection: Collection | null = null;
constructor() {
this.client = new ChromaClient();
}
async init(): Promise<void> {
try {
this.collection = await this.client.getOrCreateCollection({
name: "procedures",
metadata: { "hnsw:space": "cosine" },
});
} catch (error) {
throw new Error(`Failed to connect to ChromaDB: ${error}`);
}
}
async storeProcedure(
name: string,
trigger: string,
steps: ProcedureStep[],
successCount: number = 1
): Promise<string> {
if (!this.collection) throw new Error("Call init() first");
const procId = `proc_${randomUUID().replace(/-/g, "").slice(0, 10)}`;
try {
await this.collection.add({
documents: [trigger],
metadatas: [{
name,
steps_json: JSON.stringify(steps),
success_count: successCount,
}],
ids: [procId],
});
} catch (error) {
throw new Error(`Failed to store procedure: ${error}`);
}
return procId;
}
async findProcedure(
query: string,
minSimilarity: number = 0.7
): Promise<Procedure | null> {
if (!this.collection) throw new Error("Call init() first");
try {
const count = await this.collection.count();
if (count === 0) return null;
const results = await this.collection.query({
queryTexts: [query],
nResults: 1,
});
if (!results.documents?.[0]?.length) return null;
const distance = results.distances?.[0]?.[0] ?? 1.0;
const similarity = 1 - distance;
if (similarity < minSimilarity) return null;
const meta = results.metadatas?.[0]?.[0] ?? {};
return {
name: String(meta.name ?? ""),
trigger: results.documents[0][0] ?? "",
steps: JSON.parse(String(meta.steps_json ?? "[]")),
successCount: Number(meta.success_count ?? 0),
similarity: Math.round(similarity * 1000) / 1000,
};
} catch (error) {
console.warn(`Procedural search failed: ${error}`);
return null;
}
}
async toPrompt(query: string): Promise<string> {
const proc = await this.findProcedure(query);
if (!proc) {
return "[Procedural Memory: no matching skill template found]";
}
const lines = [
`[Suggested Procedure: ${proc.name} (used ${proc.successCount}x)]`,
];
proc.steps.forEach((step, i) => {
lines.push(` Step ${i + 1}: ${step.description} (tool: ${step.tool})`);
});
return lines.join("\n");
}
}
// --- Usage ---
const pm = new ProceduralMemory();
await pm.init();
await pm.storeProcedure(
"weekly_sales_report",
"Generate a weekly sales report with charts and insights",
[
{ tool: "query_db", params: { query: "sales_last_7_days" }, description: "Fetch sales data" },
{ tool: "aggregate", params: { group_by: "region" }, description: "Aggregate by region" },
{ tool: "chart_gen", params: { type: "bar" }, description: "Generate bar chart" },
{ tool: "summarize", params: {}, description: "Extract key insights" },
{ tool: "format_pdf", params: {}, description: "Format as PDF report" },
],
12
);
const result = await pm.findProcedure("Create the weekly sales report");
if (result) {
console.log(`Found: ${result.name} (${result.similarity})`);
result.steps.forEach((s) => console.log(` → ${s.description}`));
}
You built a ProceduralMemory class that stores skill templates as trigger-description/steps pairs. The trigger is stored as an embedding in ChromaDB, so when a user request comes in, semantic search finds the closest matching procedure. The steps are stored as JSON and loaded into the prompt as a suggested plan. The min_similarity threshold (0.7) prevents false matches — if no procedure is relevant enough, the agent falls back to reasoning from scratch.
Memory Manager — Orchestrating All Three Tiers
Step 4: Conversation Summarizer
This is the bridge between short-term conversation and long-term memory. The summarize_conversation function takes a full conversation log and asks Claude a simple question: "What were the important parts of this conversation?" Claude extracts the essential information — topics discussed, decisions made, user preferences learned, and action items remaining — and returns it as a compact JSON record.
The output is ~200-400 tokens, which is 10-20x smaller than the raw conversation. That compression is what makes episodic memory economically viable. Without it, storing full transcripts for 500 daily conversations would blow through your storage budget within a week. The interesting design choice here is using Claude itself as the summarizer — it understands conversational nuance better than any rule-based approach, and the cost (~$0.002 per summary) is trivial compared to the savings on future prompt tokens.
import anthropic # pip install anthropic>=0.30.0
import json
async def summarize_conversation(
messages: list[dict],
client: anthropic.AsyncAnthropic | None = None,
) -> dict:
"""Summarize a conversation into a structured episode record.
Args:
messages: List of conversation messages (role + content dicts)
client: Anthropic client (creates one if not provided)
Returns:
Dict with: summary, topics, decisions, user_preferences, action_items
"""
if client is None:
client = anthropic.AsyncAnthropic() # reads ANTHROPIC_API_KEY env var
# Format the conversation for summarization
conversation_text = "\n".join(
f"{msg['role'].upper()}: {msg['content']}" for msg in messages
)
try:
response = await client.messages.create(
model="claude-sonnet-4-6",
max_tokens=500,
system=(
"You are a conversation summarizer. Extract the key information "
"from the conversation and return a JSON object with these fields:\n"
'- "summary": 1-2 sentence overview of the conversation\n'
'- "topics": list of main topics discussed\n'
'- "decisions": list of decisions made (empty if none)\n'
'- "user_preferences": any user preferences learned (empty if none)\n'
'- "action_items": pending items (empty if none)\n'
"Return ONLY valid JSON, no markdown fences."
),
messages=[{"role": "user", "content": conversation_text}],
)
result = json.loads(response.content[0].text)
return result
except json.JSONDecodeError:
# Fallback: return raw summary if Claude didn't produce valid JSON
return {
"summary": response.content[0].text[:500],
"topics": [],
"decisions": [],
"user_preferences": [],
"action_items": [],
}
except anthropic.APIError as e:
raise RuntimeError(f"Summarization API call failed: {e}") from e
import Anthropic from "@anthropic-ai/sdk"; // npm install @anthropic-ai/sdk@^0.30.0
interface EpisodeRecord {
summary: string;
topics: string[];
decisions: string[];
userPreferences: string[];
actionItems: string[];
}
async function summarizeConversation(
messages: Array<{ role: string; content: string }>,
client?: Anthropic
): Promise<EpisodeRecord> {
const anthropic = client ?? new Anthropic(); // reads ANTHROPIC_API_KEY
const conversationText = messages
.map((msg) => `${msg.role.toUpperCase()}: ${msg.content}`)
.join("\n");
try {
const response = await anthropic.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 500,
system:
"You are a conversation summarizer. Extract key information " +
"and return a JSON object with: summary, topics, decisions, " +
"user_preferences, action_items. Return ONLY valid JSON.",
messages: [{ role: "user", content: conversationText }],
});
const text =
response.content[0].type === "text" ? response.content[0].text : "";
const parsed = JSON.parse(text);
return {
summary: parsed.summary ?? "",
topics: parsed.topics ?? [],
decisions: parsed.decisions ?? [],
userPreferences: parsed.user_preferences ?? [],
actionItems: parsed.action_items ?? [],
};
} catch (error) {
if (error instanceof SyntaxError) {
return {
summary: "Summarization produced invalid JSON",
topics: [],
decisions: [],
userPreferences: [],
actionItems: [],
};
}
throw new Error(`Summarization failed: ${error}`);
}
}
Step 5: Memory Manager
Finally, the orchestrator that ties everything together. The MemoryManager has a clear lifecycle: start_session() creates fresh working memory and loads relevant episodes and procedures. During the conversation, you update working memory and log each turn. end_session() summarizes the conversation, stores the summary in episodic memory, and clears working memory. The interesting part is build_memory_context() — it combines all three tiers into a single formatted string that becomes part of Claude's system prompt. Individual memory classes are building blocks; the Memory Manager is the glue that makes them work as a coherent system.
import anthropic
class MemoryManager:
"""Orchestrates all three memory tiers for a conversational agent.
Lifecycle:
1. start_session() — creates working memory, loads relevant episodes + procedures
2. During conversation — update working memory via set/get
3. end_session() — summarizes conversation, stores episode, clears working memory
"""
def __init__(self, persist_dir: str = "./memory_db"):
self.episodic = EpisodicMemory(persist_dir=persist_dir)
self.procedural = ProceduralMemory(persist_dir=persist_dir)
self.working: WorkingMemory | None = None
self.client = anthropic.AsyncAnthropic()
self.conversation_log: list[dict] = []
def start_session(self, session_id: str, user_id: str = "default") -> str:
"""Initialize a new session with fresh working memory.
Returns the formatted memory context for the first prompt.
"""
self.working = WorkingMemory(session_id=session_id)
self.working.set("user_id", user_id)
self.conversation_log = []
return self.working.to_prompt()
def build_memory_context(self, user_message: str) -> str:
"""Build the full memory context to inject into Claude's system prompt.
Combines all three tiers into a single formatted string.
"""
if not self.working:
raise RuntimeError("No active session. Call start_session() first.")
sections = [
self.working.to_prompt(),
self.episodic.to_prompt(user_message, n_results=3),
self.procedural.to_prompt(user_message),
]
return "\n\n".join(sections)
def log_turn(self, role: str, content: str) -> None:
"""Log a conversation turn for later summarization."""
self.conversation_log.append({"role": role, "content": content})
async def end_session(self) -> str | None:
"""End the session: summarize, store episode, clear working memory.
Returns the episode ID if a summary was stored, None otherwise.
"""
if not self.working or not self.conversation_log:
return None
session_id = self.working.session_id
user_id = self.working.get("user_id", "default")
# Summarize the conversation
try:
summary_record = await summarize_conversation(
self.conversation_log, self.client
)
except Exception as e:
print(f"Warning: summarization failed: {e}")
summary_record = {
"summary": f"Session {session_id} — summarization failed",
"topics": [],
}
# Store in episodic memory
episode_id = self.episodic.store_episode(
summary=summary_record["summary"],
session_id=session_id,
user_id=user_id,
topics=summary_record.get("topics", []),
)
# Clear working memory
self.working.clear()
self.working = None
self.conversation_log = []
return episode_id
# --- Usage: Full agent loop ---
async def agent_loop():
"""Demonstrates a full conversation with memory management."""
manager = MemoryManager(persist_dir="./memory_db")
client = anthropic.AsyncAnthropic()
# Start session
session_id = "sess_044"
manager.start_session(session_id, user_id="user_alice")
# Simulate a user message
user_msg = "What did we decide about the deployment schedule?"
manager.log_turn("user", user_msg)
# Build memory-enriched system prompt
memory_context = manager.build_memory_context(user_msg)
manager.working.set("intent", "recall_decision")
manager.working.set("topic", "deployment schedule")
system_prompt = f"""You are a helpful assistant with memory.
{memory_context}
Use the memory context above to provide informed, personalized responses.
If you find relevant past interactions, reference them naturally."""
# Call Claude with memory context
response = await client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=system_prompt,
messages=[{"role": "user", "content": user_msg}],
)
assistant_msg = response.content[0].text
manager.log_turn("assistant", assistant_msg)
print(f"Agent: {assistant_msg}")
# End session — summarize and persist
episode_id = await manager.end_session()
print(f"Session saved as episode: {episode_id}")
import Anthropic from "@anthropic-ai/sdk";
class MemoryManager {
/** Orchestrates all three memory tiers for a conversational agent. */
private episodic: EpisodicMemory;
private procedural: ProceduralMemory;
private working: WorkingMemory | null = null;
private client: Anthropic;
private conversationLog: Array<{ role: string; content: string }> = [];
constructor() {
this.episodic = new EpisodicMemory("episodes");
this.procedural = new ProceduralMemory();
this.client = new Anthropic();
}
async init(): Promise<void> {
await this.episodic.init();
await this.procedural.init();
}
startSession(sessionId: string, userId: string = "default"): string {
this.working = new WorkingMemory(sessionId);
this.working.set("user_id", userId);
this.conversationLog = [];
return this.working.toPrompt();
}
async buildMemoryContext(userMessage: string): Promise<string> {
if (!this.working) throw new Error("No active session");
const sections = [
this.working.toPrompt(),
await this.episodic.toPrompt(userMessage, 3),
await this.procedural.toPrompt(userMessage),
];
return sections.join("\n\n");
}
logTurn(role: string, content: string): void {
this.conversationLog.push({ role, content });
}
async endSession(): Promise<string | null> {
if (!this.working || !this.conversationLog.length) return null;
const sessionId = this.working.sessionId;
const userId = String(this.working.get("user_id", "default"));
let summaryRecord: EpisodeRecord;
try {
summaryRecord = await summarizeConversation(
this.conversationLog,
this.client
);
} catch {
summaryRecord = {
summary: `Session ${sessionId} — summarization failed`,
topics: [],
decisions: [],
userPreferences: [],
actionItems: [],
};
}
const episodeId = await this.episodic.storeEpisode(
summaryRecord.summary,
sessionId,
userId,
summaryRecord.topics
);
this.working.clear();
this.working = null;
this.conversationLog = [];
return episodeId;
}
}
// --- Usage: Full agent loop ---
async function agentLoop() {
const manager = new MemoryManager();
await manager.init();
const client = new Anthropic();
manager.startSession("sess_044", "user_alice");
const userMsg = "What did we decide about the deployment schedule?";
manager.logTurn("user", userMsg);
const memoryContext = await manager.buildMemoryContext(userMsg);
const systemPrompt = `You are a helpful assistant with memory.
${memoryContext}
Use the memory context above to provide informed, personalized responses.`;
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
system: systemPrompt,
messages: [{ role: "user", content: userMsg }],
});
const text =
response.content[0].type === "text" ? response.content[0].text : "";
manager.logTurn("assistant", text);
console.log(`Agent: ${text}`);
const episodeId = await manager.endSession();
console.log(`Session saved as episode: ${episodeId}`);
}
You built a complete MemoryManager that orchestrates all three tiers. The lifecycle is: start_session() creates working memory → build_memory_context() combines all three tiers into a system prompt → during the conversation, you update working memory and log turns → end_session() summarizes the conversation, stores it in episodic memory, and clears working memory. Each Claude call gets a rich memory context that includes current task state, relevant past interactions, and matching skill templates — all in under 2,000 tokens of memory overhead.
The summarization call at session end uses ~500 input tokens + ~200 output tokens per conversation, costing roughly $0.002 per session with Claude Sonnet. For 500 sessions/day, that’s $1/day in summarization costs. The savings from not replaying full conversation history (4,000+ tokens per session) far outweigh this cost — roughly $8/day saved on input tokens alone.
Hands-On Exercise
What You'll Build
A 3-tier memory system (working, episodic, procedural) with a MemoryManager that wires them together. You'll run a multi-turn session, save it to episodic memory, start a new session, and verify the agent remembers key details from the first session.
Time Estimate: 45–60 minutes
Prerequisites: Python 3.10+, an Anthropic API key (console.anthropic.com), and a terminal
Files You'll Create:
memory_system.py— All 3 memory tiers + MemoryManager + test harness./memory_db/— Auto-created ChromaDB persistent storage directory
Environment Setup
mkdir memory-lab && cd memory-lab
python -m venv venv && source venv/bin/activate # Windows: venv\Scripts\activate
pip install "anthropic>=0.30.0" "chromadb>=0.4.0"
export ANTHROPIC_API_KEY=your-key-here # Windows: set ANTHROPIC_API_KEY=your-key-here
Step 1: Build All 3 Memory Tiers + MemoryManager
This step creates the complete memory system in one file: WorkingMemory (key-value scratchpad), EpisodicMemory (ChromaDB-backed past interactions), ProceduralMemory (reusable action templates), and a MemoryManager that wires them together and provides build_context() for injecting memory into prompts.
Create a new file called memory_system.py and add the following:
import json
import time
import uuid
import chromadb
import anthropic
client = anthropic.Anthropic()
# ── Tier 1: Working Memory (key-value scratchpad) ───────────
class WorkingMemory:
"""Fast, mutable scratchpad for current task state."""
def __init__(self):
self._store: dict = {}
self.session_id = f"sess_{uuid.uuid4().hex[:8]}"
self.created_at = time.time()
def set(self, key: str, value) -> None:
self._store[key] = value
def get(self, key: str, default=None):
return self._store.get(key, default)
def delete(self, key: str) -> None:
self._store.pop(key, None)
def clear(self) -> None:
self._store.clear()
def to_prompt(self) -> str:
if not self._store:
return "[Working Memory: empty]"
lines = [f"[Working Memory — Session {self.session_id}]"]
for k, v in self._store.items():
lines.append(f" {k}: {v}")
return "\n".join(lines)
def to_dict(self) -> dict:
return dict(self._store)
# ── Tier 2: Episodic Memory (past interactions via ChromaDB) ─
class EpisodicMemory:
"""Searchable archive of past conversation summaries."""
def __init__(self, persist_dir: str = "./memory_db"):
self._client = chromadb.PersistentClient(path=persist_dir)
self._collection = self._client.get_or_create_collection(
name="episodes", metadata={"hnsw:space": "cosine"}
)
def store_episode(self, summary: str, session_id: str, metadata: dict = None) -> str:
episode_id = f"ep_{uuid.uuid4().hex[:12]}"
meta = {
"session_id": session_id,
"timestamp": time.time(),
**(metadata or {}),
}
self._collection.add(
ids=[episode_id],
documents=[summary],
metadatas=[meta],
)
return episode_id
def recall(self, query: str, top_k: int = 3) -> list[dict]:
if self._collection.count() == 0:
return []
results = self._collection.query(query_texts=[query], n_results=min(top_k, self._collection.count()))
episodes = []
for doc, meta in zip(results["documents"][0], results["metadatas"][0]):
episodes.append({"summary": doc, **meta})
return episodes
def to_prompt(self, query: str) -> str:
episodes = self.recall(query, top_k=2)
if not episodes:
return "[Past Interactions: none yet]"
lines = ["[Relevant Past Interactions]"]
for ep in episodes:
ts = time.strftime("%Y-%m-%d", time.localtime(ep.get("timestamp", 0)))
lines.append(f" - Session {ep.get('session_id', '?')} ({ts}): {ep['summary']}")
return "\n".join(lines)
@property
def count(self) -> int:
return self._collection.count()
# ── Tier 3: Procedural Memory (reusable action templates) ────
class ProceduralMemory:
"""Library of proven tool-call sequences."""
def __init__(self):
self._procedures: dict[str, dict] = {}
def store_procedure(self, name: str, description: str, steps: list[str]) -> None:
self._procedures[name] = {
"description": description,
"steps": steps,
"usage_count": 0,
}
def find_procedure(self, query: str) -> dict | None:
"""Simple keyword matching (production: use embeddings)."""
query_lower = query.lower()
best_match, best_score = None, 0
for name, proc in self._procedures.items():
score = sum(1 for word in query_lower.split()
if word in name.lower() or word in proc["description"].lower())
if score > best_score:
best_match, best_score = (name, proc), score
if best_match and best_score > 0:
best_match[1]["usage_count"] += 1
return {"name": best_match[0], **best_match[1]}
return None
def to_prompt(self, query: str) -> str:
proc = self.find_procedure(query)
if not proc:
return "[Procedures: no matching template]"
steps_str = "\n".join(f" Step {i+1}: {s}" for i, s in enumerate(proc["steps"]))
return f"[Suggested Procedure: {proc['name']} (used {proc['usage_count']}x)]\n{steps_str}"
# ── MemoryManager (wires all 3 tiers together) ───────────────
class MemoryManager:
"""Orchestrates all three memory tiers."""
def __init__(self, persist_dir: str = "./memory_db"):
self.working = WorkingMemory()
self.episodic = EpisodicMemory(persist_dir=persist_dir)
self.procedural = ProceduralMemory()
self.conversation_log: list[dict] = []
def build_context(self, current_query: str) -> str:
"""Build combined memory context for injection into the LLM prompt."""
parts = [
self.working.to_prompt(),
self.episodic.to_prompt(current_query),
self.procedural.to_prompt(current_query),
]
return "\n\n".join(parts)
def log_turn(self, role: str, content: str) -> None:
self.conversation_log.append({"role": role, "content": content})
def end_session(self, summary: str = None) -> str:
"""Summarize and archive the session to episodic memory."""
if not summary and self.conversation_log:
# Auto-summarize with Claude
transcript = "\n".join(f"{m['role']}: {m['content']}" for m in self.conversation_log)
try:
response = client.messages.create(
model="claude-sonnet-4-6", max_tokens=300,
system="Summarize this conversation in 2-3 sentences. Preserve key decisions, preferences, and facts.",
messages=[{"role": "user", "content": transcript}],
)
summary = response.content[0].text
except Exception:
summary = f"Session with {len(self.conversation_log)} turns."
episode_id = self.episodic.store_episode(
summary=summary,
session_id=self.working.session_id,
metadata={"turns": len(self.conversation_log)},
)
# Reset for next session
self.working.clear()
self.working = WorkingMemory() # new session_id
self.conversation_log = []
return episode_id
def chat(self, user_message: str) -> str:
"""Send a message with full memory context."""
self.log_turn("user", user_message)
memory_context = self.build_context(user_message)
response = client.messages.create(
model="claude-sonnet-4-6", max_tokens=1024,
system=(
"You are a helpful assistant with multi-session memory. "
"Use the memory context below to inform your responses.\n\n"
f"{memory_context}"
),
messages=[{"role": m["role"], "content": m["content"]}
for m in self.conversation_log],
)
reply = response.content[0].text
self.log_turn("assistant", reply)
return reply
# ── Tests ────────────────────────────────────────────────────
if __name__ == "__main__":
import shutil
# Clean up any previous test data
shutil.rmtree("./memory_db", ignore_errors=True)
print("═" * 55)
print("TEST 1: Individual Memory Tiers")
print("═" * 55)
# Working Memory
wm = WorkingMemory()
wm.set("intent", "schedule_meeting")
wm.set("date", "2026-04-05")
wm.set("attendees", ["Alice", "Bob"])
print("\n Working Memory:")
print(" " + wm.to_prompt().replace("\n", "\n "))
# Episodic Memory
em = EpisodicMemory(persist_dir="./memory_db")
em.store_episode("User prefers email over Slack for notifications.", "sess_001")
em.store_episode("Discussed deployment schedule. Decision: deploy Friday.", "sess_002")
em.store_episode("Reviewed API rate limits. Set max to 1000 req/min.", "sess_003")
print(f"\n Episodic Memory: {em.count} episodes stored")
print(" " + em.to_prompt("deployment schedule").replace("\n", "\n "))
# Procedural Memory
pm = ProceduralMemory()
pm.store_procedure("generate_report", "Generate a formatted report from data",
["Gather data from sources", "Analyze key metrics", "Format as markdown", "Send to user"])
pm.store_procedure("debug_api_error", "Debug and fix API errors",
["Check error code and message", "Search logs for context", "Identify root cause", "Suggest fix"])
print(f"\n Procedural Memory:")
print(" " + pm.to_prompt("generate a report").replace("\n", "\n "))
print(f"\n{'═' * 55}")
print("TEST 2: Cross-Session Memory (with Claude)")
print("═" * 55)
# Session 1
mm = MemoryManager(persist_dir="./memory_db")
mm.working.set("intent", "setup_preferences")
print("\n Session 1:")
reply1 = mm.chat("Hi! I prefer getting notifications via email, not Slack.")
print(f" Turn 1: {reply1[:120]}...")
reply2 = mm.chat("Also, let's plan to deploy the new API on Friday.")
print(f" Turn 2: {reply2[:120]}...")
ep_id = mm.end_session()
print(f" Session ended → episode {ep_id}")
# Session 2 — should remember preferences from Session 1
print("\n Session 2:")
reply3 = mm.chat("Do you remember my notification preference?")
print(f" Turn 1: {reply3[:150]}...")
print(f"\n Memory context that was injected:")
print(" " + mm.build_context("notification preference").replace("\n", "\n "))
# Cleanup
shutil.rmtree("./memory_db", ignore_errors=True)
print(f"\n{'═' * 55}")
print("✓ All tests complete.")
print("═" * 55)
Run it: This single command runs both Test 1 (individual tiers, no API calls) and Test 2 (cross-session memory with live Claude calls). Make sure your ANTHROPIC_API_KEY is set before running.
Look for these key behaviors:
- Test 1: All 3 tiers produce formatted prompt output — working memory shows key-value pairs, episodic shows matching summaries, procedural shows matching template
- Test 2, Session 1: Claude responds to preferences and deployment planning normally
- Test 2, Session 2: Claude should reference the email preference from Session 1 — this confirms episodic memory retrieval is working
- Memory context: Should show past interactions injected with the correct session ID and summary text
ModuleNotFoundError: No module named 'chromadb'→ Runpip install chromadb- Episodic memory returns empty in Session 2 → Make sure
end_session()was called after Session 1. Check that the./memory_dbdirectory was created (PersistentClient). - Claude doesn't mention the preference in Session 2 → Check the memory context output. If it shows "[Past Interactions: none yet]", the episode wasn't stored. Verify
end_session()succeeded. - Permission error on
./memory_db→ Delete thememory_dbfolder and try again. On Windows, make sure no other process has the folder open.
Verify Everything Works
Run the full test suite. Both tests should complete, with Session 2 demonstrating cross-session memory recall:
If Claude references the email preference in Session 2 without being told again, your 3-tier memory system is working correctly.
You've built a complete multi-layer memory system with working memory (in-session scratchpad), episodic memory (cross-session recall via ChromaDB), and procedural memory (reusable action templates). This is the architecture used by production agents that need to remember user preferences, past decisions, and learned workflows across sessions.
- Memory compaction: When episodic memory exceeds 100 entries, merge related episodes from the same week into consolidated records
/memorycommand: Add a command that lets the user inspect all three memory tiers in a formatted display- Procedural learning: When the agent completes a multi-step task, automatically extract and store the tool sequence as a new procedure
- Forgetting persistent storage:
chromadb.Client()is in-memory only. Always usePersistentClient(path="...")for cross-session memory. - Not handling empty retrieval: On the first session, episodic and procedural memory are empty. Your code must return graceful defaults, not crash.
- Overstuffing the prompt: Limit episodic retrieval to 2–3 episodes. More context ≠ better responses.
- Skipping summarization: Store summaries, not raw transcripts. Raw transcripts bloat retrieval and waste tokens.
Knowledge Check
1. Match each scenario to the correct memory tier:
Scenario: “The agent needs to store that the user’s preferred output format is CSV, learned during last week’s conversation.”
2. Why is it more efficient to store conversation summaries in episodic memory rather than full transcripts?
3. An agent keeps forgetting user preferences between sessions even though it uses a vector database. What is the most likely cause?
PersistentClient(path="./chroma_db") so data is written to disk and survives restarts.PersistentClient(path="./chroma_db").4. What is the role of the summarization pipeline in the memory architecture?
5. A customer support agent handles 500 conversations per day. Without any management, the episodic memory store will grow unbounded. Which approach best prevents this while preserving useful memories?
6. In M08, you learned about conversation context management. How does multi-layer memory improve upon the basic approaches covered there?
Your Score
Module Summary
Key Concepts Recap
- Multi-layer memory architecture: Separate memory into working (current task), episodic (past interactions), and procedural (learned skills) tiers
- Working memory: A mutable key-value scratchpad injected into every LLM call, cleared when the task completes
- Episodic memory: Conversation summaries stored in a vector database, retrieved by semantic search for cross-session continuity
- Procedural memory: Reusable action templates with trigger conditions, retrieved by similarity matching to avoid re-planning
- Summarization pipeline: Uses Claude to compress raw conversations into compact episode records (93% token reduction)
- Memory Manager: Orchestrates all three tiers — loads at session start, maintains during conversation, persists at session end
Next: M12 — ReAct Agent Loop
You’ve given your agent a brain that remembers. Now it’s time to give it the ability to reason and act. In M12, you’ll implement the ReAct pattern — a structured loop where Claude thinks about what to do, takes an action (tool call), observes the result, and decides the next step. This is the foundation of autonomous agent behavior, and your multi-layer memory will play a key role in giving the agent context for its decisions.