M03: Prompts — Programming in Natural Language

Prompts are how you program Claude. This module teaches you the anatomy of effective prompts, battle-tested engineering patterns, and how to build a conversation manager that gives your agent a persistent memory.

Learning Objectives

  • Explain the roles of system, user, and assistant messages in the Messages API
  • Apply zero-shot, few-shot, and chain-of-thought prompting patterns and predict which works best for a given task
  • Describe the stateless prompt-to-completion loop and why your code must manage context
  • Build effective system prompts with structured sections for personaA role or identity assigned to Claude via the system prompt. For example, "You are a senior Python developer" makes Claude respond with that expertise and perspective., constraints, and output format
  • Implement a ConversationManager class that maintains multi-turnA conversation with multiple back-and-forth exchanges between user and assistant. Since the API is stateless, your code must store and resend the full message history with each new turn. context

Anatomy of a Prompt: Message Roles

Everyday Analogy

Before: Imagine you could only communicate with an AI by typing one giant blob of text with no structure — your instructions, your question, and any prior context all mashed together with no labels.

The pain: The AI had no way to distinguish "this is how you should behave" from "this is what the user is asking," which led to inconsistent, hard-to-control responses.

The mapping: The Messages API solves this like a screenplay: the system messageA special instruction sent with every API call that defines Claude's persona, rules, and behavior. It's invisible to end users but shapes every response. It counts as input tokens. sets the stage directions (persistent rules the audience never sees), the user delivers their lines (the question or task), and the assistant responds in character. The director (system) never appears on screen but controls the entire performance — just like a well-structured system prompt invisibly shapes every reply.

What this actually looks like in code: Here's the actual structure you send to the API. Notice how each piece has its own dedicated slot — system prompt separate, messages tagged with roles:

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system="You are a helpful coding assistant.",  # ← the "director"
    messages=[
        {"role": "user", "content": "How do I reverse a string?"},     # ← actor 1
        {"role": "assistant", "content": "Use slicing: s[::-1]"},       # ← actor 2
        {"role": "user", "content": "What about in JavaScript?"},       # ← actor 1 again
    ]  # Roles MUST alternate: user, assistant, user, assistant...
)
Technical Definition When you call the Messages APIAnthropic's primary API for interacting with Claude. You send a list of role-tagged messages and receive a response. Supports system prompts, tool use, streaming, and more., you send two things. First, a system parameter — a string of persistent instructions that Claude always follows but the end user never sees. Second, a messages array — a list of objects, each tagged with a role.

The roles must strictly alternate: "user" (what the human says) then "assistant" (what Claude said previously). Think of it as handing Claude the full script of the conversation so far, plus the director's notes. Claude reads the entire array as one seamless context and generates the next assistant turn. It does not "remember" anything from previous API calls — each request starts from scratch.
Animation: Message Role Stack
systemYou are a helpful coding assistant. Be concise. Use examples.
↓ influences every response
userHow do I reverse a string in Python?
assistantUse slicing: reversed_str = my_string[::-1]
userWhat about in JavaScript?
assistantSplit, reverse, join: str.split('').reverse().join('')
Diagram: Message Role Flow
system You are a helpful coding assistant. Be concise. persistent influences every turn user How do I reverse a string in Python? assistant Use slicing: reversed_str = my_string[::-1] user What about in JavaScript? assistant str.split('').reverse().join('') messages[ ] Each API call sends the FULL array — Claude has no memory between calls
Why It Matters In production, a poorly structured system prompt can drop task accuracy from 92% to under 60%. At one AI consultancy, simply restructuring a customer-service agent's system prompt — adding explicit role, constraints, and output format sections — reduced hallucinated refund amounts by 74% and cut average response latency by 0.8 seconds. As you build agents in later modules, the system prompt will define your agent's personality, capabilities, tool-use instructions, and safety guardrails. Getting roles right here is the foundation for everything that follows.

Prompt Engineering Patterns

Everyday Analogy

Before: Early LLM users had exactly one tool: type a question and hope for the best. There was no systematic way to improve the quality of the response beyond rewording your question and retrying.

The pain: Results were wildly inconsistent — the same model might get a math problem right once and wrong three times, and users had no framework for understanding why or how to fix it.

The mapping: Prompt patterns are like teaching strategies that give you repeatable levers. Sometimes you just ask the question (zero-shotA prompting pattern where you give the model a task with no examples. It relies entirely on the model's pre-trained knowledge. Works well for simple, common tasks.), like asking a student a pop quiz question. Sometimes you show examples first (few-shotA prompting pattern where you include 2-5 input/output examples before your actual question. The model learns the desired pattern from the examples. Great for formatting and classification tasks.), like demonstrating solved problems before the test. And sometimes you walk through the reasoning step by step (chain-of-thoughtA prompting pattern that instructs the model to show its reasoning step by step before giving the final answer. Dramatically improves accuracy on math, logic, and multi-step problems. Often triggered by adding "Let's think step by step" to the prompt.), like a tutor working through a problem on a whiteboard so the student can follow the logic.

What these patterns actually look like: Here's the same question sent three ways. Notice how the prompt structure changes — not the question itself:

# ZERO-SHOT — just the question, no help
"Classify this email as spam or not-spam: 'You won $1M! Click here!'"

# FEW-SHOT — show examples first, then ask
"""Classify these emails:
Email: 'Meeting tomorrow at 3pm' → not-spam
Email: 'Your invoice is attached' → not-spam
Email: 'FREE VIAGRA!!!' → spam

Now classify: 'You won $1M! Click here!' →"""

# CHAIN-OF-THOUGHT — ask for step-by-step reasoning
"""Classify this email as spam or not-spam: 'You won $1M! Click here!'

Think step by step:
1. What signals suggest spam?
2. What signals suggest legitimate?
3. Weigh the evidence
4. Final classification:"""

The zero-shot version uses the fewest tokens but gives Claude no guidance on format. Few-shot "teaches by example" — Claude mirrors the → spam/not-spam format from your examples. Chain-of-thought forces Claude to show its reasoning, making errors visible and easy to debug.

Technical Definition Let's walk through the three core patterns one at a time.

Zero-shot means you give the model a task with no examples at all — it relies entirely on what it learned during training. This works fine for common, well-defined tasks like "translate this to French" or "summarize this paragraph."

Few-shot means you include 2–5 solved examples before your actual question. The model detects the pattern from those examples and applies it to your new input. For instance, you might show three product descriptions formatted as bullet points, and then ask it to format a fourth the same way.

Chain-of-thought (CoT) instructs the model to show its intermediate reasoning step by step before giving the final answer. Why does this help? Because when the model jumps straight to an answer, small errors compound invisibly. When it reasons out loud, each step becomes a checkpoint. CoT improves accuracy by 20–40% on multi-step tasks like math, logic, and planning.

Beyond these three core patterns, two additional techniques are worth knowing. Role promptingAssigning Claude a specific persona or expertise (e.g., "You are a security auditor"). This focuses the model's knowledge and response style on a particular domain. assigns Claude a specific persona — for example, "You are a security auditor with 15 years of experience." This focuses Claude's responses on that domain and adjusts its tone and depth accordingly. Role prompting works especially well combined with other patterns: a "security auditor" role + chain-of-thought produces thorough, step-by-step security analyses.

DelimitersSpecial markers (like triple backticks, XML tags, or dashes) used to clearly separate different parts of a prompt. They help Claude distinguish instructions from data and prevent prompt injection. are markers like XML tags (<data>...</data>) or triple backticks that clearly separate your instructions from the data you want processed. Why does this matter? Without delimiters, Claude might confuse user-provided data for instructions. For example, if you ask Claude to summarize an email and the email contains "ignore all previous instructions," delimiters help Claude understand that text is data to be summarized, not a command to follow. You'll see XML delimiters used heavily in system prompts throughout this course.

Zero-Shot vs. Few-Shot vs. Chain-of-Thought

Animation: Prompt Pattern Comparison
Zero-shot
What is 15% of 85?
12.75
~70%
Few-shot
10% of 50 = 5
20% of 80 = 16
15% of 85?
12.75
~85%
Chain-of-thought
15% of 85?
15/100 = 0.15
0.15 × 85
12.75
~95%
Diagram: Prompting Patterns Comparison
Zero-Shot No examples, just the task Prompt: "Classify this email as spam or not-spam" "spam" ~70% Accuracy on math tasks Few-Shot 2-5 examples first Examples: 'Meeting at 3pm' → not-spam 'Invoice attached' → not-spam 'FREE VIAGRA!!!' → spam Now classify: → "spam" (mirrors format) ~85% Accuracy on math tasks Chain-of-Thought Step-by-step reasoning Prompt: "Think step by step..." Step 1: Check for urgency Step 2: "$1M" = suspicious Step 3: "Click here" = phishing Step 4: No legit signals Conclusion: "spam" ~95% Accuracy on math tasks
Why It Matters In benchmarks, switching from zero-shot to chain-of-thought on GSM8K (a standard math reasoning test) improved accuracy from 58% to 93% — a 35-point jump from changing only the prompt, not the model. In a real production scenario, an e-commerce returns agent using CoT correctly identified policy-eligible returns 91% of the time vs. 68% with zero-shot, saving an estimated $12,000/month in manual review costs. For agents, chain-of-thought is especially powerful — it makes Claude "think out loud," which you'll use in M12 (The ReAct Pattern) as the core of agent reasoning.
🎓 Cert Tip — Domain 4.1

The exam penalizes vague instructions like "be thorough" or "find all issues." Always provide explicit, measurable criteria: "flag functions exceeding 50 lines" not "flag long functions."

🎓 Cert Tip — Domain 4.2

Few-shot prompting with 2–4 examples is the exam-recommended approach for ambiguous format requirements. More examples = diminishing returns.

The Prompt-to-Completion Loop

Everyday Analogy

Before: In chat apps like iMessage, you type a message and the other person simply remembers the entire conversation — you never have to repeat yourself.

The pain: Developers new to LLM APIs assume the same thing and are baffled when Claude "forgets" what they said two messages ago, leading to broken multi-turn conversations and agents that lose context mid-task.

The mapping: Sending a prompt is actually like mailing a detailed letter to an expert who has amnesia. You must include everything they need — full context, the history of past letters, and your new question — because they have zero memory of previous letters. If you want them to "remember" something, you photocopy it and include it in the envelope every single time.

What the "envelope" actually looks like: Here's the literal data your code sends on Turn 3 of a conversation. Notice how the entire history from Turns 1 and 2 is included — you're photocopying every previous letter:

# Turn 3 — your code sends ALL of this, not just the new question:
{
  "system": "You are a helpful tutor.",           # ← always included
  "messages": [
    {"role": "user", "content": "What is a list?"},      # ← Turn 1
    {"role": "assistant", "content": "A list is..."},     # ← Turn 1 reply
    {"role": "user", "content": "How do I sort one?"},    # ← Turn 2
    {"role": "assistant", "content": "Use sorted()..."},  # ← Turn 2 reply
    {"role": "user", "content": "What about reverse?"},   # ← Turn 3 (NEW)
  ]
}
# Total input: system prompt + all 5 messages. You pay for ALL of it.

And here's the response object you get back: The key fields are content (Claude's answer), stop_reason (why it stopped), and usage (your token bill):

# What you get back from client.messages.create(...)
{
  "role": "assistant",
  "content": [{"type": "text", "text": "Use sorted()..."}],
  "stop_reason": "end_turn",       # "end_turn" = finished naturally
                                    # "max_tokens" = hit the limit
                                    # "tool_use" = wants to call a tool
  "usage": {
    "input_tokens": 35,            # what you sent (you pay for this)
    "output_tokens": 52            # what Claude generated (costs more)
  }
}
Technical Definition Each API call is statelessClaude has no memory between API calls. Each request is independent. If you want Claude to "remember" a conversation, your code must send the full message history with every request.. Here's what that means step by step.

First, your code constructs the full messages array. That array contains three things: the system prompt, the conversation history, and the new user message. Second, you send that array to the API. Third, you receive back a completionThe model's generated response to your prompt. A completion includes the assistant's content blocks, a stop_reason (why generation stopped), and usage metadata (input/output token counts)..

The completion contains three things. The assistant's response text is the main one. Next is stop_reasonA field in the API response that tells you why Claude stopped generating. Common values: "end_turn" (finished naturally), "max_tokens" (hit the token limit), "tool_use" (wants to call a tool)., which tells you why Claude stopped generating. Finally, the usage metadata shows how many tokens were consumed.

As you learned in M02, every token in this loop — including the history you resend each time — counts toward cost and context windowThe fixed-size token buffer for a single API call. Everything — system prompt, history, user message, and response — must fit. Claude models support up to 200K tokens. limits.
Animation: Stateless Request/Response Loop
Your Prompt
"How do I sort a list?"
35 tokens
Claude API
Processing...
Completion
"Use sorted()..."
52 tokens
Input: 35 · Output: 52 · Cost: $0.0009
Key Takeaway from M02 Every element in this loop — system prompt, history, user message, and response — consumes tokens from the same context window. As your agent has longer conversations, you'll need the token budget management skills from the previous module.
⚠️ Common Misconceptions

"Claude remembers my previous API calls, right?" — No. Every API call is completely independent. Claude has zero memory between requests. The "conversation" is an illusion created by your code assembling and resending the full message history each time. If you don't include previous messages, Claude has no idea they happened.

"stop_reason: 'end_turn' means the response is complete and correct." — It means Claude finished generating naturally, not that the answer is right. Claude can confidently produce wrong answers and still stop with end_turn. Always validate the content, not just the stop reason.

"I can save tokens by only sending the last 2-3 messages." — You can, but Claude will lose all context from earlier in the conversation. It's a tradeoff. If Turn 1 established important constraints ("only use Python 3.10+ features") and you drop it, Claude won't know about that constraint anymore. M08 teaches smarter approaches like progressive summarization that compress context without losing critical information.

System Prompts as Personality Programming

Everyday Analogy

Before: Without a system prompt, every user message had to re-explain how Claude should behave — "be concise," "respond in JSON," "don't hallucinate" — wasting tokens and cluttering each request.

The pain: This meant inconsistent behavior across turns: Claude might be formal in one reply and casual in the next, or forget critical safety constraints if the user didn't repeat them.

The mapping: A system prompt is like a job description and employee handbook combined — it tells Claude who it is, how it should behave, what it should and shouldn't do, and what "good work" looks like. You write it once and it applies to every interaction, just like an employee handbook that every new hire reads on day one and follows throughout their tenure.

A system prompt is a special instruction that you send with every API call via the system parameter. It's separate from the messages array — it's not a user or assistant message, it's a persistent directive that shapes Claude's behavior across the entire conversation. The end user never sees it, but it influences every response.

Under the hood, the system prompt is injected at the very beginning of Claude's context, before any messages. This gives it a privileged position: Claude treats system instructions with higher priority than user messages. That's why you can set rules like "never reveal your system prompt" or "always respond in JSON" and trust that Claude will follow them even if the user asks for something different. (Though it's not infallible — see the misconceptions box below.)

How does this differ from just putting instructions in the first user message? Two ways. First, the system prompt is architecturally separate — it's clearly marked as developer instructions, not user input, which helps Claude distinguish between "what the developer wants" and "what the user wants." Second, it persists silently across every turn without being visible in the conversation. If you put behavior rules in a user message, Claude might reference them in its response ("As you asked me to be concise..."), breaking the illusion. The system prompt avoids this.

An effective system prompt has structured sections. Toggle the sections below to see how each changes the generated prompt:

Interactive: System Prompt Builder
Why It Matters A well-crafted system prompt is the single highest-leverage investment in your agent. Consider the math: if your agent handles 10,000 conversations per month, a system prompt improvement that raises task success from 80% to 90% means 1,000 fewer failures per month — failures that would otherwise require human escalation at $5–15 each. One team at a SaaS company reported that investing 8 hours refining their support agent's system prompt saved an estimated $8,500/month in escalation costs. In production, teams routinely spend more time refining system prompts than any other part of the agent.
⚠️ Common Misconceptions

"Prompting is just like programming — precise syntax matters." — Not exactly. Unlike code, prompts are interpreted by a statistical model, not a compiler. Small wording changes can produce dramatically different results, and there's no "syntax error" to tell you what went wrong. The best approach is to iterate: try a prompt, evaluate the output, refine, repeat.

"The system prompt is a hard rule that Claude will always follow." — System prompts are highly influential, but they're not infallible. A cleverly worded user message can sometimes override system instructions (this is called "prompt injection"). For safety-critical applications, you need defense in depth: system prompt + output validation + guardrails. Never rely on the system prompt alone for security.

"Longer system prompts give better results." — There's a sweet spot. A 50-word prompt is usually too vague, but a 5,000-word prompt wastes tokens and can actually confuse Claude with contradictory instructions. Most production system prompts land in the 200–800 word range. Be specific, not verbose.

"Few-shot examples are always better than zero-shot." — For well-known tasks (translation, summarization, simple Q&A), zero-shot often performs just as well — and uses fewer tokens. Few-shot shines when the task has an ambiguous or non-obvious output format that Claude can't infer from the instruction alone.

"Chain-of-thought is always the best pattern." — CoT adds significant output tokens (and cost). For simple, single-step tasks like "translate this to Spanish," forcing step-by-step reasoning is wasteful and can even reduce quality. Use CoT for multi-step reasoning, math, logic, and planning — not for everything.

Code Walkthrough

Bridging concept to code: You now understand the three message roles, the three prompting patterns, the stateless loop, and system prompts as personality programming. Next, we will translate each of these concepts into runnable code. The first example shows a single API call with a structured system prompt. The second compares zero-shot, few-shot, and CoT side by side. The third builds a ConversationManager class that handles the stateless loop for you — the pattern you will reuse throughout this entire course.

Single-Turn with System Prompt

Let's start with the simplest case: a single API call with a well-structured system prompt. The system prompt below uses XML-tagged sections — <role>, <constraints>, <output_format>. Why XML? Because Claude treats XML tags as semantic boundaries. When Claude sees <constraints>, it knows "everything inside here is a rule I must follow." This makes it much better at following multi-part instructions compared to a wall of plain text.

Here's a mistake that trips up nearly every beginner: putting the system prompt inside the messages array as a user message. Don't do that. The system parameter is its own top-level field, architecturally separate from the conversation. The tricky part? Getting this wrong won't throw an error — your code will run fine. But Claude's behavior will be subtly worse, and you'll spend hours debugging something that "should work" without understanding why the output quality is inconsistent.

# pip install anthropic>=0.30.0
import anthropic

client = anthropic.Anthropic()

# A well-structured system prompt with clear sections
system_prompt = """You are a senior Python developer conducting code reviews.

You review code for bugs, performance issues, and style violations.

- Be concise: max 3 bullet points per issue category
- Always suggest a fix, not just identify the problem
- If the code is clean, say so — don't invent issues


## Bugs
- ...
## Performance
- ...
## Style
- ...
"""

try:
    message = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=system_prompt,
        messages=[
            {"role": "user", "content": "Review this:\ndef add(a,b): return a+b"}
        ]
    )
    print(message.content[0].text)
    print(f"\nTokens: {message.usage.input_tokens} in, {message.usage.output_tokens} out")
except anthropic.APIError as e:
    print(f"API error: {e.status_code} - {e.message}")
// npm install @anthropic-ai/sdk@^0.30.0
import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

const systemPrompt = `You are a senior Python developer conducting code reviews.

<role>You review code for bugs, performance issues, and style violations.</role>
<constraints>
- Be concise: max 3 bullet points per issue category
- Always suggest a fix, not just identify the problem
- If the code is clean, say so — don't invent issues
</constraints>
<output_format>
## Bugs
- ...
## Performance
- ...
## Style
- ...
</output_format>`;

try {
  const message = await client.messages.create({
    model: 'claude-sonnet-4-6',
    max_tokens: 1024,
    system: systemPrompt,
    messages: [
      { role: 'user', content: 'Review this:\ndef add(a,b): return a+b' }
    ]
  });
  console.log(message.content[0].text);
  console.log(`\nTokens: ${message.usage.input_tokens} in, ${message.usage.output_tokens} out`);
} catch (error) {
  if (error instanceof Anthropic.APIError) {
    console.error(`API error: ${error.status} - ${error.message}`);
  } else { throw error; }
}
What Just Happened? You created an Anthropic client, defined a structured system prompt with XML sections, sent a single user message asking for a code review, and received Claude's response along with token usage metadata. The system prompt shaped Claude's behavior (concise bullet points, suggest fixes, honest about clean code) without the user needing to ask for any of that. The try/except block ensures your app does not crash on API errors.

Comparing Prompt Patterns

Now let's see the three patterns compete head-to-head. The code below sends the exact same math word problem three different ways — zero-shot, few-shot, and chain-of-thought — and prints the results side by side. This is the fastest way to build intuition for which pattern to reach for in your own projects.

One thing to watch in the output: notice how the input_tokens count differs between patterns. Few-shot examples are "free" in terms of accuracy, but they're not free in tokens — each example adds to your input cost. Chain-of-thought also produces longer outputs (Claude "thinks out loud"), so budget accordingly on both sides.

import anthropic

client = anthropic.Anthropic()

question = "A store sells apples for $1.50 each. If you buy 5 or more, you get a 20% discount. How much do 7 apples cost?"

prompts = {
    "zero-shot": question,
    "few-shot": f"""Example: A shirt costs $25. With a 10% discount, it costs $25 * 0.90 = $22.50.
Example: A book costs $15. Buy 3+ and get 15% off. 4 books = $15 * 4 * 0.85 = $51.00.

Now solve: {question}""",
    "chain-of-thought": f"""{question}

Let's solve this step by step:
1. First, determine the base price
2. Check if the discount applies
3. Calculate the discount amount
4. Compute the final price"""
}

for name, prompt in prompts.items():
    try:
        msg = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=300,
            messages=[{"role": "user", "content": prompt}]
        )
        print(f"\n{'='*40}")
        print(f"Pattern: {name}")
        print(f"Response: {msg.content[0].text[:200]}")
        print(f"Tokens: {msg.usage.input_tokens} in, {msg.usage.output_tokens} out")
    except anthropic.APIError as e:
        print(f"Error ({name}): {e.message}")
import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

const question = 'A store sells apples for $1.50 each. If you buy 5 or more, you get a 20% discount. How much do 7 apples cost?';

const prompts = {
  'zero-shot': question,
  'few-shot': `Example: A shirt costs $25. With a 10% discount, it costs $25 * 0.90 = $22.50.
Example: A book costs $15. Buy 3+ and get 15% off. 4 books = $15 * 4 * 0.85 = $51.00.

Now solve: ${question}`,
  'chain-of-thought': `${question}

Let's solve this step by step:
1. First, determine the base price
2. Check if the discount applies
3. Calculate the discount amount
4. Compute the final price`
};

for (const [name, prompt] of Object.entries(prompts)) {
  try {
    const msg = await client.messages.create({
      model: 'claude-sonnet-4-6',
      max_tokens: 300,
      messages: [{ role: 'user', content: prompt }]
    });
    console.log(`\n${'='.repeat(40)}`);
    console.log(`Pattern: ${name}`);
    console.log(`Response: ${msg.content[0].text.slice(0, 200)}`);
    console.log(`Tokens: ${msg.usage.input_tokens} in, ${msg.usage.output_tokens} out`);
  } catch (error) {
    if (error instanceof Anthropic.APIError) {
      console.error(`Error (${name}): ${error.message}`);
    } else { throw error; }
  }
}
Expected Output (chain-of-thought):
Pattern: chain-of-thought Response: 1. Base price per apple: $1.50 2. Buying 7 apples — that's 5 or more, so the 20% discount applies 3. Discount: 20% off means we pay 80% of the price 4. Total: 7 × $1.50 × 0.80 = $8.40 The 7 apples cost **$8.40**. Tokens: 89 in, 72 out
What Just Happened? You sent the same math question three different ways. Zero-shot gave a bare answer (sometimes wrong). Few-shot showed the model a pattern and improved formatting. Chain-of-thought made Claude show its work step by step — and got the right answer ($8.40) with explicit reasoning you can verify. Notice that CoT used more output tokens (Claude "thinks out loud"), but the accuracy gain is worth it for any task requiring multi-step logic.

ConversationManager Class

This is the most important code in the module — you will extend this class throughout the course. Let me walk you through it the way you'd think about building it yourself.

The constructor stores three things: the system prompt, the model name, and an empty messages list. That messages list is the heart of the entire class. Since the API is stateless, this Python list is literally the only place the conversation exists. If your process crashes and you lose this list, the entire conversation history is gone — poof. (In M08, you'll learn how to persist this to a database so it survives restarts.)

Now let's look at the send() method, where the real magic happens. The flow is: append the user message to the list, call the API with the full history, then append Claude's response. By keeping both sides in the list, each subsequent call automatically includes the full conversation context. Here's the subtle but critical detail: if the API call fails, we pop() the user message back off the list. Why? Without this rollback, you'd have an orphaned user message sitting in the list with no assistant reply after it. The next call would send two consecutive user messages, violate the strict role alternation rule, and throw a confusing error that doesn't mention the real cause.

Finally, the demo at the bottom sends two turns and prints the token counts. Pay attention to the numbers: the second call uses noticeably more input tokens than the first, because it includes the entire first turn's content. This is the cost growth pattern you must plan for in any long-running agent — and it's exactly why M08 teaches conversation summarization techniques.

import anthropic

class ConversationManager:
    """Manages multi-turn conversations with Claude."""

    def __init__(self, system_prompt: str, model: str = "claude-sonnet-4-6"):
        self.client = anthropic.Anthropic()
        self.system = system_prompt
        self.model = model
        self.messages: list[dict] = []

    def send(self, user_message: str) -> tuple[str, dict]:
        """Send a message and get a response. Returns (text, usage)."""
        self.messages.append({"role": "user", "content": user_message})

        try:
            response = self.client.messages.create(
                model=self.model,
                max_tokens=1024,
                system=self.system,
                messages=self.messages,
            )
            assistant_text = response.content[0].text
            self.messages.append({"role": "assistant", "content": assistant_text})

            return assistant_text, {
                "input_tokens": response.usage.input_tokens,
                "output_tokens": response.usage.output_tokens,
            }
        except anthropic.APIError as e:
            self.messages.pop()  # Remove failed user message
            raise

    def get_history(self) -> list[dict]:
        """Return the full conversation history."""
        return self.messages.copy()

    def clear(self):
        """Clear conversation history (keeps system prompt)."""
        self.messages = []


# Usage
conv = ConversationManager(
    system_prompt="You are a helpful Python tutor. Be concise."
)

try:
    reply, usage = conv.send("What is a list comprehension?")
    print(f"Claude: {reply}")
    print(f"Tokens: {usage}")

    reply, usage = conv.send("Show me an example with filtering.")
    print(f"\nClaude: {reply}")
    print(f"Tokens: {usage}")
    print(f"History length: {len(conv.get_history())} messages")
except anthropic.APIError as e:
    print(f"Error: {e.message}")
import Anthropic from '@anthropic-ai/sdk';

class ConversationManager {
  constructor(systemPrompt, model = 'claude-sonnet-4-6') {
    this.client = new Anthropic();
    this.system = systemPrompt;
    this.model = model;
    this.messages = [];
  }

  async send(userMessage) {
    this.messages.push({ role: 'user', content: userMessage });

    try {
      const response = await this.client.messages.create({
        model: this.model,
        max_tokens: 1024,
        system: this.system,
        messages: this.messages,
      });
      const assistantText = response.content[0].text;
      this.messages.push({ role: 'assistant', content: assistantText });

      return {
        text: assistantText,
        usage: {
          inputTokens: response.usage.input_tokens,
          outputTokens: response.usage.output_tokens,
        },
      };
    } catch (error) {
      this.messages.pop(); // Remove failed user message
      throw error;
    }
  }

  getHistory() { return [...this.messages]; }
  clear() { this.messages = []; }
}

// Usage
const conv = new ConversationManager(
  'You are a helpful Python tutor. Be concise.'
);

try {
  let result = await conv.send('What is a list comprehension?');
  console.log(`Claude: ${result.text}`);
  console.log(`Tokens:`, result.usage);

  result = await conv.send('Show me an example with filtering.');
  console.log(`\nClaude: ${result.text}`);
  console.log(`Tokens:`, result.usage);
  console.log(`History length: ${conv.getHistory().length} messages`);
} catch (error) {
  if (error instanceof Anthropic.APIError) {
    console.error(`Error: ${error.message}`);
  } else { throw error; }
}
What Just Happened? You built a reusable ConversationManager that solves the stateless API problem. It stores the full message history in a list, sends it with every API call, appends both user and assistant messages to maintain alternation, and gracefully rolls back on errors. After two turns, the history contains 4 messages (2 user + 2 assistant), and the second API call consumed more input tokens because it included the first turn's content. This class is the foundation you will build on for tool use (M07), ReAct agents (M12), and production conversation management.

Hands-On Exercise

What You'll Build

A multi-turn Code Review Agent that uses a structured system prompt, compares prompt patterns, and tracks token growth across turns using the ConversationManager class.

Time estimate: 25–35 minutes • Prerequisites: Completed M01/M02 labs (API key set, SDK installed) • Files you'll create: review_agent.py (or review_agent.mjs)

Environment Setup

If you completed the M01 lab, you're already set up. Otherwise:

pip install "anthropic>=0.30.0"           # or: npm install "@anthropic-ai/sdk@^0.30.0"
export ANTHROPIC_API_KEY="your-key-here"  # or (Windows): set ANTHROPIC_API_KEY=your-key-here

Step 1: Build a Code Review System Prompt

A great system prompt is the highest-leverage thing you can write for an agent. This step builds one with 5 XML-tagged sections — the same structure production teams use. XML tags help Claude parse multi-part instructions cleanly.

Create a new file called review_agent.py (or review_agent.mjs):

import anthropic

client = anthropic.Anthropic()

REVIEW_SYSTEM_PROMPT = """You are a senior software engineer conducting code reviews.

<role>You review code for correctness, performance, security, and style.</role>
<expertise>Python, JavaScript, SQL. You know OWASP top 10 and PEP 8.</expertise>
<review_criteria>
- Bugs: logic errors, off-by-one, null handling
- Performance: unnecessary loops, missing caching opportunities
- Security: injection risks, hardcoded secrets, unsafe deserialization
- Style: naming conventions, function length, missing docstrings
</review_criteria>
<output_format>
For each category with findings, use this format:
## [Category]
- **Issue**: description
- **Fix**: suggested code change
If a category has no issues, omit it entirely.
</output_format>
<tone>Be constructive and specific. Praise good patterns. Never be dismissive.</tone>"""

# Test the system prompt with a simple review
test_code = '''def get_user(id):
    query = f"SELECT * FROM users WHERE id = {id}"
    return db.execute(query)'''

try:
    msg = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=REVIEW_SYSTEM_PROMPT,
        messages=[{"role": "user", "content": f"Review this code:\n```python\n{test_code}\n```"}]
    )
    print(msg.content[0].text)
    print(f"\nTokens: {msg.usage.input_tokens} in, {msg.usage.output_tokens} out")
except anthropic.APIError as e:
    print(f"Error: {e.message}")
import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

const REVIEW_SYSTEM_PROMPT = `You are a senior software engineer conducting code reviews.

<role>You review code for correctness, performance, security, and style.</role>
<expertise>Python, JavaScript, SQL. You know OWASP top 10 and PEP 8.</expertise>
<review_criteria>
- Bugs: logic errors, off-by-one, null handling
- Performance: unnecessary loops, missing caching opportunities
- Security: injection risks, hardcoded secrets, unsafe deserialization
- Style: naming conventions, function length, missing docstrings
</review_criteria>
<output_format>
For each category with findings, use this format:
## [Category]
- **Issue**: description
- **Fix**: suggested code change
If a category has no issues, omit it entirely.
</output_format>
<tone>Be constructive and specific. Praise good patterns. Never be dismissive.</tone>`;

const testCode = `def get_user(id):
    query = f"SELECT * FROM users WHERE id = {id}"
    return db.execute(query)`;

try {
  const msg = await client.messages.create({
    model: 'claude-sonnet-4-6',
    max_tokens: 1024,
    system: REVIEW_SYSTEM_PROMPT,
    messages: [{ role: 'user', content: `Review this code:\n\`\`\`python\n${testCode}\n\`\`\`` }]
  });
  console.log(msg.content[0].text);
  console.log(`\nTokens: ${msg.usage.input_tokens} in, ${msg.usage.output_tokens} out`);
} catch (error) {
  console.error(`Error: ${error.message}`);
}

Run it: python review_agent.py (or node review_agent.mjs)

Expected Output (your wording will vary):
## Security - **Issue**: SQL injection vulnerability — user input is interpolated directly into the query string via f-string - **Fix**: Use parameterized queries: `db.execute("SELECT * FROM users WHERE id = ?", (id,))` ## Style - **Issue**: `id` shadows the built-in `id()` function - **Fix**: Rename to `user_id`: `def get_user(user_id):` Tokens: 185 in, 92 out
✅ Checkpoint: If Claude flagged the SQL injection vulnerability and used the category/issue/fix format from your system prompt, Step 1 is working. The structured output comes entirely from the <output_format> section you defined.
Troubleshooting
  • ModuleNotFoundError: No module named 'anthropic' — Run pip install anthropic to install the SDK.
  • AuthenticationError: Could not resolve API key — Make sure you've set export ANTHROPIC_API_KEY="sk-..." in your terminal (or set ANTHROPIC_API_KEY=sk-... on Windows).
  • Claude doesn't follow the output format — Make sure the <output_format> XML tags are inside the system parameter, not in the messages array. System-level instructions get higher priority.

Step 2: Compare Prompt Patterns

Now let's see how zero-shot, few-shot, and chain-of-thought affect the same review task. This step sends the same buggy code three ways and prints the results so you can compare quality and token cost side by side.

Create a new file called pattern_compare.py (or pattern_compare.mjs):

import anthropic

client = anthropic.Anthropic()

code = '''def process_items(items):
    result = []
    for i in range(len(items)):
        if items[i] != None:
            result.append(items[i].upper())
    return result'''

patterns = {
    "zero-shot": f"Review this Python code for issues:\n```python\n{code}\n```",
    "few-shot": f"""Here are example code reviews:

Code: `x = x + 1` → Style: Use `x += 1` for augmented assignment.
Code: `if x == None` → Bug: Use `is None` instead of `== None` for identity checks.

Now review this code:
```python
{code}
```""",
    "chain-of-thought": f"""Review this Python code step by step:
```python
{code}
```

Think through it methodically:
1. Read each line and check for bugs
2. Look for performance issues
3. Check for style violations
4. Summarize your findings""",
}

for name, prompt in patterns.items():
    try:
        msg = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=500,
            messages=[{"role": "user", "content": prompt}]
        )
        print(f"\n{'='*50}")
        print(f"Pattern: {name}")
        print(f"Tokens: {msg.usage.input_tokens} in, {msg.usage.output_tokens} out")
        print(f"Response:\n{msg.content[0].text[:300]}")
    except anthropic.APIError as e:
        print(f"Error ({name}): {e.message}")
import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

const code = `def process_items(items):
    result = []
    for i in range(len(items)):
        if items[i] != None:
            result.append(items[i].upper())
    return result`;

const patterns = {
  'zero-shot': `Review this Python code for issues:\n\`\`\`python\n${code}\n\`\`\``,
  'few-shot': `Here are example code reviews:\n\nCode: \`x = x + 1\` → Style: Use \`x += 1\`.\nCode: \`if x == None\` → Bug: Use \`is None\`.\n\nNow review:\n\`\`\`python\n${code}\n\`\`\``,
  'chain-of-thought': `Review this step by step:\n\`\`\`python\n${code}\n\`\`\`\n\n1. Check for bugs\n2. Performance issues\n3. Style violations\n4. Summarize`,
};

for (const [name, prompt] of Object.entries(patterns)) {
  try {
    const msg = await client.messages.create({
      model: 'claude-sonnet-4-6',
      max_tokens: 500,
      messages: [{ role: 'user', content: prompt }]
    });
    console.log(`\n${'='.repeat(50)}`);
    console.log(`Pattern: ${name}`);
    console.log(`Tokens: ${msg.usage.input_tokens} in, ${msg.usage.output_tokens} out`);
    console.log(`Response:\n${msg.content[0].text.slice(0, 300)}`);
  } catch (error) {
    console.error(`Error (${name}): ${error.message}`);
  }
}

Run it: python pattern_compare.py (or node pattern_compare.mjs)

Expected Output (abbreviated):
================================================== Pattern: zero-shot Tokens: 42 in, 180 out Response: Here are the issues I found... ================================================== Pattern: few-shot Tokens: 78 in, 150 out Response: Code: `if items[i] != None` → Bug: Use `is not None`... ================================================== Pattern: chain-of-thought Tokens: 65 in, 250 out Response: 1. Reading line by line... 2. `!= None` should be `is not None`... 3. `range(len(items))` is un-Pythonic...
✅ Checkpoint: You should see three reviews of the same code. Compare: Does few-shot match the format of your examples? Does chain-of-thought show step-by-step reasoning? Does zero-shot use fewer input tokens? All three should flag != None (should be is not None) and the un-Pythonic range(len(...)) loop.
Troubleshooting
  • ModuleNotFoundError: No module named 'anthropic' — Run pip install anthropic to install the SDK.
  • AuthenticationError — Check your ANTHROPIC_API_KEY environment variable is set correctly.
  • Only one or two patterns print — One of the API calls may have hit a rate limit. Wait a few seconds and re-run. The loop will resume from the failed pattern.

Step 3: Multi-Turn Review Conversation

Now combine the system prompt from Step 1 with the ConversationManager from the code walkthrough to have a multi-turn review conversation. This demonstrates how context builds across turns — and how token costs grow. This step uses the ConversationManager class from the Code Walkthrough section above.

Create a new file called review_conversation.py (or review_conversation.mjs). Copy the ConversationManager class from the Code Walkthrough above, then add:

# After the ConversationManager class definition...

REVIEW_SYSTEM_PROMPT = """You are a senior software engineer conducting code reviews.
<role>Review code for correctness, performance, security, and style.</role>
<output_format>Use ## Category headers with bullet points. Be concise.</output_format>
<tone>Be constructive. Praise good patterns.</tone>"""

conv = ConversationManager(system_prompt=REVIEW_SYSTEM_PROMPT)
total_in = 0
total_out = 0

turns = [
    "Review this:\n```python\ndef get_user(id):\n    query = f'SELECT * FROM users WHERE id = {id}'\n    return db.execute(query)\n```",
    "Can you show me the fixed version with parameterized queries?",
    "Now add error handling for the case where the user is not found.",
    "What about connection pooling — is that important here?",
    "Summarize all the improvements we discussed in a checklist.",
]

for i, turn in enumerate(turns, 1):
    try:
        reply, usage = conv.send(turn)
        total_in += usage["input_tokens"]
        total_out += usage["output_tokens"]
        print(f"\n--- Turn {i} ---")
        print(f"You: {turn[:60]}...")
        print(f"Claude: {reply[:150]}...")
        print(f"This turn: {usage['input_tokens']} in, {usage['output_tokens']} out")
        print(f"Cumulative: {total_in} in, {total_out} out")
    except Exception as e:
        print(f"Error on turn {i}: {e}")
        break

print(f"\n{'='*50}")
print(f"Total: {len(conv.get_history())} messages, {total_in} input + {total_out} output tokens")
// After the ConversationManager class definition...

const conv = new ConversationManager(
  `You are a senior software engineer conducting code reviews.
<role>Review code for correctness, performance, security, and style.</role>
<output_format>Use ## Category headers with bullet points. Be concise.</output_format>
<tone>Be constructive. Praise good patterns.</tone>`
);

let totalIn = 0, totalOut = 0;
const turns = [
  "Review this:\n```python\ndef get_user(id):\n    query = f'SELECT * FROM users WHERE id = {id}'\n    return db.execute(query)\n```",
  'Can you show me the fixed version with parameterized queries?',
  'Now add error handling for the case where the user is not found.',
  'What about connection pooling — is that important here?',
  'Summarize all the improvements we discussed in a checklist.',
];

for (let i = 0; i < turns.length; i++) {
  try {
    const { text, usage } = await conv.send(turns[i]);
    totalIn += usage.inputTokens;
    totalOut += usage.outputTokens;
    console.log(`\n--- Turn ${i + 1} ---`);
    console.log(`You: ${turns[i].slice(0, 60)}...`);
    console.log(`Claude: ${text.slice(0, 150)}...`);
    console.log(`This turn: ${usage.inputTokens} in, ${usage.outputTokens} out`);
    console.log(`Cumulative: ${totalIn} in, ${totalOut} out`);
  } catch (error) {
    console.error(`Error on turn ${i + 1}: ${error.message}`);
    break;
  }
}
console.log(`\n${'='.repeat(50)}`);
console.log(`Total: ${conv.getHistory().length} messages, ${totalIn} input + ${totalOut} output tokens`);

Run it: python review_conversation.py (or node review_conversation.mjs)

Expected Output (abbreviated):
--- Turn 1 --- You: Review this:... Claude: ## Security... This turn: 195 in, 88 out Cumulative: 195 in, 88 out --- Turn 2 --- You: Can you show me the fixed version with parameterized qu... Claude: Here's the fixed version... This turn: 310 in, 120 out Cumulative: 505 in, 208 out ... --- Turn 5 --- You: Summarize all the improvements we discussed in a check... Claude: Here's your improvement checklist... This turn: 890 in, 145 out Cumulative: 2100 in, 580 out ================================================== Total: 10 messages, 2100 input + 580 output tokens
✅ Checkpoint: Watch the input tokens grow with each turn — by Turn 5, input tokens per call should be noticeably higher than Turn 1, because each call resends the full history. If Claude references earlier turns in its responses (e.g., "As we discussed with the parameterized queries..."), the ConversationManager is working correctly.
Troubleshooting
  • NameError: name 'ConversationManager' is not defined — Make sure you copied the full ConversationManager class from the Code Walkthrough section into the top of your file.
  • Roles alternation error — If an API call fails, the pop() rollback in send() should handle it. If you manually edited the messages list, ensure it strictly alternates user/assistant.
  • High token counts — This is expected! Each turn resends the full history. At 5 turns, cumulative input of 1,500–3,000 tokens is normal. This is exactly why M08 (Conversation Management) teaches summarization techniques.

Verify Everything Works

Run both scripts to confirm your setup:

python review_agent.py && python pattern_compare.py && python review_conversation.py
🎉 Congratulations! You can now build structured system prompts, choose the right prompt pattern for a task, and manage multi-turn conversations with automatic token tracking. These three skills form the backbone of every agent you'll build from M05 onward.

Stretch Goals (Optional)

  • Auto-summarization: When token count exceeds 80% of the context window, automatically summarize older turns
  • Prompt template library: Build a function that switches between zero-shot, few-shot, and CoT patterns with a single parameter

Knowledge Check

Test your understanding of prompts, patterns, and conversation management.

Q1: What happens if you send two consecutive "user" messages without an "assistant" message between them?

AClaude automatically generates a response between them
BThe first user message is silently ignored
CThe API returns a validation error — roles must alternate between user and assistant
DBoth messages are concatenated into a single user turn
Correct! The Messages API requires strict alternation between user and assistant roles. Sending two consecutive user messages will return an error. Your ConversationManager must enforce this ordering.

Q2: A system prompt is 200 tokens and your conversation has 10 turns averaging 100 tokens each (50 user + 50 assistant). Approximately how many input tokens will the next API call use?

A100 tokens (just the latest user message)
B~1,250 tokens (200 system + 1,000 history + ~50 new message)
C200 tokens (just the system prompt)
D50 tokens (just the latest user message without history)
Correct! Because the API is stateless, every call sends the full context: system prompt (200) + all 10 past turns (1,000) + the new user message (~50) = ~1,250 input tokens. This is why token budget management from M02 matters!

Q3: Which prompting pattern should you use for a multi-step math word problem?

AZero-shot — Claude is good at math
BFew-shot — show similar solved problems
CChain-of-thought — have Claude reason step by step
DRole prompting — tell Claude it's a mathematician
Correct! Chain-of-thought is the best pattern for multi-step reasoning. By making Claude show its work, each step becomes a checkpoint that reduces compounding errors. CoT improves accuracy by 20-40% on these tasks.

Q4: Which system prompt is most effective for a code review agent?

AStructured with XML sections defining role, constraints, output format, and examples
B"You are a code reviewer. Review the code."
CA 5,000-word essay describing every possible code issue
DNo system prompt — just ask for a code review in the user message
Correct! Structured system prompts with clear sections give Claude explicit guidance without being wastefully long. XML tags help Claude parse the sections. Option B is too vague, C wastes tokens, and D misses the most powerful control mechanism.

Q5: Why must your code send the full conversation history with every API call?

AFor billing purposes — Anthropic needs the full context to charge correctly
BClaude stores history on the server but needs confirmation
CIt's optional — Claude remembers previous calls automatically
DBecause the API is stateless — Claude has no memory between calls
Correct! Each API call is completely independent. Claude has zero memory of previous calls. The "conversation" is an illusion created by your code assembling and sending the full message history every time. This is why your ConversationManager class is essential.

Q6: Fill in the blank to make a valid API call with a system prompt:

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    ______="You are a helpful assistant.",
    messages=[{"role": "user", "content": "Hello!"}]
)
Aprompt
Bsystem
Cinstructions
Dcontext
Correct! The parameter is system. It's a top-level parameter in the Messages API, separate from the messages array. The system prompt is not a message with a role — it's its own dedicated parameter.

Module Summary

Key Takeaways

  • Three roles shape every conversation — system (director), user (questioner), assistant (responder). The system prompt is your most powerful lever.
  • Pattern choice matters enormously — zero-shot for simple tasks, few-shot for formatting/classification, chain-of-thought for reasoning. CoT improves accuracy 20-40%.
  • Every call is stateless — Claude has zero memory between requests. Your code is the memory manager.
  • Structure your system prompts — use XML sections for role, constraints, format, and tone. This compounds across every interaction.
  • The ConversationManager pattern — a reusable class that maintains history and sends full context with each call. You'll extend this throughout the course.

Next Module Preview: M04 — Structured Output

Now that you can prompt Claude effectively, the next challenge is getting structured, parseable responses. In Module 4, you'll learn JSON mode, Pydantic/Zod validation, and error recovery — turning Claude's natural language into data your agent can reliably act on.