Building AI Agents with Claude Track 4: Agent Architectures
Module 15 of 30 ~75 min Advanced

Multi-Agent Systems

Build teams of specialized agents that collaborate, delegate, and resolve conflicts to tackle tasks too complex for any single agent.

Learning Objectives

  • Recognize when a task requires multiple specialized agents instead of a single generalist
  • Compare three architecture patterns: supervisor/worker, peer-to-peer, and pipeline
  • Design structured handoff messages that prevent context loss between agents
  • Choose between shared-state and message-passing coordination strategies
  • Implement a supervisor-orchestrated content pipeline with four specialized agents

Prerequisites: M12 (ReAct Pattern), M13 (Planning & Decomposition)  |  Level: Advanced

When One Agent Isn't Enough

In M13, you built a planning agent that decomposes tasks and executes them via a DAG. That works well when a single agent can handle every sub-task. But what happens when the sub-tasks require fundamentally different skills? A research task needs web search tools and a "gather facts" mindset. A writing task needs a creative prompt and no tools at all. An editing task needs grammar rules and a critical eye. Cramming all these roles into one agent means one massive system prompt, 15+ tools, and conflicting instructions — a recipe for mediocre results.

Everyday Analogy

Before multi-agent: Imagine a solo contractor asked to build a skyscraper. They'd need to be an architect, structural engineer, electrician, plumber, and project manager — all at once. Even if they had all those skills, they'd be overwhelmed by context switching and couldn't work on the foundation and the wiring simultaneously.

The pain: A single agent with 20 tools and a kitchen-sink system prompt faces the same problem. In M06, you learned that tool selection accuracy degrades above 5 tools. A prompt trying to be simultaneously a researcher, writer, editor, and fact-checker produces output that's mediocre at everything and excellent at nothing. And it can't parallelize — it handles one role at a time, sequentially.

The mapping: Multi-agent systems are the construction crew. Each specialist handles what they're best at: the Researcher has search tools and a "gather comprehensive facts" prompt. The Writer has a "produce engaging prose" prompt and no tools (just context from the Researcher). The Editor focuses on grammar and clarity. A Supervisor coordinates them all, just like a project manager on a construction site. Each agent's focused role means fewer tools, a clearer prompt, and better results.

Here's what that looks like in practice — compare a generalist prompt versus a specialized one:

GENERALIST (1,200 tokens)
"You are a research/writing/editing/review agent. When given a task, first research it, then write an article, then edit it, then review it. Use tools: web_search, write_file, grammar_check, fact_verify, style_analyze, readability_score, citation_lookup, plagiarism_check, sentiment_analyze, keyword_extract, summarize, translate, format_markdown, publish..."
SPECIALIST — Researcher (300 tokens)
"You are a research specialist. Gather facts, cite sources, distinguish confirmed from uncertain. Tools: web_search, citation_lookup, fact_verify."
Technical Definition

A multi-agent systemAn architecture where multiple LLM-powered agents collaborate on a task, each with specialized system prompts, distinct tool sets, and bounded responsibilities. A coordinator agent typically orchestrates delegation and result aggregation. decomposes complex tasks across specialized agents, each with a focused system prompt, a small set of relevant tools (3-5), and bounded responsibilities. Instead of one agent wearing all hats, each agent is an expert at its specific role.

The technical benefits are significant. First, prompt specialization: each agent's system prompt is focused and concise — typically 200-400 tokens compared to 1,000+ for a generalist. A shorter, more focused prompt means the model follows instructions more reliably.

Second, tool isolation: each agent gets only the tools it needs, keeping selection accuracy high. Recall from M06: accuracy drops sharply above 5 tools. A Researcher with 3 search tools picks the right one far more often than a generalist choosing from 15.

Third, parallel execution: independent agents can run simultaneously, cutting wall-clock time. The Researcher and a Fact-Checker can work at the same time instead of waiting in line.

Fourth, failure isolation: if the Editor agent fails, the Researcher's results are still intact — you don't lose everything. Each agent is a separate API call, so one failure doesn't crash the whole pipeline.

When should you use multi-agent vs. single-agent? The decision point is role diversity. If every sub-task needs the same tools and the same reasoning style, a single planning agent (M13) is simpler and cheaper. If sub-tasks require fundamentally different expertise (research vs. writing vs. code review), multi-agent is worth the coordination overhead.

Multi-Agent Task Distribution
👥Supervisor
🔍Researcher
Writer
📝Editor
Reviewer
task: research
task: write
task: edit
task: review
notes ready
draft ready
edits ready
approved!
Final Article Delivered
Why It Matters

Multi-agent architectures are how production AI systems handle real complexity. Anthropic's own Claude Code uses multiple specialized agents internally — one for file search, one for code editing, one for test execution. In enterprise deployments, multi-agent systems handle 40% more task complexity than single agents while reducing error rates by 25-35%. The key metric: a focused agent with 3 tools and a 300-token system prompt produces better results than a generalist with 15 tools and a 1,200-token prompt — even on the exact same task.

Common Misconceptions

"More agents = better results" — No. Each agent adds coordination overhead (an extra Claude call for delegation, message formatting, result aggregation). Two agents for a task that one can handle just doubles the cost. Use multi-agent only when sub-tasks require genuinely different expertise.

"Agents are separate API calls, so they're isolated" — They are isolated in terms of context windows, but they share side effects. If Agent A writes to a database and Agent B reads from it, they're implicitly coupled. Shared state coordination (covered later in this module) must handle these interactions explicitly.

"The supervisor always needs the most powerful model" — Not always — it depends on what the supervisor does. A supervisor that simply routes a keyword to the right worker and collects results can use Claude Haiku. A supervisor that decomposes an ambiguous task, reasons about inter-task dependencies, and synthesizes heterogeneous worker outputs needs Sonnet or Opus — mistakes at this level cascade to every worker. Match model power to reasoning complexity, not role name. (See Model-Per-Role below for specific guidance.)

You now understand why multi-agent systems exist and when to use them. The next question is: how do you structure the agents? There are three dominant patterns, each suited to different task types.

Architecture Patterns

Everyday Analogy

Before choosing a pattern: Imagine organizing a team without deciding on a management structure. Does one person assign tasks (supervisor)? Does everyone coordinate as equals (peer-to-peer)? Does each person finish their part and hand it to the next (pipeline)? Without choosing, you get chaos — people duplicating work, waiting on each other, or stepping on each other's toes.

The pain: Multi-agent systems without a clear architecture pattern suffer the same fate. Agents duplicate work, pass incomplete context, or deadlock waiting for each other. The pattern defines who talks to whom, who makes decisions, and how results flow — without it, coordination collapses.

The mapping: The three patterns map to familiar organizational structures. Supervisor/worker is a manager with a team — the manager assigns tasks, tracks progress, and compiles the final deliverable. Pipeline is an assembly line — each station adds value and passes the work forward. Peer-to-peer is a brainstorming session — everyone contributes as equals, building on each other's ideas until consensus emerges.

Technical Definition

The three dominant architecture patternsStructural blueprints for organizing multi-agent communication and control flow. The three main patterns are: supervisor/worker (hub-and-spoke), pipeline (sequential chain), and peer-to-peer (mesh network). for multi-agent systems are:

1. Supervisor/Worker (Hub-and-Spoke): A coordinator agent receives the task, decomposes it, dispatches sub-tasks to specialized worker agents, and aggregates their results. Workers only talk to the supervisor, never to each other directly. This gives you a single coordination point, clear context isolation, structured result aggregation, and an auditable decision flow. It's the most common and recommended pattern for most use cases.

2. Pipeline (Sequential Chain): Agents are arranged in a sequence. Each agent receives the previous agent's output, processes it, and passes its output to the next. The Researcher → Writer → Editor → Reviewer flow is a pipeline. This is ideal for tasks that require sequential refinement — each stage adds value to the previous stage's work.

3. Peer-to-Peer (Mesh): Agents communicate directly with any other agent, with no central coordinator. Each agent decides who to talk to and when. This is powerful for negotiation, debate, or consensus scenarios but much harder to debug and coordinate. In practice, it's rarely used in production due to the complexity of managing N×N communication channels.

💡 Model-Per-Role — Supervisor vs Worker

Multi-agent architectures multiply the “model-per-step” decision from M12 across multiple agents. A supervisor that runs once per request can afford a heavyweight model with extended thinking; workers that fan out 10-50× per request cannot.

  • Supervisor / coordinator — Opus + extended thinking on decomposition; Sonnet without thinking on routine dispatching. Mistakes at this level cascade to every worker, so pay for reasoning here.
  • Workers / specialists — Sonnet, or Haiku for narrow validators and classifiers. Workers run in parallel, so each one’s cost compounds with fan-out N.
  • Aggregator / final synthesizer — Sonnet (Opus + thinking if synthesizing across many heterogeneous worker outputs).

The cost-math from M12 → Inference & Reasoning in the Agent Loop applies here at every node — just with a higher fan-out factor. Putting a reasoning model in every worker is the most common multi-agent cost mistake.

Three Architecture Patterns — Topology Comparison
Supervisor / Worker (Hub-and-Spoke) S Supervisor W1 W2 W3 ↑↓ ↑↓ ↑↓ ✓ Clear control · Easy to debug ⚠ Single point of failure Pipeline (Assembly Line) Researcher Writer Editor Reviewer output → input output → input output → input ✓ Sequential refinement ⚠ No parallelism possible Peer-to-Peer (Mesh Network) A1 A2 A3 A4 N×N connections ✓ Flexible · Negotiation ✗ Hard to debug · Rare in prod 4 agents = 6 channels 10 agents = 45 channels
Three Architecture Patterns
Supervisor / Worker
S
W1
W2
W3
Pipeline
R
W
E
V
Peer-to-Peer
A
B
C
Cert Tip — Domain 1.2

The exam strongly favors hub-and-spoke (coordinator + subagents) over flat multi-agent architectures. Know why: single coordination point, clear context isolation, structured result aggregation, and auditable decision flow. Peer-to-peer is harder to debug and scale.

Why It Matters

The pattern you choose determines your system's debuggability, scalability, and cost. In production deployments, supervisor/worker handles 90%+ of multi-agent use cases. Pipeline handles most of the rest (content creation, data processing). Peer-to-peer is used in less than 5% of real-world systems because the coordination complexity rarely justifies the flexibility. If you're unsure which pattern to use, start with supervisor/worker — you can always refactor later, and the clear coordination point makes debugging much easier.

You've chosen an architecture pattern. Now the agents need to actually talk to each other. What information do they pass? In what format? This is where most multi-agent systems succeed or fail — poorly structured handoffs cause context loss, the #1 failure mode.

Agent Communication Protocols

Everyday Analogy

Before structured communication: Imagine departments in a company communicating by shouting across the office. Sales yells "We got a big deal!" but doesn't mention the client name, contract value, or deadline. Marketing hears fragments and starts a campaign for the wrong product. Finance has no idea what's happening.

The pain: Without structured handoff messages, agents face the same problem. The Researcher passes raw text to the Writer, but doesn't specify which facts are confirmed vs. uncertain, what sources were used, or what the original goal was. The Writer produces content based on incomplete context, and the Editor doesn't know what the original requirements were. Each handoff loses information.

The mapping: Structured communication is like company forms and memos. A sales handoff form has specific fields: client name, deal value, product, deadline, special terms. Every department knows exactly what information they're getting and what they need to pass along. Agent handoff messages work the same way — typed fields for task ID, status, payload, context, and next instructions ensure nothing gets lost in translation.

Here's what an actual handoff message looks like between a Researcher and a Writer:

{ "sender": "researcher", "receiver": "writer", "task_id": "article-42", "type": "result", "payload": { "research_notes": "## Key Findings\n1. Walking 30 min/day reduces...", "sources": ["WHO 2023 guidelines", "JAMA meta-analysis"], "confidence": "high" }, "original_goal": "Write a beginner article about daily walking benefits", "instructions": "Use all 5 findings. Target 200-300 words. Cite sources inline.", "metadata": { "tokens_used": 847, "timestamp": "2025-03-15T14:23:01Z" } }
Technical Definition

Handoff messagesStructured data objects exchanged between agents containing: sender ID, receiver ID, task ID, message type (task/result/error), payload content, and metadata. They prevent context loss during agent-to-agent communication. are structured data objects exchanged between agents. A well-designed handoff message contains:

Required fields: sender (which agent sent it), receiver (which agent should process it), task_id (for tracking across the pipeline), type (task assignment, result, error, or feedback), and payload (the actual content — research notes, draft text, edit suggestions, etc.).

Context fields: original_goal (the user's original request, so downstream agents can reference it), instructions (specific guidance for the receiver), previous_results (summary of what earlier agents produced), and metadata (timestamps, token counts, confidence scores).

The most common failure in multi-agent systems is context dropping — Agent A knows something important but doesn't include it in the handoff, so Agent B operates without it. Standardized message schemas prevent this by making context fields explicit and required.

Structured Handoff Messages
🔍 Researcher
Handoff Message
sender: "researcher"
receiver: "writer"
task_id: "article-42"
type: "result"
payload: {research_notes, sources}
original_goal: "quantum computing article"
Writer
Common Misconceptions

"Just pass the full conversation history to every agent" — This seems safe ("more context is better, right?"), but it backfires. Each agent gets confused by context meant for other agents. The Writer sees the Researcher's internal reasoning and starts mimicking a research style instead of writing prose. Pass only what each agent needs — targeted context, not everything.

"Handoff messages are overhead I can skip in a prototype" — Unstructured text passing works for 2 agents. At 4+ agents, you'll spend more time debugging context loss than you saved by skipping the structure. The original_goal field alone prevents the most common multi-agent failure.

"Agents can figure out who to talk to" — Without explicit routing (sender/receiver fields), agents can't self-organize. The Supervisor must specify who gets each task. Agents don't have a "directory" of other agents unless you explicitly build one.

Context Isolation — Separate Context Windows
Coordinator Agent Context Window (full) system: "You are a coordinator..." user: "Analyze Q4 sales data" assistant: "I'll decompose..." tool_result: {researcher: ...} assistant: "Now sending to..." tool_result: {writer: ...} ... 12 more messages ... → original_goal: "Analyze Q4..." → previous_results: {data: ...} ↑ Only these fields are passed Sees full conversation history Orchestrates all sub-tasks task + context result Researcher Subagent 1 Isolated Context Window system: "You are a researcher..." user: {original_goal, task: "find data"} ✗ Cannot see writer/editor conversations Writer Subagent 2 Isolated Context Window system: "You are a writer..." user: {original_goal, previous_results} ✗ Cannot see researcher's internal reasoning Editor Subagent 3 Isolated Context Window system: "You are an editor..." user: {original_goal, draft_to_edit} ✗ Cannot see researcher or writer history
Cert Tip — Domain 1.3

Subagents have ISOLATED context windows — they do NOT inherit the coordinator's full conversation. The coordinator must explicitly pass relevant context in the task prompt. Anti-pattern: assuming subagents know everything the coordinator knows. Always include original_goal and previous_results in every handoff message.

Cert Tip — Domain 5.3 (Error Propagation)

When a subagent fails, return structured error context — not a generic "search unavailable" status. Include: failure_type, attempted_query, partial_results, and alternative_approaches. This lets the coordinator make intelligent recovery decisions (retry with modified query, try alternatives, or proceed with partial results). The exam directly tests two anti-patterns: silently suppressing errors (returning {success: true, results: []} on failure — dangerous because the coordinator can't tell access failure from valid empty results) and terminating the entire workflow on a single subagent failure (overkill when partial recovery is possible). Subagents should implement local recovery for transient failures and propagate to the coordinator only what they couldn't resolve.

Why It Matters

Context loss between agents is the #1 cause of multi-agent failures in production. In a study of 500 multi-agent pipelines, 62% of failures traced back to incomplete handoff messages — Agent B didn't know something Agent A knew because the handoff didn't include it. Adding structured, required fields to handoff messages reduced these failures by 73%. The fix is cheap (a few hundred extra tokens per handoff) relative to the cost of re-running a failed pipeline (~$0.10-0.50 per re-run at scale).

Structured messages solve the "what to communicate" problem. But when multiple agents need to share data — like a research database or a draft document — you face a deeper architectural question: should they share a central data store, or should they pass everything via messages?

Shared State vs. Message Passing

Everyday Analogy

Before choosing a coordination model: Imagine a team working on a document. One approach: everyone edits the same Google Doc simultaneously (shared state). It's fast and convenient, but two people might edit the same paragraph at the same time, creating conflicts. The other approach: each person works on their own copy and emails revisions to a coordinator (message passing). No conflicts, but slower — every change requires an explicit handoff.

The pain: Shared state feels simpler initially, but as the number of agents grows, conflicts multiply. Agent A writes research notes to a shared store. Agent B reads them before Agent A is done, getting incomplete data. Agent C overwrites Agent B's edits. These race conditions are subtle and hard to debug.

The mapping: In software terms, shared state means a central database or in-memory object that all agents read and write. Message passing means each agent has its own private state and communicates only through explicit handoff messages. For 2-3 agents, shared state is fine. For 5+ agents or production systems, message passing is more robust — each agent's inputs and outputs are explicit and auditable.

Here's what a race condition looks like in shared state:

# Time 0: Shared store store["draft"] = "Initial draft about walking..." # Time 1: Editor reads, starts editing text = store["draft"] # Editor sees "Initial draft..." # Time 2: Writer ALSO updates the draft (Editor doesn't know!) store["draft"] = "Completely rewritten draft..." # Overwrites! # Time 3: Editor saves its edits of the OLD draft store["draft"] = edited_text # Writer's rewrite is LOST
Technical Definition

Shared stateA coordination model where all agents read from and write to a central data store. Simple to implement but prone to race conditions when multiple agents write simultaneously. architectures use a central store that all agents read from and write to. That store could be a database, a Python dictionary in memory, or a file system. An agent writes its results directly to the store, and downstream agents read from it.

This enables implicit coordination — agents don't need to address messages to specific recipients, they just read from the store. But it creates three risks. First, race conditions: two agents writing to the same key at the same time, and one overwrites the other. Second, stale reads: Agent B reads a value before Agent A finishes writing the updated version. Third, tight coupling: if you change the store's schema, every agent that reads from it breaks.

Message passingA coordination model where each agent has private state and communicates only through explicit handoff messages. More verbose but eliminates race conditions and makes all data flow auditable. takes a different approach: each agent has its own private state. Agents communicate only through explicit messages — the handoff objects from the previous section.

This is more verbose — you have to explicitly send every piece of data an agent needs. But the trade-offs are worth it. Race conditions disappear entirely because no two agents write to the same store. Every data exchange is logged and auditable. And agents can run on different machines without sharing memory. Production multi-agent systems overwhelmingly prefer message passing for these reasons.

In practice, many systems use a hybrid: message passing for agent-to-agent communication, with a shared "artifact store" for large outputs (draft documents, research collections) that are too big to include in every message. The messages reference artifact IDs rather than embedding the full content.

Shared State vs. Message Passing
Shared State
A
B
C
Shared DB
Message Passing
A
B
C
Common Misconceptions

"Shared state is always simpler" — It's simpler to set up, but harder to debug. When Agent B returns wrong results, was it because Agent A wrote bad data? Or because Agent B read before Agent A finished? With message passing, every input to every agent is logged — you can replay the exact data each agent received.

"Message passing means duplicating data everywhere" — Not quite. In practice, you pass references (artifact IDs) for large objects and only include small, essential context in the message itself. The message says "article is at artifact-42" rather than embedding 2,000 words of text.

Why It Matters

The coordination model determines your system's reliability ceiling. A shared-state system with 5 agents has 10 potential read/write conflict pairs (5×4÷2). With 10 agents, that jumps to 45. Each conflict pair is a potential race condition bug. Message passing eliminates all of these — data only flows through explicit, logged handoffs. At companies running multi-agent systems in production, teams that switch from shared state to message passing report 40-60% fewer coordination bugs and 3x faster debugging (because every data flow is in the message log).

You've learned how agents communicate and coordinate. But what happens when agents disagree? The Researcher says "Company X has 500 employees" and the Editor says "that's outdated, they now have 800." Your system needs a strategy for resolving conflicts.

Conflict Resolution

Everyday Analogy

Before conflict resolution: Imagine two editors working on the same article — one changes a paragraph to be more formal, the other makes it more conversational. Without a tiebreaker, both changes get applied simultaneously and the paragraph becomes an incoherent mix. Or one editor silently overwrites the other, and nobody knows the formal version was lost.

The pain: In multi-agent systems, silent conflicts are worse than loud failures. If the Researcher returns outdated data and the Editor catches it but the system doesn't have a resolution protocol, the outdated data might silently propagate to the final output. The user trusts the result, never knowing about the disagreement.

The mapping: Conflict resolution strategies are like editorial workflows. Supervisor arbitration is a senior editor who reviews both versions and picks the better one. Voting is asking three editors and going with the majority. Confidence scoring is letting each editor rate their confidence and trusting the more confident one. Escalation is flagging unresolvable disagreements for human review. Each strategy fits different situations.

Technical Definition

Conflict resolutionStrategies for handling cases where multiple agents produce contradictory or incompatible outputs. Common approaches: supervisor arbitration, voting/consensus, confidence scoring, and human-in-the-loop escalation. strategies handle cases where agents produce contradictory outputs. There are four standard approaches:

1. Supervisor arbitration: The coordinator agent receives conflicting outputs and decides which to use (or merges them). This works well when the coordinator has enough context to judge quality. Cost: one extra Claude call for arbitration.

2. Voting/consensus: Run the same task on 3+ agents (or the same agent 3 times with different temperature settings) and take the majority answer. This is expensive — 3x the API calls — but highly reliable for factual tasks where there's one correct answer. In production, voting is used for high-stakes decisions like medical information verification, where a single wrong answer has serious consequences.

3. Confidence scoring: Each agent includes a confidence field in its output (0.0–1.0). The system picks the highest-confidence answer. But there's an important caveat: LLM self-reported confidence is not always well-calibrated. The model may say it's 95% confident and still be wrong. So confidence scoring works best as a tiebreaker combined with other signals — for example, prefer the answer from the agent that had access to more relevant tools.

4. Human-in-the-loop escalation: Flag unresolvable conflicts for human review. This is the fallback when automated resolution isn't reliable enough. The key is defining clear, programmatic escalation thresholds: escalate when confidence scores are within 0.1 of each other, when the conflict involves safety-critical data, or when the same task has been retried 3+ times without convergence. Don't make escalation subjective — make it a rule the system can evaluate automatically.

Why It Matters

Designing for disagreement upfront prevents cascading errors. In a content pipeline producing 100 articles/day, even a 5% conflict rate means 5 articles/day with contradictory information. If those ship without resolution, you erode user trust. With supervisor arbitration, those 5 conflicts cost 5 extra Claude calls (~$0.15 total) and produce a curated, consistent result. The alternative — silently propagating the first agent's answer — is a ticking time bomb. As the system scales, unresolved conflicts compound.

You now understand all the architectural concepts: when to use multi-agent, which pattern to choose, how to communicate, how to coordinate state, and how to resolve conflicts. In the UCC pipeline, you'd apply these concepts with specialized agents for entity resolution (matching debtor names across filings), risk scoring (analyzing collateral and lien positions), and report generation (producing human-readable summaries). Let's build a complete multi-agent content pipeline that puts all of these patterns into practice.

Code Walkthrough: Content Creation Pipeline

We'll build a supervisor-orchestrated pipeline with four specialized agents: Researcher, Writer, Editor, and Reviewer. The Supervisor coordinates the pipeline, handles retry when the Reviewer rejects, and delivers the final result.

Step 1: Define Specialized Agents

Let's start with the foundation: defining each agent as a function with its own system prompt and (optionally) tools. The key insight here is how little each agent knows. The Researcher knows nothing about writing style. The Writer knows nothing about web search. This narrow focus is what makes each agent good at its job.

We also need a HandoffMessage dataclass — a structured envelope that carries data between agents. Think of it as the standard form every agent fills out when passing work along. Without it, agents would just throw raw text at each other with no metadata, no tracking, and no way to audit what happened.

import anthropic
import json
from dataclasses import dataclass, field, asdict
from typing import Any

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY env var


@dataclass
class HandoffMessage:
    """Structured message passed between agents."""
    sender: str
    receiver: str
    task_id: str
    msg_type: str  # "task", "result", "error", "feedback"
    payload: dict[str, Any]
    original_goal: str = ""
    instructions: str = ""
    metadata: dict[str, Any] = field(default_factory=dict)


# --- Agent prompts ---
RESEARCHER_PROMPT = """You are a research specialist. Your job is to
gather comprehensive, factual information on a topic.
- Search for key facts, statistics, and expert opinions
- Cite your sources (URLs or publication names)
- Distinguish confirmed facts from uncertain claims
- Return structured research notes, not a narrative"""

WRITER_PROMPT = """You are a professional writer. Your job is to
transform research notes into an engaging, well-structured article.
- Write for the specified audience (beginner, intermediate, expert)
- Use clear headings, short paragraphs, and concrete examples
- Do NOT fabricate facts — use only the research notes provided
- Include a brief introduction, 3-5 main sections, and a conclusion"""

EDITOR_PROMPT = """You are a meticulous editor. Your job is to
improve clarity, grammar, and accuracy of a draft article.
- Fix grammatical errors and awkward phrasing
- Flag any claims that seem unsupported by the research notes
- Suggest structural improvements if needed
- Return the edited article plus a list of changes made"""

REVIEWER_PROMPT = """You are a quality reviewer. Your job is to
assess whether an article meets publication standards.
Evaluate: accuracy, completeness, clarity, and engagement.
Return a JSON object with:
{
  "approved": true/false,
  "score": 0-100,
  "feedback": "specific improvement suggestions if not approved"
}"""


def run_agent(
    agent_name: str,
    system_prompt: str,
    message: HandoffMessage,
    tools: list | None = None,
) -> HandoffMessage:
    """Run a single agent and return its result as a HandoffMessage."""
    user_content = (
        f"Original goal: {message.original_goal}\n"
        f"Instructions: {message.instructions}\n\n"
        f"Input from {message.sender}:\n"
        f"{json.dumps(message.payload, indent=2)}"
    )

    try:
        kwargs = {
            "model": "claude-sonnet-4-6",
            "max_tokens": 2048,
            "system": system_prompt,
            "messages": [{"role": "user", "content": user_content}],
        }
        if tools:
            kwargs["tools"] = tools

        response = client.messages.create(**kwargs)

        # Extract text from response
        result_text = ""
        for block in response.content:
            if block.type == "text":
                result_text += block.text

        return HandoffMessage(
            sender=agent_name,
            receiver="supervisor",
            task_id=message.task_id,
            msg_type="result",
            payload={"output": result_text},
            original_goal=message.original_goal,
            metadata={
                "tokens_used": response.usage.input_tokens
                + response.usage.output_tokens
            },
        )

    except Exception as e:
        return HandoffMessage(
            sender=agent_name,
            receiver="supervisor",
            task_id=message.task_id,
            msg_type="error",
            payload={"error": str(e)},
            original_goal=message.original_goal,
        )
import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic(); // reads ANTHROPIC_API_KEY env var

// --- Handoff message structure ---
function createMessage({ sender, receiver, taskId, type, payload,
    originalGoal = "", instructions = "", metadata = {} }) {
  return { sender, receiver, taskId, type, payload,
    originalGoal, instructions, metadata };
}

// --- Agent prompts ---
const RESEARCHER_PROMPT = `You are a research specialist. Your job is to
gather comprehensive, factual information on a topic.
- Search for key facts, statistics, and expert opinions
- Cite your sources (URLs or publication names)
- Distinguish confirmed facts from uncertain claims
- Return structured research notes, not a narrative`;

const WRITER_PROMPT = `You are a professional writer. Your job is to
transform research notes into an engaging, well-structured article.
- Write for the specified audience (beginner, intermediate, expert)
- Use clear headings, short paragraphs, and concrete examples
- Do NOT fabricate facts — use only the research notes provided
- Include a brief introduction, 3-5 main sections, and a conclusion`;

const EDITOR_PROMPT = `You are a meticulous editor. Your job is to
improve clarity, grammar, and accuracy of a draft article.
- Fix grammatical errors and awkward phrasing
- Flag any claims that seem unsupported by the research notes
- Suggest structural improvements if needed
- Return the edited article plus a list of changes made`;

const REVIEWER_PROMPT = `You are a quality reviewer. Your job is to
assess whether an article meets publication standards.
Evaluate: accuracy, completeness, clarity, and engagement.
Return a JSON object with:
{
  "approved": true/false,
  "score": 0-100,
  "feedback": "specific improvement suggestions if not approved"
}`;

async function runAgent(agentName, systemPrompt, message, tools) {
  const userContent =
    `Original goal: ${message.originalGoal}\n` +
    `Instructions: ${message.instructions}\n\n` +
    `Input from ${message.sender}:\n` +
    JSON.stringify(message.payload, null, 2);

  try {
    const params = {
      model: "claude-sonnet-4-6",
      max_tokens: 2048,
      system: systemPrompt,
      messages: [{ role: "user", content: userContent }],
    };
    if (tools) params.tools = tools;

    const response = await client.messages.create(params);
    const resultText = response.content
      .filter((b) => b.type === "text")
      .map((b) => b.text)
      .join("");

    return createMessage({
      sender: agentName,
      receiver: "supervisor",
      taskId: message.taskId,
      type: "result",
      payload: { output: resultText },
      originalGoal: message.originalGoal,
      metadata: {
        tokensUsed: response.usage.input_tokens + response.usage.output_tokens,
      },
    });
  } catch (e) {
    return createMessage({
      sender: agentName,
      receiver: "supervisor",
      taskId: message.taskId,
      type: "error",
      payload: { error: e.message },
      originalGoal: message.originalGoal,
    });
  }
}
What Just Happened?

You defined four specialized agents, each with a focused system prompt of 4-5 lines. The run_agent function wraps any agent call with structured handoff messages. Every call includes the original_goal so downstream agents never lose sight of the user's request. Errors return a structured error message rather than crashing, enabling the supervisor to retry or route to a fallback. Notice how each agent's prompt avoids overlapping responsibilities — the Writer doesn't fact-check, the Editor doesn't write from scratch, the Reviewer doesn't edit.

Step 2: The Supervisor Pipeline

Now for the interesting part — the Supervisor. This is the "brain" of the system. It doesn't do any research, writing, or editing itself. Instead, it orchestrates: dispatch work to the right agent, collect results, and decide what happens next.

The pipeline flows Researcher → Writer → Editor → Reviewer. But here's the clever bit: if the Reviewer rejects the article, the Supervisor doesn't just give up. It routes the Reviewer's feedback back to the Writer and tries again (up to max_review_attempts). This retry-with-feedback loop is what separates a production pipeline from a toy demo — real content needs iteration.

def run_content_pipeline(
    topic: str,
    audience: str = "beginner",
    max_review_attempts: int = 2,
) -> dict:
    """Orchestrate a multi-agent content creation pipeline."""
    task_id = f"article-{hash(topic) % 10000}"
    message_log: list[dict] = []

    def log(msg: HandoffMessage):
        """Log every handoff for auditability."""
        entry = asdict(msg)
        message_log.append(entry)
        print(f"  [{msg.sender} → {msg.receiver}] {msg.msg_type}")

    # Phase 1: Research
    research_msg = HandoffMessage(
        sender="supervisor",
        receiver="researcher",
        task_id=task_id,
        msg_type="task",
        payload={"topic": topic},
        original_goal=f"Write a {audience}-level article about {topic}",
        instructions=f"Research {topic} thoroughly. Focus on facts relevant "
                     f"to a {audience} audience.",
    )
    log(research_msg)
    research_result = run_agent(
        "researcher", RESEARCHER_PROMPT, research_msg
    )
    log(research_result)

    if research_result.msg_type == "error":
        return {"status": "failed", "stage": "research",
                "error": research_result.payload, "log": message_log}

    # Phase 2: Write (with potential review loop)
    draft = None
    for attempt in range(1, max_review_attempts + 1):
        writer_instructions = (
            f"Write a {audience}-level article about {topic}."
        )
        if draft and attempt > 1:
            writer_instructions += (
                f"\n\nPrevious draft was rejected. Reviewer feedback:\n"
                f"{review_result.payload.get('output', '')}"
            )

        write_msg = HandoffMessage(
            sender="supervisor",
            receiver="writer",
            task_id=task_id,
            msg_type="task",
            payload={
                "research_notes": research_result.payload["output"],
                "previous_draft": draft,
            },
            original_goal=research_msg.original_goal,
            instructions=writer_instructions,
        )
        log(write_msg)
        write_result = run_agent("writer", WRITER_PROMPT, write_msg)
        log(write_result)

        if write_result.msg_type == "error":
            return {"status": "failed", "stage": "writing",
                    "error": write_result.payload, "log": message_log}

        draft = write_result.payload["output"]

        # Phase 3: Edit
        edit_msg = HandoffMessage(
            sender="supervisor",
            receiver="editor",
            task_id=task_id,
            msg_type="task",
            payload={
                "draft": draft,
                "research_notes": research_result.payload["output"],
            },
            original_goal=research_msg.original_goal,
            instructions="Edit for clarity, grammar, and accuracy.",
        )
        log(edit_msg)
        edit_result = run_agent("editor", EDITOR_PROMPT, edit_msg)
        log(edit_result)

        if edit_result.msg_type == "error":
            return {"status": "failed", "stage": "editing",
                    "error": edit_result.payload, "log": message_log}

        draft = edit_result.payload["output"]

        # Phase 4: Review
        review_msg = HandoffMessage(
            sender="supervisor",
            receiver="reviewer",
            task_id=task_id,
            msg_type="task",
            payload={"article": draft},
            original_goal=research_msg.original_goal,
            instructions="Review for publication quality.",
        )
        log(review_msg)
        review_result = run_agent(
            "reviewer", REVIEWER_PROMPT, review_msg
        )
        log(review_result)

        # Check if approved
        try:
            review_data = json.loads(review_result.payload["output"])
            if review_data.get("approved", False):
                print(f"  Approved on attempt {attempt}! "
                      f"Score: {review_data.get('score')}")
                return {
                    "status": "approved",
                    "article": draft,
                    "score": review_data.get("score"),
                    "attempts": attempt,
                    "log": message_log,
                }
        except json.JSONDecodeError:
            pass  # Review wasn't valid JSON, treat as rejection

        print(f"  Rejected on attempt {attempt}, retrying...")

    return {
        "status": "max_attempts_reached",
        "article": draft,
        "attempts": max_review_attempts,
        "log": message_log,
    }


# --- Run the pipeline ---
if __name__ == "__main__":
    result = run_content_pipeline(
        topic="quantum computing",
        audience="beginner",
        max_review_attempts=2,
    )
    print(f"\nStatus: {result['status']}")
    print(f"Messages exchanged: {len(result['log'])}")
    if result.get("article"):
        print(f"Article length: {len(result['article'])} chars")
async function runContentPipeline(
  topic,
  { audience = "beginner", maxReviewAttempts = 2 } = {}
) {
  const taskId = `article-${Math.abs(topic.split("").reduce(
    (a, c) => ((a << 5) - a + c.charCodeAt(0)) | 0, 0
  )) % 10000}`;
  const messageLog = [];
  const originalGoal = `Write a ${audience}-level article about ${topic}`;

  function log(msg) {
    messageLog.push({ ...msg });
    console.log(`  [${msg.sender} → ${msg.receiver}] ${msg.type}`);
  }

  // Phase 1: Research
  const researchMsg = createMessage({
    sender: "supervisor", receiver: "researcher", taskId,
    type: "task", payload: { topic },
    originalGoal,
    instructions: `Research ${topic} thoroughly for a ${audience} audience.`,
  });
  log(researchMsg);
  const researchResult = await runAgent(
    "researcher", RESEARCHER_PROMPT, researchMsg
  );
  log(researchResult);

  if (researchResult.type === "error") {
    return { status: "failed", stage: "research",
      error: researchResult.payload, log: messageLog };
  }

  // Phase 2-4: Write → Edit → Review (with retry loop)
  let draft = null;
  for (let attempt = 1; attempt <= maxReviewAttempts; attempt++) {
    let writerInstructions =
      `Write a ${audience}-level article about ${topic}.`;
    if (draft && attempt > 1) {
      writerInstructions += `\n\nPrevious draft rejected. Feedback:\n` +
        (reviewResult?.payload?.output ?? "");
    }

    const writeMsg = createMessage({
      sender: "supervisor", receiver: "writer", taskId,
      type: "task",
      payload: {
        research_notes: researchResult.payload.output,
        previous_draft: draft,
      },
      originalGoal,
      instructions: writerInstructions,
    });
    log(writeMsg);
    const writeResult = await runAgent("writer", WRITER_PROMPT, writeMsg);
    log(writeResult);
    if (writeResult.type === "error")
      return { status: "failed", stage: "writing", log: messageLog };

    draft = writeResult.payload.output;

    // Edit
    const editMsg = createMessage({
      sender: "supervisor", receiver: "editor", taskId,
      type: "task",
      payload: { draft, research_notes: researchResult.payload.output },
      originalGoal,
      instructions: "Edit for clarity, grammar, and accuracy.",
    });
    log(editMsg);
    const editResult = await runAgent("editor", EDITOR_PROMPT, editMsg);
    log(editResult);
    if (editResult.type === "error")
      return { status: "failed", stage: "editing", log: messageLog };

    draft = editResult.payload.output;

    // Review
    const reviewMsg = createMessage({
      sender: "supervisor", receiver: "reviewer", taskId,
      type: "task",
      payload: { article: draft },
      originalGoal,
      instructions: "Review for publication quality.",
    });
    log(reviewMsg);
    var reviewResult = await runAgent(
      "reviewer", REVIEWER_PROMPT, reviewMsg
    );
    log(reviewResult);

    try {
      const reviewData = JSON.parse(reviewResult.payload.output);
      if (reviewData.approved) {
        console.log(`  Approved on attempt ${attempt}! Score: ${reviewData.score}`);
        return { status: "approved", article: draft,
          score: reviewData.score, attempts: attempt, log: messageLog };
      }
    } catch { /* Not valid JSON — treat as rejection */ }

    console.log(`  Rejected on attempt ${attempt}, retrying...`);
  }

  return { status: "max_attempts_reached", article: draft,
    attempts: maxReviewAttempts, log: messageLog };
}

// --- Run ---
const result = await runContentPipeline("quantum computing", {
  audience: "beginner", maxReviewAttempts: 2,
});
console.log(`\nStatus: ${result.status}`);
console.log(`Messages exchanged: ${result.log.length}`);
if (result.article)
  console.log(`Article length: ${result.article.length} chars`);
What Just Happened?

You built a complete multi-agent pipeline with supervisor orchestration. The key architectural decisions: (1) every handoff carries original_goal so agents never lose context, (2) the Reviewer can reject and the pipeline retries with feedback (up to max_review_attempts), (3) every message is logged for auditability, and (4) errors at any stage return structured failure information rather than crashing. This is the supervisor/worker pattern from the architecture section — the supervisor dispatches tasks, collects results, and handles retry logic.

Expected Output

[supervisor → researcher] task [researcher → supervisor] result [supervisor → writer] task [writer → supervisor] result [supervisor → editor] task [editor → supervisor] result [supervisor → reviewer] task [reviewer → supervisor] result Approved on attempt 1! Score: 87 Status: approved Messages exchanged: 8 Article length: 2841 chars

Hands-On Exercise

What You'll Build

A multi-agent content pipeline with 4 specialized agents (Researcher, Writer, Editor, Reviewer) coordinated by a Supervisor, with structured handoff messages, retry logic on review rejection, and a full message log timeline.

Time Estimate: 45–60 minutes  |  Files You'll Create: multi_agent_pipeline.py

Prerequisites

  • Python 3.10+ installed
  • An Anthropic API key (from console.anthropic.com)
  • Completed M12 (ReAct Pattern) and M13 (Planning & Decomposition)

Files You'll Create

  • multi_agent_pipeline.py — complete multi-agent system with 4 specialized agents, handoff messages, supervisor orchestrator, and tests

Environment Setup

Open a terminal and run this block to create your project directory, virtual environment, and install the Anthropic SDK:

mkdir multi-agent-lab && cd multi-agent-lab
python -m venv venv && source venv/bin/activate   # Windows: venv\Scripts\activate
pip install "anthropic>=0.40.0"
export ANTHROPIC_API_KEY=your-key-here             # Windows: set ANTHROPIC_API_KEY=your-key-here

Run Command

pip install "anthropic>=0.40.0"

Expected Output

Successfully installed anthropic-0.40.0 ...
✅ Checkpoint

If you see "Successfully installed anthropic-...", your environment is ready. If you see a permissions error, make sure your virtual environment is activated (you should see (venv) in your terminal prompt).

Step 1: Build the Multi-Agent Content Pipeline

What & Why: This step creates the entire system in one file: 4 specialized agents with focused prompts, structured handoff messages, a Supervisor that orchestrates the pipeline with retry logic on review rejection, and a message log for auditing. We put everything in one file so you can see the complete data flow. Each agent has its own isolated context — the Supervisor explicitly passes context via handoff messages (as the cert tip in Domain 1.3 warns).

Create a new file called multi_agent_pipeline.py and add the following:

import anthropic
import json
import time
import uuid

client = anthropic.Anthropic()

# ── Handoff Message Structure ────────────────────────────────
def create_handoff(sender: str, receiver: str, task_id: str,
                   msg_type: str, payload: str, goal: str,
                   instructions: str = "") -> dict:
    return {
        "id": f"msg_{uuid.uuid4().hex[:8]}",
        "sender": sender,
        "receiver": receiver,
        "task_id": task_id,
        "type": msg_type,
        "payload": payload,
        "original_goal": goal,
        "instructions": instructions,
        "timestamp": time.time(),
    }

# ── Specialized Agents ───────────────────────────────────────
AGENT_PROMPTS = {
    "researcher": (
        "You are a research specialist. Given a topic, produce a concise "
        "research brief with 3-5 key findings, each with a source reference. "
        "Focus on facts and data points. Output structured markdown."
    ),
    "writer": (
        "You are a professional writer. Given research findings, write a "
        "well-structured article of 200-300 words. Use clear language, "
        "include an introduction and conclusion. Incorporate the research "
        "findings naturally with citations."
    ),
    "editor": (
        "You are an experienced editor. Review the article for clarity, "
        "grammar, flow, and factual consistency. Make direct edits (don't "
        "just suggest changes). Return the improved article."
    ),
    "reviewer": (
        "You are a quality reviewer. Score the article 0-100 on:\n"
        "- Accuracy (0-25): Are facts correct and well-sourced?\n"
        "- Clarity (0-25): Is the writing clear and well-organized?\n"
        "- Completeness (0-25): Does it cover the topic adequately?\n"
        "- Engagement (0-25): Is it interesting to read?\n\n"
        "Respond with JSON: {\"score\": N, \"feedback\": \"...\", \"approved\": true/false}\n"
        "Approve only if total score >= 75."
    ),
}

def run_agent(agent_name: str, content: str) -> str:
    """Run a single specialized agent."""
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=AGENT_PROMPTS[agent_name],
        messages=[{"role": "user", "content": content}],
    )
    return response.content[0].text

# ── Supervisor (Pipeline Orchestrator) ───────────────────────
def run_pipeline(topic: str, max_review_attempts: int = 2, verbose: bool = True) -> dict:
    """Orchestrate the 4-agent content pipeline with retry on rejection."""
    task_id = f"task_{uuid.uuid4().hex[:8]}"
    goal = f"Write a high-quality article about: {topic}"
    message_log = []

    if verbose:
        print(f"\n{'═' * 55}")
        print(f"  📝 Topic: {topic}")
        print(f"  🎯 Goal: {goal}")
        print(f"{'═' * 55}")

    # Stage 1: Research
    if verbose:
        print(f"\n  [1/4] 🔍 Researcher working...")
    research = run_agent("researcher", f"Research this topic: {topic}")
    msg = create_handoff("researcher", "writer", task_id, "research_complete",
                         research, goal, "Use these findings to write an article.")
    message_log.append(msg)
    if verbose:
        print(f"         Found {research.count('##') + research.count('- ')} key points")

    # Stage 2: Write
    if verbose:
        print(f"  [2/4] ✍️  Writer working...")
    article = run_agent("writer",
        f"Original goal: {goal}\n\nResearch findings:\n{research}\n\n"
        f"Write a 200-300 word article based on these findings.")
    msg = create_handoff("writer", "editor", task_id, "draft_complete",
                         article, goal, "Edit this article for quality.")
    message_log.append(msg)
    if verbose:
        print(f"         Draft: {len(article.split())} words")

    # Stage 3: Edit
    if verbose:
        print(f"  [3/4] 📝 Editor working...")
    edited = run_agent("editor",
        f"Original goal: {goal}\n\nArticle to edit:\n{article}")
    msg = create_handoff("editor", "reviewer", task_id, "edit_complete",
                         edited, goal, "Review and score this article.")
    message_log.append(msg)
    if verbose:
        print(f"         Edited: {len(edited.split())} words")

    # Stage 4: Review (with retry loop)
    for attempt in range(1, max_review_attempts + 1):
        if verbose:
            print(f"  [4/4] 🔎 Reviewer (attempt {attempt}/{max_review_attempts})...")

        review_text = run_agent("reviewer",
            f"Original goal: {goal}\n\nArticle to review:\n{edited}")
        msg = create_handoff("reviewer", "supervisor", task_id, "review_complete",
                             review_text, goal)
        message_log.append(msg)

        # Parse review score
        try:
            review = json.loads(review_text)
        except json.JSONDecodeError:
            review = {"score": 80, "feedback": review_text, "approved": True}

        if verbose:
            print(f"         Score: {review.get('score', '?')}/100 — "
                  f"{'✅ Approved' if review.get('approved') else '❌ Rejected'}")

        if review.get("approved", False):
            break

        # Rejected — send feedback to editor for revision
        if attempt < max_review_attempts:
            if verbose:
                print(f"         📝 Sending feedback to Editor for revision...")
            edited = run_agent("editor",
                f"Original goal: {goal}\n\n"
                f"Current article:\n{edited}\n\n"
                f"Reviewer feedback (score {review.get('score', '?')}/100):\n"
                f"{review.get('feedback', 'No specific feedback')}\n\n"
                f"Please revise the article to address this feedback.")
            msg = create_handoff("editor", "reviewer", task_id, "revision_complete",
                                 edited, goal, "Re-review after revision.")
            message_log.append(msg)

    # Print message timeline
    if verbose:
        print(f"\n  {'─' * 50}")
        print(f"  📨 Message Log ({len(message_log)} handoffs):")
        for m in message_log:
            ts = time.strftime("%H:%M:%S", time.localtime(m["timestamp"]))
            print(f"    [{ts}] {m['sender']} → {m['receiver']}: {m['type']}")
            print(f"             {m['payload'][:60]}...")

    return {
        "topic": topic,
        "article": edited,
        "review": review,
        "message_log": message_log,
        "stages_completed": len(set(m["sender"] for m in message_log)),
    }


# ── Tests ────────────────────────────────────────────────────
if __name__ == "__main__":
    # Test 1: Full pipeline
    print("\n▶ TEST 1: Full content pipeline")
    result = run_pipeline("The benefits of walking 30 minutes daily")
    print(f"\n  Final article ({len(result['article'].split())} words):")
    print(f"  {result['article'][:200]}...")
    print(f"  Review score: {result['review'].get('score', '?')}/100")
    print(f"  Total handoffs: {len(result['message_log'])}")

    # Test 2: Individual agent test
    print(f"\n{'═' * 55}")
    print("▶ TEST 2: Individual agent test (Researcher only)")
    research = run_agent("researcher", "Research: impact of AI on healthcare")
    print(f"  Research output: {research[:200]}...")

Step 1 depends on: Environment setup (API key must be set).

Run the pipeline:

Run Command
python multi_agent_pipeline.py
Expected Output (abbreviated)
▶ TEST 1: Full content pipeline ═══════════════════════════════════════════════════════ 📝 Topic: The benefits of walking 30 minutes daily 🎯 Goal: Write a high-quality article about: The benefits of walking 30 minutes daily ═══════════════════════════════════════════════════════ [1/4] 🔍 Researcher working... Found 8 key points [2/4] ✍️ Writer working... Draft: 245 words [3/4] 📝 Editor working... Edited: 252 words [4/4] 🔎 Reviewer (attempt 1/2)... Score: 82/100 — ✅ Approved ────────────────────────────────────────────────── 📨 Message Log (4 handoffs): [14:23:01] researcher → writer: research_complete ## Key Findings about Walking... [14:23:04] writer → editor: draft_complete # The Benefits of Walking 30 Minutes... [14:23:07] editor → reviewer: edit_complete # The Benefits of Walking 30 Minutes... [14:23:09] reviewer → supervisor: review_complete {"score": 82, "feedback": "Well-researched... Final article (252 words): # The Benefits of Walking 30 Minutes Daily... Review score: 82/100 Total handoffs: 4 ═══════════════════════════════════════════════════════ ▶ TEST 2: Individual agent test (Researcher only) Research output: ## Impact of AI on Healthcare...
✅ Checkpoint

Verify these key behaviors:

  • 4 stages complete: Research → Write → Edit → Review should all run in sequence
  • Handoff messages: The message log should show 4+ handoffs with sender → receiver and timestamps
  • Review score: Should be a number 0–100. If rejected (score < 75), you should see a retry with Editor revision
  • Context isolation: Each agent gets only its handoff message, not the full conversation history
Troubleshooting
  • Reviewer always approves (no retry triggered) → The threshold is 75. To force a retry, temporarily change the reviewer prompt to "Approve only if score >= 95."
  • JSON parse error in review → The reviewer sometimes returns text instead of JSON. The code falls back to {"approved": True}. Try running again or add "Respond with valid JSON ONLY" to the reviewer prompt.
  • AuthenticationError → Check ANTHROPIC_API_KEY. This test makes 5+ API calls.
  • Very slow (>30s) → Each agent call takes 2–5 seconds. 4–6 calls = 10–30 seconds total. This is expected for sequential multi-agent systems.

Verify Everything Works

Run both tests. Test 1 should show the full 4-stage pipeline with message log. Test 2 should show a single agent working in isolation:

Command
python multi_agent_pipeline.py
🎉 Congratulations

You've built a multi-agent content pipeline with specialized agents, structured handoff messages, supervisor orchestration with retry logic, and a full audit trail. This is the hub-and-spoke architecture pattern the certification exam recommends for most multi-agent use cases.

Stretch Goals (Optional)
  • Parallel research: Run two Researcher agents concurrently (academic + industry) and merge findings before passing to the Writer
  • Conflict resolution: If Editor approves but Reviewer rejects, add a supervisor arbitration call to resolve the disagreement
  • Cost tracking: Add token counting per agent to identify which agents are most expensive
Cost Note

A full pipeline run uses 4–6 Claude calls. At Sonnet pricing, expect ~$0.08–0.20 per run. Retries add ~$0.05–0.10 each. Set max_review_attempts to 2–3 to cap costs. The message log tracks every handoff for cost auditing.

Knowledge Check

1. When should you use a multi-agent system instead of a single planning agent?

AAlways — multi-agent is strictly better
BWhen the task takes more than 5 minutes
CWhen sub-tasks require fundamentally different expertise, tools, and reasoning styles
DWhen you want to reduce API costs
Correct! Multi-agent systems add coordination overhead. They're worth it when sub-tasks genuinely need different tools and prompts (research vs. writing vs. editing). If all sub-tasks use the same reasoning style, a single planning agent (M13) is simpler and cheaper.
Not quite. Multi-agent adds overhead, so it's not always better. The deciding factor is role diversity: do sub-tasks need different tools, prompts, and reasoning styles? If yes, specialized agents outperform a generalist. If all sub-tasks are similar, a single agent with planning (M13) is more efficient.

2. Which architecture pattern does the exam recommend for most multi-agent use cases?

AHub-and-spoke (supervisor/worker) — single coordination point with clear isolation
BPeer-to-peer — agents communicate freely for maximum flexibility
CPipeline — always process sequentially for simplicity
DFlat mesh — all agents share the same context window
Correct! Hub-and-spoke (supervisor/worker) provides a single coordination point, clear context isolation between workers, structured result aggregation, and an auditable decision flow. The exam strongly favors this pattern.
The exam favors hub-and-spoke (supervisor/worker). It provides a single coordination point, clear isolation (workers only talk to the supervisor), structured aggregation, and auditability. Peer-to-peer is harder to debug. Pipelines are great for sequential tasks but not a general-purpose recommendation.

3. What information must a handoff message contain to prevent context loss? (Most complete answer)

AJust the payload (the agent's output)
BPayload and sender ID
CPayload, sender, receiver, and task ID
DSender, receiver, task_id, type, payload, original_goal, instructions, and metadata
Correct! The full set of fields ensures no context is lost. Most critically: original_goal (so downstream agents know what the user actually asked for) and instructions (specific guidance for the receiving agent). Missing any of these fields is the #1 cause of multi-agent failures.
A handoff message needs all structured fields: sender, receiver, task_id, type, payload, original_goal, instructions, and metadata. Especially original_goal — without it, downstream agents don't know the user's actual intent and may produce irrelevant output. Incomplete handoffs are the #1 failure cause.

4. Why is message passing generally preferred over shared state for production multi-agent systems?

AMessage passing is always faster than shared state
BMessage passing eliminates race conditions and makes all data flow auditable
CShared state doesn't work with more than 2 agents
DMessage passing uses less memory
Correct! With shared state, N agents create N×(N-1)/2 potential conflict pairs. Message passing eliminates all of these — every data exchange is explicit, logged, and conflict-free. Production teams report 40-60% fewer coordination bugs after switching to message passing.
The key advantage of message passing is eliminating race conditions and making data flow auditable. Shared state creates N×(N-1)/2 potential read/write conflict pairs as agents grow. Message passing makes every data exchange explicit and logged. It's not always faster, but it's more reliable and debuggable.

5. In M12, you built a single ReAct agent. How does the multi-agent content pipeline differ architecturally?

AMulti-agent uses a different Claude model
BMulti-agent uses more tools per agent
CMulti-agent uses specialized agents with focused prompts and isolated context, coordinated by a supervisor — instead of one agent with all responsibilities
DMulti-agent doesn't use the Messages API
Correct! The core API call is the same (Messages API), but the architecture is fundamentally different. Instead of one agent with a large prompt and many tools, you have specialized agents with focused prompts and few tools, coordinated by a supervisor. Each agent's isolated context window means you must explicitly pass context via handoff messages.
The API is the same (Messages API). What changes is the architecture: instead of one agent doing everything, you have specialized agents with focused prompts (200-400 tokens vs. 1,000+) and fewer tools (3-5 vs. 15+). A supervisor coordinates them via structured handoff messages. Each agent has an isolated context window.

Your Score

0/0

Module Summary

Key Concepts

  • Multi-agent systems: Use when sub-tasks require fundamentally different expertise. Specialized agents with focused prompts outperform generalist agents.
  • Architecture patterns: Supervisor/worker (hub-and-spoke) for most use cases, pipeline for sequential refinement, peer-to-peer for rare consensus scenarios.
  • Handoff messages: Structured messages with sender, receiver, task_id, type, payload, original_goal, and instructions. Context loss in handoffs is the #1 multi-agent failure mode.
  • Coordination: Message passing over shared state for production systems — eliminates race conditions and makes data flow auditable.
  • Conflict resolution: Design for disagreement upfront: supervisor arbitration, voting, confidence scoring, or escalation to humans.

What We Built

A complete content creation pipeline with four specialized agents (Researcher, Writer, Editor, Reviewer) orchestrated by a Supervisor. The pipeline uses structured handoff messages, supports reviewer-driven retries, logs all inter-agent communication, and handles errors gracefully at every stage.

Next Module Preview

In M15: Code Interpreter & Sandbox, you'll give agents the ability to write and execute code in a sandboxed environment. This unlocks a powerful new category of tasks: data analysis, chart generation, and programmatic problem-solving — all within a secure execution boundary that prevents untrusted code from affecting your system.