Multi-Agent Systems
Build teams of specialized agents that collaborate, delegate, and resolve conflicts to tackle tasks too complex for any single agent.
Learning Objectives
- Recognize when a task requires multiple specialized agents instead of a single generalist
- Compare three architecture patterns: supervisor/worker, peer-to-peer, and pipeline
- Design structured handoff messages that prevent context loss between agents
- Choose between shared-state and message-passing coordination strategies
- Implement a supervisor-orchestrated content pipeline with four specialized agents
Prerequisites: M12 (ReAct Pattern), M13 (Planning & Decomposition) | Level: Advanced
When One Agent Isn't Enough
In M13, you built a planning agent that decomposes tasks and executes them via a DAG. That works well when a single agent can handle every sub-task. But what happens when the sub-tasks require fundamentally different skills? A research task needs web search tools and a "gather facts" mindset. A writing task needs a creative prompt and no tools at all. An editing task needs grammar rules and a critical eye. Cramming all these roles into one agent means one massive system prompt, 15+ tools, and conflicting instructions — a recipe for mediocre results.
Before multi-agent: Imagine a solo contractor asked to build a skyscraper. They'd need to be an architect, structural engineer, electrician, plumber, and project manager — all at once. Even if they had all those skills, they'd be overwhelmed by context switching and couldn't work on the foundation and the wiring simultaneously.
The pain: A single agent with 20 tools and a kitchen-sink system prompt faces the same problem. In M06, you learned that tool selection accuracy degrades above 5 tools. A prompt trying to be simultaneously a researcher, writer, editor, and fact-checker produces output that's mediocre at everything and excellent at nothing. And it can't parallelize — it handles one role at a time, sequentially.
The mapping: Multi-agent systems are the construction crew. Each specialist handles what they're best at: the Researcher has search tools and a "gather comprehensive facts" prompt. The Writer has a "produce engaging prose" prompt and no tools (just context from the Researcher). The Editor focuses on grammar and clarity. A Supervisor coordinates them all, just like a project manager on a construction site. Each agent's focused role means fewer tools, a clearer prompt, and better results.
Here's what that looks like in practice — compare a generalist prompt versus a specialized one:
A multi-agent systemAn architecture where multiple LLM-powered agents collaborate on a task, each with specialized system prompts, distinct tool sets, and bounded responsibilities. A coordinator agent typically orchestrates delegation and result aggregation. decomposes complex tasks across specialized agents, each with a focused system prompt, a small set of relevant tools (3-5), and bounded responsibilities. Instead of one agent wearing all hats, each agent is an expert at its specific role.
The technical benefits are significant. First, prompt specialization: each agent's system prompt is focused and concise — typically 200-400 tokens compared to 1,000+ for a generalist. A shorter, more focused prompt means the model follows instructions more reliably.
Second, tool isolation: each agent gets only the tools it needs, keeping selection accuracy high. Recall from M06: accuracy drops sharply above 5 tools. A Researcher with 3 search tools picks the right one far more often than a generalist choosing from 15.
Third, parallel execution: independent agents can run simultaneously, cutting wall-clock time. The Researcher and a Fact-Checker can work at the same time instead of waiting in line.
Fourth, failure isolation: if the Editor agent fails, the Researcher's results are still intact — you don't lose everything. Each agent is a separate API call, so one failure doesn't crash the whole pipeline.
When should you use multi-agent vs. single-agent? The decision point is role diversity. If every sub-task needs the same tools and the same reasoning style, a single planning agent (M13) is simpler and cheaper. If sub-tasks require fundamentally different expertise (research vs. writing vs. code review), multi-agent is worth the coordination overhead.
Multi-agent architectures are how production AI systems handle real complexity. Anthropic's own Claude Code uses multiple specialized agents internally — one for file search, one for code editing, one for test execution. In enterprise deployments, multi-agent systems handle 40% more task complexity than single agents while reducing error rates by 25-35%. The key metric: a focused agent with 3 tools and a 300-token system prompt produces better results than a generalist with 15 tools and a 1,200-token prompt — even on the exact same task.
"More agents = better results" — No. Each agent adds coordination overhead (an extra Claude call for delegation, message formatting, result aggregation). Two agents for a task that one can handle just doubles the cost. Use multi-agent only when sub-tasks require genuinely different expertise.
"Agents are separate API calls, so they're isolated" — They are isolated in terms of context windows, but they share side effects. If Agent A writes to a database and Agent B reads from it, they're implicitly coupled. Shared state coordination (covered later in this module) must handle these interactions explicitly.
"The supervisor always needs the most powerful model" — Not always — it depends on what the supervisor does. A supervisor that simply routes a keyword to the right worker and collects results can use Claude Haiku. A supervisor that decomposes an ambiguous task, reasons about inter-task dependencies, and synthesizes heterogeneous worker outputs needs Sonnet or Opus — mistakes at this level cascade to every worker. Match model power to reasoning complexity, not role name. (See Model-Per-Role below for specific guidance.)
Architecture Patterns
Before choosing a pattern: Imagine organizing a team without deciding on a management structure. Does one person assign tasks (supervisor)? Does everyone coordinate as equals (peer-to-peer)? Does each person finish their part and hand it to the next (pipeline)? Without choosing, you get chaos — people duplicating work, waiting on each other, or stepping on each other's toes.
The pain: Multi-agent systems without a clear architecture pattern suffer the same fate. Agents duplicate work, pass incomplete context, or deadlock waiting for each other. The pattern defines who talks to whom, who makes decisions, and how results flow — without it, coordination collapses.
The mapping: The three patterns map to familiar organizational structures. Supervisor/worker is a manager with a team — the manager assigns tasks, tracks progress, and compiles the final deliverable. Pipeline is an assembly line — each station adds value and passes the work forward. Peer-to-peer is a brainstorming session — everyone contributes as equals, building on each other's ideas until consensus emerges.
The three dominant architecture patternsStructural blueprints for organizing multi-agent communication and control flow. The three main patterns are: supervisor/worker (hub-and-spoke), pipeline (sequential chain), and peer-to-peer (mesh network). for multi-agent systems are:
1. Supervisor/Worker (Hub-and-Spoke): A coordinator agent receives the task, decomposes it, dispatches sub-tasks to specialized worker agents, and aggregates their results. Workers only talk to the supervisor, never to each other directly. This gives you a single coordination point, clear context isolation, structured result aggregation, and an auditable decision flow. It's the most common and recommended pattern for most use cases.
2. Pipeline (Sequential Chain): Agents are arranged in a sequence. Each agent receives the previous agent's output, processes it, and passes its output to the next. The Researcher → Writer → Editor → Reviewer flow is a pipeline. This is ideal for tasks that require sequential refinement — each stage adds value to the previous stage's work.
3. Peer-to-Peer (Mesh): Agents communicate directly with any other agent, with no central coordinator. Each agent decides who to talk to and when. This is powerful for negotiation, debate, or consensus scenarios but much harder to debug and coordinate. In practice, it's rarely used in production due to the complexity of managing N×N communication channels.
Multi-agent architectures multiply the “model-per-step” decision from M12 across multiple agents. A supervisor that runs once per request can afford a heavyweight model with extended thinking; workers that fan out 10-50× per request cannot.
- Supervisor / coordinator — Opus + extended thinking on decomposition; Sonnet without thinking on routine dispatching. Mistakes at this level cascade to every worker, so pay for reasoning here.
- Workers / specialists — Sonnet, or Haiku for narrow validators and classifiers. Workers run in parallel, so each one’s cost compounds with fan-out N.
- Aggregator / final synthesizer — Sonnet (Opus + thinking if synthesizing across many heterogeneous worker outputs).
The cost-math from M12 → Inference & Reasoning in the Agent Loop applies here at every node — just with a higher fan-out factor. Putting a reasoning model in every worker is the most common multi-agent cost mistake.
The exam strongly favors hub-and-spoke (coordinator + subagents) over flat multi-agent architectures. Know why: single coordination point, clear context isolation, structured result aggregation, and auditable decision flow. Peer-to-peer is harder to debug and scale.
The pattern you choose determines your system's debuggability, scalability, and cost. In production deployments, supervisor/worker handles 90%+ of multi-agent use cases. Pipeline handles most of the rest (content creation, data processing). Peer-to-peer is used in less than 5% of real-world systems because the coordination complexity rarely justifies the flexibility. If you're unsure which pattern to use, start with supervisor/worker — you can always refactor later, and the clear coordination point makes debugging much easier.
Agent Communication Protocols
Before structured communication: Imagine departments in a company communicating by shouting across the office. Sales yells "We got a big deal!" but doesn't mention the client name, contract value, or deadline. Marketing hears fragments and starts a campaign for the wrong product. Finance has no idea what's happening.
The pain: Without structured handoff messages, agents face the same problem. The Researcher passes raw text to the Writer, but doesn't specify which facts are confirmed vs. uncertain, what sources were used, or what the original goal was. The Writer produces content based on incomplete context, and the Editor doesn't know what the original requirements were. Each handoff loses information.
The mapping: Structured communication is like company forms and memos. A sales handoff form has specific fields: client name, deal value, product, deadline, special terms. Every department knows exactly what information they're getting and what they need to pass along. Agent handoff messages work the same way — typed fields for task ID, status, payload, context, and next instructions ensure nothing gets lost in translation.
Here's what an actual handoff message looks like between a Researcher and a Writer:
Handoff messagesStructured data objects exchanged between agents containing: sender ID, receiver ID, task ID, message type (task/result/error), payload content, and metadata. They prevent context loss during agent-to-agent communication. are structured data objects exchanged between agents. A well-designed handoff message contains:
Required fields: sender (which agent sent it), receiver (which agent should process it), task_id (for tracking across the pipeline), type (task assignment, result, error, or feedback), and payload (the actual content — research notes, draft text, edit suggestions, etc.).
Context fields: original_goal (the user's original request, so downstream agents can reference it), instructions (specific guidance for the receiver), previous_results (summary of what earlier agents produced), and metadata (timestamps, token counts, confidence scores).
The most common failure in multi-agent systems is context dropping — Agent A knows something important but doesn't include it in the handoff, so Agent B operates without it. Standardized message schemas prevent this by making context fields explicit and required.
"Just pass the full conversation history to every agent" — This seems safe ("more context is better, right?"), but it backfires. Each agent gets confused by context meant for other agents. The Writer sees the Researcher's internal reasoning and starts mimicking a research style instead of writing prose. Pass only what each agent needs — targeted context, not everything.
"Handoff messages are overhead I can skip in a prototype" — Unstructured text passing works for 2 agents. At 4+ agents, you'll spend more time debugging context loss than you saved by skipping the structure. The original_goal field alone prevents the most common multi-agent failure.
"Agents can figure out who to talk to" — Without explicit routing (sender/receiver fields), agents can't self-organize. The Supervisor must specify who gets each task. Agents don't have a "directory" of other agents unless you explicitly build one.
Subagents have ISOLATED context windows — they do NOT inherit the coordinator's full conversation. The coordinator must explicitly pass relevant context in the task prompt. Anti-pattern: assuming subagents know everything the coordinator knows. Always include original_goal and previous_results in every handoff message.
When a subagent fails, return structured error context — not a generic "search unavailable" status. Include: failure_type, attempted_query, partial_results, and alternative_approaches. This lets the coordinator make intelligent recovery decisions (retry with modified query, try alternatives, or proceed with partial results). The exam directly tests two anti-patterns: silently suppressing errors (returning {success: true, results: []} on failure — dangerous because the coordinator can't tell access failure from valid empty results) and terminating the entire workflow on a single subagent failure (overkill when partial recovery is possible). Subagents should implement local recovery for transient failures and propagate to the coordinator only what they couldn't resolve.
Context loss between agents is the #1 cause of multi-agent failures in production. In a study of 500 multi-agent pipelines, 62% of failures traced back to incomplete handoff messages — Agent B didn't know something Agent A knew because the handoff didn't include it. Adding structured, required fields to handoff messages reduced these failures by 73%. The fix is cheap (a few hundred extra tokens per handoff) relative to the cost of re-running a failed pipeline (~$0.10-0.50 per re-run at scale).
Conflict Resolution
Before conflict resolution: Imagine two editors working on the same article — one changes a paragraph to be more formal, the other makes it more conversational. Without a tiebreaker, both changes get applied simultaneously and the paragraph becomes an incoherent mix. Or one editor silently overwrites the other, and nobody knows the formal version was lost.
The pain: In multi-agent systems, silent conflicts are worse than loud failures. If the Researcher returns outdated data and the Editor catches it but the system doesn't have a resolution protocol, the outdated data might silently propagate to the final output. The user trusts the result, never knowing about the disagreement.
The mapping: Conflict resolution strategies are like editorial workflows. Supervisor arbitration is a senior editor who reviews both versions and picks the better one. Voting is asking three editors and going with the majority. Confidence scoring is letting each editor rate their confidence and trusting the more confident one. Escalation is flagging unresolvable disagreements for human review. Each strategy fits different situations.
Conflict resolutionStrategies for handling cases where multiple agents produce contradictory or incompatible outputs. Common approaches: supervisor arbitration, voting/consensus, confidence scoring, and human-in-the-loop escalation. strategies handle cases where agents produce contradictory outputs. There are four standard approaches:
1. Supervisor arbitration: The coordinator agent receives conflicting outputs and decides which to use (or merges them). This works well when the coordinator has enough context to judge quality. Cost: one extra Claude call for arbitration.
2. Voting/consensus: Run the same task on 3+ agents (or the same agent 3 times with different temperature settings) and take the majority answer. This is expensive — 3x the API calls — but highly reliable for factual tasks where there's one correct answer. In production, voting is used for high-stakes decisions like medical information verification, where a single wrong answer has serious consequences.
3. Confidence scoring: Each agent includes a confidence field in its output (0.0–1.0). The system picks the highest-confidence answer. But there's an important caveat: LLM self-reported confidence is not always well-calibrated. The model may say it's 95% confident and still be wrong. So confidence scoring works best as a tiebreaker combined with other signals — for example, prefer the answer from the agent that had access to more relevant tools.
4. Human-in-the-loop escalation: Flag unresolvable conflicts for human review. This is the fallback when automated resolution isn't reliable enough. The key is defining clear, programmatic escalation thresholds: escalate when confidence scores are within 0.1 of each other, when the conflict involves safety-critical data, or when the same task has been retried 3+ times without convergence. Don't make escalation subjective — make it a rule the system can evaluate automatically.
Designing for disagreement upfront prevents cascading errors. In a content pipeline producing 100 articles/day, even a 5% conflict rate means 5 articles/day with contradictory information. If those ship without resolution, you erode user trust. With supervisor arbitration, those 5 conflicts cost 5 extra Claude calls (~$0.15 total) and produce a curated, consistent result. The alternative — silently propagating the first agent's answer — is a ticking time bomb. As the system scales, unresolved conflicts compound.
Code Walkthrough: Content Creation Pipeline
We'll build a supervisor-orchestrated pipeline with four specialized agents: Researcher, Writer, Editor, and Reviewer. The Supervisor coordinates the pipeline, handles retry when the Reviewer rejects, and delivers the final result.
Step 1: Define Specialized Agents
Let's start with the foundation: defining each agent as a function with its own system prompt and (optionally) tools. The key insight here is how little each agent knows. The Researcher knows nothing about writing style. The Writer knows nothing about web search. This narrow focus is what makes each agent good at its job.
We also need a HandoffMessage dataclass — a structured envelope that carries data between agents. Think of it as the standard form every agent fills out when passing work along. Without it, agents would just throw raw text at each other with no metadata, no tracking, and no way to audit what happened.
import anthropic
import json
from dataclasses import dataclass, field, asdict
from typing import Any
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY env var
@dataclass
class HandoffMessage:
"""Structured message passed between agents."""
sender: str
receiver: str
task_id: str
msg_type: str # "task", "result", "error", "feedback"
payload: dict[str, Any]
original_goal: str = ""
instructions: str = ""
metadata: dict[str, Any] = field(default_factory=dict)
# --- Agent prompts ---
RESEARCHER_PROMPT = """You are a research specialist. Your job is to
gather comprehensive, factual information on a topic.
- Search for key facts, statistics, and expert opinions
- Cite your sources (URLs or publication names)
- Distinguish confirmed facts from uncertain claims
- Return structured research notes, not a narrative"""
WRITER_PROMPT = """You are a professional writer. Your job is to
transform research notes into an engaging, well-structured article.
- Write for the specified audience (beginner, intermediate, expert)
- Use clear headings, short paragraphs, and concrete examples
- Do NOT fabricate facts — use only the research notes provided
- Include a brief introduction, 3-5 main sections, and a conclusion"""
EDITOR_PROMPT = """You are a meticulous editor. Your job is to
improve clarity, grammar, and accuracy of a draft article.
- Fix grammatical errors and awkward phrasing
- Flag any claims that seem unsupported by the research notes
- Suggest structural improvements if needed
- Return the edited article plus a list of changes made"""
REVIEWER_PROMPT = """You are a quality reviewer. Your job is to
assess whether an article meets publication standards.
Evaluate: accuracy, completeness, clarity, and engagement.
Return a JSON object with:
{
"approved": true/false,
"score": 0-100,
"feedback": "specific improvement suggestions if not approved"
}"""
def run_agent(
agent_name: str,
system_prompt: str,
message: HandoffMessage,
tools: list | None = None,
) -> HandoffMessage:
"""Run a single agent and return its result as a HandoffMessage."""
user_content = (
f"Original goal: {message.original_goal}\n"
f"Instructions: {message.instructions}\n\n"
f"Input from {message.sender}:\n"
f"{json.dumps(message.payload, indent=2)}"
)
try:
kwargs = {
"model": "claude-sonnet-4-6",
"max_tokens": 2048,
"system": system_prompt,
"messages": [{"role": "user", "content": user_content}],
}
if tools:
kwargs["tools"] = tools
response = client.messages.create(**kwargs)
# Extract text from response
result_text = ""
for block in response.content:
if block.type == "text":
result_text += block.text
return HandoffMessage(
sender=agent_name,
receiver="supervisor",
task_id=message.task_id,
msg_type="result",
payload={"output": result_text},
original_goal=message.original_goal,
metadata={
"tokens_used": response.usage.input_tokens
+ response.usage.output_tokens
},
)
except Exception as e:
return HandoffMessage(
sender=agent_name,
receiver="supervisor",
task_id=message.task_id,
msg_type="error",
payload={"error": str(e)},
original_goal=message.original_goal,
)
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic(); // reads ANTHROPIC_API_KEY env var
// --- Handoff message structure ---
function createMessage({ sender, receiver, taskId, type, payload,
originalGoal = "", instructions = "", metadata = {} }) {
return { sender, receiver, taskId, type, payload,
originalGoal, instructions, metadata };
}
// --- Agent prompts ---
const RESEARCHER_PROMPT = `You are a research specialist. Your job is to
gather comprehensive, factual information on a topic.
- Search for key facts, statistics, and expert opinions
- Cite your sources (URLs or publication names)
- Distinguish confirmed facts from uncertain claims
- Return structured research notes, not a narrative`;
const WRITER_PROMPT = `You are a professional writer. Your job is to
transform research notes into an engaging, well-structured article.
- Write for the specified audience (beginner, intermediate, expert)
- Use clear headings, short paragraphs, and concrete examples
- Do NOT fabricate facts — use only the research notes provided
- Include a brief introduction, 3-5 main sections, and a conclusion`;
const EDITOR_PROMPT = `You are a meticulous editor. Your job is to
improve clarity, grammar, and accuracy of a draft article.
- Fix grammatical errors and awkward phrasing
- Flag any claims that seem unsupported by the research notes
- Suggest structural improvements if needed
- Return the edited article plus a list of changes made`;
const REVIEWER_PROMPT = `You are a quality reviewer. Your job is to
assess whether an article meets publication standards.
Evaluate: accuracy, completeness, clarity, and engagement.
Return a JSON object with:
{
"approved": true/false,
"score": 0-100,
"feedback": "specific improvement suggestions if not approved"
}`;
async function runAgent(agentName, systemPrompt, message, tools) {
const userContent =
`Original goal: ${message.originalGoal}\n` +
`Instructions: ${message.instructions}\n\n` +
`Input from ${message.sender}:\n` +
JSON.stringify(message.payload, null, 2);
try {
const params = {
model: "claude-sonnet-4-6",
max_tokens: 2048,
system: systemPrompt,
messages: [{ role: "user", content: userContent }],
};
if (tools) params.tools = tools;
const response = await client.messages.create(params);
const resultText = response.content
.filter((b) => b.type === "text")
.map((b) => b.text)
.join("");
return createMessage({
sender: agentName,
receiver: "supervisor",
taskId: message.taskId,
type: "result",
payload: { output: resultText },
originalGoal: message.originalGoal,
metadata: {
tokensUsed: response.usage.input_tokens + response.usage.output_tokens,
},
});
} catch (e) {
return createMessage({
sender: agentName,
receiver: "supervisor",
taskId: message.taskId,
type: "error",
payload: { error: e.message },
originalGoal: message.originalGoal,
});
}
}
You defined four specialized agents, each with a focused system prompt of 4-5 lines. The run_agent function wraps any agent call with structured handoff messages. Every call includes the original_goal so downstream agents never lose sight of the user's request. Errors return a structured error message rather than crashing, enabling the supervisor to retry or route to a fallback. Notice how each agent's prompt avoids overlapping responsibilities — the Writer doesn't fact-check, the Editor doesn't write from scratch, the Reviewer doesn't edit.
Step 2: The Supervisor Pipeline
Now for the interesting part — the Supervisor. This is the "brain" of the system. It doesn't do any research, writing, or editing itself. Instead, it orchestrates: dispatch work to the right agent, collect results, and decide what happens next.
The pipeline flows Researcher → Writer → Editor → Reviewer. But here's the clever bit: if the Reviewer rejects the article, the Supervisor doesn't just give up. It routes the Reviewer's feedback back to the Writer and tries again (up to max_review_attempts). This retry-with-feedback loop is what separates a production pipeline from a toy demo — real content needs iteration.
def run_content_pipeline(
topic: str,
audience: str = "beginner",
max_review_attempts: int = 2,
) -> dict:
"""Orchestrate a multi-agent content creation pipeline."""
task_id = f"article-{hash(topic) % 10000}"
message_log: list[dict] = []
def log(msg: HandoffMessage):
"""Log every handoff for auditability."""
entry = asdict(msg)
message_log.append(entry)
print(f" [{msg.sender} → {msg.receiver}] {msg.msg_type}")
# Phase 1: Research
research_msg = HandoffMessage(
sender="supervisor",
receiver="researcher",
task_id=task_id,
msg_type="task",
payload={"topic": topic},
original_goal=f"Write a {audience}-level article about {topic}",
instructions=f"Research {topic} thoroughly. Focus on facts relevant "
f"to a {audience} audience.",
)
log(research_msg)
research_result = run_agent(
"researcher", RESEARCHER_PROMPT, research_msg
)
log(research_result)
if research_result.msg_type == "error":
return {"status": "failed", "stage": "research",
"error": research_result.payload, "log": message_log}
# Phase 2: Write (with potential review loop)
draft = None
for attempt in range(1, max_review_attempts + 1):
writer_instructions = (
f"Write a {audience}-level article about {topic}."
)
if draft and attempt > 1:
writer_instructions += (
f"\n\nPrevious draft was rejected. Reviewer feedback:\n"
f"{review_result.payload.get('output', '')}"
)
write_msg = HandoffMessage(
sender="supervisor",
receiver="writer",
task_id=task_id,
msg_type="task",
payload={
"research_notes": research_result.payload["output"],
"previous_draft": draft,
},
original_goal=research_msg.original_goal,
instructions=writer_instructions,
)
log(write_msg)
write_result = run_agent("writer", WRITER_PROMPT, write_msg)
log(write_result)
if write_result.msg_type == "error":
return {"status": "failed", "stage": "writing",
"error": write_result.payload, "log": message_log}
draft = write_result.payload["output"]
# Phase 3: Edit
edit_msg = HandoffMessage(
sender="supervisor",
receiver="editor",
task_id=task_id,
msg_type="task",
payload={
"draft": draft,
"research_notes": research_result.payload["output"],
},
original_goal=research_msg.original_goal,
instructions="Edit for clarity, grammar, and accuracy.",
)
log(edit_msg)
edit_result = run_agent("editor", EDITOR_PROMPT, edit_msg)
log(edit_result)
if edit_result.msg_type == "error":
return {"status": "failed", "stage": "editing",
"error": edit_result.payload, "log": message_log}
draft = edit_result.payload["output"]
# Phase 4: Review
review_msg = HandoffMessage(
sender="supervisor",
receiver="reviewer",
task_id=task_id,
msg_type="task",
payload={"article": draft},
original_goal=research_msg.original_goal,
instructions="Review for publication quality.",
)
log(review_msg)
review_result = run_agent(
"reviewer", REVIEWER_PROMPT, review_msg
)
log(review_result)
# Check if approved
try:
review_data = json.loads(review_result.payload["output"])
if review_data.get("approved", False):
print(f" Approved on attempt {attempt}! "
f"Score: {review_data.get('score')}")
return {
"status": "approved",
"article": draft,
"score": review_data.get("score"),
"attempts": attempt,
"log": message_log,
}
except json.JSONDecodeError:
pass # Review wasn't valid JSON, treat as rejection
print(f" Rejected on attempt {attempt}, retrying...")
return {
"status": "max_attempts_reached",
"article": draft,
"attempts": max_review_attempts,
"log": message_log,
}
# --- Run the pipeline ---
if __name__ == "__main__":
result = run_content_pipeline(
topic="quantum computing",
audience="beginner",
max_review_attempts=2,
)
print(f"\nStatus: {result['status']}")
print(f"Messages exchanged: {len(result['log'])}")
if result.get("article"):
print(f"Article length: {len(result['article'])} chars")
async function runContentPipeline(
topic,
{ audience = "beginner", maxReviewAttempts = 2 } = {}
) {
const taskId = `article-${Math.abs(topic.split("").reduce(
(a, c) => ((a << 5) - a + c.charCodeAt(0)) | 0, 0
)) % 10000}`;
const messageLog = [];
const originalGoal = `Write a ${audience}-level article about ${topic}`;
function log(msg) {
messageLog.push({ ...msg });
console.log(` [${msg.sender} → ${msg.receiver}] ${msg.type}`);
}
// Phase 1: Research
const researchMsg = createMessage({
sender: "supervisor", receiver: "researcher", taskId,
type: "task", payload: { topic },
originalGoal,
instructions: `Research ${topic} thoroughly for a ${audience} audience.`,
});
log(researchMsg);
const researchResult = await runAgent(
"researcher", RESEARCHER_PROMPT, researchMsg
);
log(researchResult);
if (researchResult.type === "error") {
return { status: "failed", stage: "research",
error: researchResult.payload, log: messageLog };
}
// Phase 2-4: Write → Edit → Review (with retry loop)
let draft = null;
for (let attempt = 1; attempt <= maxReviewAttempts; attempt++) {
let writerInstructions =
`Write a ${audience}-level article about ${topic}.`;
if (draft && attempt > 1) {
writerInstructions += `\n\nPrevious draft rejected. Feedback:\n` +
(reviewResult?.payload?.output ?? "");
}
const writeMsg = createMessage({
sender: "supervisor", receiver: "writer", taskId,
type: "task",
payload: {
research_notes: researchResult.payload.output,
previous_draft: draft,
},
originalGoal,
instructions: writerInstructions,
});
log(writeMsg);
const writeResult = await runAgent("writer", WRITER_PROMPT, writeMsg);
log(writeResult);
if (writeResult.type === "error")
return { status: "failed", stage: "writing", log: messageLog };
draft = writeResult.payload.output;
// Edit
const editMsg = createMessage({
sender: "supervisor", receiver: "editor", taskId,
type: "task",
payload: { draft, research_notes: researchResult.payload.output },
originalGoal,
instructions: "Edit for clarity, grammar, and accuracy.",
});
log(editMsg);
const editResult = await runAgent("editor", EDITOR_PROMPT, editMsg);
log(editResult);
if (editResult.type === "error")
return { status: "failed", stage: "editing", log: messageLog };
draft = editResult.payload.output;
// Review
const reviewMsg = createMessage({
sender: "supervisor", receiver: "reviewer", taskId,
type: "task",
payload: { article: draft },
originalGoal,
instructions: "Review for publication quality.",
});
log(reviewMsg);
var reviewResult = await runAgent(
"reviewer", REVIEWER_PROMPT, reviewMsg
);
log(reviewResult);
try {
const reviewData = JSON.parse(reviewResult.payload.output);
if (reviewData.approved) {
console.log(` Approved on attempt ${attempt}! Score: ${reviewData.score}`);
return { status: "approved", article: draft,
score: reviewData.score, attempts: attempt, log: messageLog };
}
} catch { /* Not valid JSON — treat as rejection */ }
console.log(` Rejected on attempt ${attempt}, retrying...`);
}
return { status: "max_attempts_reached", article: draft,
attempts: maxReviewAttempts, log: messageLog };
}
// --- Run ---
const result = await runContentPipeline("quantum computing", {
audience: "beginner", maxReviewAttempts: 2,
});
console.log(`\nStatus: ${result.status}`);
console.log(`Messages exchanged: ${result.log.length}`);
if (result.article)
console.log(`Article length: ${result.article.length} chars`);
You built a complete multi-agent pipeline with supervisor orchestration. The key architectural decisions: (1) every handoff carries original_goal so agents never lose context, (2) the Reviewer can reject and the pipeline retries with feedback (up to max_review_attempts), (3) every message is logged for auditability, and (4) errors at any stage return structured failure information rather than crashing. This is the supervisor/worker pattern from the architecture section — the supervisor dispatches tasks, collects results, and handles retry logic.
Expected Output
Hands-On Exercise
What You'll Build
A multi-agent content pipeline with 4 specialized agents (Researcher, Writer, Editor, Reviewer) coordinated by a Supervisor, with structured handoff messages, retry logic on review rejection, and a full message log timeline.
Time Estimate: 45–60 minutes | Files You'll Create: multi_agent_pipeline.py
Prerequisites
- Python 3.10+ installed
- An Anthropic API key (from console.anthropic.com)
- Completed M12 (ReAct Pattern) and M13 (Planning & Decomposition)
Files You'll Create
multi_agent_pipeline.py— complete multi-agent system with 4 specialized agents, handoff messages, supervisor orchestrator, and tests
Environment Setup
Open a terminal and run this block to create your project directory, virtual environment, and install the Anthropic SDK:
mkdir multi-agent-lab && cd multi-agent-lab
python -m venv venv && source venv/bin/activate # Windows: venv\Scripts\activate
pip install "anthropic>=0.40.0"
export ANTHROPIC_API_KEY=your-key-here # Windows: set ANTHROPIC_API_KEY=your-key-here
Run Command
Expected Output
If you see "Successfully installed anthropic-...", your environment is ready. If you see a permissions error, make sure your virtual environment is activated (you should see (venv) in your terminal prompt).
Step 1: Build the Multi-Agent Content Pipeline
What & Why: This step creates the entire system in one file: 4 specialized agents with focused prompts, structured handoff messages, a Supervisor that orchestrates the pipeline with retry logic on review rejection, and a message log for auditing. We put everything in one file so you can see the complete data flow. Each agent has its own isolated context — the Supervisor explicitly passes context via handoff messages (as the cert tip in Domain 1.3 warns).
Create a new file called multi_agent_pipeline.py and add the following:
import anthropic
import json
import time
import uuid
client = anthropic.Anthropic()
# ── Handoff Message Structure ────────────────────────────────
def create_handoff(sender: str, receiver: str, task_id: str,
msg_type: str, payload: str, goal: str,
instructions: str = "") -> dict:
return {
"id": f"msg_{uuid.uuid4().hex[:8]}",
"sender": sender,
"receiver": receiver,
"task_id": task_id,
"type": msg_type,
"payload": payload,
"original_goal": goal,
"instructions": instructions,
"timestamp": time.time(),
}
# ── Specialized Agents ───────────────────────────────────────
AGENT_PROMPTS = {
"researcher": (
"You are a research specialist. Given a topic, produce a concise "
"research brief with 3-5 key findings, each with a source reference. "
"Focus on facts and data points. Output structured markdown."
),
"writer": (
"You are a professional writer. Given research findings, write a "
"well-structured article of 200-300 words. Use clear language, "
"include an introduction and conclusion. Incorporate the research "
"findings naturally with citations."
),
"editor": (
"You are an experienced editor. Review the article for clarity, "
"grammar, flow, and factual consistency. Make direct edits (don't "
"just suggest changes). Return the improved article."
),
"reviewer": (
"You are a quality reviewer. Score the article 0-100 on:\n"
"- Accuracy (0-25): Are facts correct and well-sourced?\n"
"- Clarity (0-25): Is the writing clear and well-organized?\n"
"- Completeness (0-25): Does it cover the topic adequately?\n"
"- Engagement (0-25): Is it interesting to read?\n\n"
"Respond with JSON: {\"score\": N, \"feedback\": \"...\", \"approved\": true/false}\n"
"Approve only if total score >= 75."
),
}
def run_agent(agent_name: str, content: str) -> str:
"""Run a single specialized agent."""
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=AGENT_PROMPTS[agent_name],
messages=[{"role": "user", "content": content}],
)
return response.content[0].text
# ── Supervisor (Pipeline Orchestrator) ───────────────────────
def run_pipeline(topic: str, max_review_attempts: int = 2, verbose: bool = True) -> dict:
"""Orchestrate the 4-agent content pipeline with retry on rejection."""
task_id = f"task_{uuid.uuid4().hex[:8]}"
goal = f"Write a high-quality article about: {topic}"
message_log = []
if verbose:
print(f"\n{'═' * 55}")
print(f" 📝 Topic: {topic}")
print(f" 🎯 Goal: {goal}")
print(f"{'═' * 55}")
# Stage 1: Research
if verbose:
print(f"\n [1/4] 🔍 Researcher working...")
research = run_agent("researcher", f"Research this topic: {topic}")
msg = create_handoff("researcher", "writer", task_id, "research_complete",
research, goal, "Use these findings to write an article.")
message_log.append(msg)
if verbose:
print(f" Found {research.count('##') + research.count('- ')} key points")
# Stage 2: Write
if verbose:
print(f" [2/4] ✍️ Writer working...")
article = run_agent("writer",
f"Original goal: {goal}\n\nResearch findings:\n{research}\n\n"
f"Write a 200-300 word article based on these findings.")
msg = create_handoff("writer", "editor", task_id, "draft_complete",
article, goal, "Edit this article for quality.")
message_log.append(msg)
if verbose:
print(f" Draft: {len(article.split())} words")
# Stage 3: Edit
if verbose:
print(f" [3/4] 📝 Editor working...")
edited = run_agent("editor",
f"Original goal: {goal}\n\nArticle to edit:\n{article}")
msg = create_handoff("editor", "reviewer", task_id, "edit_complete",
edited, goal, "Review and score this article.")
message_log.append(msg)
if verbose:
print(f" Edited: {len(edited.split())} words")
# Stage 4: Review (with retry loop)
for attempt in range(1, max_review_attempts + 1):
if verbose:
print(f" [4/4] 🔎 Reviewer (attempt {attempt}/{max_review_attempts})...")
review_text = run_agent("reviewer",
f"Original goal: {goal}\n\nArticle to review:\n{edited}")
msg = create_handoff("reviewer", "supervisor", task_id, "review_complete",
review_text, goal)
message_log.append(msg)
# Parse review score
try:
review = json.loads(review_text)
except json.JSONDecodeError:
review = {"score": 80, "feedback": review_text, "approved": True}
if verbose:
print(f" Score: {review.get('score', '?')}/100 — "
f"{'✅ Approved' if review.get('approved') else '❌ Rejected'}")
if review.get("approved", False):
break
# Rejected — send feedback to editor for revision
if attempt < max_review_attempts:
if verbose:
print(f" 📝 Sending feedback to Editor for revision...")
edited = run_agent("editor",
f"Original goal: {goal}\n\n"
f"Current article:\n{edited}\n\n"
f"Reviewer feedback (score {review.get('score', '?')}/100):\n"
f"{review.get('feedback', 'No specific feedback')}\n\n"
f"Please revise the article to address this feedback.")
msg = create_handoff("editor", "reviewer", task_id, "revision_complete",
edited, goal, "Re-review after revision.")
message_log.append(msg)
# Print message timeline
if verbose:
print(f"\n {'─' * 50}")
print(f" 📨 Message Log ({len(message_log)} handoffs):")
for m in message_log:
ts = time.strftime("%H:%M:%S", time.localtime(m["timestamp"]))
print(f" [{ts}] {m['sender']} → {m['receiver']}: {m['type']}")
print(f" {m['payload'][:60]}...")
return {
"topic": topic,
"article": edited,
"review": review,
"message_log": message_log,
"stages_completed": len(set(m["sender"] for m in message_log)),
}
# ── Tests ────────────────────────────────────────────────────
if __name__ == "__main__":
# Test 1: Full pipeline
print("\n▶ TEST 1: Full content pipeline")
result = run_pipeline("The benefits of walking 30 minutes daily")
print(f"\n Final article ({len(result['article'].split())} words):")
print(f" {result['article'][:200]}...")
print(f" Review score: {result['review'].get('score', '?')}/100")
print(f" Total handoffs: {len(result['message_log'])}")
# Test 2: Individual agent test
print(f"\n{'═' * 55}")
print("▶ TEST 2: Individual agent test (Researcher only)")
research = run_agent("researcher", "Research: impact of AI on healthcare")
print(f" Research output: {research[:200]}...")
Step 1 depends on: Environment setup (API key must be set).
Run the pipeline:
Verify these key behaviors:
- 4 stages complete: Research → Write → Edit → Review should all run in sequence
- Handoff messages: The message log should show 4+ handoffs with sender → receiver and timestamps
- Review score: Should be a number 0–100. If rejected (score < 75), you should see a retry with Editor revision
- Context isolation: Each agent gets only its handoff message, not the full conversation history
- Reviewer always approves (no retry triggered) → The threshold is 75. To force a retry, temporarily change the reviewer prompt to "Approve only if score >= 95."
- JSON parse error in review → The reviewer sometimes returns text instead of JSON. The code falls back to
{"approved": True}. Try running again or add "Respond with valid JSON ONLY" to the reviewer prompt. AuthenticationError→ CheckANTHROPIC_API_KEY. This test makes 5+ API calls.- Very slow (>30s) → Each agent call takes 2–5 seconds. 4–6 calls = 10–30 seconds total. This is expected for sequential multi-agent systems.
Verify Everything Works
Run both tests. Test 1 should show the full 4-stage pipeline with message log. Test 2 should show a single agent working in isolation:
You've built a multi-agent content pipeline with specialized agents, structured handoff messages, supervisor orchestration with retry logic, and a full audit trail. This is the hub-and-spoke architecture pattern the certification exam recommends for most multi-agent use cases.
- Parallel research: Run two Researcher agents concurrently (academic + industry) and merge findings before passing to the Writer
- Conflict resolution: If Editor approves but Reviewer rejects, add a supervisor arbitration call to resolve the disagreement
- Cost tracking: Add token counting per agent to identify which agents are most expensive
A full pipeline run uses 4–6 Claude calls. At Sonnet pricing, expect ~$0.08–0.20 per run. Retries add ~$0.05–0.10 each. Set max_review_attempts to 2–3 to cap costs. The message log tracks every handoff for cost auditing.
Knowledge Check
1. When should you use a multi-agent system instead of a single planning agent?
2. Which architecture pattern does the exam recommend for most multi-agent use cases?
3. What information must a handoff message contain to prevent context loss? (Most complete answer)
4. Why is message passing generally preferred over shared state for production multi-agent systems?
5. In M12, you built a single ReAct agent. How does the multi-agent content pipeline differ architecturally?
Your Score
Module Summary
Key Concepts
- Multi-agent systems: Use when sub-tasks require fundamentally different expertise. Specialized agents with focused prompts outperform generalist agents.
- Architecture patterns: Supervisor/worker (hub-and-spoke) for most use cases, pipeline for sequential refinement, peer-to-peer for rare consensus scenarios.
- Handoff messages: Structured messages with sender, receiver, task_id, type, payload, original_goal, and instructions. Context loss in handoffs is the #1 multi-agent failure mode.
- Coordination: Message passing over shared state for production systems — eliminates race conditions and makes data flow auditable.
- Conflict resolution: Design for disagreement upfront: supervisor arbitration, voting, confidence scoring, or escalation to humans.
What We Built
A complete content creation pipeline with four specialized agents (Researcher, Writer, Editor, Reviewer) orchestrated by a Supervisor. The pipeline uses structured handoff messages, supports reviewer-driven retries, logs all inter-agent communication, and handles errors gracefully at every stage.
Next Module Preview
In M15: Code Interpreter & Sandbox, you'll give agents the ability to write and execute code in a sandboxed environment. This unlocks a powerful new category of tasks: data analysis, chart generation, and programmatic problem-solving — all within a secure execution boundary that prevents untrusted code from affecting your system.