Deploy Your Agent — Local, GCP & AWS
Hands-on lab: take a working UCC filing research agent and deploy it to three production environments — Local Docker, Google Cloud Run, and AWS Lambda. You will end with a live URL you can call from anywhere.
Learning Objectives
- Wrap an AI agent as a production REST API using FastAPI with health checks and error handling
- Containerize an agent application with Docker using security best practices (non-root user, no baked-in secrets)
- Deploy a containerized agent to Google Cloud Run with proper resource limits and secret management
- Deploy the same agent to AWS Lambda using SAM, including the handler adapter pattern
- Compare Local Docker, Cloud Run, and Lambda across cold start, cost, scaling, and timeout dimensions
What You'll Build
Lab Overview
Final artifact: A UCC filing research agent deployed to Local Docker, GCP Cloud Run, and AWS Lambda — all testable with the same curl command.
Time estimate: 90–120 minutes
Prerequisites:
- M15B (Build Complete Agent & Subagent System) — the agent code we are deploying
- M21 (API Design) and M22 (Cost Optimization) — production design concepts
- Docker Desktop installed and running (install Docker)
- Optional: GCP account with
gcloudCLI for Cloud Run steps - Optional: AWS account with
awsCLI and SAM CLI for Lambda steps
Files you will create:
Steps 8–11 (GCP and AWS) require cloud accounts. If you do not have them, each step includes a Mock Mode alternative so you can simulate the deployment locally and still learn the concepts. The Docker steps (1–7) work entirely on your machine.
Step 1: Project Setup
What: Create a clean project directory with all the dependency files you will need. Why: A proper project structure keeps your agent code, server code, and infrastructure files separated. This makes it easy to test locally, build a Docker image, and deploy to any cloud provider without restructuring.
Environment Setup
Open your terminal and run this block to create the project and install dependencies:
mkdir ucc-agent-deploy && cd ucc-agent-deploy
python -m venv venv
# Linux/macOS:
source venv/bin/activate
# Windows:
# venv\Scripts\activate
pip install anthropic>=0.30.0 fastapi>=0.115.0 uvicorn>=0.34.0 mangum>=0.19.0 pydantic>=2.0.0
pip freeze > requirements.txt
# Set your API key (never hardcode this in files!)
export ANTHROPIC_API_KEY=your-key-here
# Windows: set ANTHROPIC_API_KEY=your-key-here
mkdir ucc-agent-deploy && cd ucc-agent-deploy
npm init -y
npm install @anthropic-ai/sdk express cors dotenv
# Set your API key (never hardcode this in files!)
export ANTHROPIC_API_KEY=your-key-here
# Windows: set ANTHROPIC_API_KEY=your-key-here
Now create the .env.example file (this gets checked into git as a template):
# Copy this file to .env and fill in your values
# NEVER commit .env to git!
ANTHROPIC_API_KEY=your-anthropic-api-key-here
PORT=8000
And the .dockerignore file to keep your Docker image clean:
venv/
node_modules/
__pycache__/
*.pyc
.env
.git/
*.md
You have a project directory with a virtual environment, installed dependencies, and configuration files. If pip list | grep anthropic shows anthropic 0.30.x or higher, you are ready for Step 2.
pip install fails with "externally managed environment" → You forgot to activate the venv. Run source venv/bin/activate (or venv\Scripts\activate on Windows) first.
python not found → Try python3 instead of python. On some systems only python3 is available.
Step 2: The Agent
What: Create the UCC filing research agent with mock data and two tools: search_filings and get_risk_score. Why: We need a real, working agent to deploy. Using mock data means the lab works without any external database, while the agent logic is identical to what you would use in production with real data sources.
First, create mock_data.py with realistic UCC filing records. This file simulates what you would normally fetch from a database or API:
# mock_data.py — Simulated UCC filing records
# In production, these would come from a database or API
FILINGS = [
{
"filing_id": "UCC-2024-NY-001847",
"debtor_name": "Acme Corporation",
"debtor_state": "NY",
"secured_party": "First National Bank",
"collateral_description": "All inventory, equipment, and accounts receivable",
"filing_date": "2024-03-15",
"status": "active",
"collateral_value": 2500000,
},
{
"filing_id": "UCC-2024-NY-002103",
"debtor_name": "Acme Corporation",
"debtor_state": "NY",
"secured_party": "TechLease Partners LLC",
"collateral_description": "Specific equipment: CNC machines, serial #TL-8842, #TL-8843",
"filing_date": "2024-06-01",
"status": "active",
"collateral_value": 450000,
},
{
"filing_id": "UCC-2023-CA-009821",
"debtor_name": "Pacific Freight Inc",
"debtor_state": "CA",
"secured_party": "West Coast Capital",
"collateral_description": "All assets including rolling stock and warehouse inventory",
"filing_date": "2023-11-20",
"status": "active",
"collateral_value": 8700000,
},
{
"filing_id": "UCC-2024-TX-004210",
"debtor_name": "Lone Star Drilling Co",
"debtor_state": "TX",
"secured_party": "Energy Finance Corp",
"collateral_description": "Drilling equipment, mineral rights assignments, accounts",
"filing_date": "2024-01-10",
"status": "active",
"collateral_value": 15000000,
},
{
"filing_id": "UCC-2022-NY-000412",
"debtor_name": "Acme Corporation",
"debtor_state": "NY",
"secured_party": "Metro Business Lending",
"collateral_description": "Accounts receivable",
"filing_date": "2022-08-05",
"status": "terminated",
"collateral_value": 750000,
},
]
def search_filings(debtor_name: str) -> list[dict]:
"""Search filings by debtor name (case-insensitive partial match)."""
name_lower = debtor_name.lower()
return [f for f in FILINGS if name_lower in f["debtor_name"].lower()]
def calculate_risk_score(debtor_name: str) -> dict:
"""Calculate a risk score based on filing history."""
matches = search_filings(debtor_name)
active = [f for f in matches if f["status"] == "active"]
total_exposure = sum(f["collateral_value"] for f in active)
if not matches:
return {
"debtor_name": debtor_name,
"risk_level": "unknown",
"risk_score": 0,
"reason": "No filings found for this entity",
"total_liens": 0,
"total_exposure": 0,
}
if total_exposure > 10000000:
risk_level, score = "high", 85
elif total_exposure > 2000000:
risk_level, score = "medium", 55
elif len(active) > 1:
risk_level, score = "medium", 45
else:
risk_level, score = "low", 20
return {
"debtor_name": debtor_name,
"risk_level": risk_level,
"risk_score": score,
"reason": f"{len(active)} active lien(s) totaling ${total_exposure:,.0f}",
"total_liens": len(active),
"total_exposure": total_exposure,
}
// mock_data.js — Simulated UCC filing records
// In production, these would come from a database or API
const FILINGS = [
{
filing_id: "UCC-2024-NY-001847",
debtor_name: "Acme Corporation",
debtor_state: "NY",
secured_party: "First National Bank",
collateral_description: "All inventory, equipment, and accounts receivable",
filing_date: "2024-03-15",
status: "active",
collateral_value: 2500000,
},
{
filing_id: "UCC-2024-NY-002103",
debtor_name: "Acme Corporation",
debtor_state: "NY",
secured_party: "TechLease Partners LLC",
collateral_description: "Specific equipment: CNC machines, serial #TL-8842, #TL-8843",
filing_date: "2024-06-01",
status: "active",
collateral_value: 450000,
},
{
filing_id: "UCC-2023-CA-009821",
debtor_name: "Pacific Freight Inc",
debtor_state: "CA",
secured_party: "West Coast Capital",
collateral_description: "All assets including rolling stock and warehouse inventory",
filing_date: "2023-11-20",
status: "active",
collateral_value: 8700000,
},
{
filing_id: "UCC-2024-TX-004210",
debtor_name: "Lone Star Drilling Co",
debtor_state: "TX",
secured_party: "Energy Finance Corp",
collateral_description: "Drilling equipment, mineral rights assignments, accounts",
filing_date: "2024-01-10",
status: "active",
collateral_value: 15000000,
},
{
filing_id: "UCC-2022-NY-000412",
debtor_name: "Acme Corporation",
debtor_state: "NY",
secured_party: "Metro Business Lending",
collateral_description: "Accounts receivable",
filing_date: "2022-08-05",
status: "terminated",
collateral_value: 750000,
},
];
function searchFilings(debtorName) {
const nameLower = debtorName.toLowerCase();
return FILINGS.filter(f => f.debtor_name.toLowerCase().includes(nameLower));
}
function calculateRiskScore(debtorName) {
const matches = searchFilings(debtorName);
const active = matches.filter(f => f.status === "active");
const totalExposure = active.reduce((sum, f) => sum + f.collateral_value, 0);
if (matches.length === 0) {
return {
debtor_name: debtorName, risk_level: "unknown", risk_score: 0,
reason: "No filings found for this entity",
total_liens: 0, total_exposure: 0,
};
}
let riskLevel, score;
if (totalExposure > 10000000) { riskLevel = "high"; score = 85; }
else if (totalExposure > 2000000) { riskLevel = "medium"; score = 55; }
else if (active.length > 1) { riskLevel = "medium"; score = 45; }
else { riskLevel = "low"; score = 20; }
return {
debtor_name: debtorName, risk_level: riskLevel, risk_score: score,
reason: `${active.length} active lien(s) totaling $${totalExposure.toLocaleString()}`,
total_liens: active.length, total_exposure: totalExposure,
};
}
module.exports = { searchFilings, calculateRiskScore, FILINGS };
Now let's build the agent itself. If you completed M12 (ReAct Pattern) and M15B (Build Complete Agent System), this will look familiar. The core pattern is a tool use loopA repeating cycle where Claude requests a tool call, your code executes it, and you send the result back. The loop continues until Claude's stop_reason is "end_turn" instead of "tool_use".: send a message to Claude, check if it wants to call a tool, execute the tool, send the result back, and repeat until Claude says "I'm done" (via stop_reason: "end_turn"). Here we simplify it into a single-file agent with two tools:
# agent.py — UCC Filing Research Agent
# Uses the Anthropic Messages API with tool use (M12 pattern)
import json
import os
import anthropic
from mock_data import search_filings, calculate_risk_score
# --- Tool definitions tell Claude what functions are available ---
# Each tool has a name, description (crucial for selection accuracy),
# and an input_schema that Claude uses to generate valid arguments.
TOOLS = [
{
"name": "search_filings",
"description": (
"Search UCC filings by debtor name. Returns a list of filing "
"records including filing ID, secured party, collateral "
"description, filing date, status, and collateral value. "
"Use this to find all liens against a specific company."
),
"input_schema": {
"type": "object",
"properties": {
"debtor_name": {
"type": "string",
"description": "The company or individual name to search for (partial match supported)",
}
},
"required": ["debtor_name"],
},
},
{
"name": "get_risk_score",
"description": (
"Calculate a lien risk score for a debtor based on their UCC "
"filing history. Returns risk level (low/medium/high), numeric "
"score (0-100), total active liens, and total collateral exposure. "
"Use this after searching filings to assess overall risk."
),
"input_schema": {
"type": "object",
"properties": {
"debtor_name": {
"type": "string",
"description": "The debtor name to assess risk for",
}
},
"required": ["debtor_name"],
},
},
]
# --- Map tool names to Python functions ---
TOOL_HANDLERS = {
"search_filings": lambda args: search_filings(args["debtor_name"]),
"get_risk_score": lambda args: calculate_risk_score(args["debtor_name"]),
}
def run_agent(question: str, max_turns: int = 10) -> str:
"""
Run the UCC research agent with a tool use loop.
The loop checks stop_reason after each Claude response:
- "tool_use" means Claude wants to call a tool — execute it and continue
- "end_turn" means Claude is done — return the final text
max_turns is a safety net to prevent infinite loops, not a control mechanism.
"""
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from environment
messages = [{"role": "user", "content": question}]
system_prompt = (
"You are a UCC filing research assistant. You help users investigate "
"Uniform Commercial Code filings, assess lien risk, and understand "
"collateral exposure for business entities. Always search for filings "
"first, then calculate risk scores to give comprehensive answers. "
"Present findings clearly with specific numbers and filing IDs."
)
for turn in range(max_turns):
try:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=system_prompt,
tools=TOOLS,
messages=messages,
)
except anthropic.APIError as e:
return f"API error: {e.message}"
except anthropic.AuthenticationError:
return "Authentication failed. Check your ANTHROPIC_API_KEY."
# Check if Claude wants to use tools or is done
if response.stop_reason == "end_turn":
# Extract the text from the response
for block in response.content:
if hasattr(block, "text"):
return block.text
return "Agent completed but produced no text output."
if response.stop_reason == "tool_use":
# Append Claude's response (which includes tool_use blocks)
messages.append({"role": "assistant", "content": response.content})
# Execute each tool call and collect results
tool_results = []
for block in response.content:
if block.type == "tool_use":
handler = TOOL_HANDLERS.get(block.name)
if handler:
try:
result = handler(block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": json.dumps(result, default=str),
})
except Exception as exc:
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": json.dumps({
"error": str(exc),
"is_retryable": False,
}),
"is_error": True,
})
else:
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": json.dumps({
"error": f"Unknown tool: {block.name}",
"is_retryable": False,
}),
"is_error": True,
})
messages.append({"role": "user", "content": tool_results})
return "Agent reached maximum turns without completing. Try a simpler question."
if __name__ == "__main__":
# Quick test — run directly to verify the agent works
answer = run_agent("What is the lien exposure for Acme Corporation?")
print(answer)
// agent.js — UCC Filing Research Agent
// Uses the Anthropic Messages API with tool use (M12 pattern)
const Anthropic = require("@anthropic-ai/sdk");
const { searchFilings, calculateRiskScore } = require("./mock_data");
const TOOLS = [
{
name: "search_filings",
description:
"Search UCC filings by debtor name. Returns a list of filing " +
"records including filing ID, secured party, collateral " +
"description, filing date, status, and collateral value. " +
"Use this to find all liens against a specific company.",
input_schema: {
type: "object",
properties: {
debtor_name: {
type: "string",
description: "The company or individual name to search for (partial match supported)",
},
},
required: ["debtor_name"],
},
},
{
name: "get_risk_score",
description:
"Calculate a lien risk score for a debtor based on their UCC " +
"filing history. Returns risk level (low/medium/high), numeric " +
"score (0-100), total active liens, and total collateral exposure. " +
"Use this after searching filings to assess overall risk.",
input_schema: {
type: "object",
properties: {
debtor_name: {
type: "string",
description: "The debtor name to assess risk for",
},
},
required: ["debtor_name"],
},
},
];
const TOOL_HANDLERS = {
search_filings: (args) => searchFilings(args.debtor_name),
get_risk_score: (args) => calculateRiskScore(args.debtor_name),
};
async function runAgent(question, maxTurns = 10) {
const client = new Anthropic(); // reads ANTHROPIC_API_KEY from env
const systemPrompt =
"You are a UCC filing research assistant. You help users investigate " +
"Uniform Commercial Code filings, assess lien risk, and understand " +
"collateral exposure for business entities. Always search for filings " +
"first, then calculate risk scores to give comprehensive answers. " +
"Present findings clearly with specific numbers and filing IDs.";
const messages = [{ role: "user", content: question }];
for (let turn = 0; turn < maxTurns; turn++) {
let response;
try {
response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
system: systemPrompt,
tools: TOOLS,
messages,
});
} catch (err) {
if (err instanceof Anthropic.AuthenticationError) {
return "Authentication failed. Check your ANTHROPIC_API_KEY.";
}
return `API error: ${err.message}`;
}
if (response.stop_reason === "end_turn") {
const textBlock = response.content.find((b) => b.type === "text");
return textBlock ? textBlock.text : "Agent completed but produced no text.";
}
if (response.stop_reason === "tool_use") {
messages.push({ role: "assistant", content: response.content });
const toolResults = [];
for (const block of response.content) {
if (block.type === "tool_use") {
const handler = TOOL_HANDLERS[block.name];
if (handler) {
try {
const result = handler(block.input);
toolResults.push({
type: "tool_result",
tool_use_id: block.id,
content: JSON.stringify(result),
});
} catch (exc) {
toolResults.push({
type: "tool_result",
tool_use_id: block.id,
content: JSON.stringify({ error: exc.message, is_retryable: false }),
is_error: true,
});
}
} else {
toolResults.push({
type: "tool_result",
tool_use_id: block.id,
content: JSON.stringify({ error: `Unknown tool: ${block.name}`, is_retryable: false }),
is_error: true,
});
}
}
}
messages.push({ role: "user", content: toolResults });
}
}
return "Agent reached maximum turns without completing.";
}
module.exports = { runAgent };
// Quick test when run directly
if (require.main === module) {
runAgent("What is the lien exposure for Acme Corporation?").then(console.log);
}
You created two files: mock_data.py provides five realistic UCC filing records with search and risk functions, and agent.py wraps those functions as Claude tools with a standard tool use loop. The agent reads ANTHROPIC_API_KEY from the environment — never from a hardcoded string. When a user asks about a debtor, Claude calls search_filings to get the data, then get_risk_score to assess risk, then composes a human-readable answer.
If you see a response mentioning Acme Corporation's filings and risk score, the agent is working. The exact text will differ each run since Claude generates it, but the numbers should match the mock data. If you see "Authentication failed", double-check your ANTHROPIC_API_KEY environment variable.
ModuleNotFoundError: No module named 'anthropic' → Make sure your venv is activated. Run source venv/bin/activate then try again.
AuthenticationError → Your API key is missing or invalid. Run echo $ANTHROPIC_API_KEY to verify it is set.
Connection error → Check your internet connection. The agent needs to reach api.anthropic.com.
Step 3: Wrap the Agent as a FastAPI Server
What: Create a REST API server that exposes the agent through HTTP endpoints. Why: Docker, Cloud Run, and Lambda all need your agent to respond to HTTP requests. Right now, your agent only works as a Python script you run from the terminal. A REST API lets any client — a web app, a mobile app, a curl command — send questions and get answers over HTTP.
We will use FastAPIA modern Python web framework built for APIs. It uses Python type hints to auto-generate documentation, validate inputs, and support async operations out of the box. Faster to develop with than Flask for API-focused projects. as the web framework. FastAPI gives us automatic input validation via PydanticA Python library that validates data using type annotations. When you define a model class with typed fields, Pydantic automatically checks that incoming data matches those types and rejects invalid requests. — if a client sends malformed JSON, the request gets rejected before it reaches the agent. It also auto-generates interactive API docs and supports async operations out of the box.
BEFORE: Right now, your agent is like a brilliant researcher who only works in person. You have to walk to their office, sit down, and ask your question face-to-face. Only one person at a time. No remote access.
THE PAIN: That means nobody else can use this researcher. Your teammates cannot ask questions. Your web app cannot ask questions. The researcher is locked in one room with one person.
THE FIX: Wrapping the agent in a REST API is like giving that researcher a phone number. Now anyone in the world can call them, ask a question, and get an answer back — without being in the same room. The FastAPI server IS that phone system: it receives calls (HTTP requests), routes them to the researcher (agent), and sends back the answer (HTTP response).
Here is what the "phone call" actually looks like in practice — the raw HTTP request and response your server handles:
Create server.py. Let's walk through it in three logical chunks so you understand the reasoning behind each part:
Chunk 1 — Imports and models. The first thing any API needs is a contract: what shape does the request come in, and what shape does the response go out? That is what the Pydantic models define. When a client sends {"question": 123} instead of a string, Pydantic catches the type mismatch and returns a clear 422 error — your agent never even sees the bad request. This matters because an unvalidated question could cause confusing errors deep in the agent loop.
Chunk 2 — The /query endpoint. This is the heart of the server. It receives a question, calls run_agent(), times how long it takes, and returns the answer wrapped in JSON. Here is the important design choice: the try/except block catches any agent errors and returns a proper HTTP 500 response instead of crashing the whole server. Without this, one bad query could take down your service for all users.
Chunk 3 — The /health endpoint. Every production service needs a health check, and skipping it is a common beginner mistake. Load balancersA system that distributes incoming network traffic across multiple servers. It uses health check endpoints to know which servers are alive and should receive traffic. and container orchestrators (Docker, Kubernetes, Cloud Run) call this endpoint every 10–30 seconds to verify your service is alive. If it stops responding, traffic gets routed elsewhere and your container gets restarted. Our health check also verifies the API key is configured — a service without a key is "alive" but useless.
# server.py — FastAPI wrapper for the UCC research agent
# Exposes the agent as a REST API with health check
import os
import time
import logging
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, Field
from agent import run_agent
# --- Logging setup ---
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("ucc-agent-api")
# --- Request/Response models ---
# Pydantic validates every incoming request automatically.
# If a client sends {"query": 123} instead of a string, they get
# a clear 422 error before the agent ever runs.
class QueryRequest(BaseModel):
question: str = Field(
...,
min_length=1,
max_length=2000,
description="The question to ask the UCC research agent",
json_schema_extra={"examples": ["Find filings for Acme Corporation"]},
)
class QueryResponse(BaseModel):
answer: str
elapsed_seconds: float
status: str = "success"
class HealthResponse(BaseModel):
status: str
service: str
version: str
class ErrorResponse(BaseModel):
detail: str
status: str = "error"
# --- Application setup ---
app = FastAPI(
title="UCC Filing Research Agent API",
description="AI-powered UCC filing research and lien risk assessment",
version="1.0.0",
)
# Allow browser-based clients to call this API
app.add_middleware(
CORSMiddleware,
allow_origins=["*"], # Restrict in production!
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
# Track request count for basic metrics
request_count = 0
error_count = 0
start_time = time.time()
@app.get("/health", response_model=HealthResponse, tags=["System"])
async def health_check():
"""Health check endpoint for load balancers and container orchestration."""
# Check that the API key is configured (do not reveal the key!)
api_key = os.environ.get("ANTHROPIC_API_KEY", "")
if not api_key:
raise HTTPException(
status_code=503,
detail="ANTHROPIC_API_KEY not configured",
)
return HealthResponse(
status="healthy",
service="ucc-agent",
version="1.0.0",
)
@app.post(
"/query",
response_model=QueryResponse,
responses={500: {"model": ErrorResponse}},
tags=["Agent"],
)
async def query_agent(request: QueryRequest):
"""Send a question to the UCC research agent and get an answer."""
global request_count, error_count
request_count += 1
logger.info(f"Query received: {request.question[:80]}...")
t0 = time.time()
try:
answer = run_agent(request.question)
elapsed = round(time.time() - t0, 2)
logger.info(f"Query completed in {elapsed}s")
return QueryResponse(
answer=answer,
elapsed_seconds=elapsed,
)
except Exception as exc:
error_count += 1
logger.error(f"Agent error: {exc}")
raise HTTPException(
status_code=500,
detail=f"Agent failed: {str(exc)}",
)
@app.get("/metrics", tags=["System"])
async def metrics():
"""Basic metrics endpoint for monitoring."""
uptime = round(time.time() - start_time, 1)
return {
"total_requests": request_count,
"total_errors": error_count,
"uptime_seconds": uptime,
"error_rate": round(error_count / max(request_count, 1), 4),
}
if __name__ == "__main__":
import uvicorn
port = int(os.environ.get("PORT", 8000))
uvicorn.run(app, host="0.0.0.0", port=port)
// server.js — Express wrapper for the UCC research agent
// Exposes the agent as a REST API with health check
const express = require("express");
const cors = require("cors");
const { runAgent } = require("./agent");
const app = express();
app.use(cors());
app.use(express.json());
let requestCount = 0;
let errorCount = 0;
const startTime = Date.now();
// --- Health check ---
app.get("/health", (req, res) => {
const apiKey = process.env.ANTHROPIC_API_KEY || "";
if (!apiKey) {
return res.status(503).json({
status: "unhealthy",
service: "ucc-agent",
detail: "ANTHROPIC_API_KEY not configured",
});
}
res.json({ status: "healthy", service: "ucc-agent", version: "1.0.0" });
});
// --- Main query endpoint ---
app.post("/query", async (req, res) => {
requestCount++;
const { question } = req.body;
if (!question || typeof question !== "string" || question.length === 0) {
return res.status(422).json({
detail: "Missing or invalid 'question' field (must be a non-empty string)",
status: "error",
});
}
if (question.length > 2000) {
return res.status(422).json({
detail: "Question must be 2000 characters or fewer",
status: "error",
});
}
console.log(`Query received: ${question.substring(0, 80)}...`);
const t0 = Date.now();
try {
const answer = await runAgent(question);
const elapsed = ((Date.now() - t0) / 1000).toFixed(2);
console.log(`Query completed in ${elapsed}s`);
res.json({ answer, elapsed_seconds: parseFloat(elapsed), status: "success" });
} catch (err) {
errorCount++;
console.error(`Agent error: ${err.message}`);
res.status(500).json({ detail: `Agent failed: ${err.message}`, status: "error" });
}
});
// --- Metrics ---
app.get("/metrics", (req, res) => {
const uptime = ((Date.now() - startTime) / 1000).toFixed(1);
res.json({
total_requests: requestCount,
total_errors: errorCount,
uptime_seconds: parseFloat(uptime),
error_rate: parseFloat((errorCount / Math.max(requestCount, 1)).toFixed(4)),
});
});
const port = process.env.PORT || 8000;
app.listen(port, "0.0.0.0", () => {
console.log(`UCC Agent API running on http://0.0.0.0:${port}`);
console.log(`Docs: http://localhost:${port}/health`);
});
You created an HTTP server that wraps the agent. It has three endpoints: POST /query sends a question and returns the answer, GET /health tells infrastructure your service is alive, and GET /metrics provides basic monitoring data. The server validates input (rejects empty or oversized questions), logs every request, and catches errors gracefully. This same server works for local development, Docker, Cloud Run, and Lambda.
"I need different server code for each cloud platform" — No. Your server.py is platform-agnostic. The same FastAPI app runs locally with uvicorn, inside Docker, on Cloud Run (Docker + managed infrastructure), and on Lambda (via the Mangum adapter). You never rewrite server logic for different platforms.
"FastAPI is only for async workloads" — FastAPI supports both sync and async. Our run_agent() function is synchronous (it blocks while waiting for Claude's API response), and that is perfectly fine. FastAPI runs it in a thread pool so other requests are not blocked.
"CORS should always be allow_origins=['*']" — Only in development! In production, restrict allow_origins to your specific frontend domain(s). A wildcard means any website can call your agent API, which creates security and cost risks.
Step 4: Test Locally (Without Docker)
What: Run the server on your machine and verify all endpoints work. Why: Always test locally before containerizing. If something breaks in Docker, you want to know whether the bug is in your code or your Docker setup. Testing locally first establishes a known-good baseline.
In a separate terminal, test each endpoint:
# Test 1: Health check
curl -s http://localhost:8000/health | python -m json.tool
# Test 2: Query the agent
curl -s -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"question": "Find filings for Acme Corporation"}' | python -m json.tool
# Test 3: Metrics
curl -s http://localhost:8000/metrics | python -m json.tool
# Test 4: Invalid request (should get 422)
curl -s -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"question": ""}' | python -m json.tool
All four curl commands should work: health returns "healthy", query returns an answer with filing details, metrics shows your request count, and the empty-question test returns a 422 validation error. Stop the server with Ctrl+C when done.
"Address already in use" → Another process is using port 8000. Either stop it (lsof -ti:8000 | xargs kill) or use a different port: --port 8001.
"ModuleNotFoundError: No module named 'fastapi'" → Your venv is not activated. Run source venv/bin/activate and try again.
uvicorn. But your machine is not a production server — it goes to sleep, it is behind a firewall, and it does not scale. The next step is to package everything into a Docker container, which makes your agent portable: the same container runs identically on your laptop, a cloud VM, or a managed service like Cloud Run.Step 5: Create the Dockerfile
What: Write a DockerfileA text file containing instructions to build a Docker image. Each instruction (FROM, COPY, RUN) creates a layer in the image. The final image contains everything needed to run your application: OS, Python, dependencies, and your code. that packages the agent, server, and all dependencies into a portable container imageA lightweight, standalone package that includes your application and everything it needs to run (code, runtime, libraries, system tools). Unlike a virtual machine, containers share the host OS kernel, making them much smaller and faster to start..
Why: Containers solve "works on my machine" problems. Your Docker image runs identically everywhere — your laptop, a colleague's machine, GCP, or AWS. It also provides isolation: your agent cannot accidentally access files or processes outside the container. If something goes wrong, the damage stays inside the container — it cannot touch the host system.
BEFORE: Imagine shipping a home-cooked meal to a friend. You would need to send the recipe, the exact brand of every ingredient, the specific oven model, and instructions for their kitchen layout. If anything differs, the result tastes different.
THE PAIN: That is what deploying software without containers feels like. "It works on my machine" fails because your machine has Python 3.12, your colleague has 3.10, the server has different system libraries, and the cloud VM has a different OS entirely.
THE FIX: A Docker container is like shipping a fully-equipped kitchen WITH the meal inside. The container includes the OS, Python, all libraries, and your code. Open it anywhere and it runs identically. Here is what the shipping label (Dockerfile) looks like — each line adds one item to the kitchen:
We will annotate every line. Each Dockerfile instruction creates a layerDocker images are built in layers, one per instruction. Layers are cached, so when you change your code but not your dependencies, Docker only rebuilds the code layer (fast) instead of reinstalling all packages (slow). in the image, and the order matters for build speed.
# Dockerfile — Multi-stage build for the UCC agent
# ----- Stage 1: Build dependencies in a full Python image -----
# WHAT: Use the full Python image to install compiled dependencies
# WHY: Some pip packages (like uvloop) need C compilers to build.
# The slim image does not have compilers, so we build here first.
FROM python:3.12 AS builder
WORKDIR /build
# WHAT: Copy only requirements.txt first, then install
# WHY: Docker caches each layer. If requirements.txt hasn't changed,
# Docker reuses the cached pip install (saves 30-60 seconds).
# GOTCHA: If you COPY . . first, ANY code change invalidates the cache.
COPY requirements.txt .
RUN pip install --no-cache-dir --target=/build/deps -r requirements.txt
# ----- Stage 2: Slim runtime image -----
# WHAT: Start fresh with a minimal Python image
# WHY: The builder image is ~900MB (compilers, headers). The slim image
# is ~150MB. We only copy the installed packages, not the compilers.
FROM python:3.12-slim
# WHAT: Create a non-root user
# WHY: Running as root inside a container is a security risk. If an
# attacker exploits a bug, they get root access to the container.
# A non-root user limits the damage.
RUN useradd --create-home --shell /bin/bash agent
WORKDIR /app
# WHAT: Copy installed dependencies from the builder stage
# WHY: We get all the pip packages without the build tools
COPY --from=builder /build/deps /usr/local/lib/python3.12/site-packages/
# WHAT: Copy application source code
# WHY: This layer changes most often (your code changes), so it goes LAST
# to maximize cache hits on the layers above.
COPY mock_data.py agent.py server.py ./
# WHAT: Switch to the non-root user
USER agent
# WHAT: Expose port 8000 and set the default command
# WHY: EXPOSE documents which port the container listens on.
# CMD sets what runs when the container starts.
# GOTCHA: EXPOSE does not publish the port — you still need -p 8000:8000
EXPOSE 8000
ENV PORT=8000
# WHAT: Run uvicorn with production settings
# WHY: --host 0.0.0.0 makes it accessible from outside the container.
# --workers 1 is fine for agent workloads (they are I/O bound,
# waiting on Claude API, not CPU bound).
CMD ["python", "-m", "uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]
# Dockerfile — Node.js version for the UCC agent
FROM node:20-slim
# Create non-root user for security
RUN useradd --create-home --shell /bin/bash agent
WORKDIR /app
# Copy package files first for layer caching
COPY package.json package-lock.json* ./
RUN npm ci --production
# Copy application code (changes most often = last layer)
COPY mock_data.js agent.js server.js ./
# Switch to non-root user
USER agent
EXPOSE 8000
ENV PORT=8000
CMD ["node", "server.js"]
"I need a different Dockerfile for each cloud provider" — No. The same Docker image runs on Cloud Run, AWS ECS, Azure Container Instances, and your laptop. That is the whole point of containers — build once, run anywhere. The only thing that changes is how you pass secrets and configure networking.
"Multi-stage builds are optional optimization" — For production, they are essential. A single-stage build with python:3.12 produces a ~900MB image (includes compilers, headers, build tools). Multi-stage drops it to ~235MB. Smaller images mean faster deployments, lower storage costs, and reduced attack surface.
"EXPOSE publishes the port" — No! EXPOSE 8000 is documentation only. It tells humans and tools "this container listens on 8000." You still need -p 8000:8000 in docker run to actually map the port. Many beginners skip the -p flag thinking EXPOSE handled it, then wonder why they cannot connect.
Notice that ANTHROPIC_API_KEY is NOT in the Dockerfile. Docker images get pushed to registries, shared with teammates, and stored in CI/CD systems. If you bake a secret into an image, anyone who pulls the image can extract it. Always pass secrets at runtime via -e flags or a secrets manager.
The exam tests secret management best practices for agent deployments. Know the hierarchy: environment variables (acceptable for dev), cloud secret managers (GCP Secret Manager, AWS Secrets Manager — required for production), and never hardcoded in code, Dockerfiles, or config files committed to git. Anti-pattern: using --set-env-vars with the actual key value in CI/CD logs.
If the build completes with "naming to docker.io/library/ucc-agent", your Dockerfile is correct. Run docker images ucc-agent to verify the image exists — it should be approximately 235 MB. If the build fails, check the error message and verify all .py files exist in your project directory.
"COPY failed: file not found in build context" → You are missing one of the .py files. Run ls *.py to confirm agent.py, mock_data.py, and server.py all exist.
"Cannot connect to the Docker daemon" → Docker Desktop is not running. Start it and wait for the whale icon to appear in your system tray.
Build takes more than 5 minutes → First build downloads the Python base image (~150 MB). Subsequent builds use the cached layer and take under 10 seconds (only the code layer rebuilds).
Total image size: ~235 MB — Layers are cached. Code changes only rebuild the top 12 KB layer.
Step 6: Build and Run the Docker Container
What: Build the Docker image and run it as a container, passing the API key at runtime. Why: This verifies that your Dockerfile is correct and that the agent works inside a container, not just on your bare machine. It also proves the secret-passing pattern works.
# Build the image (run from the project directory)
docker build -t ucc-agent .
# Run the container, passing the API key from your environment
docker run -p 8000:8000 -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY ucc-agent
# Or on Windows PowerShell:
# docker run -p 8000:8000 -e ANTHROPIC_API_KEY=$env:ANTHROPIC_API_KEY ucc-agent
Test with the same curl commands from Step 4:
curl -s http://localhost:8000/health | python -m json.tool
curl -s -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"question": "What is the lien exposure for Acme Corporation?"}' \
| python -m json.tool
The responses from the Docker container should be identical to what you saw in Step 4. Same health check, same agent answer. If it works here, your container is ready for the cloud. Stop the container with Ctrl+C.
"port already in use" → Stop any process using port 8000, or change the host port: docker run -p 8001:8000 ... then test on localhost:8001.
Container exits immediately → Check logs: docker logs $(docker ps -lq). Common cause: missing ANTHROPIC_API_KEY.
"ANTHROPIC_API_KEY not configured" in health check → The -e flag did not pass the key. Verify: echo $ANTHROPIC_API_KEY should show your key (not empty).
Step 7: Docker Compose for Development
What: Create a docker-compose.yml that simplifies running the container with environment variables from a .env file.
Why: Typing the full docker run -p 8000:8000 -e ANTHROPIC_API_KEY=... command every time is tedious and error-prone. One typo in the port mapping or a forgotten -e flag and your container starts without the API key. Docker ComposeA tool that lets you define multi-container applications in a YAML file. Instead of typing long docker run commands, you define ports, environment variables, volumes, and health checks in docker-compose.yml and run everything with one command: docker compose up. reads all your settings from a YAML config file, so the command becomes just docker compose up — every time, with the same settings.
Docker Compose also supports features that raw docker run does not make easy: mounting your source code as a volume for hot-reloading during development, defining health checks that auto-restart crashed containers, and orchestrating multiple services (e.g., agent + database) in a single command. For this lab, we use it for convenience. In production, it becomes essential for managing multi-service deployments.
# docker-compose.yml — Development orchestration
version: "3.8"
services:
agent:
build: .
ports:
- "8000:8000"
env_file:
- .env # Reads ANTHROPIC_API_KEY from .env file
environment:
- PORT=8000
# Uncomment the next 2 lines for hot-reload during development:
# volumes:
# - ./:/app # Mount source code into container
restart: unless-stopped # Auto-restart on crash
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 10s
Create your .env file (this file is gitignored — never commit it):
ANTHROPIC_API_KEY=your-actual-api-key-here
PORT=8000
The docker compose up command should build the image (if needed), start the container, and show the Uvicorn startup message. The health check should return "healthy". You now have a fully containerized agent running locally.
"no configuration file provided: not found" → You are not in the project directory. Run cd ucc-agent-deploy first, or verify docker-compose.yml exists with ls docker-compose.yml.
"error while loading .env" → You have not created the .env file yet. Copy .env.example to .env and fill in your API key: cp .env.example .env.
Health check keeps failing → The container might not have curl installed (the slim image does not include it). Replace the healthcheck test with: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')"]
Step 8: GCP Cloud Run — Prerequisites
What: Set up your GCP project and authenticate the gcloud CLI. Why: Cloud Run runs your Docker container on Google's infrastructure. It handles three things automatically: scaling (from zero to thousands of instances), TLS certificates (free HTTPS), and load balancing (distributing traffic across instances).
You pay only when your container is handling requests — $0 when idle. For agent workloads, Cloud Run is often the best default choice. It supports long timeouts (up to 60 minutes, which matters for complex multi-tool agent loops) and keeps containers warm between requests so repeat callers avoid cold starts.
Steps 8–9 require a GCP account. If you do not have one, skip to the Mock Mode box at the end of Step 9, which simulates the deploy locally.
# 1. Authenticate (opens browser for Google login)
gcloud auth login
# 2. Set your project (replace with your GCP project ID)
gcloud config set project YOUR_PROJECT_ID
# 3. Enable the required APIs
gcloud services enable run.googleapis.com artifactregistry.googleapis.com
# 4. Create an Artifact Registry repository for Docker images
gcloud artifacts repositories create agents \
--repository-format=docker \
--location=us-central1 \
--description="Agent container images"
# 5. Configure Docker to authenticate with Artifact Registry
gcloud auth configure-docker us-central1-docker.pkg.dev
# 6. Verify your setup
gcloud config list
If gcloud config list shows your project ID and account, you are ready to push your image. If you see "ERROR: (gcloud.config.set) INVALID_VALUE", double-check your project ID — it must match exactly (case-sensitive).
Step 9: Deploy to Cloud Run
What: Push your Docker image to Artifact RegistryGoogle Cloud's container registry service. It stores your Docker images so Cloud Run can pull and run them. Similar to Docker Hub but private and integrated with GCP's access control. and deploy it to Cloud RunA fully managed service from Google Cloud that runs containers. It automatically scales from zero to many instances, handles HTTPS, and charges only while your container is processing a request. Ideal for APIs and agent services.. Why: This gives you a public HTTPS URL that anyone can call. Cloud Run handles TLS, scaling, and load balancing — you just provide the container.
# 1. Tag the image for Artifact Registry
docker tag ucc-agent \
us-central1-docker.pkg.dev/YOUR_PROJECT_ID/agents/ucc-agent:v1
# 2. Push the image
docker push us-central1-docker.pkg.dev/YOUR_PROJECT_ID/agents/ucc-agent:v1
# 3. Deploy to Cloud Run
# Every flag explained:
# --image : the container image to run
# --region : where to run (us-central1 is cheap, low latency for US)
# --set-env-vars: pass the API key (for production, use Secret Manager instead)
# --memory 512Mi: 512 MB RAM — enough for an agent + Python runtime
# --cpu 1 : 1 vCPU — agents are I/O bound, not CPU bound
# --timeout 60 : 60-second request timeout — agent loops can take 10-30s
# --max-instances 3: cost safety net — prevents scaling to 1000 instances
# from a traffic spike (each instance = money)
# --allow-unauthenticated: makes the URL public (remove for production)
gcloud run deploy ucc-agent \
--image us-central1-docker.pkg.dev/YOUR_PROJECT_ID/agents/ucc-agent:v1 \
--region us-central1 \
--set-env-vars ANTHROPIC_API_KEY=your-key-here \
--memory 512Mi \
--cpu 1 \
--timeout 60 \
--max-instances 3 \
--allow-unauthenticated
Test your deployed agent with the public URL:
# Replace with YOUR actual service URL from the deploy output
SERVICE_URL="https://ucc-agent-abc123-uc.a.run.app"
# Health check
curl -s $SERVICE_URL/health | python -m json.tool
# Query the agent
curl -s -X POST $SERVICE_URL/query \
-H "Content-Type: application/json" \
-d '{"question": "Find all filings for Pacific Freight Inc"}' \
| python -m json.tool
The Cloud Run URL should return the same response as your local Docker container. You now have a public HTTPS API for your agent. Note the first request may take 2–3 seconds longer (cold start) as Cloud Run boots the container.
"PERMISSION_DENIED: Cloud Run Admin API has not been used" → Run gcloud services enable run.googleapis.com and wait 1–2 minutes for the API to activate.
"ERROR: failed to push image" → Docker is not authenticated with Artifact Registry. Run gcloud auth configure-docker us-central1-docker.pkg.dev then retry the push.
Service deployed but returns 503 → The API key was not set correctly. Check with gcloud run services describe ucc-agent --region us-central1 and look for the ANTHROPIC_API_KEY env var. Redeploy with the correct key.
Cloud Run charges per request: CPU time + memory time + number of requests. With --max-instances 3 and typical agent traffic (a few queries per hour), expect less than $1/month. When idle, you pay $0. For comparison, a dedicated VM running 24/7 would cost $25–$50/month.
For production, remove --allow-unauthenticated and use IAM authentication or an API gateway. Also use Secret Manager for the API key instead of --set-env-vars:
gcloud run deploy ucc-agent ... --set-secrets ANTHROPIC_API_KEY=anthropic-key:latest
If you do not have a GCP account, you can simulate the Cloud Run experience locally. The key insight is that Cloud Run just runs your Docker container — it does not change your code. Run docker run -p 8080:8000 -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY ucc-agent and pretend http://localhost:8080 is your Cloud Run URL. The behavior is identical.
Step 10: AWS Lambda — Prerequisites
What: Set up your AWS account and understand the Lambda execution model. Why: AWS LambdaA serverless compute service from AWS that runs your code in response to events. You pay only for the time your code actually executes (billed in 1ms increments). Lambda scales automatically from zero to thousands of concurrent executions. is a fundamentally different deployment model than Cloud Run.
Instead of running a persistent web server that waits for requests, Lambda runs your code in response to individual events. Each request spins up a fresh execution environment, runs your handler function, and shuts down. You pay per invocation — the first 1 million requests per month are free. Lambda is ideal for event-driven agents: a webhook fires and starts processing, a new file appears in S3 and triggers analysis, or a cron job runs every hour.
Steps 10–11 require an AWS account with the AWS SAM CLI installed. If you do not have one, skip to the Mock Mode box at the end of Step 11.
Cloud Run is your default for agent APIs. It supports long-running requests (up to 60 minutes), keeps containers warm between requests, and works with any HTTP framework. Think of it as "your Docker container, but managed."
Lambda is best for event-driven agents: a new file appears in S3 and triggers analysis, a webhook fires and starts processing, or a cron job runs every hour. Lambda's max timeout is 15 minutes, which works for most agent tasks but not for complex multi-agent pipelines that take longer.
Tradeoff: Lambda has cold starts (2–5 seconds to load Python + dependencies). Cloud Run also has cold starts, but you can set minimum instances to 1 to eliminate them ($0.05/hour).
# 1. Configure AWS CLI (you will be prompted for access key, secret, region)
aws configure
# Access Key ID: your-access-key
# Secret Access Key: your-secret-key
# Default region: us-east-1
# Default output format: json
# 2. Verify your identity
aws sts get-caller-identity
# 3. Verify SAM CLI is installed
sam --version
If aws sts get-caller-identity returns your account ID, and sam --version shows a version number, you are ready for Step 11.
Step 11: Deploy to AWS Lambda
What: Create a Lambda handler that wraps the same agent code, package it with SAM, and deploy it behind API GatewayAn AWS service that acts as a front door for your Lambda functions. It handles HTTP routing, authentication, rate limiting, and CORS, then invokes the Lambda function for each request..
Why: Lambda does not run a web server. It expects a specific handler function that receives an event object and returns a response. The MangumA Python library that adapts ASGI/FastAPI applications to work as AWS Lambda handlers. It converts the Lambda event (from API Gateway) into a format FastAPI understands, and converts the FastAPI response back into Lambda's expected format. library bridges this gap: it converts the Lambda event (from API Gateway) into a format FastAPI understands, and converts FastAPI's response back into Lambda's expected format. This means you reuse 100% of your server.py code without rewriting it.
Create lambda_handler.py — this is the adapter between Lambda and your FastAPI app:
# lambda_handler.py — AWS Lambda adapter for FastAPI
# Mangum converts API Gateway events into ASGI requests that FastAPI understands.
from mangum import Mangum
from server import app
# This is the entry point Lambda calls.
# Mangum wraps the FastAPI app: Lambda event -> HTTP request -> FastAPI -> HTTP response -> Lambda response
handler = Mangum(app, lifespan="off")
# That's it! Mangum does all the translation. Your FastAPI routes,
# Pydantic validation, error handling — everything works exactly
# the same as it does locally. The only difference is HOW the
# request arrives (Lambda event vs. HTTP request).
// lambda_handler.js — AWS Lambda adapter (native handler, no Express)
// For Node.js, we use a native Lambda handler since serverless-http
// adds overhead. The agent logic is called directly.
const { runAgent } = require("./agent");
exports.handler = async (event) => {
// API Gateway sends the HTTP method and body in the event object
const method = event.httpMethod || event.requestContext?.http?.method;
const path = event.path || event.rawPath;
// Health check
if (path === "/health" && method === "GET") {
const apiKey = process.env.ANTHROPIC_API_KEY || "";
if (!apiKey) {
return { statusCode: 503, body: JSON.stringify({ status: "unhealthy" }) };
}
return {
statusCode: 200,
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ status: "healthy", service: "ucc-agent", version: "1.0.0" }),
};
}
// Query endpoint
if (path === "/query" && method === "POST") {
try {
const body = JSON.parse(event.body || "{}");
const question = body.question;
if (!question || typeof question !== "string" || question.length === 0) {
return {
statusCode: 422,
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ detail: "Missing or invalid question", status: "error" }),
};
}
const t0 = Date.now();
const answer = await runAgent(question);
const elapsed = ((Date.now() - t0) / 1000).toFixed(2);
return {
statusCode: 200,
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ answer, elapsed_seconds: parseFloat(elapsed), status: "success" }),
};
} catch (err) {
return {
statusCode: 500,
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ detail: `Agent failed: ${err.message}`, status: "error" }),
};
}
}
return { statusCode: 404, body: JSON.stringify({ detail: "Not found" }) };
};
Now create template.yaml — the SAM templateAWS Serverless Application Model (SAM) template. A YAML file that defines your Lambda function, API Gateway, and other AWS resources. SAM builds, packages, and deploys everything with a single command. that defines your Lambda function and API Gateway:
AWSTemplateFormatVersion: "2010-09-09"
Transform: AWS::Serverless-2016-10-31
Description: UCC Filing Research Agent - Lambda deployment
Globals:
Function:
Timeout: 120 # 120 seconds - agent loops can take 10-30s
MemorySize: 512 # 512 MB - enough for Python + anthropic SDK
Runtime: python3.12
Resources:
UCCAgentFunction:
Type: AWS::Serverless::Function
Properties:
Handler: lambda_handler.handler
CodeUri: .
Description: UCC Filing Research Agent
Architectures:
- x86_64
Environment:
Variables:
ANTHROPIC_API_KEY: !Ref AnthropicApiKey
Events:
HealthCheck:
Type: Api
Properties:
Path: /health
Method: get
Query:
Type: Api
Properties:
Path: /query
Method: post
Metrics:
Type: Api
Properties:
Path: /metrics
Method: get
Parameters:
AnthropicApiKey:
Type: String
NoEcho: true # Masks the value in CloudFormation logs
Description: Anthropic API key for Claude access
Outputs:
ApiUrl:
Description: API Gateway URL for the UCC Agent
Value: !Sub "https://${ServerlessRestApi}.execute-api.${AWS::Region}.amazonaws.com/Prod/"
Build and deploy:
# Build the Lambda package (installs deps, creates deployment artifact)
sam build
# Deploy (first time: use --guided for interactive prompts)
sam deploy --guided
# Stack name: ucc-agent-stack
# Region: us-east-1
# Parameter AnthropicApiKey: (paste your API key — it will be masked)
# Confirm changes before deploy: y
# Allow SAM CLI IAM role creation: y
# Save arguments to samconfig.toml: y
# After first deploy, subsequent deploys are simpler:
# sam build && sam deploy
Test the Lambda deployment:
# Replace with YOUR API Gateway URL from the deploy output
LAMBDA_URL="https://abc123def4.execute-api.us-east-1.amazonaws.com/Prod"
curl -s $LAMBDA_URL/health | python -m json.tool
curl -s -X POST $LAMBDA_URL/query \
-H "Content-Type: application/json" \
-d '{"question": "What is the risk level for Lone Star Drilling Co?"}' \
| python -m json.tool
The Lambda endpoint should return the same agent response as local and Cloud Run. The first request will be slower (3–6 seconds cold start) as Lambda initializes the Python runtime and loads the Anthropic SDK. Subsequent requests within 15 minutes will be faster (warm start).
"Task timed out after 120 seconds" → The agent loop took too long. Check your tool implementation and increase the timeout in template.yaml, or simplify the query.
"Unable to import module 'lambda_handler'" → SAM did not include all files. Run sam build again and check .aws-sam/build/ for your files.
"Internal server error" with no details → Check CloudWatch logs: sam logs --stack-name ucc-agent-stack --tail
Lambda's free tier includes 1 million requests/month and 400,000 GB-seconds of compute. A typical agent query (512 MB, 15 seconds) uses 7.5 GB-seconds. You would need over 53,000 agent queries/month to exceed the free tier. After that, it is roughly $0.20 per 1 million requests + $0.0000166667 per GB-second.
The exam expects you to know the Lambda adapter pattern: reuse your existing web framework (FastAPI/Express) code via an adapter library (Mangum for Python, serverless-http for Node.js) rather than rewriting handler logic from scratch. Know the tradeoffs: adapters add ~50ms overhead per cold start, but save weeks of duplicate code maintenance. For latency-critical Lambda functions, a native handler (no framework) is faster but harder to maintain.
You can test the Lambda handler locally without AWS. SAM provides a local invoke feature if you have Docker installed:
sam local start-api --env-vars '{"UCCAgentFunction": {"ANTHROPIC_API_KEY": "your-key"}}'
This starts a local API Gateway that invokes your Lambda handler in a Docker container. Test it at http://localhost:3000/query. If you do not have SAM installed, you can also test the handler directly in Python:
python -c "from lambda_handler import handler; print(handler({'httpMethod':'GET','path':'/health'}, None))"
Deployment Comparison: Docker vs Cloud Run vs Lambda
You have now deployed the same agent to three environments. Here is how they compare across the dimensions that matter most for AI agent workloads:
"Lambda is always cheaper than Cloud Run" — Not for agent workloads. Agents make multiple API calls per request (10–30 seconds of compute). Lambda's per-millisecond billing adds up. Cloud Run's per-request model with short idle periods can be cheaper for consistent traffic.
"Serverless means no server" — There IS a server. You just do not manage it. AWS provisions, patches, and scales the servers for you. Your code still runs on a real machine somewhere.
"Cold starts are always a problem" — For agents, the Claude API call itself takes 3–15 seconds. A 2-second cold start is a small fraction of total response time. Cold starts matter more for sub-100ms APIs.
The certification exam tests your ability to choose the right deployment model for a given scenario. Key decision factors: request timeout requirements (Cloud Run for long-running agents), traffic pattern (Lambda for spiky/event-driven), and whether you need persistent connections (Cloud Run or containers, not Lambda).
Health Checks and Basic Monitoring
Your deployed agent already has a /health endpoint and a /metrics endpoint. These are not optional extras — they are the minimum viable monitoring for any production service.
A health check answers one question: "Is this service alive and able to handle requests?" Without it, infrastructure has no way to detect a crashed or misconfigured service. Your users would be the first to find out, and that is the worst way to discover an outage.
Metrics go one step further. They answer: "How is this service performing over time?" Request count, error rate, and uptime tell you whether your agent is healthy, degrading, or on fire — before users complain. Here is how each platform uses these endpoints:
How Platforms Use Health Checks
- Docker Compose: The
healthcheckdirective in your compose file pings/healthevery 30 seconds. If it fails 3 times, Docker restarts the container. - Cloud Run: GCP automatically probes your container's port. If your service cannot serve requests, Cloud Run routes traffic to other instances or starts new ones.
- Lambda: Health checks are less relevant since each invocation is independent. API Gateway has its own health monitoring.
Checking Logs
# Docker: view container logs
docker logs -f $(docker ps -q --filter ancestor=ucc-agent)
# GCP Cloud Run: stream logs
gcloud run services logs read ucc-agent --region us-central1
# AWS Lambda: tail CloudWatch logs
sam logs --stack-name ucc-agent-stack --tail
# All platforms: check the /metrics endpoint
curl -s http://localhost:8000/metrics | python -m json.tool
This basic monitoring tells you how many requests the agent has handled, how many failed, and the error rate. For full observability (distributed tracing, latency percentiles, cost tracking), see M19: Tracing & Logging and M20: Monitoring & Continuous Improvement.
Final Verification: Test All Three Deployments
The ultimate test: send the same query to all three environments and verify you get equivalent responses. This proves your deployment pipeline works end-to-end.
QUERY='{"question": "What is the lien exposure for Acme Corporation?"}'
echo "=== Local Docker ==="
curl -s -X POST http://localhost:8000/query \
-H "Content-Type: application/json" -d "$QUERY" | python -m json.tool
echo ""
echo "=== GCP Cloud Run ==="
curl -s -X POST https://ucc-agent-abc123-uc.a.run.app/query \
-H "Content-Type: application/json" -d "$QUERY" | python -m json.tool
echo ""
echo "=== AWS Lambda ==="
curl -s -X POST https://abc123def4.execute-api.us-east-1.amazonaws.com/Prod/query \
-H "Content-Type: application/json" -d "$QUERY" | python -m json.tool
All three deployments should return equivalent answers about Acme Corporation's filings and risk score. The exact wording will differ (Claude generates it fresh each time), but the data — 2 active filings, $2,950,000 total exposure, medium risk — should be consistent.
You have deployed a multi-tool AI agent to three production environments, each with proper health checks, error handling, secret management, and monitoring. The same code, three platforms, one curl command to test them all.
Real-World Agent Deployment Patterns
The lab you just finished gives you the canonical “HTTP request in, JSON response out” deployment. That covers a real slice of production agents — but the moment your agent runs longer than 5 minutes, needs persistent memory across requests, or processes batches unattended, the simple Cloud Run / Lambda recipe stops fitting. This section covers the four topologies you will actually meet in the field, when each one applies, and what to use when serverless breaks down.
The Four Canonical Topologies
Almost every production agent fits into one of these four shapes. Pick the topology first — the platform follows from it.
| Topology | Shape | Real example | Typical stack |
|---|---|---|---|
| 1. Sync API | Request → agent runs → JSON response. Under ~30 s end-to-end. | A classification API. A “summarize this PDF” endpoint. The lab you just built. | Cloud Run / Lambda + FastAPI |
| 2. Streaming chat | SSE or WebSocket from client to a long-lived handler. Tokens stream back as they generate. | A copilot UI, a customer-support chat, a coding assistant. | Cloud Run (native streaming) or Lambda Function URLs with response streaming. Redis for session state. |
| 3. Async worker | API enqueues a task → returns a job ID immediately → worker pulls from queue and runs the agent for minutes-to-hours → client polls or gets a webhook. | A research agent that browses 50 sources. A code-migration agent on a 200-file repo. | SQS / Pub-Sub / Cloud Tasks → ECS Fargate / GKE / Modal / Inngest. Postgres for job state. |
| 4. Scheduled batch | Cron triggers a batch run that processes N items in parallel and writes to a sink. | Nightly support-ticket categorization. Weekly invoice extraction across 10 k PDFs. | EventBridge / Cloud Scheduler → Batch / Cloud Run Jobs / Anthropic Message Batches API (50% off). |
The lab built topology 1 (Sync API). The other three are where most production agents actually live, because real work rarely fits in 30 seconds.
When Cloud Run / Lambda Stop Fitting
Three concrete failure modes push you off the simple recipe. Recognising them early saves a painful re-architecture later:
Lambda caps at 15 minutes. Cloud Run requests cap at 60 minutes. Anthropic API calls themselves can run several minutes for long tool-use chains, and a research agent making 30 sequential Claude calls easily blows past these limits. Fix: move to topology 3 (async worker). The HTTP request returns a job ID in 100 ms; the actual agent runs on ECS Fargate, GKE, or Modal where there is no timeout. The client polls GET /jobs/{id} or receives a webhook on completion.
Cloud Run and Lambda are stateless by design — the next request can land on a brand-new instance with empty memory. If your agent needs to remember the last 20 turns of conversation, you cannot keep that in process memory; the next request might hit a cold instance and lose it. Fix: externalize session state. Three common patterns: (a) Redis or Memorystore keyed by session_id, (b) Postgres with a conversations table, (c) a managed agent platform that owns the session for you (Bedrock AgentCore, Vertex Agent Engine — covered below). The instance is still stateless; the data lives outside it.
Cloud Run will happily scale to 1,000 instances under load. The Anthropic API will not happily accept 1,000 concurrent requests — you hit your tokens-per-minute or requests-per-minute limit and start getting 429s, which the user sees as failures. Fix: put a queue in front. Even for a sync-feeling API, requests can flow into SQS or Cloud Tasks, then a worker pool with a configured concurrency limit (e.g., 20 in-flight Claude calls max) drains the queue at a sustainable rate. The queue absorbs bursts; backpressure replaces 429s.
Reference Architecture: A Production Agent System
Here is what a real production agent stack looks like end-to-end — the kind of diagram you would whiteboard at a design review. None of this is exotic; each box is a service you have probably used before. The point is to see how they fit together.
┌──────────┐ ┌──────────────┐ ┌─────────────────┐
│ Client │───▶│ API Gateway │───▶│ Auth (Cognito │
│ (web/app)│ │ + WAF + TLS │ │ / Auth0 / IAM)│
└──────────┘ └──────┬───────┘ └─────────────────┘
│
┌────────────┴────────────┐
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Sync path │ │ Async path │
│ (Cloud Run) │ │ (SQS/PubSub) │
│ fast tasks │ │ long tasks │
└──────┬───────┘ └──────┬───────┘
│ │
│ ▼
│ ┌────────────────┐
│ │ Worker pool │
│ │ (Fargate/GKE/ │
│ │ Modal/Inngest)│
│ └──────┬─────────┘
│ │
▼ ▼
┌────────────────────────────────────────────┐
│ AGENT LOOP (your code) │
│ ┌──────────────────────────────────────┐ │
│ │ Claude API (direct / Bedrock / │ │
│ │ Vertex AI) │ │
│ └──────────────────────────────────────┘ │
│ ┌─────────┐ ┌─────────┐ ┌──────────────┐ │
│ │ Tool 1 │ │ Tool 2 │ │ Vector DB │ │
│ │ (HTTP) │ │ (DB) │ │ (pgvector, │ │
│ │ │ │ │ │ Pinecone) │ │
│ └─────────┘ └─────────┘ └──────────────┘ │
└─────────────────┬──────────────────────────┘
│
┌────────────┼────────────────┐
▼ ▼ ▼
┌─────────┐ ┌──────────┐ ┌────────────────┐
│ Session │ │ Job/run │ │ Observability │
│ store │ │ store │ │ (OTel + Lang- │
│ (Redis) │ │ (Postgres│ │ fuse / Datadog│
│ │ │ / Dynamo│ │ / CloudWatch) │
└─────────┘ └──────────┘ └────────────────┘
A few details worth calling out, since they are easy to miss when you copy this from a slide:
- Sync path and async path share the same agent code. The split is in how the agent is invoked, not what it does. A research agent might be exposed as both:
POST /queryfor short questions (sync, Cloud Run) andPOST /jobsfor deep research (async, queue + worker). - Session store and job store are different. Sessions hold conversation state (Redis is fine, sub-millisecond). Job store holds run lifecycle — status, started_at, output, errors — and needs durability (Postgres or DynamoDB).
- Observability is non-negotiable for agents. Unlike CRUD APIs, agents fail in ways logs alone cannot diagnose: the wrong tool was picked, the model loop oscillated, a planning step truncated. You need per-step traces. M19 covers this; the box exists in this diagram so you do not forget it on day one.
- Vector DB is optional — only present if your agent does retrieval. Many production agents do not have one.
Managed Agent Platforms: A Working Example
Everything above assumes you host the agent loop yourself. The alternative is to hand off the loop to a managed agent platform — the cloud provider runs the reasoning loop, persists session state, and orchestrates tool calls. You upload a system prompt and tool definitions; you never write a FastAPI server. The two production options for Claude in 2026 are AWS Bedrock AgentCore and Google Vertex AI Agent Engine.
Here is a complete working example using Bedrock AgentCore. Two parts: (1) a one-time setup that creates the agent (control plane), and (2) an invocation script you run from any client (data plane). This is what production code actually looks like — not pseudocode.
Part 1 — Create the agent (run once, e.g., from CI/CD or a setup script):
import boto3
# Control-plane client: creates and configures agents
ctrl = boto3.client("bedrock-agent", region_name="us-west-2")
# Step 1: create the agent shell (system prompt + foundation model)
agent = ctrl.create_agent(
agentName="ucc-research-agent",
foundationModel="anthropic.claude-opus-4-7-v1:0",
instruction=(
"You are a UCC filing research assistant. Use the lookup_filings "
"tool to search public records. Always cite filing numbers."
),
agentResourceRoleArn="arn:aws:iam::123456789012:role/AgentExecutionRole",
idleSessionTTLInSeconds=600, # session memory expires after 10 min idle
)
agent_id = agent["agent"]["agentId"]
# Step 2: attach an action group (your tools, defined as a Lambda function)
ctrl.create_agent_action_group(
agentId=agent_id,
agentVersion="DRAFT",
actionGroupName="filing-tools",
actionGroupExecutor={
"lambda": "arn:aws:lambda:us-west-2:123456789012:function:lookup-filings"
},
apiSchema={
"s3": {
"s3BucketName": "my-agent-schemas",
"s3ObjectKey": "filing-tools-openapi.yaml", # OpenAPI 3 spec
}
},
)
# Step 3: prepare and create an alias (the stable endpoint your clients call)
ctrl.prepare_agent(agentId=agent_id)
alias = ctrl.create_agent_alias(agentId=agent_id, agentAliasName="prod")
print(f"Agent ready: agentId={agent_id}, aliasId={alias['agentAlias']['agentAliasId']}")
# Create the agent
aws bedrock-agent create-agent \
--agent-name ucc-research-agent \
--foundation-model anthropic.claude-opus-4-7-v1:0 \
--instruction "You are a UCC filing research assistant..." \
--agent-resource-role-arn arn:aws:iam::123456789012:role/AgentExecutionRole \
--idle-session-ttl-in-seconds 600 \
--region us-west-2
# Attach the action group (tools)
aws bedrock-agent create-agent-action-group \
--agent-id ABCDE12345 \
--agent-version DRAFT \
--action-group-name filing-tools \
--action-group-executor lambda=arn:aws:lambda:us-west-2:123456789012:function:lookup-filings \
--api-schema s3={s3BucketName=my-agent-schemas,s3ObjectKey=filing-tools-openapi.yaml}
# Prepare (compiles the agent) and create a stable alias
aws bedrock-agent prepare-agent --agent-id ABCDE12345
aws bedrock-agent create-agent-alias --agent-id ABCDE12345 --agent-alias-name prod
Part 2 — Invoke the agent (your application code, run on every user request):
import boto3, uuid
# Data-plane client: invokes deployed agents
rt = boto3.client("bedrock-agent-runtime", region_name="us-west-2")
def ask(question: str, session_id: str) -> str:
response = rt.invoke_agent(
agentId="ABCDE12345",
agentAliasId="PRODALIAS1", # the alias you created above
sessionId=session_id, # same id across calls = same conversation
inputText=question,
enableTrace=True, # returns reasoning steps for debugging
)
# invoke_agent streams; concatenate chunks for the final answer
answer = []
for event in response["completion"]:
if "chunk" in event:
answer.append(event["chunk"]["bytes"].decode("utf-8"))
elif "trace" in event:
# Useful for production logs: which tool was called, why
print("trace:", event["trace"])
return "".join(answer)
# Same session_id = Bedrock automatically maintains conversation memory.
# No Redis. No conversations table. The platform owns it.
session = str(uuid.uuid4())
print(ask("What were Q4 sales for Acme Corporation?", session))
print(ask("And how does that compare to Q3?", session)) # context preserved
import {
BedrockAgentRuntimeClient,
InvokeAgentCommand,
} from "@aws-sdk/client-bedrock-agent-runtime";
import { randomUUID } from "node:crypto";
const rt = new BedrockAgentRuntimeClient({ region: "us-west-2" });
async function ask(question: string, sessionId: string): Promise<string> {
const cmd = new InvokeAgentCommand({
agentId: "ABCDE12345",
agentAliasId: "PRODALIAS1",
sessionId,
inputText: question,
enableTrace: true,
});
const res = await rt.send(cmd);
let answer = "";
for await (const event of res.completion ?? []) {
if (event.chunk?.bytes) {
answer += new TextDecoder().decode(event.chunk.bytes);
} else if (event.trace) {
console.log("trace:", JSON.stringify(event.trace));
}
}
return answer;
}
const session = randomUUID();
console.log(await ask("What were Q4 sales for Acme Corporation?", session));
console.log(await ask("And how does that compare to Q3?", session));
The Vertex AI equivalent is similar in shape: aiplatform.agent_engines.create() registers your agent (typically a LangGraph or ADK graph), and agent_engine.query(input=..., session_id=...) invokes it. Same idea — you stop owning the runtime.
Pick managed when your agent fits the platform’s loop (system prompt + tools + RAG), you are already deep in AWS or GCP, and you would rather pay slightly more per call than maintain a worker pool. Time-to-first-deploy is hours, not days.
Pick self-hosted (the lab’s pattern) when you need custom planning logic, want to swap models per step (cheap model for routing, expensive model for synthesis), need exact cost control per turn, or your tools touch systems that cannot be reached from the platform’s execution environment. You also stay portable — the same code runs against direct Anthropic API, Bedrock-as-a-model, or Vertex-as-a-model without rewriting the loop.
Hybrid in practice: many teams ship v1 on a managed platform to validate the product, then graduate the parts that hit the platform’s ceiling onto a self-hosted worker pool. The managed platform handles the chat surface; the worker pool handles the long, custom, expensive runs.
InvokeAgent.
Going Further (Optional Stretch Goals)
These are optional extensions for learners who want to go deeper:
- CI/CD Pipeline: Set up GitHub Actions to automatically build your Docker image and deploy to Cloud Run on every push to
main. Usegcloud run deployin your workflow. - Auto-scaling Config: Configure Cloud Run's
--min-instances 1to eliminate cold starts for production traffic. Calculate the cost tradeoff: $0.05/hour idle vs. 2-second cold starts. - Multi-region Deployment: Deploy to
us-central1andeurope-west1on Cloud Run, then use a global load balancer to route users to the nearest region. - Add Streaming: Implement a
POST /query/streamendpoint that uses Server-Sent Events (SSE) to stream the agent's response token by token. This dramatically improves perceived latency for users. - Lambda Layers: Package the Anthropic SDK as a Lambda Layer to reduce cold start time and deployment package size. Layers are cached across invocations.
Knowledge Check
Test your understanding of agent deployment concepts. Select the best answer for each question.
1. Why is streaming important for agent API endpoints?
2. Why should you NEVER bake ANTHROPIC_API_KEY into a Docker image?
3. Your agent takes 3 minutes to complete a complex multi-tool analysis. Which deployment platform is the best fit?
4. What does --max-instances 3 prevent in a Cloud Run deployment?
5. Your agent works perfectly locally but returns timeout errors on AWS Lambda. What is the most likely cause?
6. What is the role of the Mangum library in the Lambda deployment?
Your Score
Summary
What We Built
In this lab you took a UCC filing research agent and deployed it to three environments:
- Local Docker — Containerized the agent with a multi-stage Dockerfile, non-root user, and runtime secret injection
- GCP Cloud Run — Pushed the image to Artifact Registry and deployed with resource limits, cost caps, and auto-scaling
- AWS Lambda — Created a Mangum adapter, defined a SAM template, and deployed behind API Gateway
All three respond to the same curl command with equivalent results. The agent code did not change between platforms — only the infrastructure wrapper.
Key Takeaways
- Wrap agents as REST APIs with health checks before containerizing
- Never bake secrets into Docker images — pass them at runtime
- Cloud Run is the best default for production agent APIs (long timeouts, auto-scaling, low cost at idle)
- Lambda is best for event-driven agents (webhooks, cron jobs, file triggers)
- Cold starts matter less for agents since the Claude API call dominates response time
- Always set
--max-instancesto prevent cost explosions
What Comes Next
In M23: Capstone Project Series, you will combine everything from the course — agent architecture, tool use, guardrails, observability, cost optimization, and deployment — into a complete, production-grade system for one of the three domain projects.