Building AI Agents with Claude Track 7: Production Deployment
Module 25 of 30 ~90-120 min Advanced

Deploy Your Agent — Local, GCP & AWS

Hands-on lab: take a working UCC filing research agent and deploy it to three production environments — Local Docker, Google Cloud Run, and AWS Lambda. You will end with a live URL you can call from anywhere.

Learning Objectives

  • Wrap an AI agent as a production REST API using FastAPI with health checks and error handling
  • Containerize an agent application with Docker using security best practices (non-root user, no baked-in secrets)
  • Deploy a containerized agent to Google Cloud Run with proper resource limits and secret management
  • Deploy the same agent to AWS Lambda using SAM, including the handler adapter pattern
  • Compare Local Docker, Cloud Run, and Lambda across cold start, cost, scaling, and timeout dimensions

What You'll Build

Lab Overview

Final artifact: A UCC filing research agent deployed to Local Docker, GCP Cloud Run, and AWS Lambda — all testable with the same curl command.

Time estimate: 90–120 minutes

Prerequisites:

  • M15B (Build Complete Agent & Subagent System) — the agent code we are deploying
  • M21 (API Design) and M22 (Cost Optimization) — production design concepts
  • Docker Desktop installed and running (install Docker)
  • Optional: GCP account with gcloud CLI for Cloud Run steps
  • Optional: AWS account with aws CLI and SAM CLI for Lambda steps

Files you will create:

ucc-agent-deploy/ ├── agent.py # Agent logic with tool definitions ├── mock_data.py # Mock UCC filing data ├── server.py # FastAPI REST API wrapper ├── lambda_handler.py # AWS Lambda adapter ├── Dockerfile # Container definition ├── docker-compose.yml # Local development orchestration ├── requirements.txt # Python dependencies ├── template.yaml # AWS SAM template ├── .env.example # Environment variable template └── .dockerignore # Files to exclude from image
Cloud accounts optional

Steps 8–11 (GCP and AWS) require cloud accounts. If you do not have them, each step includes a Mock Mode alternative so you can simulate the deployment locally and still learn the concepts. The Docker steps (1–7) work entirely on your machine.

3-Tier Deployment — Same Agent, Three Targets
Agent Source agent.py server.py Local Docker development stateful, full control docker compose up GCP Cloud Run scale-to-zero containers ~1-3s cold start gcloud run deploy AWS Lambda serverless functions ~2-5s cold start sam deploy Dockerfile image build same Dockerfile push to GCR lambda_handler.py SAM zip + template.yaml curl POST /api/research same test command verifies all three
Now you understand what we are building and what you need. Before writing code, think about the deployment journey: first you wrap the agent as an API, then you containerize it, then you push that container to cloud platforms. Each step adds one layer of infrastructure. Let us start with the project structure.

Step 1: Project Setup

What: Create a clean project directory with all the dependency files you will need. Why: A proper project structure keeps your agent code, server code, and infrastructure files separated. This makes it easy to test locally, build a Docker image, and deploy to any cloud provider without restructuring.

Environment Setup

Open your terminal and run this block to create the project and install dependencies:

mkdir ucc-agent-deploy && cd ucc-agent-deploy
python -m venv venv
# Linux/macOS:
source venv/bin/activate
# Windows:
# venv\Scripts\activate

pip install anthropic>=0.30.0 fastapi>=0.115.0 uvicorn>=0.34.0 mangum>=0.19.0 pydantic>=2.0.0
pip freeze > requirements.txt

# Set your API key (never hardcode this in files!)
export ANTHROPIC_API_KEY=your-key-here
# Windows: set ANTHROPIC_API_KEY=your-key-here
mkdir ucc-agent-deploy && cd ucc-agent-deploy
npm init -y
npm install @anthropic-ai/sdk express cors dotenv

# Set your API key (never hardcode this in files!)
export ANTHROPIC_API_KEY=your-key-here
# Windows: set ANTHROPIC_API_KEY=your-key-here

Now create the .env.example file (this gets checked into git as a template):

# Copy this file to .env and fill in your values
# NEVER commit .env to git!
ANTHROPIC_API_KEY=your-anthropic-api-key-here
PORT=8000

And the .dockerignore file to keep your Docker image clean:

venv/
node_modules/
__pycache__/
*.pyc
.env
.git/
*.md
Run command
ls -la # You should see: agent.py (not yet), mock_data.py (not yet), # .env.example, .dockerignore, requirements.txt, venv/
Checkpoint

You have a project directory with a virtual environment, installed dependencies, and configuration files. If pip list | grep anthropic shows anthropic 0.30.x or higher, you are ready for Step 2.

Troubleshooting

pip install fails with "externally managed environment" → You forgot to activate the venv. Run source venv/bin/activate (or venv\Scripts\activate on Windows) first.

python not found → Try python3 instead of python. On some systems only python3 is available.

Step 2: The Agent

What: Create the UCC filing research agent with mock data and two tools: search_filings and get_risk_score. Why: We need a real, working agent to deploy. Using mock data means the lab works without any external database, while the agent logic is identical to what you would use in production with real data sources.

First, create mock_data.py with realistic UCC filing records. This file simulates what you would normally fetch from a database or API:

# mock_data.py — Simulated UCC filing records
# In production, these would come from a database or API

FILINGS = [
    {
        "filing_id": "UCC-2024-NY-001847",
        "debtor_name": "Acme Corporation",
        "debtor_state": "NY",
        "secured_party": "First National Bank",
        "collateral_description": "All inventory, equipment, and accounts receivable",
        "filing_date": "2024-03-15",
        "status": "active",
        "collateral_value": 2500000,
    },
    {
        "filing_id": "UCC-2024-NY-002103",
        "debtor_name": "Acme Corporation",
        "debtor_state": "NY",
        "secured_party": "TechLease Partners LLC",
        "collateral_description": "Specific equipment: CNC machines, serial #TL-8842, #TL-8843",
        "filing_date": "2024-06-01",
        "status": "active",
        "collateral_value": 450000,
    },
    {
        "filing_id": "UCC-2023-CA-009821",
        "debtor_name": "Pacific Freight Inc",
        "debtor_state": "CA",
        "secured_party": "West Coast Capital",
        "collateral_description": "All assets including rolling stock and warehouse inventory",
        "filing_date": "2023-11-20",
        "status": "active",
        "collateral_value": 8700000,
    },
    {
        "filing_id": "UCC-2024-TX-004210",
        "debtor_name": "Lone Star Drilling Co",
        "debtor_state": "TX",
        "secured_party": "Energy Finance Corp",
        "collateral_description": "Drilling equipment, mineral rights assignments, accounts",
        "filing_date": "2024-01-10",
        "status": "active",
        "collateral_value": 15000000,
    },
    {
        "filing_id": "UCC-2022-NY-000412",
        "debtor_name": "Acme Corporation",
        "debtor_state": "NY",
        "secured_party": "Metro Business Lending",
        "collateral_description": "Accounts receivable",
        "filing_date": "2022-08-05",
        "status": "terminated",
        "collateral_value": 750000,
    },
]


def search_filings(debtor_name: str) -> list[dict]:
    """Search filings by debtor name (case-insensitive partial match)."""
    name_lower = debtor_name.lower()
    return [f for f in FILINGS if name_lower in f["debtor_name"].lower()]


def calculate_risk_score(debtor_name: str) -> dict:
    """Calculate a risk score based on filing history."""
    matches = search_filings(debtor_name)
    active = [f for f in matches if f["status"] == "active"]
    total_exposure = sum(f["collateral_value"] for f in active)

    if not matches:
        return {
            "debtor_name": debtor_name,
            "risk_level": "unknown",
            "risk_score": 0,
            "reason": "No filings found for this entity",
            "total_liens": 0,
            "total_exposure": 0,
        }

    if total_exposure > 10000000:
        risk_level, score = "high", 85
    elif total_exposure > 2000000:
        risk_level, score = "medium", 55
    elif len(active) > 1:
        risk_level, score = "medium", 45
    else:
        risk_level, score = "low", 20

    return {
        "debtor_name": debtor_name,
        "risk_level": risk_level,
        "risk_score": score,
        "reason": f"{len(active)} active lien(s) totaling ${total_exposure:,.0f}",
        "total_liens": len(active),
        "total_exposure": total_exposure,
    }
// mock_data.js — Simulated UCC filing records
// In production, these would come from a database or API

const FILINGS = [
  {
    filing_id: "UCC-2024-NY-001847",
    debtor_name: "Acme Corporation",
    debtor_state: "NY",
    secured_party: "First National Bank",
    collateral_description: "All inventory, equipment, and accounts receivable",
    filing_date: "2024-03-15",
    status: "active",
    collateral_value: 2500000,
  },
  {
    filing_id: "UCC-2024-NY-002103",
    debtor_name: "Acme Corporation",
    debtor_state: "NY",
    secured_party: "TechLease Partners LLC",
    collateral_description: "Specific equipment: CNC machines, serial #TL-8842, #TL-8843",
    filing_date: "2024-06-01",
    status: "active",
    collateral_value: 450000,
  },
  {
    filing_id: "UCC-2023-CA-009821",
    debtor_name: "Pacific Freight Inc",
    debtor_state: "CA",
    secured_party: "West Coast Capital",
    collateral_description: "All assets including rolling stock and warehouse inventory",
    filing_date: "2023-11-20",
    status: "active",
    collateral_value: 8700000,
  },
  {
    filing_id: "UCC-2024-TX-004210",
    debtor_name: "Lone Star Drilling Co",
    debtor_state: "TX",
    secured_party: "Energy Finance Corp",
    collateral_description: "Drilling equipment, mineral rights assignments, accounts",
    filing_date: "2024-01-10",
    status: "active",
    collateral_value: 15000000,
  },
  {
    filing_id: "UCC-2022-NY-000412",
    debtor_name: "Acme Corporation",
    debtor_state: "NY",
    secured_party: "Metro Business Lending",
    collateral_description: "Accounts receivable",
    filing_date: "2022-08-05",
    status: "terminated",
    collateral_value: 750000,
  },
];

function searchFilings(debtorName) {
  const nameLower = debtorName.toLowerCase();
  return FILINGS.filter(f => f.debtor_name.toLowerCase().includes(nameLower));
}

function calculateRiskScore(debtorName) {
  const matches = searchFilings(debtorName);
  const active = matches.filter(f => f.status === "active");
  const totalExposure = active.reduce((sum, f) => sum + f.collateral_value, 0);

  if (matches.length === 0) {
    return {
      debtor_name: debtorName, risk_level: "unknown", risk_score: 0,
      reason: "No filings found for this entity",
      total_liens: 0, total_exposure: 0,
    };
  }

  let riskLevel, score;
  if (totalExposure > 10000000) { riskLevel = "high"; score = 85; }
  else if (totalExposure > 2000000) { riskLevel = "medium"; score = 55; }
  else if (active.length > 1) { riskLevel = "medium"; score = 45; }
  else { riskLevel = "low"; score = 20; }

  return {
    debtor_name: debtorName, risk_level: riskLevel, risk_score: score,
    reason: `${active.length} active lien(s) totaling $${totalExposure.toLocaleString()}`,
    total_liens: active.length, total_exposure: totalExposure,
  };
}

module.exports = { searchFilings, calculateRiskScore, FILINGS };

Now let's build the agent itself. If you completed M12 (ReAct Pattern) and M15B (Build Complete Agent System), this will look familiar. The core pattern is a tool use loopA repeating cycle where Claude requests a tool call, your code executes it, and you send the result back. The loop continues until Claude's stop_reason is "end_turn" instead of "tool_use".: send a message to Claude, check if it wants to call a tool, execute the tool, send the result back, and repeat until Claude says "I'm done" (via stop_reason: "end_turn"). Here we simplify it into a single-file agent with two tools:

# agent.py — UCC Filing Research Agent
# Uses the Anthropic Messages API with tool use (M12 pattern)

import json
import os
import anthropic
from mock_data import search_filings, calculate_risk_score

# --- Tool definitions tell Claude what functions are available ---
# Each tool has a name, description (crucial for selection accuracy),
# and an input_schema that Claude uses to generate valid arguments.

TOOLS = [
    {
        "name": "search_filings",
        "description": (
            "Search UCC filings by debtor name. Returns a list of filing "
            "records including filing ID, secured party, collateral "
            "description, filing date, status, and collateral value. "
            "Use this to find all liens against a specific company."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "debtor_name": {
                    "type": "string",
                    "description": "The company or individual name to search for (partial match supported)",
                }
            },
            "required": ["debtor_name"],
        },
    },
    {
        "name": "get_risk_score",
        "description": (
            "Calculate a lien risk score for a debtor based on their UCC "
            "filing history. Returns risk level (low/medium/high), numeric "
            "score (0-100), total active liens, and total collateral exposure. "
            "Use this after searching filings to assess overall risk."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "debtor_name": {
                    "type": "string",
                    "description": "The debtor name to assess risk for",
                }
            },
            "required": ["debtor_name"],
        },
    },
]

# --- Map tool names to Python functions ---
TOOL_HANDLERS = {
    "search_filings": lambda args: search_filings(args["debtor_name"]),
    "get_risk_score": lambda args: calculate_risk_score(args["debtor_name"]),
}


def run_agent(question: str, max_turns: int = 10) -> str:
    """
    Run the UCC research agent with a tool use loop.

    The loop checks stop_reason after each Claude response:
    - "tool_use" means Claude wants to call a tool — execute it and continue
    - "end_turn" means Claude is done — return the final text

    max_turns is a safety net to prevent infinite loops, not a control mechanism.
    """
    client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from environment

    messages = [{"role": "user", "content": question}]

    system_prompt = (
        "You are a UCC filing research assistant. You help users investigate "
        "Uniform Commercial Code filings, assess lien risk, and understand "
        "collateral exposure for business entities. Always search for filings "
        "first, then calculate risk scores to give comprehensive answers. "
        "Present findings clearly with specific numbers and filing IDs."
    )

    for turn in range(max_turns):
        try:
            response = client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=1024,
                system=system_prompt,
                tools=TOOLS,
                messages=messages,
            )
        except anthropic.APIError as e:
            return f"API error: {e.message}"
        except anthropic.AuthenticationError:
            return "Authentication failed. Check your ANTHROPIC_API_KEY."

        # Check if Claude wants to use tools or is done
        if response.stop_reason == "end_turn":
            # Extract the text from the response
            for block in response.content:
                if hasattr(block, "text"):
                    return block.text
            return "Agent completed but produced no text output."

        if response.stop_reason == "tool_use":
            # Append Claude's response (which includes tool_use blocks)
            messages.append({"role": "assistant", "content": response.content})

            # Execute each tool call and collect results
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    handler = TOOL_HANDLERS.get(block.name)
                    if handler:
                        try:
                            result = handler(block.input)
                            tool_results.append({
                                "type": "tool_result",
                                "tool_use_id": block.id,
                                "content": json.dumps(result, default=str),
                            })
                        except Exception as exc:
                            tool_results.append({
                                "type": "tool_result",
                                "tool_use_id": block.id,
                                "content": json.dumps({
                                    "error": str(exc),
                                    "is_retryable": False,
                                }),
                                "is_error": True,
                            })
                    else:
                        tool_results.append({
                            "type": "tool_result",
                            "tool_use_id": block.id,
                            "content": json.dumps({
                                "error": f"Unknown tool: {block.name}",
                                "is_retryable": False,
                            }),
                            "is_error": True,
                        })

            messages.append({"role": "user", "content": tool_results})

    return "Agent reached maximum turns without completing. Try a simpler question."


if __name__ == "__main__":
    # Quick test — run directly to verify the agent works
    answer = run_agent("What is the lien exposure for Acme Corporation?")
    print(answer)
// agent.js — UCC Filing Research Agent
// Uses the Anthropic Messages API with tool use (M12 pattern)

const Anthropic = require("@anthropic-ai/sdk");
const { searchFilings, calculateRiskScore } = require("./mock_data");

const TOOLS = [
  {
    name: "search_filings",
    description:
      "Search UCC filings by debtor name. Returns a list of filing " +
      "records including filing ID, secured party, collateral " +
      "description, filing date, status, and collateral value. " +
      "Use this to find all liens against a specific company.",
    input_schema: {
      type: "object",
      properties: {
        debtor_name: {
          type: "string",
          description: "The company or individual name to search for (partial match supported)",
        },
      },
      required: ["debtor_name"],
    },
  },
  {
    name: "get_risk_score",
    description:
      "Calculate a lien risk score for a debtor based on their UCC " +
      "filing history. Returns risk level (low/medium/high), numeric " +
      "score (0-100), total active liens, and total collateral exposure. " +
      "Use this after searching filings to assess overall risk.",
    input_schema: {
      type: "object",
      properties: {
        debtor_name: {
          type: "string",
          description: "The debtor name to assess risk for",
        },
      },
      required: ["debtor_name"],
    },
  },
];

const TOOL_HANDLERS = {
  search_filings: (args) => searchFilings(args.debtor_name),
  get_risk_score: (args) => calculateRiskScore(args.debtor_name),
};

async function runAgent(question, maxTurns = 10) {
  const client = new Anthropic(); // reads ANTHROPIC_API_KEY from env

  const systemPrompt =
    "You are a UCC filing research assistant. You help users investigate " +
    "Uniform Commercial Code filings, assess lien risk, and understand " +
    "collateral exposure for business entities. Always search for filings " +
    "first, then calculate risk scores to give comprehensive answers. " +
    "Present findings clearly with specific numbers and filing IDs.";

  const messages = [{ role: "user", content: question }];

  for (let turn = 0; turn < maxTurns; turn++) {
    let response;
    try {
      response = await client.messages.create({
        model: "claude-sonnet-4-6",
        max_tokens: 1024,
        system: systemPrompt,
        tools: TOOLS,
        messages,
      });
    } catch (err) {
      if (err instanceof Anthropic.AuthenticationError) {
        return "Authentication failed. Check your ANTHROPIC_API_KEY.";
      }
      return `API error: ${err.message}`;
    }

    if (response.stop_reason === "end_turn") {
      const textBlock = response.content.find((b) => b.type === "text");
      return textBlock ? textBlock.text : "Agent completed but produced no text.";
    }

    if (response.stop_reason === "tool_use") {
      messages.push({ role: "assistant", content: response.content });

      const toolResults = [];
      for (const block of response.content) {
        if (block.type === "tool_use") {
          const handler = TOOL_HANDLERS[block.name];
          if (handler) {
            try {
              const result = handler(block.input);
              toolResults.push({
                type: "tool_result",
                tool_use_id: block.id,
                content: JSON.stringify(result),
              });
            } catch (exc) {
              toolResults.push({
                type: "tool_result",
                tool_use_id: block.id,
                content: JSON.stringify({ error: exc.message, is_retryable: false }),
                is_error: true,
              });
            }
          } else {
            toolResults.push({
              type: "tool_result",
              tool_use_id: block.id,
              content: JSON.stringify({ error: `Unknown tool: ${block.name}`, is_retryable: false }),
              is_error: true,
            });
          }
        }
      }
      messages.push({ role: "user", content: toolResults });
    }
  }
  return "Agent reached maximum turns without completing.";
}

module.exports = { runAgent };

// Quick test when run directly
if (require.main === module) {
  runAgent("What is the lien exposure for Acme Corporation?").then(console.log);
}
What Just Happened?

You created two files: mock_data.py provides five realistic UCC filing records with search and risk functions, and agent.py wraps those functions as Claude tools with a standard tool use loop. The agent reads ANTHROPIC_API_KEY from the environment — never from a hardcoded string. When a user asks about a debtor, Claude calls search_filings to get the data, then get_risk_score to assess risk, then composes a human-readable answer.

Run command
python agent.py
Expected output (will vary since Claude generates the text)
Based on the UCC filing search, Acme Corporation has the following lien exposure: **Active Filings (2):** 1. UCC-2024-NY-001847 — First National Bank Collateral: All inventory, equipment, and accounts receivable Value: $2,500,000 2. UCC-2024-NY-002103 — TechLease Partners LLC Collateral: CNC machines (serial #TL-8842, #TL-8843) Value: $450,000 **Risk Assessment:** - Risk Level: Medium (score: 55/100) - Total Active Liens: 2 - Total Exposure: $2,950,000 Note: One previous filing (UCC-2022-NY-000412) has been terminated.
Checkpoint

If you see a response mentioning Acme Corporation's filings and risk score, the agent is working. The exact text will differ each run since Claude generates it, but the numbers should match the mock data. If you see "Authentication failed", double-check your ANTHROPIC_API_KEY environment variable.

Troubleshooting

ModuleNotFoundError: No module named 'anthropic' → Make sure your venv is activated. Run source venv/bin/activate then try again.

AuthenticationError → Your API key is missing or invalid. Run echo $ANTHROPIC_API_KEY to verify it is set.

Connection error → Check your internet connection. The agent needs to reach api.anthropic.com.

Step 3: Wrap the Agent as a FastAPI Server

What: Create a REST API server that exposes the agent through HTTP endpoints. Why: Docker, Cloud Run, and Lambda all need your agent to respond to HTTP requests. Right now, your agent only works as a Python script you run from the terminal. A REST API lets any client — a web app, a mobile app, a curl command — send questions and get answers over HTTP.

We will use FastAPIA modern Python web framework built for APIs. It uses Python type hints to auto-generate documentation, validate inputs, and support async operations out of the box. Faster to develop with than Flask for API-focused projects. as the web framework. FastAPI gives us automatic input validation via PydanticA Python library that validates data using type annotations. When you define a model class with typed fields, Pydantic automatically checks that incoming data matches those types and rejects invalid requests. — if a client sends malformed JSON, the request gets rejected before it reaches the agent. It also auto-generates interactive API docs and supports async operations out of the box.

Everyday Analogy

BEFORE: Right now, your agent is like a brilliant researcher who only works in person. You have to walk to their office, sit down, and ask your question face-to-face. Only one person at a time. No remote access.

THE PAIN: That means nobody else can use this researcher. Your teammates cannot ask questions. Your web app cannot ask questions. The researcher is locked in one room with one person.

THE FIX: Wrapping the agent in a REST API is like giving that researcher a phone number. Now anyone in the world can call them, ask a question, and get an answer back — without being in the same room. The FastAPI server IS that phone system: it receives calls (HTTP requests), routes them to the researcher (agent), and sends back the answer (HTTP response).

Here is what the "phone call" actually looks like in practice — the raw HTTP request and response your server handles:

POST /query HTTP/1.1 Host: localhost:8000 Content-Type: application/json {"question": "Find filings for Acme Corporation"} --- HTTP/1.1 200 OK {"answer": "Acme Corporation has 2 active liens...", "elapsed_seconds": 4.72, "status": "success"}

Create server.py. Let's walk through it in three logical chunks so you understand the reasoning behind each part:

Chunk 1 — Imports and models. The first thing any API needs is a contract: what shape does the request come in, and what shape does the response go out? That is what the Pydantic models define. When a client sends {"question": 123} instead of a string, Pydantic catches the type mismatch and returns a clear 422 error — your agent never even sees the bad request. This matters because an unvalidated question could cause confusing errors deep in the agent loop.

Chunk 2 — The /query endpoint. This is the heart of the server. It receives a question, calls run_agent(), times how long it takes, and returns the answer wrapped in JSON. Here is the important design choice: the try/except block catches any agent errors and returns a proper HTTP 500 response instead of crashing the whole server. Without this, one bad query could take down your service for all users.

Chunk 3 — The /health endpoint. Every production service needs a health check, and skipping it is a common beginner mistake. Load balancersA system that distributes incoming network traffic across multiple servers. It uses health check endpoints to know which servers are alive and should receive traffic. and container orchestrators (Docker, Kubernetes, Cloud Run) call this endpoint every 10–30 seconds to verify your service is alive. If it stops responding, traffic gets routed elsewhere and your container gets restarted. Our health check also verifies the API key is configured — a service without a key is "alive" but useless.

# server.py — FastAPI wrapper for the UCC research agent
# Exposes the agent as a REST API with health check

import os
import time
import logging
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, Field

from agent import run_agent

# --- Logging setup ---
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("ucc-agent-api")

# --- Request/Response models ---
# Pydantic validates every incoming request automatically.
# If a client sends {"query": 123} instead of a string, they get
# a clear 422 error before the agent ever runs.

class QueryRequest(BaseModel):
    question: str = Field(
        ...,
        min_length=1,
        max_length=2000,
        description="The question to ask the UCC research agent",
        json_schema_extra={"examples": ["Find filings for Acme Corporation"]},
    )

class QueryResponse(BaseModel):
    answer: str
    elapsed_seconds: float
    status: str = "success"

class HealthResponse(BaseModel):
    status: str
    service: str
    version: str

class ErrorResponse(BaseModel):
    detail: str
    status: str = "error"

# --- Application setup ---
app = FastAPI(
    title="UCC Filing Research Agent API",
    description="AI-powered UCC filing research and lien risk assessment",
    version="1.0.0",
)

# Allow browser-based clients to call this API
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],  # Restrict in production!
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

# Track request count for basic metrics
request_count = 0
error_count = 0
start_time = time.time()


@app.get("/health", response_model=HealthResponse, tags=["System"])
async def health_check():
    """Health check endpoint for load balancers and container orchestration."""
    # Check that the API key is configured (do not reveal the key!)
    api_key = os.environ.get("ANTHROPIC_API_KEY", "")
    if not api_key:
        raise HTTPException(
            status_code=503,
            detail="ANTHROPIC_API_KEY not configured",
        )
    return HealthResponse(
        status="healthy",
        service="ucc-agent",
        version="1.0.0",
    )


@app.post(
    "/query",
    response_model=QueryResponse,
    responses={500: {"model": ErrorResponse}},
    tags=["Agent"],
)
async def query_agent(request: QueryRequest):
    """Send a question to the UCC research agent and get an answer."""
    global request_count, error_count
    request_count += 1

    logger.info(f"Query received: {request.question[:80]}...")
    t0 = time.time()

    try:
        answer = run_agent(request.question)
        elapsed = round(time.time() - t0, 2)
        logger.info(f"Query completed in {elapsed}s")
        return QueryResponse(
            answer=answer,
            elapsed_seconds=elapsed,
        )
    except Exception as exc:
        error_count += 1
        logger.error(f"Agent error: {exc}")
        raise HTTPException(
            status_code=500,
            detail=f"Agent failed: {str(exc)}",
        )


@app.get("/metrics", tags=["System"])
async def metrics():
    """Basic metrics endpoint for monitoring."""
    uptime = round(time.time() - start_time, 1)
    return {
        "total_requests": request_count,
        "total_errors": error_count,
        "uptime_seconds": uptime,
        "error_rate": round(error_count / max(request_count, 1), 4),
    }


if __name__ == "__main__":
    import uvicorn
    port = int(os.environ.get("PORT", 8000))
    uvicorn.run(app, host="0.0.0.0", port=port)
// server.js — Express wrapper for the UCC research agent
// Exposes the agent as a REST API with health check

const express = require("express");
const cors = require("cors");
const { runAgent } = require("./agent");

const app = express();
app.use(cors());
app.use(express.json());

let requestCount = 0;
let errorCount = 0;
const startTime = Date.now();

// --- Health check ---
app.get("/health", (req, res) => {
  const apiKey = process.env.ANTHROPIC_API_KEY || "";
  if (!apiKey) {
    return res.status(503).json({
      status: "unhealthy",
      service: "ucc-agent",
      detail: "ANTHROPIC_API_KEY not configured",
    });
  }
  res.json({ status: "healthy", service: "ucc-agent", version: "1.0.0" });
});

// --- Main query endpoint ---
app.post("/query", async (req, res) => {
  requestCount++;
  const { question } = req.body;

  if (!question || typeof question !== "string" || question.length === 0) {
    return res.status(422).json({
      detail: "Missing or invalid 'question' field (must be a non-empty string)",
      status: "error",
    });
  }
  if (question.length > 2000) {
    return res.status(422).json({
      detail: "Question must be 2000 characters or fewer",
      status: "error",
    });
  }

  console.log(`Query received: ${question.substring(0, 80)}...`);
  const t0 = Date.now();

  try {
    const answer = await runAgent(question);
    const elapsed = ((Date.now() - t0) / 1000).toFixed(2);
    console.log(`Query completed in ${elapsed}s`);
    res.json({ answer, elapsed_seconds: parseFloat(elapsed), status: "success" });
  } catch (err) {
    errorCount++;
    console.error(`Agent error: ${err.message}`);
    res.status(500).json({ detail: `Agent failed: ${err.message}`, status: "error" });
  }
});

// --- Metrics ---
app.get("/metrics", (req, res) => {
  const uptime = ((Date.now() - startTime) / 1000).toFixed(1);
  res.json({
    total_requests: requestCount,
    total_errors: errorCount,
    uptime_seconds: parseFloat(uptime),
    error_rate: parseFloat((errorCount / Math.max(requestCount, 1)).toFixed(4)),
  });
});

const port = process.env.PORT || 8000;
app.listen(port, "0.0.0.0", () => {
  console.log(`UCC Agent API running on http://0.0.0.0:${port}`);
  console.log(`Docs: http://localhost:${port}/health`);
});
What Just Happened?

You created an HTTP server that wraps the agent. It has three endpoints: POST /query sends a question and returns the answer, GET /health tells infrastructure your service is alive, and GET /metrics provides basic monitoring data. The server validates input (rejects empty or oversized questions), logs every request, and catches errors gracefully. This same server works for local development, Docker, Cloud Run, and Lambda.

Common Misconceptions

"I need different server code for each cloud platform" — No. Your server.py is platform-agnostic. The same FastAPI app runs locally with uvicorn, inside Docker, on Cloud Run (Docker + managed infrastructure), and on Lambda (via the Mangum adapter). You never rewrite server logic for different platforms.

"FastAPI is only for async workloads" — FastAPI supports both sync and async. Our run_agent() function is synchronous (it blocks while waiting for Claude's API response), and that is perfectly fine. FastAPI runs it in a thread pool so other requests are not blocked.

"CORS should always be allow_origins=['*']" — Only in development! In production, restrict allow_origins to your specific frontend domain(s). A wildcard means any website can call your agent API, which creates security and cost risks.

Step 4: Test Locally (Without Docker)

What: Run the server on your machine and verify all endpoints work. Why: Always test locally before containerizing. If something breaks in Docker, you want to know whether the bug is in your code or your Docker setup. Testing locally first establishes a known-good baseline.

Start the server
uvicorn server:app --host 0.0.0.0 --port 8000 --reload
Expected output
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) INFO: Started reloader process [12345]

In a separate terminal, test each endpoint:

# Test 1: Health check
curl -s http://localhost:8000/health | python -m json.tool

# Test 2: Query the agent
curl -s -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": "Find filings for Acme Corporation"}' | python -m json.tool

# Test 3: Metrics
curl -s http://localhost:8000/metrics | python -m json.tool

# Test 4: Invalid request (should get 422)
curl -s -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": ""}' | python -m json.tool
Expected output for health check
{ "status": "healthy", "service": "ucc-agent", "version": "1.0.0" }
Expected output for query
{ "answer": "Based on the UCC filing search, Acme Corporation has 2 active liens...", "elapsed_seconds": 4.72, "status": "success" }
Checkpoint

All four curl commands should work: health returns "healthy", query returns an answer with filing details, metrics shows your request count, and the empty-question test returns a 422 validation error. Stop the server with Ctrl+C when done.

Troubleshooting

"Address already in use" → Another process is using port 8000. Either stop it (lsof -ti:8000 | xargs kill) or use a different port: --port 8001.

"ModuleNotFoundError: No module named 'fastapi'" → Your venv is not activated. Run source venv/bin/activate and try again.

Request Flow: Client to Agent to Response
Client (curl / browser)
↓ POST /query {"question": "..."}
FastAPI Server (server.py)
↓ run_agent(question)
UCC Agent (agent.py)
↓ tool calls
search_filings()
get_risk_score()
↑ tool results
Claude composes answer
↑ JSON response
{"answer": "...", "status": "success"}
Static diagram: Client sends POST /query to FastAPI, which calls the agent, which calls tools (search_filings, get_risk_score), then Claude composes an answer that flows back as a JSON response.
Your agent works locally with uvicorn. But your machine is not a production server — it goes to sleep, it is behind a firewall, and it does not scale. The next step is to package everything into a Docker container, which makes your agent portable: the same container runs identically on your laptop, a cloud VM, or a managed service like Cloud Run.

Step 5: Create the Dockerfile

What: Write a DockerfileA text file containing instructions to build a Docker image. Each instruction (FROM, COPY, RUN) creates a layer in the image. The final image contains everything needed to run your application: OS, Python, dependencies, and your code. that packages the agent, server, and all dependencies into a portable container imageA lightweight, standalone package that includes your application and everything it needs to run (code, runtime, libraries, system tools). Unlike a virtual machine, containers share the host OS kernel, making them much smaller and faster to start..

Why: Containers solve "works on my machine" problems. Your Docker image runs identically everywhere — your laptop, a colleague's machine, GCP, or AWS. It also provides isolation: your agent cannot accidentally access files or processes outside the container. If something goes wrong, the damage stays inside the container — it cannot touch the host system.

Everyday Analogy

BEFORE: Imagine shipping a home-cooked meal to a friend. You would need to send the recipe, the exact brand of every ingredient, the specific oven model, and instructions for their kitchen layout. If anything differs, the result tastes different.

THE PAIN: That is what deploying software without containers feels like. "It works on my machine" fails because your machine has Python 3.12, your colleague has 3.10, the server has different system libraries, and the cloud VM has a different OS entirely.

THE FIX: A Docker container is like shipping a fully-equipped kitchen WITH the meal inside. The container includes the OS, Python, all libraries, and your code. Open it anywhere and it runs identically. Here is what the shipping label (Dockerfile) looks like — each line adds one item to the kitchen:

FROM python:3.12-slim # The kitchen (OS + Python) COPY requirements.txt . # The ingredient list RUN pip install -r req... # Buy all ingredients COPY *.py . # The recipe (your code) CMD ["python", "-m", "uvicorn", ...] # "Start cooking"

We will annotate every line. Each Dockerfile instruction creates a layerDocker images are built in layers, one per instruction. Layers are cached, so when you change your code but not your dependencies, Docker only rebuilds the code layer (fast) instead of reinstalling all packages (slow). in the image, and the order matters for build speed.

# Dockerfile — Multi-stage build for the UCC agent

# ----- Stage 1: Build dependencies in a full Python image -----
# WHAT: Use the full Python image to install compiled dependencies
# WHY: Some pip packages (like uvloop) need C compilers to build.
#       The slim image does not have compilers, so we build here first.
FROM python:3.12 AS builder

WORKDIR /build

# WHAT: Copy only requirements.txt first, then install
# WHY: Docker caches each layer. If requirements.txt hasn't changed,
#       Docker reuses the cached pip install (saves 30-60 seconds).
# GOTCHA: If you COPY . . first, ANY code change invalidates the cache.
COPY requirements.txt .
RUN pip install --no-cache-dir --target=/build/deps -r requirements.txt

# ----- Stage 2: Slim runtime image -----
# WHAT: Start fresh with a minimal Python image
# WHY: The builder image is ~900MB (compilers, headers). The slim image
#       is ~150MB. We only copy the installed packages, not the compilers.
FROM python:3.12-slim

# WHAT: Create a non-root user
# WHY: Running as root inside a container is a security risk. If an
#       attacker exploits a bug, they get root access to the container.
#       A non-root user limits the damage.
RUN useradd --create-home --shell /bin/bash agent

WORKDIR /app

# WHAT: Copy installed dependencies from the builder stage
# WHY: We get all the pip packages without the build tools
COPY --from=builder /build/deps /usr/local/lib/python3.12/site-packages/

# WHAT: Copy application source code
# WHY: This layer changes most often (your code changes), so it goes LAST
#       to maximize cache hits on the layers above.
COPY mock_data.py agent.py server.py ./

# WHAT: Switch to the non-root user
USER agent

# WHAT: Expose port 8000 and set the default command
# WHY: EXPOSE documents which port the container listens on.
#       CMD sets what runs when the container starts.
# GOTCHA: EXPOSE does not publish the port — you still need -p 8000:8000
EXPOSE 8000
ENV PORT=8000

# WHAT: Run uvicorn with production settings
# WHY: --host 0.0.0.0 makes it accessible from outside the container.
#       --workers 1 is fine for agent workloads (they are I/O bound,
#       waiting on Claude API, not CPU bound).
CMD ["python", "-m", "uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]
# Dockerfile — Node.js version for the UCC agent

FROM node:20-slim

# Create non-root user for security
RUN useradd --create-home --shell /bin/bash agent

WORKDIR /app

# Copy package files first for layer caching
COPY package.json package-lock.json* ./
RUN npm ci --production

# Copy application code (changes most often = last layer)
COPY mock_data.js agent.js server.js ./

# Switch to non-root user
USER agent

EXPOSE 8000
ENV PORT=8000

CMD ["node", "server.js"]
Common Misconceptions

"I need a different Dockerfile for each cloud provider" — No. The same Docker image runs on Cloud Run, AWS ECS, Azure Container Instances, and your laptop. That is the whole point of containers — build once, run anywhere. The only thing that changes is how you pass secrets and configure networking.

"Multi-stage builds are optional optimization" — For production, they are essential. A single-stage build with python:3.12 produces a ~900MB image (includes compilers, headers, build tools). Multi-stage drops it to ~235MB. Smaller images mean faster deployments, lower storage costs, and reduced attack surface.

"EXPOSE publishes the port" — No! EXPOSE 8000 is documentation only. It tells humans and tools "this container listens on 8000." You still need -p 8000:8000 in docker run to actually map the port. Many beginners skip the -p flag thinking EXPOSE handled it, then wonder why they cannot connect.

Security: Never Bake Secrets Into Images

Notice that ANTHROPIC_API_KEY is NOT in the Dockerfile. Docker images get pushed to registries, shared with teammates, and stored in CI/CD systems. If you bake a secret into an image, anyone who pulls the image can extract it. Always pass secrets at runtime via -e flags or a secrets manager.

Cert Tip — Domain 3.1

The exam tests secret management best practices for agent deployments. Know the hierarchy: environment variables (acceptable for dev), cloud secret managers (GCP Secret Manager, AWS Secrets Manager — required for production), and never hardcoded in code, Dockerfiles, or config files committed to git. Anti-pattern: using --set-env-vars with the actual key value in CI/CD logs.

Run command (verify the Dockerfile is valid)
docker build -t ucc-agent . --progress=plain 2>&1 | tail -5
Expected output
=> [stage-1 4/5] COPY mock_data.py agent.py server.py ./ => [stage-1 5/5] WORKDIR /app => exporting to image => naming to docker.io/library/ucc-agent => done
Checkpoint

If the build completes with "naming to docker.io/library/ucc-agent", your Dockerfile is correct. Run docker images ucc-agent to verify the image exists — it should be approximately 235 MB. If the build fails, check the error message and verify all .py files exist in your project directory.

Troubleshooting

"COPY failed: file not found in build context" → You are missing one of the .py files. Run ls *.py to confirm agent.py, mock_data.py, and server.py all exist.

"Cannot connect to the Docker daemon" → Docker Desktop is not running. Start it and wait for the whale icon to appear in your system tray.

Build takes more than 5 minutes → First build downloads the Python base image (~150 MB). Subsequent builds use the cached layer and take under 10 seconds (only the code layer rebuilds).

Docker Image Layers (Built Bottom to Top)
CMD uvicorn server:app 0 KB (metadata)
COPY *.py (your code) ~12 KB
COPY deps from builder ~85 MB
RUN useradd agent ~1 KB
FROM python:3.12-slim ~150 MB

Total image size: ~235 MB — Layers are cached. Code changes only rebuild the top 12 KB layer.

Static diagram: Docker layers from bottom to top: python:3.12-slim (150MB), useradd (1KB), deps (85MB), your code (12KB), CMD (metadata). Code changes only rebuild top layer.

Step 6: Build and Run the Docker Container

What: Build the Docker image and run it as a container, passing the API key at runtime. Why: This verifies that your Dockerfile is correct and that the agent works inside a container, not just on your bare machine. It also proves the secret-passing pattern works.

# Build the image (run from the project directory)
docker build -t ucc-agent .

# Run the container, passing the API key from your environment
docker run -p 8000:8000 -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY ucc-agent

# Or on Windows PowerShell:
# docker run -p 8000:8000 -e ANTHROPIC_API_KEY=$env:ANTHROPIC_API_KEY ucc-agent
Expected build output (abbreviated)
[+] Building 45.2s (12/12) FINISHED => [builder 1/3] FROM python:3.12 => [builder 2/3] COPY requirements.txt . => [builder 3/3] RUN pip install ... => [stage-1 1/5] FROM python:3.12-slim => [stage-1 2/5] RUN useradd ... => [stage-1 3/5] COPY --from=builder ... => [stage-1 4/5] COPY *.py . => naming to docker.io/library/ucc-agent
Expected run output
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Test with the same curl commands from Step 4:

curl -s http://localhost:8000/health | python -m json.tool

curl -s -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What is the lien exposure for Acme Corporation?"}' \
  | python -m json.tool
Checkpoint

The responses from the Docker container should be identical to what you saw in Step 4. Same health check, same agent answer. If it works here, your container is ready for the cloud. Stop the container with Ctrl+C.

Troubleshooting

"port already in use" → Stop any process using port 8000, or change the host port: docker run -p 8001:8000 ... then test on localhost:8001.

Container exits immediately → Check logs: docker logs $(docker ps -lq). Common cause: missing ANTHROPIC_API_KEY.

"ANTHROPIC_API_KEY not configured" in health check → The -e flag did not pass the key. Verify: echo $ANTHROPIC_API_KEY should show your key (not empty).

Step 7: Docker Compose for Development

What: Create a docker-compose.yml that simplifies running the container with environment variables from a .env file.

Why: Typing the full docker run -p 8000:8000 -e ANTHROPIC_API_KEY=... command every time is tedious and error-prone. One typo in the port mapping or a forgotten -e flag and your container starts without the API key. Docker ComposeA tool that lets you define multi-container applications in a YAML file. Instead of typing long docker run commands, you define ports, environment variables, volumes, and health checks in docker-compose.yml and run everything with one command: docker compose up. reads all your settings from a YAML config file, so the command becomes just docker compose up — every time, with the same settings.

Docker Compose also supports features that raw docker run does not make easy: mounting your source code as a volume for hot-reloading during development, defining health checks that auto-restart crashed containers, and orchestrating multiple services (e.g., agent + database) in a single command. For this lab, we use it for convenience. In production, it becomes essential for managing multi-service deployments.

# docker-compose.yml — Development orchestration
version: "3.8"

services:
  agent:
    build: .
    ports:
      - "8000:8000"
    env_file:
      - .env                    # Reads ANTHROPIC_API_KEY from .env file
    environment:
      - PORT=8000
    # Uncomment the next 2 lines for hot-reload during development:
    # volumes:
    #   - ./:/app               # Mount source code into container
    restart: unless-stopped     # Auto-restart on crash
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 10s

Create your .env file (this file is gitignored — never commit it):

ANTHROPIC_API_KEY=your-actual-api-key-here
PORT=8000
Run commands
# Start the container docker compose up # Test (same curl commands as before) curl -s http://localhost:8000/health | python -m json.tool # Stop when done docker compose down
Checkpoint

The docker compose up command should build the image (if needed), start the container, and show the Uvicorn startup message. The health check should return "healthy". You now have a fully containerized agent running locally.

Troubleshooting

"no configuration file provided: not found" → You are not in the project directory. Run cd ucc-agent-deploy first, or verify docker-compose.yml exists with ls docker-compose.yml.

"error while loading .env" → You have not created the .env file yet. Copy .env.example to .env and fill in your API key: cp .env.example .env.

Health check keeps failing → The container might not have curl installed (the slim image does not include it). Replace the healthcheck test with: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')"]

Your agent is containerized and running on your machine. The exact same Docker image can now be pushed to any cloud provider. Next, we will deploy it to Google Cloud Run — a managed service that runs containers without you managing servers, scaling, or infrastructure.

Step 8: GCP Cloud Run — Prerequisites

What: Set up your GCP project and authenticate the gcloud CLI. Why: Cloud Run runs your Docker container on Google's infrastructure. It handles three things automatically: scaling (from zero to thousands of instances), TLS certificates (free HTTPS), and load balancing (distributing traffic across instances).

You pay only when your container is handling requests — $0 when idle. For agent workloads, Cloud Run is often the best default choice. It supports long timeouts (up to 60 minutes, which matters for complex multi-tool agent loops) and keeps containers warm between requests so repeat callers avoid cold starts.

Cloud Account Required (or use Mock Mode)

Steps 8–9 require a GCP account. If you do not have one, skip to the Mock Mode box at the end of Step 9, which simulates the deploy locally.

# 1. Authenticate (opens browser for Google login)
gcloud auth login

# 2. Set your project (replace with your GCP project ID)
gcloud config set project YOUR_PROJECT_ID

# 3. Enable the required APIs
gcloud services enable run.googleapis.com artifactregistry.googleapis.com

# 4. Create an Artifact Registry repository for Docker images
gcloud artifacts repositories create agents \
  --repository-format=docker \
  --location=us-central1 \
  --description="Agent container images"

# 5. Configure Docker to authenticate with Artifact Registry
gcloud auth configure-docker us-central1-docker.pkg.dev

# 6. Verify your setup
gcloud config list
Expected output for gcloud config list
[core] project = YOUR_PROJECT_ID account = your-email@gmail.com
Checkpoint

If gcloud config list shows your project ID and account, you are ready to push your image. If you see "ERROR: (gcloud.config.set) INVALID_VALUE", double-check your project ID — it must match exactly (case-sensitive).

Step 9: Deploy to Cloud Run

What: Push your Docker image to Artifact RegistryGoogle Cloud's container registry service. It stores your Docker images so Cloud Run can pull and run them. Similar to Docker Hub but private and integrated with GCP's access control. and deploy it to Cloud RunA fully managed service from Google Cloud that runs containers. It automatically scales from zero to many instances, handles HTTPS, and charges only while your container is processing a request. Ideal for APIs and agent services.. Why: This gives you a public HTTPS URL that anyone can call. Cloud Run handles TLS, scaling, and load balancing — you just provide the container.

# 1. Tag the image for Artifact Registry
docker tag ucc-agent \
  us-central1-docker.pkg.dev/YOUR_PROJECT_ID/agents/ucc-agent:v1

# 2. Push the image
docker push us-central1-docker.pkg.dev/YOUR_PROJECT_ID/agents/ucc-agent:v1

# 3. Deploy to Cloud Run
# Every flag explained:
#   --image       : the container image to run
#   --region      : where to run (us-central1 is cheap, low latency for US)
#   --set-env-vars: pass the API key (for production, use Secret Manager instead)
#   --memory 512Mi: 512 MB RAM — enough for an agent + Python runtime
#   --cpu 1       : 1 vCPU — agents are I/O bound, not CPU bound
#   --timeout 60  : 60-second request timeout — agent loops can take 10-30s
#   --max-instances 3: cost safety net — prevents scaling to 1000 instances
#                       from a traffic spike (each instance = money)
#   --allow-unauthenticated: makes the URL public (remove for production)

gcloud run deploy ucc-agent \
  --image us-central1-docker.pkg.dev/YOUR_PROJECT_ID/agents/ucc-agent:v1 \
  --region us-central1 \
  --set-env-vars ANTHROPIC_API_KEY=your-key-here \
  --memory 512Mi \
  --cpu 1 \
  --timeout 60 \
  --max-instances 3 \
  --allow-unauthenticated
Expected deploy output
Deploying container to Cloud Run service [ucc-agent] in project [YOUR_PROJECT_ID] region [us-central1] OK Deploying... Done. OK Creating Revision... OK Routing traffic... Done. Service [ucc-agent] revision [ucc-agent-00001-abc] has been deployed and is serving 100 percent of traffic. Service URL: https://ucc-agent-abc123-uc.a.run.app

Test your deployed agent with the public URL:

# Replace with YOUR actual service URL from the deploy output
SERVICE_URL="https://ucc-agent-abc123-uc.a.run.app"

# Health check
curl -s $SERVICE_URL/health | python -m json.tool

# Query the agent
curl -s -X POST $SERVICE_URL/query \
  -H "Content-Type: application/json" \
  -d '{"question": "Find all filings for Pacific Freight Inc"}' \
  | python -m json.tool
Checkpoint

The Cloud Run URL should return the same response as your local Docker container. You now have a public HTTPS API for your agent. Note the first request may take 2–3 seconds longer (cold start) as Cloud Run boots the container.

Troubleshooting

"PERMISSION_DENIED: Cloud Run Admin API has not been used" → Run gcloud services enable run.googleapis.com and wait 1–2 minutes for the API to activate.

"ERROR: failed to push image" → Docker is not authenticated with Artifact Registry. Run gcloud auth configure-docker us-central1-docker.pkg.dev then retry the push.

Service deployed but returns 503 → The API key was not set correctly. Check with gcloud run services describe ucc-agent --region us-central1 and look for the ANTHROPIC_API_KEY env var. Redeploy with the correct key.

Cost: Cloud Run Pricing

Cloud Run charges per request: CPU time + memory time + number of requests. With --max-instances 3 and typical agent traffic (a few queries per hour), expect less than $1/month. When idle, you pay $0. For comparison, a dedicated VM running 24/7 would cost $25–$50/month.

Production Security

For production, remove --allow-unauthenticated and use IAM authentication or an API gateway. Also use Secret Manager for the API key instead of --set-env-vars:

gcloud run deploy ucc-agent ... --set-secrets ANTHROPIC_API_KEY=anthropic-key:latest

Mock Mode (No GCP Account)

If you do not have a GCP account, you can simulate the Cloud Run experience locally. The key insight is that Cloud Run just runs your Docker container — it does not change your code. Run docker run -p 8080:8000 -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY ucc-agent and pretend http://localhost:8080 is your Cloud Run URL. The behavior is identical.

Step 10: AWS Lambda — Prerequisites

What: Set up your AWS account and understand the Lambda execution model. Why: AWS LambdaA serverless compute service from AWS that runs your code in response to events. You pay only for the time your code actually executes (billed in 1ms increments). Lambda scales automatically from zero to thousands of concurrent executions. is a fundamentally different deployment model than Cloud Run.

Instead of running a persistent web server that waits for requests, Lambda runs your code in response to individual events. Each request spins up a fresh execution environment, runs your handler function, and shuts down. You pay per invocation — the first 1 million requests per month are free. Lambda is ideal for event-driven agents: a webhook fires and starts processing, a new file appears in S3 and triggers analysis, or a cron job runs every hour.

Cloud Account Required (or use Mock Mode)

Steps 10–11 require an AWS account with the AWS SAM CLI installed. If you do not have one, skip to the Mock Mode box at the end of Step 11.

Cloud Run vs Lambda — When to Use Which

Cloud Run is your default for agent APIs. It supports long-running requests (up to 60 minutes), keeps containers warm between requests, and works with any HTTP framework. Think of it as "your Docker container, but managed."

Lambda is best for event-driven agents: a new file appears in S3 and triggers analysis, a webhook fires and starts processing, or a cron job runs every hour. Lambda's max timeout is 15 minutes, which works for most agent tasks but not for complex multi-agent pipelines that take longer.

Tradeoff: Lambda has cold starts (2–5 seconds to load Python + dependencies). Cloud Run also has cold starts, but you can set minimum instances to 1 to eliminate them ($0.05/hour).

# 1. Configure AWS CLI (you will be prompted for access key, secret, region)
aws configure
# Access Key ID: your-access-key
# Secret Access Key: your-secret-key
# Default region: us-east-1
# Default output format: json

# 2. Verify your identity
aws sts get-caller-identity

# 3. Verify SAM CLI is installed
sam --version
Expected output for aws sts get-caller-identity
{ "UserId": "AIDA...", "Account": "123456789012", "Arn": "arn:aws:iam::123456789012:user/your-user" }
Checkpoint

If aws sts get-caller-identity returns your account ID, and sam --version shows a version number, you are ready for Step 11.

Step 11: Deploy to AWS Lambda

What: Create a Lambda handler that wraps the same agent code, package it with SAM, and deploy it behind API GatewayAn AWS service that acts as a front door for your Lambda functions. It handles HTTP routing, authentication, rate limiting, and CORS, then invokes the Lambda function for each request..

Why: Lambda does not run a web server. It expects a specific handler function that receives an event object and returns a response. The MangumA Python library that adapts ASGI/FastAPI applications to work as AWS Lambda handlers. It converts the Lambda event (from API Gateway) into a format FastAPI understands, and converts the FastAPI response back into Lambda's expected format. library bridges this gap: it converts the Lambda event (from API Gateway) into a format FastAPI understands, and converts FastAPI's response back into Lambda's expected format. This means you reuse 100% of your server.py code without rewriting it.

Create lambda_handler.py — this is the adapter between Lambda and your FastAPI app:

# lambda_handler.py — AWS Lambda adapter for FastAPI
# Mangum converts API Gateway events into ASGI requests that FastAPI understands.

from mangum import Mangum
from server import app

# This is the entry point Lambda calls.
# Mangum wraps the FastAPI app: Lambda event -> HTTP request -> FastAPI -> HTTP response -> Lambda response
handler = Mangum(app, lifespan="off")

# That's it! Mangum does all the translation. Your FastAPI routes,
# Pydantic validation, error handling — everything works exactly
# the same as it does locally. The only difference is HOW the
# request arrives (Lambda event vs. HTTP request).
// lambda_handler.js — AWS Lambda adapter (native handler, no Express)
// For Node.js, we use a native Lambda handler since serverless-http
// adds overhead. The agent logic is called directly.

const { runAgent } = require("./agent");

exports.handler = async (event) => {
  // API Gateway sends the HTTP method and body in the event object
  const method = event.httpMethod || event.requestContext?.http?.method;
  const path = event.path || event.rawPath;

  // Health check
  if (path === "/health" && method === "GET") {
    const apiKey = process.env.ANTHROPIC_API_KEY || "";
    if (!apiKey) {
      return { statusCode: 503, body: JSON.stringify({ status: "unhealthy" }) };
    }
    return {
      statusCode: 200,
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({ status: "healthy", service: "ucc-agent", version: "1.0.0" }),
    };
  }

  // Query endpoint
  if (path === "/query" && method === "POST") {
    try {
      const body = JSON.parse(event.body || "{}");
      const question = body.question;

      if (!question || typeof question !== "string" || question.length === 0) {
        return {
          statusCode: 422,
          headers: { "Content-Type": "application/json" },
          body: JSON.stringify({ detail: "Missing or invalid question", status: "error" }),
        };
      }

      const t0 = Date.now();
      const answer = await runAgent(question);
      const elapsed = ((Date.now() - t0) / 1000).toFixed(2);

      return {
        statusCode: 200,
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ answer, elapsed_seconds: parseFloat(elapsed), status: "success" }),
      };
    } catch (err) {
      return {
        statusCode: 500,
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ detail: `Agent failed: ${err.message}`, status: "error" }),
      };
    }
  }

  return { statusCode: 404, body: JSON.stringify({ detail: "Not found" }) };
};

Now create template.yaml — the SAM templateAWS Serverless Application Model (SAM) template. A YAML file that defines your Lambda function, API Gateway, and other AWS resources. SAM builds, packages, and deploys everything with a single command. that defines your Lambda function and API Gateway:

AWSTemplateFormatVersion: "2010-09-09"
Transform: AWS::Serverless-2016-10-31
Description: UCC Filing Research Agent - Lambda deployment

Globals:
  Function:
    Timeout: 120           # 120 seconds - agent loops can take 10-30s
    MemorySize: 512        # 512 MB - enough for Python + anthropic SDK
    Runtime: python3.12

Resources:
  UCCAgentFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: lambda_handler.handler
      CodeUri: .
      Description: UCC Filing Research Agent
      Architectures:
        - x86_64
      Environment:
        Variables:
          ANTHROPIC_API_KEY: !Ref AnthropicApiKey
      Events:
        HealthCheck:
          Type: Api
          Properties:
            Path: /health
            Method: get
        Query:
          Type: Api
          Properties:
            Path: /query
            Method: post
        Metrics:
          Type: Api
          Properties:
            Path: /metrics
            Method: get

Parameters:
  AnthropicApiKey:
    Type: String
    NoEcho: true           # Masks the value in CloudFormation logs
    Description: Anthropic API key for Claude access

Outputs:
  ApiUrl:
    Description: API Gateway URL for the UCC Agent
    Value: !Sub "https://${ServerlessRestApi}.execute-api.${AWS::Region}.amazonaws.com/Prod/"

Build and deploy:

# Build the Lambda package (installs deps, creates deployment artifact)
sam build

# Deploy (first time: use --guided for interactive prompts)
sam deploy --guided
# Stack name: ucc-agent-stack
# Region: us-east-1
# Parameter AnthropicApiKey: (paste your API key — it will be masked)
# Confirm changes before deploy: y
# Allow SAM CLI IAM role creation: y
# Save arguments to samconfig.toml: y

# After first deploy, subsequent deploys are simpler:
# sam build && sam deploy
Expected deploy output (abbreviated)
CloudFormation outputs from deployed stack ------------------------------------------- Outputs ------------------------------------------- Key: ApiUrl Value: https://abc123def4.execute-api.us-east-1.amazonaws.com/Prod/ ------------------------------------------- Successfully created/updated stack - ucc-agent-stack in us-east-1

Test the Lambda deployment:

# Replace with YOUR API Gateway URL from the deploy output
LAMBDA_URL="https://abc123def4.execute-api.us-east-1.amazonaws.com/Prod"

curl -s $LAMBDA_URL/health | python -m json.tool

curl -s -X POST $LAMBDA_URL/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What is the risk level for Lone Star Drilling Co?"}' \
  | python -m json.tool
Checkpoint

The Lambda endpoint should return the same agent response as local and Cloud Run. The first request will be slower (3–6 seconds cold start) as Lambda initializes the Python runtime and loads the Anthropic SDK. Subsequent requests within 15 minutes will be faster (warm start).

Troubleshooting

"Task timed out after 120 seconds" → The agent loop took too long. Check your tool implementation and increase the timeout in template.yaml, or simplify the query.

"Unable to import module 'lambda_handler'" → SAM did not include all files. Run sam build again and check .aws-sam/build/ for your files.

"Internal server error" with no details → Check CloudWatch logs: sam logs --stack-name ucc-agent-stack --tail

Cost: Lambda Pricing

Lambda's free tier includes 1 million requests/month and 400,000 GB-seconds of compute. A typical agent query (512 MB, 15 seconds) uses 7.5 GB-seconds. You would need over 53,000 agent queries/month to exceed the free tier. After that, it is roughly $0.20 per 1 million requests + $0.0000166667 per GB-second.

Cert Tip — Domain 3.3

The exam expects you to know the Lambda adapter pattern: reuse your existing web framework (FastAPI/Express) code via an adapter library (Mangum for Python, serverless-http for Node.js) rather than rewriting handler logic from scratch. Know the tradeoffs: adapters add ~50ms overhead per cold start, but save weeks of duplicate code maintenance. For latency-critical Lambda functions, a native handler (no framework) is faster but harder to maintain.

Mock Mode (No AWS Account)

You can test the Lambda handler locally without AWS. SAM provides a local invoke feature if you have Docker installed:

sam local start-api --env-vars '{"UCCAgentFunction": {"ANTHROPIC_API_KEY": "your-key"}}'

This starts a local API Gateway that invokes your Lambda handler in a Docker container. Test it at http://localhost:3000/query. If you do not have SAM installed, you can also test the handler directly in Python:

python -c "from lambda_handler import handler; print(handler({'httpMethod':'GET','path':'/health'}, None))"

Deployment Comparison: Docker vs Cloud Run vs Lambda

You have now deployed the same agent to three environments. Here is how they compare across the dimensions that matter most for AI agent workloads:

Deployment Platform Comparison
Local Docker
Cold StartNone
Max TimeoutUnlimited
Cost at Idle$0
ScalingManual
Best ForDev & testing
GCP Cloud Run
Cold Start~1-2s
Max Timeout60 min
Cost at Idle$0
ScalingAuto 0→1000
Best ForProduction APIs
AWS Lambda
Cold Start~2-5s
Max Timeout15 min
Cost at Idle$0
ScalingAuto 0→1000
Best ForEvent-driven
Comparison table: Local Docker (no cold start, unlimited timeout, manual scaling, best for dev), GCP Cloud Run (1-2s cold start, 60min timeout, auto-scaling, best for production APIs), AWS Lambda (2-5s cold start, 15min timeout, auto-scaling, best for event-driven).
Common Misconceptions

"Lambda is always cheaper than Cloud Run" — Not for agent workloads. Agents make multiple API calls per request (10–30 seconds of compute). Lambda's per-millisecond billing adds up. Cloud Run's per-request model with short idle periods can be cheaper for consistent traffic.

"Serverless means no server" — There IS a server. You just do not manage it. AWS provisions, patches, and scales the servers for you. Your code still runs on a real machine somewhere.

"Cold starts are always a problem" — For agents, the Claude API call itself takes 3–15 seconds. A 2-second cold start is a small fraction of total response time. Cold starts matter more for sub-100ms APIs.

Cert Tip — Domain 3.2

The certification exam tests your ability to choose the right deployment model for a given scenario. Key decision factors: request timeout requirements (Cloud Run for long-running agents), traffic pattern (Lambda for spiky/event-driven), and whether you need persistent connections (Cloud Run or containers, not Lambda).

Health Checks and Basic Monitoring

Your deployed agent already has a /health endpoint and a /metrics endpoint. These are not optional extras — they are the minimum viable monitoring for any production service.

A health check answers one question: "Is this service alive and able to handle requests?" Without it, infrastructure has no way to detect a crashed or misconfigured service. Your users would be the first to find out, and that is the worst way to discover an outage.

Metrics go one step further. They answer: "How is this service performing over time?" Request count, error rate, and uptime tell you whether your agent is healthy, degrading, or on fire — before users complain. Here is how each platform uses these endpoints:

How Platforms Use Health Checks

  • Docker Compose: The healthcheck directive in your compose file pings /health every 30 seconds. If it fails 3 times, Docker restarts the container.
  • Cloud Run: GCP automatically probes your container's port. If your service cannot serve requests, Cloud Run routes traffic to other instances or starts new ones.
  • Lambda: Health checks are less relevant since each invocation is independent. API Gateway has its own health monitoring.

Checking Logs

# Docker: view container logs
docker logs -f $(docker ps -q --filter ancestor=ucc-agent)

# GCP Cloud Run: stream logs
gcloud run services logs read ucc-agent --region us-central1

# AWS Lambda: tail CloudWatch logs
sam logs --stack-name ucc-agent-stack --tail

# All platforms: check the /metrics endpoint
curl -s http://localhost:8000/metrics | python -m json.tool
Example /metrics response
{ "total_requests": 12, "total_errors": 1, "uptime_seconds": 3847.2, "error_rate": 0.0833 }

This basic monitoring tells you how many requests the agent has handled, how many failed, and the error rate. For full observability (distributed tracing, latency percentiles, cost tracking), see M19: Tracing & Logging and M20: Monitoring & Continuous Improvement.

Final Verification: Test All Three Deployments

The ultimate test: send the same query to all three environments and verify you get equivalent responses. This proves your deployment pipeline works end-to-end.

QUERY='{"question": "What is the lien exposure for Acme Corporation?"}'

echo "=== Local Docker ==="
curl -s -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" -d "$QUERY" | python -m json.tool

echo ""
echo "=== GCP Cloud Run ==="
curl -s -X POST https://ucc-agent-abc123-uc.a.run.app/query \
  -H "Content-Type: application/json" -d "$QUERY" | python -m json.tool

echo ""
echo "=== AWS Lambda ==="
curl -s -X POST https://abc123def4.execute-api.us-east-1.amazonaws.com/Prod/query \
  -H "Content-Type: application/json" -d "$QUERY" | python -m json.tool
Congratulations!

All three deployments should return equivalent answers about Acme Corporation's filings and risk score. The exact wording will differ (Claude generates it fresh each time), but the data — 2 active filings, $2,950,000 total exposure, medium risk — should be consistent.

You have deployed a multi-tool AI agent to three production environments, each with proper health checks, error handling, secret management, and monitoring. The same code, three platforms, one curl command to test them all.

Real-World Agent Deployment Patterns

The lab you just finished gives you the canonical “HTTP request in, JSON response out” deployment. That covers a real slice of production agents — but the moment your agent runs longer than 5 minutes, needs persistent memory across requests, or processes batches unattended, the simple Cloud Run / Lambda recipe stops fitting. This section covers the four topologies you will actually meet in the field, when each one applies, and what to use when serverless breaks down.

The Four Canonical Topologies

Almost every production agent fits into one of these four shapes. Pick the topology first — the platform follows from it.

Topology Shape Real example Typical stack
1. Sync API Request → agent runs → JSON response. Under ~30 s end-to-end. A classification API. A “summarize this PDF” endpoint. The lab you just built. Cloud Run / Lambda + FastAPI
2. Streaming chat SSE or WebSocket from client to a long-lived handler. Tokens stream back as they generate. A copilot UI, a customer-support chat, a coding assistant. Cloud Run (native streaming) or Lambda Function URLs with response streaming. Redis for session state.
3. Async worker API enqueues a task → returns a job ID immediately → worker pulls from queue and runs the agent for minutes-to-hours → client polls or gets a webhook. A research agent that browses 50 sources. A code-migration agent on a 200-file repo. SQS / Pub-Sub / Cloud Tasks → ECS Fargate / GKE / Modal / Inngest. Postgres for job state.
4. Scheduled batch Cron triggers a batch run that processes N items in parallel and writes to a sink. Nightly support-ticket categorization. Weekly invoice extraction across 10 k PDFs. EventBridge / Cloud Scheduler → Batch / Cloud Run Jobs / Anthropic Message Batches API (50% off).

The lab built topology 1 (Sync API). The other three are where most production agents actually live, because real work rarely fits in 30 seconds.

When Cloud Run / Lambda Stop Fitting

Three concrete failure modes push you off the simple recipe. Recognising them early saves a painful re-architecture later:

Failure Mode 1: Runs Exceed Timeouts

Lambda caps at 15 minutes. Cloud Run requests cap at 60 minutes. Anthropic API calls themselves can run several minutes for long tool-use chains, and a research agent making 30 sequential Claude calls easily blows past these limits. Fix: move to topology 3 (async worker). The HTTP request returns a job ID in 100 ms; the actual agent runs on ECS Fargate, GKE, or Modal where there is no timeout. The client polls GET /jobs/{id} or receives a webhook on completion.

Failure Mode 2: Stateful Sessions Across Requests

Cloud Run and Lambda are stateless by design — the next request can land on a brand-new instance with empty memory. If your agent needs to remember the last 20 turns of conversation, you cannot keep that in process memory; the next request might hit a cold instance and lose it. Fix: externalize session state. Three common patterns: (a) Redis or Memorystore keyed by session_id, (b) Postgres with a conversations table, (c) a managed agent platform that owns the session for you (Bedrock AgentCore, Vertex Agent Engine — covered below). The instance is still stateless; the data lives outside it.

Failure Mode 3: Concurrency Outpaces Anthropic Rate Limits

Cloud Run will happily scale to 1,000 instances under load. The Anthropic API will not happily accept 1,000 concurrent requests — you hit your tokens-per-minute or requests-per-minute limit and start getting 429s, which the user sees as failures. Fix: put a queue in front. Even for a sync-feeling API, requests can flow into SQS or Cloud Tasks, then a worker pool with a configured concurrency limit (e.g., 20 in-flight Claude calls max) drains the queue at a sustainable rate. The queue absorbs bursts; backpressure replaces 429s.

Reference Architecture: A Production Agent System

Here is what a real production agent stack looks like end-to-end — the kind of diagram you would whiteboard at a design review. None of this is exotic; each box is a service you have probably used before. The point is to see how they fit together.

  ┌──────────┐    ┌──────────────┐    ┌─────────────────┐
  │  Client  │───▶│ API Gateway  │───▶│  Auth (Cognito  │
  │ (web/app)│    │  + WAF + TLS │    │   / Auth0 / IAM)│
  └──────────┘    └──────┬───────┘    └─────────────────┘
                         │
            ┌────────────┴────────────┐
            ▼                         ▼
    ┌──────────────┐          ┌──────────────┐
    │  Sync path   │          │  Async path  │
    │  (Cloud Run) │          │ (SQS/PubSub) │
    │  fast tasks  │          │  long tasks  │
    └──────┬───────┘          └──────┬───────┘
           │                         │
           │                         ▼
           │                 ┌────────────────┐
           │                 │  Worker pool   │
           │                 │ (Fargate/GKE/  │
           │                 │  Modal/Inngest)│
           │                 └──────┬─────────┘
           │                        │
           ▼                        ▼
  ┌────────────────────────────────────────────┐
  │           AGENT LOOP (your code)           │
  │  ┌──────────────────────────────────────┐  │
  │  │ Claude API (direct / Bedrock /       │  │
  │  │            Vertex AI)                │  │
  │  └──────────────────────────────────────┘  │
  │  ┌─────────┐ ┌─────────┐ ┌──────────────┐  │
  │  │ Tool 1  │ │ Tool 2  │ │ Vector DB    │  │
  │  │ (HTTP)  │ │ (DB)    │ │ (pgvector,   │  │
  │  │         │ │         │ │  Pinecone)   │  │
  │  └─────────┘ └─────────┘ └──────────────┘  │
  └─────────────────┬──────────────────────────┘
                    │
       ┌────────────┼────────────────┐
       ▼            ▼                ▼
  ┌─────────┐  ┌──────────┐  ┌────────────────┐
  │ Session │  │ Job/run  │  │ Observability  │
  │ store   │  │ store    │  │ (OTel + Lang-  │
  │ (Redis) │  │ (Postgres│  │  fuse / Datadog│
  │         │  │  / Dynamo│  │  / CloudWatch) │
  └─────────┘  └──────────┘  └────────────────┘

A few details worth calling out, since they are easy to miss when you copy this from a slide:

  • Sync path and async path share the same agent code. The split is in how the agent is invoked, not what it does. A research agent might be exposed as both: POST /query for short questions (sync, Cloud Run) and POST /jobs for deep research (async, queue + worker).
  • Session store and job store are different. Sessions hold conversation state (Redis is fine, sub-millisecond). Job store holds run lifecycle — status, started_at, output, errors — and needs durability (Postgres or DynamoDB).
  • Observability is non-negotiable for agents. Unlike CRUD APIs, agents fail in ways logs alone cannot diagnose: the wrong tool was picked, the model loop oscillated, a planning step truncated. You need per-step traces. M19 covers this; the box exists in this diagram so you do not forget it on day one.
  • Vector DB is optional — only present if your agent does retrieval. Many production agents do not have one.

Managed Agent Platforms: A Working Example

Everything above assumes you host the agent loop yourself. The alternative is to hand off the loop to a managed agent platform — the cloud provider runs the reasoning loop, persists session state, and orchestrates tool calls. You upload a system prompt and tool definitions; you never write a FastAPI server. The two production options for Claude in 2026 are AWS Bedrock AgentCore and Google Vertex AI Agent Engine.

Here is a complete working example using Bedrock AgentCore. Two parts: (1) a one-time setup that creates the agent (control plane), and (2) an invocation script you run from any client (data plane). This is what production code actually looks like — not pseudocode.

Part 1 — Create the agent (run once, e.g., from CI/CD or a setup script):

import boto3

# Control-plane client: creates and configures agents
ctrl = boto3.client("bedrock-agent", region_name="us-west-2")

# Step 1: create the agent shell (system prompt + foundation model)
agent = ctrl.create_agent(
    agentName="ucc-research-agent",
    foundationModel="anthropic.claude-opus-4-7-v1:0",
    instruction=(
        "You are a UCC filing research assistant. Use the lookup_filings "
        "tool to search public records. Always cite filing numbers."
    ),
    agentResourceRoleArn="arn:aws:iam::123456789012:role/AgentExecutionRole",
    idleSessionTTLInSeconds=600,  # session memory expires after 10 min idle
)
agent_id = agent["agent"]["agentId"]

# Step 2: attach an action group (your tools, defined as a Lambda function)
ctrl.create_agent_action_group(
    agentId=agent_id,
    agentVersion="DRAFT",
    actionGroupName="filing-tools",
    actionGroupExecutor={
        "lambda": "arn:aws:lambda:us-west-2:123456789012:function:lookup-filings"
    },
    apiSchema={
        "s3": {
            "s3BucketName": "my-agent-schemas",
            "s3ObjectKey": "filing-tools-openapi.yaml",  # OpenAPI 3 spec
        }
    },
)

# Step 3: prepare and create an alias (the stable endpoint your clients call)
ctrl.prepare_agent(agentId=agent_id)
alias = ctrl.create_agent_alias(agentId=agent_id, agentAliasName="prod")
print(f"Agent ready: agentId={agent_id}, aliasId={alias['agentAlias']['agentAliasId']}")
# Create the agent
aws bedrock-agent create-agent \
  --agent-name ucc-research-agent \
  --foundation-model anthropic.claude-opus-4-7-v1:0 \
  --instruction "You are a UCC filing research assistant..." \
  --agent-resource-role-arn arn:aws:iam::123456789012:role/AgentExecutionRole \
  --idle-session-ttl-in-seconds 600 \
  --region us-west-2

# Attach the action group (tools)
aws bedrock-agent create-agent-action-group \
  --agent-id ABCDE12345 \
  --agent-version DRAFT \
  --action-group-name filing-tools \
  --action-group-executor lambda=arn:aws:lambda:us-west-2:123456789012:function:lookup-filings \
  --api-schema s3={s3BucketName=my-agent-schemas,s3ObjectKey=filing-tools-openapi.yaml}

# Prepare (compiles the agent) and create a stable alias
aws bedrock-agent prepare-agent --agent-id ABCDE12345
aws bedrock-agent create-agent-alias --agent-id ABCDE12345 --agent-alias-name prod

Part 2 — Invoke the agent (your application code, run on every user request):

import boto3, uuid

# Data-plane client: invokes deployed agents
rt = boto3.client("bedrock-agent-runtime", region_name="us-west-2")

def ask(question: str, session_id: str) -> str:
    response = rt.invoke_agent(
        agentId="ABCDE12345",
        agentAliasId="PRODALIAS1",   # the alias you created above
        sessionId=session_id,         # same id across calls = same conversation
        inputText=question,
        enableTrace=True,             # returns reasoning steps for debugging
    )
    # invoke_agent streams; concatenate chunks for the final answer
    answer = []
    for event in response["completion"]:
        if "chunk" in event:
            answer.append(event["chunk"]["bytes"].decode("utf-8"))
        elif "trace" in event:
            # Useful for production logs: which tool was called, why
            print("trace:", event["trace"])
    return "".join(answer)

# Same session_id = Bedrock automatically maintains conversation memory.
# No Redis. No conversations table. The platform owns it.
session = str(uuid.uuid4())
print(ask("What were Q4 sales for Acme Corporation?", session))
print(ask("And how does that compare to Q3?", session))  # context preserved
import {
  BedrockAgentRuntimeClient,
  InvokeAgentCommand,
} from "@aws-sdk/client-bedrock-agent-runtime";
import { randomUUID } from "node:crypto";

const rt = new BedrockAgentRuntimeClient({ region: "us-west-2" });

async function ask(question: string, sessionId: string): Promise<string> {
  const cmd = new InvokeAgentCommand({
    agentId: "ABCDE12345",
    agentAliasId: "PRODALIAS1",
    sessionId,
    inputText: question,
    enableTrace: true,
  });
  const res = await rt.send(cmd);
  let answer = "";
  for await (const event of res.completion ?? []) {
    if (event.chunk?.bytes) {
      answer += new TextDecoder().decode(event.chunk.bytes);
    } else if (event.trace) {
      console.log("trace:", JSON.stringify(event.trace));
    }
  }
  return answer;
}

const session = randomUUID();
console.log(await ask("What were Q4 sales for Acme Corporation?", session));
console.log(await ask("And how does that compare to Q3?", session));
What you get for free with this approach
- The agent loop (tool selection, tool execution, response synthesis) - Session memory across calls (no Redis, no conversations table) - Tool execution via Lambda (auto-scaled, IAM-scoped per tool) - Built-in tracing in CloudWatch (every reasoning step logged) - Optional: Bedrock Knowledge Bases for RAG, Bedrock Guardrails for safety filters - TLS, IAM auth, regional failover — all inherited from AWS

The Vertex AI equivalent is similar in shape: aiplatform.agent_engines.create() registers your agent (typically a LangGraph or ADK graph), and agent_engine.query(input=..., session_id=...) invokes it. Same idea — you stop owning the runtime.

When to Pick Managed vs Self-Hosted

Pick managed when your agent fits the platform’s loop (system prompt + tools + RAG), you are already deep in AWS or GCP, and you would rather pay slightly more per call than maintain a worker pool. Time-to-first-deploy is hours, not days.

Pick self-hosted (the lab’s pattern) when you need custom planning logic, want to swap models per step (cheap model for routing, expensive model for synthesis), need exact cost control per turn, or your tools touch systems that cannot be reached from the platform’s execution environment. You also stay portable — the same code runs against direct Anthropic API, Bedrock-as-a-model, or Vertex-as-a-model without rewriting the loop.

Hybrid in practice: many teams ship v1 on a managed platform to validate the product, then graduate the parts that hit the platform’s ceiling onto a self-hosted worker pool. The managed platform handles the chat surface; the worker pool handles the long, custom, expensive runs.

What Just Happened?
You moved from “I deployed one container three ways” to “I can pick the right topology and platform for any agent workload.” The key map: short sync requests → Cloud Run / Lambda; streaming chat → Cloud Run with SSE or Lambda Function URLs; long agent runs → queue + worker pool (Fargate / Modal / Inngest); scheduled batch → Cloud Run Jobs or the Anthropic Message Batches API. When you do not want to operate any of it, hand the loop to Bedrock AgentCore or Vertex AI Agent Engine and just call InvokeAgent.

Going Further (Optional Stretch Goals)

These are optional extensions for learners who want to go deeper:

  1. CI/CD Pipeline: Set up GitHub Actions to automatically build your Docker image and deploy to Cloud Run on every push to main. Use gcloud run deploy in your workflow.
  2. Auto-scaling Config: Configure Cloud Run's --min-instances 1 to eliminate cold starts for production traffic. Calculate the cost tradeoff: $0.05/hour idle vs. 2-second cold starts.
  3. Multi-region Deployment: Deploy to us-central1 and europe-west1 on Cloud Run, then use a global load balancer to route users to the nearest region.
  4. Add Streaming: Implement a POST /query/stream endpoint that uses Server-Sent Events (SSE) to stream the agent's response token by token. This dramatically improves perceived latency for users.
  5. Lambda Layers: Package the Anthropic SDK as a Lambda Layer to reduce cold start time and deployment package size. Layers are cached across invocations.

Knowledge Check

Test your understanding of agent deployment concepts. Select the best answer for each question.

1. Why is streaming important for agent API endpoints?

Streaming reduces the total cost of Claude API calls by 50%
Streaming is required by Cloud Run and Lambda
Agent responses take 5-30 seconds, and users need progress feedback instead of a frozen screen
Streaming allows the agent to use more tools per request
Correct! Agent tool use loops typically take 5-30 seconds. Without streaming, the user stares at a blank screen the entire time. Streaming sends partial results as they become available, improving perceived performance.
Not quite. Streaming does not affect cost or tool count. The key reason is user experience: agent responses take 5-30 seconds, and streaming provides progress feedback instead of making users wait in silence.

2. Why should you NEVER bake ANTHROPIC_API_KEY into a Docker image?

Docker images cannot contain environment variables
Images get pushed to registries and shared — anyone who pulls the image can extract the key
The Claude API rejects keys that come from Docker containers
It makes the Docker image too large to push
Correct! Docker images are stored in registries, shared with teammates, and used in CI/CD systems. Any secret baked into the image can be extracted by anyone who has access to the image. Always pass secrets at runtime via -e flags or a secrets manager.
Not quite. Docker images CAN contain env vars and the API works fine from containers. The issue is security: images get shared and stored in registries, so anyone who pulls the image can extract baked-in secrets.

3. Your agent takes 3 minutes to complete a complex multi-tool analysis. Which deployment platform is the best fit?

GCP Cloud Run — supports up to 60-minute timeouts
AWS Lambda — scales better for long-running tasks
Local Docker — cloud platforms cannot handle 3-minute requests
Any platform — they all support unlimited timeouts
Correct! Cloud Run supports timeouts up to 60 minutes, making it ideal for long-running agent tasks. Lambda's max is 15 minutes (which would work for 3 minutes but Cloud Run is a better default), and local Docker is not a production solution.
Not quite. Lambda supports up to 15 minutes (3 minutes would work but is not the best default). Cloud Run supports up to 60 minutes and provides better DX for long-running requests, making it the best choice for production agent APIs.

4. What does --max-instances 3 prevent in a Cloud Run deployment?

It prevents more than 3 users from accessing the service at once
It limits the Docker image to 3 layers
It ensures at least 3 instances are always running for availability
It caps the maximum number of container instances to prevent cost explosion from unexpected traffic spikes
Correct! Without a max instance limit, a traffic spike or accidental loop could scale to hundreds of instances, each costing money. --max-instances 3 is a cost safety net that limits your exposure while still allowing concurrent requests.
Not quite. --max-instances limits how many container instances Cloud Run can create. It is a cost safety net: without it, a traffic spike could auto-scale to hundreds of instances, each generating charges. It does NOT limit users or Docker layers.

5. Your agent works perfectly locally but returns timeout errors on AWS Lambda. What is the most likely cause?

Lambda does not support the Anthropic SDK
The Docker image is too large for Lambda
The agent's tool use loop takes longer than the Lambda function's timeout setting
Lambda cannot make outbound HTTP requests to the Claude API
Correct! The default Lambda timeout is 3 seconds. Agent loops with multiple tool calls can take 10-30 seconds. If your SAM template timeout is too low, Lambda kills the function before it finishes. Fix: increase the Timeout in template.yaml.
Not quite. Lambda supports the Anthropic SDK and can make outbound HTTP requests. The most likely issue is the timeout setting: Lambda defaults to 3 seconds, but agent loops take 10-30 seconds. Increase the Timeout value in your SAM template.

6. What is the role of the Mangum library in the Lambda deployment?

It provides a Python runtime for Lambda
It translates Lambda events into ASGI requests so FastAPI works without code changes
It manages the Anthropic API key as an encrypted secret
It optimizes the Docker image size for Lambda's container format
Correct! Mangum is an adapter. Lambda sends events in its own format (from API Gateway), and Mangum converts them into HTTP requests that FastAPI/ASGI understands. This means you reuse 100% of your server.py code without modification.
Not quite. Mangum is an adapter library. Its job is to translate API Gateway events (Lambda's input format) into ASGI requests that FastAPI can process. This lets you reuse your FastAPI server code on Lambda without any changes.

Your Score

0 / 6

Summary

What We Built

In this lab you took a UCC filing research agent and deployed it to three environments:

  1. Local Docker — Containerized the agent with a multi-stage Dockerfile, non-root user, and runtime secret injection
  2. GCP Cloud Run — Pushed the image to Artifact Registry and deployed with resource limits, cost caps, and auto-scaling
  3. AWS Lambda — Created a Mangum adapter, defined a SAM template, and deployed behind API Gateway

All three respond to the same curl command with equivalent results. The agent code did not change between platforms — only the infrastructure wrapper.

Key Takeaways

  • Wrap agents as REST APIs with health checks before containerizing
  • Never bake secrets into Docker images — pass them at runtime
  • Cloud Run is the best default for production agent APIs (long timeouts, auto-scaling, low cost at idle)
  • Lambda is best for event-driven agents (webhooks, cron jobs, file triggers)
  • Cold starts matter less for agents since the Claude API call dominates response time
  • Always set --max-instances to prevent cost explosions

What Comes Next

In M23: Capstone Project Series, you will combine everything from the course — agent architecture, tool use, guardrails, observability, cost optimization, and deployment — into a complete, production-grade system for one of the three domain projects.