M19: Tracing & Logging
See inside your agent's decision-making with distributed traces, structured logs, and production-ready observability pipelines.
Learning Objectives
- Explain why traditional logging is insufficient for AI agents and what observability adds
- Create end-to-end traces that capture every decision an agent makes
- Nest spans to show parent-child relationships and timing breakdowns
- Implement structured JSON logging with PII redaction
- Compare observability tools (Langfuse, OpenTelemetry, LangSmith, Arize) and choose the right one
Why Observability Matters for Agents
The pain: This is exactly what running an AI agent without observability feels like. A user reports "it gave me the wrong answer," and you have no trace of which tools were called, what the model reasoning was, or where the chain broke. You are debugging blind, guessing at causes, and wasting hours reproducing issues that may be non-deterministic.
The mapping: Observability is your headlights, dashboard, and GPS combined. Traces show you the road your agent took (every step in sequence). Spans show you timing for each turn and acceleration. Structured logs are your gauges — real-time readings of speed, fuel, and engine temperature. With these instruments, you can see exactly what happened, when, and why.
Here is what observability data actually looks like in practice. This is a single structured log entry from an agent run — the "dashboard reading" from the analogy above:
Every field is a "gauge" you can query. The trace_id links this entry to all other entries from the same request. The latency_ms tells you how fast the engine is running. The stop_reason tells you what the agent decided to do next (in this case, call a tool).
For AI agents, observability is especially critical because agents are non-deterministic. The same input can produce different outputs depending on model temperature, tool availability, and conversation context. Traditional debugging — set a breakpoint, reproduce the issue, step through the code — often fails because you simply cannot reproduce the exact same conditions twice. Observability sidesteps this problem entirely: it captures every execution as it happens, giving you a permanent, queryable record to analyze after the fact.
Why is agent observability different from traditional application observability? Three reasons:
- Non-determinism: The same prompt can yield different tool calls and reasoning paths on each run. Unlike a REST API that returns the same JSON for the same input, Claude might call
search_orderson one run andcheck_inventoryon another. You need traces to see which path was taken for each specific request. - Multi-step chains: Agents make 5-20 decisions per request. Each decision is its own step: call a tool, parse the result, decide the next action. A bug in step 7 may only manifest as a wrong answer in step 12, with 5 correct steps in between hiding the root cause.
- Cost visibility: Each LLM call costs money — typically $0.003 to $0.015 per call depending on the model and token count. Without token-level tracking, a runaway loop (where the agent calls Claude 50 times instead of 3) can burn through your budget before anyone notices. Traces show you exactly how many calls each request consumed.
"Observability is just logging with a fancier name." — Not quite. Logging captures individual events as text lines. Observability combines three pillars — traces (structured execution records), logs (machine-parseable event entries), and metrics (aggregated measurements) — into a system where you can answer questions you did not anticipate when you wrote the code. Logs alone cannot show you causal relationships between steps.
"I only need observability when something breaks." — Observability is most valuable for understanding normal behavior so you can spot anomalies. If you do not know that your P50 latency is normally 1.2 seconds, you cannot detect when it creeps to 3.5 seconds. Baseline data is the foundation of anomaly detection.
"Adding tracing will slow down my agent." — Modern tracing libraries (Langfuse, OpenTelemetry) use asynchronous batching — spans are queued in memory and flushed in background threads. The overhead is typically under 5ms per request. The performance insight you gain (finding a 3-second database bottleneck, for example) far outweighs the microsecond cost of recording it.
"I can just use print() statements and grep the output." — This works for local debugging with one user. It collapses at scale. When your agent handles 1,000 concurrent requests, print output from different requests interleaves randomly. Without trace_id correlation, you cannot reconstruct which log lines belong to which request. Structured logging with trace IDs solves this.
"More data is always better — log everything." — Over-logging creates two problems. First, cost: shipping 10 GB/day of logs to a cloud aggregator is expensive. Second, noise: when every line is logged, finding the important signals requires complex queries. Log the right things (decisions, errors, timing, token counts) rather than everything.
In the UCC filing pipeline, observability would capture every step of a filing analysis: the initial query to BigQuery for debtor records, the entity resolution call that matches "ACME CORP" to "Acme Corporation Inc," the risk scoring computation, and the final report generation. If a filing risk score seems wrong, you pull up the trace and immediately see whether the entity resolution matched the wrong company or the risk model received stale collateral data.
Agent Traces
The pain: Without a flight recorder, investigators would have to rely on pilot memory and guesswork. For AI agents, "pilot memory" is your application logs, which typically only say "request received" and "response sent" with nothing in between. When a user says "the agent told me to cancel my order but I asked for a refund," you have no record of the reasoning chain.
The mapping: A traceA complete record of a single agent execution, identified by a unique trace ID. Contains all spans (units of work), their timing, and metadata. Think of it as the flight recorder for one agent run. is your agent's flight recorder. It captures every decision, tool call, and model invocation from the moment a user request arrives to the moment the response is sent. Each trace has a unique trace_id that lets you pull up the complete story of any execution, even months later.
Here is what a trace object actually looks like as a data structure — the "flight recorder" in JSON form:
Notice how the trace is a single JSON object that tells the complete story: who asked, what the agent did (three spans), how long each step took, and what it answered. This is the data structure you will build in the code walkthrough below.
Traces follow the OpenTelemetry standard. That means they work across languages (Python, TypeScript, Go) and across tools (Langfuse, Jaeger, Datadog). You write your tracing code once, and any OpenTelemetry-compatible backend can ingest it. In short: the trace is the top-level container, and spans are the building blocks inside it.
What a Trace Captures
Every trace for an AI agent should record:
- Input: The user's message (or a hash/summary if it contains PII)
- Model calls: Which model was invoked, what prompt was sent, tokens used, latency
- Tool calls: Which tools were invoked, their inputs and outputs, success/failure
- Decisions: The agent's reasoning at each step (what it chose to do and why)
- Output: The final response delivered to the user
- Metadata: Total cost, total latency, error status, user satisfaction signal
search_orders took 450ms — if this suddenly jumped to 3000ms in production, the trace would immediately show you where the bottleneck is. That is the power of traces: instant root-cause visibility.
Spans: Nesting & Timing
The pain: If your lasagna tastes wrong, you need to know which sub-step failed. Was it the sauce (burnt onions)? The noodles (overcooked)? The baking (too long)? Without timing each step, you are tasting the final product and guessing backwards.
The mapping: A spanA single unit of work within a trace, identified by a span_id. Spans have parent-child relationships (a tool call span is a child of the agent loop span) and record start time, end time, status, and metadata. is one step in the recipe. The root span ("handle user request") contains child spans ("call LLM," "execute tool," "validate output"). Each child can have its own children ("parse JSON response" inside "execute tool"). The nesting shows you exactly where time is spent and where errors originate.
Here is what a single span looks like as a data structure — the "recipe step card" from the analogy above. Just as a recipe step has a name ("brown the meat"), a start time, an end time, and belongs to a larger step ("make sauce"), a span has all the same pieces in JSON form:
The parent_span_id is the key field — it links this LLM call to its parent (the root agent span), creating the tree structure you see in waterfall visualizations.
Fourth, start_time and end_time timestamps — these let you calculate exactly how long this unit of work took. Fifth, a status field (ok or error) telling you whether the operation succeeded. And sixth, attributes — key-value metadata specific to this span. For an LLM span, attributes might include model: "claude-sonnet-4-6" and tokens_used: 1240. For a tool span, you might store tool_name: "search_orders" and success: true.
The parent-child relationships between spans create a tree structure that mirrors the actual execution flow of your agent. When you visualize this tree as a waterfall chart, you can instantly see which operations are slow, which ones failed, and how they depend on each other.
Span Types for AI Agents
When instrumenting an agent, you will typically create five types of spans. Each type captures different information, and together they give you a complete picture of every decision the agent made.
- Root span (agent): The entire request lifecycle, from the moment a user message arrives to the moment a response is sent. This is the top-level container — every other span is a child (or grandchild) of this one. It captures total latency, overall status (success or error), and the final output summary.
- LLM span (generation): Each call to Claude. This is usually the most expensive span in terms of both time and money. Key attributes to record: model name, prompt tokens, completion tokens, latency in milliseconds, cost in USD, and
stop_reason(which tells you whether Claude wants to call a tool or is finished). - Tool span (tool): Each tool invocation — database queries, API calls, file reads. Attributes: tool name, input parameters, output summary (truncated to avoid storing large payloads), success/failure status, and latency. Tool spans are where you most often find performance bottlenecks.
- Retrieval span: RAG lookups, where you search a vector database for relevant context. Attributes: the search query, number of results returned, relevance scores, and which documents were selected. This is critical for debugging "the agent gave a wrong answer" issues — often the retriever found the wrong documents.
- Guardrail span: Input/output validation checks. Attributes: check name (e.g., "pii_output_check"), pass/fail result, and the reason for failure. These spans are typically very fast (under 10ms) but their pass/fail status is essential for compliance auditing.
Timing and Performance Analysis
The timing data in spans is where observability pays for itself. Without spans, you know the total request took 5 seconds. With spans, you know exactly where those 5 seconds went. Here are the three performance questions that span timing answers:
- Where is time spent? If your agent takes 5 seconds, is it 3 seconds in LLM calls, 1.5 seconds in tool calls, and 0.5 seconds in processing? Or is one tool call taking 4 seconds while everything else is fast? The waterfall chart makes this immediately visual — long bars are slow operations.
- Are there sequential bottlenecks? Two independent tool calls running sequentially instead of in parallel would show up as back-to-back spans instead of overlapping ones. For example, if your agent looks up order status and checks inventory, those two lookups can run at the same time. Sequential spans tell you "these could be parallelized for a 2x speedup."
- Is there a slow dependency? A database lookup that takes 2000ms+ consistently is immediately visible in the waterfall. More importantly, you can track this over time: if the database call was 50ms last week and is 2000ms today, something changed in your database — not your agent code.
The key insight is that optimization without span data is guesswork. Developers often assume the LLM call is the bottleneck (because "AI is slow"), but in practice, database queries, API calls to third-party services, and large payload serialization are frequently the real culprits.
Structured Logging
The pain: Most application logs are the junk drawer. Lines like
INFO: Processing request... and DEBUG: Tool called successfully tell you almost nothing when you are searching through 10 million log lines for the one request that failed at 3:47 AM. You end up writing fragile regex patterns to parse free-form text, and half your queries return false matches.The mapping: Structured loggingWriting log entries as machine-parseable key-value pairs (typically JSON) instead of free-form text strings. Enables filtering, aggregation, and alerting without regex parsing. Example: {"level":"error","trace_id":"abc","tool":"search","latency_ms":2100} is the filing cabinet. Every log entry is a JSON object — not a free-form text string. Each object has consistent keys like
trace_id, span_id, timestamp, level, and message, plus domain-specific fields relevant to your agent. The payoff is queryability: instead of writing fragile regex, you can ask precise questions like "show me all ERROR logs where tool_name=search_orders and latency_ms > 2000." That query takes seconds, even across millions of entries, because every field is a first-class key that your log aggregator can index and filter on.
How does structured logging work under the hood? When your agent calls logger.info("tool_call", tool_name="search_orders", latency_ms=47), the logging library does three things. First, it creates a dictionary with your key-value pairs. Second, it adds automatic fields like the ISO timestamp, log level, and process ID. Third, it serializes the entire dictionary to a JSON string and writes it to the output stream (stdout, a file, or a log aggregation service). The result is a single line of JSON per event — no multi-line stack traces breaking your parsers, no free-form messages with inconsistent formats.
How does structured logging differ from the print() statements and logging.info("some message") calls you have used before? Traditional logging outputs human-readable text: "2025-03-15 Tool search_orders called, took 47ms, found 3 results". That is easy to read in a terminal, but impossible to query at scale. Structured logging outputs machine-readable JSON: {"tool_name":"search_orders","latency_ms":47,"result_count":3}. That is slightly harder to read in a terminal (though tools like jq help), but trivially easy to query, filter, aggregate, and alert on. The tradeoff is clear: you give up a bit of readability in exchange for massive gains in operability.
What to Log: The Essential Fields
Every structured log entry from your agent should include these fields:
{
// Correlation fields — link this log to its trace
"trace_id": "tr_a1b2c3d4e5f6",
"span_id": "sp_7g8h9i0j",
"timestamp": "2025-03-15T14:32:01.847Z",
// Classification fields — filter and aggregate
"level": "info", // debug, info, warn, error
"agent_id": "order-agent-v2",
"step_number": 3, // which step in the agent loop
// Execution fields — understand what happened
"model": "claude-sonnet-4-6",
"token_count": { "input": 1840, "output": 312 },
"tool_name": "search_orders",
"latency_ms": 847,
// Context fields — understand the situation
"message": "Tool call completed successfully",
"user_id": "usr_hashed_abc", // hashed, never raw
"session_id": "sess_xyz789"
}
What NOT to Log: The Danger Zone
- PII (Personally Identifiable Information): Names, emails, phone numbers, addresses, SSNs. Hash or redact before logging.
- API keys and secrets: Never log
Authorizationheaders, API keys, or database credentials. Even in debug mode. - Full prompt contents: Prompts may contain user data. Log a hash or summary instead of the full text.
- Full model responses: Responses may contain generated PII or sensitive reasoning. Log a truncated summary.
- Medical/financial data: PHI (HIPAA), payment card numbers (PCI-DSS), and financial account details have strict regulatory requirements.
"Structured logging means I have to give up human-readable output." — Not true. Tools like jq (command line) and log viewers (Kibana, Grafana Loki) render structured JSON beautifully. During local development, you can configure structlog or pino to output pretty-printed, colorized text. You get human-readable locally AND machine-parseable in production.
"I should log the full prompt and full response for debugging." — This is a compliance and cost trap. Full prompts often contain user PII. Full responses can be thousands of tokens. Instead, log a hash of the prompt (for deduplication), token counts (for cost tracking), and the first 200 characters of the response (for quick inspection). Store full prompt/response pairs only in a secure, access-controlled trace backend like Langfuse — never in general application logs.
"Log levels (DEBUG, INFO, WARN, ERROR) are enough for filtering." — Log levels are one dimension. With structured logging, you can filter on any field: tool_name="search_orders" AND latency_ms > 2000 finds slow search calls regardless of log level. Think of log levels as a coarse filter and structured fields as fine-grained filters.
Observability Tools: The Landscape
The AI agent observability ecosystem has matured rapidly. Here are the four major tools you should know, each with different strengths:
Langfuse (Open Source)
Best for: Teams that want full control over their data and prefer self-hosting. Langfuse is purpose-built for LLM observability — it understands what an LLM call is, what tokens cost, and how traces flow through agent loops. Unlike general-purpose observability tools, you do not need to teach it what "input_tokens" means or how to calculate cost per model.
Langfuse integrates directly with the Anthropic SDK via a simple decorator or client wrapper. It provides a web UI where you can click on any trace, see the full span waterfall, inspect the prompt and response for each LLM call, and see exactly how much that request cost.
- Self-hosted (Docker Compose) or cloud option — your data stays on your infrastructure if needed
- Native Python and JS/TS SDKs with minimal code changes to add tracing
- Built-in cost calculation that knows current pricing for Claude, GPT-4, and other models
- Prompt management and versioning — track which prompt version produced which results
OpenTelemetry (CNCF Standard)
Best for: Teams already using OpenTelemetry for microservice observability who want to extend it to LLM calls. If your company already has Jaeger or Grafana Tempo collecting traces from your backend services, OpenTelemetry lets you add LLM spans into that same pipeline. Your agent's trace shows up alongside your API server's trace, giving you end-to-end visibility from the HTTP request all the way through the LLM call and back.
The tradeoff: OpenTelemetry is a general-purpose standard, not LLM-specific. It does not know what "tokens" or "model cost" are out of the box. You need to add those as custom span attributes. The GenAI semantic conventions (standard attribute names for LLM operations) are still being finalized by the CNCF community.
- Vendor-neutral, CNCF-backed standard — no lock-in to any specific backend
- Works with your existing infrastructure (Jaeger, Datadog, Honeycomb, Grafana Tempo)
- Requires more setup for LLM-specific features like cost tracking and prompt inspection
- Semantic conventions for GenAI are still evolving — expect attribute names to stabilize over the next year
OpenTelemetry for Agent Observability
Why OTel: OpenTelemetry is the vendor-neutral CNCF standard for traces, metrics, and logs. You instrument your agent once with the OTel SDK and ship the same telemetry to Datadog, Grafana (Tempo + Loki), or New Relic without rewriting code. For enterprises where the platform team has already standardized on one of these APMs, OTel is the path of least resistance — agent traces flow into the same dashboards engineers already use for the rest of the stack.
Key concepts mapped to agents: one trace per user request (the full agent run from prompt to final response), and one span per tool call, LLM call, or guardrail check inside that run. Span attributes carry the agent-specific signal: llm.model, llm.input_tokens, llm.output_tokens, tool.name, tool.success.
Pseudocode for OTel instrumentation:
tracer = trace.get_tracer("agent")
with tracer.start_as_current_span("agent.run") as root:
root.set_attribute("user.id", hash_user_id(user_id))
while not done:
with tracer.start_as_current_span("llm.call") as s:
s.set_attribute("llm.model", "claude-sonnet-4-6")
response = client.messages.create(...)
s.set_attribute("llm.output_tokens", response.usage.output_tokens)
for tool_call in response.tool_uses:
with tracer.start_as_current_span(f"tool.{tool_call.name}") as ts:
ts.set_attribute("tool.name", tool_call.name)
result = run_tool(tool_call)
When to use which: reach for Langfuse when you need agent-specific UX out of the box — prompt playground, model cost rollups, eval datasets, side-by-side trace diffing. Reach for OTel when the requirement is enterprise APM consolidation — SLO dashboards, on-call alerting, and end-to-end traces that span the load balancer, API server, agent, and downstream microservices in one waterfall. Many production teams run both: OTel feeds the platform APM, Langfuse owns the LLM-native workflow.
LangSmith (LangChain Ecosystem)
Best for: Teams already using LangChain/LangGraph who want tight integration. LangSmith provides tracing, evaluation, and dataset management in one platform.
- Deep LangChain integration (auto-instrumentation)
- Built-in evaluation and testing features
- Cloud-hosted (no self-host option)
- May create vendor lock-in if you move away from LangChain
Arize Phoenix (Open Source)
Best for: Teams focused on model performance analysis, embedding drift detection, and retrieval quality. Strong visualization for understanding how your RAG pipeline performs over time.
- Embedding visualization and drift detection
- RAG-specific evaluation metrics
- Open-source with a cloud option
- Strong evaluation integration
Code Walkthrough: Instrumented Agent with Langfuse
We will build an order-tracking agent instrumented with Langfuse for tracing and structlog/pino for structured logging. The agent creates traces, nests spans, redacts PII, and — critically — never lets observability failures break the agent itself.
Step 1: Setup and Configuration
Let's start by initializing the two pillars of our observability stack: a Langfuse client for tracing and a structured logger for machine-readable log entries. Langfuse will capture the "what happened" story — traces and spans showing every decision the agent made. Structlog (Python) or pino (Node.js) will capture the "what should we alert on" signals — JSON log entries with fields like latency, token counts, and error status.
There is one critical design decision here: the Langfuse client is wrapped in a try/catch block. Why? Because if Langfuse's servers are down or your API keys are misconfigured, your agent must still serve users. Observability is important, but never more important than the agent's primary job. Also, never hardcode API keys — always pull them from environment variables so they do not end up in version control.
import os
import re
import time
import hashlib
import anthropic
import structlog
from langfuse import Langfuse
# --- Langfuse client (tracing) ---
# Falls back gracefully if credentials are missing
try:
langfuse = Langfuse(
public_key=os.environ.get("LANGFUSE_PUBLIC_KEY", ""),
secret_key=os.environ.get("LANGFUSE_SECRET_KEY", ""),
host=os.environ.get("LANGFUSE_HOST", "https://cloud.langfuse.com"),
)
TRACING_ENABLED = True
except Exception:
langfuse = None
TRACING_ENABLED = False
# --- Structured logger (structlog) ---
structlog.configure(
processors=[
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.add_log_level,
structlog.processors.JSONRenderer(),
]
)
logger = structlog.get_logger()
# --- Anthropic client ---
client = anthropic.Anthropic()
# --- PII redaction utility ---
def redact_pii(text: str) -> str:
"""Remove emails, phone numbers, and SSNs from text."""
text = re.sub(
r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
'[EMAIL_REDACTED]', text
)
text = re.sub(
r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b',
'[PHONE_REDACTED]', text
)
text = re.sub(
r'\b\d{3}-\d{2}-\d{4}\b',
'[SSN_REDACTED]', text
)
return text
def hash_user_id(user_id: str) -> str:
"""One-way hash for user identifiers in logs."""
return "usr_" + hashlib.sha256(
user_id.encode()
).hexdigest()[:12]
import Anthropic from "@anthropic-ai/sdk";
import { Langfuse } from "langfuse";
import pino from "pino";
import crypto from "crypto";
// --- Langfuse client (tracing) ---
let langfuse: Langfuse | null = null;
let TRACING_ENABLED = false;
try {
langfuse = new Langfuse({
publicKey: process.env.LANGFUSE_PUBLIC_KEY ?? "",
secretKey: process.env.LANGFUSE_SECRET_KEY ?? "",
baseUrl: process.env.LANGFUSE_HOST ?? "https://cloud.langfuse.com",
});
TRACING_ENABLED = true;
} catch {
langfuse = null;
}
// --- Structured logger (pino) ---
const logger = pino({
level: "info",
timestamp: pino.stdTimeFunctions.isoTime,
formatters: { level: (label) => ({ level: label }) },
});
// --- Anthropic client ---
const client = new Anthropic();
// --- PII redaction utility ---
function redactPii(text: string): string {
return text
.replace(
/[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/g,
"[EMAIL_REDACTED]"
)
.replace(/\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/g, "[PHONE_REDACTED]")
.replace(/\b\d{3}-\d{2}-\d{4}\b/g, "[SSN_REDACTED]");
}
function hashUserId(userId: string): string {
return (
"usr_" +
crypto.createHash("sha256").update(userId).digest("hex").slice(0, 12)
);
}
hash_user_id function creates a one-way hash so you can correlate logs by user without storing their actual ID.
Step 2: The Instrumented Agent Loop
Now for the heart of the module: the agent loop itself, fully instrumented. Here is the mental model for what this code does. Every time a user sends a message, three things happen in parallel with the agent's work: a trace is created (the "flight recorder" starts), spans are opened and closed for each operation (LLM calls, tool calls), and structured log entries are emitted at every step. Think of it as the agent doing its job while a diligent assistant takes notes on every decision, timing each one with a stopwatch.
There is one design pattern you will see repeated throughout this code, and it is the most important lesson of this entire module: every single observability call is wrapped in try/except. This is not paranoia — it is a production requirement. Here is the reasoning: if Langfuse's servers go down at 3 AM, what should happen? Option A: your agent crashes and 1,000 users get error pages. Option B: your agent keeps serving users, and you lose 30 minutes of trace data that you can reconstruct from structured logs later. The answer is always Option B. Observability is a supporting actor, never the lead.
TOOLS = [
{
"name": "search_orders",
"description": "Search for orders by order ID or customer ID.",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Order ID or customer ID"}
},
"required": ["query"],
},
}
]
def search_orders(query: str) -> dict:
"""Simulated order lookup."""
return {
"order_id": "ORD-12345",
"status": "shipped",
"carrier": "FedEx",
"tracking": "FX9876543210",
"estimated_delivery": "2025-03-18",
}
def run_agent(user_message: str, user_id: str = "anonymous"):
"""Run the agent with full tracing and structured logging."""
hashed_uid = hash_user_id(user_id)
trace = None
trace_id = None
# --- Create trace ---
if TRACING_ENABLED:
try:
trace = langfuse.trace(
name="order-agent",
user_id=hashed_uid,
metadata={"agent_version": "v2.1"},
input=redact_pii(user_message),
)
trace_id = trace.id
except Exception as e:
logger.warning("tracing_init_failed", error=str(e))
logger.info(
"agent_start",
trace_id=trace_id,
user_id=hashed_uid,
input_length=len(user_message),
)
messages = [{"role": "user", "content": user_message}]
step = 0
max_steps = 10
while step < max_steps:
step += 1
start_time = time.time()
# --- LLM span ---
llm_span = None
if trace:
try:
llm_span = trace.span(
name="llm.chat",
metadata={"model": "claude-sonnet-4-6", "step": step},
)
except Exception:
pass
try:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
tools=TOOLS,
messages=messages,
)
latency_ms = int((time.time() - start_time) * 1000)
# Log the LLM call
logger.info(
"llm_call",
trace_id=trace_id,
step=step,
model="claude-sonnet-4-6",
input_tokens=response.usage.input_tokens,
output_tokens=response.usage.output_tokens,
latency_ms=latency_ms,
stop_reason=response.stop_reason,
)
# Update LLM span
if llm_span:
try:
llm_span.end(
metadata={
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"latency_ms": latency_ms,
"stop_reason": response.stop_reason,
}
)
except Exception:
pass
except Exception as e:
logger.error(
"llm_call_failed",
trace_id=trace_id,
step=step,
error=str(e),
)
if llm_span:
try:
llm_span.end(metadata={"error": str(e)})
except Exception:
pass
break
# --- Handle tool use ---
if response.stop_reason == "tool_use":
for block in response.content:
if block.type == "tool_use":
tool_start = time.time()
tool_span = None
if trace:
try:
tool_span = trace.span(
name=f"tool.{block.name}",
input=str(block.input),
)
except Exception:
pass
try:
result = search_orders(block.input.get("query", ""))
tool_latency = int((time.time() - tool_start) * 1000)
logger.info(
"tool_call",
trace_id=trace_id,
step=step,
tool_name=block.name,
latency_ms=tool_latency,
success=True,
)
if tool_span:
try:
tool_span.end(
output=str(result),
metadata={
"latency_ms": tool_latency,
"success": True,
},
)
except Exception:
pass
messages.append({"role": "assistant", "content": response.content})
messages.append({
"role": "user",
"content": [{
"type": "tool_result",
"tool_use_id": block.id,
"content": str(result),
}],
})
except Exception as e:
logger.error(
"tool_call_failed",
trace_id=trace_id,
step=step,
tool_name=block.name,
error=str(e),
)
if tool_span:
try:
tool_span.end(metadata={"error": str(e)})
except Exception:
pass
break
else:
# End turn — extract final text
final_text = ""
for block in response.content:
if hasattr(block, "text"):
final_text += block.text
logger.info(
"agent_complete",
trace_id=trace_id,
total_steps=step,
output_length=len(final_text),
)
if trace:
try:
trace.update(output=redact_pii(final_text[:200]))
except Exception:
pass
return final_text
logger.warning("agent_max_steps", trace_id=trace_id, steps=max_steps)
return "I was unable to complete your request. Please try again."
# --- Run it ---
if __name__ == "__main__":
result = run_agent(
"Where is my order ORD-12345?",
user_id="john@example.com"
)
print(result)
if TRACING_ENABLED and langfuse:
langfuse.flush()
const TOOLS = [
{
name: "search_orders",
description: "Search for orders by order ID or customer ID.",
input_schema: {
type: "object" as const,
properties: {
query: { type: "string", description: "Order ID or customer ID" },
},
required: ["query"],
},
},
];
function searchOrders(query: string) {
return {
order_id: "ORD-12345",
status: "shipped",
carrier: "FedEx",
tracking: "FX9876543210",
estimated_delivery: "2025-03-18",
};
}
async function runAgent(userMessage: string, userId = "anonymous") {
const hashedUid = hashUserId(userId);
let trace: any = null;
let traceId: string | null = null;
// --- Create trace ---
if (TRACING_ENABLED && langfuse) {
try {
trace = langfuse.trace({
name: "order-agent",
userId: hashedUid,
metadata: { agent_version: "v2.1" },
input: redactPii(userMessage),
});
traceId = trace.id;
} catch (e) {
logger.warn({ error: String(e) }, "tracing_init_failed");
}
}
logger.info(
{ traceId, userId: hashedUid, inputLength: userMessage.length },
"agent_start"
);
const messages: any[] = [{ role: "user", content: userMessage }];
let step = 0;
const maxSteps = 10;
while (step < maxSteps) {
step++;
const startTime = Date.now();
// --- LLM span ---
let llmSpan: any = null;
if (trace) {
try {
llmSpan = trace.span({
name: "llm.chat",
metadata: { model: "claude-sonnet-4-6", step },
});
} catch {}
}
try {
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
tools: TOOLS,
messages,
});
const latencyMs = Date.now() - startTime;
logger.info(
{
traceId,
step,
model: "claude-sonnet-4-6",
inputTokens: response.usage.input_tokens,
outputTokens: response.usage.output_tokens,
latencyMs,
stopReason: response.stop_reason,
},
"llm_call"
);
if (llmSpan) {
try {
llmSpan.end({
metadata: {
input_tokens: response.usage.input_tokens,
output_tokens: response.usage.output_tokens,
latency_ms: latencyMs,
stop_reason: response.stop_reason,
},
});
} catch {}
}
// --- Handle tool use ---
if (response.stop_reason === "tool_use") {
for (const block of response.content) {
if (block.type === "tool_use") {
const toolStart = Date.now();
let toolSpan: any = null;
if (trace) {
try {
toolSpan = trace.span({
name: `tool.${block.name}`,
input: JSON.stringify(block.input),
});
} catch {}
}
try {
const result = searchOrders(
(block.input as any).query ?? ""
);
const toolLatency = Date.now() - toolStart;
logger.info(
{
traceId,
step,
toolName: block.name,
latencyMs: toolLatency,
success: true,
},
"tool_call"
);
if (toolSpan) {
try {
toolSpan.end({
output: JSON.stringify(result),
metadata: { latency_ms: toolLatency, success: true },
});
} catch {}
}
messages.push({ role: "assistant", content: response.content });
messages.push({
role: "user",
content: [
{
type: "tool_result",
tool_use_id: block.id,
content: JSON.stringify(result),
},
],
});
} catch (e) {
logger.error(
{ traceId, step, toolName: block.name, error: String(e) },
"tool_call_failed"
);
if (toolSpan) {
try { toolSpan.end({ metadata: { error: String(e) } }); } catch {}
}
break;
}
}
}
} else {
// End turn
let finalText = "";
for (const block of response.content) {
if (block.type === "text") finalText += block.text;
}
logger.info(
{ traceId, totalSteps: step, outputLength: finalText.length },
"agent_complete"
);
if (trace) {
try {
trace.update({ output: redactPii(finalText.slice(0, 200)) });
} catch {}
}
return finalText;
}
} catch (e) {
logger.error({ traceId, step, error: String(e) }, "llm_call_failed");
if (llmSpan) {
try { llmSpan.end({ metadata: { error: String(e) } }); } catch {}
}
break;
}
}
logger.warn({ traceId, steps: maxSteps }, "agent_max_steps");
return "I was unable to complete your request. Please try again.";
}
// --- Run it ---
(async () => {
const result = await runAgent(
"Where is my order ORD-12345?",
"john@example.com"
);
console.log(result);
if (TRACING_ENABLED && langfuse) {
await langfuse.flushAsync();
}
})();
trace_id so you can correlate logs with traces. (5) The agent logs input_tokens, output_tokens, and latency_ms for every call — essential for cost and performance monitoring.
Compliance & Audit Logging
Production agents in regulated industries must meet strict logging requirements. The traces and structured logs you built earlier in this module are your foundation — but compliance adds rules about what you log, how you protect it, and how long you keep it.
What to Log for Each Framework
- HIPAA: Log every access to protected health information (PHI) — who requested it, which agent tool accessed it, what record was returned, and when. You must capture the minimum necessary data for the operation. Never log full patient records in traces; log only record IDs and access metadata.
- SOC 2: Log all authentication events, authorization decisions, configuration changes, and data access patterns. SOC 2 auditors want to see that every agent action is attributable to a specific user and that access controls are enforced consistently.
- GDPR: Log the legal basis for processing personal data, consent status, and any cross-border data transfers. Critically, GDPR’s right-to-erasure means you must be able to locate and delete all traces containing a specific user’s data — design your trace storage with a
user_idindex from day one.
PII Redaction in Logs
Apply PII redaction before data reaches your logging pipeline, not after. Use pattern matching to detect and mask emails, phone numbers, SSNs, and credit card numbers in both agent inputs and outputs. The redaction function you saw in the code walkthrough’s redact_pii() helper is the right pattern — wrap every trace span’s input and output through it.
Audit Trail & Data Retention
Audit logs must be immutable and tamper-proof. Write them to append-only storage (such as S3 with Object Lock or a dedicated audit log service) separate from your application database. Timestamp every entry with a synchronized clock. For retention: HIPAA requires six years, SOC 2 typically one year, and GDPR requires you to delete data when the processing purpose expires — but keep anonymized audit records to prove compliance. Build retention automation early; retrofitting deletion across years of unindexed traces is painful.
Prompt Management & Versioning
Your agent’s system prompt is as critical as any function in your codebase — a single word change can shift behavior across thousands of requests. Yet many teams treat prompts as hardcoded strings buried inside application code, with no review process, no version history, and no way to roll back. Treat prompts as code.
Prompts as Code in Version Control
Store every system prompt, few-shot example set, and tool description in version-controlled files (e.g., prompts/order_agent_v12.txt) rather than inline strings. This gives you Git history showing exactly when a prompt changed, who changed it, and why. Use a prompt registry — a simple config file or a service like Langfuse’s prompt management — to map prompt names to versions at runtime. Your agent loads get_prompt("order_agent", version="latest") instead of reading a hardcoded string, making it trivial to switch versions without redeploying code.
A/B Testing Prompts
When you write a new prompt version, don’t replace the old one immediately. Instead, route a small percentage of traffic (5–10%) to the new version and compare metrics: task success rate, average tokens used, user satisfaction scores, and hallucination rate. Log the prompt version in every trace (add a prompt_version field to your structured logs) so you can slice dashboards by version. Only promote the new prompt to 100% once metrics confirm it performs at least as well as the previous version.
Prompt Regression Testing
Maintain an evaluation suite of 20–50 representative inputs with expected outputs. Run this suite automatically (in CI or as a pre-merge check) whenever a prompt file changes. Compare results against baseline scores — if accuracy drops below your threshold, the change is blocked. This catches regressions that manual review misses: a prompt tweak that improves one edge case but degrades ten common cases. Combine this with the evaluation framework from M18 for a complete quality gate.
Hands-On Exercise
What You'll Build
A fully instrumented order-tracking agent with a PII guardrail span, per-call cost tracking, and a log analysis function that computes summary metrics from structured logs — extending the code walkthrough from earlier in this module.
Time Estimate: 30–40 minutes
Prerequisites: Python 3.9+ installed, an Anthropic API key (ANTHROPIC_API_KEY environment variable set), and pip available. Langfuse account is optional — the agent works without it by using local trace IDs.
Files You'll Create:
traced_agent.py— The instrumented agent with guardrail span and cost trackinglog_analyzer.py— A log query function to analyze agent run data
Environment Setup
mkdir tracing-lab && cd tracing-lab
python -m venv venv
# macOS/Linux:
source venv/bin/activate
# Windows:
# venv\Scripts\activate
pip install anthropic structlog
export ANTHROPIC_API_KEY=your-key-here
# Optional — only if you have a Langfuse account:
# pip install langfuse
# export LANGFUSE_PUBLIC_KEY=your-public-key
# export LANGFUSE_SECRET_KEY=your-secret-key
Step 1: Create the Base Agent with Guardrail Span
We will start from the code walkthrough's agent and add two new features: a guardrail span that checks the final response for PII, and per-call cost tracking. Why the guardrail? Even if your tool data is clean, the LLM might echo back PII from the user's message. For example, if the user says "my email is john@test.com, where's my order?", Claude might include that email in its response. The guardrail catches and redacts it before the response reaches the user.
The cost tracking is straightforward multiplication — input tokens times the per-million price, plus output tokens times their price — but having it in every log entry means you can spot cost anomalies in real time.
Create a new file called traced_agent.py with the following complete code:
import os
import re
import sys
import time
import json
import hashlib
import anthropic
import structlog
# --- Structured logger ---
# Send JSON logs to stderr so stdout stays clean for the agent's user-facing
# response. This lets us pipe stderr to a log file with `2>agent_logs.jsonl`.
structlog.configure(
processors=[
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.add_log_level,
structlog.processors.JSONRenderer(),
],
logger_factory=structlog.PrintLoggerFactory(file=sys.stderr),
)
logger = structlog.get_logger()
# --- Anthropic client ---
client = anthropic.Anthropic()
# --- PII redaction ---
PII_PATTERNS = [
(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', '[EMAIL_REDACTED]'),
(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE_REDACTED]'),
(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN_REDACTED]'),
]
def redact_pii(text: str) -> str:
"""Remove emails, phone numbers, and SSNs from text."""
for pattern, replacement in PII_PATTERNS:
text = re.sub(pattern, replacement, text)
return text
def contains_pii(text: str) -> bool:
"""Check if text contains any PII patterns."""
for pattern, _ in PII_PATTERNS:
if re.search(pattern, text):
return True
return False
def hash_user_id(user_id: str) -> str:
return "usr_" + hashlib.sha256(user_id.encode()).hexdigest()[:12]
# --- Cost calculation ---
# Prices per million tokens (Claude Sonnet)
COST_PER_M_INPUT = 3.00
COST_PER_M_OUTPUT = 15.00
def calculate_cost(input_tokens: int, output_tokens: int) -> float:
"""Calculate cost in USD for a single LLM call."""
input_cost = (input_tokens / 1_000_000) * COST_PER_M_INPUT
output_cost = (output_tokens / 1_000_000) * COST_PER_M_OUTPUT
return round(input_cost + output_cost, 6)
# --- Tool definition ---
TOOLS = [
{
"name": "search_orders",
"description": "Search for orders by order ID or customer ID.",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Order ID or customer ID"}
},
"required": ["query"],
},
}
]
def search_orders(query: str) -> dict:
"""Simulated order lookup."""
return {
"order_id": "ORD-12345",
"status": "shipped",
"carrier": "FedEx",
"tracking": "FX9876543210",
"estimated_delivery": "2025-03-18",
}
# --- Agent with guardrail span and cost tracking ---
def run_agent(user_message: str, user_id: str = "anonymous"):
hashed_uid = hash_user_id(user_id)
trace_id = f"tr_{hashlib.md5(f'{time.time()}'.encode()).hexdigest()[:10]}"
total_cost = 0.0
logger.info(
"agent_start",
trace_id=trace_id,
user_id=hashed_uid,
input_length=len(user_message),
)
messages = [{"role": "user", "content": user_message}]
step = 0
max_steps = 10
while step < max_steps:
step += 1
start_time = time.time()
try:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
tools=TOOLS,
messages=messages,
)
latency_ms = int((time.time() - start_time) * 1000)
call_cost = calculate_cost(
response.usage.input_tokens,
response.usage.output_tokens,
)
total_cost += call_cost
logger.info(
"llm_call",
trace_id=trace_id,
step=step,
model="claude-sonnet-4-6",
input_tokens=response.usage.input_tokens,
output_tokens=response.usage.output_tokens,
latency_ms=latency_ms,
cost_usd=call_cost,
stop_reason=response.stop_reason,
)
except Exception as e:
logger.error(
"llm_call_failed",
trace_id=trace_id,
step=step,
error=str(e),
)
break
# Handle tool use
if response.stop_reason == "tool_use":
for block in response.content:
if block.type == "tool_use":
tool_start = time.time()
try:
result = search_orders(block.input.get("query", ""))
tool_latency = int((time.time() - tool_start) * 1000)
logger.info(
"tool_call",
trace_id=trace_id,
step=step,
tool_name=block.name,
latency_ms=tool_latency,
success=True,
)
messages.append({"role": "assistant", "content": response.content})
messages.append({
"role": "user",
"content": [{
"type": "tool_result",
"tool_use_id": block.id,
"content": str(result),
}],
})
except Exception as e:
logger.error(
"tool_call_failed",
trace_id=trace_id,
step=step,
tool_name=block.name,
error=str(e),
)
break
else:
# Extract final text
final_text = ""
for block in response.content:
if hasattr(block, "text"):
final_text += block.text
# --- GUARDRAIL SPAN: Check for PII in output ---
guard_start = time.time()
pii_found = contains_pii(final_text)
guard_latency = int((time.time() - guard_start) * 1000)
if pii_found:
final_text = redact_pii(final_text)
guard_status = "fail_redacted"
else:
guard_status = "pass"
logger.info(
"guardrail_check",
trace_id=trace_id,
span_name="guardrail.output_check",
pii_detected=pii_found,
status=guard_status,
latency_ms=guard_latency,
)
logger.info(
"agent_complete",
trace_id=trace_id,
total_steps=step,
total_cost_usd=total_cost,
output_length=len(final_text),
)
return final_text
logger.warning("agent_max_steps", trace_id=trace_id, steps=max_steps)
return "I was unable to complete your request. Please try again."
if __name__ == "__main__":
result = run_agent(
"Where is my order ORD-12345?",
user_id="john@example.com"
)
print(f"\nAgent response:\n{result}")
Run command:
Expected output (structured log lines followed by the agent response):
cost_usd fields on each llm_call event, a guardrail_check event with "pii_detected": false and "status": "pass", and an agent_complete event with total_cost_usd. If you see these, Step 1 is working correctly.
Troubleshooting:
- If you see
ModuleNotFoundError: No module named 'structlog'→ runpip install structlog - If you see
AuthenticationError→ check that yourANTHROPIC_API_KEYenvironment variable is set - If you see
RateLimitError→ wait 30 seconds and try again
Step 2: Create the Log Analyzer
Now that the agent produces structured JSON logs, let's write a function that reads those logs and answers operational questions: "What was the total cost and average latency for all agent runs?" This is the kind of query that structured logging makes trivial but free-form logging makes nearly impossible.
Step dependency: This step uses the log output from Step 1. Make sure traced_agent.py from Step 1 is working before proceeding.
Create a new file called log_analyzer.py:
import json
import sys
from collections import defaultdict
def analyze_logs(log_lines: list[str]) -> dict:
"""Analyze structured JSON log lines from the traced agent.
Returns a summary with total cost, average latency,
tool call counts, and guardrail results.
"""
llm_calls = []
tool_calls = []
guardrail_results = []
agent_runs = []
for line in log_lines:
line = line.strip()
if not line:
continue
try:
entry = json.loads(line)
except json.JSONDecodeError:
continue
event = entry.get("event", "")
if event == "llm_call":
llm_calls.append({
"trace_id": entry.get("trace_id"),
"latency_ms": entry.get("latency_ms", 0),
"cost_usd": entry.get("cost_usd", 0),
"input_tokens": entry.get("input_tokens", 0),
"output_tokens": entry.get("output_tokens", 0),
})
elif event == "tool_call":
tool_calls.append({
"trace_id": entry.get("trace_id"),
"tool_name": entry.get("tool_name"),
"latency_ms": entry.get("latency_ms", 0),
"success": entry.get("success", False),
})
elif event == "guardrail_check":
guardrail_results.append({
"trace_id": entry.get("trace_id"),
"pii_detected": entry.get("pii_detected", False),
"status": entry.get("status"),
})
elif event == "agent_complete":
agent_runs.append({
"trace_id": entry.get("trace_id"),
"total_steps": entry.get("total_steps", 0),
"total_cost_usd": entry.get("total_cost_usd", 0),
})
# Compute summary
total_cost = sum(c["cost_usd"] for c in llm_calls)
avg_latency = (
sum(c["latency_ms"] for c in llm_calls) / len(llm_calls)
if llm_calls else 0
)
total_tokens = sum(
c["input_tokens"] + c["output_tokens"] for c in llm_calls
)
pii_detections = sum(1 for g in guardrail_results if g["pii_detected"])
summary = {
"total_llm_calls": len(llm_calls),
"total_tool_calls": len(tool_calls),
"total_agent_runs": len(agent_runs),
"total_cost_usd": round(total_cost, 6),
"avg_llm_latency_ms": round(avg_latency, 1),
"total_tokens": total_tokens,
"guardrail_pii_detections": pii_detections,
"tool_success_rate": (
f"{sum(1 for t in tool_calls if t['success'])}/{len(tool_calls)}"
if tool_calls else "N/A"
),
}
return summary
if __name__ == "__main__":
# Read log lines from stdin or a file
if len(sys.argv) > 1:
with open(sys.argv[1], "r") as f:
lines = f.readlines()
else:
print("Usage: python log_analyzer.py ")
print(" Or pipe logs: python traced_agent.py 2>&1 | python log_analyzer.py /dev/stdin")
sys.exit(1)
summary = analyze_logs(lines)
print(json.dumps(summary, indent=2))
Run command (pipe agent logs to the analyzer):
Expected output:
total_cost_usd should be a small number (under $0.05 for a single run). The guardrail_pii_detections should be 0 since the simulated order data does not contain PII. If you see this output, Step 2 is working correctly.
Troubleshooting:
- If the log file is empty → make sure you used
2>(stderr redirect). Step 1'straced_agent.pyconfigures structlog withPrintLoggerFactory(file=sys.stderr), so logs go to stderr while the agent response goes to stdout. By default (without that factory) structlog writes to stdout, which would mix logs with the response — double-check thelogger_factory=line is present. - If you see
json.JSONDecodeError→ some non-JSON output mixed in. The analyzer skips bad lines, but check that your traced_agent.py is printing the agent response to stdout (not stderr)
Verify Everything Works
Run both steps together as a single end-to-end pipeline:
You should see: (1) the agent response printed to the terminal, (2) all structured JSON log lines, and (3) the analysis summary. If all three appear with sensible values, you have a working instrumented agent with guardrail spans, cost tracking, and log analysis.
Knowledge Check
1. What is the relationship between a trace and a span?
2. Which of the following should NEVER appear in production agent logs?
3. An agent trace shows: LLM call (200ms) → tool.search_db (3200ms) → LLM call (250ms). Total latency is 3650ms. What is the bottleneck?
4. What is the primary advantage of OpenTelemetry over LLM-specific tools like Langfuse?
5. A user reports: "The agent gave me the wrong order status." What should you check FIRST?
6. In a span hierarchy, which statement about parent-child relationships is correct?
Your Score
Summary
You have completed the first module of Track 6: Observability. Here is what you learned:
- Observability vs. Monitoring: Monitoring tells you that something broke; observability tells you why. For non-deterministic AI agents, observability is essential.
- Traces: The top-level container capturing the complete lifecycle of a single agent request. Every trace has a unique ID and contains a tree of spans.
- Spans: Individual units of work (LLM calls, tool invocations, guardrail checks) nested in parent-child hierarchies with timing data.
- Structured Logging: JSON-formatted logs with consistent keys (trace_id, span_id, latency_ms, tokens) that enable filtering, aggregation, and alerting. Always redact PII before logging.
- Tools: Langfuse (LLM-native, self-hostable), OpenTelemetry (vendor-neutral standard), LangSmith (LangChain ecosystem), Arize Phoenix (embedding/RAG focus).
- Golden Rule: Never let observability failures break your agent. Wrap all tracing/logging calls in try/except.
Next up: M20: Monitoring & Continuous Improvement builds on this foundation — you will learn to set up dashboards, alerts, and feedback loops that keep your agent improving over time.