← M18: Evaluation & Testing 🏠 Home M20: Monitoring & CI →

M19: Tracing & Logging

See inside your agent's decision-making with distributed traces, structured logs, and production-ready observability pipelines.

Learning Objectives

Explain why traditional logging is insufficient for AI agents and what observability adds
Create end-to-end traces that capture every decision an agent makes
Nest spans to show parent-child relationships and timing breakdowns
Implement structured JSON logging with PII redaction
Compare observability tools (Langfuse, OpenTelemetry, LangSmith, Arize) and choose the right one

Why Observability Matters for Agents

Analogy: Driving at Night Before: Imagine driving a car at night with no headlights, no dashboard gauges, and no GPS. You press the accelerator and hope you are going the right speed, in the right direction, on the right road. If something goes wrong, you have no way to know what happened until you crash.

The pain: This is exactly what running an AI agent without observability feels like. A user reports "it gave me the wrong answer," and you have no trace of which tools were called, what the model reasoning was, or where the chain broke. You are debugging blind, guessing at causes, and wasting hours reproducing issues that may be non-deterministic.

The mapping: Observability is your headlights, dashboard, and GPS combined. Traces show you the road your agent took (every step in sequence). Spans show you timing for each turn and acceleration. Structured logs are your gauges — real-time readings of speed, fuel, and engine temperature. With these instruments, you can see exactly what happened, when, and why.

Here is what observability data actually looks like in practice. This is a single structured log entry from an agent run — the "dashboard reading" from the analogy above:

{"timestamp":"2025-03-15T14:32:02.694Z","level":"info","event":"llm_call","trace_id":"tr_a1b2c3d4","step":1,"model":"claude-sonnet-4-6","input_tokens":1840,"output_tokens":89,"latency_ms":847,"stop_reason":"tool_use"}

Every field is a "gauge" you can query. The trace_id links this entry to all other entries from the same request. The latency_ms tells you how fast the engine is running. The stop_reason tells you what the agent decided to do next (in this case, call a tool).

Technical Definition: Observability is the practice of instrumenting your system so you can answer any question about its behavior after the fact, without deploying new code. In other words: when a user reports a bug, you can pull up the exact execution record and see what happened — without adding new logging, without asking the user to reproduce the issue, and without guessing.

For AI agents, observability is especially critical because agents are non-deterministic. The same input can produce different outputs depending on model temperature, tool availability, and conversation context. Traditional debugging — set a breakpoint, reproduce the issue, step through the code — often fails because you simply cannot reproduce the exact same conditions twice. Observability sidesteps this problem entirely: it captures every execution as it happens, giving you a permanent, queryable record to analyze after the fact.

Why is agent observability different from traditional application observability? Three reasons:

Non-determinism: The same prompt can yield different tool calls and reasoning paths on each run. Unlike a REST API that returns the same JSON for the same input, Claude might call search_orders on one run and check_inventory on another. You need traces to see which path was taken for each specific request.
Multi-step chains: Agents make 5-20 decisions per request. Each decision is its own step: call a tool, parse the result, decide the next action. A bug in step 7 may only manifest as a wrong answer in step 12, with 5 correct steps in between hiding the root cause.
Cost visibility: Each LLM call costs money — typically $0.003 to $0.015 per call depending on the model and token count. Without token-level tracking, a runaway loop (where the agent calls Claude 50 times instead of 3) can burn through your budget before anyone notices. Traces show you exactly how many calls each request consumed.

Agent Without vs With Observability

Without Observability

With Observability

Left: Agent runs with no visibility — user reports "wrong answer," developer has no data. Right: Agent runs with traces and logs — developer instantly sees the tool returned stale data at step 3, causing the wrong answer at step 5.

Why It Matters Teams without observability spend an average of 4-8 hours debugging a single agent failure. Teams with proper tracing resolve the same issues in 10-15 minutes because they can see the exact sequence of decisions, tool calls, and model outputs that led to the problem. At scale (1,000+ agent runs per day), that difference is the difference between a sustainable product and a team drowning in support tickets.

⚠️ Common Misconceptions

"Observability is just logging with a fancier name." — Not quite. Logging captures individual events as text lines. Observability combines three pillars — traces (structured execution records), logs (machine-parseable event entries), and metrics (aggregated measurements) — into a system where you can answer questions you did not anticipate when you wrote the code. Logs alone cannot show you causal relationships between steps.

"I only need observability when something breaks." — Observability is most valuable for understanding normal behavior so you can spot anomalies. If you do not know that your P50 latency is normally 1.2 seconds, you cannot detect when it creeps to 3.5 seconds. Baseline data is the foundation of anomaly detection.

"Adding tracing will slow down my agent." — Modern tracing libraries (Langfuse, OpenTelemetry) use asynchronous batching — spans are queued in memory and flushed in background threads. The overhead is typically under 5ms per request. The performance insight you gain (finding a 3-second database bottleneck, for example) far outweighs the microsecond cost of recording it.

"I can just use print() statements and grep the output." — This works for local debugging with one user. It collapses at scale. When your agent handles 1,000 concurrent requests, print output from different requests interleaves randomly. Without trace_id correlation, you cannot reconstruct which log lines belong to which request. Structured logging with trace IDs solves this.

"More data is always better — log everything." — Over-logging creates two problems. First, cost: shipping 10 GB/day of logs to a cloud aggregator is expensive. Second, noise: when every line is logged, finding the important signals requires complex queries. Log the right things (decisions, errors, timing, token counts) rather than everything.

In the UCC filing pipeline, observability would capture every step of a filing analysis: the initial query to BigQuery for debtor records, the entity resolution call that matches "ACME CORP" to "Acme Corporation Inc," the risk scoring computation, and the final report generation. If a filing risk score seems wrong, you pull up the trace and immediately see whether the entity resolution matched the wrong company or the risk model received stale collateral data.

Now that you understand why observability is essential, let's look at the primary building block: the trace. A trace captures the complete story of a single agent execution from start to finish.

Agent Traces

Analogy: Flight Recorder (Black Box) Before: After a plane lands safely, nobody needs the flight recorder. But when something goes wrong — a near-miss, a mechanical failure, an unexpected diversion — the flight recorder is the single most valuable piece of evidence. It captures every input (altitude, speed, control inputs) and every output (engine response, heading changes) in a continuous timeline.

The pain: Without a flight recorder, investigators would have to rely on pilot memory and guesswork. For AI agents, "pilot memory" is your application logs, which typically only say "request received" and "response sent" with nothing in between. When a user says "the agent told me to cancel my order but I asked for a refund," you have no record of the reasoning chain.

The mapping: A is your agent's flight recorder. It captures every decision, tool call, and model invocation from the moment a user request arrives to the moment the response is sent. Each trace has a unique trace_id that lets you pull up the complete story of any execution, even months later.

Here is what a trace object actually looks like as a data structure — the "flight recorder" in JSON form:

{ "trace_id": "tr_a1b2c3d4e5f6", "name": "order-agent", "start_time": "2025-03-15T14:32:01.847Z", "end_time": "2025-03-15T14:32:03.454Z", "duration_ms": 1607, "user_id": "usr_8f14e45fce", "input": "Where is my order ORD-12345?", "output": "Your order ORD-12345 shipped via FedEx...", "status": "ok", "spans": [ { "span_id": "sp_001", "name": "llm.chat", "duration_ms": 847 }, { "span_id": "sp_002", "name": "tool.search_orders", "duration_ms": 47 }, { "span_id": "sp_003", "name": "llm.chat", "duration_ms": 712 } ] }

Notice how the trace is a single JSON object that tells the complete story: who asked, what the agent did (three spans), how long each step took, and what it answered. This is the data structure you will build in the code walkthrough below.

Technical Definition: Trace A trace is a data structure that represents the complete lifecycle of a single request through your agent. Think of it as a container with four key pieces. First, a globally unique trace_id — typically a UUID — that identifies this specific execution. Second, a start timestamp and end timestamp so you know exactly how long the request took. Third, metadata about the request: who made it (user_id), which conversation it belongs to (session_id), and a summary of the input. Fourth, a tree of spans representing individual units of work — we will cover spans in detail in the next section.

Traces follow the OpenTelemetry standard. That means they work across languages (Python, TypeScript, Go) and across tools (Langfuse, Jaeger, Datadog). You write your tracing code once, and any OpenTelemetry-compatible backend can ingest it. In short: the trace is the top-level container, and spans are the building blocks inside it.

What a Trace Captures

Every trace for an AI agent should record:

Input: The user's message (or a hash/summary if it contains PII)
Model calls: Which model was invoked, what prompt was sent, tokens used, latency
Tool calls: Which tools were invoked, their inputs and outputs, success/failure
Decisions: The agent's reasoning at each step (what it chose to do and why)
Output: The final response delivered to the user
Metadata: Total cost, total latency, error status, user satisfaction signal

Trace Waterfall — Nested Span Timing

Trace Waterfall — Nested Spans with Timing

0ms500ms1000ms1500ms2000ms2500ms

A horizontal waterfall chart showing a trace with nested spans: the full trace spans 2400ms. Inside it, an LLM call (800ms), a tool call to search_orders (450ms), another LLM call (600ms), a guardrail check (120ms), and a final response assembly (200ms). Parent-child nesting shows which spans are contained within others.

What Just Happened?

The waterfall animation showed a single trace with multiple spans. Notice how the LLM calls take the most time (800ms and 600ms), while guardrail checks are fast (120ms). The tool call to search_orders took 450ms — if this suddenly jumped to 3000ms in production, the trace would immediately show you where the bottleneck is. That is the power of traces: instant root-cause visibility.

Common Misconception "Traces are just fancy logs." No. Logs are individual lines of text with timestamps. Traces are structured, hierarchical records with parent-child relationships, timing data, and cross-service correlation. A log says "tool called at 14:32:01." A trace says "this tool call was the 3rd step in trace abc-123, took 450ms, was a child of the LLM reasoning span, and returned 3 results." Traces connect the dots; logs are the individual dots.

Each trace is made up of spans — the individual units of work. Understanding how spans nest inside each other is the key to reading (and building) effective traces.

Spans: Nesting & Timing

Analogy: Recipe Steps Before: Think about following a complex recipe. "Make lasagna" is the top-level task. Inside it, you have "make sauce" (which contains "chop onions," "brown meat," "simmer"), "make pasta layers" (which contains "boil noodles," "drain"), and "assemble and bake." Each sub-step is nested inside a larger step, and each has its own start time and end time.

The pain: If your lasagna tastes wrong, you need to know which sub-step failed. Was it the sauce (burnt onions)? The noodles (overcooked)? The baking (too long)? Without timing each step, you are tasting the final product and guessing backwards.

The mapping: A is one step in the recipe. The root span ("handle user request") contains child spans ("call LLM," "execute tool," "validate output"). Each child can have its own children ("parse JSON response" inside "execute tool"). The nesting shows you exactly where time is spent and where errors originate.

Here is what a single span looks like as a data structure — the "recipe step card" from the analogy above. Just as a recipe step has a name ("brown the meat"), a start time, an end time, and belongs to a larger step ("make sauce"), a span has all the same pieces in JSON form:

{ "span_id": "sp_7g8h9i0j", "parent_span_id": "sp_root_001", "trace_id": "tr_a1b2c3d4e5f6", "name": "llm.chat", "start_time": "2025-03-15T14:32:01.850Z", "end_time": "2025-03-15T14:32:02.697Z", "status": "ok", "attributes": { "model": "claude-sonnet-4-6", "input_tokens": 1840, "output_tokens": 89, "latency_ms": 847, "stop_reason": "tool_use" } }

The parent_span_id is the key field — it links this LLM call to its parent (the root agent span), creating the tree structure you see in waterfall visualizations.

Technical Definition: Span A span represents a single unit of work within a trace. Every span carries six pieces of information. First, a unique span_id that identifies this specific operation. Second, a parent_span_id that links it to the span that triggered it — if this is the top-level span (the root), the parent is null. Third, a name describing the operation, like "llm.chat" or "tool.search_orders."

Fourth, start_time and end_time timestamps — these let you calculate exactly how long this unit of work took. Fifth, a status field (ok or error) telling you whether the operation succeeded. And sixth, attributes — key-value metadata specific to this span. For an LLM span, attributes might include model: "claude-sonnet-4-6" and tokens_used: 1240. For a tool span, you might store tool_name: "search_orders" and success: true.

The parent-child relationships between spans create a tree structure that mirrors the actual execution flow of your agent. When you visualize this tree as a waterfall chart, you can instantly see which operations are slow, which ones failed, and how they depend on each other.

Span Types for AI Agents

When instrumenting an agent, you will typically create five types of spans. Each type captures different information, and together they give you a complete picture of every decision the agent made.

Root span (agent): The entire request lifecycle, from the moment a user message arrives to the moment a response is sent. This is the top-level container — every other span is a child (or grandchild) of this one. It captures total latency, overall status (success or error), and the final output summary.
LLM span (generation): Each call to Claude. This is usually the most expensive span in terms of both time and money. Key attributes to record: model name, prompt tokens, completion tokens, latency in milliseconds, cost in USD, and stop_reason (which tells you whether Claude wants to call a tool or is finished).
Tool span (tool): Each tool invocation — database queries, API calls, file reads. Attributes: tool name, input parameters, output summary (truncated to avoid storing large payloads), success/failure status, and latency. Tool spans are where you most often find performance bottlenecks.
Retrieval span: RAG lookups, where you search a vector database for relevant context. Attributes: the search query, number of results returned, relevance scores, and which documents were selected. This is critical for debugging "the agent gave a wrong answer" issues — often the retriever found the wrong documents.
Guardrail span: Input/output validation checks. Attributes: check name (e.g., "pii_output_check"), pass/fail result, and the reason for failure. These spans are typically very fast (under 10ms) but their pass/fail status is essential for compliance auditing.

Timing and Performance Analysis

The timing data in spans is where observability pays for itself. Without spans, you know the total request took 5 seconds. With spans, you know exactly where those 5 seconds went. Here are the three performance questions that span timing answers:

Where is time spent? If your agent takes 5 seconds, is it 3 seconds in LLM calls, 1.5 seconds in tool calls, and 0.5 seconds in processing? Or is one tool call taking 4 seconds while everything else is fast? The waterfall chart makes this immediately visual — long bars are slow operations.
Are there sequential bottlenecks? Two independent tool calls running sequentially instead of in parallel would show up as back-to-back spans instead of overlapping ones. For example, if your agent looks up order status and checks inventory, those two lookups can run at the same time. Sequential spans tell you "these could be parallelized for a 2x speedup."
Is there a slow dependency? A database lookup that takes 2000ms+ consistently is immediately visible in the waterfall. More importantly, you can track this over time: if the database call was 50ms last week and is 2000ms today, something changed in your database — not your agent code.

The key insight is that optimization without span data is guesswork. Developers often assume the LLM call is the bottleneck (because "AI is slow"), but in practice, database queries, API calls to third-party services, and large payload serialization are frequently the real culprits.

Why It Matters A production agent handling 10,000 requests/day with an average latency of 3 seconds generates 30,000 seconds of execution time daily. If spans reveal that 40% of that time is spent in a single tool call that could be cached, you can cut your P50 latency from 3s to 1.8s — and your compute costs drop proportionally. Without span-level timing, you would never know where to optimize.

You now understand the structure of traces and spans. But raw trace data is only useful if it is paired with well-structured logs. Next, we will cover how to log the right fields in the right format — and, critically, what to never log.

Structured Logging

Analogy: Filing Cabinet vs Junk Drawer Before: Imagine two ways to store receipts. In the junk drawer approach, you toss every receipt into a single drawer in whatever order they arrive. To find a specific receipt, you dump out the entire drawer and read every piece of paper. In the filing cabinet approach, every receipt goes into a labeled folder, sorted by date, category, and amount.

The pain: Most application logs are the junk drawer. Lines like INFO: Processing request... and DEBUG: Tool called successfully tell you almost nothing when you are searching through 10 million log lines for the one request that failed at 3:47 AM. You end up writing fragile regex patterns to parse free-form text, and half your queries return false matches.

The mapping: is the filing cabinet. Every log entry is a JSON object — not a free-form text string. Each object has consistent keys like trace_id, span_id, timestamp, level, and message, plus domain-specific fields relevant to your agent. The payoff is queryability: instead of writing fragile regex, you can ask precise questions like "show me all ERROR logs where tool_name=search_orders and latency_ms > 2000." That query takes seconds, even across millions of entries, because every field is a first-class key that your log aggregator can index and filter on.

How does structured logging work under the hood? When your agent calls logger.info("tool_call", tool_name="search_orders", latency_ms=47), the logging library does three things. First, it creates a dictionary with your key-value pairs. Second, it adds automatic fields like the ISO timestamp, log level, and process ID. Third, it serializes the entire dictionary to a JSON string and writes it to the output stream (stdout, a file, or a log aggregation service). The result is a single line of JSON per event — no multi-line stack traces breaking your parsers, no free-form messages with inconsistent formats.

How does structured logging differ from the print() statements and logging.info("some message") calls you have used before? Traditional logging outputs human-readable text: "2025-03-15 Tool search_orders called, took 47ms, found 3 results". That is easy to read in a terminal, but impossible to query at scale. Structured logging outputs machine-readable JSON: {"tool_name":"search_orders","latency_ms":47,"result_count":3}. That is slightly harder to read in a terminal (though tools like jq help), but trivially easy to query, filter, aggregate, and alert on. The tradeoff is clear: you give up a bit of readability in exchange for massive gains in operability.

What to Log: The Essential Fields

Every structured log entry from your agent should include these fields:

{
  // Correlation fields — link this log to its trace
  "trace_id": "tr_a1b2c3d4e5f6",
  "span_id": "sp_7g8h9i0j",
  "timestamp": "2025-03-15T14:32:01.847Z",

  // Classification fields — filter and aggregate
  "level": "info",           // debug, info, warn, error
  "agent_id": "order-agent-v2",
  "step_number": 3,          // which step in the agent loop

  // Execution fields — understand what happened
  "model": "claude-sonnet-4-6",
  "token_count": { "input": 1840, "output": 312 },
  "tool_name": "search_orders",
  "latency_ms": 847,

  // Context fields — understand the situation
  "message": "Tool call completed successfully",
  "user_id": "usr_hashed_abc",  // hashed, never raw
  "session_id": "sess_xyz789"
}

What NOT to Log: The Danger Zone

Security: Never Log These

PII (Personally Identifiable Information): Names, emails, phone numbers, addresses, SSNs. Hash or redact before logging.
API keys and secrets: Never log Authorization headers, API keys, or database credentials. Even in debug mode.
Full prompt contents: Prompts may contain user data. Log a hash or summary instead of the full text.
Full model responses: Responses may contain generated PII or sensitive reasoning. Log a truncated summary.
Medical/financial data: PHI (HIPAA), payment card numbers (PCI-DSS), and financial account details have strict regulatory requirements.

What to Log vs What NOT to Log

Structured Logging Pipeline with PII Redaction

Raw Log Event

trace_id: tr_abc tool: search user_email: john@test.com latency: 847ms patient_name: John Smith tokens: 1840

→

🚫

PII

FILTER

→

Clean Log Entry

trace_id: tr_abc tool: search user_id: usr_h4sh3d latency: 847ms patient_id: pt_r3d4ct tokens: 1840

↓

Log Aggregator

Queryable JSON store

Alerting System

latency_ms > 5000

Dashboard

Real-time metrics

A pipeline showing: (1) Raw log with PII fields highlighted in red, (2) PII filter gate that strips/hashes sensitive data, (3) Clean log entry with safe fields, (4) Clean logs flowing to a log aggregator, alerting system, and dashboard.

What Just Happened?

The animation showed a raw log entry flowing through a PII filter that strips emails and names, replacing them with hashed identifiers. The clean entry then fans out to three destinations: a log aggregator for querying, an alerting system for threshold violations, and a dashboard for real-time visibility. This pipeline is the backbone of production observability.

⚠️ Common Misconceptions

"Structured logging means I have to give up human-readable output." — Not true. Tools like jq (command line) and log viewers (Kibana, Grafana Loki) render structured JSON beautifully. During local development, you can configure structlog or pino to output pretty-printed, colorized text. You get human-readable locally AND machine-parseable in production.

"I should log the full prompt and full response for debugging." — This is a compliance and cost trap. Full prompts often contain user PII. Full responses can be thousands of tokens. Instead, log a hash of the prompt (for deduplication), token counts (for cost tracking), and the first 200 characters of the response (for quick inspection). Store full prompt/response pairs only in a secure, access-controlled trace backend like Langfuse — never in general application logs.

"Log levels (DEBUG, INFO, WARN, ERROR) are enough for filtering." — Log levels are one dimension. With structured logging, you can filter on any field: tool_name="search_orders" AND latency_ms > 2000 finds slow search calls regardless of log level. Think of log levels as a coarse filter and structured fields as fine-grained filters.

With structured logs and traces in place, you need tools to collect, store, and visualize this data. Let's survey the observability landscape for AI agents.

Observability Tools: The Landscape

The AI agent observability ecosystem has matured rapidly. Here are the four major tools you should know, each with different strengths:

Langfuse (Open Source)

Best for: Teams that want full control over their data and prefer self-hosting. Langfuse is purpose-built for LLM observability — it understands what an LLM call is, what tokens cost, and how traces flow through agent loops. Unlike general-purpose observability tools, you do not need to teach it what "input_tokens" means or how to calculate cost per model.

Langfuse integrates directly with the Anthropic SDK via a simple decorator or client wrapper. It provides a web UI where you can click on any trace, see the full span waterfall, inspect the prompt and response for each LLM call, and see exactly how much that request cost.

Self-hosted (Docker Compose) or cloud option — your data stays on your infrastructure if needed
Native Python and JS/TS SDKs with minimal code changes to add tracing
Built-in cost calculation that knows current pricing for Claude, GPT-4, and other models
Prompt management and versioning — track which prompt version produced which results

OpenTelemetry (CNCF Standard)

Best for: Teams already using OpenTelemetry for microservice observability who want to extend it to LLM calls. If your company already has Jaeger or Grafana Tempo collecting traces from your backend services, OpenTelemetry lets you add LLM spans into that same pipeline. Your agent's trace shows up alongside your API server's trace, giving you end-to-end visibility from the HTTP request all the way through the LLM call and back.

The tradeoff: OpenTelemetry is a general-purpose standard, not LLM-specific. It does not know what "tokens" or "model cost" are out of the box. You need to add those as custom span attributes. The GenAI semantic conventions (standard attribute names for LLM operations) are still being finalized by the CNCF community.

Vendor-neutral, CNCF-backed standard — no lock-in to any specific backend
Works with your existing infrastructure (Jaeger, Datadog, Honeycomb, Grafana Tempo)
Requires more setup for LLM-specific features like cost tracking and prompt inspection
Semantic conventions for GenAI are still evolving — expect attribute names to stabilize over the next year

OpenTelemetry for Agent Observability

Why OTel: OpenTelemetry is the vendor-neutral CNCF standard for traces, metrics, and logs. You instrument your agent once with the OTel SDK and ship the same telemetry to Datadog, Grafana (Tempo + Loki), or New Relic without rewriting code. For enterprises where the platform team has already standardized on one of these APMs, OTel is the path of least resistance — agent traces flow into the same dashboards engineers already use for the rest of the stack.

Key concepts mapped to agents: one trace per user request (the full agent run from prompt to final response), and one span per tool call, LLM call, or guardrail check inside that run. Span attributes carry the agent-specific signal: llm.model, llm.input_tokens, llm.output_tokens, tool.name, tool.success.

Pseudocode for OTel instrumentation:

tracer = trace.get_tracer("agent")

with tracer.start_as_current_span("agent.run") as root:
    root.set_attribute("user.id", hash_user_id(user_id))
    while not done:
        with tracer.start_as_current_span("llm.call") as s:
            s.set_attribute("llm.model", "claude-sonnet-4-6")
            response = client.messages.create(...)
            s.set_attribute("llm.output_tokens", response.usage.output_tokens)
        for tool_call in response.tool_uses:
            with tracer.start_as_current_span(f"tool.{tool_call.name}") as ts:
                ts.set_attribute("tool.name", tool_call.name)
                result = run_tool(tool_call)

When to use which: reach for Langfuse when you need agent-specific UX out of the box — prompt playground, model cost rollups, eval datasets, side-by-side trace diffing. Reach for OTel when the requirement is enterprise APM consolidation — SLO dashboards, on-call alerting, and end-to-end traces that span the load balancer, API server, agent, and downstream microservices in one waterfall. Many production teams run both: OTel feeds the platform APM, Langfuse owns the LLM-native workflow.

LangSmith (LangChain Ecosystem)

Best for: Teams already using LangChain/LangGraph who want tight integration. LangSmith provides tracing, evaluation, and dataset management in one platform.

Deep LangChain integration (auto-instrumentation)
Built-in evaluation and testing features
Cloud-hosted (no self-host option)
May create vendor lock-in if you move away from LangChain

Arize Phoenix (Open Source)

Best for: Teams focused on model performance analysis, embedding drift detection, and retrieval quality. Strong visualization for understanding how your RAG pipeline performs over time.

Embedding visualization and drift detection
RAG-specific evaluation metrics
Open-source with a cloud option
Strong evaluation integration

Observability Tool Comparison

Langfuse

Open Source / Cloud

✓ LLM-native traces

✓ Cost tracking

✓ Self-hostable

✓ Prompt mgmt

~ GenAI evals

OpenTelemetry

CNCF Standard

✓ Vendor neutral

✓ Existing infra

✓ Multi-language

~ LLM support

— Cost tracking

LangSmith

Cloud Platform

✓ LangChain native

✓ Eval + datasets

✓ Auto-instrument

— Self-hostable

~ Non-LC agents

Arize Phoenix

Open Source / Cloud

✓ Embedding viz

✓ Drift detection

✓ RAG evals

✓ Self-hostable

~ Cost tracking

Four cards comparing Langfuse, OpenTelemetry, LangSmith, and Arize Phoenix across features like tracing, cost tracking, self-hosting, and evaluation support.

Why It Matters Choosing the wrong observability tool can lock you into a vendor or leave critical blind spots. If you are building with Claude directly (not LangChain), Langfuse or OpenTelemetry are the strongest choices. Langfuse gives you LLM-specific features out of the box; OpenTelemetry gives you vendor neutrality and integration with your existing infrastructure. Most production teams end up using both: OpenTelemetry for infrastructure spans and Langfuse (or similar) for LLM-specific trace analysis.

Now let's put it all together. In the code walkthrough, you will build an instrumented agent that creates traces, nests spans, logs structured data with PII redaction, and handles observability failures gracefully.

Code Walkthrough: Instrumented Agent with Langfuse

We will build an order-tracking agent instrumented with Langfuse for tracing and structlog/pino for structured logging. The agent creates traces, nests spans, redacts PII, and — critically — never lets observability failures break the agent itself.

Step 1: Setup and Configuration

Let's start by initializing the two pillars of our observability stack: a Langfuse client for tracing and a structured logger for machine-readable log entries. Langfuse will capture the "what happened" story — traces and spans showing every decision the agent made. Structlog (Python) or pino (Node.js) will capture the "what should we alert on" signals — JSON log entries with fields like latency, token counts, and error status.

There is one critical design decision here: the Langfuse client is wrapped in a try/catch block. Why? Because if Langfuse's servers are down or your API keys are misconfigured, your agent must still serve users. Observability is important, but never more important than the agent's primary job. Also, never hardcode API keys — always pull them from environment variables so they do not end up in version control.

import os
import re
import time
import hashlib
import anthropic
import structlog
from langfuse import Langfuse

# --- Langfuse client (tracing) ---
# Falls back gracefully if credentials are missing
try:
    langfuse = Langfuse(
        public_key=os.environ.get("LANGFUSE_PUBLIC_KEY", ""),
        secret_key=os.environ.get("LANGFUSE_SECRET_KEY", ""),
        host=os.environ.get("LANGFUSE_HOST", "https://cloud.langfuse.com"),
    )
    TRACING_ENABLED = True
except Exception:
    langfuse = None
    TRACING_ENABLED = False

# --- Structured logger (structlog) ---
structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.add_log_level,
        structlog.processors.JSONRenderer(),
    ]
)
logger = structlog.get_logger()

# --- Anthropic client ---
client = anthropic.Anthropic()

# --- PII redaction utility ---
def redact_pii(text: str) -> str:
    """Remove emails, phone numbers, and SSNs from text."""
    text = re.sub(
        r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
        '[EMAIL_REDACTED]', text
    )
    text = re.sub(
        r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b',
        '[PHONE_REDACTED]', text
    )
    text = re.sub(
        r'\b\d{3}-\d{2}-\d{4}\b',
        '[SSN_REDACTED]', text
    )
    return text

def hash_user_id(user_id: str) -> str:
    """One-way hash for user identifiers in logs."""
    return "usr_" + hashlib.sha256(
        user_id.encode()
    ).hexdigest()[:12]

import Anthropic from "@anthropic-ai/sdk";
import { Langfuse } from "langfuse";
import pino from "pino";
import crypto from "crypto";

// --- Langfuse client (tracing) ---
let langfuse: Langfuse | null = null;
let TRACING_ENABLED = false;

try {
  langfuse = new Langfuse({
    publicKey: process.env.LANGFUSE_PUBLIC_KEY ?? "",
    secretKey: process.env.LANGFUSE_SECRET_KEY ?? "",
    baseUrl: process.env.LANGFUSE_HOST ?? "https://cloud.langfuse.com",
  });
  TRACING_ENABLED = true;
} catch {
  langfuse = null;
}

// --- Structured logger (pino) ---
const logger = pino({
  level: "info",
  timestamp: pino.stdTimeFunctions.isoTime,
  formatters: { level: (label) => ({ level: label }) },
});

// --- Anthropic client ---
const client = new Anthropic();

// --- PII redaction utility ---
function redactPii(text: string): string {
  return text
    .replace(
      /[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/g,
      "[EMAIL_REDACTED]"
    )
    .replace(/\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/g, "[PHONE_REDACTED]")
    .replace(/\b\d{3}-\d{2}-\d{4}\b/g, "[SSN_REDACTED]");
}

function hashUserId(userId: string): string {
  return (
    "usr_" +
    crypto.createHash("sha256").update(userId).digest("hex").slice(0, 12)
  );
}

What Just Happened?

We set up three things: (1) A Langfuse client wrapped in a try/catch so the agent works even if tracing is down, (2) A structured logger that outputs JSON with ISO timestamps, and (3) A PII redaction function that strips emails, phone numbers, and SSNs before they reach any log or trace. The hash_user_id function creates a one-way hash so you can correlate logs by user without storing their actual ID.

Step 2: The Instrumented Agent Loop

Now for the heart of the module: the agent loop itself, fully instrumented. Here is the mental model for what this code does. Every time a user sends a message, three things happen in parallel with the agent's work: a trace is created (the "flight recorder" starts), spans are opened and closed for each operation (LLM calls, tool calls), and structured log entries are emitted at every step. Think of it as the agent doing its job while a diligent assistant takes notes on every decision, timing each one with a stopwatch.

There is one design pattern you will see repeated throughout this code, and it is the most important lesson of this entire module: every single observability call is wrapped in try/except. This is not paranoia — it is a production requirement. Here is the reasoning: if Langfuse's servers go down at 3 AM, what should happen? Option A: your agent crashes and 1,000 users get error pages. Option B: your agent keeps serving users, and you lose 30 minutes of trace data that you can reconstruct from structured logs later. The answer is always Option B. Observability is a supporting actor, never the lead.

TOOLS = [
    {
        "name": "search_orders",
        "description": "Search for orders by order ID or customer ID.",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Order ID or customer ID"}
            },
            "required": ["query"],
        },
    }
]

def search_orders(query: str) -> dict:
    """Simulated order lookup."""
    return {
        "order_id": "ORD-12345",
        "status": "shipped",
        "carrier": "FedEx",
        "tracking": "FX9876543210",
        "estimated_delivery": "2025-03-18",
    }

def run_agent(user_message: str, user_id: str = "anonymous"):
    """Run the agent with full tracing and structured logging."""
    hashed_uid = hash_user_id(user_id)
    trace = None
    trace_id = None

    # --- Create trace ---
    if TRACING_ENABLED:
        try:
            trace = langfuse.trace(
                name="order-agent",
                user_id=hashed_uid,
                metadata={"agent_version": "v2.1"},
                input=redact_pii(user_message),
            )
            trace_id = trace.id
        except Exception as e:
            logger.warning("tracing_init_failed", error=str(e))

    logger.info(
        "agent_start",
        trace_id=trace_id,
        user_id=hashed_uid,
        input_length=len(user_message),
    )

    messages = [{"role": "user", "content": user_message}]
    step = 0
    max_steps = 10

    while step < max_steps:
        step += 1
        start_time = time.time()

        # --- LLM span ---
        llm_span = None
        if trace:
            try:
                llm_span = trace.span(
                    name="llm.chat",
                    metadata={"model": "claude-sonnet-4-6", "step": step},
                )
            except Exception:
                pass

        try:
            response = client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=1024,
                tools=TOOLS,
                messages=messages,
            )
            latency_ms = int((time.time() - start_time) * 1000)

            # Log the LLM call
            logger.info(
                "llm_call",
                trace_id=trace_id,
                step=step,
                model="claude-sonnet-4-6",
                input_tokens=response.usage.input_tokens,
                output_tokens=response.usage.output_tokens,
                latency_ms=latency_ms,
                stop_reason=response.stop_reason,
            )

            # Update LLM span
            if llm_span:
                try:
                    llm_span.end(
                        metadata={
                            "input_tokens": response.usage.input_tokens,
                            "output_tokens": response.usage.output_tokens,
                            "latency_ms": latency_ms,
                            "stop_reason": response.stop_reason,
                        }
                    )
                except Exception:
                    pass

        except Exception as e:
            logger.error(
                "llm_call_failed",
                trace_id=trace_id,
                step=step,
                error=str(e),
            )
            if llm_span:
                try:
                    llm_span.end(metadata={"error": str(e)})
                except Exception:
                    pass
            break

        # --- Handle tool use ---
        if response.stop_reason == "tool_use":
            for block in response.content:
                if block.type == "tool_use":
                    tool_start = time.time()
                    tool_span = None

                    if trace:
                        try:
                            tool_span = trace.span(
                                name=f"tool.{block.name}",
                                input=str(block.input),
                            )
                        except Exception:
                            pass

                    try:
                        result = search_orders(block.input.get("query", ""))
                        tool_latency = int((time.time() - tool_start) * 1000)

                        logger.info(
                            "tool_call",
                            trace_id=trace_id,
                            step=step,
                            tool_name=block.name,
                            latency_ms=tool_latency,
                            success=True,
                        )

                        if tool_span:
                            try:
                                tool_span.end(
                                    output=str(result),
                                    metadata={
                                        "latency_ms": tool_latency,
                                        "success": True,
                                    },
                                )
                            except Exception:
                                pass

                        messages.append({"role": "assistant", "content": response.content})
                        messages.append({
                            "role": "user",
                            "content": [{
                                "type": "tool_result",
                                "tool_use_id": block.id,
                                "content": str(result),
                            }],
                        })

                    except Exception as e:
                        logger.error(
                            "tool_call_failed",
                            trace_id=trace_id,
                            step=step,
                            tool_name=block.name,
                            error=str(e),
                        )
                        if tool_span:
                            try:
                                tool_span.end(metadata={"error": str(e)})
                            except Exception:
                                pass
                        break
        else:
            # End turn — extract final text
            final_text = ""
            for block in response.content:
                if hasattr(block, "text"):
                    final_text += block.text

            logger.info(
                "agent_complete",
                trace_id=trace_id,
                total_steps=step,
                output_length=len(final_text),
            )

            if trace:
                try:
                    trace.update(output=redact_pii(final_text[:200]))
                except Exception:
                    pass

            return final_text

    logger.warning("agent_max_steps", trace_id=trace_id, steps=max_steps)
    return "I was unable to complete your request. Please try again."


# --- Run it ---
if __name__ == "__main__":
    result = run_agent(
        "Where is my order ORD-12345?",
        user_id="john@example.com"
    )
    print(result)
    if TRACING_ENABLED and langfuse:
        langfuse.flush()

const TOOLS = [
  {
    name: "search_orders",
    description: "Search for orders by order ID or customer ID.",
    input_schema: {
      type: "object" as const,
      properties: {
        query: { type: "string", description: "Order ID or customer ID" },
      },
      required: ["query"],
    },
  },
];

function searchOrders(query: string) {
  return {
    order_id: "ORD-12345",
    status: "shipped",
    carrier: "FedEx",
    tracking: "FX9876543210",
    estimated_delivery: "2025-03-18",
  };
}

async function runAgent(userMessage: string, userId = "anonymous") {
  const hashedUid = hashUserId(userId);
  let trace: any = null;
  let traceId: string | null = null;

  // --- Create trace ---
  if (TRACING_ENABLED && langfuse) {
    try {
      trace = langfuse.trace({
        name: "order-agent",
        userId: hashedUid,
        metadata: { agent_version: "v2.1" },
        input: redactPii(userMessage),
      });
      traceId = trace.id;
    } catch (e) {
      logger.warn({ error: String(e) }, "tracing_init_failed");
    }
  }

  logger.info(
    { traceId, userId: hashedUid, inputLength: userMessage.length },
    "agent_start"
  );

  const messages: any[] = [{ role: "user", content: userMessage }];
  let step = 0;
  const maxSteps = 10;

  while (step < maxSteps) {
    step++;
    const startTime = Date.now();

    // --- LLM span ---
    let llmSpan: any = null;
    if (trace) {
      try {
        llmSpan = trace.span({
          name: "llm.chat",
          metadata: { model: "claude-sonnet-4-6", step },
        });
      } catch {}
    }

    try {
      const response = await client.messages.create({
        model: "claude-sonnet-4-6",
        max_tokens: 1024,
        tools: TOOLS,
        messages,
      });
      const latencyMs = Date.now() - startTime;

      logger.info(
        {
          traceId,
          step,
          model: "claude-sonnet-4-6",
          inputTokens: response.usage.input_tokens,
          outputTokens: response.usage.output_tokens,
          latencyMs,
          stopReason: response.stop_reason,
        },
        "llm_call"
      );

      if (llmSpan) {
        try {
          llmSpan.end({
            metadata: {
              input_tokens: response.usage.input_tokens,
              output_tokens: response.usage.output_tokens,
              latency_ms: latencyMs,
              stop_reason: response.stop_reason,
            },
          });
        } catch {}
      }

      // --- Handle tool use ---
      if (response.stop_reason === "tool_use") {
        for (const block of response.content) {
          if (block.type === "tool_use") {
            const toolStart = Date.now();
            let toolSpan: any = null;

            if (trace) {
              try {
                toolSpan = trace.span({
                  name: `tool.${block.name}`,
                  input: JSON.stringify(block.input),
                });
              } catch {}
            }

            try {
              const result = searchOrders(
                (block.input as any).query ?? ""
              );
              const toolLatency = Date.now() - toolStart;

              logger.info(
                {
                  traceId,
                  step,
                  toolName: block.name,
                  latencyMs: toolLatency,
                  success: true,
                },
                "tool_call"
              );

              if (toolSpan) {
                try {
                  toolSpan.end({
                    output: JSON.stringify(result),
                    metadata: { latency_ms: toolLatency, success: true },
                  });
                } catch {}
              }

              messages.push({ role: "assistant", content: response.content });
              messages.push({
                role: "user",
                content: [
                  {
                    type: "tool_result",
                    tool_use_id: block.id,
                    content: JSON.stringify(result),
                  },
                ],
              });
            } catch (e) {
              logger.error(
                { traceId, step, toolName: block.name, error: String(e) },
                "tool_call_failed"
              );
              if (toolSpan) {
                try { toolSpan.end({ metadata: { error: String(e) } }); } catch {}
              }
              break;
            }
          }
        }
      } else {
        // End turn
        let finalText = "";
        for (const block of response.content) {
          if (block.type === "text") finalText += block.text;
        }

        logger.info(
          { traceId, totalSteps: step, outputLength: finalText.length },
          "agent_complete"
        );

        if (trace) {
          try {
            trace.update({ output: redactPii(finalText.slice(0, 200)) });
          } catch {}
        }

        return finalText;
      }
    } catch (e) {
      logger.error({ traceId, step, error: String(e) }, "llm_call_failed");
      if (llmSpan) {
        try { llmSpan.end({ metadata: { error: String(e) } }); } catch {}
      }
      break;
    }
  }

  logger.warn({ traceId, steps: maxSteps }, "agent_max_steps");
  return "I was unable to complete your request. Please try again.";
}

// --- Run it ---
(async () => {
  const result = await runAgent(
    "Where is my order ORD-12345?",
    "john@example.com"
  );
  console.log(result);
  if (TRACING_ENABLED && langfuse) {
    await langfuse.flushAsync();
  }
})();

Example Structured Log Output (JSON)

{"timestamp":"2025-03-15T14:32:01.847Z","level":"info","event":"agent_start","trace_id":"tr_a1b2c3d4","user_id":"usr_8f14e45fce","input_length":29} {"timestamp":"2025-03-15T14:32:02.694Z","level":"info","event":"llm_call","trace_id":"tr_a1b2c3d4","step":1,"model":"claude-sonnet-4-6","input_tokens":1840,"output_tokens":89,"latency_ms":847,"stop_reason":"tool_use"} {"timestamp":"2025-03-15T14:32:02.741Z","level":"info","event":"tool_call","trace_id":"tr_a1b2c3d4","step":1,"tool_name":"search_orders","latency_ms":47,"success":true} {"timestamp":"2025-03-15T14:32:03.453Z","level":"info","event":"llm_call","trace_id":"tr_a1b2c3d4","step":2,"model":"claude-sonnet-4-6","input_tokens":2102,"output_tokens":156,"latency_ms":712,"stop_reason":"end_turn"} {"timestamp":"2025-03-15T14:32:03.454Z","level":"info","event":"agent_complete","trace_id":"tr_a1b2c3d4","total_steps":2,"output_length":198}

What Just Happened?

We built a complete instrumented agent. Key patterns: (1) Every observability call is wrapped in try/except — if Langfuse is down, the agent still serves users. (2) PII is redacted before it reaches any trace or log. (3) Every LLM call and tool call gets its own span with timing metadata. (4) Structured logs include trace_id so you can correlate logs with traces. (5) The agent logs input_tokens, output_tokens, and latency_ms for every call — essential for cost and performance monitoring.

Compliance & Audit Logging

Production agents in regulated industries must meet strict logging requirements. The traces and structured logs you built earlier in this module are your foundation — but compliance adds rules about what you log, how you protect it, and how long you keep it.

What to Log for Each Framework

HIPAA: Log every access to protected health information (PHI) — who requested it, which agent tool accessed it, what record was returned, and when. You must capture the minimum necessary data for the operation. Never log full patient records in traces; log only record IDs and access metadata.
SOC 2: Log all authentication events, authorization decisions, configuration changes, and data access patterns. SOC 2 auditors want to see that every agent action is attributable to a specific user and that access controls are enforced consistently.
GDPR: Log the legal basis for processing personal data, consent status, and any cross-border data transfers. Critically, GDPR’s right-to-erasure means you must be able to locate and delete all traces containing a specific user’s data — design your trace storage with a user_id index from day one.

PII Redaction in Logs

Apply PII redaction before data reaches your logging pipeline, not after. Use pattern matching to detect and mask emails, phone numbers, SSNs, and credit card numbers in both agent inputs and outputs. The redaction function you saw in the code walkthrough’s redact_pii() helper is the right pattern — wrap every trace span’s input and output through it.

Compliance Log Pipeline — Redact Then Store

Audit Trail & Data Retention

Audit logs must be immutable and tamper-proof. Write them to append-only storage (such as S3 with Object Lock or a dedicated audit log service) separate from your application database. Timestamp every entry with a synchronized clock. For retention: HIPAA requires six years, SOC 2 typically one year, and GDPR requires you to delete data when the processing purpose expires — but keep anonymized audit records to prove compliance. Build retention automation early; retrofitting deletion across years of unindexed traces is painful.

What Just Happened?

You learned the three compliance dimensions of agent logging: what to capture (access events, authorization decisions, legal basis), what to exclude (PII/PHI redacted before storage), and how to store it (immutable, indexed by user ID, with framework-specific retention periods).

Prompt Management & Versioning

Your agent’s system prompt is as critical as any function in your codebase — a single word change can shift behavior across thousands of requests. Yet many teams treat prompts as hardcoded strings buried inside application code, with no review process, no version history, and no way to roll back. Treat prompts as code.

Prompts as Code in Version Control

Store every system prompt, few-shot example set, and tool description in version-controlled files (e.g., prompts/order_agent_v12.txt) rather than inline strings. This gives you Git history showing exactly when a prompt changed, who changed it, and why. Use a prompt registry — a simple config file or a service like Langfuse’s prompt management — to map prompt names to versions at runtime. Your agent loads get_prompt("order_agent", version="latest") instead of reading a hardcoded string, making it trivial to switch versions without redeploying code.

A/B Testing Prompts

When you write a new prompt version, don’t replace the old one immediately. Instead, route a small percentage of traffic (5–10%) to the new version and compare metrics: task success rate, average tokens used, user satisfaction scores, and hallucination rate. Log the prompt version in every trace (add a prompt_version field to your structured logs) so you can slice dashboards by version. Only promote the new prompt to 100% once metrics confirm it performs at least as well as the previous version.

Prompt Regression Testing

Maintain an evaluation suite of 20–50 representative inputs with expected outputs. Run this suite automatically (in CI or as a pre-merge check) whenever a prompt file changes. Compare results against baseline scores — if accuracy drops below your threshold, the change is blocked. This catches regressions that manual review misses: a prompt tweak that improves one edge case but degrades ten common cases. Combine this with the evaluation framework from M18 for a complete quality gate.

What Just Happened?

You learned the three pillars of prompt management: version control (prompts in Git with a registry for runtime lookup), A/B testing (gradual rollout with metric comparison), and regression testing (automated eval suites that block bad changes before they reach production).

Hands-On Exercise

What You'll Build

A fully instrumented order-tracking agent with a PII guardrail span, per-call cost tracking, and a log analysis function that computes summary metrics from structured logs — extending the code walkthrough from earlier in this module.

Time Estimate: 30–40 minutes

Prerequisites: Python 3.9+ installed, an Anthropic API key (ANTHROPIC_API_KEY environment variable set), and pip available. Langfuse account is optional — the agent works without it by using local trace IDs.

Files You'll Create:

traced_agent.py — The instrumented agent with guardrail span and cost tracking
log_analyzer.py — A log query function to analyze agent run data

Environment Setup

mkdir tracing-lab && cd tracing-lab
python -m venv venv
# macOS/Linux:
source venv/bin/activate
# Windows:
# venv\Scripts\activate
pip install anthropic structlog
export ANTHROPIC_API_KEY=your-key-here
# Optional — only if you have a Langfuse account:
# pip install langfuse
# export LANGFUSE_PUBLIC_KEY=your-public-key
# export LANGFUSE_SECRET_KEY=your-secret-key

Step 1: Create the Base Agent with Guardrail Span

We will start from the code walkthrough's agent and add two new features: a guardrail span that checks the final response for PII, and per-call cost tracking. Why the guardrail? Even if your tool data is clean, the LLM might echo back PII from the user's message. For example, if the user says "my email is john@test.com, where's my order?", Claude might include that email in its response. The guardrail catches and redacts it before the response reaches the user.

The cost tracking is straightforward multiplication — input tokens times the per-million price, plus output tokens times their price — but having it in every log entry means you can spot cost anomalies in real time.

Create a new file called traced_agent.py with the following complete code:

import os
import re
import sys
import time
import json
import hashlib
import anthropic
import structlog

# --- Structured logger ---
# Send JSON logs to stderr so stdout stays clean for the agent's user-facing
# response. This lets us pipe stderr to a log file with `2>agent_logs.jsonl`.
structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.add_log_level,
        structlog.processors.JSONRenderer(),
    ],
    logger_factory=structlog.PrintLoggerFactory(file=sys.stderr),
)
logger = structlog.get_logger()

# --- Anthropic client ---
client = anthropic.Anthropic()

# --- PII redaction ---
PII_PATTERNS = [
    (r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', '[EMAIL_REDACTED]'),
    (r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE_REDACTED]'),
    (r'\b\d{3}-\d{2}-\d{4}\b', '[SSN_REDACTED]'),
]

def redact_pii(text: str) -> str:
    """Remove emails, phone numbers, and SSNs from text."""
    for pattern, replacement in PII_PATTERNS:
        text = re.sub(pattern, replacement, text)
    return text

def contains_pii(text: str) -> bool:
    """Check if text contains any PII patterns."""
    for pattern, _ in PII_PATTERNS:
        if re.search(pattern, text):
            return True
    return False

def hash_user_id(user_id: str) -> str:
    return "usr_" + hashlib.sha256(user_id.encode()).hexdigest()[:12]

# --- Cost calculation ---
# Prices per million tokens (Claude Sonnet)
COST_PER_M_INPUT = 3.00
COST_PER_M_OUTPUT = 15.00

def calculate_cost(input_tokens: int, output_tokens: int) -> float:
    """Calculate cost in USD for a single LLM call."""
    input_cost = (input_tokens / 1_000_000) * COST_PER_M_INPUT
    output_cost = (output_tokens / 1_000_000) * COST_PER_M_OUTPUT
    return round(input_cost + output_cost, 6)

# --- Tool definition ---
TOOLS = [
    {
        "name": "search_orders",
        "description": "Search for orders by order ID or customer ID.",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Order ID or customer ID"}
            },
            "required": ["query"],
        },
    }
]

def search_orders(query: str) -> dict:
    """Simulated order lookup."""
    return {
        "order_id": "ORD-12345",
        "status": "shipped",
        "carrier": "FedEx",
        "tracking": "FX9876543210",
        "estimated_delivery": "2025-03-18",
    }

# --- Agent with guardrail span and cost tracking ---
def run_agent(user_message: str, user_id: str = "anonymous"):
    hashed_uid = hash_user_id(user_id)
    trace_id = f"tr_{hashlib.md5(f'{time.time()}'.encode()).hexdigest()[:10]}"
    total_cost = 0.0

    logger.info(
        "agent_start",
        trace_id=trace_id,
        user_id=hashed_uid,
        input_length=len(user_message),
    )

    messages = [{"role": "user", "content": user_message}]
    step = 0
    max_steps = 10

    while step < max_steps:
        step += 1
        start_time = time.time()

        try:
            response = client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=1024,
                tools=TOOLS,
                messages=messages,
            )
            latency_ms = int((time.time() - start_time) * 1000)
            call_cost = calculate_cost(
                response.usage.input_tokens,
                response.usage.output_tokens,
            )
            total_cost += call_cost

            logger.info(
                "llm_call",
                trace_id=trace_id,
                step=step,
                model="claude-sonnet-4-6",
                input_tokens=response.usage.input_tokens,
                output_tokens=response.usage.output_tokens,
                latency_ms=latency_ms,
                cost_usd=call_cost,
                stop_reason=response.stop_reason,
            )

        except Exception as e:
            logger.error(
                "llm_call_failed",
                trace_id=trace_id,
                step=step,
                error=str(e),
            )
            break

        # Handle tool use
        if response.stop_reason == "tool_use":
            for block in response.content:
                if block.type == "tool_use":
                    tool_start = time.time()
                    try:
                        result = search_orders(block.input.get("query", ""))
                        tool_latency = int((time.time() - tool_start) * 1000)

                        logger.info(
                            "tool_call",
                            trace_id=trace_id,
                            step=step,
                            tool_name=block.name,
                            latency_ms=tool_latency,
                            success=True,
                        )

                        messages.append({"role": "assistant", "content": response.content})
                        messages.append({
                            "role": "user",
                            "content": [{
                                "type": "tool_result",
                                "tool_use_id": block.id,
                                "content": str(result),
                            }],
                        })
                    except Exception as e:
                        logger.error(
                            "tool_call_failed",
                            trace_id=trace_id,
                            step=step,
                            tool_name=block.name,
                            error=str(e),
                        )
                        break
        else:
            # Extract final text
            final_text = ""
            for block in response.content:
                if hasattr(block, "text"):
                    final_text += block.text

            # --- GUARDRAIL SPAN: Check for PII in output ---
            guard_start = time.time()
            pii_found = contains_pii(final_text)
            guard_latency = int((time.time() - guard_start) * 1000)

            if pii_found:
                final_text = redact_pii(final_text)
                guard_status = "fail_redacted"
            else:
                guard_status = "pass"

            logger.info(
                "guardrail_check",
                trace_id=trace_id,
                span_name="guardrail.output_check",
                pii_detected=pii_found,
                status=guard_status,
                latency_ms=guard_latency,
            )

            logger.info(
                "agent_complete",
                trace_id=trace_id,
                total_steps=step,
                total_cost_usd=total_cost,
                output_length=len(final_text),
            )

            return final_text

    logger.warning("agent_max_steps", trace_id=trace_id, steps=max_steps)
    return "I was unable to complete your request. Please try again."


if __name__ == "__main__":
    result = run_agent(
        "Where is my order ORD-12345?",
        user_id="john@example.com"
    )
    print(f"\nAgent response:\n{result}")

Run command:

python traced_agent.py

Expected output (structured log lines followed by the agent response):

{"timestamp":"...","level":"info","event":"agent_start","trace_id":"tr_...","user_id":"usr_8f14e45fce","input_length":29} {"timestamp":"...","level":"info","event":"llm_call","trace_id":"tr_...","step":1,"model":"claude-sonnet-4-6","input_tokens":...,"output_tokens":...,"latency_ms":...,"cost_usd":0.00xxxx,"stop_reason":"tool_use"} {"timestamp":"...","level":"info","event":"tool_call","trace_id":"tr_...","step":1,"tool_name":"search_orders","latency_ms":0,"success":true} {"timestamp":"...","level":"info","event":"llm_call","trace_id":"tr_...","step":2,...,"cost_usd":0.00xxxx,"stop_reason":"end_turn"} {"timestamp":"...","level":"info","event":"guardrail_check","trace_id":"tr_...","span_name":"guardrail.output_check","pii_detected":false,"status":"pass","latency_ms":0} {"timestamp":"...","level":"info","event":"agent_complete","trace_id":"tr_...","total_steps":2,"total_cost_usd":0.00xxxx,"output_length":...} Agent response: Your order ORD-12345 has been shipped via FedEx...

✅ Checkpoint

You should see structured JSON log lines with cost_usd fields on each llm_call event, a guardrail_check event with "pii_detected": false and "status": "pass", and an agent_complete event with total_cost_usd. If you see these, Step 1 is working correctly.

Troubleshooting:

If you see ModuleNotFoundError: No module named 'structlog' → run pip install structlog
If you see AuthenticationError → check that your ANTHROPIC_API_KEY environment variable is set
If you see RateLimitError → wait 30 seconds and try again

Step 2: Create the Log Analyzer

Now that the agent produces structured JSON logs, let's write a function that reads those logs and answers operational questions: "What was the total cost and average latency for all agent runs?" This is the kind of query that structured logging makes trivial but free-form logging makes nearly impossible.

Step dependency: This step uses the log output from Step 1. Make sure traced_agent.py from Step 1 is working before proceeding.

Create a new file called log_analyzer.py:

import json
import sys
from collections import defaultdict


def analyze_logs(log_lines: list[str]) -> dict:
    """Analyze structured JSON log lines from the traced agent.

    Returns a summary with total cost, average latency,
    tool call counts, and guardrail results.
    """
    llm_calls = []
    tool_calls = []
    guardrail_results = []
    agent_runs = []

    for line in log_lines:
        line = line.strip()
        if not line:
            continue
        try:
            entry = json.loads(line)
        except json.JSONDecodeError:
            continue

        event = entry.get("event", "")

        if event == "llm_call":
            llm_calls.append({
                "trace_id": entry.get("trace_id"),
                "latency_ms": entry.get("latency_ms", 0),
                "cost_usd": entry.get("cost_usd", 0),
                "input_tokens": entry.get("input_tokens", 0),
                "output_tokens": entry.get("output_tokens", 0),
            })
        elif event == "tool_call":
            tool_calls.append({
                "trace_id": entry.get("trace_id"),
                "tool_name": entry.get("tool_name"),
                "latency_ms": entry.get("latency_ms", 0),
                "success": entry.get("success", False),
            })
        elif event == "guardrail_check":
            guardrail_results.append({
                "trace_id": entry.get("trace_id"),
                "pii_detected": entry.get("pii_detected", False),
                "status": entry.get("status"),
            })
        elif event == "agent_complete":
            agent_runs.append({
                "trace_id": entry.get("trace_id"),
                "total_steps": entry.get("total_steps", 0),
                "total_cost_usd": entry.get("total_cost_usd", 0),
            })

    # Compute summary
    total_cost = sum(c["cost_usd"] for c in llm_calls)
    avg_latency = (
        sum(c["latency_ms"] for c in llm_calls) / len(llm_calls)
        if llm_calls else 0
    )
    total_tokens = sum(
        c["input_tokens"] + c["output_tokens"] for c in llm_calls
    )
    pii_detections = sum(1 for g in guardrail_results if g["pii_detected"])

    summary = {
        "total_llm_calls": len(llm_calls),
        "total_tool_calls": len(tool_calls),
        "total_agent_runs": len(agent_runs),
        "total_cost_usd": round(total_cost, 6),
        "avg_llm_latency_ms": round(avg_latency, 1),
        "total_tokens": total_tokens,
        "guardrail_pii_detections": pii_detections,
        "tool_success_rate": (
            f"{sum(1 for t in tool_calls if t['success'])}/{len(tool_calls)}"
            if tool_calls else "N/A"
        ),
    }

    return summary


if __name__ == "__main__":
    # Read log lines from stdin or a file
    if len(sys.argv) > 1:
        with open(sys.argv[1], "r") as f:
            lines = f.readlines()
    else:
        print("Usage: python log_analyzer.py ")
        print("  Or pipe logs: python traced_agent.py 2>&1 | python log_analyzer.py /dev/stdin")
        sys.exit(1)

    summary = analyze_logs(lines)
    print(json.dumps(summary, indent=2))

Run command (pipe agent logs to the analyzer):

python traced_agent.py 2>agent_logs.jsonl && python log_analyzer.py agent_logs.jsonl

Expected output:

{ "total_llm_calls": 2, "total_tool_calls": 1, "total_agent_runs": 1, "total_cost_usd": 0.008xxx, "avg_llm_latency_ms": 780.0, "total_tokens": 4087, "guardrail_pii_detections": 0, "tool_success_rate": "1/1" }

✅ Checkpoint

You should see a JSON summary with all fields populated. The total_cost_usd should be a small number (under $0.05 for a single run). The guardrail_pii_detections should be 0 since the simulated order data does not contain PII. If you see this output, Step 2 is working correctly.

Troubleshooting:

If the log file is empty → make sure you used 2> (stderr redirect). Step 1's traced_agent.py configures structlog with PrintLoggerFactory(file=sys.stderr), so logs go to stderr while the agent response goes to stdout. By default (without that factory) structlog writes to stdout, which would mix logs with the response — double-check the logger_factory= line is present.
If you see json.JSONDecodeError → some non-JSON output mixed in. The analyzer skips bad lines, but check that your traced_agent.py is printing the agent response to stdout (not stderr)

Verify Everything Works

Run both steps together as a single end-to-end pipeline:

python traced_agent.py 2>agent_logs.jsonl && echo "--- Agent logs ---" && cat agent_logs.jsonl && echo "" && echo "--- Analysis ---" && python log_analyzer.py agent_logs.jsonl

You should see: (1) the agent response printed to the terminal, (2) all structured JSON log lines, and (3) the analysis summary. If all three appear with sensible values, you have a working instrumented agent with guardrail spans, cost tracking, and log analysis.

🎉 Congratulations

You have built a production-style observability pipeline from scratch. Your agent creates structured logs with trace IDs, tracks per-call costs, runs a PII guardrail on output, and you can analyze the logs programmatically. These are the same patterns used by teams running agents at scale.

Knowledge Check

1. What is the relationship between a trace and a span?

AA trace is a type of span with additional metadata

BA trace is the top-level container; spans are the individual units of work nested inside it

CTraces and spans are independent data structures linked by timestamps

DA span contains multiple traces, one per tool call

Correct! A trace is the top-level container representing the entire request lifecycle. Spans are nested inside the trace and represent individual operations (LLM calls, tool invocations, guardrail checks). Each span has a parent_span_id that creates the tree structure.

Not quite. A trace is the top-level container for an entire request lifecycle. Spans are nested inside it, each representing a single unit of work (LLM call, tool invocation, etc.). Spans link to each other via parent_span_id to form a tree.

2. Which of the following should NEVER appear in production agent logs?

AToken counts and model names

BLatency measurements and trace IDs

CUser email addresses and full prompt contents

DTool names and error messages

Correct! PII like email addresses and full prompt contents must never appear in production logs. Emails are PII that violate privacy regulations; prompts may contain user-provided sensitive data. Always hash user identifiers and log prompt summaries or hashes instead of full text.

The answer is user email addresses and full prompt contents. Emails are PII, and prompts often contain user-provided sensitive data. Token counts, latency, trace IDs, tool names, and error messages are all safe and useful to log.

3. An agent trace shows: LLM call (200ms) → tool.search_db (3200ms) → LLM call (250ms). Total latency is 3650ms. What is the bottleneck?

AThe LLM calls are too slow and need a faster model

BThe database tool call at 3200ms is the bottleneck — consider caching or query optimization

CThe agent is making too many steps and needs fewer tool calls

DThe total latency is normal for an AI agent and needs no optimization

Correct! The database tool call accounts for 3200ms out of 3650ms total (87.7% of execution time). The LLM calls are fast (200ms and 250ms). Optimizing the database query, adding caching, or using connection pooling would have the highest impact on latency.

Look at the timing breakdown: the database tool call is 3200ms out of 3650ms total (87.7%). The LLM calls are actually quite fast at 200ms and 250ms. The database call is the clear bottleneck — caching or query optimization would have the biggest impact.

4. What is the primary advantage of OpenTelemetry over LLM-specific tools like Langfuse?

ABetter cost tracking for LLM calls

BBuilt-in prompt management and versioning

CVendor neutrality — instrument once, send to any compatible backend

DAutomatic PII redaction in traces

Correct! OpenTelemetry's primary advantage is vendor neutrality. It is a CNCF standard supported by dozens of backends (Jaeger, Grafana, Datadog, Honeycomb, etc.). You instrument your code once using the OTel SDK, and you can switch backends without changing your application code. LLM-specific tools like Langfuse offer better AI-native features but may create vendor dependency.

The key advantage of OpenTelemetry is vendor neutrality. You write instrumentation once and send it to any compatible backend (Jaeger, Datadog, Honeycomb, etc.). Cost tracking and prompt management are actually strengths of LLM-specific tools like Langfuse, not OpenTelemetry.

5. A user reports: "The agent gave me the wrong order status." What should you check FIRST?

AThe model version and temperature settings

BThe user's account permissions

CThe trace for that request — specifically the tool call output and LLM reasoning

DThe system prompt for recent changes

Correct! The trace for that specific request is your first stop. Pull up the trace_id (from request logs) and examine: (1) Did the tool return the correct data? (2) Did the LLM interpret the tool output correctly? (3) Was the right tool called? This immediately narrows the root cause to either a data issue (tool returned stale data) or a model issue (LLM misinterpreted the result).

Start with the trace. Pull up the trace_id for that specific request and look at the tool call output (did the tool return correct data?) and the LLM reasoning (did the model interpret the data correctly?). This immediately narrows the root cause. Model versions and system prompts are secondary investigations.

6. In a span hierarchy, which statement about parent-child relationships is correct?

AChild spans always start after the parent span ends

BA parent span's duration is always exactly the sum of its children's durations

CChild spans execute within the parent span's time window, and the parent's duration includes overhead beyond its children

DSpans cannot have more than one level of nesting

Correct! Child spans execute within the time window of their parent span. The parent span's duration is typically longer than the sum of its children because of processing overhead between child operations (parsing, decision logic, data transformation). This overhead is visible in the waterfall as gaps between child spans.

Child spans run within the parent's time window, not after it. The parent's duration includes both the children's execution time and any overhead (parsing, logic) between them. Spans can be nested many levels deep — e.g., trace > agent_loop > llm_call > token_streaming.

Your Score

0/0

Summary

You have completed the first module of Track 6: Observability. Here is what you learned:

Observability vs. Monitoring: Monitoring tells you that something broke; observability tells you why. For non-deterministic AI agents, observability is essential.
Traces: The top-level container capturing the complete lifecycle of a single agent request. Every trace has a unique ID and contains a tree of spans.
Spans: Individual units of work (LLM calls, tool invocations, guardrail checks) nested in parent-child hierarchies with timing data.
Structured Logging: JSON-formatted logs with consistent keys (trace_id, span_id, latency_ms, tokens) that enable filtering, aggregation, and alerting. Always redact PII before logging.
Tools: Langfuse (LLM-native, self-hostable), OpenTelemetry (vendor-neutral standard), LangSmith (LangChain ecosystem), Arize Phoenix (embedding/RAG focus).
Golden Rule: Never let observability failures break your agent. Wrap all tracing/logging calls in try/except.

Next up: M20: Monitoring & Continuous Improvement builds on this foundation — you will learn to set up dashboards, alerts, and feedback loops that keep your agent improving over time.

← M18: Evaluation & Testing 🏠 Home M20: Monitoring & CI →