Agent Observability: Tracing, Logging, and Debugging AI in Production

The Black Box Problem

You deployed an agent. Users are complaining it gives wrong answers. Your logs say agent completed successfully. Helpful, right?

This is the state of observability in most agent systems. They log the start and the end. Everything in between is a black box. When things go wrong, and they always go wrong, you're left guessing. Was it the prompt? The tool call? The model? The input? You don't know because you never instrumented the middle.

Traditional application observability gives you metrics, logs, and traces. Agent observability needs all three plus two more: token accounting and decision auditing. Let me explain each.

Structured Logging That Actually Helps

Most agent logs look like this:

INFO: Agent started task
INFO: Calling tool search_documents
INFO: Tool returned results
INFO: Agent completed

Useless. Here's what agent logs should look like:

interface AgentLogEntry {
  timestamp: number;
  traceId: string;        // connects related operations
  spanId: string;         // this specific operation
  parentSpanId?: string;  // what triggered this
  level: "debug" | "info" | "warn" | "error";
  event: string;          // machine-readable event type
  agent: string;          // which agent
  model?: string;         // which model was used
  data: {
    input?: unknown;
    output?: unknown;
    tokenUsage?: TokenUsage;
    durationMs?: number;
    toolName?: string;
    error?: ErrorDetails;
    metadata?: Record<string, unknown>;
  };
}

Every log entry has a trace ID that connects it to the entire operation, a span ID for this specific step, and a parent span ID showing what triggered it. This isn't overkill. This is the minimum you need to debug anything.

const logger = {
  agentStep(ctx: AgentContext, event: string, data: Record<string, unknown>) {
    const entry: AgentLogEntry = {
      timestamp: Date.now(),
      traceId: ctx.traceId,
      spanId: crypto.randomUUID(),
      parentSpanId: ctx.currentSpanId,
      level: "info",
      event,
      agent: ctx.agentId,
      model: ctx.currentModel,
      data,
    };
    ctx.currentSpanId = entry.spanId;
    emit(entry);
  }
};

Distributed Tracing for Agent Chains

When Agent A calls Agent B which calls Tool C which queries API D, you need to trace the entire chain. This is distributed tracing, and it's the single most valuable observability tool for multi-agent systems.

class AgentTracer {
  startTrace(task: Task): TraceContext {
    return {
      traceId: crypto.randomUUID(),
      spans: [],
      startTime: Date.now(),
      task: task.id,
    };
  }

  startSpan(ctx: TraceContext, name: string, attributes?: Record<string, unknown>): Span {
    const span: Span = {
      spanId: crypto.randomUUID(),
      parentSpanId: ctx.currentSpanId,
      traceId: ctx.traceId,
      name,
      startTime: Date.now(),
      attributes: attributes || {},
      events: [],
      status: "running",
    };
    ctx.spans.push(span);
    ctx.currentSpanId = span.spanId;
    return span;
  } The related post on [LangSmith for real-time tracing](/blog/real-time-agent-monitoring-langsmith) goes further on this point.

  endSpan(span: Span, status: "ok" | "error", attributes?: Record<string, unknown>) {
    span.endTime = Date.now();
    span.duration = span.endTime - span.startTime;
    span.status = status;
    Object.assign(span.attributes, attributes);
  }
}

With tracing in place, you can visualize the entire agent execution as a tree:

[trace: abc-123] Task: "Research and summarize topic X"
  [span: plan] Planning (model: sonnet, 450ms, 230 tokens)
  [span: search-1] Tool: web_search (320ms)
  [span: search-2] Tool: web_search (280ms)
  [span: evaluate] Evaluating results (model: sonnet, 380ms, 180 tokens)
  [span: search-3] Tool: web_search (350ms)  <- follow-up
  [span: synthesize] Generating summary (model: sonnet, 1200ms, 890 tokens)
  [span: filter] Output filtering (12ms)
Total: 2990ms | Tokens: 1300 | Cost: $0.0039

One glance tells you where the time went, what the agent decided at each step, and what it cost. When a user reports a slow response, you pull the trace ID and see exactly what happened.

Token Accounting

Every LLM call consumes tokens. Every token costs money. If you're not tracking this per-request, per-agent, and per-user, you're flying blind on cost.

class TokenAccountant {
  private usage: Map<string, TokenBucket> = new Map();

  record(ctx: TraceContext, usage: TokenUsage) {
    const buckets = [
      `trace:${ctx.traceId}`,
      `agent:${ctx.agentId}`,
      `model:${usage.model}`,
      `user:${ctx.userId}`,
      `hour:${this.currentHour()}`,
    ];

    for (const key of buckets) {
      const bucket = this.usage.get(key) || new TokenBucket();
      bucket.add(usage);
      this.usage.set(key, bucket);
    }
  }

  getCost(key: string): CostSummary {
    const bucket = this.usage.get(key);
    if (!bucket) return CostSummary.zero();
    return bucket.toCost();
  }
}

class TokenBucket {
  inputTokens = 0;
  outputTokens = 0;
  cacheReadTokens = 0;
  cacheWriteTokens = 0;
  calls = 0;

  add(usage: TokenUsage) {
    this.inputTokens += usage.input;
    this.outputTokens += usage.output;
    this.cacheReadTokens += usage.cacheRead || 0;
    this.cacheWriteTokens += usage.cacheWrite || 0;
    this.calls++;
  } The related post on [evaluation metrics](/blog/agent-evaluation-metrics) goes further on this point.

  toCost(): CostSummary {
    // Model-specific pricing would go here
    return {
      inputCost: this.inputTokens * PRICE_PER_INPUT_TOKEN,
      outputCost: this.outputTokens * PRICE_PER_OUTPUT_TOKEN,
      totalCost: (this.inputTokens * PRICE_PER_INPUT_TOKEN) +
                 (this.outputTokens * PRICE_PER_OUTPUT_TOKEN),
      calls: this.calls,
    };
  }
}

This data drives real decisions. When your monthly bill spikes, you can drill down from "we spent $X" to "agent Y spent $X on model Z for user W" in seconds. That's the difference between "we need to optimize" and knowing exactly what to optimize.

Decision Auditing

This is the one nobody else talks about. AI agents make decisions. You need to be able to audit those decisions after the fact.

interface DecisionRecord {
  traceId: string;
  timestamp: number;
  decisionType: "tool_selection" | "plan_step" | "output_choice" | "error_recovery";
  options: DecisionOption[];
  chosen: string;
  reasoning: string;
  confidence: number;
  context: {
    messageCount: number;
    tokensSoFar: number;
    toolCallsSoFar: number;
    elapsedMs: number;
  };
}

Every time the agent makes a choice, log it. What were the options? What did it choose? Why? What was the context at the time of the decision?

This is gold for debugging. "The agent chose to call the API again instead of using cached results because the prompt didn't mention the cache was available." That's actionable. That's something you can fix.

Metrics That Matter for Agents

Traditional metrics like request latency and error rate are necessary but insufficient. Here are agent-specific metrics you need:

const agentMetrics = {
  // Efficiency
  tokensPerTask: histogram(),          // how many tokens to complete a task
  toolCallsPerTask: histogram(),       // how many tool calls per task
  llmCallsPerTask: histogram(),        // how many LLM roundtrips
  planRevisionsPerTask: histogram(),    // how often the agent replans

  // Quality
  taskSuccessRate: counter(),           // completed vs failed
  escalationRate: counter(),            // needed human intervention
  userSatisfaction: histogram(),        // if you have feedback

  // Cost
  costPerTask: histogram(),             // dollar cost
  costPerUser: counter(),               // spending by user
  cacheHitRate: counter(),              // prompt cache efficiency The related post on [W&B for experiment tracking](/blog/wandb-agent-evaluation) goes further on this point.

  // Performance
  timeToFirstToken: histogram(),        // perceived latency
  totalTaskDuration: histogram(),       // end-to-end time
  toolCallLatency: histogram(),         // per-tool performance
};

Set alerts on the right ones. A sudden spike in toolCallsPerTask means your agent is struggling. A drop in cacheHitRate means your prompt structure changed. A rise in costPerTask means something got more expensive. These are the early warning signals.

Building a Debug View

All this data is worthless if you can't look at it quickly. Build a debug view that shows a single agent interaction from end to end.

async function debugView(traceId: string): Promise<DebugReport> {
  const spans = await traceStore.getSpans(traceId);
  const decisions = await decisionStore.getDecisions(traceId);
  const tokens = await tokenStore.getUsage(traceId);
  const logs = await logStore.getLogs(traceId);

  return {
    timeline: buildTimeline(spans),
    decisions: decisions.map(d => ({
      ...d,
      wasOptimal: await evaluateDecision(d),
    })),
    cost: tokens.toCost(),
    errors: logs.filter(l => l.level === "error"),
    duration: spans[spans.length - 1].endTime - spans[0].startTime,
    summary: generateSummary(spans, decisions, tokens),
  };
}

When a user reports an issue, you paste the trace ID, get a complete picture of what happened, and know exactly where to look. No more "let me try to reproduce this." No more reading through thousands of log lines.

The Observability Tax

Yes, all this instrumentation has a cost. More code. More storage. More processing. The tracing middleware adds a few milliseconds per span. The token accounting adds computation per LLM call.

It's worth it. Every single time.

The alternative is shipping an agent to production and hoping it works. That's not engineering. That's prayer. And in my experience, production systems don't respond well to prayer.

Instrument everything. Store everything. Query it when you need it. The cost of observability is always less than the cost of debugging without it.