Agent Observability: Tracing, Logging, and Debugging AI in Production
By Diesel
architectureobservabilitydebugging
## The Black Box Problem
You deployed an agent. Users are complaining it gives wrong answers. Your logs say `agent completed successfully`. Helpful, right?
This is the state of observability in most agent systems. They log the start and the end. Everything in between is a black box. When things go wrong, and they always go wrong, you're left guessing. Was it the prompt? The tool call? The model? The input? You don't know because you never instrumented the middle.
Traditional application observability gives you metrics, logs, and traces. Agent observability needs all three plus two more: token accounting and decision auditing. Let me explain each.
## Structured Logging That Actually Helps
Most agent logs look like this:
```
INFO: Agent started task
INFO: Calling tool search_documents
INFO: Tool returned results
INFO: Agent completed
```
Useless. Here's what agent logs should look like:
```typescript
interface AgentLogEntry {
timestamp: number;
traceId: string; // connects related operations
spanId: string; // this specific operation
parentSpanId?: string; // what triggered this
level: "debug" | "info" | "warn" | "error";
event: string; // machine-readable event type
agent: string; // which agent
model?: string; // which model was used
data: {
input?: unknown;
output?: unknown;
tokenUsage?: TokenUsage;
durationMs?: number;
toolName?: string;
error?: ErrorDetails;
metadata?: Record;
};
}
```
Every log entry has a trace ID that connects it to the entire operation, a span ID for this specific step, and a parent span ID showing what triggered it. This isn't overkill. This is the minimum you need to debug anything.
```typescript
const logger = {
agentStep(ctx: AgentContext, event: string, data: Record) {
const entry: AgentLogEntry = {
timestamp: Date.now(),
traceId: ctx.traceId,
spanId: crypto.randomUUID(),
parentSpanId: ctx.currentSpanId,
level: "info",
event,
agent: ctx.agentId,
model: ctx.currentModel,
data,
};
ctx.currentSpanId = entry.spanId;
emit(entry);
}
};
```
## Distributed Tracing for Agent Chains
When Agent A calls Agent B which calls Tool C which queries API D, you need to trace the entire chain. This is distributed tracing, and it's the single most valuable observability tool for multi-agent systems.
```typescript
class AgentTracer {
startTrace(task: Task): TraceContext {
return {
traceId: crypto.randomUUID(),
spans: [],
startTime: Date.now(),
task: task.id,
};
}
startSpan(ctx: TraceContext, name: string, attributes?: Record): Span {
const span: Span = {
spanId: crypto.randomUUID(),
parentSpanId: ctx.currentSpanId,
traceId: ctx.traceId,
name,
startTime: Date.now(),
attributes: attributes || {},
events: [],
status: "running",
};
ctx.spans.push(span);
ctx.currentSpanId = span.spanId;
return span;
} The related post on [LangSmith for real-time tracing](/blog/real-time-agent-monitoring-langsmith) goes further on this point.
endSpan(span: Span, status: "ok" | "error", attributes?: Record) {
span.endTime = Date.now();
span.duration = span.endTime - span.startTime;
span.status = status;
Object.assign(span.attributes, attributes);
}
}
```
With tracing in place, you can visualize the entire agent execution as a tree:
```
[trace: abc-123] Task: "Research and summarize topic X"
[span: plan] Planning (model: sonnet, 450ms, 230 tokens)
[span: search-1] Tool: web_search (320ms)
[span: search-2] Tool: web_search (280ms)
[span: evaluate] Evaluating results (model: sonnet, 380ms, 180 tokens)
[span: search-3] Tool: web_search (350ms) <- follow-up
[span: synthesize] Generating summary (model: sonnet, 1200ms, 890 tokens)
[span: filter] Output filtering (12ms)
Total: 2990ms | Tokens: 1300 | Cost: $0.0039
```
One glance tells you where the time went, what the agent decided at each step, and what it cost. When a user reports a slow response, you pull the trace ID and see exactly what happened.
## Token Accounting
Every LLM call consumes tokens. Every token costs money. If you're not tracking this per-request, per-agent, and per-user, you're flying blind on cost.
```typescript
class TokenAccountant {
private usage: Map = new Map();
record(ctx: TraceContext, usage: TokenUsage) {
const buckets = [
`trace:${ctx.traceId}`,
`agent:${ctx.agentId}`,
`model:${usage.model}`,
`user:${ctx.userId}`,
`hour:${this.currentHour()}`,
];
for (const key of buckets) {
const bucket = this.usage.get(key) || new TokenBucket();
bucket.add(usage);
this.usage.set(key, bucket);
}
}
getCost(key: string): CostSummary {
const bucket = this.usage.get(key);
if (!bucket) return CostSummary.zero();
return bucket.toCost();
}
}
class TokenBucket {
inputTokens = 0;
outputTokens = 0;
cacheReadTokens = 0;
cacheWriteTokens = 0;
calls = 0;
add(usage: TokenUsage) {
this.inputTokens += usage.input;
this.outputTokens += usage.output;
this.cacheReadTokens += usage.cacheRead || 0;
this.cacheWriteTokens += usage.cacheWrite || 0;
this.calls++;
} The related post on [evaluation metrics](/blog/agent-evaluation-metrics) goes further on this point.
toCost(): CostSummary {
// Model-specific pricing would go here
return {
inputCost: this.inputTokens * PRICE_PER_INPUT_TOKEN,
outputCost: this.outputTokens * PRICE_PER_OUTPUT_TOKEN,
totalCost: (this.inputTokens * PRICE_PER_INPUT_TOKEN) +
(this.outputTokens * PRICE_PER_OUTPUT_TOKEN),
calls: this.calls,
};
}
}
```
This data drives real decisions. When your monthly bill spikes, you can drill down from "we spent $X" to "agent Y spent $X on model Z for user W" in seconds. That's the difference between "we need to optimize" and knowing exactly what to optimize.
## Decision Auditing
This is the one nobody else talks about. AI agents make decisions. You need to be able to audit those decisions after the fact.
```typescript
interface DecisionRecord {
traceId: string;
timestamp: number;
decisionType: "tool_selection" | "plan_step" | "output_choice" | "error_recovery";
options: DecisionOption[];
chosen: string;
reasoning: string;
confidence: number;
context: {
messageCount: number;
tokensSoFar: number;
toolCallsSoFar: number;
elapsedMs: number;
};
}
```
Every time the agent makes a choice, log it. What were the options? What did it choose? Why? What was the context at the time of the decision?
This is gold for debugging. "The agent chose to call the API again instead of using cached results because the prompt didn't mention the cache was available." That's actionable. That's something you can fix.
## Metrics That Matter for Agents
Traditional metrics like request latency and error rate are necessary but insufficient. Here are agent-specific metrics you need:
```typescript
const agentMetrics = {
// Efficiency
tokensPerTask: histogram(), // how many tokens to complete a task
toolCallsPerTask: histogram(), // how many tool calls per task
llmCallsPerTask: histogram(), // how many LLM roundtrips
planRevisionsPerTask: histogram(), // how often the agent replans
// Quality
taskSuccessRate: counter(), // completed vs failed
escalationRate: counter(), // needed human intervention
userSatisfaction: histogram(), // if you have feedback
// Cost
costPerTask: histogram(), // dollar cost
costPerUser: counter(), // spending by user
cacheHitRate: counter(), // prompt cache efficiency The related post on [W&B for experiment tracking](/blog/wandb-agent-evaluation) goes further on this point.
// Performance
timeToFirstToken: histogram(), // perceived latency
totalTaskDuration: histogram(), // end-to-end time
toolCallLatency: histogram(), // per-tool performance
};
```
Set alerts on the right ones. A sudden spike in `toolCallsPerTask` means your agent is struggling. A drop in `cacheHitRate` means your prompt structure changed. A rise in `costPerTask` means something got more expensive. These are the early warning signals.
## Building a Debug View
All this data is worthless if you can't look at it quickly. Build a debug view that shows a single agent interaction from end to end.
```typescript
async function debugView(traceId: string): Promise {
const spans = await traceStore.getSpans(traceId);
const decisions = await decisionStore.getDecisions(traceId);
const tokens = await tokenStore.getUsage(traceId);
const logs = await logStore.getLogs(traceId);
return {
timeline: buildTimeline(spans),
decisions: decisions.map(d => ({
...d,
wasOptimal: await evaluateDecision(d),
})),
cost: tokens.toCost(),
errors: logs.filter(l => l.level === "error"),
duration: spans[spans.length - 1].endTime - spans[0].startTime,
summary: generateSummary(spans, decisions, tokens),
};
}
```
When a user reports an issue, you paste the trace ID, get a complete picture of what happened, and know exactly where to look. No more "let me try to reproduce this." No more reading through thousands of log lines.
## The Observability Tax
Yes, all this instrumentation has a cost. More code. More storage. More processing. The tracing middleware adds a few milliseconds per span. The token accounting adds computation per LLM call.
It's worth it. Every single time.
The alternative is shipping an agent to production and hoping it works. That's not engineering. That's prayer. And in my experience, production systems don't respond well to prayer.
Instrument everything. Store everything. Query it when you need it. The cost of observability is always less than the cost of debugging without it.