Agent Evaluation: How to Know If Your Agent Actually Works

You built an agent. It demos well. Your team is excited. Your stakeholders are impressed. You deploy it. Then the support tickets start rolling in.

The agent works great for the 10 scenarios you tested. It falls apart for the 10,000 scenarios your users actually encounter. The demo and production are different planets, and you didn't have the metrics to see it coming.

Agent evaluation is the least glamorous and most important part of building agents. You can't improve what you can't measure. And for agents, what you need to measure is significantly more complex than traditional software testing.

Why Traditional Testing Isn't Enough

Software testing is deterministic. Given input X, the program should produce output Y. If it does, pass. If it doesn't, fail.

Agents are stochastic. Given the same input, the agent might take different paths to the answer. It might use different tools. It might produce slightly different outputs. All of which could be correct, or all of which could be subtly wrong in different ways.

Traditional testing catches "the code crashed." Agent evaluation needs to catch "the agent took 47 steps when 5 would have sufficed," "the agent's answer is technically correct but missed the user's actual intent," and "the agent hallucinated a database table that doesn't exist."

Different problem. Different tools.

The Four Dimensions of Agent Evaluation

1. Task Completion (Did it finish the job?)

The most basic metric. Did the agent achieve the stated goal? Not "did it try" or "did it get close." Did it actually succeed?

This requires defining what success looks like for each task type, which is harder than it sounds. "Summarize this document" doesn't have a single correct answer. "Calculate the total revenue for Q1" does. You need different evaluation strategies for deterministic and open-ended tasks.

For deterministic tasks: compare the agent's output against a known correct answer. Exact match, numerical accuracy, structured output validation. This connects directly to observability and tracing.

For open-ended tasks: use rubrics. Did the summary cover the main points? Did it miss any critical information? Was the tone appropriate? You can evaluate these with another LLM (LLM-as-judge) or with human reviewers, or both.

Track completion rate as a percentage. "Our agent completes 87% of customer inquiry tasks successfully." That number is your headline metric. If it's below your threshold, nothing else matters.

2. Efficiency (Did it finish efficiently?)

An agent that takes 50 tool calls to do something achievable in 5 is technically successful but practically useless. It's slow, expensive, and fragile (each unnecessary step is another chance for failure).

Step count. How many reasoning/action cycles did the agent take? Compare against a baseline (either human performance or a known-good agent trace).

Token usage. How many tokens were consumed? This directly correlates with cost and latency. An agent that uses 100K tokens per task at $15/M tokens costs $1.50 per task. An optimized agent doing the same task in 20K tokens costs $0.30. Multiply by thousands of daily tasks.

Tool call accuracy. What percentage of tool calls were productive? If the agent called 10 tools and 3 returned useful results, that's a 30% tool accuracy rate. The agent is mostly flailing.

Backtracking frequency. How often does the agent undo or redo work? Some backtracking is natural. Excessive backtracking means the agent's planning or reasoning is poor.

3. Quality (Was the output good?)

Completion plus efficiency doesn't capture quality. The agent might complete the task quickly but produce a mediocre result.

Factual accuracy. Are the facts in the output correct? This is especially critical for agents that retrieve and synthesize information. A research agent that produces a confident summary with one wrong statistic is worse than one that says "I couldn't verify this number."

Relevance. Does the output address what was actually asked? Agents love to demonstrate knowledge. They'll answer adjacent questions, provide unrequested context, and volunteer opinions that weren't solicited. Evaluate whether the output addresses the specific request.

Coherence. Is the output internally consistent? Does it contradict itself? Especially important for longer outputs where early statements might conflict with later conclusions. It is worth reading about monitoring agents with LangSmith alongside this.

Safety. Does the output comply with policies? Does it avoid revealing sensitive information? Does it stay within the agent's authorized scope?

4. Reliability (Does it work consistently?)

An agent that succeeds 95% of the time on one test run and 75% on the next has a reliability problem. You need consistency.

Variance across runs. Run the same evaluation set multiple times. How much do the metrics vary? High variance means the agent's performance is unpredictable, which is often worse than consistently mediocre performance because you can't plan around it.

Degradation under load. Does the agent perform worse when handling multiple tasks simultaneously? When the context window is nearly full? When the tools are slow to respond?

Edge case handling. What happens with unusual inputs? Empty inputs. Extremely long inputs. Inputs in unexpected languages. Contradictory instructions. The agent's failure mode matters more than its success mode.

Building an Evaluation Framework

The Evaluation Dataset

You need a curated set of test cases. Not 5. Not 10. Hundreds, ideally, spanning the full range of scenarios your agent will encounter.

Each test case needs: an input (what the user asks), expected behavior (what the agent should do), and evaluation criteria (how to judge success).

- id: "refund-001"
  input: "I want a refund for order #1234, it arrived damaged"
  expected_actions:
    - "look_up_order: #1234"
    - "check_refund_policy"
    - "check_damage_claim_status"
  success_criteria:
    - "Agent retrieved order details"
    - "Agent checked policy for damaged goods"
    - "Agent provided correct refund timeline"
    - "Agent did NOT auto-approve without damage verification"
  edge_case: false
  category: "refund"

Build this dataset from real usage data wherever possible. The scenarios your users actually present are more valuable than scenarios you imagine.

LLM-as-Judge

For open-ended tasks, use another LLM to evaluate the agent's output. Give the judge the original task, the agent's output, and a rubric. Ask it to score on specific criteria.

This isn't perfect. The judge has its own biases and error modes. But it scales, which human evaluation doesn't. Use human evaluation to calibrate the judge, then use the judge for continuous evaluation.

Evaluate the agent's response on these criteria (1-5 scale):
1. Task completion: Did the agent address the user's request?
2. Factual accuracy: Are all stated facts correct?
3. Relevance: Is the response focused on what was asked?
4. Completeness: Were any important aspects missed?
5. Safety: Did the agent stay within authorized boundaries?

Trajectory Analysis

Don't just evaluate the final output. Evaluate the path the agent took. This catches agents that arrive at the right answer through wrong reasoning (lucky guesses that won't generalize) and agents that take correct paths but produce flawed outputs.

Log every thought, action, and observation. Review the trajectories for patterns: common failure points, unnecessary detours, missed shortcuts, reasoning errors that happened to not affect the outcome this time. This connects directly to evaluation tooling like W&B.

Metrics That Actually Drive Decisions

Track these over time, not as one-off evaluations.

Task success rate by category. Not one global number. Break it down. Your agent might crush routine queries (98%) while struggling with multi-step workflows (72%). That tells you where to invest.

Mean steps to completion. Trending up means the agent is getting less efficient. Investigate before it becomes a cost problem.

Escalation rate. What percentage of tasks get escalated to a human? Trending down is good (the agent is learning). Flat might be fine (you've found the equilibrium). Trending up is a problem.

Cost per task. API costs, tool costs, human review costs, all in. This is the number your finance team cares about.

Time to resolution. How long does the full task take? Include human review time if applicable. This is the number your users care about.

The Evaluation Loop

Evaluation isn't a phase. It's a continuous process.

Deploy. Measure. Identify failures. Fix. Measure again. Repeat.

Every failure in production is a test case you didn't have. Add it to the evaluation dataset. Every edge case a human flags is a scenario the agent should handle next time. The dataset grows, the evaluation gets more comprehensive, and the agent improves.

The teams that treat evaluation as a one-time gate before deployment end up with agents that work in testing and fail in production. The teams that treat evaluation as a continuous feedback loop end up with agents that actually get better over time.

Measure everything. Trust nothing. Improve constantly. That's the whole game.