Weights & Biases for Agent Evaluation: Beyond Model Training

Here's a dirty secret of AI agent development: most teams have no idea if their agents are getting better or worse. They ship a prompt change, run a few manual tests, the results "look good," and they push to production. Then a week later, customer support starts getting weird tickets about the agent doing something it never used to do.

The problem isn't that people don't want to evaluate their agents. It's that agent evaluation is genuinely hard, and the tooling hasn't caught up to the need. Model training has mature experiment tracking. Agent workflows have vibes.

Weights & Biases (W&B) is changing that. Their platform, originally built for ML experiment tracking, turns out to be an excellent fit for the agent evaluation problem. Not because they designed it for agents from day one, but because the underlying primitives (logging, comparison, versioning, datasets) are exactly what you need.

Why Agent Evaluation Is Different

Model evaluation has clear metrics. Accuracy, F1, perplexity, BLEU score. You train a model, evaluate it on a held-out set, get a number. Higher is better. Done.

Agent evaluation is messier. Your agent doesn't just generate text. It reasons, calls tools, handles errors, retries, makes decisions, and produces multi-step outputs where any step could fail in ways that affect the final result. A single number doesn't capture it. It is worth reading about LangSmith as an alternative alongside this.

What you actually need to evaluate:

Tool call accuracy: Did the agent call the right tools with the right arguments?
Reasoning quality: Did the intermediate thinking steps make sense?
End-to-end correctness: Is the final output right?
Efficiency: How many steps and tokens did it take?
Robustness: Does it handle edge cases, ambiguous inputs, and tool failures?
Regression detection: Did your latest change break something that worked before?

W&B gives you the infrastructure to track all of these across runs, compare changes, and catch regressions before they hit production.

Setting Up Agent Evaluation with W&B

The core loop is: run your agent on a test set, log every step to W&B, compute metrics, compare against baselines.

import wandb
import json
from datetime import datetime

# Initialize a run for this evaluation
wandb.init(
    project="agent-evaluation",
    config={
        "model": "claude-sonnet-4-20250514",
        "agent_version": "v2.3",
        "prompt_version": "2026-03-13",
        "max_steps": 10,
    }
)

# Your test dataset
test_cases = load_test_cases("./eval/test_set.json")

results = []
for case in test_cases:
    # Run the agent and capture the full trace
    trace = run_agent_with_trace(case["input"])

    # Compute metrics for this case
    metrics = {
        "correct": trace.final_output == case["expected_output"],
        "tool_calls": len(trace.tool_calls),
        "tokens_used": trace.total_tokens,
        "latency_ms": trace.latency_ms,
        "tool_accuracy": compute_tool_accuracy(trace, case),
        "steps": trace.num_steps,
    }

    results.append(metrics)

    # Log individual case results
    wandb.log({
        "case_id": case["id"],
        **metrics,
    })

# Log aggregate metrics
wandb.log({
    "accuracy": sum(r["correct"] for r in results) / len(results),
    "avg_tool_calls": sum(r["tool_calls"] for r in results) / len(results),
    "avg_tokens": sum(r["tokens_used"] for r in results) / len(results),
    "avg_latency_ms": sum(r["latency_ms"] for r in results) / len(results),
})

wandb.finish()

Every evaluation run becomes a W&B experiment. You can compare runs side by side. You can see how accuracy changed when you modified the system prompt. You can track token usage across versions and catch cost regressions.

Evaluation Datasets with W&B Artifacts

Test sets drift. You add cases, fix labels, remove duplicates. Without versioning, you're comparing results from different test sets and drawing wrong conclusions.

W&B Artifacts version your evaluation datasets the same way Git versions your code.

# Create and version your test dataset
artifact = wandb.Artifact("agent-test-set", type="dataset")
artifact.add_file("./eval/test_set.json")
wandb.log_artifact(artifact)

# In your evaluation script, pull the specific version
run = wandb.init(project="agent-evaluation")
artifact = run.use_artifact("agent-test-set:v3")
artifact_dir = artifact.download()
test_cases = load_test_cases(f"{artifact_dir}/test_set.json")

Now every evaluation run records which dataset version it used. When you look at historical results, you know exactly what test set produced them. No more "wait, did we add those new edge cases before or after the v2.1 evaluation?"

Tracing Agent Workflows

The tool calling trace is where W&B's logging really earns its keep. For each test case, you want to see not just the final answer but every reasoning step, every tool call, every intermediate result.

# W&B Tables for structured trace logging
trace_table = wandb.Table(columns=[
    "case_id", "step", "action", "tool_name",
    "tool_input", "tool_output", "reasoning", "tokens"
])

for case in test_cases:
    trace = run_agent_with_trace(case["input"])

    for i, step in enumerate(trace.steps):
        trace_table.add_data(
            case["id"],
            i,
            step.action_type,
            step.tool_name or "",
            json.dumps(step.tool_input) if step.tool_input else "",
            str(step.tool_output)[:500] if step.tool_output else "",
            step.reasoning[:500] if step.reasoning else "",
            step.tokens,
        )

wandb.log({"traces": trace_table})

W&B Tables give you filterable, sortable, interactive views of your agent traces. Filter to cases where correct == False. Sort by token usage to find inefficient runs. Group by tool name to see which tools are called most frequently. This is the kind of analysis that's impossible with console logs and spreadsheets.

Comparing Agent Versions

The killer feature for production agent development is comparison. You change the system prompt, swap a model, modify tool descriptions, or refactor the agent loop. You need to know if things got better, worse, or stayed the same.

# Tag runs for easy comparison
wandb.init(
    project="agent-evaluation",
    tags=["prompt-v3", "claude-sonnet", "production-candidate"],
    config={
        "change_description": "Added chain-of-thought to tool selection",
        "base_version": "v2.2",
    }
)

W&B's comparison view shows two runs side by side with metrics overlaid. Accuracy went from 87% to 91%. Average tokens dropped from 3200 to 2800. But tool accuracy on the "database query" category dropped from 95% to 88%. That's a regression you need to investigate before shipping.

Without this comparison infrastructure, you'd catch the accuracy improvement, celebrate, ship it, and discover the database query regression from production errors two weeks later. The related post on the metrics you are tracking goes further on this point.

LLM-as-Judge for Subjective Evaluation

Some agent outputs can't be evaluated with exact matching. "Summarize this document" doesn't have one right answer. For these tasks, you need an LLM judge.

def llm_judge(agent_output: str, reference: str, criteria: str) -> dict:
    """Use a strong model to evaluate agent output."""
    judge_prompt = f"""Evaluate the following agent output against the reference.

Criteria: {criteria}

Agent output: {agent_output}

Reference: {reference}

Score from 1-5 and explain your reasoning.
Return JSON: {{"score": int, "reasoning": str}}"""

    result = judge_model.generate(judge_prompt)
    return json.loads(result)

# Log judge scores to W&B
for case in test_cases:
    output = run_agent(case["input"])
    judgment = llm_judge(
        output,
        case["reference"],
        "accuracy, completeness, and clarity"
    )

    wandb.log({
        "case_id": case["id"],
        "judge_score": judgment["score"],
        "judge_reasoning": judgment["reasoning"],
    })

Log the judge scores to W&B alongside your other metrics. Now you can track subjective quality across versions with the same comparison tools you use for objective metrics.

CI/CD Integration

The real payoff is running evaluations automatically. Every PR that changes agent code triggers an evaluation run. The results appear in the PR as a W&B report link.

# .github/workflows/agent-eval.yml
name: Agent Evaluation
on:
  pull_request:
    paths:
      - 'agents/**'
      - 'prompts/**'
      - 'tools/**'

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run evaluation
        env:
          WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          python eval/run_evaluation.py \
            --dataset agent-test-set:latest \
            --tag "pr-${{ github.event.number }}"
      - name: Check thresholds
        run: |
          python eval/check_thresholds.py \
            --min-accuracy 0.85 \
            --max-cost-increase 0.20

If accuracy drops below 85% or cost increases by more than 20%, the check fails and the PR gets blocked. No more shipping regressions because the manual test "looked fine."

What W&B Doesn't Do (Yet)

W&B is infrastructure for logging, comparison, and visualization. It's not an agent evaluation framework in the way that Ragas or DeepEval are. It doesn't come with pre-built metrics for faithfulness, hallucination detection, or retrieval relevancy. You write those metrics yourself and log the results to W&B.

That's fine for teams with evaluation expertise. For teams that want opinionated, pre-built evaluation suites, pairing W&B with a framework like Ragas gives you the best of both worlds. Ragas computes the metrics. W&B tracks them over time. It is worth reading about RAG-specific evaluation alongside this.

The agent-specific features are also still evolving. W&B Weave is their newer offering specifically designed for LLM application tracking, with built-in tracing and evaluation tools. It's worth watching, but the core W&B platform already handles 90% of what you need.

The Bottom Line

If you're building agents and you're not evaluating them systematically, you're flying blind. You might be shipping improvements. You might be shipping regressions. You genuinely don't know.

W&B gives you the infrastructure to know. It's not the only option (MLflow, Langfuse, and Braintrust all play in this space), but W&B's combination of experiment tracking, dataset versioning, interactive tables, and team collaboration features makes it the most complete platform for the job.

The overhead is real. You need test datasets, evaluation scripts, metrics definitions, and CI integration. That's a week of work to set up properly. But once it's running, every agent change gets evaluated automatically, every regression gets caught before production, and every improvement gets quantified.

That's not optional for production agents. That's the minimum bar.