Real-Time Agent Monitoring with LangSmith

Your agent works in development. You deploy it. Users start hitting it. Something goes wrong. The agent gives a weird answer, or loops on a tool call, or spends $4 on a single request. You check the logs. They say "200 OK."

That's the problem with agents. HTTP status codes tell you nothing about what happened inside. You need to see the reasoning chain. Which tools did it call? What did it get back? Why did it make that decision? How much did it cost?

LangSmith answers all of that. It's an observability platform purpose-built for LLM applications. Think of it as Datadog for agents. The related post on observability principles goes further on this point.

Setup

pip install langsmith langchain-anthropic langchain-community

export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY="your-langsmith-api-key"
export LANGCHAIN_PROJECT="my-agent-prod"

That's it. Three environment variables and every LangChain operation gets traced automatically. No code changes needed.

What Gets Traced Automatically

Once tracing is enabled, LangSmith captures:

Every LLM call (model, prompt, response, tokens, latency, cost)
Every tool invocation (name, input, output, duration)
Every chain execution (input, output, intermediate steps)
Every retrieval operation (query, results, scores)
Error traces with full stack traces
Parent-child relationships between operations

All of it structured, searchable, and linked together in a trace tree.

Custom Tracing with the `@traceable` Decorator

For your own functions, add the @traceable decorator:

from langsmith import traceable

@traceable(name="process_user_request")
def process_request(user_input: str, session_id: str) -> dict:
    """Main entry point for agent requests."""
    # Your agent logic here
    result = agent.invoke({
        "messages": [{"role": "user", "content": user_input}]
    })
    return {"response": result, "session_id": session_id}

@traceable(name="validate_output", tags=["validation"])
def validate_output(response: str) -> tuple[str, list[str]]:
    """Validate agent output before returning to user."""
    warnings = []
    if len(response) > 5000:
        warnings.append("Response exceeds length limit")
    return response, warnings

Nested @traceable functions create a trace tree automatically. The parent call shows the child calls underneath it. You see exactly how time is spent at every level.

Structured Metadata

Attach business context to your traces. This is what makes LangSmith useful for debugging, not just monitoring.

from langsmith import traceable
from langsmith.run_helpers import get_current_run_tree

@traceable(name="agent_invoke")
def invoke_agent(user_input: str, user_id: str, tier: str):
    run_tree = get_current_run_tree()
    if run_tree:
        run_tree.metadata = {
            "user_id": user_id,
            "tier": tier,
            "input_length": len(user_input),
        }
        run_tree.tags = ["production", tier]

    result = agent.invoke({"messages": [("human", user_input)]})

    # Add output metadata
    if run_tree:
        run_tree.metadata["output_length"] = len(str(result))
        run_tree.metadata["tool_calls"] = count_tool_calls(result)

    return result

Now you can filter traces in LangSmith by user tier, search for all requests from a specific user, or find all traces where tool call count exceeded 5.

Feedback Collection

LangSmith tracks user feedback tied to specific traces. This closes the loop between "what the agent did" and "was it actually helpful."

from langsmith import Client

ls_client = Client()

def record_feedback(run_id: str, score: float, comment: str = ""):
    """Record user feedback on an agent response."""
    ls_client.create_feedback(
        run_id=run_id,
        key="user_rating",
        score=score,  # 0.0 to 1.0
        comment=comment,
    )

def record_correction(run_id: str, correct_answer: str):
    """Record a correction when the agent got it wrong."""
    ls_client.create_feedback(
        run_id=run_id,
        key="correction",
        score=0.0,
        comment=correct_answer,
    )

After enough corrections pile up, you have a labeled dataset for evaluation. LangSmith didn't just monitor your agent. It built your test suite.

Building Evaluation Datasets

Extract traces with feedback into reusable evaluation datasets:

from langsmith import Client

ls_client = Client()

def create_evaluation_dataset(project_name: str, dataset_name: str):
    """Create a dataset from traced runs with feedback."""
    dataset = ls_client.create_dataset(dataset_name) This connects directly to [evaluation metrics to track](/blog/agent-evaluation-metrics).

    # Get runs that have feedback
    runs = ls_client.list_runs(
        project_name=project_name,
        filter='has(feedback_key, "correction")',
    )

    for run in runs:
        feedbacks = list(ls_client.list_feedback(run_ids=[run.id]))
        correction = next(
            (f.comment for f in feedbacks if f.key == "correction"),
            None
        )
        if correction:
            ls_client.create_example(
                inputs=run.inputs,
                outputs={"expected": correction},
                dataset_id=dataset.id,
            )

    return dataset

Automated Evaluation

Run evaluations against your dataset to catch regressions before they hit production.

from langsmith.evaluation import evaluate

def correctness_evaluator(run, example):
    """Check if the agent's response matches the expected output."""
    prediction = run.outputs.get("response", "")
    expected = example.outputs.get("expected", "")

    # Use an LLM to judge equivalence
    from langchain_anthropic import ChatAnthropic
    judge = ChatAnthropic(model="claude-haiku-4-20250514")
    result = judge.invoke([
        ("system", "Compare two responses. Are they semantically equivalent? Answer YES or NO."),
        ("human", f"Response A: {prediction}\n\nResponse B: {expected}")
    ])

    return {
        "key": "correctness",
        "score": 1.0 if "YES" in result.content.upper() else 0.0,
    }

# Run the evaluation
results = evaluate(
    lambda inputs: agent.invoke(inputs),
    data="my-evaluation-dataset",
    evaluators=[correctness_evaluator],
    experiment_prefix="v2-agent",
)

Every time you change your agent (new prompt, different model, added tools), run the evaluation suite. If correctness drops, you caught the regression before users did.

Cost Tracking

Every LLM call in LangSmith includes token counts. Aggregate them for cost monitoring.

from langsmith import Client
from datetime import datetime, timedelta

ls_client = Client()

def get_cost_report(project_name: str, days: int = 7) -> dict:
    """Calculate agent costs over a time period."""
    start = datetime.utcnow() - timedelta(days=days)

    runs = ls_client.list_runs(
        project_name=project_name,
        start_time=start,
        run_type="llm",
    )

    total_input_tokens = 0
    total_output_tokens = 0
    total_runs = 0

    for run in runs:
        if run.total_tokens:
            total_input_tokens += run.prompt_tokens or 0
            total_output_tokens += run.completion_tokens or 0
            total_runs += 1 For a deeper look, see [W&B as an alternative](/blog/wandb-agent-evaluation).

    # Claude Sonnet pricing (approximate)
    input_cost = (total_input_tokens / 1_000_000) * 3.00
    output_cost = (total_output_tokens / 1_000_000) * 15.00

    return {
        "period_days": days,
        "total_llm_calls": total_runs,
        "total_input_tokens": total_input_tokens,
        "total_output_tokens": total_output_tokens,
        "estimated_cost_usd": round(input_cost + output_cost, 2),
        "avg_cost_per_call": round((input_cost + output_cost) / max(total_runs, 1), 4),
    }

Run this daily. Set alerts when the average cost per call spikes. A prompt change that adds 2,000 tokens looks harmless until it's multiplied by 10,000 daily requests.

Alert Rules

Set up monitoring rules that fire when something goes wrong:

def check_agent_health(project_name: str) -> list[str]:
    """Run health checks and return any alerts."""
    alerts = []
    ls_client_local = Client()

    recent_runs = list(ls_client_local.list_runs(
        project_name=project_name,
        start_time=datetime.utcnow() - timedelta(hours=1),
        is_root=True,
    ))

    if not recent_runs:
        alerts.append("WARNING: No agent runs in the last hour")
        return alerts

    # Error rate
    errors = [r for r in recent_runs if r.error]
    error_rate = len(errors) / len(recent_runs)
    if error_rate > 0.1:
        alerts.append(f"CRITICAL: Error rate is {error_rate:.0%} (threshold: 10%)")

    # Latency
    latencies = [
        (r.end_time - r.start_time).total_seconds()
        for r in recent_runs
        if r.end_time and r.start_time
    ]
    if latencies:
        avg_latency = sum(latencies) / len(latencies)
        if avg_latency > 30:
            alerts.append(f"WARNING: Average latency is {avg_latency:.1f}s (threshold: 30s)")

        p95 = sorted(latencies)[int(len(latencies) * 0.95)]
        if p95 > 60:
            alerts.append(f"WARNING: P95 latency is {p95:.1f}s (threshold: 60s)")

    return alerts

The Dashboard Checklist

At minimum, monitor these in production:

Error rate. Anything above 5% needs investigation.
P50 and P95 latency. Know your baseline, alert on spikes.
Cost per request. Track daily, set budget alerts.
Tool call count per request. Rising counts mean the agent is struggling.
User feedback scores. The ultimate quality metric.
Token usage trends. Creeping token counts mean something changed.

LangSmith gives you all of these out of the box. You just need to look at them.

The Point

Observability isn't optional for production agents. They're non-deterministic systems making autonomous decisions. Without tracing, you're flying blind. LangSmith plugs in with three environment variables and gives you full visibility into every decision your agent makes.

You can't fix what you can't see. Now you can see everything.

Setup

pip install langsmith langchain-anthropic langchain-community

export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY="your-langsmith-api-key"
export LANGCHAIN_PROJECT="my-agent-prod"

That's it. Three environment variables and every LangChain operation gets traced automatically. No code changes needed.

What Gets Traced Automatically

Once tracing is enabled, LangSmith captures:

Every LLM call (model, prompt, response, tokens, latency, cost)
Every tool invocation (name, input, output, duration)
Every chain execution (input, output, intermediate steps)
Every retrieval operation (query, results, scores)
Error traces with full stack traces
Parent-child relationships between operations

All of it structured, searchable, and linked together in a trace tree.

Custom Tracing with the `@traceable` Decorator

For your own functions, add the @traceable decorator:

from langsmith import traceable

@traceable(name="process_user_request")
def process_request(user_input: str, session_id: str) -> dict:
    """Main entry point for agent requests."""
    # Your agent logic here
    result = agent.invoke({
        "messages": [{"role": "user", "content": user_input}]
    })
    return {"response": result, "session_id": session_id}

@traceable(name="validate_output", tags=["validation"])
def validate_output(response: str) -> tuple[str, list[str]]:
    """Validate agent output before returning to user."""
    warnings = []
    if len(response) > 5000:
        warnings.append("Response exceeds length limit")
    return response, warnings

Nested @traceable functions create a trace tree automatically. The parent call shows the child calls underneath it. You see exactly how time is spent at every level.

Structured Metadata

Attach business context to your traces. This is what makes LangSmith useful for debugging, not just monitoring.

from langsmith import traceable
from langsmith.run_helpers import get_current_run_tree

@traceable(name="agent_invoke")
def invoke_agent(user_input: str, user_id: str, tier: str):
    run_tree = get_current_run_tree()
    if run_tree:
        run_tree.metadata = {
            "user_id": user_id,
            "tier": tier,
            "input_length": len(user_input),
        }
        run_tree.tags = ["production", tier]

    result = agent.invoke({"messages": [("human", user_input)]})

    # Add output metadata
    if run_tree:
        run_tree.metadata["output_length"] = len(str(result))
        run_tree.metadata["tool_calls"] = count_tool_calls(result)

    return result

Now you can filter traces in LangSmith by user tier, search for all requests from a specific user, or find all traces where tool call count exceeded 5.

Feedback Collection

LangSmith tracks user feedback tied to specific traces. This closes the loop between "what the agent did" and "was it actually helpful."

from langsmith import Client

ls_client = Client()

def record_feedback(run_id: str, score: float, comment: str = ""):
    """Record user feedback on an agent response."""
    ls_client.create_feedback(
        run_id=run_id,
        key="user_rating",
        score=score,  # 0.0 to 1.0
        comment=comment,
    )

def record_correction(run_id: str, correct_answer: str):
    """Record a correction when the agent got it wrong."""
    ls_client.create_feedback(
        run_id=run_id,
        key="correction",
        score=0.0,
        comment=correct_answer,
    )

After enough corrections pile up, you have a labeled dataset for evaluation. LangSmith didn't just monitor your agent. It built your test suite.

Building Evaluation Datasets

Extract traces with feedback into reusable evaluation datasets:

from langsmith import Client

ls_client = Client()

def create_evaluation_dataset(project_name: str, dataset_name: str):
    """Create a dataset from traced runs with feedback."""
    dataset = ls_client.create_dataset(dataset_name) This connects directly to [evaluation metrics to track](/blog/agent-evaluation-metrics).

    # Get runs that have feedback
    runs = ls_client.list_runs(
        project_name=project_name,
        filter='has(feedback_key, "correction")',
    )

    for run in runs:
        feedbacks = list(ls_client.list_feedback(run_ids=[run.id]))
        correction = next(
            (f.comment for f in feedbacks if f.key == "correction"),
            None
        )
        if correction:
            ls_client.create_example(
                inputs=run.inputs,
                outputs={"expected": correction},
                dataset_id=dataset.id,
            )

    return dataset

Automated Evaluation

Run evaluations against your dataset to catch regressions before they hit production.

from langsmith.evaluation import evaluate

def correctness_evaluator(run, example):
    """Check if the agent's response matches the expected output."""
    prediction = run.outputs.get("response", "")
    expected = example.outputs.get("expected", "")

    # Use an LLM to judge equivalence
    from langchain_anthropic import ChatAnthropic
    judge = ChatAnthropic(model="claude-haiku-4-20250514")
    result = judge.invoke([
        ("system", "Compare two responses. Are they semantically equivalent? Answer YES or NO."),
        ("human", f"Response A: {prediction}\n\nResponse B: {expected}")
    ])

    return {
        "key": "correctness",
        "score": 1.0 if "YES" in result.content.upper() else 0.0,
    }

# Run the evaluation
results = evaluate(
    lambda inputs: agent.invoke(inputs),
    data="my-evaluation-dataset",
    evaluators=[correctness_evaluator],
    experiment_prefix="v2-agent",
)

Every time you change your agent (new prompt, different model, added tools), run the evaluation suite. If correctness drops, you caught the regression before users did.

Cost Tracking

Every LLM call in LangSmith includes token counts. Aggregate them for cost monitoring.

from langsmith import Client
from datetime import datetime, timedelta

ls_client = Client()

def get_cost_report(project_name: str, days: int = 7) -> dict:
    """Calculate agent costs over a time period."""
    start = datetime.utcnow() - timedelta(days=days)

    runs = ls_client.list_runs(
        project_name=project_name,
        start_time=start,
        run_type="llm",
    )

    total_input_tokens = 0
    total_output_tokens = 0
    total_runs = 0

    for run in runs:
        if run.total_tokens:
            total_input_tokens += run.prompt_tokens or 0
            total_output_tokens += run.completion_tokens or 0
            total_runs += 1 For a deeper look, see [W&B as an alternative](/blog/wandb-agent-evaluation).

    # Claude Sonnet pricing (approximate)
    input_cost = (total_input_tokens / 1_000_000) * 3.00
    output_cost = (total_output_tokens / 1_000_000) * 15.00

    return {
        "period_days": days,
        "total_llm_calls": total_runs,
        "total_input_tokens": total_input_tokens,
        "total_output_tokens": total_output_tokens,
        "estimated_cost_usd": round(input_cost + output_cost, 2),
        "avg_cost_per_call": round((input_cost + output_cost) / max(total_runs, 1), 4),
    }

Run this daily. Set alerts when the average cost per call spikes. A prompt change that adds 2,000 tokens looks harmless until it's multiplied by 10,000 daily requests.

Alert Rules

Set up monitoring rules that fire when something goes wrong:

def check_agent_health(project_name: str) -> list[str]:
    """Run health checks and return any alerts."""
    alerts = []
    ls_client_local = Client()

    recent_runs = list(ls_client_local.list_runs(
        project_name=project_name,
        start_time=datetime.utcnow() - timedelta(hours=1),
        is_root=True,
    ))

    if not recent_runs:
        alerts.append("WARNING: No agent runs in the last hour")
        return alerts

    # Error rate
    errors = [r for r in recent_runs if r.error]
    error_rate = len(errors) / len(recent_runs)
    if error_rate > 0.1:
        alerts.append(f"CRITICAL: Error rate is {error_rate:.0%} (threshold: 10%)")

    # Latency
    latencies = [
        (r.end_time - r.start_time).total_seconds()
        for r in recent_runs
        if r.end_time and r.start_time
    ]
    if latencies:
        avg_latency = sum(latencies) / len(latencies)
        if avg_latency > 30:
            alerts.append(f"WARNING: Average latency is {avg_latency:.1f}s (threshold: 30s)")

        p95 = sorted(latencies)[int(len(latencies) * 0.95)]
        if p95 > 60:
            alerts.append(f"WARNING: P95 latency is {p95:.1f}s (threshold: 60s)")

    return alerts

The Dashboard Checklist

At minimum, monitor these in production:

Error rate. Anything above 5% needs investigation.
P50 and P95 latency. Know your baseline, alert on spikes.
Cost per request. Track daily, set budget alerts.
Tool call count per request. Rising counts mean the agent is struggling.
User feedback scores. The ultimate quality metric.
Token usage trends. Creeping token counts mean something changed.

LangSmith gives you all of these out of the box. You just need to look at them.

The Point

You can't fix what you can't see. Now you can see everything.

Setup

What Gets Traced Automatically

Custom Tracing with the @traceable Decorator

Structured Metadata

Feedback Collection

Building Evaluation Datasets

Automated Evaluation

Cost Tracking

Alert Rules

The Dashboard Checklist

The Point

Setup

What Gets Traced Automatically

Custom Tracing with the @traceable Decorator

Structured Metadata

Feedback Collection

Building Evaluation Datasets

Automated Evaluation

Cost Tracking

Alert Rules

The Dashboard Checklist

The Point

Custom Tracing with the `@traceable` Decorator

Custom Tracing with the `@traceable` Decorator