Your agent works in development. You deploy it. Users start hitting it. Something goes wrong. The agent gives a weird answer, or loops on a tool call, or spends $4 on a single request. You check the logs. They say "200 OK."
That's the problem with agents. HTTP status codes tell you nothing about what happened inside. You need to see the reasoning chain. Which tools did it call? What did it get back? Why did it make that decision? How much did it cost?
LangSmith answers all of that. It's an observability platform purpose-built for LLM applications. Think of it as Datadog for agents. The related post on [observability principles](/blog/agent-observability-tracing-logging) goes further on this point.
## Setup
```bash
pip install langsmith langchain-anthropic langchain-community
```
```bash
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY="your-langsmith-api-key"
export LANGCHAIN_PROJECT="my-agent-prod"
```
That's it. Three environment variables and every LangChain operation gets traced automatically. No code changes needed.
## What Gets Traced Automatically
Once tracing is enabled, LangSmith captures:
- Every LLM call (model, prompt, response, tokens, latency, cost)
- Every tool invocation (name, input, output, duration)
- Every chain execution (input, output, intermediate steps)
- Every retrieval operation (query, results, scores)
- Error traces with full stack traces
- Parent-child relationships between operations
All of it structured, searchable, and linked together in a trace tree.
## Custom Tracing with the `@traceable` Decorator
For your own functions, add the `@traceable` decorator:
```python
from langsmith import traceable
@traceable(name="process_user_request")
def process_request(user_input: str, session_id: str) -> dict:
"""Main entry point for agent requests."""
# Your agent logic here
result = agent.invoke({
"messages": [{"role": "user", "content": user_input}]
})
return {"response": result, "session_id": session_id}
@traceable(name="validate_output", tags=["validation"])
def validate_output(response: str) -> tuple[str, list[str]]:
"""Validate agent output before returning to user."""
warnings = []
if len(response) > 5000:
warnings.append("Response exceeds length limit")
return response, warnings
```
Nested `@traceable` functions create a trace tree automatically. The parent call shows the child calls underneath it. You see exactly how time is spent at every level.
## Structured Metadata
Attach business context to your traces. This is what makes LangSmith useful for debugging, not just monitoring.
```python
from langsmith import traceable
from langsmith.run_helpers import get_current_run_tree
@traceable(name="agent_invoke")
def invoke_agent(user_input: str, user_id: str, tier: str):
run_tree = get_current_run_tree()
if run_tree:
run_tree.metadata = {
"user_id": user_id,
"tier": tier,
"input_length": len(user_input),
}
run_tree.tags = ["production", tier]
result = agent.invoke({"messages": [("human", user_input)]})
# Add output metadata
if run_tree:
run_tree.metadata["output_length"] = len(str(result))
run_tree.metadata["tool_calls"] = count_tool_calls(result)
return result
```
Now you can filter traces in LangSmith by user tier, search for all requests from a specific user, or find all traces where tool call count exceeded 5.
## Feedback Collection
LangSmith tracks user feedback tied to specific traces. This closes the loop between "what the agent did" and "was it actually helpful."
```python
from langsmith import Client
ls_client = Client()
def record_feedback(run_id: str, score: float, comment: str = ""):
"""Record user feedback on an agent response."""
ls_client.create_feedback(
run_id=run_id,
key="user_rating",
score=score, # 0.0 to 1.0
comment=comment,
)
def record_correction(run_id: str, correct_answer: str):
"""Record a correction when the agent got it wrong."""
ls_client.create_feedback(
run_id=run_id,
key="correction",
score=0.0,
comment=correct_answer,
)
```
After enough corrections pile up, you have a labeled dataset for evaluation. LangSmith didn't just monitor your agent. It built your test suite.
## Building Evaluation Datasets
Extract traces with feedback into reusable evaluation datasets:
```python
from langsmith import Client
ls_client = Client()
def create_evaluation_dataset(project_name: str, dataset_name: str):
"""Create a dataset from traced runs with feedback."""
dataset = ls_client.create_dataset(dataset_name) This connects directly to [evaluation metrics to track](/blog/agent-evaluation-metrics).
# Get runs that have feedback
runs = ls_client.list_runs(
project_name=project_name,
filter='has(feedback_key, "correction")',
)
for run in runs:
feedbacks = list(ls_client.list_feedback(run_ids=[run.id]))
correction = next(
(f.comment for f in feedbacks if f.key == "correction"),
None
)
if correction:
ls_client.create_example(
inputs=run.inputs,
outputs={"expected": correction},
dataset_id=dataset.id,
)
return dataset
```
## Automated Evaluation
Run evaluations against your dataset to catch regressions before they hit production.
```python
from langsmith.evaluation import evaluate
def correctness_evaluator(run, example):
"""Check if the agent's response matches the expected output."""
prediction = run.outputs.get("response", "")
expected = example.outputs.get("expected", "")
# Use an LLM to judge equivalence
from langchain_anthropic import ChatAnthropic
judge = ChatAnthropic(model="claude-haiku-4-20250514")
result = judge.invoke([
("system", "Compare two responses. Are they semantically equivalent? Answer YES or NO."),
("human", f"Response A: {prediction}\n\nResponse B: {expected}")
])
return {
"key": "correctness",
"score": 1.0 if "YES" in result.content.upper() else 0.0,
}
# Run the evaluation
results = evaluate(
lambda inputs: agent.invoke(inputs),
data="my-evaluation-dataset",
evaluators=[correctness_evaluator],
experiment_prefix="v2-agent",
)
```
Every time you change your agent (new prompt, different model, added tools), run the evaluation suite. If correctness drops, you caught the regression before users did.
## Cost Tracking
Every LLM call in LangSmith includes token counts. Aggregate them for cost monitoring.
```python
from langsmith import Client
from datetime import datetime, timedelta
ls_client = Client()
def get_cost_report(project_name: str, days: int = 7) -> dict:
"""Calculate agent costs over a time period."""
start = datetime.utcnow() - timedelta(days=days)
runs = ls_client.list_runs(
project_name=project_name,
start_time=start,
run_type="llm",
)
total_input_tokens = 0
total_output_tokens = 0
total_runs = 0
for run in runs:
if run.total_tokens:
total_input_tokens += run.prompt_tokens or 0
total_output_tokens += run.completion_tokens or 0
total_runs += 1 For a deeper look, see [W&B as an alternative](/blog/wandb-agent-evaluation).
# Claude Sonnet pricing (approximate)
input_cost = (total_input_tokens / 1_000_000) * 3.00
output_cost = (total_output_tokens / 1_000_000) * 15.00
return {
"period_days": days,
"total_llm_calls": total_runs,
"total_input_tokens": total_input_tokens,
"total_output_tokens": total_output_tokens,
"estimated_cost_usd": round(input_cost + output_cost, 2),
"avg_cost_per_call": round((input_cost + output_cost) / max(total_runs, 1), 4),
}
```
Run this daily. Set alerts when the average cost per call spikes. A prompt change that adds 2,000 tokens looks harmless until it's multiplied by 10,000 daily requests.
## Alert Rules
Set up monitoring rules that fire when something goes wrong:
```python
def check_agent_health(project_name: str) -> list[str]:
"""Run health checks and return any alerts."""
alerts = []
ls_client_local = Client()
recent_runs = list(ls_client_local.list_runs(
project_name=project_name,
start_time=datetime.utcnow() - timedelta(hours=1),
is_root=True,
))
if not recent_runs:
alerts.append("WARNING: No agent runs in the last hour")
return alerts
# Error rate
errors = [r for r in recent_runs if r.error]
error_rate = len(errors) / len(recent_runs)
if error_rate > 0.1:
alerts.append(f"CRITICAL: Error rate is {error_rate:.0%} (threshold: 10%)")
# Latency
latencies = [
(r.end_time - r.start_time).total_seconds()
for r in recent_runs
if r.end_time and r.start_time
]
if latencies:
avg_latency = sum(latencies) / len(latencies)
if avg_latency > 30:
alerts.append(f"WARNING: Average latency is {avg_latency:.1f}s (threshold: 30s)")
p95 = sorted(latencies)[int(len(latencies) * 0.95)]
if p95 > 60:
alerts.append(f"WARNING: P95 latency is {p95:.1f}s (threshold: 60s)")
return alerts
```
## The Dashboard Checklist
At minimum, monitor these in production:
1. **Error rate.** Anything above 5% needs investigation.
2. **P50 and P95 latency.** Know your baseline, alert on spikes.
3. **Cost per request.** Track daily, set budget alerts.
4. **Tool call count per request.** Rising counts mean the agent is struggling.
5. **User feedback scores.** The ultimate quality metric.
6. **Token usage trends.** Creeping token counts mean something changed.
LangSmith gives you all of these out of the box. You just need to look at them.
## The Point
Observability isn't optional for production agents. They're non-deterministic systems making autonomous decisions. Without tracing, you're flying blind. LangSmith plugs in with three environment variables and gives you full visibility into every decision your agent makes.
You can't fix what you can't see. Now you can see everything.