Your agent works beautifully on your laptop. It searches, it reasons, it generates answers. Demo goes great. Then someone asks "how do we deploy this?" and the room goes quiet.
Deploying AI agents is not the same as deploying a web app. Agents are stateful. They make multiple LLM calls per request. They use tools that talk to external services. A single user request might take 30 seconds and consume real money in API calls. You need to think about all of that.
## The Architecture
FastAPI handles HTTP. Your agent runs inside it. Docker packages everything. A reverse proxy sits in front for TLS and load balancing.
```
Client → Caddy (TLS) → FastAPI (Docker) → Agent → LLM API + Tools
```
## Project Structure
```
agent-api/
├── src/
│ ├── __init__.py
│ ├── main.py # FastAPI app
│ ├── agent.py # Agent logic
│ ├── models.py # Request/response schemas
│ └── config.py # Settings
├── Dockerfile
├── docker-compose.yml
├── pyproject.toml
└── .env.example
```
## The Agent Module
Keep your agent separate from your API. The API is a delivery mechanism. The agent is the product.
```python
# src/agent.py
from langchain_anthropic import ChatAnthropic
from langchain_community.tools.tavily_search import TavilySearchResults
from langgraph.prebuilt import create_react_agent
from langgraph.checkpoint.memory import MemorySaver
memory = MemorySaver()
def create_agent():
llm = ChatAnthropic(
model="claude-sonnet-4-20250514",
temperature=0,
max_tokens=4096,
)
tools = [TavilySearchResults(max_results=3)]
agent = create_react_agent(
llm,
tools,
checkpointer=memory,
)
return agent
# Singleton. Create once, reuse across requests.
agent = create_agent()
```
## Request and Response Models
```python
# src/models.py
from pydantic import BaseModel, Field
class AgentRequest(BaseModel):
message: str = Field(..., min_length=1, max_length=4000)
thread_id: str = Field(default="default") It is worth reading about [deployment patterns](/blog/agent-deployment-patterns) alongside this.
class AgentResponse(BaseModel):
response: str
thread_id: str
tool_calls: list[dict] = []
class HealthResponse(BaseModel):
status: str
version: str
```
## Configuration
```python
# src/config.py
from pydantic_settings import BaseSettings
class Settings(BaseSettings):
anthropic_api_key: str
tavily_api_key: str
api_version: str = "0.1.0"
max_concurrent_requests: int = 10
request_timeout: int = 120
model_config = {"env_file": ".env"}
settings = Settings()
```
## The FastAPI Application
```python
# src/main.py
import asyncio
from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from langchain_core.messages import HumanMessage
from .agent import agent
from .models import AgentRequest, AgentResponse, HealthResponse
from .config import settings
# Semaphore to limit concurrent agent executions
semaphore = asyncio.Semaphore(settings.max_concurrent_requests)
@asynccontextmanager
async def lifespan(app: FastAPI):
# Startup: warm up the agent
print("Agent API starting up...")
yield
# Shutdown: cleanup
print("Agent API shutting down...")
app = FastAPI(
title="Agent API",
version=settings.api_version,
lifespan=lifespan,
)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"], # Lock this down in production
allow_methods=["POST", "GET"],
allow_headers=["*"],
)
@app.get("/health", response_model=HealthResponse)
async def health():
return HealthResponse(status="healthy", version=settings.api_version)
@app.post("/agent/invoke", response_model=AgentResponse)
async def invoke_agent(request: AgentRequest):
async with semaphore:
try:
config = {"configurable": {"thread_id": request.thread_id}} It is worth reading about [cost optimisation in production](/blog/cost-optimization-ai-agents) alongside this.
result = await asyncio.wait_for(
asyncio.to_thread(
agent.invoke,
{"messages": [HumanMessage(content=request.message)]},
config,
),
timeout=settings.request_timeout,
)
last_message = result["messages"][-1]
tool_calls = []
for msg in result["messages"]:
if hasattr(msg, "tool_calls") and msg.tool_calls:
tool_calls.extend(
[{"name": tc["name"], "args": tc["args"]} for tc in msg.tool_calls]
)
return AgentResponse(
response=last_message.content,
thread_id=request.thread_id,
tool_calls=tool_calls,
)
except asyncio.TimeoutError:
raise HTTPException(
status_code=504,
detail="Agent execution timed out"
)
except Exception as e:
raise HTTPException(
status_code=500,
detail=f"Agent execution failed: {str(e)}"
)
```
The semaphore is critical. Without it, 100 concurrent requests means 100 concurrent LLM API calls. Your rate limits evaporate and your bill explodes. Cap it.
The timeout prevents runaway agent loops from holding connections forever. 120 seconds is generous. Adjust based on your agent's typical execution time. The related post on [observability after deployment](/blog/agent-observability-tracing-logging) goes further on this point.
## Streaming Responses
Agents are slow. Users need feedback. Stream the response.
```python
from fastapi.responses import StreamingResponse
import json
@app.post("/agent/stream")
async def stream_agent(request: AgentRequest):
async def generate():
config = {"configurable": {"thread_id": request.thread_id}}
for event in agent.stream(
{"messages": [HumanMessage(content=request.message)]},
config,
stream_mode="messages",
):
message, metadata = event
if message.content:
chunk = json.dumps({
"type": "content",
"content": message.content
})
yield f"data: {chunk}\n\n"
yield f"data: {json.dumps({'type': 'done'})}\n\n"
return StreamingResponse(
generate(),
media_type="text/event-stream",
)
```
Server-sent events. The client gets partial responses as the agent thinks. Much better UX than a loading spinner for 30 seconds.
## The Dockerfile
```dockerfile
FROM python:3.14-slim AS base
WORKDIR /app
# Install uv for fast dependency management
COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv
# Copy dependency files
COPY pyproject.toml uv.lock ./
# Install dependencies
RUN uv sync --frozen --no-dev
# Copy application code
COPY src/ ./src/
# Non-root user
RUN useradd -m -r agent && chown -R agent:agent /app
USER agent
EXPOSE 8000
CMD ["uv", "run", "uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]
```
Slim base image. Non-root user. No dev dependencies in production. `uv` instead of pip because it's 10x faster and you don't want to spend 5 minutes on dependency resolution in every build.
## Docker Compose
```yaml
# docker-compose.yml
services:
agent-api:
build: .
ports:
- "8000:8000"
env_file:
- .env
restart: unless-stopped
deploy:
resources:
limits:
memory: 2G
cpus: "2.0"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 15s
```
Memory limit at 2GB. Agents can be memory hungry, especially with large context windows and multiple tool calls. The health check restarts the container if the API stops responding.
## Environment Configuration
```bash
# .env.example
ANTHROPIC_API_KEY=
TAVILY_API_KEY=
API_VERSION=0.1.0
MAX_CONCURRENT_REQUESTS=10
REQUEST_TIMEOUT=120
```
## Reverse Proxy with Caddy
```
# Caddyfile
api.yourdomain.com {
reverse_proxy localhost:8000
header {
X-Content-Type-Options nosniff
X-Frame-Options DENY
}
}
```
Caddy handles TLS automatically. No certbot, no renewal scripts, no expired certificate at 3am.
## Structured Logging
You need to know what your agent is doing in production. Add structured logging.
```python
# src/main.py (add to top)
import logging
import json
from datetime import datetime
class JSONFormatter(logging.Formatter):
def format(self, record):
log_data = {
"timestamp": datetime.utcnow().isoformat(),
"level": record.levelname,
"message": record.getMessage(),
"module": record.module,
}
if hasattr(record, "extra"):
log_data.update(record.extra)
return json.dumps(log_data)
logger = logging.getLogger("agent-api")
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logger.addHandler(handler)
logger.setLevel(logging.INFO)
```
Then log every agent invocation:
```python
logger.info("Agent invoked", extra={
"thread_id": request.thread_id,
"message_length": len(request.message),
"duration_ms": duration,
"tool_calls": len(tool_calls),
})
```
## What You Get
A containerized agent API that handles concurrent requests safely, streams responses, logs everything, and restarts itself when it crashes. That's the baseline for production.
What's missing: authentication (add API key middleware), rate limiting per user, cost tracking per request, persistent memory (swap MemorySaver for a database-backed checkpointer), and horizontal scaling. But the foundation is solid. Build from here.