Deploying AI Agents with FastAPI and Docker

Your agent works beautifully on your laptop. It searches, it reasons, it generates answers. Demo goes great. Then someone asks "how do we deploy this?" and the room goes quiet.

Deploying AI agents is not the same as deploying a web app. Agents are stateful. They make multiple LLM calls per request. They use tools that talk to external services. A single user request might take 30 seconds and consume real money in API calls. You need to think about all of that.

The Architecture

FastAPI handles HTTP. Your agent runs inside it. Docker packages everything. A reverse proxy sits in front for TLS and load balancing.

Client → Caddy (TLS) → FastAPI (Docker) → Agent → LLM API + Tools

Project Structure

agent-api/
├── src/
│   ├── __init__.py
│   ├── main.py          # FastAPI app
│   ├── agent.py          # Agent logic
│   ├── models.py         # Request/response schemas
│   └── config.py         # Settings
├── Dockerfile
├── docker-compose.yml
├── pyproject.toml
└── .env.example

The Agent Module

Keep your agent separate from your API. The API is a delivery mechanism. The agent is the product.

# src/agent.py
from langchain_anthropic import ChatAnthropic
from langchain_community.tools.tavily_search import TavilySearchResults
from langgraph.prebuilt import create_react_agent
from langgraph.checkpoint.memory import MemorySaver

memory = MemorySaver()

def create_agent():
    llm = ChatAnthropic(
        model="claude-sonnet-4-20250514",
        temperature=0,
        max_tokens=4096,
    )

    tools = [TavilySearchResults(max_results=3)]

    agent = create_react_agent(
        llm,
        tools,
        checkpointer=memory,
    )
    return agent

# Singleton. Create once, reuse across requests.
agent = create_agent()

Request and Response Models

# src/models.py
from pydantic import BaseModel, Field

class AgentRequest(BaseModel):
    message: str = Field(..., min_length=1, max_length=4000)
    thread_id: str = Field(default="default") It is worth reading about [deployment patterns](/blog/agent-deployment-patterns) alongside this.

class AgentResponse(BaseModel):
    response: str
    thread_id: str
    tool_calls: list[dict] = []

class HealthResponse(BaseModel):
    status: str
    version: str

Configuration

# src/config.py
from pydantic_settings import BaseSettings

class Settings(BaseSettings):
    anthropic_api_key: str
    tavily_api_key: str
    api_version: str = "0.1.0"
    max_concurrent_requests: int = 10
    request_timeout: int = 120

    model_config = {"env_file": ".env"}

settings = Settings()

The FastAPI Application

# src/main.py
import asyncio
from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from langchain_core.messages import HumanMessage

from .agent import agent
from .models import AgentRequest, AgentResponse, HealthResponse
from .config import settings

# Semaphore to limit concurrent agent executions
semaphore = asyncio.Semaphore(settings.max_concurrent_requests)

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup: warm up the agent
    print("Agent API starting up...")
    yield
    # Shutdown: cleanup
    print("Agent API shutting down...")

app = FastAPI(
    title="Agent API",
    version=settings.api_version,
    lifespan=lifespan,
)

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],  # Lock this down in production
    allow_methods=["POST", "GET"],
    allow_headers=["*"],
)

@app.get("/health", response_model=HealthResponse)
async def health():
    return HealthResponse(status="healthy", version=settings.api_version)

@app.post("/agent/invoke", response_model=AgentResponse)
async def invoke_agent(request: AgentRequest):
    async with semaphore:
        try:
            config = {"configurable": {"thread_id": request.thread_id}} It is worth reading about [cost optimisation in production](/blog/cost-optimization-ai-agents) alongside this.

            result = await asyncio.wait_for(
                asyncio.to_thread(
                    agent.invoke,
                    {"messages": [HumanMessage(content=request.message)]},
                    config,
                ),
                timeout=settings.request_timeout,
            )

            last_message = result["messages"][-1]
            tool_calls = []
            for msg in result["messages"]:
                if hasattr(msg, "tool_calls") and msg.tool_calls:
                    tool_calls.extend(
                        [{"name": tc["name"], "args": tc["args"]} for tc in msg.tool_calls]
                    )

            return AgentResponse(
                response=last_message.content,
                thread_id=request.thread_id,
                tool_calls=tool_calls,
            )
        except asyncio.TimeoutError:
            raise HTTPException(
                status_code=504,
                detail="Agent execution timed out"
            )
        except Exception as e:
            raise HTTPException(
                status_code=500,
                detail=f"Agent execution failed: {str(e)}"
            )

The semaphore is critical. Without it, 100 concurrent requests means 100 concurrent LLM API calls. Your rate limits evaporate and your bill explodes. Cap it.

The timeout prevents runaway agent loops from holding connections forever. 120 seconds is generous. Adjust based on your agent's typical execution time. The related post on observability after deployment goes further on this point.

Streaming Responses

Agents are slow. Users need feedback. Stream the response.

from fastapi.responses import StreamingResponse
import json

@app.post("/agent/stream")
async def stream_agent(request: AgentRequest):
    async def generate():
        config = {"configurable": {"thread_id": request.thread_id}}

        for event in agent.stream(
            {"messages": [HumanMessage(content=request.message)]},
            config,
            stream_mode="messages",
        ):
            message, metadata = event
            if message.content:
                chunk = json.dumps({
                    "type": "content",
                    "content": message.content
                })
                yield f"data: {chunk}\n\n"

        yield f"data: {json.dumps({'type': 'done'})}\n\n"

    return StreamingResponse(
        generate(),
        media_type="text/event-stream",
    )

Server-sent events. The client gets partial responses as the agent thinks. Much better UX than a loading spinner for 30 seconds.

The Dockerfile

FROM python:3.14-slim AS base

WORKDIR /app

# Install uv for fast dependency management
COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv

# Copy dependency files
COPY pyproject.toml uv.lock ./

# Install dependencies
RUN uv sync --frozen --no-dev

# Copy application code
COPY src/ ./src/

# Non-root user
RUN useradd -m -r agent && chown -R agent:agent /app
USER agent

EXPOSE 8000

CMD ["uv", "run", "uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]

Slim base image. Non-root user. No dev dependencies in production. uv instead of pip because it's 10x faster and you don't want to spend 5 minutes on dependency resolution in every build.

Docker Compose

# docker-compose.yml
services:
  agent-api:
    build: .
    ports:
      - "8000:8000"
    env_file:
      - .env
    restart: unless-stopped
    deploy:
      resources:
        limits:
          memory: 2G
          cpus: "2.0"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 15s

Memory limit at 2GB. Agents can be memory hungry, especially with large context windows and multiple tool calls. The health check restarts the container if the API stops responding.

Environment Configuration

# .env.example
ANTHROPIC_API_KEY=
TAVILY_API_KEY=
API_VERSION=0.1.0
MAX_CONCURRENT_REQUESTS=10
REQUEST_TIMEOUT=120

Reverse Proxy with Caddy

# Caddyfile
api.yourdomain.com {
    reverse_proxy localhost:8000

    header {
        X-Content-Type-Options nosniff
        X-Frame-Options DENY
    }
}

Caddy handles TLS automatically. No certbot, no renewal scripts, no expired certificate at 3am.

Structured Logging

You need to know what your agent is doing in production. Add structured logging.

# src/main.py (add to top)
import logging
import json
from datetime import datetime

class JSONFormatter(logging.Formatter):
    def format(self, record):
        log_data = {
            "timestamp": datetime.utcnow().isoformat(),
            "level": record.levelname,
            "message": record.getMessage(),
            "module": record.module,
        }
        if hasattr(record, "extra"):
            log_data.update(record.extra)
        return json.dumps(log_data)

logger = logging.getLogger("agent-api")
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logger.addHandler(handler)
logger.setLevel(logging.INFO)

Then log every agent invocation:

logger.info("Agent invoked", extra={
    "thread_id": request.thread_id,
    "message_length": len(request.message),
    "duration_ms": duration,
    "tool_calls": len(tool_calls),
})

What You Get

A containerized agent API that handles concurrent requests safely, streams responses, logs everything, and restarts itself when it crashes. That's the baseline for production.

What's missing: authentication (add API key middleware), rate limiting per user, cost tracking per request, persistent memory (swap MemorySaver for a database-backed checkpointer), and horizontal scaling. But the foundation is solid. Build from here.