Agent Deployment Patterns: From Dev to Production Without Losing Sleep

The Deployment Problem

You've built your agent. It works on your laptop. It passes your tests. You're ready to ship.

Stop. Take a breath. Because deploying an AI agent to production is fundamentally different from deploying a traditional web application, and the differences will bite you if you don't plan for them.

A web app is deterministic. Same input, same output. You can test exhaustively. An agent is probabilistic. Same input, different output every time. You can't test exhaustively. You can only test the boundaries and hope the middle holds.

A web app costs compute. An agent costs compute plus LLM tokens plus tool execution. A bug in a web app returns a wrong answer. A bug in an agent returns a wrong answer AND charges you for the privilege.

Different beast. Different deployment strategy.

Pattern 1: Shadow Deployment

Before your agent talks to real users, have it listen. Shadow deployment runs the new agent alongside the existing system without serving its results to users. For a deeper look, see deploying with FastAPI and Docker.

class ShadowDeployment {
  async handle(request: Request): Promise<Response> {
    // Production agent handles the request
    const prodResponse = await this.prodAgent.run(request);

    // Shadow agent processes the same request in background
    this.shadowAgent.run(request).then(shadowResponse => {
      this.compare(request, prodResponse, shadowResponse);
    }).catch(error => {
      this.logShadowFailure(request, error);
    });

    // Only production response goes to the user
    return prodResponse;
  }

  private async compare(
    request: Request,
    prod: Response,
    shadow: Response
  ) {
    const comparison = {
      request: request.id,
      prodCost: prod.tokenUsage.totalCost,
      shadowCost: shadow.tokenUsage.totalCost,
      prodLatency: prod.durationMs,
      shadowLatency: shadow.durationMs,
      outputSimilarity: await computeSimilarity(prod.output, shadow.output),
      qualityDelta: await evaluateQuality(prod.output, shadow.output, request),
    };

    await this.metrics.record(comparison);
  }
}

Shadow deployment answers the question "would the new version be better?" without any risk to users. You get real traffic patterns, real edge cases, real cost data. The shadow agent's cost is your testing budget. It's usually worth it.

Pattern 2: Canary Releases

Route a small percentage of traffic to the new agent. Monitor everything. Gradually increase if things look good.

class CanaryRouter {
  constructor(
    private canaryPercent: number = 5,
    private prodAgent: Agent,
    private canaryAgent: Agent,
  ) {}

  async handle(request: Request): Promise<Response> {
    const isCanary = this.shouldRouteToCanary(request);
    const agent = isCanary ? this.canaryAgent : this.prodAgent;

    const response = await agent.run(request);

    await this.metrics.record({
      variant: isCanary ? "canary" : "production",
      cost: response.tokenUsage.totalCost,
      latency: response.durationMs,
      success: response.success,
      quality: await this.evaluateQuality(response),
    });

    return response;
  }

  private shouldRouteToCanary(request: Request): boolean {
    // Consistent routing: same user always gets same variant
    const hash = hashUserId(request.userId);
    return (hash % 100) < this.canaryPercent;
  }
}

The key is consistent routing. The same user should always get the same variant during the canary period. Otherwise you get confused users who see different behavior on consecutive requests.

Start at 5%. If error rate, cost, and latency are within bounds after 24 hours, go to 20%. Then 50%. Then 100%. At any point, if metrics degrade, roll back instantly.

Pattern 3: Feature Flags for Agent Capabilities

Don't deploy the whole agent at once. Deploy capabilities incrementally behind feature flags.

class FeatureFlaggedAgent {
  async run(context: AgentContext): Promise<AgentResult> {
    const tools = this.getAvailableTools(context.userId);
    const model = this.getModel(context.userId);
    const maxSteps = this.getMaxSteps(context.userId);

    return this.agent.run({
      ...context,
      tools,
      model,
      maxSteps,
    });
  }

  private getAvailableTools(userId: string): Tool[] {
    const base = [searchTool, readFileTool];

    if (featureFlags.isEnabled("agent-write-files", userId)) {
      base.push(writeFileTool);
    }

    if (featureFlags.isEnabled("agent-execute-code", userId)) {
      base.push(executeCodeTool);
    }

    return base;
  }
}

New tool? Roll it out to 10% of users. New model? Test it on internal users first. Higher step limits? Enable for power users only. Each capability gets its own flag, its own rollout schedule, its own monitoring.

Pattern 4: Environment Parity with Guardrails

Your staging environment needs to be as close to production as possible. For agents, that means real LLM calls, not mocks. Mocking the LLM removes the exact thing you're trying to test: the non-deterministic behavior.

const envConfig = {
  development: {
    model: "haiku",              // cheap model for iteration
    maxToolCalls: 5,             // low limits
    budgetCents: 10,             // tight budget
    tools: developmentTools,     // sandboxed tools
    logging: "verbose",
  },
  staging: {
    model: "sonnet",             // production model
    maxToolCalls: 15,            // production limits
    budgetCents: 100,            // reasonable budget
    tools: stagingTools,         // real tools, sandboxed data
    logging: "verbose",
  },
  production: {
    model: "sonnet",
    maxToolCalls: 15,
    budgetCents: 500,
    tools: productionTools,
    logging: "structured",
  },
};

In staging, use the same model as production. Use the same tools. The only difference should be the data (sanitized copies of production data) and the users (your team, not real customers). This connects directly to cost optimisation.

Pattern 5: Rollback Strategy

When things go wrong (not if), you need to roll back fast. Agent rollbacks are trickier than web app rollbacks because agents have state.

class AgentDeployment {
  private versions: Map<string, AgentConfig> = new Map();
  private activeVersion: string;

  async rollback(reason: string) {
    const previousVersion = this.getPreviousStableVersion();

    // 1. Stop routing to current version
    this.router.pauseRouting();

    // 2. Drain in-flight requests (give running agents time to complete)
    await this.drainInflight(timeoutMs: 30_000);

    // 3. Switch to previous version
    this.activeVersion = previousVersion;
    this.router.resumeRouting();

    // 4. Record the rollback
    await this.audit.record({
      action: "rollback",
      from: this.activeVersion,
      to: previousVersion,
      reason,
      timestamp: Date.now(),
    });

    // 5. Alert the team
    await this.notify(`Agent rolled back: ${reason}`);
  }

  private async drainInflight(timeoutMs: number) {
    const deadline = Date.now() + timeoutMs;
    while (this.inflightCount() > 0 && Date.now() < deadline) {
      await sleep(1000);
    }
    if (this.inflightCount() > 0) {
      // Force-terminate remaining requests
      await this.terminateInflight();
    }
  }
}

The drain step is critical. You can't just cut over to a new version while agents are mid-execution. They have context, they have pending tool calls, they have partial results. Give them a grace period to complete. If they don't finish, terminate them with a clear error message to the user.

Pattern 6: Automated Quality Gates

Don't rely on humans to catch regressions. Automate quality checks at every stage.

class QualityGate {
  private checks: QualityCheck[] = [
    new CostCheck({ maxIncrease: 0.20 }),    // cost no more than 20% higher
    new LatencyCheck({ maxP99Ms: 10_000 }),   // P99 under 10s
    new ErrorRateCheck({ maxRate: 0.05 }),     // error rate under 5%
    new ToolCallCheck({ maxAverage: 12 }),     // average tool calls under 12
    new QualityScoreCheck({ minAverage: 0.7 }),// quality score above 0.7
  ];

  async evaluate(metrics: DeploymentMetrics): Promise<GateResult> {
    const results = await Promise.all(
      this.checks.map(c => c.evaluate(metrics))
    );

    const failed = results.filter(r => !r.passed);

    return {
      passed: failed.length === 0,
      checks: results,
      recommendation: failed.length === 0
        ? "proceed"
        : failed.some(f => f.severity === "critical")
          ? "rollback"
          : "pause_and_investigate",
    };
  }
}

Run these gates continuously during canary deployments. If any critical check fails, roll back automatically. If a warning fires, pause the rollout and alert the team. No human should have to watch dashboards 24/7 to catch deployment issues.

Pattern 7: Blue-Green with State Migration

For major agent updates that change how state is structured, use blue-green deployment with state migration. This connects directly to observability after deployment.

Blue (current):  Agent v1 + State Schema v1
Green (new):     Agent v2 + State Schema v2

Migration:
1. Deploy green environment
2. Migrate state from v1 schema to v2 schema
3. Validate migrated state
4. Switch traffic to green
5. Monitor
6. Decommission blue after stability period

class StateMigration {
  async migrate(userId: string): Promise<void> {
    const v1State = await stateStoreV1.load(userId);
    const v2State = this.transform(v1State);

    // Validate the transformation
    const valid = v2Schema.safeParse(v2State);
    if (!valid.success) {
      throw new MigrationError(userId, valid.error);
    }

    await stateStoreV2.save(userId, v2State);
  }

  private transform(v1: V1State): V2State {
    return {
      ...v1,
      // New fields in v2
      preferences: v1.settings || defaultPreferences,
      interactionHistory: v1.history?.map(this.transformHistoryEntry) || [],
      version: 2,
    };
  }
}

Migrate state for a small batch of users first. Verify the green environment works with migrated state. Then migrate the rest. Keep the blue environment alive for a week as a safety net.

The Deployment Checklist

Before every agent deployment, run through this:

Pre-deploy:
[ ] Shadow deployment results reviewed
[ ] Cost comparison: new vs current
[ ] Latency comparison: new vs current
[ ] Quality comparison: new vs current
[ ] Rollback plan documented
[ ] Monitoring dashboards configured
[ ] Alert thresholds set
[ ] On-call engineer assigned

Deploy:
[ ] Canary at 5% for 24 hours
[ ] Quality gates passing
[ ] No anomalies in cost or latency
[ ] Canary at 20% for 24 hours
[ ] Quality gates still passing
[ ] Full rollout
[ ] Monitor for 48 hours

Post-deploy:
[ ] Cleanup shadow environment
[ ] Archive old version (don't delete)
[ ] Update runbook
[ ] Record deployment metrics

It's not glamorous. It's thorough. And thorough is what keeps you sleeping while your agent talks to users at 3 AM.

The Rule

Here's the rule I live by: deploy your agent like it's going to do something stupid at the worst possible moment. Because eventually, it will. The question isn't whether. It's whether you've built the system to catch it, contain it, and recover from it.

That's not pessimism. That's production engineering.