The Agent Loop: State Machines for Production AI

The Loop Nobody Talks About

Every AI agent you've ever used runs in a loop. Receive input, think, act, observe, repeat. It's deceptively simple on paper. In production, it's where everything falls apart.

The reason most agent frameworks feel brittle isn't the LLM. It's the loop. When your agent decides to call a tool, fails, retries, calls a different tool, gets confused by the response, and spirals into an infinite cycle of self-correction, that's not an AI problem. That's an engineering problem. And engineering problems have engineering solutions. The related post on the ReAct pattern goes further on this point.

State machines are that solution.

What's Actually Happening Inside an Agent

Strip away the abstractions and every agent follows this pattern:

IDLE -> PLANNING -> EXECUTING -> OBSERVING -> DECIDING -> (back to PLANNING or DONE)

Most frameworks hide this behind a while True loop with some break conditions. That works for demos. It doesn't work when you need to answer questions like "why did the agent spend $4.50 on 47 tool calls to accomplish a task that should have taken 3?"

A state machine makes every transition explicit. You know exactly which state the agent is in, what caused the transition, and what the valid next states are. No more guessing.

The Minimal Agent State Machine

Here's what a production agent state machine actually looks like:

type AgentState =
  | "idle"
  | "planning"
  | "executing_tool"
  | "waiting_for_response"
  | "evaluating_result"
  | "error_recovery"
  | "completed"
  | "failed";

const transitions: Record<AgentState, AgentState[]> = {
  idle: ["planning"],
  planning: ["executing_tool", "completed", "failed"],
  executing_tool: ["waiting_for_response", "error_recovery"],
  waiting_for_response: ["evaluating_result", "error_recovery"],
  evaluating_result: ["planning", "completed", "failed"],
  error_recovery: ["planning", "failed"],
  completed: ["idle"],
  failed: ["idle"],
};

Notice the error_recovery state. That's the one most people skip, and it's the one that saves you at 3 AM. When a tool call fails, you don't just retry blindly. You enter a dedicated state that can reason about the failure, adjust the approach, or gracefully give up. It is worth reading about event-driven agent design alongside this.

Why "Just Use a While Loop" Breaks

I've seen this pattern hundreds of times:

while not done:
    response = llm.generate(messages)
    if response.tool_calls:
        results = execute_tools(response.tool_calls)
        messages.append(results)
    else:
        done = True

Three problems with this.

First, there's no concept of "I've been going too long." The loop runs until the LLM decides to stop. That's giving a probabilistic system veto power over your compute budget.

Second, error handling is an afterthought. When execute_tools throws, you're in limbo. The message history is corrupted. The LLM doesn't know what happened. Recovery is basically "start over and hope."

Third, you can't inspect it. When something goes wrong (and it will), you can't answer "what state was the agent in when it derailed?" because there are no states. There's just "in the loop" and "out of the loop."

Adding Depth Limits and Escape Hatches

The most important feature of a state machine agent isn't the states. It's the constraints on transitions.

const config = {
  maxPlanningCycles: 5,
  maxToolCallsPerCycle: 3,
  maxTotalToolCalls: 15,
  maxErrorRecoveries: 2,
  timeoutMs: 30_000,
};

Every transition checks these constraints. If the agent has planned 5 times without completing, something is wrong. Force it to failed with a clear reason. If it's recovered from errors twice and hit a third, stop. The definition of insanity applies to agents too.

The Evaluation State Is Where the Magic Happens

Most agent implementations go straight from "got tool result" back to "ask LLM what to do next." That's wasteful. The evaluation state is where you add programmatic intelligence.

function evaluate(result: ToolResult, context: AgentContext): AgentState {
  // Did we get what we needed?
  if (result.satisfiesGoal(context.currentObjective)) {
    return "completed";
  }

  // Are we going in circles?
  if (context.isRepeatingPattern(result, lastNResults: 3)) {
    context.addSystemMessage("You're repeating the same approach. Try something different.");
    return "planning";
  }

  // Is the result actionable?
  if (result.isError && result.isRetryable) {
    return "error_recovery";
  }

  // Default: let the LLM plan the next move
  return "planning";
}

This is where you encode domain knowledge. The LLM doesn't need to figure out that it's going in circles. You can detect that programmatically and tell it. You're not replacing the AI's reasoning. You're giving it guardrails so it reasons about the right things.

Nested State Machines for Complex Agents

Real agents aren't flat. A "research agent" might have a top-level state machine for the overall task, with nested state machines for individual research queries, each with their own retry logic and evaluation criteria.

TopLevel: PLANNING -> RESEARCHING -> SYNTHESIZING -> REVIEWING -> DONE
  |
  RESEARCHING (sub-machine):
    QUERY_FORMULATION -> SEARCH -> FILTER -> EXTRACT -> VALIDATE

Each sub-machine is self-contained. It can fail without crashing the parent. The parent can decide whether to retry the sub-task, skip it, or abort the whole operation. This is composition. Same principle that makes Unix pipes powerful, applied to AI agents. For a deeper look, see stateful agent state management.

State Persistence: Surviving Crashes

If your agent runs for more than a few seconds, you need to persist state. Not just for crash recovery. For debugging, auditing, and understanding what happened after the fact.

interface AgentSnapshot {
  state: AgentState;
  context: AgentContext;
  messageHistory: Message[];
  toolCallCount: number;
  errorCount: number;
  timestamp: number;
  transitionLog: Transition[];
}

Every state transition, serialize and store. When something goes wrong, you have a complete timeline. When a customer asks "why did the agent do X," you can replay the exact sequence of states and transitions that led there.

The Pattern in Practice

Here's how this plays out in a real system. An agent needs to research a topic and write a summary.

IDLE to PLANNING: Agent receives the task. Plans to search three sources.
PLANNING to EXECUTING: First search query fires.
EXECUTING to EVALUATING: Results come back. Evaluation: relevant but incomplete.
EVALUATING to PLANNING: Agent plans a follow-up query.
PLANNING to EXECUTING: Second query fires. API timeout.
EXECUTING to ERROR_RECOVERY: Logged the timeout. Decides to retry with modified query.
ERROR_RECOVERY to PLANNING: New plan, adjusted approach.
...continues with full auditability...
EVALUATING to COMPLETED: Sufficient information gathered. Summary written.

Every step is logged. Every transition is validated. The whole thing took 8 cycles instead of the 23 that a naive loop would have burned through.

What This Gets You

State machines aren't glamorous. They don't make for exciting demos. But they give you something that matters far more than demos: confidence that your agent will behave predictably in production.

You get debuggability. You get cost control. You get the ability to say "the agent can never do more than X tool calls" and actually mean it. You get crash recovery. You get audit trails.

The agent loop is the foundation. Get it right, and everything built on top of it works better. Get it wrong, and no amount of prompt engineering will save you.

Build the state machine first. Then make it smart.