The Agent Loop: State Machines for Production AI
By Diesel
architecturestate-machinesproduction
## The Loop Nobody Talks About
Every AI agent you've ever used runs in a loop. Receive input, think, act, observe, repeat. It's deceptively simple on paper. In production, it's where everything falls apart.
The reason most agent frameworks feel brittle isn't the LLM. It's the loop. When your agent decides to call a tool, fails, retries, calls a different tool, gets confused by the response, and spirals into an infinite cycle of self-correction, that's not an AI problem. That's an engineering problem. And engineering problems have engineering solutions. The related post on [the ReAct pattern](/blog/react-pattern-agents) goes further on this point.
State machines are that solution.
## What's Actually Happening Inside an Agent
Strip away the abstractions and every agent follows this pattern:
```
IDLE -> PLANNING -> EXECUTING -> OBSERVING -> DECIDING -> (back to PLANNING or DONE)
```
Most frameworks hide this behind a `while True` loop with some break conditions. That works for demos. It doesn't work when you need to answer questions like "why did the agent spend $4.50 on 47 tool calls to accomplish a task that should have taken 3?"
A state machine makes every transition explicit. You know exactly which state the agent is in, what caused the transition, and what the valid next states are. No more guessing.
## The Minimal Agent State Machine
Here's what a production agent state machine actually looks like:
```typescript
type AgentState =
| "idle"
| "planning"
| "executing_tool"
| "waiting_for_response"
| "evaluating_result"
| "error_recovery"
| "completed"
| "failed";
const transitions: Record = {
idle: ["planning"],
planning: ["executing_tool", "completed", "failed"],
executing_tool: ["waiting_for_response", "error_recovery"],
waiting_for_response: ["evaluating_result", "error_recovery"],
evaluating_result: ["planning", "completed", "failed"],
error_recovery: ["planning", "failed"],
completed: ["idle"],
failed: ["idle"],
};
```
Notice the `error_recovery` state. That's the one most people skip, and it's the one that saves you at 3 AM. When a tool call fails, you don't just retry blindly. You enter a dedicated state that can reason about the failure, adjust the approach, or gracefully give up. It is worth reading about [event-driven agent design](/blog/event-driven-agent-architecture) alongside this.
## Why "Just Use a While Loop" Breaks
I've seen this pattern hundreds of times:
```python
while not done:
response = llm.generate(messages)
if response.tool_calls:
results = execute_tools(response.tool_calls)
messages.append(results)
else:
done = True
```
Three problems with this.
First, there's no concept of "I've been going too long." The loop runs until the LLM decides to stop. That's giving a probabilistic system veto power over your compute budget.
Second, error handling is an afterthought. When `execute_tools` throws, you're in limbo. The message history is corrupted. The LLM doesn't know what happened. Recovery is basically "start over and hope."
Third, you can't inspect it. When something goes wrong (and it will), you can't answer "what state was the agent in when it derailed?" because there are no states. There's just "in the loop" and "out of the loop."
## Adding Depth Limits and Escape Hatches
The most important feature of a state machine agent isn't the states. It's the constraints on transitions.
```typescript
const config = {
maxPlanningCycles: 5,
maxToolCallsPerCycle: 3,
maxTotalToolCalls: 15,
maxErrorRecoveries: 2,
timeoutMs: 30_000,
};
```
Every transition checks these constraints. If the agent has planned 5 times without completing, something is wrong. Force it to `failed` with a clear reason. If it's recovered from errors twice and hit a third, stop. The definition of insanity applies to agents too.
## The Evaluation State Is Where the Magic Happens
Most agent implementations go straight from "got tool result" back to "ask LLM what to do next." That's wasteful. The evaluation state is where you add programmatic intelligence.
```typescript
function evaluate(result: ToolResult, context: AgentContext): AgentState {
// Did we get what we needed?
if (result.satisfiesGoal(context.currentObjective)) {
return "completed";
}
// Are we going in circles?
if (context.isRepeatingPattern(result, lastNResults: 3)) {
context.addSystemMessage("You're repeating the same approach. Try something different.");
return "planning";
}
// Is the result actionable?
if (result.isError && result.isRetryable) {
return "error_recovery";
}
// Default: let the LLM plan the next move
return "planning";
}
```
This is where you encode domain knowledge. The LLM doesn't need to figure out that it's going in circles. You can detect that programmatically and tell it. You're not replacing the AI's reasoning. You're giving it guardrails so it reasons about the right things.
## Nested State Machines for Complex Agents
Real agents aren't flat. A "research agent" might have a top-level state machine for the overall task, with nested state machines for individual research queries, each with their own retry logic and evaluation criteria.
```
TopLevel: PLANNING -> RESEARCHING -> SYNTHESIZING -> REVIEWING -> DONE
|
RESEARCHING (sub-machine):
QUERY_FORMULATION -> SEARCH -> FILTER -> EXTRACT -> VALIDATE
```
Each sub-machine is self-contained. It can fail without crashing the parent. The parent can decide whether to retry the sub-task, skip it, or abort the whole operation. This is composition. Same principle that makes Unix pipes powerful, applied to AI agents. For a deeper look, see [stateful agent state management](/blog/stateful-vs-stateless-agents).
## State Persistence: Surviving Crashes
If your agent runs for more than a few seconds, you need to persist state. Not just for crash recovery. For debugging, auditing, and understanding what happened after the fact.
```typescript
interface AgentSnapshot {
state: AgentState;
context: AgentContext;
messageHistory: Message[];
toolCallCount: number;
errorCount: number;
timestamp: number;
transitionLog: Transition[];
}
```
Every state transition, serialize and store. When something goes wrong, you have a complete timeline. When a customer asks "why did the agent do X," you can replay the exact sequence of states and transitions that led there.
## The Pattern in Practice
Here's how this plays out in a real system. An agent needs to research a topic and write a summary.
1. **IDLE to PLANNING**: Agent receives the task. Plans to search three sources.
2. **PLANNING to EXECUTING**: First search query fires.
3. **EXECUTING to EVALUATING**: Results come back. Evaluation: relevant but incomplete.
4. **EVALUATING to PLANNING**: Agent plans a follow-up query.
5. **PLANNING to EXECUTING**: Second query fires. API timeout.
6. **EXECUTING to ERROR_RECOVERY**: Logged the timeout. Decides to retry with modified query.
7. **ERROR_RECOVERY to PLANNING**: New plan, adjusted approach.
8. **...continues with full auditability...**
9. **EVALUATING to COMPLETED**: Sufficient information gathered. Summary written.
Every step is logged. Every transition is validated. The whole thing took 8 cycles instead of the 23 that a naive loop would have burned through.
## What This Gets You
State machines aren't glamorous. They don't make for exciting demos. But they give you something that matters far more than demos: confidence that your agent will behave predictably in production.
You get debuggability. You get cost control. You get the ability to say "the agent can never do more than X tool calls" and actually mean it. You get crash recovery. You get audit trails.
The agent loop is the foundation. Get it right, and everything built on top of it works better. Get it wrong, and no amount of prompt engineering will save you.
Build the state machine first. Then make it smart.