Fault Tolerance in Multi-Agent Systems: When Agents Die
By Diesel
multi-agentfault-tolerancereliability
## Everything Fails
In distributed systems, the mantra is "everything fails, all the time." AI agents are worse. They don't just crash. They fail in creative, novel ways that no error handler anticipated.
An agent might: return confidently wrong results with no error signal, enter an infinite loop of self-correction that burns your entire token budget, hallucinate a file path and try to read it 47 times before timing out, produce valid JSON that passes schema validation but contains complete nonsense, or go silent mid-task with no error, no timeout, just... nothing.
These aren't hypothetical. I've seen each of them in production. The last one is the worst because you don't even know something went wrong until a human notices the pipeline stalled three hours ago.
## The Failure Taxonomy
Not all failures are equal. Your fault tolerance strategy needs to handle each type differently.
**Crash failures:** The agent process dies. Clean, detectable, recoverable. The easy case.
**Byzantine failures:** The agent produces incorrect results without signaling an error. This is the hard case. The agent says "task complete, here are the results" and the results are wrong. No exception. No error code. Just bad output that looks plausible.
**Performance failures:** The agent works but takes 10x longer than expected. Is it thinking deeply or is it stuck? You can't tell from outside. This connects directly to [retry strategies](/blog/retry-strategies-ai-agents).
**Cascade failures:** Agent A fails, which causes Agent B to receive bad input, which causes Agent C to hallucinate based on Agent B's bad output. One failure propagates through the system.
```python
class FailureType(Enum):
CRASH = "crash" # Process died
BYZANTINE = "byzantine" # Wrong results, no error
TIMEOUT = "timeout" # Too slow
CASCADE = "cascade" # Failure propagated from upstream
BUDGET = "budget" # Token/cost limit exceeded
```
## Circuit Breakers
Borrowed directly from electrical engineering and popularized by Netflix for microservices. When an agent fails repeatedly, stop sending it work.
```python
class CircuitBreaker:
def __init__(self, failure_threshold=3,
reset_timeout=60):
self.failures = 0
self.threshold = failure_threshold
self.reset_timeout = reset_timeout
self.state = "closed" # closed=normal, open=blocked
self.opened_at = None
async def call(self, agent, task):
if self.state == "open":
if time.time() - self.opened_at > self.reset_timeout:
self.state = "half-open"
else:
raise CircuitOpenError(agent.id)
try:
result = await agent.execute(task)
if self.state == "half-open":
self.state = "closed"
self.failures = 0
return result
except Exception:
self.failures += 1
if self.failures >= self.threshold:
self.state = "open"
self.opened_at = time.time()
raise
```
Three failures in a row trips the circuit. No more tasks go to that agent for 60 seconds. After the timeout, one test task goes through (half-open state). If it succeeds, the agent is restored. If it fails, the circuit opens again for another 60 seconds.
**The agent-specific twist:** You need circuit breakers per agent AND per task type. An agent might be perfectly fine for code generation but consistently failing at security reviews. A global circuit breaker would take it offline for everything. A per-type breaker only blocks the failing task category.
## Supervision Trees
Erlang got this right in the 1980s. Every process has a supervisor. When a process dies, the supervisor decides what to do: restart it, restart all its siblings, or escalate to its own supervisor. For a deeper look, see [observability tooling](/blog/agent-observability-tracing-logging).
```python
class AgentSupervisor:
def __init__(self, strategy="one_for_one"):
self.children = {}
self.strategy = strategy
self.restart_counts = {}
self.max_restarts = 5
self.restart_window = 300 # seconds
async def supervise(self, agent_id, task):
try:
result = await self.children[agent_id].execute(task)
return result
except AgentCrash:
return await self._handle_crash(agent_id, task)
async def _handle_crash(self, agent_id, task):
self._record_restart(agent_id)
if self._too_many_restarts(agent_id):
# Escalate: this agent is fundamentally broken
await self.escalate(agent_id)
return
if self.strategy == "one_for_one":
# Restart just the failed agent
await self._restart(agent_id)
return await self.children[agent_id].execute(task)
elif self.strategy == "one_for_all":
# Restart all children (shared state might be corrupt)
for child_id in self.children:
await self._restart(child_id)
return await self.children[agent_id].execute(task)
elif self.strategy == "rest_for_one":
# Restart failed + all agents started after it
await self._restart_from(agent_id)
return await self.children[agent_id].execute(task)
```
The strategy choice matters.
**one_for_one:** Good when agents are independent. A code writer crashing doesn't affect the test writer.
**one_for_all:** Good when agents share state. If the shared context is corrupted, all agents need a clean restart.
**rest_for_one:** Good for pipeline architectures. If stage 3 crashes, stages 4 and 5 have been working with potentially corrupted data from stage 3's partial output.
## Byzantine Fault Detection
The hardest problem. How do you detect when an agent is wrong but doesn't know it's wrong?
```python
class ByzantineDetector:
async def validate(self, result, task, agent):
checks = [
self._schema_valid(result),
self._internally_consistent(result),
self._cross_reference(result, task),
self._confidence_calibrated(result, agent),
self._not_copied(result, task.input),
]
return all(await asyncio.gather(*checks))
async def _internally_consistent(self, result):
"""Does the result contradict itself?"""
# Ask a cheap model to check for contradictions
check = await self.validator_model.check(
f"Does this output contain internal "
f"contradictions? {result.summary}"
)
return not check.has_contradictions
async def _confidence_calibrated(self, result, agent):
"""Is the stated confidence realistic?"""
historical = agent.confidence_history
if result.confidence > 0.95:
# Agents that claim >95% confidence are usually
# less accurate than agents claiming 70-85%
actual_accuracy = historical.accuracy_at_confidence(
0.95
)
return actual_accuracy > 0.8
return True
async def _not_copied(self, result, input_data):
"""Did the agent just parrot the input back?"""
similarity = compute_similarity(
result.output, input_data
)
return similarity < 0.85
```
The confidence calibration check is gold. I've found that agents claiming 95%+ confidence are wrong more often than agents claiming 75%. Overconfidence is a consistent signal of low-quality output. Track it historically per agent and flag anomalies.
## Retry Strategies
Not all retries are equal.
```python
class RetryStrategy:
async def retry_with_feedback(self, agent, task,
failure_info):
"""Don't just retry. Tell the agent what went wrong."""
enriched_task = task.with_context(
f"Previous attempt failed: {failure_info}. "
f"Avoid this specific failure mode."
)
return await agent.execute(enriched_task)
async def retry_with_different_agent(self, task,
failed_agent):
"""Same task, different agent."""
alternative = self.find_alternative(
task, exclude=[failed_agent]
)
return await alternative.execute(task)
async def retry_with_decomposition(self, task):
"""Break the task into smaller pieces."""
subtasks = await self.decomposer.split(task)
results = []
for subtask in subtasks:
result = await self.route_and_execute(subtask)
results.append(result)
return self.synthesize(results)
async def retry_with_escalation(self, task):
"""Use a more capable (expensive) model."""
escalated_agent = self.spawn_agent(
model="claude-opus",
task_type=task.type
)
return await escalated_agent.execute(task)
```
The hierarchy: same agent with feedback, different agent, decompose and retry, escalate to a better model. Each step costs more. Don't jump to opus because haiku had one bad response.
## Graceful Degradation
When multiple agents are down, the system should get worse gradually, not fail entirely.
```python
class GracefulDegradation:
def __init__(self, feature_priorities):
self.priorities = feature_priorities
async def process(self, task, available_agents):
capacity = sum(a.available_capacity
for a in available_agents) It is worth reading about [load balancing to route around failures](/blog/load-balancing-ai-agents) alongside this.
if capacity > 0.8:
# Full service
return await self.full_pipeline(task)
elif capacity > 0.5:
# Skip non-critical steps
return await self.essential_pipeline(task)
elif capacity > 0.2:
# Single-agent mode
best = max(available_agents,
key=lambda a: a.skill_score(task.type))
return await best.execute(task)
else:
# Emergency: queue for later
await self.queue.enqueue(task, priority="urgent")
return DegradedResult(
status="queued",
message="System at reduced capacity"
)
```
This is what Netflix does with their microservices. When the recommendation engine is down, you still see content. It's just not personalized. Your multi-agent system should work the same way. Security review is down? Skip it but flag the code as "unreviewed." Code generation is degraded? Produce simpler output with more comments for human review.
## The Kill Switch
Every multi-agent system needs an emergency stop.
```python
class KillSwitch:
def __init__(self):
self.active = True
self.budget_limit = 100_000 # tokens
self.time_limit = 3600 # seconds
self.error_rate_limit = 0.5 # 50% error rate
async def check(self, metrics):
if metrics.total_tokens > self.budget_limit:
await self.emergency_stop("Token budget exceeded")
if metrics.elapsed > self.time_limit:
await self.emergency_stop("Time limit exceeded")
if metrics.error_rate > self.error_rate_limit:
await self.emergency_stop("Error rate critical")
async def emergency_stop(self, reason):
self.active = False
for agent in self.all_agents:
await agent.terminate(graceful=True)
await self.persist_state()
await self.notify_human(reason)
```
Without a kill switch, a runaway agent system will drain your API budget, produce mountains of garbage, and keep going until you manually intervene. I learned this after a $400 incident that ran for six hours overnight. Now every system has hard budget limits that trigger immediate shutdown.
## The Resilience Checklist
Before deploying any multi-agent system:
1. Every agent has a circuit breaker
2. Every agent has a supervisor
3. Output validation runs on every result (including "successful" ones)
4. Retry strategy escalates, doesn't repeat blindly
5. Graceful degradation is defined for 50%, 25%, and 10% capacity
6. Kill switch with token budget, time limit, and error rate triggers
7. All failures are logged with full context for post-mortem
8. State is persisted frequently enough to survive restarts
Fault tolerance isn't a feature. It's the difference between a demo and a production system. Build it first, not last.