Fault Tolerance in Multi-Agent Systems: When Agents Die

Everything Fails

In distributed systems, the mantra is "everything fails, all the time." AI agents are worse. They don't just crash. They fail in creative, novel ways that no error handler anticipated.

An agent might: return confidently wrong results with no error signal, enter an infinite loop of self-correction that burns your entire token budget, hallucinate a file path and try to read it 47 times before timing out, produce valid JSON that passes schema validation but contains complete nonsense, or go silent mid-task with no error, no timeout, just... nothing.

These aren't hypothetical. I've seen each of them in production. The last one is the worst because you don't even know something went wrong until a human notices the pipeline stalled three hours ago.

The Failure Taxonomy

Not all failures are equal. Your fault tolerance strategy needs to handle each type differently.

Crash failures: The agent process dies. Clean, detectable, recoverable. The easy case.

Byzantine failures: The agent produces incorrect results without signaling an error. This is the hard case. The agent says "task complete, here are the results" and the results are wrong. No exception. No error code. Just bad output that looks plausible.

Performance failures: The agent works but takes 10x longer than expected. Is it thinking deeply or is it stuck? You can't tell from outside. This connects directly to retry strategies.

Cascade failures: Agent A fails, which causes Agent B to receive bad input, which causes Agent C to hallucinate based on Agent B's bad output. One failure propagates through the system.

class FailureType(Enum):
    CRASH = "crash"          # Process died
    BYZANTINE = "byzantine"  # Wrong results, no error
    TIMEOUT = "timeout"      # Too slow
    CASCADE = "cascade"      # Failure propagated from upstream
    BUDGET = "budget"        # Token/cost limit exceeded

Circuit Breakers

Borrowed directly from electrical engineering and popularized by Netflix for microservices. When an agent fails repeatedly, stop sending it work.

class CircuitBreaker:
    def __init__(self, failure_threshold=3,
                 reset_timeout=60):
        self.failures = 0
        self.threshold = failure_threshold
        self.reset_timeout = reset_timeout
        self.state = "closed"  # closed=normal, open=blocked
        self.opened_at = None

    async def call(self, agent, task):
        if self.state == "open":
            if time.time() - self.opened_at > self.reset_timeout:
                self.state = "half-open"
            else:
                raise CircuitOpenError(agent.id)

        try:
            result = await agent.execute(task)
            if self.state == "half-open":
                self.state = "closed"
                self.failures = 0
            return result
        except Exception:
            self.failures += 1
            if self.failures >= self.threshold:
                self.state = "open"
                self.opened_at = time.time()
            raise

Three failures in a row trips the circuit. No more tasks go to that agent for 60 seconds. After the timeout, one test task goes through (half-open state). If it succeeds, the agent is restored. If it fails, the circuit opens again for another 60 seconds.

The agent-specific twist: You need circuit breakers per agent AND per task type. An agent might be perfectly fine for code generation but consistently failing at security reviews. A global circuit breaker would take it offline for everything. A per-type breaker only blocks the failing task category.

Supervision Trees

Erlang got this right in the 1980s. Every process has a supervisor. When a process dies, the supervisor decides what to do: restart it, restart all its siblings, or escalate to its own supervisor. For a deeper look, see observability tooling.

class AgentSupervisor:
    def __init__(self, strategy="one_for_one"):
        self.children = {}
        self.strategy = strategy
        self.restart_counts = {}
        self.max_restarts = 5
        self.restart_window = 300  # seconds

    async def supervise(self, agent_id, task):
        try:
            result = await self.children[agent_id].execute(task)
            return result
        except AgentCrash:
            return await self._handle_crash(agent_id, task)

    async def _handle_crash(self, agent_id, task):
        self._record_restart(agent_id)

        if self._too_many_restarts(agent_id):
            # Escalate: this agent is fundamentally broken
            await self.escalate(agent_id)
            return

        if self.strategy == "one_for_one":
            # Restart just the failed agent
            await self._restart(agent_id)
            return await self.children[agent_id].execute(task)

        elif self.strategy == "one_for_all":
            # Restart all children (shared state might be corrupt)
            for child_id in self.children:
                await self._restart(child_id)
            return await self.children[agent_id].execute(task)

        elif self.strategy == "rest_for_one":
            # Restart failed + all agents started after it
            await self._restart_from(agent_id)
            return await self.children[agent_id].execute(task)

The strategy choice matters.

one_for_one: Good when agents are independent. A code writer crashing doesn't affect the test writer.

one_for_all: Good when agents share state. If the shared context is corrupted, all agents need a clean restart.

rest_for_one: Good for pipeline architectures. If stage 3 crashes, stages 4 and 5 have been working with potentially corrupted data from stage 3's partial output.

Byzantine Fault Detection

The hardest problem. How do you detect when an agent is wrong but doesn't know it's wrong?

class ByzantineDetector:
    async def validate(self, result, task, agent):
        checks = [
            self._schema_valid(result),
            self._internally_consistent(result),
            self._cross_reference(result, task),
            self._confidence_calibrated(result, agent),
            self._not_copied(result, task.input),
        ]
        return all(await asyncio.gather(*checks))

    async def _internally_consistent(self, result):
        """Does the result contradict itself?"""
        # Ask a cheap model to check for contradictions
        check = await self.validator_model.check(
            f"Does this output contain internal "
            f"contradictions? {result.summary}"
        )
        return not check.has_contradictions

    async def _confidence_calibrated(self, result, agent):
        """Is the stated confidence realistic?"""
        historical = agent.confidence_history
        if result.confidence > 0.95:
            # Agents that claim >95% confidence are usually
            # less accurate than agents claiming 70-85%
            actual_accuracy = historical.accuracy_at_confidence(
                0.95
            )
            return actual_accuracy > 0.8

        return True

    async def _not_copied(self, result, input_data):
        """Did the agent just parrot the input back?"""
        similarity = compute_similarity(
            result.output, input_data
        )
        return similarity < 0.85

The confidence calibration check is gold. I've found that agents claiming 95%+ confidence are wrong more often than agents claiming 75%. Overconfidence is a consistent signal of low-quality output. Track it historically per agent and flag anomalies.

Retry Strategies

Not all retries are equal.

class RetryStrategy:
    async def retry_with_feedback(self, agent, task,
                                   failure_info):
        """Don't just retry. Tell the agent what went wrong."""
        enriched_task = task.with_context(
            f"Previous attempt failed: {failure_info}. "
            f"Avoid this specific failure mode."
        )
        return await agent.execute(enriched_task)

    async def retry_with_different_agent(self, task,
                                          failed_agent):
        """Same task, different agent."""
        alternative = self.find_alternative(
            task, exclude=[failed_agent]
        )
        return await alternative.execute(task)

    async def retry_with_decomposition(self, task):
        """Break the task into smaller pieces."""
        subtasks = await self.decomposer.split(task)
        results = []
        for subtask in subtasks:
            result = await self.route_and_execute(subtask)
            results.append(result)
        return self.synthesize(results)

    async def retry_with_escalation(self, task):
        """Use a more capable (expensive) model."""
        escalated_agent = self.spawn_agent(
            model="claude-opus",
            task_type=task.type
        )
        return await escalated_agent.execute(task)

The hierarchy: same agent with feedback, different agent, decompose and retry, escalate to a better model. Each step costs more. Don't jump to opus because haiku had one bad response.

Graceful Degradation

When multiple agents are down, the system should get worse gradually, not fail entirely.

class GracefulDegradation:
    def __init__(self, feature_priorities):
        self.priorities = feature_priorities

    async def process(self, task, available_agents):
        capacity = sum(a.available_capacity
                      for a in available_agents) It is worth reading about [load balancing to route around failures](/blog/load-balancing-ai-agents) alongside this.

        if capacity > 0.8:
            # Full service
            return await self.full_pipeline(task)

        elif capacity > 0.5:
            # Skip non-critical steps
            return await self.essential_pipeline(task)

        elif capacity > 0.2:
            # Single-agent mode
            best = max(available_agents,
                      key=lambda a: a.skill_score(task.type))
            return await best.execute(task)

        else:
            # Emergency: queue for later
            await self.queue.enqueue(task, priority="urgent")
            return DegradedResult(
                status="queued",
                message="System at reduced capacity"
            )

This is what Netflix does with their microservices. When the recommendation engine is down, you still see content. It's just not personalized. Your multi-agent system should work the same way. Security review is down? Skip it but flag the code as "unreviewed." Code generation is degraded? Produce simpler output with more comments for human review.

The Kill Switch

Every multi-agent system needs an emergency stop.

class KillSwitch:
    def __init__(self):
        self.active = True
        self.budget_limit = 100_000  # tokens
        self.time_limit = 3600       # seconds
        self.error_rate_limit = 0.5  # 50% error rate

    async def check(self, metrics):
        if metrics.total_tokens > self.budget_limit:
            await self.emergency_stop("Token budget exceeded")
        if metrics.elapsed > self.time_limit:
            await self.emergency_stop("Time limit exceeded")
        if metrics.error_rate > self.error_rate_limit:
            await self.emergency_stop("Error rate critical")

    async def emergency_stop(self, reason):
        self.active = False
        for agent in self.all_agents:
            await agent.terminate(graceful=True)
        await self.persist_state()
        await self.notify_human(reason)

Without a kill switch, a runaway agent system will drain your API budget, produce mountains of garbage, and keep going until you manually intervene. I learned this after a $400 incident that ran for six hours overnight. Now every system has hard budget limits that trigger immediate shutdown.

The Resilience Checklist

Before deploying any multi-agent system:

Every agent has a circuit breaker
Every agent has a supervisor
Output validation runs on every result (including "successful" ones)
Retry strategy escalates, doesn't repeat blindly
Graceful degradation is defined for 50%, 25%, and 10% capacity
Kill switch with token budget, time limit, and error rate triggers
All failures are logged with full context for post-mortem
State is persisted frequently enough to survive restarts

Fault tolerance isn't a feature. It's the difference between a demo and a production system. Build it first, not last.