The Supervisor Pattern: Building Agent Managers That Work
By Diesel
multi-agentsupervisorpatterns
## What Is a Supervisor Agent?
A supervisor is an agent whose job isn't to do work. It's to make other agents do work well. It decomposes tasks, assigns them, monitors progress, handles failures, and synthesizes results. It's a manager. And like human managers, there are good ones and terrible ones.
The terrible ones micromanage every decision, bloat their context window tracking minutiae, and become the bottleneck they were supposed to prevent. The good ones set clear objectives, trust their workers, intervene only when things go wrong, and add genuine value in synthesis.
Building a good supervisor is harder than building a good worker. Workers need domain expertise. Supervisors need judgment.
## The Basic Supervisor
At its simplest, a supervisor decomposes, delegates, and collects.
```python
class BasicSupervisor:
def __init__(self, workers, model="claude-sonnet"):
self.workers = workers
self.llm = LLM(model)
async def handle(self, task):
# 1. Decompose
plan = await self.llm.generate(
f"Break this task into subtasks for these "
f"specialists: {[w.role for w in self.workers]}. "
f"Task: {task.description}"
)
# 2. Delegate
results = {}
for subtask in plan.subtasks:
worker = self.match_worker(subtask)
result = await worker.execute(subtask)
results[subtask.id] = result
# 3. Synthesize
synthesis = await self.llm.generate(
f"Combine these results into a final output: "
f"{results}"
)
return synthesis
```
This works for demos. It falls apart in production because it assumes everything succeeds, executes sequentially when it could parallelize, has no error handling, no quality validation, and no ability to iterate.
## The Production Supervisor
Here's what a real supervisor looks like.
```python
class ProductionSupervisor:
def __init__(self, workers, config):
self.workers = {w.role: w for w in workers}
self.config = config
self.circuit_breakers = {
w.role: CircuitBreaker() for w in workers
}
self.execution_log = []
async def handle(self, task):
# Phase 1: Plan
plan = await self._create_plan(task)
self._log("plan_created", plan)
# Phase 2: Execute with monitoring
results = await self._execute_plan(plan)
# Phase 3: Validate
validated = await self._validate_results(
results, task
)
# Phase 4: Iterate if needed
if validated.needs_revision:
results = await self._iterate(
plan, results, validated.feedback
)
# Phase 5: Synthesize
return await self._synthesize(task, results)
async def _execute_plan(self, plan):
# Build dependency graph
graph = self._build_dag(plan.subtasks)
results = {}
# Execute in topological order, parallelize
# where possible
for batch in graph.parallel_batches():
batch_results = await asyncio.gather(*[
self._execute_subtask(st, results)
for st in batch
], return_exceptions=True)
for subtask, result in zip(batch, batch_results):
if isinstance(result, Exception):
result = await self._handle_failure(
subtask, result, results
)
results[subtask.id] = result It is worth reading about [broader orchestration patterns](/blog/multi-agent-orchestration-patterns) alongside this.
return results
async def _execute_subtask(self, subtask, prior_results):
worker = self._select_worker(subtask)
breaker = self.circuit_breakers[worker.role]
# Inject relevant context from prior results
context = self._extract_context(
subtask, prior_results
)
enriched = subtask.with_context(context)
async with self._timeout(subtask.budget):
result = await breaker.call(
worker, enriched
)
self._log("subtask_complete", subtask, result)
return result
```
The difference is in the details. The plan becomes a DAG (directed acyclic graph) that identifies which subtasks can run in parallel. Each subtask gets relevant context from prior results without dumping everything into every worker's prompt. Circuit breakers prevent cascading failures. Timeouts prevent runaway execution.
## The Decomposition Problem
The supervisor's most important job is decomposition. Bad decomposition torpedoes everything downstream.
```python
class SmartDecomposer:
async def decompose(self, task, workers):
# Step 1: Identify required capabilities
capabilities_needed = await self._analyze_task(task)
# Step 2: Map capabilities to workers
assignments = {}
for cap in capabilities_needed:
candidates = [
w for w in workers
if w.can_handle(cap)
]
if not candidates:
# No specialist. Can we combine?
combo = self._find_combination(cap, workers)
if combo:
assignments[cap] = combo
else:
raise NoCapableWorkerError(cap)
else:
assignments[cap] = candidates[0]
# Step 3: Identify dependencies
deps = await self._find_dependencies(
capabilities_needed
)
# Step 4: Create subtask DAG
subtasks = []
for cap, worker in assignments.items():
subtask = SubTask(
capability=cap,
assigned_to=worker,
depends_on=[
d for d in deps if d.target == cap
],
estimated_tokens=self._estimate_cost(cap)
)
subtasks.append(subtask)
return Plan(subtasks=subtasks, dag=deps)
```
Three common decomposition failures:
**Over-decomposition.** Breaking "write a function" into "write the signature," "write the body," "write the return statement." The coordination overhead exceeds the work. If a task fits in one agent's context with room to spare, don't decompose it.
**Under-decomposition.** Giving one agent a task that requires three domains of expertise. The agent does all three poorly instead of one well. If the task crosses domain boundaries, it needs decomposition.
**Wrong boundaries.** Splitting a task at a point that creates excessive dependency between subtasks. "Write the frontend" and "write the API" with both needing to agree on the data schema. Now the supervisor is shuttling schema negotiations back and forth. Better decomposition: "Define the shared schema," then "implement frontend using schema," then "implement API using schema."
## The Context Management Challenge
The supervisor's context window is its most precious resource. And the temptation is to stuff everything into it.
```python
class ContextManager:
def __init__(self, max_context_tokens=100000):
self.max_tokens = max_context_tokens
self.context = {}
self.importance = {}
def store(self, key, value, importance=0.5):
tokens = self._count_tokens(value)
self.context[key] = value
self.importance[key] = importance
self._evict_if_needed()
def _evict_if_needed(self):
total = sum(
self._count_tokens(v)
for v in self.context.values()
)
while total > self.max_tokens * 0.8:
# Evict least important
least = min(
self.importance, key=self.importance.get
)
total -= self._count_tokens(
self.context[least]
)
del self.context[least]
del self.importance[least]
def get_summary_for_worker(self, worker_role):
"""Only pass relevant context to each worker"""
relevant = {
k: v for k, v in self.context.items()
if self._is_relevant(k, worker_role)
}
return relevant
```
The supervisor should hold: the original task description (always), the plan (always), summaries of completed subtask results (important), full results only for the current active subtask (medium), worker capability profiles (low, refresh from memory). It is worth reading about [hierarchical topologies](/blog/multi-agent-topology-hierarchical-flat) alongside this.
The supervisor should NOT hold: full code output from every worker, detailed logs, raw data, anything the supervisor doesn't need to make routing or synthesis decisions.
## The Iteration Loop
First-pass results are rarely final. The supervisor needs to evaluate and iterate.
```python
class IterationManager:
def __init__(self, max_iterations=3):
self.max_iter = max_iterations
async def iterate(self, supervisor, plan,
results, validation):
for i in range(self.max_iter):
if validation.all_passed:
return results
# Identify which subtasks need revision
failures = validation.failed_checks
for check in failures:
subtask = check.subtask
feedback = check.feedback
# Re-execute with feedback
worker = plan.get_worker(subtask)
revised = await worker.execute(
subtask.with_feedback(
f"Revision {i+1}: {feedback}"
)
)
results[subtask.id] = revised
# Re-validate
validation = await supervisor.validate(
results, plan.original_task
)
# Max iterations reached
return results # Return best effort
```
Key principle: feedback must be specific. "This is wrong, try again" produces the same wrong answer. "The SQL query on line 15 is vulnerable to injection because the user input on line 12 isn't parameterized" produces a fix.
## Supervisor Hierarchies
For complex systems, supervisors supervise supervisors.
```
Project Supervisor
/ | \
Frontend Backend Testing
Supervisor Supervisor Supervisor
/ | \ / | \ / | \
R C S A D C U I E
```
Each layer adds a level of abstraction. The Project Supervisor thinks in features. The Frontend Supervisor thinks in components. The React worker thinks in JSX.
**The rule:** Never go deeper than three levels. Each level adds latency and information loss. Three levels (project, domain, specialist) handles most enterprise complexity. If you need a fourth level, your decomposition is wrong.
## Anti-Patterns I've Seen (and Built)
**The Micromanager.** Supervisor that checks every worker's intermediate step. Context window fills up with status checks. The supervisor becomes slower than just having one agent do the whole task.
**The Absent Manager.** Supervisor that fires off all tasks and blindly concatenates results. No validation, no iteration, no synthesis. You're paying for coordination overhead without getting coordination value.
**The Context Hoarder.** Supervisor that keeps every worker's full output in context "just in case." By the third iteration, it's operating at 95% context utilization and its own reasoning quality has cratered. It is worth reading about [specialist agents underneath](/blog/agent-specialization-vs-generalist) alongside this.
**The Single Point of Failure.** Supervisor crashes and the entire system stops. If your supervisor doesn't have a checkpoint/recovery mechanism, your fault tolerance is nonexistent at the most critical point.
## What Makes a Good Supervisor
After building supervisor agents for two years, the pattern is clear:
The supervisor's system prompt should be about judgment, not domain knowledge. It doesn't need to know TypeScript. It needs to know when TypeScript output is good enough, when it needs iteration, and when to escalate.
Keep the supervisor's model high-quality. Workers can be cheaper models because their scope is narrow. The supervisor needs the best reasoning available because its decisions affect the entire pipeline.
Supervisors should be stateless between tasks. All state lives in the plan and the results store. A supervisor that crashes mid-task should be replaceable by a new instance that reads the same plan and results.
The best supervisors I've built do three things: they decompose intelligently, they validate ruthlessly, and they synthesize creatively. Everything else is plumbing.