The Supervisor Pattern: Building Agent Managers That Work

What Is a Supervisor Agent?

A supervisor is an agent whose job isn't to do work. It's to make other agents do work well. It decomposes tasks, assigns them, monitors progress, handles failures, and synthesizes results. It's a manager. And like human managers, there are good ones and terrible ones.

The terrible ones micromanage every decision, bloat their context window tracking minutiae, and become the bottleneck they were supposed to prevent. The good ones set clear objectives, trust their workers, intervene only when things go wrong, and add genuine value in synthesis.

Building a good supervisor is harder than building a good worker. Workers need domain expertise. Supervisors need judgment.

The Basic Supervisor

At its simplest, a supervisor decomposes, delegates, and collects.

class BasicSupervisor:
    def __init__(self, workers, model="claude-sonnet"):
        self.workers = workers
        self.llm = LLM(model)

    async def handle(self, task):
        # 1. Decompose
        plan = await self.llm.generate(
            f"Break this task into subtasks for these "
            f"specialists: {[w.role for w in self.workers]}. "
            f"Task: {task.description}"
        )

        # 2. Delegate
        results = {}
        for subtask in plan.subtasks:
            worker = self.match_worker(subtask)
            result = await worker.execute(subtask)
            results[subtask.id] = result

        # 3. Synthesize
        synthesis = await self.llm.generate(
            f"Combine these results into a final output: "
            f"{results}"
        )
        return synthesis

This works for demos. It falls apart in production because it assumes everything succeeds, executes sequentially when it could parallelize, has no error handling, no quality validation, and no ability to iterate.

The Production Supervisor

Here's what a real supervisor looks like.

class ProductionSupervisor:
    def __init__(self, workers, config):
        self.workers = {w.role: w for w in workers}
        self.config = config
        self.circuit_breakers = {
            w.role: CircuitBreaker() for w in workers
        }
        self.execution_log = []

    async def handle(self, task):
        # Phase 1: Plan
        plan = await self._create_plan(task)
        self._log("plan_created", plan)

        # Phase 2: Execute with monitoring
        results = await self._execute_plan(plan)

        # Phase 3: Validate
        validated = await self._validate_results(
            results, task
        )

        # Phase 4: Iterate if needed
        if validated.needs_revision:
            results = await self._iterate(
                plan, results, validated.feedback
            )

        # Phase 5: Synthesize
        return await self._synthesize(task, results)

    async def _execute_plan(self, plan):
        # Build dependency graph
        graph = self._build_dag(plan.subtasks)
        results = {}

        # Execute in topological order, parallelize
        # where possible
        for batch in graph.parallel_batches():
            batch_results = await asyncio.gather(*[
                self._execute_subtask(st, results)
                for st in batch
            ], return_exceptions=True)

            for subtask, result in zip(batch, batch_results):
                if isinstance(result, Exception):
                    result = await self._handle_failure(
                        subtask, result, results
                    )
                results[subtask.id] = result It is worth reading about [broader orchestration patterns](/blog/multi-agent-orchestration-patterns) alongside this.

        return results

    async def _execute_subtask(self, subtask, prior_results):
        worker = self._select_worker(subtask)
        breaker = self.circuit_breakers[worker.role]

        # Inject relevant context from prior results
        context = self._extract_context(
            subtask, prior_results
        )
        enriched = subtask.with_context(context)

        async with self._timeout(subtask.budget):
            result = await breaker.call(
                worker, enriched
            )

        self._log("subtask_complete", subtask, result)
        return result

The difference is in the details. The plan becomes a DAG (directed acyclic graph) that identifies which subtasks can run in parallel. Each subtask gets relevant context from prior results without dumping everything into every worker's prompt. Circuit breakers prevent cascading failures. Timeouts prevent runaway execution.

The Decomposition Problem

The supervisor's most important job is decomposition. Bad decomposition torpedoes everything downstream.

class SmartDecomposer:
    async def decompose(self, task, workers):
        # Step 1: Identify required capabilities
        capabilities_needed = await self._analyze_task(task)

        # Step 2: Map capabilities to workers
        assignments = {}
        for cap in capabilities_needed:
            candidates = [
                w for w in workers
                if w.can_handle(cap)
            ]
            if not candidates:
                # No specialist. Can we combine?
                combo = self._find_combination(cap, workers)
                if combo:
                    assignments[cap] = combo
                else:
                    raise NoCapableWorkerError(cap)
            else:
                assignments[cap] = candidates[0]

        # Step 3: Identify dependencies
        deps = await self._find_dependencies(
            capabilities_needed
        )

        # Step 4: Create subtask DAG
        subtasks = []
        for cap, worker in assignments.items():
            subtask = SubTask(
                capability=cap,
                assigned_to=worker,
                depends_on=[
                    d for d in deps if d.target == cap
                ],
                estimated_tokens=self._estimate_cost(cap)
            )
            subtasks.append(subtask)

        return Plan(subtasks=subtasks, dag=deps)

Three common decomposition failures:

Over-decomposition. Breaking "write a function" into "write the signature," "write the body," "write the return statement." The coordination overhead exceeds the work. If a task fits in one agent's context with room to spare, don't decompose it.

Under-decomposition. Giving one agent a task that requires three domains of expertise. The agent does all three poorly instead of one well. If the task crosses domain boundaries, it needs decomposition.

Wrong boundaries. Splitting a task at a point that creates excessive dependency between subtasks. "Write the frontend" and "write the API" with both needing to agree on the data schema. Now the supervisor is shuttling schema negotiations back and forth. Better decomposition: "Define the shared schema," then "implement frontend using schema," then "implement API using schema."

The Context Management Challenge

The supervisor's context window is its most precious resource. And the temptation is to stuff everything into it.

class ContextManager:
    def __init__(self, max_context_tokens=100000):
        self.max_tokens = max_context_tokens
        self.context = {}
        self.importance = {}

    def store(self, key, value, importance=0.5):
        tokens = self._count_tokens(value)
        self.context[key] = value
        self.importance[key] = importance
        self._evict_if_needed()

    def _evict_if_needed(self):
        total = sum(
            self._count_tokens(v)
            for v in self.context.values()
        )
        while total > self.max_tokens * 0.8:
            # Evict least important
            least = min(
                self.importance, key=self.importance.get
            )
            total -= self._count_tokens(
                self.context[least]
            )
            del self.context[least]
            del self.importance[least]

    def get_summary_for_worker(self, worker_role):
        """Only pass relevant context to each worker"""
        relevant = {
            k: v for k, v in self.context.items()
            if self._is_relevant(k, worker_role)
        }
        return relevant

The supervisor should hold: the original task description (always), the plan (always), summaries of completed subtask results (important), full results only for the current active subtask (medium), worker capability profiles (low, refresh from memory). It is worth reading about hierarchical topologies alongside this.

The supervisor should NOT hold: full code output from every worker, detailed logs, raw data, anything the supervisor doesn't need to make routing or synthesis decisions.

The Iteration Loop

First-pass results are rarely final. The supervisor needs to evaluate and iterate.

class IterationManager:
    def __init__(self, max_iterations=3):
        self.max_iter = max_iterations

    async def iterate(self, supervisor, plan,
                       results, validation):
        for i in range(self.max_iter):
            if validation.all_passed:
                return results

            # Identify which subtasks need revision
            failures = validation.failed_checks

            for check in failures:
                subtask = check.subtask
                feedback = check.feedback

                # Re-execute with feedback
                worker = plan.get_worker(subtask)
                revised = await worker.execute(
                    subtask.with_feedback(
                        f"Revision {i+1}: {feedback}"
                    )
                )
                results[subtask.id] = revised

            # Re-validate
            validation = await supervisor.validate(
                results, plan.original_task
            )

        # Max iterations reached
        return results  # Return best effort

Key principle: feedback must be specific. "This is wrong, try again" produces the same wrong answer. "The SQL query on line 15 is vulnerable to injection because the user input on line 12 isn't parameterized" produces a fix.

Supervisor Hierarchies

For complex systems, supervisors supervise supervisors.

        Project Supervisor
       /         |         \
  Frontend     Backend    Testing
  Supervisor   Supervisor  Supervisor
   / | \       / | \       / | \
  R  C  S    A  D  C    U  I  E

Each layer adds a level of abstraction. The Project Supervisor thinks in features. The Frontend Supervisor thinks in components. The React worker thinks in JSX.

The rule: Never go deeper than three levels. Each level adds latency and information loss. Three levels (project, domain, specialist) handles most enterprise complexity. If you need a fourth level, your decomposition is wrong.

Anti-Patterns I've Seen (and Built)

The Micromanager. Supervisor that checks every worker's intermediate step. Context window fills up with status checks. The supervisor becomes slower than just having one agent do the whole task.

The Absent Manager. Supervisor that fires off all tasks and blindly concatenates results. No validation, no iteration, no synthesis. You're paying for coordination overhead without getting coordination value.

The Context Hoarder. Supervisor that keeps every worker's full output in context "just in case." By the third iteration, it's operating at 95% context utilization and its own reasoning quality has cratered. It is worth reading about specialist agents underneath alongside this.

The Single Point of Failure. Supervisor crashes and the entire system stops. If your supervisor doesn't have a checkpoint/recovery mechanism, your fault tolerance is nonexistent at the most critical point.

What Makes a Good Supervisor

After building supervisor agents for two years, the pattern is clear:

The supervisor's system prompt should be about judgment, not domain knowledge. It doesn't need to know TypeScript. It needs to know when TypeScript output is good enough, when it needs iteration, and when to escalate.

Keep the supervisor's model high-quality. Workers can be cheaper models because their scope is narrow. The supervisor needs the best reasoning available because its decisions affect the entire pipeline.

Supervisors should be stateless between tasks. All state lives in the plan and the results store. A supervisor that crashes mid-task should be replaceable by a new instance that reads the same plan and results.

The best supervisors I've built do three things: they decompose intelligently, they validate ruthlessly, and they synthesize creatively. Everything else is plumbing.