Event-Driven Agent Architecture: Reactive Systems That Scale
By Diesel
architectureevent-drivenscalability
## The Polling Problem
Most AI agents are needy. They sit in a loop, constantly asking "is there something for me to do?" Every few seconds, they poll for new messages, check for tool results, ping external APIs. It works when you have one agent. It falls apart spectacularly when you have fifty.
Event-driven architecture flips this entirely. Instead of agents pulling for work, work pushes to agents. The agent sleeps until something interesting happens, then wakes up, handles it, and goes back to sleep. No wasted cycles. No unnecessary LLM calls. No polling loops burning tokens while nothing is happening.
This isn't theoretical. It's how every serious production system handles concurrency. And it's how AI agents should work too.
## Events vs. Messages vs. Commands
Before we build anything, let's get the vocabulary straight. These three concepts look similar but serve different purposes. It is worth reading about [state machine internals](/blog/agent-loop-state-machines) alongside this.
**Events** are facts about things that already happened. "ToolExecutionCompleted", "UserMessageReceived", "ErrorOccurred". They're immutable. They describe the past.
**Commands** are requests for something to happen. "ExecuteTool", "GenerateResponse", "RetryOperation". They might fail. They describe intent.
**Messages** are the payload. The actual data being passed around. An event carries a message. A command carries a message. The message is the content, not the intent.
Most agent frameworks conflate all three. That's why they're hard to debug. When everything is a "message," you can't distinguish between "this happened" and "please make this happen."
## The Event Bus Pattern
At the core of an event-driven agent is an event bus. Think of it as a central nervous system.
```typescript
type EventHandler = (event: AgentEvent) => Promise;
class AgentEventBus {
private handlers: Map = new Map();
on(eventType: string, handler: EventHandler) {
const existing = this.handlers.get(eventType) || [];
existing.push(handler);
this.handlers.set(eventType, existing);
}
async emit(event: AgentEvent) {
const handlers = this.handlers.get(event.type) || [];
await Promise.allSettled(
handlers.map(h => h(event))
);
}
}
```
Simple, right? The power isn't in the bus itself. It's in what you connect to it.
## Wiring an Agent to Events
Here's where it gets interesting. Instead of a monolithic agent loop, you decompose the agent into event handlers, each responsible for one thing.
```typescript
// Planning handler: triggered when a new task arrives
bus.on("task:received", async (event) => {
const plan = await llm.plan(event.payload.task);
await bus.emit({ type: "plan:created", payload: { plan } });
});
// Execution handler: triggered when a plan step needs executing
bus.on("plan:step:ready", async (event) => {
const result = await toolRunner.execute(event.payload.step);
await bus.emit({
type: result.success ? "tool:completed" : "tool:failed",
payload: { step: event.payload.step, result },
});
});
// Evaluation handler: triggered when a tool completes
bus.on("tool:completed", async (event) => {
const evaluation = await evaluator.assess(event.payload);
if (evaluation.goalMet) {
await bus.emit({ type: "task:completed", payload: evaluation });
} else {
await bus.emit({ type: "plan:step:ready", payload: evaluation.nextStep });
}
}); It is worth reading about [fault tolerance](/blog/fault-tolerance-multi-agent) alongside this.
// Error handler: triggered on any failure
bus.on("tool:failed", async (event) => {
const recovery = await errorHandler.analyze(event.payload);
await bus.emit({ type: recovery.eventType, payload: recovery });
});
```
Each handler is small, testable, and replaceable. You can swap out the planning strategy without touching execution. You can add logging by subscribing to every event type. You can add rate limiting by wrapping the emission.
## Why This Scales Better Than Loops
Three reasons.
**Concurrency becomes natural.** Multiple events can be processed simultaneously. If your agent is researching three topics in parallel, three `plan:step:ready` events fire and three tool executions happen concurrently. No thread management. No mutex locks. Just events flowing through handlers.
**Backpressure is built in.** When the system is overloaded, you can queue events instead of dropping them. The event bus becomes a buffer. Agents process events at their own pace. Nobody crashes because too many requests came in at once.
**Horizontal scaling is straightforward.** Want to handle more load? Run more event handlers. Each one subscribes to the same event types and picks up work from the queue. This is the same pattern that powers every major cloud service. It works because it's proven. This connects directly to [deployment considerations](/blog/agent-deployment-patterns).
## Event Sourcing for Agent Memory
Here's a pattern that changes how you think about agent state: event sourcing. Instead of storing the current state, store every event that ever happened. Derive the current state by replaying events.
```typescript
class AgentState {
private events: AgentEvent[] = [];
apply(event: AgentEvent) {
this.events.push(event);
}
get currentPlan(): Plan | null {
const planEvents = this.events.filter(e => e.type === "plan:created");
return planEvents.length > 0
? planEvents[planEvents.length - 1].payload.plan
: null;
}
get toolCallCount(): number {
return this.events.filter(e =>
e.type === "tool:completed" || e.type === "tool:failed"
).length;
}
get timeline(): EventTimeline {
return this.events.map(e => ({
type: e.type,
timestamp: e.timestamp,
summary: summarize(e),
}));
}
}
```
This gives you perfect auditability. You can replay any agent session from start to finish. You can debug issues by examining the exact sequence of events. You can "rewind" to any point and understand exactly what the agent knew and when it knew it.
## Dead Letter Queues: Where Failed Events Go to Be Useful
Events that can't be processed need somewhere to go. That's the dead letter queue (DLQ). In agent systems, this is surprisingly valuable.
```typescript
bus.on("*:failed", async (event) => {
if (event.retryCount >= MAX_RETRIES) {
await deadLetterQueue.push({
originalEvent: event,
failureReason: event.error,
agentState: currentState.snapshot(),
timestamp: Date.now(),
});
await bus.emit({ type: "task:escalated", payload: event });
}
});
```
Your DLQ becomes a dataset. Every failed event tells you something about what your agent struggles with. Aggregate them. Look for patterns. The events that end up in the DLQ are the roadmap for your next improvement cycle.
## Event Schemas: The Contract Between Components
This is where most teams cut corners and pay for it later. Every event needs a schema. Not optional. Not "we'll add it later."
```typescript
const ToolCompletedEvent = z.object({
type: z.literal("tool:completed"),
timestamp: z.number(),
correlationId: z.string().uuid(),
payload: z.object({
toolName: z.string(),
input: z.record(z.unknown()),
output: z.unknown(),
durationMs: z.number(),
tokenUsage: z.object({
input: z.number(),
output: z.number(),
}).optional(),
}),
});
```
The `correlationId` is critical. It connects related events across the entire chain. When a task comes in, it gets a correlation ID. Every event spawned from that task carries the same ID. Debugging goes from "find the needle in the haystack" to "filter by correlation ID."
## Choreography vs. Orchestration
Two schools of thought on how events flow.
**Choreography**: each handler knows what events to emit next. No central coordinator. The system emerges from the interactions. Simpler to build, harder to reason about at scale.
**Orchestration**: a central coordinator subscribes to events and decides what happens next. More control, single point of failure.
For AI agents, I've found a hybrid works best. Use orchestration for the top-level task flow (plan, execute, evaluate). Use choreography for the details within each phase (tool selection, error recovery, result processing). The orchestrator provides structure. The choreography provides flexibility.
## The Real Win: Composability
The biggest advantage of event-driven agents isn't performance or scalability. It's composability. You can build complex agent behaviors by combining simple event handlers.
Need a logging agent? Subscribe to all events, write them to a store.
Need a cost tracker? Subscribe to `tool:completed` events, sum up token usage.
Need a safety monitor? Subscribe to `plan:created` events, validate the plan before execution.
Need a multi-agent system? Each agent emits events. Other agents subscribe to them.
None of these require modifying the core agent code. They're additive. Plug them in when you need them, unplug when you don't.
That's the architecture you want. Not a monolith that does everything, but a system of small, focused components that communicate through events. It's how you build agents that survive contact with production.