Agent Guardrails: Input Validation, Output Filtering, and Circuit Breakers

The Trust Problem

Here's a question that keeps me up at night: how much do you trust your AI agent?

If your answer is "completely," you haven't been in production long enough. LLMs hallucinate. They misinterpret instructions. They generate SQL that drops tables, shell commands that delete directories, and API calls that charge your credit card. Not because they're malicious. Because they're probabilistic systems doing their best with ambiguous instructions.

Guardrails aren't about limiting your agent. They're about making it safe to give your agent more power. The more constraints you put around dangerous operations, the more dangerous operations you can actually allow.

Counterintuitive, I know. But that's how trust works in engineering.

Layer 1: Input Validation

Before your agent processes anything, validate the input. This means validating what the user sends AND what the LLM generates as intermediate steps. The related post on implementation guide for guardrails goes further on this point.

const UserInputSchema = z.object({
  task: z.string().min(1).max(10_000),
  context: z.record(z.unknown()).optional(),
  constraints: z.object({
    maxBudgetCents: z.number().int().positive().max(1000).default(100),
    maxToolCalls: z.number().int().positive().max(50).default(10),
    allowedTools: z.array(z.string()).optional(),
    blockedTools: z.array(z.string()).optional(),
  }).optional(),
});

function validateInput(raw: unknown): ValidatedInput {
  const parsed = UserInputSchema.safeParse(raw);
  if (!parsed.success) {
    throw new InputValidationError(parsed.error.flatten());
  }
  return sanitize(parsed.data);
}

The sanitize function is where the real work happens. Strip potential prompt injections. Remove control characters. Normalize Unicode. If you're accepting user input that gets concatenated into prompts (and you probably are), this is your first line of defense.

function sanitize(input: ValidatedInput): ValidatedInput {
  return {
    ...input,
    task: input.task
      .replace(/[\x00-\x08\x0B\x0C\x0E-\x1F]/g, "") // control chars
      .replace(/\{%.*?%\}/g, "")  // template injection
      .trim(),
  };
}

Layer 2: Tool Call Validation

This is the big one. When your agent decides to call a tool, you need to validate the call before it executes. Not after. After is too late.

interface ToolGuardrail {
  validate(call: ToolCall, context: AgentContext): GuardrailResult;
}

class ToolCallValidator {
  private guardrails: ToolGuardrail[] = [];

  register(guardrail: ToolGuardrail) {
    this.guardrails.push(guardrail);
  }

  async validate(call: ToolCall, context: AgentContext): Promise<void> {
    for (const guardrail of this.guardrails) {
      const result = guardrail.validate(call, context);
      if (result.blocked) {
        throw new ToolCallBlockedError(
          call.toolName,
          result.reason,
          result.suggestion
        );
      }
      if (result.modified) {
        call.args = result.modifiedArgs;
      }
    }
  }
}

Notice the modified path. Sometimes you don't want to block a call entirely. You want to constrain it. The agent wants to query a database? Fine, but add a LIMIT 100 to that query. The agent wants to write a file? Fine, but only in the designated output directory.

Here are guardrails that actually matter in production:

// Prevent destructive database operations
class SQLGuardrail implements ToolGuardrail {
  private blocked = /\b(DROP|DELETE|TRUNCATE|ALTER|GRANT)\b/i;

  validate(call: ToolCall): GuardrailResult {
    if (call.toolName !== "sql_query") return { blocked: false };
    if (this.blocked.test(call.args.query)) {
      return {
        blocked: true,
        reason: "Destructive SQL operation detected",
        suggestion: "Use SELECT for read operations only",
      };
    }
    return { blocked: false };
  }
}

// Prevent file system operations outside sandbox
class FileSystemGuardrail implements ToolGuardrail {
  constructor(private allowedPaths: string[]) {} It is worth reading about [prompt injection attacks](/blog/prompt-injection-attacks-ai-agents) alongside this.

  validate(call: ToolCall): GuardrailResult {
    if (!["read_file", "write_file"].includes(call.toolName)) {
      return { blocked: false };
    }
    const targetPath = path.resolve(call.args.path);
    const allowed = this.allowedPaths.some(p =>
      targetPath.startsWith(path.resolve(p))
    );
    if (!allowed) {
      return {
        blocked: true,
        reason: `Path ${targetPath} is outside allowed directories`,
        suggestion: `Allowed paths: ${this.allowedPaths.join(", ")}`,
      };
    }
    return { blocked: false };
  }
}

Layer 3: Output Filtering

Your agent's final output goes to users. Filter it. Always.

class OutputFilter {
  private filters: OutputFilterRule[] = [
    new PIIFilter(),          // emails, phone numbers, SSNs
    new CredentialFilter(),   // API keys, passwords, tokens
    new ConfidenceFilter(),   // flag low-confidence claims
    new ToneFilter(),         // catch inappropriate language
  ];

  async filter(output: AgentOutput): Promise<FilteredOutput> {
    let filtered = output;
    const appliedFilters: string[] = [];

    for (const rule of this.filters) {
      const result = await rule.apply(filtered);
      if (result.modified) {
        filtered = result.output;
        appliedFilters.push(rule.name);
      }
    }

    return {
      output: filtered,
      filtersApplied: appliedFilters,
      originalHash: hash(output), // for audit trail
    };
  }
}

The PII filter is non-negotiable. If your agent has access to a database with customer data, it WILL eventually include a real email address or phone number in its output. Regex isn't perfect for PII detection, but it catches the obvious stuff:

class PIIFilter implements OutputFilterRule {
  name = "pii";
  private patterns = [
    { type: "email", regex: /\b[\w.-]+@[\w.-]+\.\w{2,}\b/g },
    { type: "phone", regex: /\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/g },
    { type: "ssn", regex: /\b\d{3}-\d{2}-\d{4}\b/g },
  ];

  apply(output: AgentOutput): FilterResult {
    let text = output.text;
    let modified = false;
    for (const { type, regex } of this.patterns) {
      if (regex.test(text)) {
        text = text.replace(regex, `[REDACTED_${type.toUpperCase()}]`);
        modified = true;
      }
    }
    return { output: { ...output, text }, modified };
  }
}

Layer 4: Circuit Breakers

This is the kill switch. When things go wrong, they need to stop going wrong fast.

A circuit breaker has three states: closed (normal operation), open (everything blocked), and half-open (testing if things are working again). The related post on human oversight loops goes further on this point.

class CircuitBreaker {
  private state: "closed" | "open" | "half-open" = "closed";
  private failureCount = 0;
  private lastFailureTime = 0;
  private successCount = 0;

  constructor(
    private config: {
      failureThreshold: number;  // failures before opening
      resetTimeoutMs: number;    // how long to stay open
      halfOpenMax: number;       // test requests in half-open
    }
  ) {}

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === "open") {
      if (Date.now() - this.lastFailureTime > this.config.resetTimeoutMs) {
        this.state = "half-open";
        this.successCount = 0;
      } else {
        throw new CircuitOpenError("Circuit breaker is open");
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess() {
    if (this.state === "half-open") {
      this.successCount++;
      if (this.successCount >= this.config.halfOpenMax) {
        this.state = "closed";
        this.failureCount = 0;
      }
    } else {
      this.failureCount = 0;
    }
  }

  private onFailure() {
    this.failureCount++;
    this.lastFailureTime = Date.now();
    if (this.failureCount >= this.config.failureThreshold) {
      this.state = "open";
    }
  }
}

Use circuit breakers around every external dependency. LLM API down? Circuit opens. Tool returning errors? Circuit opens. Database unreachable? Circuit opens. The agent gets a clear signal that something is broken instead of hammering a failing service.

Budget Enforcement: The Guardrail Nobody Builds

Cost is a guardrail. Maybe the most important one.

class BudgetEnforcer {
  private spent = 0;

  constructor(private maxCents: number) {}

  async checkBudget(estimatedCost: number): Promise<void> {
    if (this.spent + estimatedCost > this.maxCents) {
      throw new BudgetExceededError(
        `Budget exhausted: spent ${this.spent}c of ${this.maxCents}c, ` +
        `estimated next call: ${estimatedCost}c`
      );
    }
  }

  recordSpend(actualCost: number) {
    this.spent += actualCost;
  }
}

Wrap every LLM call with the budget enforcer. Wrap every paid API call. Make the budget a first-class parameter of every agent invocation. "Do this task, but don't spend more than $0.50." That's a constraint the agent should respect, and the guardrail should enforce.

Composing Guardrails

The real power comes from composing these layers into a pipeline:

class GuardrailPipeline {
  async process(input: unknown, agent: Agent): Promise<FilteredOutput> {
    // Layer 1: Validate input
    const validated = validateInput(input);

    // Layer 2: Run agent with tool validation
    const rawOutput = await agent.run(validated, {
      onToolCall: (call, ctx) => this.toolValidator.validate(call, ctx),
      onBudgetCheck: (est) => this.budget.checkBudget(est),
      circuitBreaker: this.circuitBreaker,
    });

    // Layer 3: Filter output
    const filtered = await this.outputFilter.filter(rawOutput);

    return filtered;
  }
}

Every layer is independent. You can test them in isolation. You can add new guardrails without touching existing ones. You can make them configurable per user, per task, per environment.

The Guardrail Mindset

Guardrails aren't constraints on your agent's capabilities. They're what makes it possible to deploy those capabilities safely. Every guardrail you add is a risk you've mitigated, a failure mode you've handled, a 3 AM page you won't receive.

Build them before you need them. Because by the time you need them, it's already too late.