Hallucination Detection: Catching When Your Agent Makes Things Up
By Diesel
securityhallucinationqualitydetection
A financial services agent confidently told a customer that their account balance was $47,832.16. The actual balance was $4,783.21. The customer made financial decisions based on the wrong number. The agent had hallucinated the extra digit with the same confident tone it used for everything else.
This is the hallucination problem. Not the dramatic, obvious kind where the model invents fictional people or cites nonexistent papers (though that happens too). The subtle kind, where the agent states something plausible, specific, and completely wrong. And nobody catches it because it sounds authoritative.
For chatbots, hallucination is embarrassing. For agents that take actions based on their outputs, hallucination is dangerous. An agent that hallucinates a database query result and then acts on it isn't just wrong. It's confidently wrong with consequences.
## Why Agents Hallucinate
Let's be precise about what's happening, because "hallucination" is a sloppy term that covers several distinct failure modes.
**Confabulation.** The model generates information that isn't grounded in any source. It fills gaps in its knowledge with plausible-sounding fabrications. This is the classic hallucination: citing a paper that doesn't exist, inventing a statistic, attributing a quote to the wrong person.
**Misattribution.** The model has the right information but connects it to the wrong source, entity, or context. It remembers a fact about Company A and attributes it to Company B. The data is real but the association is wrong.
**Extrapolation.** The model takes a partial piece of information and extends it beyond what the source supports. A document says revenue grew 15% in Q1. The model states revenue grew 15% annually. Small difference, potentially large consequences.
**Temporal confusion.** The model applies information from one time period to another. Training data from 2024 being presented as current fact in 2026. An API response from this morning being referenced as if it were from an hour ago.
**Numeric instability.** Models are notoriously bad with numbers. They transpose digits, miscalculate, and confuse orders of magnitude. The financial example I opened with is a real pattern. Models handle text far better than they handle numerical precision. It is worth reading about [guardrails to intercept bad outputs](/blog/agent-guardrails-production) alongside this.
For agents specifically, there's an additional failure mode:
**Tool output hallucination.** The agent claims a tool returned a result that it didn't. Instead of admitting a tool call failed or returned unexpected data, the agent fabricates a plausible result and proceeds as if it's real. This is especially dangerous because the agent's reasoning chain looks valid. The fabricated data point just wasn't real.
## Detection Strategies
No single technique catches all hallucinations. You need layers, just like every other security problem.
### Grounding Verification
The most reliable detection method: compare agent outputs against the source data.
If the agent used a retrieval system, check that every claimed fact appears in the retrieved documents. If it called a tool, verify that the tool response contains the information the agent references. If it cites a number, trace that number back to its origin.
This is computationally expensive. You're essentially running a fact-checking pipeline on every agent output. But for high-stakes applications, it's the only approach that catches subtle hallucinations reliably.
Implementation pattern: extract factual claims from the agent's output, match each claim against the source data, flag claims that can't be grounded. Use a separate, cheaper model for the extraction and matching. The verification model doesn't need to be as capable as the primary model. It just needs to compare statements against documents.
### Self-Consistency Checking
Ask the model the same question multiple times (with temperature > 0) and compare the answers. Consistent answers across runs are more likely to be grounded. Inconsistent answers suggest the model is generating rather than recalling.
This works surprisingly well for factual claims. If three out of five runs say the answer is X and two say Y, X is more likely correct but warrants verification. If all five runs give different answers, the model is clearly confabulating.
The cost is latency and compute: you're running multiple inferences instead of one. For critical decisions, it's worth it. For low-stakes outputs, it's probably not.
### Confidence Calibration
Some models expose logprob information for their outputs. Tokens generated with high confidence (high logprobs) are more likely to be correct than tokens generated with low confidence.
This is imperfect. Models can be confidently wrong. But low-confidence outputs are a useful signal for "maybe verify this one." Use logprobs as a triage mechanism, not a definitive quality measure.
### Cross-Reference Checking
For outputs that reference external facts, verify against authoritative sources. If the agent says a regulation requires X, check the actual regulation. If it quotes a price, check the actual price feed. If it cites a date, verify the date. For a deeper look, see [grounding quality via RAG evaluation](/blog/rag-evaluation-retrieval-quality).
This requires building a verification pipeline specific to your domain. Generic fact-checking doesn't work well because the agent might be referencing internal data that no external source can verify. You need domain-specific verification sources.
### Semantic Drift Detection
Monitor the agent's outputs over time for semantic drift. If the agent consistently provides one type of answer for a given query pattern and then suddenly shifts, investigate. The shift might be legitimate (updated data) or it might indicate a hallucination pattern.
This requires a baseline of expected behaviour, which takes time to build. But once you have it, it's a powerful anomaly detection mechanism.
## Architecture for Hallucination-Resistant Agents
Detection is important. Prevention is better.
### Retrieval-First Architecture
Force the agent to retrieve information before answering. Don't let it answer from parametric memory alone. Every factual claim should be backed by a retrieved document, a tool response, or an explicit source.
This doesn't eliminate hallucination (the model can still misinterpret retrieved data) but it dramatically reduces confabulation. The model generates from source material rather than from its training data.
### Structured Output Enforcement
Where possible, force the agent to output structured data rather than free text. A JSON object with specific fields is easier to validate than a prose paragraph. You can check that each field contains valid data, that numbers are in expected ranges, that referenced entities actually exist.
This works for operational outputs (API calls, database queries, configuration changes) better than for conversational outputs. But even for conversations, you can require the agent to separate factual claims into a structured section that's validated separately from the conversational text.
### Citation Requirements
Require the agent to cite its sources for every factual claim. Not just "based on the retrieved documents" but specifically which document, which section, which data point. Then verify the citations.
Models that are required to cite sources hallucinate less, because the citation requirement forces them to ground their outputs. They can't just generate a plausible fact. They have to point to where it came from. When they can't find a source, they're more likely to say "I don't know" instead of fabricating one.
### Numeric Guardrails
For any operation involving numbers, implement range checks, consistency checks, and format validation. If the agent reports a financial balance, verify it against the database. If it calculates a percentage, verify the arithmetic. If it cites a quantity, check it's in a plausible range. This connects directly to [red-teaming for hallucination](/blog/red-teaming-ai-agents).
Numbers are where hallucination is most dangerous and most preventable. A simple validation layer catches the majority of numeric hallucinations.
## The Human Factor
Here's the uncomfortable part. Humans are terrible at catching AI hallucinations. Studies consistently show that people trust AI-generated text more than they should, especially when it's well-written and confident. The more fluent the output, the less critically people evaluate it.
Human-in-the-loop only works if the human is trained to be skeptical, given the tools to verify, and incentivised to actually check. A human who rubber-stamps every AI output isn't oversight. It's theatre.
If your hallucination mitigation strategy is "a human reviews everything," make sure that human has:
- Access to the source data the agent used
- Training on common hallucination patterns
- Enough time to actually verify (not a queue of 200 reviews per hour)
- A process for escalation when they're unsure
## Measuring Hallucination Rates
You can't improve what you don't measure. Establish a hallucination rate metric for your agent system.
Sample agent outputs regularly. Manually verify a subset against source data. Categorise hallucinations by type (confabulation, misattribution, extrapolation, numeric). Track rates over time. Set targets for acceptable rates by category.
An honest hallucination rate is more valuable than a perfect demo. Know where your agent fails, and you can improve it. Pretend it doesn't fail, and you'll learn the truth from your users. They'll be less forgiving than your test suite.