## The Single-Shot Problem
Traditional RAG is a one-shot game. User asks a question. System converts it to a query. System retrieves top-k chunks. System generates an answer from those chunks.
One shot. No feedback loop. No self-correction.
This works for simple factual questions. "What's our return policy?" Retrieve, read, answer. Done.
It falls apart for anything complex. "Compare our Q3 2025 performance across all regions and identify which products underperformed relative to their targets." That requires multiple retrievals across different document types, cross-referencing results, identifying gaps, and synthesizing information from potentially dozens of chunks.
No single query captures all of that. And if your first retrieval misses something, there's no mechanism to try again.
Agentic RAG solves this by putting an agent in the retrieval loop.
## What Agentic RAG Actually Means
Instead of a fixed retrieve-then-generate pipeline, you have an agent that can:
1. **Analyze** the user's question and decompose it into sub-questions
2. **Plan** which searches to run and in what order
3. **Retrieve** using different strategies (semantic search, keyword search, metadata filters)
4. **Evaluate** whether the retrieved context is sufficient
5. **Reformulate** queries when results are inadequate
6. **Synthesize** across multiple retrieval rounds It is worth reading about [hybrid search backends](/blog/hybrid-search-rag-production) alongside this.
The agent has retrieval as a tool, not as a fixed pipeline step. It decides when to search, what to search for, and whether to search again.
## The Architecture
```python
from langchain.agents import AgentExecutor
from langchain.tools import Tool
# Define retrieval tools the agent can use
tools = [
Tool(
name="semantic_search",
description="Search the knowledge base using natural language. "
"Best for conceptual questions and finding related content.",
func=lambda q: vector_store.similarity_search(q, k=5),
),
Tool(
name="keyword_search",
description="Search using exact keywords. Best for finding specific "
"documents, codes, names, or identifiers.",
func=lambda q: bm25_index.search(q, k=5),
),
Tool(
name="metadata_filter",
description="Filter documents by metadata. Accepts JSON with fields: "
"department, document_type, date_range, author.",
func=lambda q: metadata_store.filter(json.loads(q)),
),
Tool(
name="lookup_document",
description="Retrieve a specific document by ID. Use when you know "
"which document you need.",
func=lambda doc_id: document_store.get(doc_id),
),
]
agent = AgentExecutor(
agent=create_react_agent(llm, tools, prompt),
tools=tools,
max_iterations=10, # prevent infinite loops
verbose=True,
)
```
The agent doesn't just have one "search" tool. It has multiple retrieval strategies and can choose the right one for each sub-question.
## Query Decomposition: The First Win
The simplest form of agentic RAG is query decomposition. Break a complex question into simpler ones, retrieve for each, combine.
```python
DECOMPOSE_PROMPT = """
Given this complex question, break it down into simpler sub-questions
that can each be answered independently from a knowledge base.
Question: {question}
Sub-questions (one per line):
"""
async def agentic_retrieve(question: str, retriever, llm):
# Step 1: Decompose
sub_questions = await llm.generate(
DECOMPOSE_PROMPT.format(question=question)
)
# Step 2: Retrieve for each sub-question
all_contexts = []
for sub_q in sub_questions.split("\n"):
results = await retriever.search(sub_q.strip(), top_k=3)
all_contexts.extend(results)
# Step 3: Deduplicate and rank
unique_contexts = deduplicate(all_contexts)
# Step 4: Generate answer from combined context
return await llm.generate(
ANSWER_PROMPT.format(
question=question,
context=format_contexts(unique_contexts)
)
)
```
"Compare Q3 performance across regions" becomes: "What was Q3 2025 performance in EMEA?", "What was Q3 2025 performance in North America?", "What was Q3 2025 performance in APAC?", "What were the Q3 2025 targets per region?"
Each sub-question retrieves relevant chunks. The combined context gives the LLM what it needs to actually compare.
## Self-Evaluation: Knowing When Retrieval Failed
The real power of agentic RAG is the feedback loop. The agent evaluates its own retrieval results.
```python
EVALUATION_PROMPT = """
Given the original question and the retrieved context, evaluate:
1. Does the context contain enough information to answer the question?
2. What information is missing?
3. What additional searches would help?
Question: {question}
Retrieved Context: {context}
Evaluation:
"""
async def retrieval_with_evaluation(question, retriever, llm, max_rounds=3):
all_context = []
for round_num in range(max_rounds):
if round_num == 0:
query = question
else:
# Use evaluation to generate a better query
eval_result = await llm.generate(
EVALUATION_PROMPT.format(
question=question,
context=format_contexts(all_context),
)
)
if "sufficient" in eval_result.lower():
break
query = extract_follow_up_query(eval_result)
results = await retriever.search(query, top_k=5)
all_context.extend(results)
return deduplicate(all_context)
```
Round 1: "What's our data retention policy for EU customers?" Retrieves general data retention policy. Agent evaluates: "I found the general policy but nothing EU-specific. Need to search for GDPR-related retention requirements."
Round 2: Searches for "GDPR data retention requirements." Finds the EU-specific addendum. Agent evaluates: "Now I have both general policy and EU-specific requirements. Sufficient."
Without self-evaluation, the system would have answered with just the general policy, missing the EU-specific nuance that was the whole point of the question.
## Tool Selection: Right Strategy for Each Sub-Problem
Different sub-questions need different retrieval approaches.
```python
ROUTING_PROMPT = """
For each sub-question, select the best retrieval strategy:
1. semantic_search - for conceptual questions, "how does X work?"
2. keyword_search - for specific terms, codes, names, identifiers
3. metadata_filter - for filtering by date, department, type
4. lookup_document - when you know the specific document The related post on [agent memory patterns](/blog/agent-memory-patterns) goes further on this point.
Sub-question: {sub_question}
Best strategy and query:
"""
```
"Who approved the budget change in Q3?" doesn't need semantic search. It needs a metadata filter on document type (approval/decision), date range (Q3), and maybe a keyword search for "budget."
"What's our general approach to cost optimization?" needs semantic search, not keywords.
The agent makes this decision per sub-question, which is exactly what a human researcher would do.
## The Routing Pattern
For organizations with multiple knowledge bases (common in enterprise), the agent also decides WHERE to search.
```python
tools = [
Tool(name="search_policies", func=policy_store.search,
description="Company policies, SOPs, compliance docs"),
Tool(name="search_technical", func=tech_store.search,
description="Technical documentation, architecture, API docs"),
Tool(name="search_financial", func=finance_store.search,
description="Financial reports, budgets, forecasts"),
Tool(name="search_hr", func=hr_store.search,
description="HR policies, benefits, org structure"),
Tool(name="search_projects", func=project_store.search,
description="Project plans, status updates, meeting notes"),
]
```
"Can an intern on Project Phoenix access the production database?" requires searching HR (intern policies), technical (database access control), and project docs (Project Phoenix team structure). The agent routes to all three.
## Guardrails: Preventing Infinite Loops
Agentic RAG can go wrong. The agent can loop endlessly, reformulating queries that never find what it's looking for (because the information doesn't exist). It can run up LLM costs with excessive iterations. It can hallucinate that it needs more information when it already has enough.
Essential guardrails:
```python
class AgenticRAGConfig:
max_iterations: int = 5 # hard stop on retrieval rounds
max_tool_calls: int = 15 # total tool invocations
timeout_seconds: int = 30 # wall clock limit
min_confidence_to_answer: float = 0.7 # below this, say "I don't know"
cost_budget_usd: float = 0.50 # per query cost cap
```
The "I don't know" path is critical. If after 5 rounds of retrieval the agent still can't find sufficient context, it should say so explicitly rather than generating a low-confidence answer.
## When to Use Agentic RAG
Not every query needs an agent. Simple factual lookups ("What's the office WiFi password?") don't benefit from query decomposition and multi-round retrieval. They just get slower and more expensive.
**Use agentic RAG for:** Complex multi-part questions. Comparative questions. Questions that span multiple document types or knowledge bases. Research-style queries where the user expects a synthesized answer. For a deeper look, see [reranking retrieved results](/blog/cross-encoder-reranking-rag).
**Use standard RAG for:** Simple factual lookups. Single-topic questions. High-volume, low-complexity query patterns.
A practical approach: route queries based on complexity. Simple queries go through the fast path (single retrieval, single generation). Complex queries go through the agentic path.
The best RAG systems aren't uniform. They adapt their retrieval strategy to the question. Just like a good researcher would.