RAG Evaluation: Measuring Retrieval Quality Before It Ruins Your Answers
By Diesel
ragevaluationquality
## The Blame Game
Something goes wrong with a RAG answer. Who's at fault?
Most teams immediately blame the LLM. "It's hallucinating." "The model isn't smart enough." "We need GPT-5."
In my experience, 70% of the time the problem is retrieval, not generation. The LLM generated a bad answer because it was given bad context. Or no relevant context. Or relevant context buried under irrelevant noise.
But nobody knows this because nobody measured retrieval quality independently from generation quality. This connects directly to [agent-level evaluation metrics](/blog/agent-evaluation-metrics).
If you're evaluating RAG as a black box (question in, answer out), you can't diagnose anything. You need to evaluate the retrieval step and the generation step separately. This article is about the retrieval side.
## The Metrics That Matter
### Retrieval Precision
Of the chunks returned, how many were actually relevant?
```python
def retrieval_precision(retrieved_chunks, relevant_chunks):
"""What fraction of retrieved chunks are relevant?"""
relevant_retrieved = set(retrieved_chunks) & set(relevant_chunks)
return len(relevant_retrieved) / len(retrieved_chunks)
```
Low precision means your retrieval is returning noise. The LLM has to wade through irrelevant chunks to find the answer, which costs tokens and increases hallucination risk.
### Retrieval Recall
Of all the relevant chunks that exist, how many did retrieval actually find?
```python
def retrieval_recall(retrieved_chunks, relevant_chunks):
"""What fraction of relevant chunks were retrieved?"""
relevant_retrieved = set(retrieved_chunks) & set(relevant_chunks)
return len(relevant_retrieved) / len(relevant_chunks)
```
Low recall means relevant information exists in your knowledge base but retrieval isn't finding it. This is the worst failure mode because the user (and the system) don't know the answer was there all along.
### Mean Reciprocal Rank (MRR)
Where does the first relevant result appear in the ranked list?
```python
def mean_reciprocal_rank(queries_results):
"""Average of 1/rank_of_first_relevant_result across queries."""
reciprocal_ranks = []
for retrieved, relevant in queries_results:
for rank, chunk in enumerate(retrieved, start=1):
if chunk in relevant:
reciprocal_ranks.append(1.0 / rank)
break
else:
reciprocal_ranks.append(0.0)
return sum(reciprocal_ranks) / len(reciprocal_ranks)
```
If MRR is high, users (and LLMs) see the right answer first. If MRR is low, the relevant chunk is there but buried at position 8 out of 10. Many LLMs pay less attention to later context items, so position matters.
### Normalized Discounted Cumulative Gain (nDCG)
The sophisticated version that accounts for graded relevance and position.
```python
import numpy as np
def ndcg_at_k(retrieved, relevance_scores, k=10):
"""
nDCG accounts for both relevance grade and rank position.
relevance_scores: dict of chunk_id -> relevance (0-3)
"""
dcg = sum(
relevance_scores.get(chunk.id, 0) / np.log2(i + 2)
for i, chunk in enumerate(retrieved[:k])
)
ideal = sorted(relevance_scores.values(), reverse=True)[:k]
idcg = sum(
score / np.log2(i + 2)
for i, score in enumerate(ideal)
)
return dcg / idcg if idcg > 0 else 0
```
## Building an Evaluation Dataset
Metrics are useless without a ground truth dataset. This is where most teams skip corners and regret it later. This connects directly to [LangSmith for tracing](/blog/real-time-agent-monitoring-langsmith).
### The Gold Standard: Human Annotation
Pick 100-200 representative queries. For each query, have a human annotate which chunks in your knowledge base are relevant (and how relevant, on a 0-3 scale).
```python
# evaluation_dataset.json
[
{
"query": "What is our data retention policy for EU customers?",
"relevant_chunks": [
{"chunk_id": "policy-42-chunk-3", "relevance": 3},
{"chunk_id": "gdpr-addendum-chunk-7", "relevance": 3},
{"chunk_id": "general-retention-chunk-1", "relevance": 2},
{"chunk_id": "compliance-faq-chunk-12", "relevance": 1},
]
},
...
]
```
This is labor-intensive. But it's the only way to get reliable ground truth. LLM-generated annotations can supplement but shouldn't replace human judgment for your core evaluation set.
### The Practical Approach: Synthetic + Verified
Generate candidate Q&A pairs from your documents using an LLM, then have humans verify and correct.
```python
GENERATE_QA_PROMPT = """
Given this document chunk, generate 3 questions that this chunk
would be a good answer to. Questions should be natural queries a
user might actually ask.
Chunk: {chunk_text}
Source: {source_document}
Questions:
"""
# Generate candidates, then human review
for chunk in sample_chunks:
candidate_questions = llm.generate(
GENERATE_QA_PROMPT.format(
chunk_text=chunk.text,
source_document=chunk.metadata["source"]
)
)
# Human reviews: Is this question realistic? Is this chunk
# the best answer? Are there other relevant chunks?
```
### Continuous Evaluation from User Feedback
Don't stop at a static dataset. Instrument your production system.
```python
class RetrievalLogger:
def log_query(self, query, retrieved_chunks, user_feedback=None):
"""Log every query and its retrieval results."""
self.store.insert({
"query": query,
"retrieved_chunks": [c.id for c in retrieved_chunks],
"timestamp": now(),
"user_feedback": user_feedback, # thumbs up/down
})
def get_failed_queries(self, days=7):
"""Find queries where users gave negative feedback."""
return self.store.query(
feedback="negative",
since=days_ago(days),
)
```
Failed queries become your next evaluation dataset entries. This creates a flywheel: bad retrieval gets flagged, gets added to the eval set, gets measured, gets fixed.
## LLM-as-Judge for Retrieval
When human annotation is too expensive for continuous evaluation, use an LLM to judge retrieval relevance.
```python
RELEVANCE_JUDGE_PROMPT = """
Given a user question and a retrieved document chunk, rate the
relevance of the chunk on a scale of 0-3:
0: Not relevant at all
1: Marginally relevant, contains tangentially related information
2: Relevant, contains useful information for answering the question
3: Highly relevant, directly answers or is essential for answering
Question: {question}
Chunk: {chunk_text}
Rating (0-3) and one-sentence justification:
"""
async def evaluate_retrieval_batch(eval_set, retriever, judge_llm):
results = []
for item in eval_set:
retrieved = await retriever.search(item["query"], top_k=10)
relevance_scores = {}
for chunk in retrieved:
judgment = await judge_llm.generate(
RELEVANCE_JUDGE_PROMPT.format(
question=item["query"],
chunk_text=chunk.text,
)
)
relevance_scores[chunk.id] = parse_rating(judgment) For a deeper look, see [reranking to lift precision](/blog/cross-encoder-reranking-rag).
results.append({
"query": item["query"],
"precision": compute_precision(retrieved, relevance_scores),
"ndcg": ndcg_at_k(retrieved, relevance_scores, k=5),
"mrr": compute_mrr(retrieved, relevance_scores),
})
return aggregate_results(results)
```
LLM judges correlate well with human judgments for relevance (around 80-85% agreement). Not perfect, but good enough for continuous monitoring and regression detection.
## The Evaluation Pipeline
Put it all together into something you actually run regularly.
```python
class RAGEvaluator:
def __init__(self, retriever, eval_dataset, judge_llm):
self.retriever = retriever
self.dataset = eval_dataset
self.judge = judge_llm
async def run_evaluation(self):
metrics = {
"precision@5": [],
"recall@10": [],
"mrr": [],
"ndcg@5": [],
"latency_p50_ms": [],
"latency_p95_ms": [],
}
for item in self.dataset:
start = time.time()
retrieved = await self.retriever.search(item["query"], top_k=10)
latency = (time.time() - start) * 1000
# Compute metrics against ground truth
relevant_ids = {r["chunk_id"] for r in item["relevant_chunks"]}
retrieved_ids = [c.id for c in retrieved]
metrics["precision@5"].append(
precision_at_k(retrieved_ids[:5], relevant_ids)
)
metrics["recall@10"].append(
recall_at_k(retrieved_ids[:10], relevant_ids)
)
metrics["mrr"].append(
reciprocal_rank(retrieved_ids, relevant_ids)
)
metrics["latency_p50_ms"].append(latency)
return {k: summarize(v) for k, v in metrics.items()}
```
Run this on every change to your retrieval pipeline. New embedding model? Run the eval. Changed chunk size? Run the eval. Added a reranker? Run the eval. Updated the index? Run the eval.
## What Good Numbers Look Like
These vary by domain, but rough benchmarks for enterprise RAG:
- **Precision@5:** Above 0.7 is good. Below 0.5 means too much noise.
- **Recall@10:** Above 0.8 is good. Below 0.6 means you're missing answers.
- **MRR:** Above 0.7 means the right answer is usually in the top 2-3 results.
- **nDCG@5:** Above 0.7 is solid.
- **Latency p95:** Under 500ms for the retrieval step alone.
If your precision is high but recall is low, you're finding good chunks when you find them but missing a lot. Probably need to adjust your search strategy or chunk size.
If recall is high but precision is low, you're finding everything relevant but also returning a lot of garbage. Add a reranker.
## The One Thing That Changes Everything
Measure retrieval separately from generation. That's it. That's the insight.
When you see a bad RAG answer, run the query through your retriever alone. Look at what came back. Was the answer in the retrieved chunks? If yes, it's a generation problem. If no, it's a retrieval problem.
This simple diagnostic saves weeks of debugging in the wrong direction. Most teams that "can't figure out why RAG isn't working" have never once looked at what their retriever actually returned.
Don't be that team. Measure retrieval. Fix retrieval. Everything else gets easier.