RAG Evaluation: Measuring Retrieval Quality Before It Ruins Your Answers

The Blame Game

Something goes wrong with a RAG answer. Who's at fault?

Most teams immediately blame the LLM. "It's hallucinating." "The model isn't smart enough." "We need GPT-5."

In my experience, 70% of the time the problem is retrieval, not generation. The LLM generated a bad answer because it was given bad context. Or no relevant context. Or relevant context buried under irrelevant noise.

But nobody knows this because nobody measured retrieval quality independently from generation quality. This connects directly to agent-level evaluation metrics.

If you're evaluating RAG as a black box (question in, answer out), you can't diagnose anything. You need to evaluate the retrieval step and the generation step separately. This article is about the retrieval side.

The Metrics That Matter

Retrieval Precision

Of the chunks returned, how many were actually relevant?

def retrieval_precision(retrieved_chunks, relevant_chunks):
    """What fraction of retrieved chunks are relevant?"""
    relevant_retrieved = set(retrieved_chunks) & set(relevant_chunks)
    return len(relevant_retrieved) / len(retrieved_chunks)

Low precision means your retrieval is returning noise. The LLM has to wade through irrelevant chunks to find the answer, which costs tokens and increases hallucination risk.

Retrieval Recall

Of all the relevant chunks that exist, how many did retrieval actually find?

def retrieval_recall(retrieved_chunks, relevant_chunks):
    """What fraction of relevant chunks were retrieved?"""
    relevant_retrieved = set(retrieved_chunks) & set(relevant_chunks)
    return len(relevant_retrieved) / len(relevant_chunks)

Low recall means relevant information exists in your knowledge base but retrieval isn't finding it. This is the worst failure mode because the user (and the system) don't know the answer was there all along.

Mean Reciprocal Rank (MRR)

Where does the first relevant result appear in the ranked list?

def mean_reciprocal_rank(queries_results):
    """Average of 1/rank_of_first_relevant_result across queries."""
    reciprocal_ranks = []
    for retrieved, relevant in queries_results:
        for rank, chunk in enumerate(retrieved, start=1):
            if chunk in relevant:
                reciprocal_ranks.append(1.0 / rank)
                break
        else:
            reciprocal_ranks.append(0.0)
    return sum(reciprocal_ranks) / len(reciprocal_ranks)

If MRR is high, users (and LLMs) see the right answer first. If MRR is low, the relevant chunk is there but buried at position 8 out of 10. Many LLMs pay less attention to later context items, so position matters.

Normalized Discounted Cumulative Gain (nDCG)

The sophisticated version that accounts for graded relevance and position.

import numpy as np

def ndcg_at_k(retrieved, relevance_scores, k=10):
    """
    nDCG accounts for both relevance grade and rank position.
    relevance_scores: dict of chunk_id -> relevance (0-3)
    """
    dcg = sum(
        relevance_scores.get(chunk.id, 0) / np.log2(i + 2)
        for i, chunk in enumerate(retrieved[:k])
    )
    ideal = sorted(relevance_scores.values(), reverse=True)[:k]
    idcg = sum(
        score / np.log2(i + 2)
        for i, score in enumerate(ideal)
    )
    return dcg / idcg if idcg > 0 else 0

Building an Evaluation Dataset

Metrics are useless without a ground truth dataset. This is where most teams skip corners and regret it later. This connects directly to LangSmith for tracing.

The Gold Standard: Human Annotation

Pick 100-200 representative queries. For each query, have a human annotate which chunks in your knowledge base are relevant (and how relevant, on a 0-3 scale).

# evaluation_dataset.json
[
    {
        "query": "What is our data retention policy for EU customers?",
        "relevant_chunks": [
            {"chunk_id": "policy-42-chunk-3", "relevance": 3},
            {"chunk_id": "gdpr-addendum-chunk-7", "relevance": 3},
            {"chunk_id": "general-retention-chunk-1", "relevance": 2},
            {"chunk_id": "compliance-faq-chunk-12", "relevance": 1},
        ]
    },
    ...
]

This is labor-intensive. But it's the only way to get reliable ground truth. LLM-generated annotations can supplement but shouldn't replace human judgment for your core evaluation set.

The Practical Approach: Synthetic + Verified

Generate candidate Q&A pairs from your documents using an LLM, then have humans verify and correct.

GENERATE_QA_PROMPT = """
Given this document chunk, generate 3 questions that this chunk
would be a good answer to. Questions should be natural queries a
user might actually ask.

Chunk: {chunk_text}
Source: {source_document}

Questions:
"""

# Generate candidates, then human review
for chunk in sample_chunks:
    candidate_questions = llm.generate(
        GENERATE_QA_PROMPT.format(
            chunk_text=chunk.text,
            source_document=chunk.metadata["source"]
        )
    )
    # Human reviews: Is this question realistic? Is this chunk
    # the best answer? Are there other relevant chunks?

Continuous Evaluation from User Feedback

Don't stop at a static dataset. Instrument your production system.

class RetrievalLogger:
    def log_query(self, query, retrieved_chunks, user_feedback=None):
        """Log every query and its retrieval results."""
        self.store.insert({
            "query": query,
            "retrieved_chunks": [c.id for c in retrieved_chunks],
            "timestamp": now(),
            "user_feedback": user_feedback,  # thumbs up/down
        })

    def get_failed_queries(self, days=7):
        """Find queries where users gave negative feedback."""
        return self.store.query(
            feedback="negative",
            since=days_ago(days),
        )

Failed queries become your next evaluation dataset entries. This creates a flywheel: bad retrieval gets flagged, gets added to the eval set, gets measured, gets fixed.

LLM-as-Judge for Retrieval

When human annotation is too expensive for continuous evaluation, use an LLM to judge retrieval relevance.

RELEVANCE_JUDGE_PROMPT = """
Given a user question and a retrieved document chunk, rate the
relevance of the chunk on a scale of 0-3:

0: Not relevant at all
1: Marginally relevant, contains tangentially related information
2: Relevant, contains useful information for answering the question
3: Highly relevant, directly answers or is essential for answering

Question: {question}
Chunk: {chunk_text}

Rating (0-3) and one-sentence justification:
"""

async def evaluate_retrieval_batch(eval_set, retriever, judge_llm):
    results = []
    for item in eval_set:
        retrieved = await retriever.search(item["query"], top_k=10)
        relevance_scores = {}

        for chunk in retrieved:
            judgment = await judge_llm.generate(
                RELEVANCE_JUDGE_PROMPT.format(
                    question=item["query"],
                    chunk_text=chunk.text,
                )
            )
            relevance_scores[chunk.id] = parse_rating(judgment) For a deeper look, see [reranking to lift precision](/blog/cross-encoder-reranking-rag).

        results.append({
            "query": item["query"],
            "precision": compute_precision(retrieved, relevance_scores),
            "ndcg": ndcg_at_k(retrieved, relevance_scores, k=5),
            "mrr": compute_mrr(retrieved, relevance_scores),
        })

    return aggregate_results(results)

LLM judges correlate well with human judgments for relevance (around 80-85% agreement). Not perfect, but good enough for continuous monitoring and regression detection.

The Evaluation Pipeline

Put it all together into something you actually run regularly.

class RAGEvaluator:
    def __init__(self, retriever, eval_dataset, judge_llm):
        self.retriever = retriever
        self.dataset = eval_dataset
        self.judge = judge_llm

    async def run_evaluation(self):
        metrics = {
            "precision@5": [],
            "recall@10": [],
            "mrr": [],
            "ndcg@5": [],
            "latency_p50_ms": [],
            "latency_p95_ms": [],
        }

        for item in self.dataset:
            start = time.time()
            retrieved = await self.retriever.search(item["query"], top_k=10)
            latency = (time.time() - start) * 1000

            # Compute metrics against ground truth
            relevant_ids = {r["chunk_id"] for r in item["relevant_chunks"]}
            retrieved_ids = [c.id for c in retrieved]

            metrics["precision@5"].append(
                precision_at_k(retrieved_ids[:5], relevant_ids)
            )
            metrics["recall@10"].append(
                recall_at_k(retrieved_ids[:10], relevant_ids)
            )
            metrics["mrr"].append(
                reciprocal_rank(retrieved_ids, relevant_ids)
            )
            metrics["latency_p50_ms"].append(latency)

        return {k: summarize(v) for k, v in metrics.items()}

Run this on every change to your retrieval pipeline. New embedding model? Run the eval. Changed chunk size? Run the eval. Added a reranker? Run the eval. Updated the index? Run the eval.

What Good Numbers Look Like

These vary by domain, but rough benchmarks for enterprise RAG:

Precision@5: Above 0.7 is good. Below 0.5 means too much noise.
Recall@10: Above 0.8 is good. Below 0.6 means you're missing answers.
MRR: Above 0.7 means the right answer is usually in the top 2-3 results.
nDCG@5: Above 0.7 is solid.
Latency p95: Under 500ms for the retrieval step alone.

If your precision is high but recall is low, you're finding good chunks when you find them but missing a lot. Probably need to adjust your search strategy or chunk size.

If recall is high but precision is low, you're finding everything relevant but also returning a lot of garbage. Add a reranker.

The One Thing That Changes Everything

Measure retrieval separately from generation. That's it. That's the insight.

When you see a bad RAG answer, run the query through your retriever alone. Look at what came back. Was the answer in the retrieved chunks? If yes, it's a generation problem. If no, it's a retrieval problem.

This simple diagnostic saves weeks of debugging in the wrong direction. Most teams that "can't figure out why RAG isn't working" have never once looked at what their retriever actually returned.

Don't be that team. Measure retrieval. Fix retrieval. Everything else gets easier.