Production RAG Pipeline with LangChain and pgvector

Every RAG tutorial does the same thing. Load a PDF, chunk it, embed it, query it, done. Fifteen minutes, looks great in a demo, falls apart the moment real users touch it.

Production RAG is a different beast. You need hybrid search, reranking, metadata filtering, chunk overlap strategies, and a database that doesn't fall over at scale. That's what we're building today.

The Stack

PostgreSQL with pgvector. Not Pinecone, not Chroma, not Weaviate. Postgres. Because you already run it, your team already knows it, and pgvector gives you vector search alongside your relational data. One database. One backup strategy. One ops story.

pip install langchain-anthropic langchain-postgres psycopg2-binary langchain-community

Database Setup

You need Postgres with the pgvector extension. Docker makes this painless.

docker run -d \
  --name pgvector-db \
  -e POSTGRES_PASSWORD=your_password \
  -e POSTGRES_DB=rag_db \
  -p 5432:5432 \
  pgvector/pgvector:pg17

Enable the extension:

CREATE EXTENSION IF NOT EXISTS vector;

Document Processing That Actually Works

The difference between demo RAG and production RAG starts at ingestion. You can't just split text every 500 characters and hope for the best.

from langchain_community.document_loaders import PyPDFLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

def load_documents(file_path: str):
    """Load documents with appropriate loader."""
    if file_path.endswith(".pdf"):
        loader = PyPDFLoader(file_path)
    else:
        loader = TextLoader(file_path)
    return loader.load() It is worth reading about [hybrid search on top of pgvector](/blog/hybrid-search-rag-production) alongside this.

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""],
    length_function=len,
)

docs = load_documents("./data/technical_manual.pdf")
chunks = splitter.split_documents(docs)

800 characters with 200 overlap. The overlap is critical. Without it, you lose context at chunk boundaries and your retrieval quality tanks. The recursive splitter tries paragraph breaks first, then sentences, then words. Much better than blind character splitting.

Embedding and Storage

from langchain_anthropic import ChatAnthropic
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_postgres.vectorstores import PGVector

# Use a local embedding model. Faster, cheaper, no API dependency.
embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-large-en-v1.5",
    model_kwargs={"device": "cpu"},
    encode_kwargs={"normalize_embeddings": True}
)

CONNECTION_STRING = "postgresql+psycopg2://postgres:your_password@localhost:5432/rag_db"

vectorstore = PGVector.from_documents(
    documents=chunks,
    embedding=embeddings,
    collection_name="technical_docs",
    connection=CONNECTION_STRING,
    pre_delete_collection=False,
)

BGE-large is the embedding model. It's open source, runs locally, and benchmarks better than most paid APIs for retrieval tasks. No API costs, no rate limits, no third party seeing your data.

Hybrid Search: Vectors Plus Keywords

Pure vector search misses exact matches. Pure keyword search misses semantic similarity. You need both.

from langchain_postgres.vectorstores import PGVector
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

# Vector retriever
vector_retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 10}
)

# BM25 keyword retriever
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 10

# Combine with weighted ensemble
hybrid_retriever = EnsembleRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    weights=[0.6, 0.4]  # Favor semantic, but keywords matter
)

60/40 split favoring semantic search. Adjust based on your data. Technical docs with lots of acronyms and specific terms? Push keywords higher. Conversational content? Push semantic higher.

Reranking: The Secret Weapon

Retrieval gets you candidates. Reranking picks the winners. This is the single biggest quality improvement you can make.

from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain.retrievers.document_compressors import CrossEncoderReranker

# Cross-encoder reranker
cross_encoder = HuggingFaceCrossEncoder(model_name="cross-encoder/ms-marco-MiniLM-L-6-v2")
reranker = CrossEncoderReranker(model=cross_encoder, top_n=5)

# Wrap the hybrid retriever with reranking
final_retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=hybrid_retriever
)

The cross-encoder sees the query AND each document together, scoring relevance directly. Bi-encoders (your embedding model) encode them separately and compare vectors. Cross-encoders are slower but dramatically more accurate for reranking a small candidate set.

The RAG Chain

from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

llm = ChatAnthropic(model="claude-sonnet-4-20250514", temperature=0)

template = """Answer the question based on the following context. If the context
doesn't contain enough information, say so clearly. Don't make things up.

Context:
{context}

Question: {question}

Provide a detailed answer with specific references to the source material."""

prompt = ChatPromptTemplate.from_template(template)

def format_docs(docs):
    return "\n\n---\n\n".join(
        f"[Source: {doc.metadata.get('source', 'unknown')}, "
        f"Page: {doc.metadata.get('page', 'n/a')}]\n{doc.page_content}"
        for doc in docs
    ) This connects directly to [chunking strategies](/blog/chunking-strategies-at-scale).

rag_chain = (
    {"context": final_retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

Notice the source formatting. Every chunk carries its metadata into the prompt. The LLM can cite specific documents and pages. Users need to verify answers. Give them the breadcrumbs.

Metadata Filtering

Real applications need filtered search. "Find me information about authentication, but only from documents uploaded this month."

# Store documents with rich metadata
for chunk in chunks:
    chunk.metadata.update({
        "department": "engineering",
        "doc_type": "technical_manual",
        "uploaded_at": "2026-03-01",
    })

# Query with filters
filtered_results = vectorstore.similarity_search(
    "authentication flow",
    k=5,
    filter={"department": "engineering", "doc_type": "technical_manual"}
)

pgvector stores metadata as JSONB. You get the full power of Postgres JSON operators for filtering. No separate metadata store needed.

Evaluation: Know Your Numbers

You can't improve what you don't measure. Track these three metrics.

from dataclasses import dataclass

@dataclass
class RAGEvaluation:
    query: str
    retrieved_docs: list
    answer: str
    expected_answer: str

    @property
    def context_relevance(self) -> float:
        """What percentage of retrieved docs are actually relevant?"""
        relevant = sum(1 for doc in self.retrieved_docs if self._is_relevant(doc))
        return relevant / len(self.retrieved_docs) if self.retrieved_docs else 0 This connects directly to [evaluating retrieval quality](/blog/rag-evaluation-retrieval-quality).

    @property
    def answer_faithfulness(self) -> float:
        """Does the answer only use information from the context?"""
        # Use LLM-as-judge for this
        pass

    @property
    def answer_relevance(self) -> float:
        """Does the answer actually address the question?"""
        # Use LLM-as-judge for this
        pass

Context relevance, faithfulness, and answer relevance. The RAG triad. If your context relevance is below 70%, fix your retrieval before touching anything else. Garbage in, garbage out.

The Full Pipeline

async def query_rag(question: str, filters: dict = None) -> dict:
    """Production RAG query with full pipeline."""
    # 1. Retrieve with hybrid search + reranking
    docs = await final_retriever.ainvoke(question)

    # 2. Generate answer
    answer = await rag_chain.ainvoke(question)

    # 3. Return with sources for verification
    sources = [
        {"source": d.metadata.get("source"), "page": d.metadata.get("page")}
        for d in docs
    ]

    return {
        "answer": answer,
        "sources": sources,
        "num_chunks_retrieved": len(docs)
    }

What Makes This Production-Grade

Hybrid search catches what pure vector misses. Reranking promotes the best results. Metadata filtering scopes queries to relevant subsets. Source citations let users verify. And pgvector means one database, one backup, one team that already knows how to operate it.

The tutorial RAG pipeline is five lines of code. The production one is fifty. That difference is where reliability lives.