Chunking Strategies That Actually Work at Scale

The Boring Part That Breaks Everything

Nobody writes conference talks about chunking. It's not glamorous. There's no paper with a clever acronym. But chunking is where most RAG pipelines quietly succeed or silently fail.

Here's the thing. Your embedding model sees chunks, not documents. Your retrieval returns chunks, not documents. Your LLM reads chunks, not documents. If your chunks are bad, everything downstream is bad. No reranker, no prompt engineering, no model upgrade will fix garbage input.

Chunking is the foundation. Treat it like one.

The Naive Approach (And Why It Persists)

Split text every N tokens with M token overlap. Done.

# The "good enough" chunker everyone starts with
def naive_chunk(text, chunk_size=512, overlap=50):
    tokens = tokenize(text)
    chunks = []
    for i in range(0, len(tokens), chunk_size - overlap):
        chunks.append(detokenize(tokens[i:i + chunk_size]))
    return chunks

This works for demos. It even works reasonably well for homogeneous document collections (all the same type, all the same structure). But it falls apart when your documents have structure. And enterprise documents always have structure.

Fixed-size chunking will split a table across two chunks (rendering both useless), cut a code block in half, separate a heading from its content, merge the end of one section with the beginning of an unrelated one, and ignore paragraph boundaries mid-sentence.

These aren't edge cases. In a corpus of real enterprise documents, these happen constantly.

Strategy 1: Structure-Aware Chunking

If your documents have structure (headers, sections, paragraphs, lists), use it.

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "h1"),
    ("##", "h2"),
    ("###", "h3"),
]

splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = splitter.split_text(markdown_content)

For HTML, parse the DOM. For PDFs, use layout analysis (more on that in the multi-modal article). For Word docs, use the heading hierarchy.

The principle: let the document's own structure define chunk boundaries. Authors put section breaks where topics change. That's exactly where you want chunk boundaries too.

When to use: Structured documents with clear heading hierarchies. Technical documentation. Policies. Manuals. It is worth reading about hybrid search alongside this.

Limitation: Sections vary wildly in size. A one-liner heading creates a tiny chunk. A massive section with no sub-headings creates a chunk that exceeds your model's effective embedding window.

Strategy 2: Recursive Character Splitting With Hierarchy

LangChain's recursive splitter is popular for good reason. It tries multiple separators in order of preference.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""],  # try each in order
)

First, try splitting on double newlines (paragraphs). If chunks are still too big, split on single newlines. Then sentences. Then words. Last resort: characters.

This preserves the most meaningful boundaries it can while respecting size constraints.

When to use: General-purpose. Good default when you don't have strong document structure.

Limitation: Still doesn't understand semantic boundaries. A topic change mid-paragraph gets missed.

Strategy 3: Semantic Chunking

Use the embedding model itself to detect topic shifts.

import numpy as np

def semantic_chunk(sentences, embedder, threshold=0.75):
    """
    Group sentences into chunks based on embedding similarity.
    When similarity drops below threshold, start a new chunk.
    """
    embeddings = embedder.embed(sentences)
    chunks = []
    current_chunk = [sentences[0]]

    for i in range(1, len(sentences)):
        similarity = cosine_similarity(embeddings[i-1], embeddings[i])
        if similarity < threshold:
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])

    chunks.append(" ".join(current_chunk))
    return chunks

This is slower (you're embedding every sentence) but produces chunks that are semantically coherent. Each chunk is about one topic, even if the document doesn't have explicit section breaks.

When to use: Unstructured text. Meeting transcripts. Email threads. Chat logs. Anything where topic shifts happen without formatting cues.

Limitation: Expensive at scale. The embedding call per sentence adds up when you're processing millions of documents. You can batch, but it's still significantly more compute than rule-based splitting. For a deeper look, see incremental document sync.

Strategy 4: Parent-Child Chunking

This is my go-to for enterprise RAG. Store chunks at two granularity levels.

Small chunks (256-512 tokens) for precise retrieval. Large chunks (1024-2048 tokens) for context. When a small chunk is retrieved, return its parent chunk to the LLM.

def parent_child_chunk(document, small_size=400, large_size=1600):
    """Create hierarchical chunks: retrieve on children, return parents."""
    large_chunks = recursive_split(document, chunk_size=large_size)

    index_entries = []
    for parent in large_chunks:
        children = recursive_split(parent.text, chunk_size=small_size)
        for child in children:
            index_entries.append({
                "child_text": child.text,
                "child_embedding": embed(child.text),
                "parent_text": parent.text,
                "parent_id": parent.id,
            })

    return index_entries

Small chunks embed better (less noise in the vector) and match queries more precisely. But small chunks lack context. The parent chunk provides that context, giving the LLM enough information to generate a good answer.

When to use: Almost always. This is the best general strategy for balancing retrieval precision with generation quality.

Limitation: Storage doubles (you're indexing small chunks but storing large ones). Worth it.

Strategy 5: Proposition-Based Chunking

Extract atomic propositions from text, embed those. This is the most aggressive approach to chunk quality.

# Use an LLM to decompose text into propositions
PROMPT = """
Decompose the following text into simple, self-contained propositions.
Each proposition should:
- Be a single factual statement
- Be understandable without context
- Include necessary entity references

Text: {text}
Propositions:
"""

A paragraph like "The company was founded in 2015 by Jane Smith in Austin, Texas, and now has 500 employees across 3 offices" becomes three propositions that each stand alone and embed cleanly.

When to use: When retrieval precision matters more than anything. FAQ systems. Factual knowledge bases. Compliance documentation where every statement needs to be independently retrievable.

Limitation: Expensive. You're calling an LLM during ingestion for every document. At enterprise scale, this is a serious cost consideration.

The Scale Problem

When you're processing millions of documents, chunking strategy interacts with compute cost, storage cost, and latency in ways that matter.

Strategy	Ingestion Cost	Storage	Retrieval Quality
Fixed-size	Low	1x	Baseline
Recursive	Low	1x	Better
Structure-aware	Low-Medium	1x	Better (structured docs)
Semantic	High	1x	Good
Parent-child	Low-Medium	2x	Best general
Proposition	Very High	3-5x	Best precision

My recommendation for most enterprise deployments: start with parent-child chunking using recursive splitting for the child chunks. Add structure-aware splitting for document types that have clear hierarchies. Reserve semantic and proposition-based chunking for high-value, low-volume document collections where precision justifies the cost.

Chunk Size: The Eternal Debate

There's no universal right answer, but there are constraints.

Embedding model context window. Most models perform best well below their maximum. A 512-token model doesn't embed 512 tokens of text well. It embeds 256-384 tokens well and starts degrading after that. The related post on multi-modal documents goes further on this point.

Retrieval precision vs. generation context. Smaller chunks are more precise to retrieve. Larger chunks give more context for generation. Parent-child chunking solves this tension.

Your actual queries. If your users ask short, specific questions, small chunks work better. If they ask complex, multi-faceted questions, larger chunks help.

Measure. A/B test. Don't guess.

One Last Thing

Whatever strategy you choose, always include metadata with your chunks. Source document, section heading, page number, document type, creation date. You'll need it for access control, source attribution, freshness scoring, and debugging when retrieval goes sideways.

Chunking isn't glamorous. But it's where the real engineering lives. Get it right, and everything else in your RAG pipeline gets easier.