← Back to Guides
2

Series

RAG Engineering in Production· Part 2

GuideFor: AI Engineers, ML Engineers, Platform Engineers, AI Systems Architects

Why Your RAG Chunks Are Lying to Your Retriever

Chunking strategy is the upstream failure that no retrieval optimization can fix - a practitioner's guide to fixed-size, recursive, semantic, hierarchical, late, and contextual chunking and when each one silently breaks.

#rag#chunking#document-splitting#semantic-chunking#late-chunking#contextual-retrieval#production-ai#llm-infrastructure

Three weeks after shipping their internal knowledge base, a compliance team member asked the system about their contractor onboarding process. The answer was confident, well-structured, and wrong in exactly the way that matters most in compliance work: it described the general process but left out the exception clause that applied to contractors in regulated projects. The exception was in the document. It had been ingested. The embedding model had encoded it. The LLM, given the right context, would have answered without hesitation.

The retrieval system never surfaced the exception because the chunk containing it had been split at the paragraph boundary where the general rule ended and the qualification began. One chunk had the rule. Another had the qualification. Neither was complete enough to surface for the query.

This is not an embedding problem. It is not a model problem. It is not a retrieval strategy problem. It is a chunking problem - and no amount of tuning downstream can compensate for it.

In Part 1 of this series, we identified the Retrieval Tax: the compounding cost your system pays every query when retrieval strategy is mismatched to your data and query type. Chunking is where that tax accrues before retrieval even runs. If you split documents wrong, you have created irretrievable context - text that exists in your corpus, cannot be found by your retriever, and will silently degrade every answer that needed it.

The thesis of this article is specific and provable: most teams optimize their embedding model and ignore their chunking strategy, but chunking configuration has as much or more influence on retrieval quality than embedding model choice. A Vectara study at NAACL 2025 demonstrated this across 25 chunking configurations and 48 embedding models. The largest controlled comparison of chunking strategies to date - 36 methods, 6 domains, 5 embedding models, 1,080 total configurations (Shaukat et al., arXiv:2603.06976, March 2026) - confirmed it: content-aware chunking significantly outperforms naive fixed-length splitting, and the gap is not marginal.

You are probably running the wrong chunking strategy. Here is how to know, and how to fix it.


The Four Ways Chunking Silently Breaks Retrieval

Chunking failures are invisible in development. Your system produces answers. The RAGAS evaluation is not yet running. The compliance team has not yet asked the edge-case question. Here is what is breaking underneath.

1. Boundary Fragmentation: The Split That Destroys Semantic Units

Fixed-size chunking cuts at a token count. It has no concept of where a sentence ends, where a paragraph concludes, or that a three-clause legal exception should be treated as a single retrievable unit. The result: chunk boundaries fall in the middle of ideas.

A paragraph explaining revenue growth gets split between two chunks. The first has the numbers; the second has the context. Neither answers the question properly. A numbered list gets split at step four: the retriever returns two orphaned fragments that individually answer nothing. A table gets bisected: headers in one chunk, values in the next, the relationship between them gone.

Chroma's benchmarks quantify this: structure-aware chunking outperforms fixed-size by up to 9 percentage points in recall on the same corpus. That gap does not narrow as you scale - it widens, because the proportion of your corpus that spans chunk boundaries grows with document length and complexity.

2. The Pronoun Disconnect: Anaphoric Reference Failure

Standard chunk-then-embed pipelines embed each chunk in isolation. The embedding model sees only what is inside the chunk boundary - no surrounding context, no document-level awareness. This destroys anaphoric references.

"Berlin" appears in chunk 4. "Its population exceeds 3.85 million" appears in chunk 5. When chunk 5 is embedded in isolation, the model encodes "population", "inhabitants" - but not "Berlin". When a user queries "What is the population of Berlin?", the query embedding contains ["population", "Berlin"]. The chunk 5 embedding contains ["population", "inhabitants"] but not "Berlin". Cosine similarity drops. The right chunk does not surface.

This is not a retriever failure. The retriever retrieved correctly from what the embeddings encoded. The problem is that the embeddings encoded incomplete information because the chunks were incomplete at the point of encoding.

Late chunking (Jina AI, arXiv:2409.04701, Oct 2024) exists specifically to fix this. Embed the full document first - run the transformer over the entire text, let self-attention carry context across the document - then split the resulting contextual embeddings into chunks. Each chunk's embedding carries document-level awareness. The chunk that contains "Its population exceeds 3.85 million" carries the context that the subject is Berlin because the full-document encoding preserved that relationship.

3. Table and Structured Data Destruction

Fixed-size chunking is a semantic disaster for structured data. Tables are two-dimensional: the meaning of any cell depends on its row label and column header. Fixed-size chunking destroys this structure by linearizing the table and splitting at token count without regard for row or column boundaries.

The header row ends up in one chunk. Values end up in another. The model receives numbers without column context, or column names without values. The NVIDIA 2024 benchmark measured this directly: page-level chunking - which keeps tables and layouts intact - won at 0.648 accuracy specifically because it preserved the two-dimensional structure. Fixed-size chunking and naive semantic chunking both lost to page-level on document types containing tables.

The correct treatment of tables is to never split them. Treat each table as an atomic unit. If the table exceeds your chunk size budget, either increase the budget for table chunks or extract and store the table separately as structured data - not as text fragments.

4. The Context Cliff: When Chunk Size Mismatches Query Type

There is a measurable performance threshold at approximately 2,500 tokens of retrieved context. Above it, response quality degrades - not because the model cannot handle the context, but because large chunks produce embeddings that average over multiple ideas. When a chunk contains three distinct concepts, its embedding sits somewhere between all three, matching none of them precisely.

The NVIDIA 2024 benchmark showed the query-type dependence directly: factoid queries (specific names, dates, values) perform best with 256-512 token chunks. Analytical queries (compare, summarize, reason across multiple facts) need 1024+ tokens. A corpus serving both query types with a single chunk size is wrong for at least one of them - on every query.


Named Concept: Chunking Debt

Every document you ingest with the wrong chunking strategy creates Chunking Debt - the accumulated retrieval quality degradation that compounds with every query. The debt accrues at ingestion time, is invisible until evaluation, and is expensive to pay down because fixing the chunking strategy requires re-ingesting and re-embedding the entire corpus.

Unlike technical debt in code, which you can accumulate and pay down incrementally, Chunking Debt is all-or-nothing. You cannot fix the chunk boundaries on a subset of documents and leave the rest. Every retrieval call over the mal-chunked corpus pays the debt, every day, until you rebuild the index from scratch.

The practical consequence: chunking strategy decisions are almost irreversible at production scale. A team that ships with fixed-size 512-token chunks and discovers the problem six months later faces a full corpus re-ingestion. For corpora of millions of documents, that is a weekend-scale infrastructure event, not a config change.

This is why chunking deserves architectural attention before a single document is ingested - not after.


The Chunking Strategy Map

The wrong way to think about chunking is to ask "which strategy is best?" Benchmarks give different winners depending on corpus type, query type, and embedding model. The right question is: what are the structural properties of my documents and queries, and which strategy preserves the information that retrieval needs?

Fixed-Size Chunking: Only Correct Baseline for One Case

Split at a fixed token count. Fast. Zero model calls. Used as the default in LangChain and LlamaIndex starters.

When it is acceptable: short, uniform, single-topic documents where every section is self-contained. Product FAQs. Support ticket descriptions. News summaries. Documents where the information density is even and the topic does not span more than one natural paragraph.

When it fails: everything else. The compliance knowledge base. Technical documentation with nested structure. Legal contracts. Financial reports. Any corpus where an idea or a rule spans more than one paragraph.

code
# Wrong way: fixed-size as a universal defaultfrom langchain.text_splitter import CharacterTextSplitter# This is what most RAG starter templates use.# It ignores sentence boundaries, paragraph structure,# tables, lists, and every other semantic unit in your documents.splitter = CharacterTextSplitter(    chunk_size=1000,    chunk_overlap=200,    separator="\n")chunks = splitter.split_text(document)# What this produces on a compliance policy document:# Chunk 14: "...standard onboarding process. New contractors must complete#             form HR-22 and submit credentials within 5 business days. For#             contractors engaged in regulated projects, the following additi"# Chunk 15: "onal requirements apply: security clearance at level 2 or above,#             completion of the compliance module within 48 hours of start..."## Query: "What are the requirements for contractors in regulated projects?"# Result: Neither chunk surfaces. Chunk 14 is about standard onboarding.#         Chunk 15 has no context establishing it applies to regulated projects.#         The sentence was split mid-clause. Both chunks fail the query.## The overlap of 200 tokens gives false confidence.# It inflates your index by 20-25% without fixing boundary fragmentation.# arXiv:2601.14123 shows overlap rarely improves BERTScore or EM.

Recursive Character Splitting: The Correct Default

Recursive character splitting tries a hierarchy of separators in order: double newlines (paragraph breaks), single newlines, spaces, characters. It will not split mid-sentence if it can split at a paragraph boundary. It will not split mid-paragraph if it can split at a sentence boundary.

This is meaningfully better than fixed-size and costs nothing extra. It is the correct baseline for any corpus that has paragraph or sentence structure, which is almost every corpus.

Chroma's analysis puts recursive splitting at 400 tokens at 85-90% recall. Semantic chunking reaches 91-92%. The 2-3 point gap requires embedding every sentence in the corpus to compute - roughly 10x the processing cost. Unless your domain specifically rewards that recall improvement, recursive is the right default.

code
# Right way: recursive splitting as the production baselinefrom langchain.text_splitter import RecursiveCharacterTextSplittersplitter = RecursiveCharacterTextSplitter(    chunk_size=400,          # Validated default for most corpora    chunk_overlap=0,         # Skip overlap unless you have evidence it helps    separators=["\n\n", "\n", ".", "!", "?", " ", ""],    length_function=len,)chunks = splitter.split_text(document)# For structured documents (Markdown, HTML), add structure-awareness first:from langchain.text_splitter import MarkdownHeaderTextSplitterheader_splitter = MarkdownHeaderTextSplitter(    headers_to_split_on=[        ("#", "h1"),        ("##", "h2"),        ("###", "h3"),    ])# Split on structure first, then recursively split sections that are too largemd_splits = header_splitter.split_text(markdown_document)

Chunk size is not a universal constant. Factoid queries need 256-512 tokens. Analytical queries need 1024+. For corpora serving mixed query types, the right answer is two indexes at different granularities - not one compromise chunk size that is wrong for both.

Semantic Chunking: Right for One Specific Corpus Type

Semantic chunking computes embedding similarity between consecutive sentences and splits when similarity drops below a threshold. It produces topically coherent chunks that do not artificially bisect ideas.

The benchmark evidence is mixed. A NAACL 2025 paper (arXiv:2410.13070) found that fixed 200-word chunks matched or beat semantic chunking on retrieval and answer generation across tasks - and the computational cost was not justified by consistent gains. Where semantic chunking genuinely outperforms recursive splitting is on mixed-format documents where neither sentence windows nor hierarchical structure provides a strong signal: Notion pages that combine narrative, inline tables, short code snippets, and bullet lists in no particular hierarchy.

One operational issue is non-negotiable: set a minimum chunk size. Semantic chunking can produce extremely small chunks - as few as 43 tokens on average in the FloTorch benchmark failure case, which produced a 54% pipeline failure rate. A floor of 200-400 tokens is not optional in production.

code
# Semantic chunking - only when corpus is genuinely mixed-format# and recursive splitting is consistently underperforming on evaluationfrom chonkie import SemanticChunkerchunker = SemanticChunker(    embedding_model="sentence-transformers/all-MiniLM-L6-v2",    chunk_size=512,    similarity_threshold=0.5,   # Lower = more splits; tune against your corpus    min_chunk_size=200,          # Non-optional. Prevents fragment-size chunks.)chunks = chunker.chunk(document_text)# Validate against your actual queries before shipping.# If recall is not at least 2-3 points above recursive at 400 tokens,# the cost is not justified.

Late Chunking: For Cross-Reference Heavy Documents

Late chunking inverts the standard pipeline. Instead of splitting documents into chunks and then embedding each chunk independently, late chunking embeds the full document through a long-context model first - preserving document-level context across the entire self-attention span - and then splits the resulting contextual embeddings into smaller chunks.

The result: a chunk containing "Its population exceeds 3.85 million" carries the context that the subject is Berlin, because the full-document encoding preserved that relationship. Pronoun references, cross-section citations, and header context all survive into the chunk embeddings.

The constraint is hard: late chunking requires embedding the full document at once. Documents longer than the embedding model's context window cannot be late-chunked. For long-context models like jina-embeddings-v3 (8K context window), documents up to that length are candidates. Beyond that length, late chunking is not applicable.

code
# Late chunking with Jina embeddings# Requires a long-context embedding model that supports late chunkingfrom jina import Client# Step 1: Late chunk embedding - model sees full document contextdef late_chunk_embed(document: str, client: Client) -> list[dict]:    """    Embed document with late chunking.    The model encodes the full document first, then produces per-chunk embeddings    that carry full-document context. Each chunk embedding is contextually aware.    """    response = client.encode(        [document],        late_chunking=True,  # jina-embeddings-v3 specific parameter        max_length=8192,    )    # Returns per-chunk embeddings with full-document context preserved    return response.chunks  # list of {text, embedding, start, end}# Step 2: Index the contextual chunk embeddingsdef index_late_chunks(chunks: list[dict], vector_store) -> None:    for chunk in chunks:        vector_store.upsert(            id=chunk["id"],            vector=chunk["embedding"],            metadata={"text": chunk["text"]},        )# Constraint check before deciding to use late chunkingdef can_late_chunk(document: str, max_model_tokens: int = 8192) -> bool:    """    Late chunking requires the full document to fit in the embedding model's    context window. If the document is too long, fall back to recursive splitting.    """    from transformers import AutoTokenizer  # pip install transformers    tokenizer = AutoTokenizer.from_pretrained("jinaai/jina-embeddings-v3")    token_count = len(tokenizer.encode(document))    return token_count <= max_model_tokens

Contextual Retrieval: The Context Layer on Top of Any Strategy

Contextual Retrieval, released by Anthropic in September 2024, is not a chunking strategy. It is a context enrichment step applied after chunking and before embedding. For each chunk, a brief context summary (typically <100 words) is generated from the full document and prepended to the chunk before it is embedded and indexed.

The benchmark numbers from Anthropic's own testing are specific and verified:

  • Contextual embeddings alone: 35% reduction in retrieval failures (5.7% → 3.7%)
  • Contextual embeddings + contextual BM25: 49% reduction (5.7% → 2.9%)
  • Full pipeline with reranking: 67% reduction (5.7% → 1.9%)

This addresses the same problem as late chunking - decontextualized chunk embeddings - through a different mechanism. Instead of embedding full-document context via the transformer's attention, it explicitly generates a context summary via LLM call and prepends it. Late chunking is more computationally efficient per chunk (no LLM call). Contextual Retrieval is more accurate at the cost of one LLM call per chunk at ingestion time.

code
from anthropic import Anthropicclient = Anthropic()CONTEXT_PROMPT = """<document>{document}</document>Here is the chunk we want to situate within the whole document:<chunk>{chunk}</chunk>Please give a short succinct context to situate this chunk within the overall documentfor the purposes of improving search retrieval of the chunk.Answer only with the succinct context and nothing else."""def add_chunk_context(document: str, chunk: str) -> str:    """    Prepend document-level context to a chunk before embedding.    This is the core operation in Anthropic's Contextual Retrieval technique.        Cost note: one API call per chunk. For large corpora, use prompt caching    to cache the document portion across all chunks from the same document.    Anthropic reports &gt;90% cost reduction via prompt caching for this pattern.    """    response = client.messages.create(        model="claude-3-haiku-20240307",  # Haiku for cost; context generation        max_tokens=100,        messages=[{            "role": "user",            "content": CONTEXT_PROMPT.format(                document=document,                chunk=chunk            )        }]    )    context = response.content[0].text    return f"{context}\n\n{chunk}"def contextual_ingest(document: str, chunks: list[str]) -> list[str]:    """    Apply contextual retrieval to a list of chunks from a document.    Returns contextualized chunks ready for embedding and indexing.    """    return [add_chunk_context(document, chunk) for chunk in chunks]

Small-to-Large Retrieval: Decoupling Search from Generation

The fundamental tension in chunking is that small chunks enable precise retrieval while large chunks provide sufficient context for generation. Small-to-Large (the ParentDocumentRetriever pattern in LangChain) resolves this by maintaining two granularities simultaneously.

Index small chunks (128-256 tokens) for retrieval precision. When a small chunk matches a query, retrieve its parent chunk (512-2048 tokens) and pass the parent to the LLM for generation. Search granularity and generation granularity are decoupled.

code
from langchain.retrievers import ParentDocumentRetrieverfrom langchain.storage import InMemoryStorefrom langchain.text_splitter import RecursiveCharacterTextSplitterfrom langchain_community.vectorstores import Chroma# Child splitter: small, precise, optimized for retrievalchild_splitter = RecursiveCharacterTextSplitter(chunk_size=200)# Parent splitter: larger, rich context, what the LLM seesparent_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)# Dual-store setupvectorstore = Chroma(    collection_name="child_chunks",    embedding_function=embeddings,)docstore = InMemoryStore()  # Stores parent chunks keyed by child IDretriever = ParentDocumentRetriever(    vectorstore=vectorstore,    docstore=docstore,    child_splitter=child_splitter,    parent_splitter=parent_splitter,)# Ingestion: adds both child and parent chunks to their respective storesretriever.add_documents(documents)# Retrieval: finds matching child chunks, returns their parents# The LLM receives parent context, not the narrow child chunksresults = retriever.get_relevant_documents("your query here")

The Chunking Strategy Decision Guide

mermaid
flowchart TD
    A[Document arrives for ingestion] --> B{Does the document contain\ntables, code blocks, or\nstrict layout structure?}
    B -- Yes --> C{Is it paginated?\nPDF, formal report}
    C -- Yes --> D[Page-level chunking\nTreat each page as atomic unit]
    C -- No --> E[Structure-aware splitting\nMarkdownHeader or HTML splitter\nthen recursive on sections]
    B -- No --> F{Does the corpus have\nheavy cross-references,\npronouns, or inter-section\ndependencies?}
    F -- Yes --> G{Does the document fit\nwithin embedding model\ncontext window?}
    G -- Yes --> H[Late chunking\njina-embeddings-v3 or equivalent]
    G -- No --> I[Contextual Retrieval\nLLM-generated chunk context\nbefore embedding]
    F -- No --> J{Is the corpus genuinely\nunstructured - mixed format\nno clear hierarchy?}
    J -- Yes --> K[Semantic chunking\nmin_chunk_size 200+ tokens\nvalidate vs recursive first]
    J -- No --> L[Recursive character splitting\n400 tokens, separators hierarchy\nno overlap unless proven]
    L --> M{Mixed query types?\nFactoid AND analytical}
    M -- Yes --> N[Dual index:\n256-512 tokens for factoid\n1024+ tokens for analytical]
    M -- No --> O[Single index at\nquery-appropriate size]

    style A fill:#4A90E2,color:#fff
    style D fill:#6BCF7F,color:#fff
    style E fill:#6BCF7F,color:#fff
    style H fill:#98D8C8,color:#333
    style I fill:#98D8C8,color:#333
    style K fill:#FFD93D,color:#333
    style L fill:#4A90E2,color:#fff
    style N fill:#9B59B6,color:#fff
    style O fill:#4A90E2,color:#fff
    style B fill:#7B68EE,color:#fff
    style C fill:#7B68EE,color:#fff
    style F fill:#7B68EE,color:#fff
    style G fill:#7B68EE,color:#fff
    style J fill:#7B68EE,color:#fff
    style M fill:#7B68EE,color:#fff

Walk this tree before writing a single line of chunking code. The output of this decision is not "use semantic chunking" or "use recursive chunking" - it is a specific configuration for your specific corpus.


The Pre-Ingestion Chunking Checklist

Before ingesting any corpus at production scale, answer these questions. Wrong answers here are costly to fix later.

Document structure:

  • Does the corpus contain tables? If yes, how will you handle atomic table units?
  • Does the corpus contain code blocks? Are they split-safe?
  • Is the document structure hierarchical (sections, subsections)? If yes, use structure-aware splitting before recursive.
  • Are documents paginated (PDFs, formal reports)? If yes, consider page-level as the atomic unit.

Query type:

  • Are most queries factoid (specific lookups)? Target 256-512 tokens.
  • Are most queries analytical (reasoning, comparison, synthesis)? Target 1024+ tokens.
  • Is the query mix genuinely heterogeneous? Consider a dual-index approach.

Cross-reference density:

  • Do documents use pronouns across paragraph boundaries?
  • Do sections reference other sections ("as described above", "see clause 4.2")?
  • If yes: does the document fit in the embedding model's context window? Late chunking if yes; Contextual Retrieval if no.

Validation gate (non-optional):

  • Run a RAGAS evaluation (context_recall, answer_faithfulness) on a representative sample of 50-100 queries before committing to a chunking strategy
  • Compare at minimum: recursive 400 tokens vs your chosen strategy. If the gap is <2 points recall, ship recursive.
  • Set a minimum chunk size floor if using semantic chunking: 200 tokens minimum.

Operational readiness:

  • Do you have a re-ingestion pipeline ready for when chunking strategy changes?
  • Have you budgeted compute for contextual retrieval LLM calls if using that approach?
  • Is your chunk size validated against your embedding model's maximum token input?

How Chunking Debt Compounds the Retrieval Tax

In Part 1 of this series, we introduced the Retrieval Tax: the compounding cost your system pays when retrieval strategy is mismatched to your data and query type. Chunking Debt is what accumulates before the Retrieval Tax is even assessed.

The sequence: bad chunking → decontextualized embeddings → retriever cannot surface the right content → retrieval strategy choice is irrelevant because the content was never retrievable. You can switch from pure vector search to hybrid BM25 + dense with a cross-encoder reranker - the Part 1 minimum viable baseline - and still receive wrong answers if the chunk boundaries destroyed the semantic units that answers depend on.

The Retrieval Tax is applied on top of Chunking Debt. Both are invisible in development. Both compound at production scale. Both require architectural intent at the beginning of the pipeline, not after observing failures.

The correct order of operations:

  1. Decide chunking strategy first - before choosing retrieval strategy, before choosing embedding model
  2. Run evaluation before committing at scale
  3. Then apply the retrieval strategy decision guide from Part 1
  4. Then layer hybrid retrieval and reranking on top

Chunking is not a config parameter. It is an architectural decision with production consequences that persist for as long as your index exists.


References


AI Engineering

Retrieval Augmented Generation

More Articles

Follow for more technical deep dives on AI/ML systems, production engineering, and building real-world applications:


Comments