Home / Blog / Why Your RAG System’s Chunking Strategy Is Probabl...

Why Your RAG System’s Chunking Strategy Is Probably Broken (And How I Fixed Mine)

By CaelLee | | 10 min read

Why Your RAG System’s Chunking Strategy Is Probably Broken (And How I Fixed Mine)

TL;DR: Swapping static for dynamic chunking boosted our legal-tech RAG system’s recall by 23%. Sounds like a quick config change, right? It wasn't. It took two weeks, a lot of cold brew, and some 3 AM breakthrough moments in a Berlin apartment. Here's the code, the benchmarks, and the things I broke along the way.

Cover image description: A split-screen illustration showing puzzle pieces snapping together dynamically on one side versus rigid rectangular blocks on the other, with a search bar glowing between them.

The Night Everything Broke (And Why I Started This)

2 AM in my Berlin apartment. My third cup of cold brew sat untouched. I'd switched to cold brew around midnight because I couldn't be bothered to make another french press. We all know that feeling.

Our legal-tech RAG system was returning completely irrelevant case precedents for contract law queries. The embeddings were solid — I'd fine-tuned them myself. The retrieval pipeline was clean. We were using the right vector database with proper indexing. But users kept reporting that "the system doesn't understand context."

I spent 3 hours on this bug before realizing: it wasn't a retrieval problem at all.

It was how we were slicing the documents. The chunker was literally splitting definitions away from their parent clauses. Section headers from their content. Cross-references from their context.

That night sparked a two-week experiment comparing dynamic versus static chunking strategies. Here's what we learned. Well—here's what I learned. And broke. And frantically fixed at 4 AM while my neighbors probably debated calling the police about the guy muttering "sentence boundaries" repeatedly.

Setting Up the Experiment

We used a dataset of 1,200 German legal documents. Got permission from a Berlin law firm — shoutout to Lena for making those calls when I was too swamped with the actual engineering.

The goal was simple:

Actually, wait—I should clarify. We also tracked Mean Reciprocal Rank and precision@1, but recall@5 was what the lawyers cared about. They wanted to know if the answer was somewhere in the results. Made sense for their workflow — they'd rather scan five chunks than miss the answer entirely.

The Static Chunking Approach


def static_chunker(text: str, chunk_size: int = 512, overlap: int = 50) -> List[str]:
 """
 Simple, predictable, but often cuts through sentences.
 I've written this exact function at least 20 times.
 """
 chunks = []
 start = 0
 
 while start < len(text):
 end = start + chunk_size
 chunk = text[start:end]
 
 # Add overlap for next chunk
 start = end - overlap
 chunks.append(chunk)
 
 return chunks

I think I first wrote this in 2021. It's fine. It works. It's also the software equivalent of slicing a baguette with a ruler — clean, consistent, but you'll definitely cut through some raisins in unfortunate ways.

If you're building a quick prototype? Sure, static works. If users are relying on this for actual legal research? That's where the trouble starts.

The Dynamic Chunking Approach

Here's what I built around 1 AM, right after the cold brew finally started working:


import spacy

nlp = spacy.load("de_core_news_lg") # German legal text, obviously

def dynamic_chunker(text: str, target_size: int = 512) -> List[str]:
 """
 Respects sentence boundaries and section headers.
 The coffee finally kicked in when I built this.
 """
 doc = nlp(text)
 chunks = []
 current_chunk = ""
 
 for sent in doc.sents:
 # Check if adding this sentence exceeds target
 if len(current_chunk) + len(sent.text) > target_size and current_chunk:
 chunks.append(current_chunk.strip())
 current_chunk = sent.text
 else:
 current_chunk += " " + sent.text
 
 # Don't forget the last chunk (learned this the hard way)
 if current_chunk:
 chunks.append(current_chunk.strip())
 
 return chunks

That if current_chunk line at the bottom? Yeah. That was a 4-hour bug at 3 AM.

I kept getting missing final chunks and couldn't figure out why my recall numbers kept tanking. The debugger showed nothing obviously wrong. Turns out I was just... not appending the last chunk. The chunk was being built, the loop exited, and then it just vanished into the void. Classic.

🔥 The key difference: Dynamic chunking respects natural boundaries — sentence endings, section breaks, semantic completeness. It doesn't just count tokens and slice.

The Results (With Real Numbers, Finally)

We tested both strategies across 150 real legal queries. These were anonymized queries from the firm's internal system, not synthetic examples I cooked up. Here's what happened:

StrategyRecall@5Avg Chunk SizeProcessing Time
Static (512 tokens)67.3%5120.4s/doc
Dynamic (target 512)82.6%4871.2s/doc
Dynamic (target 256)78.1%2410.9s/doc

That's a 23% improvement in recall accuracy just by changing how we split text. Not a new embedding model. Not a fancy reranker. Just... respecting that sentences are meaningful units.

I stared at these numbers for a while. Then I sent them to my colleague at 2 AM. He responded with "bro go to sleep."

He was right. But I was too jacked up on cold brew to sleep, so I just kept testing edge cases.

Three Times Dynamic Chunking Saved Us

1. The "Buried Definition" Problem

A query for "Schriftform" (written form requirement) kept failing with static chunks because the definition and the requirement appeared in separate chunks:


Static Chunk #47: "...die Schriftform ist erforderlich für..."
Static Chunk #48: "...gemäß § 126 BGB definiert als..."

The embedding for chunk #47 knew about the requirement, but not what it actually meant. Chunk #48 had the definition, but lost the context of why it mattered.

Dynamic chunking kept the definition intact with its context. Recall for definition-seeking queries jumped from 45% to 89%.

That's not an improvement. That's the difference between a usable system and expensive enterprise garbage.

2. The Section Header Disaster

German legal documents use numbered sections (like § 623, Abs. 2) extensively. Static chunking kept placing section headers at the bottom of one chunk with the actual content in the next chunk. The embeddings completely lost the connection between header and body.


# Static chunk boundary disaster:
# Chunk 45: "...Vertragskündigung gemäß § 623"
# Chunk 46: "Abs. 2: Die Kündigung bedarf..."

# Dynamic handled it properly:
# Chunk: "§ 623 Abs. 2: Die Kündigung bedarf der Schriftform..."

I found this by accident. Was manually scrolling through chunks at 11 PM, spotted the split, and just sat there facepalming for a solid 30 seconds.

3. The Cross-Reference Nightmare

Some legal texts reference other sections mid-paragraph — like "gemäß § 242 BGB" appearing right in the middle of a § 623 discussion. Dynamic chunking kept these references within the same chunk, preserving the semantic relationships.

Static chunking? Complete coin flip whether the reference and its context survived together. Pure luck. Sometimes it worked, sometimes it didn't. Our users couldn't figure out why the system seemed to randomly forget about certain legal principles.

Spoiler: it wasn't random. It was our chunker.

The Hybrid Approach We Actually Deployed

Pure dynamic chunking was great for accuracy, but we needed speed too. The firm's system processes about 200 new documents daily. At 1.2 seconds per document, that was adding up — and they were planning to scale.

Here's what we ended up shipping:


def hybrid_chunker(text: str, target_size: int = 512) -> List[str]:
 """
 Uses document structure hints first, falls back to dynamic.
 Best of both worlds. Took me way too long to realize this.
 """
 # Check for clear section markers
 sections = re.split(r'\n(?=[§\d]+\.\s|[A-C]\.\s)', text)
 
 chunks = []
 for section in sections:
 if len(section) <= target_size:
 chunks.append(section)
 else:
 # Fall back to dynamic chunking for long sections
 chunks.extend(dynamic_chunker(section, target_size))
 
 return chunks

This gave us 84.2% recall while keeping processing reasonable. The regex isn't perfect — German legal formatting is, honestly, kind of chaotic — but it catches about 90% of section boundaries.

We deployed this on a Tuesday. By Thursday, the complaints about context understanding had basically stopped. No celebration. No high-fives. Just... silence. Which, when you're dealing with legal tech users, is actually the best outcome possible.

What I Wish I Knew Before Starting

💡 Three things I learned the hard way:

Berlin tech scene insight: The best chunking strategy discussion I had was at a Späti at 1 AM with an NLP researcher from TU Berlin. We were both grabbing Club Mate and somehow ended up debating sentence boundary detection for 45 minutes. Me with my practical engineering view, him with formal linguistic theory. Neither of us was fully right. Both of us learned something.

Sometimes the best debugging happens when you step away from the keyboard entirely.

The Code Is on GitHub

I've open-sourced our testing framework — includes the legal dataset generator (anonymized, obviously — can't share actual client data) and all chunking implementations.


# Quick start — should work out of the box
from chunking_compare import ChunkingBenchmark

benchmark = ChunkingBenchmark(
 documents=your_docs,
 queries=test_queries,
 chunkers=[static_chunker, dynamic_chunker, hybrid_chunker]
)

results = benchmark.run()
print(results.summary())

There's also a Docker setup if you don't want to deal with spaCy model dependencies. Trust me, getting decorenews_lg installed properly at 2 AM is absolutely not the move. I learned this. So you don't have to.

What's Next?

I'm currently testing semantic-aware chunking — using embedding similarity to determine boundaries dynamically. The basic idea: calculate cosine similarity between adjacent sentences, then chunk at "topic shift" points where similarity drops below a threshold.

Early results? Promising but expensive.

Processing time jumped to 3.8 seconds per document with all-MiniLM-L6-v2. That's... a lot. We're experimenting with smaller models and smarter caching, but it's not production-ready yet. Threshold sensitivity is the real problem — small changes in the cutoff value create wildly different chunk sizes.

I'll probably write about that in a few weeks if there's interest. Still debugging the threshold calibration, but I think I'm close.

What's your experience with chunking strategies? Have you found static chunking actually works better in certain domains? I'm especially curious about medical and technical documentation use cases. From what I've seen, medical texts have similar structure problems to legal docs — nested definitions, cross-references, context-dependent meaning — but I haven't tested it properly yet.

Drop a comment below. I'll be here with my coffee, probably debugging something completely different by now.

The semantic chunker, if I'm being honest.

🚀

Update: Someone on Hacker News pointed out that LangChain's RecursiveCharacterTextSplitter does something similar to our hybrid approach. I checked — it does, but it doesn't handle German section markers well out of the box. Might submit a PR when I have time between deployments.

rag #nlp #python #beginners #webdev

Hybrid (static + dynamic)84.2%5101.5s/doc
C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free