Home / Blog / I Fixed My RAG System Without Touching the Model —...

I Fixed My RAG System Without Touching the Model — A Chunking Story

By CaelLee | | 6 min read

I Fixed My RAG System Without Touching the Model — A Chunking Story

Last week I pulled a client's knowledge base QA system from 67% accuracy to 89%. Didn't touch the model. Didn't swap embeddings. Changed exactly one thing: I killed fixed-size chunking and replaced it with semantic chunking.

Sounds simple, right?

The rabbit hole was deeper than I expected. And the bugs. The bugs were spectacular.

The Setup

I'd been running LangChain 0.1.9's RecursiveCharacterTextSplitterchunk_size=512, overlap=50 — for about six months without major issues. Solid, boring, predictable. Until March 15th, when a client lit up our shared Slack channel:

"Why does searching 'Q3 return rate last year' show me data from the year before? Is this thing deliberately stupid?"

I checked the logs. Oh boy.

The document mentioned "2023 Q3" at the end of one chunk, and "return rate data below" at the start of the next. Two pieces of the same answer, sitting on opposite sides of a chunk boundary like they'd had a fight. The embedding search never stood a chance.

That's the problem with fixed chunking. It counts characters like a ruler measures water — technically accurate, completely missing the point.

So I went down the semantic chunking rabbit hole. Here's what I tried, what broke, and what finally worked.

Three Approaches I Actually Tested

1. Sentence Embedding Similarity Chunking

The idea's straightforward: split your doc into sentences, compute cosine similarity between adjacent sentence embeddings, and cut wherever similarity takes a nosedive.

I ran SentenceTransformers' all-MiniLM-L6-v2 against 200 documents. The results were genuinely better than fixed chunking. A section about "Product A pricing strategy" followed by "competitive analysis" would drop from 0.85 similarity to 0.41 — boundaries so clean you could frame them.

But here's where it got ugly.

The computation cost was absurd. A 100,000-character document library took 43 minutes to chunk. My CPU screamed at 100% the entire time. Fan was going full jet engine.

And short sentences? Disaster. "OK." and "Next we'll look at..." would get flagged as boundaries. I ended up with a chunk that was literally three characters: "Spe" — the rest of "Specifically" had been sliced off into the next chunk.

Actually, let me correct that. It's not that "short sentence embeddings are unstable." The real issue — which I confirmed by digging through sentence-transformers GitHub issues — is that their model produces high-variance vector representations for sentences under 10 tokens. Someone filed this bug in June 2024. Still open.

I slapped on a minimum chunk size of 200 characters, forcing short sentences to merge upward. Ugly fix? Absolutely. But it worked.

2. NLP Document Structure Chunking

This approach feels smarter — use NLP models to identify document structure (headings, paragraphs, lists, tables) and chunk along those natural boundaries.

I used Unstructured 0.12.5, which pulls apart PDFs into titles, body text, and tables. For technical docs and contracts, it was beautiful. Headings became perfect chunk summaries. Saved me so much metadata work.

Then I hit the table problem.

A client's financial report had a table spanning two pages. Unstructured split the header into one chunk and the data rows into another. User searches "Q2 2024 gross margin" — gets back the header row and nothing else. Pure silence where numbers should be.

I stared at that chunk for ten minutes and said something I won't repeat here.

Wrote a post-processing script to detect

tags and consecutive | delimiter patterns, then force-merge adjacent chunks regardless of similarity scores. The code's an embarrassment of if-else blocks, but at least tables stay intact now. I could optimize this further — store tables as structured data instead of plaintext chunks — but deadlines exist, so here we are.

3. LLM-Powered Semantic Chunking

Twitter's been hyping this one hard lately. Just let an LLM read the doc and output "cut here" markers. Makes intuitive sense — who understands semantics better than a language model?

I tested GPT-4-turbo (gpt-4-0125-preview) on 50 documents. The chunk quality was stunning. Boundaries looked hand-annotated. In some places, I swear it did better than I would have done manually.

Two deal-breakers, though:

Plus, the LLM occasionally got clever in ways I didn't want. It merged "Risk Factors" and "Legal Disclaimer" into one chunk because they were "semantically related." User searches "investment risk" and gets back paragraphs of legalese. They closed the tab immediately. Brilliant user experience, truly.

What I Actually Use Now (After All That Pain)

Hybrid strategy. Here's the playbook:

  1. Rough cut with Unstructured — use document structure (headings, paragraphs) for initial splitting
  2. Embedding similarity check — scan each chunk internally, split further if similarity drops
  3. Hard constraints — minimum 200 characters, maximum 800
  4. Hands off tables and code blocks — they stay whole, period

This combo runs 3x faster than pure embedding chunking and scores 22 percentage points higher than fixed chunking. But the real win? Stability. No weird boundary decisions. No 2 AM phone calls.

Let me show you the numbers:

Chunking MethodRecall@5Chunk CountRetrieval Latency
Fixed (512/50)0.67100%2.3s
Pure embedding0.78115%2.8s

Same test set, 200 queries. Fewer chunks means faster vector DB lookups. Sometimes the dumb optimizations pay off.

Things I Learned the Hard Way

Don't trust paper benchmarks. Real-world documents are chaos. Two-column PDF layouts. OCR-scanned documents with text soup. Someone embedding an Excel table inside a Word doc. I once got a contract with Chinese on the left, English on the right — Unstructured just gave up entirely. Had to preprocess that one by hand.

Overlap isn't your friend. I once set overlap=100 thinking "more context is better." Retrieved chunks had massive overlap. Users got duplicate information and thought the system was broken. Complaint email got CC'd to their CTO. Dropped it to 30. Problem vanished.

Match chunking to your embedding model. I tested BGE-large-en vs text-embedding-3-small — their optimal chunk sizes differed by nearly 200 tokens. Don't copy-paste someone else's config. Test it on your own data. Seriously.

And here's one that drove me absolutely insane: in November 2024, OpenAI changed the default dimensions for text-embedding-3. My carefully tuned overlap parameters? Completely useless overnight. Retrieval accuracy dropped 5 points. I spent two days debugging my code before finding a Hacker News thread discussing the change. Nearly threw my keyboard across the room.

TL;DR

Semantic chunking dramatically improves RAG retrieval quality. But don't reach for LLMs first — the cost and latency aren't worth it for most real-world applications.

Start with NLP structure chunking + embedding similarity. It's the pragmatic sweet spot: fast enough for production, accurate enough to matter.

Most importantly: run A/B tests on your own data. Ignore everyone's "best practices" — including mine. Your document formats, your users' query patterns, your embedding model — change any one variable and the optimal strategy might be completely different.

What chunking approach are you using? Ever dealt with truly cursed document formats? I'll go first: I once got a PDF with vertical traditional Chinese text that embedded horizontal English tables. Unstructured refused to even open it.

Drop a comment below. I'll buy you a virtual coffee. (Virtual, because tech salaries only stretch so far.)

rag #semanticchunking #nlp #llmops #vectordatabase #embeddings

Hybrid0.8970%1.1s
C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free