When Your RAG System Lies to You—and You Almost Believe It
When Your RAG System Lies to You—and You Almost Believe It
Last week, our team was building an internal knowledge-base Q&A system, and the accuracy was stuck at 73%. Properly stuck. My boss wandered over asking what was going on, so I pulled up the logs. Here's what I found: out of the top 5 retrieved documents, 3 had absolutely nothing to do with the user's question. But the model? Oh, it was diligent. It took those irrelevant snippets and wove them into a perfectly coherent answer. My boss nodded along, genuinely impressed.
That's when it hit me.
The scariest thing about RAG systems isn't when the model hallucinates. It's when it hallucinates with such conviction—backed by seemingly relevant sources—that even you start doubting yourself.
The Retrieval Hallucination That Cost Me Two Weeks of Sleep
Q3 last year. I was leading a team building a financial report analyser for a client. The brief sounded dead simple: upload a PDF earnings report, ask questions, get answers. Basic CRUD work, I thought. How hard could it be?
Launch day. The client drops a test question in the group chat: "What's our Q3 gross margin?"
The system fires back instantly: "Based on the financial report, Q3 gross margin is 42.3%."
Thumbs-up emoji from the client. I'm feeling pretty smug.
Half an hour later, another message: "I've gone through the entire report. There's no mention of 42.3%. Where did your system get this number?"
My stomach dropped.
Digging through the logs, I discovered the retrieval module had pulled several passages about "gross profit"—but all from Q2. The model, armed with Q2 context, essentially thought, "Q3's probably similar," and conjured up a number that looked entirely reasonable. Here's the darkly funny part: the actual figure was 41.8%. It was close. Spooky close.
Actually—let me correct myself. It wasn't "accurate." It was lucky. We tested with a different report later, and it confidently spat out numbers that were off by a mile.
This is what I now call "retrieval quality collapse": the retrieval gives the model wrong context, and the generation layer gift-wraps it as a correct answer. With unsettling confidence.
We eventually bolted on validation logic, tweaked retrieval parameters, and clawed our way to 85% accuracy. But that experience burned something into my brain: in RAG architecture, retrieval and generation aren't two independent stages. They're a pair of frenemies who constantly blame each other—and cover for each other.
Three Cases That Made Me Rethink RAG Entirely
Case 1: The Butterfly Effect of Chunk Size
Last year I helped an e-commerce team build a product description generator. We got stuck on the most basic question: how big should the document chunks be?
We tried 512 tokens first. Retrieval was blazing fast. But when users asked "How's the battery life on this phone?", the returned chunks were all fluffy marketing speak—"massive battery," "all-day power"—with zero specific numbers. The model, having nothing concrete to work with, invented "up to two days of battery life." The official spec? 18 hours.
So we switched to 256 tokens. Retrieval precision shot up. New problem: when users asked "How does this compare to competitor X?", the small chunks meant retrieval only found one product's description. The model had no idea what competitor X even was, but it happily compared the two anyway.
Bit of a pickle, that.
Our eventual fix was a dynamic chunking strategy with overlap windows: 256 tokens for technical specs, 512 tokens for descriptive content, with 50-100 token overlaps between adjacent chunks. Retrieval accuracy jumped from 68% to 89%. Engineering complexity? Doubled. I remember we were on LangChain 0.1.16 at the time—the Document Transformer's overlap window support was, let's say, patchy. Three days of source code wrangling to make it work properly.
Case 2: The Embedding Model Trap
Lots of folks assume newer embedding models are always better. I've got the scars to prove otherwise.
Early this year, building a legal document retrieval system, I lazily went with the latest BGE-M3. During testing, something odd kept happening: queries about "employment contract termination conditions" would return chunks about "labour dispatch agreement termination." Semantically related? Sure. In legal practice? Completely different beasts. I actually checked with the solicitor we were working with—she said mixing these up could trigger genuine legal liability.
Switching to a legally-fine-tuned BERT model immediately boosted accuracy by 12 percentage points. The trade-off? Inference speed dropped about 40%. We ended up adding a rerank step: coarse retrieval with the general model, then fine-grained ranking with the domain-specific one. I think we used Cohere's Rerank API, v3. Expensive as hell, but the results were night and day.
This architecture seems obvious in hindsight, but I had to write two full pages of ROI analysis to convince our CTO to swallow that 40% inference cost increase. Brought in the legal director to help argue the case. Took about two weeks to get sign-off.
Case 3: When Better Retrieval Makes Things Worse
Here's a counterintuitive one. Honestly, this is the finding that's made me rethink the most.
We were optimising a medical Q&A system and managed to push retrieval accuracy from 80% to 95%. Should be a win, right?
User satisfaction dropped.
Took us ages to figure out why: the retrieval was too precise. It was pulling highly specialised medical literature verbatim—full of terms like "glucocorticoids" and "angiotensin-converting enzyme inhibitors." Normal humans don't talk like that. The model, armed with this hyper-technical context, generated answers that were technically flawless but completely impenetrable to patients.
When retrieval was a bit sloppier, the model actually translated the content into plain language on its own. Users loved it.
The irony's thick enough to cut. The metric for retrieval quality might not be "how accurate is it?" but rather "how useful is it for generation?" These two things are often at odds.
What I Actually Do Now
After face-planting enough times, I've cobbled together a set of practical strategies. Not claiming they're optimal, but they've held up decently across our team's projects.
1. Measure Retrieval by "Generation-Friendliness"
Traditional metrics like Recall and Precision still matter, but the ultimate question is: given this context, what's the probability the model generates a correct answer? We now run end-to-end evaluations for every retrieval experiment rather than just looking at retrieval metrics in isolation. Takes about 40 minutes per run. Worth every second.
2. Tag Retrieved Results with Confidence Labels
We attach a signal to each retrieved chunk: exact match, semantic similarity, or weak relevance. The generation model uses these tags to calibrate its "fabrication tendency"—it'll confidently quote exact matches but proactively say "I'm not entirely sure" for weak relevance. I borrowed this idea from studying Anthropic's system prompt designs. For implementation, we use Langfuse to track tag accuracy. Works surprisingly well.
3. Build Adversarial Test Cases Deliberately
Before any launch, we construct a batch of questions designed to break retrieval. Questions about figures that don't exist in the source documents. Questions that require cross-document synthesis. The real test is seeing whether the model admits ignorance or doubles down on fabrication.
In our November 2024 test run, the model still "confidently hallucinated" on 12% of adversarial cases. Adding one line to the prompt—"If you're uncertain, please state this explicitly"—dropped it to 3%.
4. Degrade Gracefully When Retrieval Fails
This is the most overlooked piece. I now require every RAG system to have a fallback for when retrieval confidence dips below threshold. Either refuse to answer or clearly flag: "The following response is based on incomplete information and should be treated as indicative only." Our threshold's set at 0.65. I literally pulled that number out of thin air—later A/B testing suggested 0.7 works slightly better, but the difference is marginal.
The Real Question to Ask Before You Build
At its core, the retrieval-generation dynamic in RAG is an information transfer problem. Retrieval says, "Here's what I found." Generation says, "Right, I'll answer based on that." But retrieval never communicates its uncertainty, and generation never articulates what kind of information it actually needs.
When I lead RAG projects now, the first thing I do isn't model selection or architecture. It's getting product and business stakeholders in a room to hash out one question: when retrieval gets it wrong, would you rather the system stay silent or take a guess?
There's no universal answer. Medical contexts and e-commerce contexts have wildly different tolerance levels. But you need to decide before a single line of code is written.
What's your experience? Have you run into situations where retrieval was spot-on but generation still face-planted? Or the reverse—retrieval was a mess but the model somehow guessed correctly? I'd genuinely love to hear your war stories. I've been looking at ByteDance's Doubao and Kimi's RAG approaches lately—seems like everyone's piling into multi-path retrieval. Do you reckon that's the right direction?
Drop your thoughts in the comments.
RAG #LLM #AIEngineering #MachineLearning #TechLeadership
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.