I Fixed My RAG System's Hallucination Problem by Beating Up Bad Documents
I Fixed My RAG System's Hallucination Problem by Beating Up Bad Documents
Last year I built an enterprise knowledge base Q&A system. During testing, the RAG accuracy was a miserable 67%. Users weren't gentle: "Why does your AI keep making stuff up?"
After digging around, I found the problem wasn't the retrieval itself. The documents coming back were poisoned — the top-ranked results had nothing to do with the actual question. The model was force-fed garbage and had to improvise. Hallucination city.
Here's the fix: between retrieval and generation, kill or demote those unreliable documents so the model only sees genuinely useful context. The industry calls this Post-Retrieval Processing & Re-ranking, and it's one of the most direct ways to curb RAG hallucinations.
Let me show you how it works.
Why Retrieved Documents Sabotage Your Model
Here's something people forget: vector search isn't magic.
It finds semantic similarity, not semantic relevance.
Let me give you an example. A user asks "What are Redis persistence methods?" Vector search might return an article about Redis cluster setup. Why? Because words like "persistence," "RDB," and "AOF" appear all over it, so the embedding similarity score is high. But that article never actually explains persistence methods systematically. The model gets this document and has to wing it — of course it hallucinates.
I once hit an even more absurd case. The system had a batch of ops manuals — pure commands and config examples, almost no natural language. A user asked "How do I configure Nginx reverse proxy?" The top-3 retrieved results were nginx.conf snippets from three different projects. The model looked at this pile of configs and decided to freestyle — it generated a completely fictional proxypassdynamic directive.
Wait, I need to correct myself. It wasn't proxypassdynamic — it was something called proxypassbackend, and the syntax looked terrifyingly real. Our frontend dev spent hours debugging before realizing it was a hallucination. This became an inside joke on the team. Now whenever someone asks "Is this config correct?" someone else inevitably fires back: "You sure it's not proxypassbackend?"
The core contradiction: Vector search looks at overall semantic similarity, but Q&A needs precise information matching. The retrieved documents might look right, but they're actually wrong.
The Pre-Ranking Dirty Work
Before we get to the fancy re-ranking stuff, there's some grunt work to handle. I've got three techniques that are cheap and effective.
1. Similarity Threshold Filtering
The simplest, most brutal approach: set a similarity score cutoff. Anything below it gets nuked.
# Quick example
SIMILARITY_THRESHOLD = 0.75
filtered_docs = [
doc for doc in retrieved_docs
if doc.score >= SIMILARITY_THRESHOLD
]
But there's a catch. Similarity score absolutes vary wildly between embedding models. With text-embedding-ada-002, scores typically float above 0.8. Switch to bge-large-en-v1.5 (the March 2024 release) and they might crash to 0.6. Your threshold needs testing against your specific model and use case — don't just guess.
My approach: grab 200-300 labeled examples, plot a similarity score distribution histogram, and find the boundary between relevant and irrelevant docs. Usually pick the value that maximizes F1. I learned this at the AI Engineer World's Fair in June 2024 — the speaker even shared their internal threshold tuning script. Too bad it wasn't open-sourced.
2. Deduplication and Redundancy Removal
Those top-k retrieved documents? They're often packed with duplicate content. Especially in enterprise knowledge bases where the same document gets saved in multiple versions, or crawlers snag the same article from different mirrors.
Duplicate content doesn't just waste tokens — it makes the model obsess over certain information while ignoring other crucial context.
I typically use two approaches:
- Exact dedup: MD5-hash the document content, kill identical hashes
- Fuzzy dedup: Calculate Jaccard similarity or n-gram overlap between documents. If it exceeds a threshold (say, 0.8), keep only the highest-scored version
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import jaccard_score
import numpy as np
def deduplicate_docs(docs, threshold=0.8):
vectorizer = CountVectorizer(ngram_range=(2, 3))
vectors = vectorizer.fit_transform([doc.content for doc in docs])
keep = []
for i, doc in enumerate(docs):
if not keep:
keep.append(i)
continue
max_sim = max(
jaccard_score(vectors[i].toarray()[0], vectors[j].toarray()[0], average='micro')
for j in keep
)
if max_sim < threshold:
keep.append(i)
return [docs[i] for i in keep]
Honestly, this code crawls in production. Once you're past 20 documents, the Jaccard computation time blows up. I've since shifted to MinHash for approximate deduplication — trades a bit of precision for actual speed.
3. Document Chunking and Context Window Optimization
This one's easy to overlook.
Often the document is relevant, but the chunking strategy butchered it — the retrieved chunk is missing critical context.
I once worked on a contract review system where users asked "Which article contains the liability clause?" The retrieved chunk had the liability content perfectly, but the article number was in the previous chunk. The model just... invented one. This gets complicated because contract formatting varies so much — some use "Article X," others use "X.," and some just use bullet points. Hard to standardize.
The fix: use small chunks for retrieval, then expand the context window when feeding the model. For example, retrieve with 256-token chunks but return them to the model padded with 128 tokens before and after. LangChain's ParentDocumentRetriever follows this logic, though I find LlamaIndex's SentenceWindowNodeParser more flexible to configure.
Re-Ranking: The Main Event
Everything above is just appetizers. Re-ranking is the main course.
The core idea is dead simple: use a more accurate (but slower) model to rescore and reorder your candidate documents.
Cross-Encoder Re-Ranking
Vector search uses Bi-Encoders — the query and document get encoded separately, then compared. Fast, but precision is limited. Cross-Encoders mash the query and document together as a single input, letting the model directly judge relevance. High precision, slower speed.
The typical architecture: Bi-Encoder retrieves top-100 from your massive document store, Cross-Encoder rescues those 100 and surfaces the top-5 for the LLM.
from sentence_transformers import CrossEncoder
# Load a Cross-Encoder model
reranker = CrossEncoder('BAAI/bge-reranker-large')
# Rescore the retrieval results
pairs = [[query, doc.content] for doc in retrieved_docs]
scores = reranker.predict(pairs)
# Sort by the new scores
reranked = sorted(
zip(retrieved_docs, scores),
key=lambda x: x[1],
reverse=True
)
In my actual project, adding bge-reranker-large boosted Q&A accuracy from 72% to 89%. The tradeoff? About 300-500ms of added latency per query. For non-real-time use cases, that's completely worth it.
The Re-Ranking Model Selection Trap
Don't reach for the biggest model immediately.
bge-reranker-v2-m3 is genuinely powerful, but its inference latency is more than double the large version. If latency matters, try bge-reranker-base first, or even accelerate with ONNX. I benchmarked on an A10 instance: ONNX-quantized base model inference took ~15ms, large model ~40ms, and v2-m3 shot up to 90ms.
Another trap: Cross-Encoders have input length limits, typically 512 tokens. If your document chunks are too long, they get truncated. This is where the earlier chunking strategy pays off again — keep retrieval chunks under 300 tokens to leave breathing room for re-ranking.
LLM-as-Reranker
A trendy approach: use the LLM itself for re-ranking. Give it a prompt to judge document relevance, or even assign scores.
You are a document relevance evaluator. Determine whether the following
document can answer the user's question. Reply only "relevant" or
"not relevant" — no explanations.
User question: {query}
Document content: {document}
Relevance judgment:
This approach is genuinely accurate, but the cost and latency are brutal. I only use it when the final candidate set is tiny (3-5 documents) or for accuracy-critical domains like medical or legal Q&A.
A compromise: small model for coarse ranking, LLM for fine ranking. Cross-Encoder narrows top-20 to top-5, then LLM filters top-5 to top-3. Controls cost while preserving accuracy. From what I understand, Cohere's Compass framework (launched late 2024) follows a similar pattern, though with their own models.
Real-World Case: Optimizing a Financial Q&A System
Let me share a project from last year — an internal research report Q&A system for a brokerage firm.
Initial state:
- Retrieval: text-embedding-ada-002 + FAISS
- Fed top-5 directly to GPT-4
- Accuracy: 71%
Round 1 optimization (April 2024):
- Added similarity threshold filtering (threshold 0.78)
- Added document deduplication
- Accuracy: 76% (+5 points)
Round 2 optimization (May 2024):
- Introduced bge-reranker-large for re-ranking
- Retrieved top-20, re-ranked to top-5
- Accuracy: 87% (+11 points)
Round 3 optimization (June 2024):
- Adjusted chunking strategy for research reports (by paragraph + tables)
- Added metadata filtering (only retrieve reports from the past year)
- Accuracy: 92% (+5 points)
The whole optimization took three weeks, pushing accuracy from 71% to 92%. My biggest takeaway: post-retrieval processing has way higher ROI than optimizing retrieval itself. Changing embedding models, tuning vector DB parameters — the effects are often subtle. But adding a processing layer between retrieval and generation? Immediate impact.
Oh, and there was one hilarious incident. I noticed accuracy suddenly dropped to 80%. After hours of debugging, I discovered an intern had set the re-ranking model threshold to 0.99 — almost all documents were getting filtered out. Lesson learned: don't push config changes on Friday afternoons.
Practical Recommendations
- Analyze failure cases before touching anything. Grab the hallucination examples and inspect what went wrong with the retrieved documents. Are they irrelevant? Relevant but incomplete? Too many documents drowning the signal? Different problems, different solutions.
- Re-ranking isn't a silver bullet. If your top-20 retrieved results contain zero relevant documents, re-ranking can't save you. Time to revisit your retrieval strategy — adjust chunk sizes, improve your embedding model, or introduce keyword search for hybrid recall. I typically use BM25 + vector search hybrid recall, which boosts retrieval by about 10-15%.
- Monitor re-ranking latency and cost. Cross-Encoder inference needs a GPU — CPU deployment is painfully slow. Consider ONNX Runtime or TensorRT for inference acceleration, or use a dedicated re-ranking API. Jina AI launched a re-ranking API in 2024 priced around $0.018 per million tokens — pretty reasonable for small-scale use.
- Think about caching. User queries are often highly similar. Cache those re-ranking results. Similar questions hit the cache, skip recomputation. I use Redis — key is the query's MD5 hash, value is the sorted document ID list, TTL set to 24 hours. Hit rate hovers around 30%, saving a decent chunk of GPU time.
TL;DR / Key Takeaways
- Vector search finds semantic similarity, not relevance — that's the root of most RAG hallucinations
- Post-retrieval processing (filtering, dedup, chunk optimization) is cheap and effective pre-ranking prep
- Cross-Encoder re-ranking is the highest-ROI single change you can make — I've seen 15-20% accuracy jumps
- Test your thresholds, monitor your latency, and cache aggressively
- Re-ranking fixes ordering problems, not absence problems — if relevant docs aren't in your top-N, fix retrieval first
What's Your Horror Story?
What's the most absurd hallucination you've seen in a RAG system? I'm genuinely curious about those "this should never happen but it did" cases. A friend told me last week their system interpreted "Chairman's speech" as "Chairman's resignation" and nearly caused a PR disaster. I still laugh thinking about it.
Drop your war stories in the comments — I read every single one.
#RAG #LLM #HallucinationMitigation #ReRanking #CrossEncoder #VectorSearch #AIEngineering
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.