| Multi-company coverage | 60% | 95% | +35 p.p. |
Look, 20 seconds is still a bit sluggish for real-time chat. But for an internal analytics tool? It's usable. And honestly, when users go from waiting over a minute to just 20 seconds, they're practically weeping with gratitude. Or so I imagine. I didn't actually ask.
Reranking: The Elephant in the Room
When I first ran profiling and stared at the flame graph, one number jumped out.
54 seconds.
The reranking step was eating 54 seconds. Every. Single. Query.
My immediate thought: I'm completely screwed.
The reason was embarrassingly simple—I was using an LLM to do reranking. Every time, I'd toss candidate documents at GPT and ask it to score and sort them. Each call took 2-3 seconds, I had 20 candidates in the pool, and I was calling them serially. Twenty multiplied by 2.5 seconds equals 50 seconds minimum. My grandmother moves faster than this thing did.
The fix? Swap in a specialised model. I deployed gte-rerank-v2 locally and switched to batch inference. Two seconds. Done.
The lesson here is painfully obvious in hindsight: specialised models beat general-purpose ones for focused tasks. Don't use an LLM to brute-force a ranking problem. It's like using a sledgehammer to swat a fly—expensive, slow, and frankly a bit ridiculous. I still cringe thinking about how naive I was.
Vector Database Selection: My Early Overengineering Disaster
We're talking about 5,000 vectors here.
Five. Thousand. Not five million.
And what did I do? I spun up Milvus Standalone like I was building something for Google-scale operations. Startup time: 10-30 seconds. Memory usage: 500MB including Docker. Query latency: 20ms. I actually felt quite professional about it at the time—distributed vector database, very fancy, very enterprise.
Then I switched to FAISS: 50MB memory, instant startup, 15ms queries.
Let that sink in.
For anything under 100,000 vectors, FAISS is honestly all you need. Milvus is brilliant software, but at small scale it's overkill of the highest order. Not even using a sledgehammer for a fly—more like using a dragon-slaying sword to chop spring onions.
That said, FAISS did have one nasty surprise that cost me an entire afternoon. Bloody hell.
The C++ backend doesn't play nicely with non-ASCII file paths. On Windows, it just throws an error—and the error message is so abstract you'd never guess it's a path issue. I spent two hours debugging before a sudden flash of inspiration made me try an English path. Worked instantly.
The workaround uses a temporary file as an intermediary:
with tempfile.NamedTemporaryFile(delete=False, suffix=".index") as tmp:
faiss.write_index(index, tmp.name)
shutil.move(tmp.name, target_path)
Three lines of code. One afternoon of my life I'll never get back. If you know, you know.
Don't Use LLMs for Query Expansion
When I first built query expansion, I thought, "Well, I've already got the LLM wired up—might as well let it expand the search terms too."
That decision added 3 seconds to every query.
Worse, the LLM would occasionally generate completely off-target expansions. It'd take "revenue" and expand it to "revenue growth rate"—which is a different concept entirely. You ask it for synonyms, and it gives you conceptual drift. Brilliant.
I replaced the whole thing with a predefined terminology dictionary:
TERM_EXPANSIONS = {
"revenue": ["operating revenue", "core business income", "turnover"],
"profit": ["net profit", "attributable net profit", "total profit"],
# ... 18 groups total
}
Direct substitution, zero latency. More stable than LLM expansion, and—this matters—completely predictable. You know exactly what it'll expand to. No surprises like "revenue quarter-over-quarter growth velocity" appearing out of nowhere.
The principle is simple: if you can compute it offline, don't compute it online.
Deduplication Details: Small Change, Big Impact
Vector search and BM25 can retrieve the same document chunk. My initial deduplication logic was dead simple: "keep whichever appears first."
This meant high-scoring chunks kept getting overwritten by low-scoring ones. Vector search might return a chunk with 0.95 similarity, BM25 gives the same chunk 0.6, but because BM25 returned first, the 0.6 score won. Infuriating.
Changed it to keep the highest score:
if new_score > old_score:
merged[pk]["scores"]["vector"] = new_score
Retrieval precision improved by roughly 15%. One line. One condition check.
Sometimes it's not the algorithm that's wrong—it's just sloppy logic.
Fair Allocation Across Companies: An Interesting Edge Case
A user asks: "Compare revenue across the three telecom operators." If one company's high-scoring chunks dominate the results, the others get squeezed out entirely. Imagine China Mobile's documents occupying 12 of the top 15 slots—China Unicom and China Telecom vanish from view.
I added a fair allocation mechanism:
per_company = max(5, HYBRID_TOP_K // len(companies))
Combined with a three-layer safety net—coverage checking, candidate recall, and guaranteed re-retrieval—multi-company coverage jumped from 60% to 95%.
This isn't algorithmic innovation. It's engineering fastidiousness. But it works.
Engineering Safety Nets: Always Set Timeouts
Every external call needs a timeout and a fallback. Without timeouts, you get occasional deadlocks where users stare at loading spinners indefinitely.
Seriously, setting timeouts is one of those things that won't kill you if you skip it, but will absolutely save your bacon if you don't.
LLM_TIMEOUT = 60
EMBEDDING_TIMEOUT = 30
RERANK_TIMEOUT = 15
try:
reranked = gte_reranker.rerank(candidates)
except TimeoutError:
logger.warning("gte-rerank timed out, falling back to hybrid score sorting")
reranked = sorted(candidates, key=lambda x: x["scores"]["hybrid"], reverse=True)
The fallback strategy might not be perfect, but it's infinitely better than deadlocking. Users would rather see "decent" results than stare at a loading animation wondering if the system has died.
Chunking Strategy: The Most Underrated Piece
Chunking is the most neglected aspect of RAG. Or rather, the most underestimated.
If you naively split by paragraphs, tables get shredded. Headers end up in one chunk, data in another, and the LLM receives crippled information—like showing someone a financial statement but only giving them half the columns.
My approach: child chunks of 150 tokens for precise retrieval granularity, paired with parent chunks of 500 tokens to preserve full context. Retrieval hits the child chunks, but generation uses the corresponding parent chunk as context.
Markdown tables, code blocks, and lists—these "atomic semantic blocks"—need special protection. Once you break them apart, the meaning fragments. This strategy—wait, I should call it an approach—works well in practice.
Evaluation Needs Multiple Metrics
For coarse retrieval, I track MAP@k and Recall@k. The core question: "Did we find all the relevant documents?"
For fine-grained ranking, I use NDCG@k and Precision@k. The core question: "Is the most valuable content at the top?"
I built the evaluation framework on RAGAS, using GPT-4 as a judge to automatically compute Faithfulness, Answer Relevance, and Context Relevance. The test dataset includes 20% unanswerable questions to verify the rejection mechanism—when the knowledge base doesn't have the answer, the system should say "I don't know" rather than hallucinating.
Before deploying this mechanism, I ran a regression test suite to confirm it wouldn't reject answerable questions. About 200 test cases, manually spot-checked 50 of them. Exhausting, but worth it.
If I Were to Keep Optimising
That 20-second query time still has room for improvement:
- Multi-turn query rewriting to resolve elliptical references, probably another 10-15% accuracy gain
- Post-hoc factual consistency checking to detect contradictions with previous answers
- Hierarchical conversation memory management with automatic summarisation of distant turns
The complexity isn't trivial, so I've held off for now. Or maybe I'll tackle it next month. Who knows? That's the thing about independent development—the optimisation backlog is always longer than the feature backlog.
What I Actually Learned (After Stepping in Every Pothole)
Profile before you optimise. Don't guess where the bottleneck is—use a flame graph. If I'd profiled earlier, I wouldn't have wasted so much time on Milvus.
Fix the biggest problem first. Anything consuming over 50% of your time budget is your priority. That 54-second reranking had to die before anything else mattered. The order is non-negotiable.
Specialised models for specialised tasks. Use gte-rerank-v2 for reranking, not an LLM. I've said this already. I won't repeat myself.
Batching beats sequential calls every time. The number of API calls matters more than individual call latency. Swapping 20 sequential calls for one batch inference run was transformative. I knew this intellectually, but I didn't feel it until this project.
Always set timeouts, always have fallbacks. Every external call needs a safety net. This is engineering hygiene. Skip it, and you're planting landmines for your future self.
That's about it.
RAG optimisation, when you strip it down, isn't about tuning some magic parameter. It's about dissecting every link in the chain, finding the one that's actually dragging everything down, and replacing it. The 54-second reranker. The 500MB Milvus instance. The 3-second LLM query expansion. Every single bottleneck was self-inflicted.
What about you? Stepped in any similar potholes lately? Drop a comment—misery loves company.
Key Takeaways:
- Profile first, optimise second (flame graphs are your friend)
- Specialised models crush general-purpose ones for focused tasks
- FAISS handles <100K vectors brilliantly; save Milvus for actual big data
- Precompute everything you can offline
- Timeouts and fallbacks aren't optional—they're survival mechanisms
rag #aiengineering #performanceoptimization #llm #vectordatabase