Home / Blog / RAG Isn't Dead — It's Just Growing Up (And Your Pr...

RAG Isn't Dead — It's Just Growing Up (And Your Production Users Can Tell)

By CaelLee | | 6 min read

RAG Isn't Dead — It's Just Growing Up (And Your Production Users Can Tell)

Last Friday afternoon, a mate who builds knowledge base software pinged me a screenshot.

"RAG is dead, long context rules everything" — some tech forum post title. He added three facepalm emojis.

I was in the middle of wrestling with a hybrid search weighting parameter — you know the one, where you're convinced 0.7 is the magic number but deep down you know it's probably 0.68 — and I fired back: "People declaring RAG dead have almost certainly never run it in production."

Sent. Then immediately wished I hadn't. Too absolute.

But I didn't unsend it either. Because on some level, I meant it.

I've been chewing on this for nearly two years now. Started mucking about with RAG in early 2023, and the playbook back then was — how do I put this — beautifully naive. Chunk PDFs into 500-word blocks, embed them with text-embedding-ada-002, stuff them into Chroma, then when a user asks something, retrieve the Top-3 chunks, and slot them into a prompt. Done.

Demo ran like a dream.

Production? Crashed and burned. Spectacularly.

Here's what that looked like: user asks "how do I get a refund", and the system proudly retrieves three documents about "payment processing". Semantically close, sure — they're all about money. But the user wanted a refund, not a payment flow. User muttered something about "rubbish bot" and rage-clicked over to human support.

This is RAG 1.0's signature problem. Pure vector search finds similarity, not relevance. Sit with that for a second. In vector space, "refund" and "payment" might have a cosine similarity north of 0.85, but in business logic, they're practically opposites.

Brilliant.

I've since stepped in enough potholes to understand why some folks reckon RAG's had its day.

One reason: long-context models are genuinely impressive. When Claude 3 dropped with 200K tokens, then Gemini 1.5 Pro straight-up hit 1 million, and now everyone's in an arms race — if your knowledge base is only a few hundred pages, why bother with retrieval? Just cram everything in. Done. Simple.

But here's the trap. I've tested this. Once you push past 100K tokens, the "lost in the middle" phenomenon gets properly noticeable. Chuck a 200-page report into the context window, ask for a specific figure from page 137, and the model will either miss it entirely or confidently hallucinate something plausible. It's not that the model isn't clever — it's that attention mechanisms naturally decay over long sequences. DeepMind's paper on this was refreshingly blunt: the model doesn't actually "remember" 10 million words. It just shoves them in, and retrieval gets messy.

Then there's cost. Stuff the entirety of War and Peace into every prompt, and you're paying for those 580,000 words on every single query. I ran the numbers: with GPT-4o, a single query with a full 100K context costs roughly 8 to 12 times more than a retrieval-based approach. Scale that to concurrent users, and your bill doesn't just grow — it detonates.

So long context isn't a silver bullet. It's gobbled up some simple use cases, absolutely. But in genuinely complex enterprise scenarios, RAG is quietly evolving into something much more interesting.

What I've been building instead

Last year I started experimenting with approaches that don't have catchy names (and honestly, naming things is half the battle). The core ideas aren't rocket science, but they work:

Parent-child window chunking. Traditional chunking has this impossible tension: big chunks lose retrieval precision, small chunks lose context. The parent-child approach is clever — use tiny 300-word child chunks for the actual retrieval step, but return the parent 1,000-word chunk to the LLM. You get surgical retrieval and rich context. I tested this on a Dream of the Red Chamber knowledge base (don't ask), and Q&A accuracy on questions about a specific scene jumped from 62% to 89%.

Hybrid search with reranking. Run vector search and BM25 keyword search in parallel, merge the results, then let a reranker sort them properly. This solves the "product code SKU-8472 returns nothing" problem — pure vector search is rubbish with exact keywords, but BM25 was literally born for this. Anthropic's Contextual Retrieval from 2024 follows the same logic; they reported hybrid search alone cut failure rates by 49%, and adding reranking brought it to 67%.

Query rewriting. Users are lazy. "how do i do that expenses thing" — this kind of chatty, vague query is a nightmare for vector search. So you add a lightweight step before retrieval: let an LLM rewrite the user's question into something more structured and searchable. Cheap to run, surprisingly effective.

Self-reflection loop. Before generating an answer, have the LLM check whether the retrieved documents are actually relevant and sufficient. If not, run another retrieval round with a reframed query. The Self-RAG paper formalised this idea, but my production implementation is much lighter — just one evaluation step — and it bumped retrieval quality by 15 percentage points.

Here's the thing, though. This stack has been running in production for over a year, and it's solid. But I've got to be honest: it's dramatically more complex than RAG 1.0. We're talking three extra model calls, latency jumping from 800ms to 2.3 seconds, and operational overhead that makes my DevOps person give me meaningful looks.

So now I tell my team: for simple scenarios, don't overthink it. Just cram it into the prompt or use basic RAG. Only reach for the advanced stuff when you've got clear, measured retrieval quality problems you can point to.

GraphRAG, Agentic RAG — I've tried the fancier architectures, and yes, they shine in specific contexts (cross-document reasoning for legal case analysis, that sort of thing). But most business use cases don't need them, and forcing them in is just overengineering with extra steps.

Simple systems are more stable. Fewer moving parts mean fewer things to break.

That sentence took me the better part of a year to genuinely understand.

So is RAG actually dying?

My take: no. But it's shifting from "application" to "infrastructure."

Think about databases. Nobody sits around debating whether databases are dead, but what system doesn't depend on one? RAG — or more broadly, "context engineering" — is becoming the default component of every AI application.

The 2023 starter-pack RAG (chunk → embed → top-K, done) is obsolete, sure. But the core idea is being absorbed into something bigger. Hybrid retrieval, intelligent reranking, query rewriting, memory systems, tool calling — these pieces together form a proper context engine.

And the big labs aren't sleeping on this. OpenAI's API now has built-in file search doing hybrid retrieval. Anthropic shipped Contextual Retrieval. Google DeepMind just published that RLM paper where models retrieve from long texts by generating code — which is, if you squint, just a more sophisticated retrieval strategy.

Nobody's abandoning RAG. They're graduating it from a demo to industrial-grade infrastructure.

The real world doesn't care about benchmarks

Funny epilogue. That friend who asked me "is RAG dead" last week? He messaged again yesterday. His team evaluated their options and decided to stick with RAG — but upgrade to hybrid search with reranking.

I asked why.

He said: "Long context is too expensive, and when users ask 'what about that product', the model has absolutely no idea what 'that' refers to."

There it is.

That's real-world requirements. Not paper benchmarks. Not forum debates. A user says something vague, and the system has to catch it.

RAG is alive and well. It's just stopped being a buzzword.

It's become plumbing. Wiring. Foundation. Invisible, and indispensable.

What's your experience? Has your team gone all-in on long context, or are you still iterating on retrieval? Drop a comment — I'm genuinely curious what's working in the wild.

TL;DR

ai #rag #llm #production #machinelearning

C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free