Home / Blog / RAG vs 纯LLM:一个能翻书,一个靠瞎编,差距大到离谱 (English)

RAG vs 纯LLM:一个能翻书,一个靠瞎编,差距大到离谱 (English)

By CaelLee | | 6 min read

RAG vs 纯LLM:一个能翻书,一个靠瞎编,差距大到离谱 (English)

Generated: 2026-06-22 06:26:45

---

Okay, I carefully checked the facts and data in the article and found no obvious errors. The mention that "the 2024 Spring Festival falls on February 10th" is correct—GPT-4 answering February 9th is indeed a hallucination caused by its knowledge cutoff, so that example is fine. Other technical parameters (chunking strategy, vector database, Rerank model, etc.) also align with industry common sense.

As for the "AI vibe," none of those clichés you listed ("It's worth noting," "In summary," etc.) appear in the original text. Instead, it's full of colloquial rants and metaphors, with a very natural style. However, there were two slightly too neat parallel sentences, which I broke up to make the rhythm more casual. Everything else is kept as is.

Here's the revised final version:

---

I Spent a Week Going Through RAG from Scratch—Here's My Trench Diary

Let me hit you with a bombshell right off the bat: RAG isn't a silver bullet, but it's ten times more reliable than fine-tuning! Don't rush off—let me explain slowly.

I've been writing this column for ten years, from the early days of SEO to today's LLM applications. I've seen countless tech concepts hyped to the moon. But RAG (Retrieval-Augmented Generation) is one of the few things that made me think, "Holy crap, this thing can actually work in the real world."

---

How Did I End Up on This Path?

Last year, I took on a project where a client wanted an internal enterprise knowledge base Q&A system. The documents covered product manuals, compliance policies, and technical specs—thousands of pages in total. Think about it: thousands of pages! My first reaction was, "Why not just fine-tune a large model?"

What happened? I hit so many pitfalls I started questioning my life choices:

Then I switched to RAG. Guess what? Every single problem was solved. Honestly, the results were way better than I expected—so good I felt like bowing down to RAG three times.

---

What Does RAG Actually Solve? Don't Get Fooled

Pure LLMs have two fatal flaws, and you've probably run into them too:

Hallucination problem—The model will confidently spout nonsense. I tested GPT-4 by asking, "When is the 2024 Spring Festival?" and it said February 9th (it's actually February 10th). It's due to the knowledge cutoff, but it still dared to lie. Can you believe that?

Data freshness—Ask "What's the weather in Beijing today?" and the model can never answer. Its knowledge is stuck at the training data cutoff, like an old man living in the past.

RAG's approach? So simple it'll make you slap your forehead: Don't let the model make things up—give it real materials and have it answer based on those. It's like being allowed to open your textbook during an exam instead of memorizing everything. Tell me, isn't that a cheat code? But cheating is a hundred times better than making stuff up!

---

My First RAG System: Built in Three Days

I spent three days building a minimum viable system using LangChain + FAISS. Here's the flow:

Offline Phase (Preparing Materials):

  1. Convert PDF documents to text—this step alone had three pitfalls just from PDF parsing
  2. Split into chunks of 512 tokens—I tweaked this parameter over a dozen times, seriously, a dozen!
  3. Convert to vectors using text-embedding-3-small
  4. Store in a FAISS index

Online Phase (Answering Questions):

  1. User asks a question
  2. Convert the question to a vector
  3. Search FAISS for the top 5 most similar text chunks
  4. Combine the question + chunks into a prompt
  5. Send to GPT-4 to generate an answer

The first time it worked, I asked, "What's the refund policy?" and it directly cited content from Chapter 3, Section 2 of the document, even including the original text. At that moment, I knew: This is the right path! You know that feeling? Like you've been searching for something forever, and suddenly someone hands it to you and says, "You're welcome."

---

Real-World Data: RAG vs. Pure LLM—The Gap Is Ridiculous

I ran a comparison test using the company's internal knowledge base, 50 questions. The results will blow your mind:

MetricPure GPT-4RAG + GPT-4
Accuracy62%94%
Hallucination Rate28%4%
Traceability Rate0%96%
Average Response Time1.2s2.8s

Accuracy jumped from 62% to 94%, and hallucination rate dropped from 28% to 4%. The trade-off: response time doubled, and cost increased fivefold.

But honestly, in enterprise scenarios, accuracy matters way more than cost. One wrong compliance answer could cost hundreds of thousands in losses. Think about it: would you rather spend five times the cost for 94% accuracy, or save a bit of money but get complaints every day? I choose the former, without hesitation.

---

Core Module Breakdown: The Pitfalls I Fell Into—Don't You Do the Same

1. Text Chunking—It's More Complicated Than It Looks

I started with fixed 512-character chunks. What happened?

Later, I switched to semantic chunking: split by paragraphs, each chunk no more than 1024 tokens, with 10% overlap between adjacent chunks. The results improved significantly. Remember: chunking isn't slicing sausage—it needs to make logical sense.

2. Vector Search—Don't Jump Straight to K8s

I tried three vector databases:

My advice: Start with FAISS, then migrate to Milvus when data gets big. Don't jump straight into a K8s cluster—you're not building an aircraft carrier. Get it running first!

3. Reranking—The Step Everyone Overlooks

After vector search retrieves the top-K results, using cosine similarity for ranking is only so-so. Think about it: high similarity doesn't always mean high relevance—like looking for a friend, someone with a similar name isn't necessarily the one you want.

I added a Rerank model (BGE-Reranker) to re-rank the retrieved 20 chunks and take the top 5. Accuracy improved by another 5–8 percentage points. This step is worth it!

4. Prompt Design—The Most Mysterious Part

I iterated through over a dozen versions and finally found an effective template:


You are a knowledge Q&A assistant. Please answer based on the following reference materials.
If the reference materials don't contain enough information, clearly say "Cannot find an answer in the available materials." Do not make things up.

Reference Materials:
{context}

Question: {question}

Please answer, and include the source number of the reference material after your answer.

Key point: Force the model to admit "I don't know" instead of fabricating. This is more important than any technical optimization. Think about it: an assistant that says "I don't know" is a thousand times more reliable than one that rambles nonsense.

---

RAG vs. Fine-Tuning: When to Use

Token Cost per Query0.003 yuan0.015 yuan
C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free