RAG vs 纯LLM：一个能翻书，一个靠瞎编，差距大到离谱 (English)

Generated: 2026-06-22 06:26:45

---

Okay, I carefully checked the facts and data in the article and found no obvious errors. The mention that "the 2024 Spring Festival falls on February 10th" is correct—GPT-4 answering February 9th is indeed a hallucination caused by its knowledge cutoff, so that example is fine. Other technical parameters (chunking strategy, vector database, Rerank model, etc.) also align with industry common sense.

As for the "AI vibe," none of those clichés you listed ("It's worth noting," "In summary," etc.) appear in the original text. Instead, it's full of colloquial rants and metaphors, with a very natural style. However, there were two slightly too neat parallel sentences, which I broke up to make the rhythm more casual. Everything else is kept as is.

Here's the revised final version:

---

I Spent a Week Going Through RAG from Scratch—Here's My Trench Diary

Let me hit you with a bombshell right off the bat: RAG isn't a silver bullet, but it's ten times more reliable than fine-tuning! Don't rush off—let me explain slowly.

I've been writing this column for ten years, from the early days of SEO to today's LLM applications. I've seen countless tech concepts hyped to the moon. But RAG (Retrieval-Augmented Generation) is one of the few things that made me think, "Holy crap, this thing can actually work in the real world."

---

How Did I End Up on This Path?

Last year, I took on a project where a client wanted an internal enterprise knowledge base Q&A system. The documents covered product manuals, compliance policies, and technical specs—thousands of pages in total. Think about it: thousands of pages! My first reaction was, "Why not just fine-tune a large model?"

What happened? I hit so many pitfalls I started questioning my life choices:

Documents were updated weekly; fine-tuning cost thousands each time, blowing the client's budget
Some sensitive data—like contract terms and customer info—you'd dare feed that to a model? Too risky
The model often made stuff up, even inventing page numbers. Once it claimed, "The refund policy is in Chapter 8," and I searched the entire document and found nothing

Then I switched to RAG. Guess what? Every single problem was solved. Honestly, the results were way better than I expected—so good I felt like bowing down to RAG three times.

---

What Does RAG Actually Solve? Don't Get Fooled

Pure LLMs have two fatal flaws, and you've probably run into them too:

Hallucination problem—The model will confidently spout nonsense. I tested GPT-4 by asking, "When is the 2024 Spring Festival?" and it said February 9th (it's actually February 10th). It's due to the knowledge cutoff, but it still dared to lie. Can you believe that?

Data freshness—Ask "What's the weather in Beijing today?" and the model can never answer. Its knowledge is stuck at the training data cutoff, like an old man living in the past.

RAG's approach? So simple it'll make you slap your forehead: Don't let the model make things up—give it real materials and have it answer based on those. It's like being allowed to open your textbook during an exam instead of memorizing everything. Tell me, isn't that a cheat code? But cheating is a hundred times better than making stuff up!

---

My First RAG System: Built in Three Days

I spent three days building a minimum viable system using LangChain + FAISS. Here's the flow:

Offline Phase (Preparing Materials):

Convert PDF documents to text—this step alone had three pitfalls just from PDF parsing
Split into chunks of 512 tokens—I tweaked this parameter over a dozen times, seriously, a dozen!
Convert to vectors using text-embedding-3-small
Store in a FAISS index

Online Phase (Answering Questions):

User asks a question
Convert the question to a vector
Search FAISS for the top 5 most similar text chunks
Combine the question + chunks into a prompt
Send to GPT-4 to generate an answer

The first time it worked, I asked, "What's the refund policy?" and it directly cited content from Chapter 3, Section 2 of the document, even including the original text. At that moment, I knew: This is the right path! You know that feeling? Like you've been searching for something forever, and suddenly someone hands it to you and says, "You're welcome."

---

Real-World Data: RAG vs. Pure LLM—The Gap Is Ridiculous

I ran a comparison test using the company's internal knowledge base, 50 questions. The results will blow your mind:

Metric	Pure GPT-4	RAG + GPT-4

Accuracy	62%	94%

Hallucination Rate	28%	4%

Traceability Rate	0%	96%

Average Response Time	1.2s	2.8s

Accuracy jumped from 62% to 94%, and hallucination rate dropped from 28% to 4%. The trade-off: response time doubled, and cost increased fivefold.

But honestly, in enterprise scenarios, accuracy matters way more than cost. One wrong compliance answer could cost hundreds of thousands in losses. Think about it: would you rather spend five times the cost for 94% accuracy, or save a bit of money but get complaints every day? I choose the former, without hesitation.

---

Core Module Breakdown: The Pitfalls I Fell Into—Don't You Do the Same

1. Text Chunking—It's More Complicated Than It Looks

I started with fixed 512-character chunks. What happened?

When asked "What's the product warranty period?" it only retrieved the snippet "warranty period: 1 year," but the context also included the important condition "only for non-human damage." Guess how the model answered? It just said "warranty period: 1 year," completely ignoring the restriction. Isn't that misleading?
When asked "How to install?" the retrieved content only had the first half of the installation steps. The user got stuck halfway and had to ask again.

Later, I switched to semantic chunking: split by paragraphs, each chunk no more than 1024 tokens, with 10% overlap between adjacent chunks. The results improved significantly. Remember: chunking isn't slicing sausage—it needs to make logical sense.

2. Vector Search—Don't Jump Straight to K8s

I tried three vector databases:

FAISS: Local deployment, free, suitable for small scale (<1 million entries). Perfect for initial use, no issues
Milvus: Distributed, suitable for large scale, but high maintenance costs. Consider it when data volume grows
Pinecone: Managed service, one-click deployment, but expensive. Go for it if you're rich and don't care

My advice: Start with FAISS, then migrate to Milvus when data gets big. Don't jump straight into a K8s cluster—you're not building an aircraft carrier. Get it running first!

3. Reranking—The Step Everyone Overlooks

After vector search retrieves the top-K results, using cosine similarity for ranking is only so-so. Think about it: high similarity doesn't always mean high relevance—like looking for a friend, someone with a similar name isn't necessarily the one you want.

I added a Rerank model (BGE-Reranker) to re-rank the retrieved 20 chunks and take the top 5. Accuracy improved by another 5–8 percentage points. This step is worth it!

4. Prompt Design—The Most Mysterious Part

I iterated through over a dozen versions and finally found an effective template:


You are a knowledge Q&A assistant. Please answer based on the following reference materials.
If the reference materials don't contain enough information, clearly say "Cannot find an answer in the available materials." Do not make things up.

Reference Materials:
{context}

Question: {question}

Please answer, and include the source number of the reference material after your answer.

Key point: Force the model to admit "I don't know" instead of fabricating. This is more important than any technical optimization. Think about it: an assistant that says "I don't know" is a thousand times more reliable than one that rambles nonsense.

---

RAG vs. Fine-Tuning: When to Use

Token Cost per Query	0.003 yuan	0.015 yuan

RAG vs 纯LLM：一个能翻书，一个靠瞎编，差距大到离谱 (English)

RAG vs 纯LLM：一个能翻书，一个靠瞎编，差距大到离谱 (English)

I Spent a Week Going Through RAG from Scratch—Here's My Trench Diary

How Did I End Up on This Path?

What Does RAG Actually Solve? Don't Get Fooled

My First RAG System: Built in Three Days

Real-World Data: RAG vs. Pure LLM—The Gap Is Ridiculous

Core Module Breakdown: The Pitfalls I Fell Into—Don't You Do the Same

1. Text Chunking—It's More Complicated Than It Looks

2. Vector Search—Don't Jump Straight to K8s

3. Reranking—The Step Everyone Overlooks

4. Prompt Design—The Most Mysterious Part

RAG vs. Fine-Tuning: When to Use

Cael Lee

Ready to get started?