基于大语言模型构建知识问答系统 (English)

Generated: 2026-06-21 13:31:59

---

Alright, here's the article after fact-checking and removing that AI-generated feel. Main fixes include correcting Llama-2's context window limit (7B's native is 4k, the 128k is from other models) and Qwen's multimodal model version number (Qwen2.5-VL). I also broke up the overly neat parallel sentences, removed formulaic expressions like "It's worth noting" and "In summary,"—making the tone read much closer to a real tech blog.

---

Don't Let LLMs Fool You! 7 Pitfalls I Stumbled Into Building an Enterprise Q&A System—All Learned the Hard Way

Two months ago, my boss clapped me on the shoulder, eyes full of expectation: "Kid, turn our two thousand-plus contracts, policy manuals, and operating procedures into a smart Q&A system. Legal needs to search clauses, HR needs to ask about leave rules, and the shop floor workers need to look up manuals by voice."

I patted my chest right back: "Isn't this just RAG? A big LLM will handle it all!"

Don't laugh. I really thought I had it in the bag.

And then the first version dropped. I was completely dumbfounded.

---

A colleague asked: "Which contract has the confidentiality obligation?"

The system said: "On page four."

I flipped to the original—it wasn't there at all.

Then someone else followed up: "What about the breach of contract clause?"

The system just made up a legal provision on the spot, even earnestly adding a "Source: Labor Contract, Chapter 2, Section 3." I had just read that contract. There was nothing like it in there.

The engineers in production were even worse. Someone asked "How to handle error code E102," and the system listed five steps—turns out half of them were for the previous generation model.

I just sat there at my desk, feeling one thing: I got sold out by the LLM, and I was still counting the money for it.

---

Now, you think I'm about to trash LLMs?

Nope. Quite the opposite. LLMs are incredibly powerful—so powerful that they can make you believe every word of their confidently spewed nonsense.

The Weakest Link in RAG? It's Not the LLM

At first, I built a standard RAG pipeline: convert PDFs to text, split them into paragraphs, embed them with an embedding model, store them in Milvus, retrieve the top_k chunks, and feed them to the LLM to generate an answer.

I used LangChain, naively set chunk_size=512, overlap=50. I thought, isn't OpenAI unstoppable? Just pair it with GPT-4, right?

So what happened?

The problem choked right at retrieval.

Someone asked "What's the invoice reimbursement process for last year's Q3?" The system returned a snippet from the "General Expense Reimbursement Policy." But there was actually a separate policy just for invoices—and it never got recalled.

The top results ranked by cosine similarity were all useless junk.

You know what that's like?

It's like going to the library for The Three-Body Problem, and the librarian hands you a Xinhua Dictionary—saying, "Well, they're both books, aren't they?"

I debugged for days before I realized I'd swapped the embedding model for a lightweight domestic model with zero indexing optimization.

The weakest link in RAG? It's retrieval.

If you retrieve garbage, even the best LLM can only do fancy tricks with that garbage.

So what did I do?

Later I learned a trick—for small-scale knowledge bases (a few hundred thousand characters or less), just skip the retrieval altogether and stuff the entire text into the context window.

And guess what?

It actually performed better.

I was using some long-context models at the time (like GLM-4-128K or Qwen2.5-32K), which could swallow most of my internal documents. And without the information loss from chunking, retrieval, and recombining, the problem vanished.

But this trick has a fatal flaw—scale. Try cramming two thousand contracts in there, and you'll blow right past the context limit.

That's when you have to fall back on retrieval.

That "Generate-Then-Retrieve" Approach That Got Me Excited

Then I read the ChatKBQA paper, and it completely reset my thinking.

The idea: first let the LLM generate a logical form (a logical expression) from the question, and then query the knowledge base precisely.

According to the paper, this method blows traditional natural-language-based retrieval out of the water in terms of entity and relation retrieval efficiency.

I tested it myself on a contract clause retrieval task. With a fine-tuned Llama-2-7B, I hit Exact Match (EM) at 58%.

And my naive RAG version? What did it score?

Less than 20%.

In other words, for the same question, the original RAG had an 80% chance of straight-up making something up.

But—generating that "logical form" itself requires fine-tuning.

Throwing GPT-4 at it directly? The results were abysmal. The paper gave a comparison too—EM at only 12%.

So, if you're in an enterprise private setting, don't cut corners.

Take your contract data, and fine-tune an open-source model properly.

I used ChatGLM2-6B with QLoRA. Low cost, solid performance.

---

Knowledge Graphs—Giving Your Q&A System a "Brain"

Relying solely on RAG for multi-hop reasoning? That's a disaster waiting to happen.

Ask something like: "Which contracts specify confidentiality obligations AND have Shanghai as the jurisdiction?"

RAG either chokes or cobbles together a half-baked answer you can't trust an inch.

Why?

Because plain-text retrieval can't model the structural relationships between entities.

What's the connection between "confidentiality obligation" and "jurisdiction in Shanghai"? The text doesn't spell it out.

That's where knowledge graphs come in.

I picked two hundred contracts, spent two weekends annotating entities and relations—well, semi-automated actually. Let a big model extract them first, then manually verified.

Then I built a small Neo4j graph database.

The improvement was obvious the moment you saw it.

Someone asked: "Which contracts have Shanghai as the jurisdiction AND have a penalty clause in Article 5?"

The system returned three contract IDs, along with snippets from the original text.

Traceable. Verifiable.

The biggest value of a knowledge graph isn't that it's "smarter"—it's that it's "interpretable." LLMs make stuff up all the time, but with a graph structure, every step of reasoning maps to a specific entity-relation path. And when things go wrong, it's easy to troubleshoot.

Can you fully automate building the graph with an LLM?

I tried.

The quality was all over the place.

So I recommend semi-automated: use an LLM to extract entities and relations, then have a human verify them.

The pitfall I hit: when doing automatic extraction across different domains (contracts vs

基于大语言模型构建知识问答系统 (English)

基于大语言模型构建知识问答系统 (English)

Don't Let LLMs Fool You! 7 Pitfalls I Stumbled Into Building an Enterprise Q&A System—All Learned the Hard Way

The Weakest Link in RAG? It's Not the LLM

That "Generate-Then-Retrieve" Approach That Got Me Excited

Knowledge Graphs—Giving Your Q&A System a "Brain"

Cael Lee

Ready to get started?