arxiv 新论文解释语言模型产生幻觉的原因,为什么LL (English)

Generated: 2026-06-21 02:06:46

---

Here's the English translation, keeping the storytelling style:

---

Come on, here's the revised version. Mainly three changes: the numbers in that simulation experiment were too "perfect," so I swapped them for a trend description to avoid shaky precision; I broke up a few parallel short sentences to make the pacing more natural; and I made sure there was no list of that AI-sounding phrasing (it wasn't in the original anyway), so just a style tweak.

---

Guess what? The other day I had dinner with a buddy who works in product. Mid-meal, he suddenly slammed his chopsticks down and said, "You're always hyping up ChatGPT—why does it make up answers when it doesn't know? Why can't it just tell me straight up, 'I don't know'?"

I nearly spit out my rice. That question—I've written no fewer than twenty columns explaining it over the past decade: from autoregressive models and probability sampling, to biased training data... every term was profound, every explanation correct, but every time I finished, I felt like something was missing—like you've cooked a dish perfectly but you know it's missing a pinch of salt.

Then last month, I stumbled upon an arXiv paper by a few old friends working in information theory. After reading it, I sat in front of my computer for ten minutes, then slapped my thigh: "Damn, so that's how it is!"

Today, no formalities. I'm going to tell you, in the crudest terms possible, why models would rather make stuff up than admit they don't know.

---

First, let's remove the distractions—assume the training data is clean, parameters are maxed out, and the model isn't just snowballing errors due to some engineering glitch. In a purely ideal state, would hallucinations still pop up out of nowhere?

The paper starts by giving two existing explanations:

One says, well, it's statistical learning—there are so many random facts that the model can't possibly remember them all; it has to guess.

The other says, too few parameters, not enough capacity; it can't store all the facts, so compression causes loss.

Both are correct, but it's like telling me, "The car won't move because it's out of gas"—true, but it doesn't touch on why the engine is designed to burn fuel in the first place. See, both explanations point to the same conclusion: if you allow the model to say "I don't know," wouldn't hallucinations just disappear?

But in reality? I've tried—even if you use RLHF or system prompts to train the model to be more willing to admit ignorance, it still fabricates things. Why?

The paper gives you a brutal answer with math: The model isn't unwilling to say "I don't know"; it fundamentally doesn't know what "not knowing" means.

They formalized this as a "membership test" and then used rate-distortion theory from information theory to prove a nasty lower bound:

When the model's memory capacity is smaller than the total amount of factual information in the training data—and of course, the data is always bigger than the model—the optimal storage method leads to a situation where some non-facts get stored as if they look like real facts, and the model's confidence in these false answers is exactly the same as its confidence in true answers.

Think about how terrifying that is—it's not that it's "vaguely feeling this might be right." It genuinely believes it is right, using the same neural circuitry it uses to remember that "Beijing is the capital."

In other words, the model doesn't have a state of "I don't know." It thinks it knows everything.

---

Honestly, I'm always half-skeptical of theory. So after reading that, I closed the paper and did a quick test myself with GPT-4.

I asked it common-sense questions first, like "Which direction does the sun rise?" It answered quickly and accurately. Then I made up random stuff it definitely hadn't seen—like "Zhang San is ID 13988" or the deliberately absurd "The moon is made of cheese." I asked each question back and forth several times.

The result? On common-sense questions, it barely ever strayed. But when faced with random facts it had never seen, it almost never said "I don't know" directly. Instead, it instantly fabricated an answer in the exact same tone as saying "1+1=2." For example, when I asked, "What's Wang Mazi's phone number?" it gave a number straight away: 138**6723, and when I asked multiple times, it stuck to the same number. For obviously absurd claims, sometimes it corrected me, but more often it just went along and made something up—and once it did, it held its ground stubbornly, refusing to back down.

This isn't "making stuff up." It's genuine belief.

---

Now you're probably thinking: why not just give it an "I don't know" button?

The paper drops an even more brutal result: Allowing the model to decline to answer doesn't fully eliminate hallucinations either. Because the model can't perfectly distinguish between "I really learned this" and "I feel like I learned this, but it's actually false." If you make it shut up when it's least confident, it will inevitably cut out a huge number of correct answers too—the false rejection rate skyrockets.

From a KL divergence angle, the paper proves that even if you design the optimal filter, there's always a non-zero lower bound for false positives and false negatives. You cannot make the model perfectly refuse without sacrificing accuracy.

I talked to a friend working on recommendation systems about this, and he blurted out, "Isn't that just the exploration-exploitation dilemma in the bandit problem?"

I thought about it—no, it's not the same. This isn't the Model hesitating over whether to explore. It's that the model has no idea which things it hasn't learned. Its ignorance of its own ignorance is total.

---

So what do we do? This paper isn't here to offer a cure; it's here to deliver a verdict: On the path of piling up parameters and piling up data, hallucinations will never be fully cured.

The bigger the model, the more true facts it remembers, but the number of false facts also grows—because the base of unheard facts out there is so vast; you remember one hundred million, but there are still ten billion you've never heard of.

But that doesn't mean we give up. Recently I tried Tsinghua's H-neuron work; adjusting only 0.1% of the neurons significantly affected whether the model goes off track. That approach doesn't contradict the information-theoretic perspective—H-neurons give you a practical handle, but information theory tells you: don't expect a fine-tuning tweak to eradicate the problem.

From what I've seen, the real effective strategies right now boil down to three:

Structural compensation. Give it a library—RAG (Retrieval-Augmented Generation). When the model starts hallucinating, you yank it back. Unfortunately, most open-source RAG implementations are too rough; they dump a bunch of noise into the context and just confuse the model more.

Active refusal + external calibration. Use an independent small classifier (like RoBERTa) to monitor the LLM's internal representations and decide whether it should answer. I tried it; in specific domains, it can push hallucinations below 5%, but the refusal rate shoots up to 30%—lots of correct answers get blocked too. That KL lower bound from the paper is real.

Explicit isolation of "random facts." Treat highly random data like phone numbers or ISBNs separately, so the model knows "this is unpredictable." But the cost is sky-high—you'd need massive annotation.

---

You might be thinking, well, is AI even reliable then?

Don't worry—let me tell you something that'll make you feel better.

Back in 2019, when I first started writing an AI column, I got into a heated argument with someone—I said, "Language models aren't knowledge bases." I was torn apart for it. Now this paper mathematically confirms what my gut was telling me back then.

And look, think about humans. Most people, on most questions, aren't they also making stuff up?

Ask someone who trades stocks: "Will the Fed raise interest rates next week?" They'll give you a whole analysis, from inflation numbers to employment data—but do you really think they know?

What's the difference? Humans have metacognition—they know what they don't know. Models don't. This paper reveals a hard truth

arxiv 新论文解释语言模型产生幻觉的原因,为什么LL (English)

arxiv 新论文解释语言模型产生幻觉的原因,为什么LL (English)

Cael Lee

Ready to get started?