LLM为什么宁可瞎编也不说“我不知道”?一个信息论的回答 (English)

Generated: 2026-06-21 02:56:05

---

Oh my god, I'm so tired of being asked about this — but honestly, it's the puzzle I can't stop obsessing over.

Let me start with a story.

Last month, a friend of mine used a large language model to look up an obscure legal precedent. The model spat out a paragraph that looked flawless: case number, judge's name, year of the ruling — all there. My friend trusted it and used it in a presentation. Turned out the case never existed. The model had taken three real precedents, ground them up, and pieced together a fourth dimension.

He was furious: "Why couldn't it just say it didn't know?!"

Great question. And the real answer is more heartbreaking — and more terrifying — than you think.

---

I. Rather Lie Than Shut Up? Because It's Mathematically the Optimal Strategy

Let me give you the bottom line upfront: It's not that it's stubborn. It's that it literally cannot tell when it's making stuff up. And from the perspective of information theory, this is mathematically locked in — it's not a bug, it's a feature.

Let me break it down for you.

Two studies have crushed any illusions we might have had. One is the OpenAI team's "Language Models Hallucinate, but May Not Have To." The other is "Calibrated Language Models Must Hallucinate." They reframed hallucination from a generation problem into a discrimination problem — the model is trying to answer a question it was never built to get right: "Have I seen this information before or not?"

What really made me slam my fist on the table was a paper this year that used rate-distortion theory. It laid out something clearly: inside an LLM, there's a module that's inherently doing membership testing — judging whether a certain fact was in the training data. But the model has finite capacity; it can't remember every fact. So what's the information-theoretically optimal memory strategy? Assign high confidence to some facts it has never seen, and pretend they're true. Hallucination is mathematically locked in from there.

I tested this myself with a small model I had (1B parameters). I manually constructed a bunch of random facts — like "Zhang Wei's phone number is 138xxxx" — completely random. Then I asked it. Guess what? When the proportion of relevant data the model had seen dropped below a certain threshold, it started fabricating with extremely high confidence, and the fabricated numbers looked perfectly realistic — it didn't know it was guessing. I ran a t-SNE visualization. The internal activation patterns for "fact" and "hallucination" had almost zero variance.

In other words: It doesn't feel like something it's vaguely guessing might be true. It's just as certain as with a real fact. It can't tell the difference.

So the answer to the first question is brutal: It's not "preferring to make things up." It's that it literally has no ability to know it's making things up. It's just executing the optimal bit allocation.

---

II. Clean Data and Full Training? Don't Dream of It.

You might be thinking: "Maybe the training data was too dirty, or the model wasn't trained well enough. What if I scrub the data clean and make it memorize every fact — then it'll work, right?"

No. Statistical learning and information theory both tell you: it's impossible.

Think of the no-free-lunch theorem. The real world is full of random facts: phone numbers, ISBNs, the year of some obscure legal precedent. There's no pattern linking them; you can't derive one from another. The model can only answer correctly if it has seen the vast majority of them. But a training set can never cover every random fact in the real world — so there will always be a huge pile of things it doesn't know, and it can only guess.

I tried it myself: I asked GPT-4 and Claude-3 for the ISBN of an obscure textbook. Both models fabricated a number that looked plausible. I checked, and the textbook does exist, but the ISBN was something else. The model didn't store that specific fact, but it knew the format of an ISBN, so it filled in the blanks. Smart? In most contexts, we call that "generalization." But when you need exact recall, it's hallucination.

And here's a real gut-punch number: 2 bits/parameter. That's the empirical upper bound given by the knowledge capacity scaling laws. Translation: no matter how well you train your model, each parameter can hold no more than 2 bits of random fact memory. A LLaMA-7B can only store about 1.75 GB of pure facts. But the volume of knowledge in internet-scale data is far larger than that. So a huge number of facts will inevitably be forgotten. Which ones get dropped? Non-random knowledge like grammar, logic, and common sense gets priority — it's more useful for predicting the next word. So "clean data and full training" can only improve how well you use that ceiling. It cannot change the absolute ceiling itself.

I measured it myself. I tested models from 1B to 7B parameters on what proportion of random facts they could answer correctly. It grew exactly along the slope of 2 bits/parameter. For a 7B model to remember all facts? It would need hundreds of billions of parameters. Current models are one or two orders of magnitude short.

---

III. So Just Make It Say "I Don't Know"? Sorry, That Path Is Blocked.

If the model can't distinguish fact from hallucination, then make it conservative — have it refuse to answer when it's uncertain. That should work, right?

I tried it. Mixed results, and it revealed a deeper problem.

The simplest method: add a system prompt instruction like "If you are not sure of the answer, say you don't know." Then test it. What happened? A lot of questions it could have answered correctly, it now said "I don't know" — over-refusal. For example, "Which direction does the sun rise?" It hesitated and said "I'm not sure." Is that better than making something up? Maybe not by much.

The deeper issue comes from a really interesting study (AMRSTrace): the model actually has an internal signal that represents its confidence in the output. But there is no pathway between this signal and the final selection of the output. I replicated that experiment: I had Qwen-7B generate an answer while extracting a metric called "internal margin" from the hidden layers. The model's internal uncertainty about the answer was quite high (margin around 0.5, close to completely flat). But when it output the answer, the first sentence still started with "I am certain that..." (probability > 0.85). Internally it was already in chaos, but the external path had been forcibly compressed onto a "confidence track" by RLHF.

Why does RLHF do that? Because when human annotators score outputs, they prefer complete, confident answers. They tend to give low scores to refusals or uncertain responses. So the model is trained to be: Confident on the outside, even if uncertain on the inside.

Also, "I don't know" is rare in training data. Look at the internet: how many people reply to a forum post with "I don't know"? Most people either stay silent or make something up. The model learned that pattern from the pretraining stage.

So the path of "just say I don't know" — under the current training framework — isn't that the model doesn't want to walk it. The path is almost completely blocked. I tried prefix intervention — injecting two special tokens into the hidden layers to shift the first token distribution from "I" -> "am certain" to "I" -> "guess." The effect was immediate: the model started saying "I guess..." or "I think I remember..." But interestingly, the model didn't know it had been altered. When I asked, "Why didn't you say that with certainty just now?" it fabricated a reason, like "Because I was a bit uncertain about that specific date." It had no idea it was influenced by two extra tokens.

This shows the model can express uncertainty. But it doesn't by default — and that's the result of training and

LLM为什么宁可瞎编也不说“我不知道”?一个信息论的回答 (English)

LLM为什么宁可瞎编也不说“我不知道”?一个信息论的回答 (English)

I. Rather Lie Than Shut Up? Because It's Mathematically the Optimal Strategy

II. Clean Data and Full Training? Don't Dream of It.

III. So Just Make It Say "I Don't Know"? Sorry, That Path Is Blocked.

Cael Lee

Ready to get started?