大模型思维链Chain-of-Thought技术原理 (English)

Generated: 2026-06-20 18:05:02

---

Just now, a friend came running over excitedly and asked me: "Quick, look! This model says it 'thought' for 30 seconds—is the answer right?" I glanced at the screen, and good grief, it had filled the entire screen with a long derivation, only to end up with the wrong answer. My friend was baffled: "Didn't it 'deeply think'? So why did it mess up?" I sighed and thought to myself: same old problem. Every time a buzzword pops up in the AI world, people treat it like a cure-all. Chain-of-Thought, or CoT for short—it's been three years since that 2022 paper went viral, and most people still don’t get what it actually does. Including me: the first time I used CoT, I stepped into so many pitfalls that I’m embarrassed to even mention them. Today, no fluff—just keep reading. I’ll break it down for you with four questions.

---

Question 1: Is the essence of Chain-of-Thought that the model is “thinking”?

Don’t kid yourself. Large models don’t think; they only have probabilities. You type one word, and it guesses which word is most likely to come next—basically no different from memorizing answers. When you say “床前明月” (the first line of a famous Chinese poem), it spits out “光” (the next character) because those two characters appear together with high probability in the training data. It’s not that it understands Li Bai; it’s that it remembers well.

Now, if you ask it to output a result directly, the moment it encounters a reasoning problem that requires multiple steps, it falls flat on its face. Why? Let me tell you a secret: the Transformer architecture determines that its computational depth is fixed. Suppose the model you’re using has 100 layers; then each time it generates a token, it only goes through those 100 layers of computation internally. Now give it a problem that requires 500 steps of logical deduction to solve—a “brain” of only 100 layers can’t hold that many intermediate conclusions. If you force it to guess, the answer will naturally be absurd. You’ve seen models confidently spew nonsense, right? That’s one source of hallucinations.

So what does CoT do? In plain terms, it gives the model a piece of scratch paper. You see, every time it generates a token as an intermediate step, that token immediately becomes new context, fed back into the model to trigger those 100 layers of computation again. The “thinking process” you see on the screen isn’t its inner monologue—it’s the model writing down intermediate states that its internal cache can’t hold, then using that written scratch to continue reasoning.

That’s why I’ve always felt the most accurate metaphor isn’t “thinking” but “scratch paper.” Think about it: your high school math teacher made you write out steps on exams—not because the steps themselves were valuable, but because they kept your brain from going off track. CoT works the same way.

The first time I used GPT-3, I naively went straight in, and it couldn’t even correctly compute "2 + 3 × 4." Later, I wrote the calculation steps into the prompt, teaching it step by step like an elementary school student, and only then did it barely get it right. Back then, CoTs were also very short and fragile—if you wrote a wrong number at step three, even if you spotted the problem at step ten, you couldn’t go back. Later, everyone started calling this method “Chain-of-Thought,” which sounds much nicer. But the essence hasn’t changed: you trade sequence length for computational depth.

---

Question 2: Why is simple CoT not enough? When must you resort to “long thinking”?

Many beginner tutorials tell you CoT is just adding the phrase “Let’s think step by step” to your prompt. Try it, and simple problems do get better. For example: “Xiao Ming has 5 apples. He eats 2. How many are left?”—whether you add that sentence or not, the model can directly recite the answer.

But try this problem on it:

A pool has an inlet pipe and an outlet pipe. The inlet pipe alone fills the pool in 6 hours; the outlet pipe alone empties it in 8 hours. If the outlet pipe runs for 2 hours first, then the inlet pipe is turned on, how many hours will it take to fill the pool?

The first time I asked GPT-3 this, it confidently calculated 3.5 hours. Upon checking, I found it had swapped the signs for filling and draining at step three.

What’s the root cause? Not that the model doesn’t understand fraction arithmetic, but that its fixed number of computational layers can’t accommodate so many intermediate variables. Pool capacity, inlet efficiency, outlet efficiency, already-drained water… these guys are packed into a 100-layer network like a crowd in an elevator, pushing and shoving until nobody can get out.

So when must you use “long thinking”? When you find that a problem requires recording multiple intermediate states, performing multi-step reasoning, and cross-referencing them. A single “step by step” is like handing you a piece of blank paper but forbidding you to write—you’re still forced to compute in your head. True long CoT writes out every intermediate result, letting the model “see” how far it has computed, so it can keep going.

---

Question 3: So is CoT a cure-all? Where does it fail?

Guess what? CoT dreads two types of problems.

The first type: the problem itself has no logic. For example, “What did you eat yesterday? Answer with three steps of reasoning.” The more the model “thinks,” the more absurd the answer, because no thinking was needed in the first place. Forcing CoT here is like putting wings on a bicycle—not only useless, but cumbersome.

The second type: the problem requires external knowledge that the model hasn’t learned. CoT can only help it organize existing knowledge; it can’t conjure up new facts. If you ask it a tough college-entrance exam math problem it’s never seen, even if it writes a hundred steps, it still won’t know the answer. That’s when you see it writing eloquently, making you excited, but then the final step delivers an absurd number—that’s exactly what my friend encountered just now.

---

Question 4: So what’s the correct way to use it? Remember two sentences.

At this point, let me give you the most straightforward criterion: If a problem requires you to grab a pen and do more than a couple of steps on scratch paper, then it’s time to use CoT—and you should make the model write each step clearly. Conversely, if you can blurt out the answer at a glance, don’t let the model waste time.

Also, don’t worship “long thinking.” Longer isn’t always better. If you make the model ramble on, it’s more likely to make a mistake around some corner—and the earlier the mistake, the farther off it will be later. That’s why today’s truly powerful reasoning models are all doing one thing: for steps with high uncertainty, they voluntarily write more detailed intermediate processes; for simple steps, they breeze past them. Just like humans—slow down where you need to take a detour, stride ahead on the straight road.

---

One last honest thing.

Four years ago, when I first started tinkering with CoT, I thought I’d found the key to unlocking the wisdom of large models. Later, I realized it was just a piece of scratch paper—useful, but don’t count on it to make the model “awaken.” Today, whenever you see slogans like “deep thinking” or “reasoning enhancement,” remember just one sentence: CoT doesn’t make the model smarter; it makes the clumsy process visible to you. You think it’s thinking? It’s just verifying.

And the one who should actually think is you.

大模型思维链Chain-of-Thought技术原理 (English)

大模型思维链Chain-of-Thought技术原理 (English)

Cael Lee

Ready to get started?