GPT-4推理提升1750%!普林斯顿清华姚班校友提出全 (English)

Generated: 2026-06-23 11:52:02

---

Alright, no problem! Let me "translate" this deep technical analysis in my own style. I promise it'll be both hype and friendly, making you feel like you're listening to an old friend tell a story — and you'll absorb the knowledge along the way.

Ready? Let's roll!

That 1750%? Don't buy it! But what's behind it, you've gotta see.

Let me start with something that'll wake you right up.

Ever seen those ads? "Use our method and your success rate skyrockets 1750%!" First reaction: "Holy crap, that's insane!" And then you want to share it?

Stop right there. Let me spell it out for you: That number is a gorgeous piece of clickbait math magic!

Look: 4% to 74% — numerically, that's a 70 percentage point increase. How do you get 1750%? 74−4=70, 70÷4=70/4? Wait, that's 1750%? Actually 70/4 = 17.5, so 1750% is 17.5 times the original? Let's check: 74% is 70 percentage points higher than 4%, but as a relative increase: (74-4)/4 100% = 70/4100% = 1750%. Yeah, arithmetic checks out. But seriously, what normal person calculates it that way? Would it kill them to just say "success rate improved by 70 percentage points"? No! Then why 1750%? Because it's shocking! Because some people see it, get hyped, and hit share!

Now, you might think this article is here to trash that statistic as a scam.

No, no — that's not the point at all. What really catches the eye of an old-timer like me who's been wrestling with AI for years isn't the absurdly exaggerated number. It's the brilliant thinking framework hiding behind it — Tree of Thoughts. This thing is, hands down, the most impressive method I've seen in the last two years for giving AI a "System 2" slow-thinking brain.

I first saw that paper back in May 2023, when I was still running CoT (Chain of Thought) baselines. Honestly, it blew my mind. Made my scalp tingle.

---

The first thing that won me over: It taught GPT-4 how to "backtrack"

Ever seen GPT-4 play the 24 Game using Chain of Thought? I tried it. The scene was both dumb and hilarious.

Give it "3, 3, 8, 8" — four cards — and guess what it does?

Step 1: 8+3=11

Step 2: 11+3=14

Step 3: 14+8=22

Conclusion: After all that, can't reach 24.

It just plows ahead into the dark! Wrong? Doesn't care, never looks back! Why? Because CoT is a one-way street — it goes down one path until it hits a dead end. If you regenerate, it might choose a different path, but it's still a new one-way street to nowhere.

But what did ToT do? Something so simple yet incredibly effective!

It makes the model take a few steps, then stop and check: "Is this path looking good? Feels off? Switch to another!"

Sounds familiar? It's like when you were a kid playing chess — look one move ahead, think three moves ahead. Right! This is actually that old "search tree" trick from AI research in the 1970s, but applied to the new body of LLMs, and the results were mind-blowing!

I read through the examples in that paper over and over. In the 24 Game, ToT makes GPT-4 generate an intermediate step, then act as its own judge: "Buddy, how far is this expression from 24? Any hope?" Then it keeps only the three most promising paths and continues exploring. Once it hits a dead end, it really does turn back from the wall — not just saying "I was wrong," but retreating to a previous node and starting a new exploration from there.

Result? Success rate jumped from 4% to 74%! That's not magic — it's the power of making the model honestly "think twice before acting"!

You might say: "Isn't that just brute-force trial and error?" No, no — the key is the evaluation function. Letting the LLM be its own judge, scoring its own intermediate steps for promise. That ability to "score itself" was a groundbreaking novelty in 2023. Now? It's the core of top-tier reasoning models like o1 and R1. They all use it!

---

The second thing that won me over: The guy who wrote ToT, Yao Shunyu

This guy's story itself is pretty interesting.

He got a silver medal in NOI (National Olympiad in Informatics) — no direct admission to university. Most people would hold a grudge. But he scored 704 on the college entrance exam (the Gaokao) and got into Tsinghua's Yao Class on his own merit. And that's not even the kicker — while at Tsinghua, he also founded a rap club! Imagine: a guy working on the hardest-core AI, also getting into the most expressive music. The vibe is something else.

The biggest thing I felt from reading his paper is: He's not afraid to experiment; he almost enjoys the hustle.

Before ToT, he did ReAct — making AI think and act at the same time. Then ToT — making AI think and backtrack. Back in 2023 when he released ToT, a lot of old-school academics said: "Isn't this just ancient search? What's the big deal?" But a year later? OpenAI's o1, DeepSeek's R1 — all doing the same thing: making the model spend more "brainpower" to think through reasoning.

When I used DeepSeek-R1 in early 2025, I saw it suddenly blurt out: "Wait, let me rethink." I laughed immediately — isn't that a personified version of ToT's "backtrack" operation? The model had learned to reflect during training! R1-Zero's training curve even has a clear inflection point that the paper calls the "Aha Moment." Seeing that, don't you think it's cool?

After ToT, Yao Shunyu went to OpenAI and built Operator (an AI agent that can operate your computer). Then Tencent poached him with a huge offer to lead their Hunyuan large model. I bet his logic has always been: What's the point of just writing papers and releasing concepts? Actually build something that hundreds of millions of people use — that's real skill! His SWE-bench is now the world standard for measuring AI coding ability.

See? A truly great researcher isn't someone who just throws out incomprehensible concepts. It's someone who can personally forge a tool that others have to use.

---

The third thing that won me over: ToT pointed a clear path for the entire AI industry

When ToT first came out, CoT was just getting hot, and everyone was still figuring out fancier prompts to trick out good results. ToT slammed the table and shouted something eye-opening: Stop obsessing over prompts! Can we make the model take more steps when it thinks?

And what happened after? You've all seen it:

OpenAI's o1 uses reinforcement learning to train a "hidden chain of thought," taking seconds to minutes per problem, sometimes longer.
Claude 3.7 launched "extended thinking" mode, where you can set a budget (say, let it think for 3000 tokens) to control depth.
Gemini 2.5 Pro's Deep Think mode reportedly explores multiple reasoning chains in parallel and checks its own mistakes.

All these big companies' best products are essentially answering the core question ToT raised back then: Large models predict the next token autoregressively, which is like human System 1 (fast thinking). How can we equip them with System 2 (slow thinking) wings?

You might think these big company products are far from you. Let me put it another way: these days, do you still write a prompt that just says "Let's think step by step"? No! You'll definitely ask it to first list several options, evaluate each, then execute step by step. You're already using ToT's idea without knowing it!

In 2024, when I led a team on an agent project, the core insight was: **An LLM that generates an answer in one shot is like a fresh intern — if you tell them to do it once and hand it in, quality

GPT-4推理提升1750%!普林斯顿清华姚班校友提出全 (English)

GPT-4推理提升1750%!普林斯顿清华姚班校友提出全 (English)

That 1750%? Don't buy it! But what's behind it, you've gotta see.

The first thing that won me over: It taught GPT-4 how to "backtrack"

The second thing that won me over: The guy who wrote ToT, Yao Shunyu

The third thing that won me over: ToT pointed a clear path for the entire AI industry

Cael Lee

Ready to get started?