DeepSeek-R1 模型发布,性能对标 OpenAI (English)

Generated: 2026-06-22 02:13:43

---

Oh, speaking of last night—I stayed up until 3 a.m. Guess what I was doing?

Not working overtime, not binge-watching shows, not gaming. I was "playing" with a model—DeepSeek R1.

Honestly, it’s been a long time since an AI shook me like this. The last time I got this excited was when GPT-4 first dropped. But this time it’s different—this thing isn’t just powerful; it gave me the bizarre feeling that I was watching something wake up.

---

Where the hell did this thing come from?

Let me set the stage.

Back in December, when DeepSeek-V3 went open-source, I already sensed something was off. A 660B model, trained for only $5.57 million, going toe-to-toe with GPT-4 in performance. I wrote an article at the time calling it China’s “Pearl Harbor moment” for AI, and the comments section blew up.

Looking back now? V3 was just the appetizer.

The real bomb dropped on the evening of January 20th with R1. I was having dinner when my phone buzzed with the notification. I opened it and nearly dropped my chopsticks—AIME 2024 math test: 79.8%, tied with OpenAI o1-1217; MATH-500: 97.3%; Codeforces Elo rating: 2029, surpassing 96.3% of human competitors.

That’s not impressive. That’s insane.

Even more insane? It’s open-source. Two 660B base models, plus six distilled versions ranging from 1.5B to 70B—code, weights, all dumped on GitHub.

Think about it: a top-tier reasoning model, just handed out like candy. Two years ago, who would have believed it?

---

R1-Zero: the thing that gives you goosebumps

I need to talk about R1-Zero.

A lot of people overlook how terrifying this thing is. It was trained entirely without human demonstrations—using reinforcement learning (RL) directly on the base model. The model never saw how humans solve problems; it just stumbled around in the dark, groping for solutions.

And the result? On AIME 2024, pass@1 shot from an initial 15.6% all the way to 71.0%. With majority voting, it hit 86.7%, surpassing o1.

There’s a line in the technical report that gave me full-body chills—the original quote is: “Even the researchers didn't anticipate R1's 'aha moment'.”

Do you hear that? The model started learning to reflect. It began checking its own answers, spending more time on complex problems. Nobody taught it to reflect—it learned on its own.

Just like how AlphaZero taught itself Go, except this time it’s a language model.

Someone online said, “2025 might be the year of RL.” I’m starting to think that might really come true.

---

What the hell is GRPO? Let me break it down for you.

Online articles about GRPO (Group Relative Policy Optimization) always throw around terms like policy, reward, advantage, actor, critic… a bunch of jargon that leaves you dizzy.

Let me put it another way.

Traditional PPO is like hiring a professional coach and a referee to train an athlete. The coach (Critic model) has to watch every move in real-time and judge whether it’s good or bad; the referee (Reward Model) gives scores. Both roles are huge neural networks that grow alongside the athlete, consuming enormous memory and compute.

What does GRPO do? It fires the coach outright.

It makes the athletes form their own small groups and compete. Whoever performs better gets a higher score; you just compare them. It’s like group competitions in a classroom—you generate eight answers, correct ones get points, wrong ones lose points. Simple and brutal. Only when there’s no standard answer for a subjective problem do you call in the referee to score.

The advantages are obvious: less memory, less compute, much lower engineering complexity.

But the cost is also high: training is unstable in the early stages and prone to collapse. It’s like a bunch of students with no teacher, just muddling along and smashing into walls. The DeepSeek team had to invest a ton of effort to stabilize the process.

I once wrote an article called “The Prisoners and Escapers of Probability,” and this is exactly the logic. Large models essentially search for optimal solutions in a probability space based on training signals. GRPO’s idea is to find a more efficient way to “find the path”—and it succeeded.

---

I gave it a spin myself

I picked a decryption puzzle that o1 had handled and threw it at R1.

It took 74 seconds and gave the correct answer.

During the process, its thinking was wrapped in ... tags. You could watch its step-by-step reasoning like a live internal broadcast. That feeling is definitely different from other models—Gemini 1206, Claude 3.5, GPT-4 Turbo; with those you have to craft your prompts meticulously, adjusting the wording over and over.

With R1, I felt more like I was chatting with a colleague.

It makes mistakes too, but often gets it right. The most amazing part: when it realizes it's wrong, it changes tack and starts over. This ability isn’t hard-coded; it emerged during RL training.

One netizen put it perfectly: “It’s like it pushed open a door.” I even think that metaphor isn’t enough—it pushed open a small door toward AGI.

---

The industry is in an uproar: Meta’s panicking, Nvidia snarks

The academic reaction was more intense than I expected.

UC Berkeley’s Alex Dimakis straight-up said: “DeepSeek is already ahead; US companies need to catch up.” When a Berkeley professor says that, it carries weight.

Casper Hansen, the author of AutoAWQ, analyzed that R1 uses multi-stage cyclic training: base → RL → fine-tune → RL → fine-tune → RL. As I understand it, the core idea is to first let the model grow reasoning capabilities through RL, then use SFT for quality control, iterating back and forth.

Machine Heart leaked a story that Meta is “already in a panic,” frantically analyzing DeepSeek, baffled by how they achieved such low training costs.

Even more intriguing is the statement from Nvidia senior scientist Jim Fan. He posted on X: “Impact can be achieved through fancy names like ‘internal implementation of ASI’ or ‘Project Strawberry.’ Impact can also be achieved by simply showing raw algorithms and matplotlib learning curves.”

Read between the lines.

OpenAI wraps AGI in religious ceremony—secret codenames, private beta invitations. DeepSeek just dumps the code on GitHub, hangs the learning curves out, and says: “See? This is how it’s trained. You can do it too.”

The contrast couldn’t be starker.

---

The money: I did the math

R1’s API pricing: per million input tokens, $1 if cache hit, $4 if not. Output tokens: $16 per million.

What’s o1’s price? Input $15, output $60.

Convert it: R1 runs about 3% to 5% of o1’s cost.

I’ve already started switching.

For anyone building AI applications, this is huge. A top-tier reasoning model at such low cost means many products that were previously shelved due to cost can now be resurrected.

---

Don’t celebrate too soon—a few pitfalls I have to point out

First, while R1-Zero is amazing, it has readability issues. It mixes multiple languages and its expressions aren’t natural enough. The later DeepSeek-R1 fixed this by adding multi-stage SFT.

Second, 660B models—most people can’t run them themselves. I tried inference on a single A100 80G and hit OOM immediately. The official API works fine, but if you want local deployment, you need serious hardware.

Third, RL training instability is a big problem. GRPO saves the cost of a Critic model but loses fine-grained guidance. This trade-off isn’t acceptable to everyone. If your business requires extremely high output quality in complex scenarios, traditional PPO might still be the safer bet.

---

The bigger picture: Scaling Law is shifting direction

There’s a trend I’ve been tracking that was explicitly highlighted here.

“Until recently, most of the compute used to train LLMs was spent on pre-training. In the past, we focused mainly on scaling pre-training, while post-training was a minor expense.”

But that’s changed.

DeepSeek-R1-Zero used 100,000 H800 GPU hours for RL training

DeepSeek-R1 模型发布,性能对标 OpenAI (English)

DeepSeek-R1 模型发布,性能对标 OpenAI (English)

Where the hell did this thing come from?

R1-Zero: the thing that gives you goosebumps

What the hell is GRPO? Let me break it down for you.

I gave it a spin myself

The industry is in an uproar: Meta’s panicking, Nvidia snarks

The money: I did the math

Don’t celebrate too soon—a few pitfalls I have to point out

The bigger picture: Scaling Law is shifting direction

Cael Lee

Ready to get started?