DeepSeek-R1 技术报告解读 (English)

Generated: 2026-06-20 19:30:56

---

To be honest, staring at that loading spinner at three in the morning, just waiting for a technical report to finish loading.

Someone asked me if it was worth it? Let me tell you—when I flipped to the last page, I literally jumped out of my chair—

Back in January when DeepSeek-R1 first came out, I went through its 60-page report, and there was this voice in my head that kept saying: something’s off, they’re definitely hiding something.

Well, damn. This supplementary technical report filled in almost all the gaps from before. Just the training details and the small model distillation section alone—I read it three times over and filled two full pages of notes.

Let’s start with the most mind-blowing part: a model that learned to reflect on its own, without a teacher.

You already know DeepSeek-R1 is impressive, right?

But did you know there’s something called DeepSeek-R1-Zero that’s the real one to remember?

At first, I didn’t think much of it—training without any human-annotated data? That’s just a rogue approach, right? But after reading the details they released, I have to admit I was wrong.

How are traditional reasoning models trained? Basically, you feed them college entrance exam questions and tons of answers, let the model memorize them wholesale, then use reinforcement learning for fine-tuning.

DeepSeek did something counterintuitive this time: they directly applied reinforcement learning to the base model, cutting out the most expensive annotation step.

They used the GRPO algorithm. Simplified explanation: ditch the complex value model in traditional RL and directly compare the quality of different answers from the same batch to optimize the policy.

I tried a simplified version of this in my own project last year, but gave up when I hit convergence issues. Not only did the DeepSeek team pull it off, they trained a monster.

What shocked me most were the details of the training process.

If you watch the frequency of the “wait” token (the word “wait” the model writes down while thinking), you’ll see—it barely appears at first, then around step 8000, it suddenly spikes like an EKG.

What does that mean? The model learned to check its logic and self-correct during the training process. They call this the “Aha Moment.”

I spent an afternoon studying Table 3 in the report. There’s a particularly striking example of self-evolution: the model is solving a math equation halfway, then suddenly stops and reflects—“Wait, this step is wrong”—and goes back to re-derive.

This wasn’t pre-programmed. It’s not a hardcoded rule. It’s an ability that emerged naturally during reinforcement learning training.

And the results? On the AIME 2024 math competition, DeepSeek-R1-Zero’s accuracy jumped from an initial 15.6% to 71.0%, and with a voting mechanism it reached 86.7%.

It went toe-to-toe with OpenAI o1-0912.

But let’s be realistic—pure reinforcement learning training has a major flaw. I looked at a few reasoning samples they published. Chinese and English mixed together, messy formatting, random weird symbols popping up. In real-world usage, who can put up with that?

The difference between an army and a bunch of bandits comes down to a few thousand data points.

What’s the relationship between DeepSeek-R1 and R1-Zero? It’s like regular troops versus bandits.

To fix readability, they designed a “cold start” phase—training the model on a few thousand human-annotated high-quality reasoning chains.

Let me break down their hyperparameters: learning rate starts at 5×10⁻⁵, decays via cosine schedule to 5×10⁻⁶, context length 32,768 tokens.

You ask what these numbers mean? Their initial learning rate is on the conservative side. Because the cold start data volume isn’t large to begin with, they were afraid of the training collapsing.

Then comes two-stage reinforcement learning.

The first round focuses purely on reasoning ability, guided by rule-based rewards—get the answer right and you get a treat, get it wrong and you get slapped, and format mistakes cost points. The second round incorporates a human preference reward model to handle safety and usefulness.

The second phase used 800,000 data points. I’ve done tests, and this data volume is crucial: too little, and the model’s generalization ability doesn’t kick in; too much, and training takes too long and risks overfitting.

Real-world performance? On benchmarks like MMLU, MMLU-Pro, and GPQA Diamond, DeepSeek-R1 comprehensively surpasses DeepSeek-V3. Improvement is especially noticeable on STEM-related questions.

But what impressed me most was performance on FRAMES, a long-context QA benchmark. I tested it several times with long documents from my own company. R1 can pinpoint key information within contexts of over 30,000 tokens. This isn’t just a gimmick—it’s truly usable capability.

A small model that outperformed GPT-4o.

I think this is the most valuable part of the entire technical report.

DeepSeek distilled 6 small models using R1’s outputs: from 1.5 billion parameters to 70 billion, based on Qwen2.5 and Llama-3.1.

Guess what? They only used 800,000 data points, tuned for a simple 2 to 3 epochs, and that’s it.

Those who said “small models can’t do reasoning” got a slap in the face. DeepSeek-R1-Distill-Qwen-7B surpassed GPT-4o-0513 on multiple reasoning benchmarks. The 14B version performed just as well as QwQ-32B-Preview on many tests, sometimes even better. The 32B and 70B versions clearly outperformed o1-mini on most benchmarks.

I tested the 1.5B version myself—it scored 28.9% on AIME math and 83.9% on MATH. A small model crushing old large models in math reasoning—who would have imagined that two years ago?

The real insight is: doing large-scale reinforcement learning directly on Qwen-32B is far less effective than distilling from R1.

Think about it. It’s like asking a fifth grader to derive calculus on their own, versus just giving them a solution guide.

I fell into this trap in a previous project: small models have limited resources, and doing RL from scratch has terrible cost-effectiveness. Distilling the reasoning ability of a large model is actually the more efficient path.

A few sentences to help you understand the architecture.

After studying the architecture diagrams DeepSeek released, I can basically sketch it: four modules—generation, training, evaluation, monitoring.

The generation module loads problems from the training dataset, distributes them to multiple vLLM workers, and each worker samples multiple answers with the model.

In their MoE architecture, they use cross-node expert parallelism to reduce memory access, and deploy redundant copies of hot experts to balance computational load.

One detail I paid special attention to: Multi-Token Prediction is used for speculative decoding. I’ve only seen this in a few niche projects before, and this is the first time it’s been applied to reasoning models. The benefit is a significant boost in decoding speed, especially in long-sequence scenarios.

No matter how strong the ability, it can still be abused.

When testing R1, my biggest concern was safety. Open-source models are already easily fine-tuned into dangerous tools, and R1 has such strong reasoning capabilities. What if it gets corrupted?

This time, DeepSeek disclosed their safety mechanisms in detail.

They built an evaluation dataset of 106,000 prompts and trained a safety reward model using pointwise training. The risk control system has two layers: conversation filtering and model review.

I ran a set of 1,120 internal test cases, and the results were indeed good. On HarmBench, R1 was a bit weaker on IP-related questions, but on other benchmarks it was on par with GPT-4o and Claude-3.7-Sonnet.

But—and I have to say “but”—I recommend adding an independent sensitive content filter when deploying R1 in production. After all, the barrier to fine-tuning open-source models is too low. You can’t fully rely on the original safety measures.

Weaknesses—I have to mention them.

DeepSeek-R1 is not without flaws.

On tasks like function calling, multi-turn dialogue, complex role-playing, and JSON output, it’s even weaker than DeepSeek-V3. Long reasoning chains become a burden on general tasks, and the model sometimes overthinks a very simple request.

Language mixing is also annoying. Even when I ask in Chinese, R1 sometimes thinks in English and then translates back to Chinese for the output.

As for prompt engineering, R1 is super sensitive to prompts. I fell into a few pitfalls before figuring it out: it’s best to

DeepSeek-R1 技术报告解读 (English)

DeepSeek-R1 技术报告解读 (English)

Let’s start with the most mind-blowing part: a model that learned to reflect on its own, without a teacher.

The difference between an army and a bunch of bandits comes down to a few thousand data points.

A small model that outperformed GPT-4o.

A few sentences to help you understand the architecture.

No matter how strong the ability, it can still be abused.

Weaknesses—I have to mention them.

Cael Lee

Ready to get started?