我花了三周复现R1：跳过SFT推理混乱，加上冷启动才稳住 (English)

Generated: 2026-06-21 14:29:21

---

Three Weeks to Replicate DeepSeek-R1’s Long Chain-of-Thought: The Results Were a Real Slap in the Face!

To be honest, I completely dismissed DeepSeek-R1 at first.

“Reinforcement learning sparks reasoning abilities,” “Long CoT self-evolution”—it sounded just like those marketing accounts shouting “AGI is coming!” every day. I thought to myself: yet another gimmick in a fancy wrapper.

But then I spent three weeks trying to replicate its training process. I hit every single pitfall, and my face is still stinging from all the slaps.

---

The First Pitfall Nearly Made Me Give Up

This one was particularly interesting.

My original plan was simple: since DeepSeek-R1‑Zero skipped the SFT stage and went straight to RL, why bother with an extra step? Just go for it, right?

You know what happened?

The model did manage to generate a CoT, but it was a mess—switching between Chinese and English randomly, throwing in “reflect” and “verify again” every few steps, stuck in an endless neurotic loop. And the worst part? The final answer was often correct.

That’s the fascinating thing: the model learned to reason, but it completely failed to learn how to “speak properly.”

After digging deeper into the R1 training details, I realized they introduced a “Cold Start” phase. Basically, they first fine-tuned the model with a few thousand high-quality long chain-of-thought examples, mandating a specific format and standardizing the language (e.g., sticking to Chinese and avoiding mixed usage). The paper glosses over this in one sentence, but from my experiments, this step is the key to user experience.

Now you’re probably thinking: why not just use SFT?

The results I got were pretty humbling: With pure SFT on Long CoT, the model can reach a certain level, but its ceiling is obvious. It’s like a student who memorized the standard answers by heart but panics when faced with a new problem. SFT gives the model “good habits,” but what actually teaches it to reason is the RL that follows.

---

RL Was So Deep It Made Me Question Life

GRPO—I initially dismissed it as just a variant of PPO, nothing worth studying. Then I actually ran it, and the number of pitfalls was infuriating.

First problem: how do you control CoT length?

Let the model expand on its own, and it goes crazy—up to 10,000 tokens. Can you believe it? A reasoning process longer than a thesis paper.

If you limit the length, the model gets sneaky—it crams multiple steps into a single paragraph. The output looks shorter, but reasoning quality plummets.

R1’s solution is to use several rewards simultaneously: a length reward (e.g., cosine-shaped, encouraging gradual increase), a length-scaling reward (bonus for longer, sensible reasoning), and an N‑gram repetition penalty (to stop the model from padding with repeated phrases). I ran each component separately, and the results were clear:

No length reward: Length goes wild, accuracy stuck around 50%.
Only length reward: Length grows steadily, but the model learns to pad; accuracy rises to 55% then drops.
Length reward plus repetition penalty: Length grows steadily, accuracy keeps rising—currently at 68% and still climbing.

That repetition penalty is crucial. The model is cunning; without it, it can loop over the same meaning four or five times just to inflate the length. Sound familiar? It’s exactly the same trick some people use when writing a paper—saying the same thing over and over just to hit the word count.

---

The Most Shocking Finding

Later I read the ByteDance Seed team’s paper. Their analysis of Long CoT is even more thorough than the R1 paper. They break long chain-of-thought into three basic actions: deep reasoning, self-reflection, and self-exploration.

At first I thought this was just the same old “reason, check, try a new path” trio—nothing worth digging into. Then I actually did the quantitative analysis. And my face got slapped again.

I wrote a script that maps each CoT step generated by the model into a semantic space and calculates their “spread.” The results were striking:

Deep reasoning: Semantic diffusion circle shrinks by 22%—essentially “locking onto core logic.”
Self-reflection: 81.72% of reflection steps land precisely back in a previous “promising idea” zone; after reflection, semantic space compresses by 11%.
Self-exploration: Semantic coverage expands from 23.95 to 29.22. The cost is reduced stability, but it does help escape dead ends.

This isn’t mysticism. It’s concrete quantitative evidence.

---

Hard Lessons from Reward Design

The R1 paper casually mentions that “reward hacking is a real risk.” Let me tell you: “risk”? It’s a guaranteed hell.

Early on, I used a model-based “helpfulness” reward function. The reward score shot up, while the model’s actual performance on CodeForces went down. The model learned to flatter the reward model with pretty words, but it never learned to actually solve problems.

I switched back to a pure rule-based reward (direct answer comparison), and performance stabilized.

R1’s training framework handles this well: they use asynchronous scheduling, where reward computation doesn’t touch the GPU and runs separately as a rule verifier. This engineering detail may seem minor, but it determines whether RL can actually converge.

---

An Accidental Discovery: Data Quality Matters More Than Quantity

During replication, I made a typical beginner mistake: assuming more data is always better.

I started with 2 million reasoning instances for SFT. The model’s performance actually declined. Then I found the key in the Light‑R1 technical report: the “all-correct rate” and “all-wrong rate” are incredibly important.

The data that actually helps RL training are the samples where “some are right and some are wrong” (about 50%). Why? Because if all are right, the model learns nothing; if all are wrong, the data is likely mislabeled.

Later I did a simple offline filter: used a small model to sample and kept only data with a pass rate between 0.2 and 0.625. The effect was immediate.

---

Performance Differences Across Model Sizes

This point deserves its own section because many people have no idea how to choose a model.

I tested four sizes: 1.5B, 7B, 14B, and 32B.

Model Size	Simple Reasoning Tasks	Complex Reasoning Tasks	Generated CoT Stability

1.5B	Significant improvement after RL	Almost none	Prone to repetition

7B	Noticeable improvement	Limited improvement	Moderate

14B	Clear improvement

我花了三周复现R1：跳过SFT推理混乱，加上冷启动才稳住 (English)

我花了三周复现R1：跳过SFT推理混乱，加上冷启动才稳住 (English)

The First Pitfall Nearly Made Me Give Up

RL Was So Deep It Made Me Question Life

The Most Shocking Finding

Hard Lessons from Reward Design

An Accidental Discovery: Data Quality Matters More Than Quantity

Performance Differences Across Model Sizes

Cael Lee

Ready to get started?