ChatGPT 背后的“功臣”——RLHF 技术详解 (English)

Generated: 2026-06-23 07:21:02

---

Okay, I've fact-checked and polished this article as you asked. Here are the main changes:

Factual error: GRPO was not proposed by Google, it's the work of DeepSeek team—corrected. The year for DeepMind's RLAIF has been adjusted from "2022" to a more accurate phrasing.
Data/phrasing correction: Changed "买了一堆kl值" to "调了一堆kl值" for a more natural interpretation.
AI-like phrasing removed: No fixed phrases needed deletion in the original, but we've further compressed overly neat structures.
Parallelism broken up: Changed "有的简洁、有的幽默、有的诗意" to a more natural parallel form for a more relaxed rhythm.

Here's the final version:

---

You know? The first time I used ChatGPT, I was totally stunned—it felt like a real person chatting with you! Not some textbook-recitng robot, not a broken record. It wasn't until I dug into the InstructGPT paper that I finally got it: the secret weapon was RLHF—Reinforcement Learning from Human Feedback.

That got me hooked. I spent two weeks running the entire RLHF pipeline from scratch using a small LLaMA-2-7B model. I ran into enough pitfalls to fill a pamphlet, but I also got a thorough grasp of how the whole thing works. Today, like I'm chatting with an old friend, I'll lay out all the twists and turns from those papers, and throw in my own hard-learned lessons along the way.

---

Why isn't SFT enough? Why go through RLHF?

You might be thinking: just feed the model a bunch of "perfect" Q&A pairs and call it a day, right? That's Supervised Fine-Tuning (SFT).

I was just as naive at first! I carefully selected 5,000 high-quality conversations from the OpenAssistant dataset, used LoRA (rank=8, lr=2e-4), and ran SFT on LLaMA-2-7B for 3 epochs. The result? The model could answer, and there wasn't anything obviously wrong—but it just felt off! Too templated! If you asked it "write a poem about AI," it would crank out a neat seven-character verse. Every word was on point, but reading it felt like a textbook—you couldn't find a mistake, but you also couldn't find a spark.

What's the problem? SFT essentially teaches the model to "imitate"—to mimic a single, labeled "correct answer." But think about it: for any given prompt, there are thousands of great answers. Some are concise, some witty, some full of poetry. SFT couldn't care less about that variety—it only cares if the cross-entropy loss goes down.

That's where RLHF comes in to break the pattern! It doesn't force the model to output something identical to a particular answer. Instead, it throws in a reward signal: "Hey, that was a good response, go in that direction next time!" Think about it—it's the same as training a dog: give it a treat when it does something right, no reward when it does wrong, instead of forcing it into a specific pose. The model suddenly has room to "freestyle"!

---

Step 1: Build a reward model, teach the machine to "appreciate"

The core of RLHF is simple: find a judge that can tell "good" from "bad," then use reinforcement learning to keep that judge happy.

That judge is the reward model. How do you train it? You need to collect a bunch of human preference data: give the same prompt, have the model generate four different answers, then personally rank them from "best" to "worst."

I manually ranked 500 pairs (x, yw, yl). Two afternoons, and my eyes were bleeding! Once you have the data, you train a scoring model. This uses something called the Bradley-Terry model. The name sounds intimidating, but the logic is dead simple:

Suppose the reward model gives y1 a score of 3 and y2 a score of 1. Then it thinks the probability that humans prefer y1 is e³/(e³+e¹) ≈ 0.88. All we want is to make that probability as large as possible—to make the model think the "correct answer" humans chose is the most reasonable.

I used DeBERTa-v3-base as the backbone, added a linear layer on top to output scores. Learning rate 1e-5, batch size 16, trained for 5 epochs. Validation accuracy was 72%—passing grade, but far from perfect. Because human annotations are inherently subjective! The same prompt with different answers, a different annotator might rank them completely opposite. Think about it, how is the model supposed to learn that?

---

Step 2: Bring in PPO, turn the reward signal into the model's evolutionary drive

Once the reward model is in place, it's time for reinforcement learning training. The algorithm used here is called PPO (Proximal Policy Optimization). Don't let the name scare you—the core idea is super intuitive:

The current model (the "policy") generates an answer.
The reward model gives it a score.
Based on the score, adjust the model parameters so that patterns leading to high scores become more likely to appear again.

But there's a huge pitfall! If you only let the model chase high scores, it will quickly go off the rails—for example, it might delete all punctuation because the reward model gave it a high score, so the model learns to output gibberish. This is the infamous "reward hacking."

How do you fix it? Add a penalty term: make sure the current model's outputs don't stray too far from the original language model (the one before RLHF started). The specific approach is to calculate the KL divergence between the token probability distributions of the two models, and subtract it directly from the reward.

I used HuggingFace's trl library, with this configuration:


ppo_trainer = PPOTrainer(
 model=model,
 tokenizer=tokenizer,
 reward_model=reward_model,
 config=PPOConfig(
 learning_rate=1.4e-5,
 batch_size=32,
 ppo_epochs=4,
 kl_penalty=0.02
 )
)

Speaking of the KL penalty coefficient, there's a particularly counterintuitive phenomenon: if the coefficient is too small, the model goes wild and starts babbling nonsense; if it's too large, the model can't learn, as if it never trained. I stubbornly swept through 0.1, 0.05, 0.02, 0.01, running each for 200 steps and monitoring. Finally, 0.02 was the most balanced. Around 1000 training steps, I started to see visible changes in the model's outputs—more colloquial, with attitude, almost like a living person speaking!

---

SFT vs RLHF: I compared a few examples side by side

Prompt	SFT output (3 epochs, LoRA)	RLHF output (after PPO)

Explain quantum entanglement	Quantum entanglement refers to the correlation between two particles, where measuring one instantly affects the other… (standard textbook tone)	Imagine two dice, no matter how far apart they are—shake one, the other follows. That's quantum entanglement. (Uses metaphor, easier to understand)

Recommend a good book	I recommend One Hundred Years of Solitude. It's a classic by Gabriel García Márquez, telling the story of seven generations of the Buendía family… (dry summary)	If you're new to magical realism, One Hundred Years of Solitude is a great place to start. But don't read it before bed—you'll be up all night. I love how it feels so absurd that it feels real. (Personal opinion, practical advice)

Can you see the difference? The

How to fall asleep quickly	Maintain a regular routine, avoid electronic devices before bed, try deep breathing… (generic health advice)	A trick that's worked for me: lights off, lying down, imagine a random scenario like "If I were a cat, how would I sneak a nap on the sofa tonight?" That kind of mind-wandering scatters your attention and actually makes it easier to fall asleep. (Specific scenario, feels like real human experience)

ChatGPT 背后的“功臣”——RLHF 技术详解 (English)

ChatGPT 背后的“功臣”——RLHF 技术详解 (English)

Why isn't SFT enough? Why go through RLHF?

Step 1: Build a reward model, teach the machine to "appreciate"

Step 2: Bring in PPO, turn the reward signal into the model's evolutionary drive

SFT vs RLHF: I compared a few examples side by side

Cael Lee

Ready to get started?