全景解读 LLM 后训练技术 (English)

Generated: 2026-06-22 04:46:57

---

A couple of days ago, a friend came to me, saying he was trying to figure out how to turn a pre-trained model into a real assistant, and he asked me where he should start. I told him, don’t rush to read papers—let me tell you a story first. The first time I got my hands on the LLaMA3-8B base model, I very reverently asked it, “What day is it today?” Guess what? It wrote me three full-length essays on the philosophy of time—long-winded, full of citations—but never once said “It’s Tuesday!” 😤

Annoying, right? It’s like if you ask your renovation guy what color to paint the wall, and he gives you a lecture on the history of paint chemistry. That’s exactly the problem with base models—they know a little about everything, but they just won’t talk straight!

That’s what post-training is here to fix. Don’t let the term “post-training” scare you. Put simply, pre-training gives you a bare concrete shell of a house—post-training is the finishing touches: painting the walls, running the wiring, custom-building the furniture—turning the house into a place where you can actually live. I spent the better part of half a year going down the full post-training pipeline, from SFT all the way to Agentic RL. Today I’ll tell you which pitfalls are worth falling into and which ones you can skip.

SFT: Thought It Would Be the Easiest, Ended Up Being the Worst

At first, I thought it would be dead simple: just get some high-quality Q&A pairs, do a little supervised fine-tuning, done. So I spent nearly two weeks carefully annotating 500 human-bot conversation pairs, ran it on LLaMA3-8B for 3 epochs, loss dropped to 0.15—and I almost thought I was already a hyperparameter-tuning master.

And the result? Sure, the model learned to answer questions. But ask it to write a long text—the first two lines are fine, by the third it starts going off the rails, and by the fifth line it’s completely lost, its logic flying so far off that even it can’t catch up.

This pitfall has a fancy name: exposure bias. During training, the model uses my perfect token history, but at inference time, it has to rely on its own generated token history—one wrong step, every step wrong, like falling dominoes. I specifically tested with a held-out set: for the same prompt, the first 50 tokens had 87% accuracy, but by token 150 it had dropped to 31%. From 87% to 31%! That kind of gap makes you want to throw your keyboard.

So SFT is really just a warm-up—don’t expect it to solve everything. But I also wouldn’t recommend skipping it entirely. If you try to do RL directly on a pure base model, training time and resource consumption at least triple, and you’re more likely to learn some really weird behaviors. You know how models sometimes manage to be even craftier than you’d imagine.

RLHF: The Idealism of Musk, the Reality of a Tractor

Later I tackled the classic RLHF pipeline—first train a Reward Model, then use PPO to optimize the policy. Reading the paper, I was full of confidence; the whole framework looked so elegant. Then I actually started coding—and every step had a trap waiting.

The first trap was the reward model. I used an open-source 7B Reward Model trained on Anthropic’s preference data, seemed solid. But in my domain—legal Q&A—it couldn’t score properly at all! It gave high scores to answers that “directly cited the statute,” but the kind of responses users preferred—“explain it in plain language”—got low scores. So fine, I’ll train my own. I spent two weeks collecting 2,000 pairs of comparative data, and the RM I trained barely got 62% accuracy on the validation set. 62% accuracy—better than random, sure, but still far from practical. Only slightly better than flipping a coin.

Then there was the stability issue with PPO training. This algorithm has to maintain four components simultaneously: policy model, reference model, value model, reward model. I used the trlx library, batch_size 8, KL penalty 0.04, let it run all night. The reward curve kept oscillating, and eventually it collapsed completely—the model started outputting gibberish like “super duper safe” nonsense just to get the high reward. Later, reading up, I found it might be related to variance in advantage estimation. I tuned GAE lambda from 0.95 to 0.9, added learning rate decay, and made a bunch of changes just to get it to barely converge.

Honestly, PPO isn’t bad, but you need enough engineers babying it. If you’re a small team jumping straight into PPO, I’d advise you to think twice. You know that saying: the theory is rich, the engineering is poor. No phrase fits PPO better.

DPO: Looks Like a Cheat Code, Actually Feels Amazing

When I first read the DPO paper, I had nothing but question marks: “Isn’t this just forcing the RL objective into a classification loss? How could that work?” But I gave it a shot anyway, and I have to admit—it’s a masterpiece!

I used the same 2,000 pairs of preference data and ran both DPO and PPO. Guess what? DPO took just 40 minutes on a single A100; PPO took 6 hours on two A100s and still hadn’t fully converged. Yet in the final evaluation—human blind comparison on 500 test prompts—DPO actually had a slightly higher win rate!

DPO has no explicit reward model, no value network; it just treats preference pairs as classification data. Hyperparameters? Barely any—I set beta=0.1, ran two epochs, done. At that moment my mindset was: “This is too simple, there has to be a catch.”

And sure enough, I eventually found its ceiling. DPO only optimizes on the data you already have—no online exploration. If your preference data covers limited scenarios, the model’s learned boundaries are fixed. Meanwhile, because PPO samples online, it can uncover behaviors that never appeared in your dataset. For example, in code generation tasks, the PPO model sometimes comes up with completely novel solution strategies; DPO mostly just imitates the style from the training set.

So my verdict: if your preference data is high-quality and covers a wide range, DPO is the best bang for your buck. If the scenario is open-ended and requires the model to explore, you’ll need online RL. There’s no silver bullet—only what fits.

GRPO: On Math Problems, It Made Me Believe in RL Again

The thing that really changed my mind about post-training was GRPO. Its idea is incredibly clever—no value model needed. It estimates advantage using the relative scores of multiple responses to the same prompt, neatly sidestepping PPO’s biggest engineering headache.

I tested it on GSM8K math problems. Base model was Qwen2.5-7B. First, I did SFT on 1000 solution paths, then ran DPO and GRPO separately. DPO didn’t improve much—accuracy went from 55% to 62%. GRPO, after an afternoon of training, jumped straight to 78%! The key is the group scoring mechanism—for each problem, generate 8 solutions, take the highest scoring one as positive, the lowest as negative, compute the advantage directly—no extra value network training needed.

I also made a mistake: with group size 4, the effect was marginal; when I bumped it to 8, the improvement was huge; but going up to 16, returns started diminishing while training time doubled. I settled on 8—best cost-benefit ratio.

GRPO is perfect for scenarios where answers can be automatically graded for correctness—math, coding, logical reasoning. If your task requires human judgment, it’s not as suitable. And there’s an

全景解读 LLM 后训练技术 (English)

全景解读 LLM 后训练技术 (English)

SFT: Thought It Would Be the Easiest, Ended Up Being the Worst

RLHF: The Idealism of Musk, the Reality of a Tractor

DPO: Looks Like a Cheat Code, Actually Feels Amazing

GRPO: On Math Problems, It Made Me Believe in RL Again

Cael Lee

Ready to get started?