SFT、RLHF、DPO、IFT — (English)

Generated: 2026-06-20 17:56:39

---

To be honest with you, now whenever I hear the phrase "DPO is cheap," I get a headache—really, a headache.

Last year I spent three months running five comparison experiments on Llama 3 8B. You know what conclusion I came to? Even I was shocked:

Offline DPO doesn't actually cost less than RLHF. The difference is that RLHF burns money on GPUs, while DPO burns money on data and trial-and-error.

When the money goes to GPUs, at least you can see it—fans spinning, temperatures climbing. But data and trial-and-error? You never know when it'll end. Tweak one hyperparameter, wait three days for the loss curve, and boom—your validation set tanks.

Those explainer posts on Zhihu make it look so elegant: skip training the reward model, optimize directly with preference data, how neat! But wait until you've prepped the data, plotted the training loss, and watched your model degrade on the validation set—that's when you realize exactly how cursed the word "offline" is.

Before I go further, let's talk about SFT for a second.

A lot of beginners think SFT is just "feed in answers, teach the model the format." That's technically right, but here's the problem: most public SFT data doesn't even pass the quality bar.

Two years ago I fine-tuned LLaMA 2 7B for code using the Alpaca format: "Human: Write a binary search\nAssistant: ". The result? The model learned the format but not the code. Ask it to "write a binary search," and it'd give you something that looked structurally sound but was full of syntactically broken pseudocode. It looked right, but you couldn't run it. Bulky, useless, and prone to crashing.

What really opened my eyes to SFT's limits was a math reasoning test. I fine-tuned Qwen2 7B on a manually corrected GSM8K dataset. Three epochs, accuracy jumped from 0% to 22%. And then it hit a wall. More data, more training steps—it just stuck around 20%. Did not budge.

Why? SFT is essentially maximum likelihood estimation—the model learns a conditional probability distribution, but it's just copying the distribution from your training data. Give it a hundred problems with full solutions, and it takes them all at face value; change the question a little, and it falls apart immediately. Because it never learned to verify its answers.

You know that kid from school who could nail an exam if it was exactly the same as the practice test, but froze the second the numbers changed? That's SFT.

So what about RLHF? The results are the best, no question, but the cost is invisible.

The InstructGPT trifecta—SFT, Reward Model, PPO—I reproduced it on an A100 back in early 2023. Guess how long it took? An entire week. And the most exhausting kind of week at that. Gradient explosion halfway through training the reward model, two days of tuning the KL penalty coefficient, PPO crashing left and right… just dealing with those alone was enough to make you want to quit.

So you see, "cheap" vs "expensive" isn't really about GPU costs. GPUs are visible; you know exactly what they cost. But data, trial-and-error, tuning parameters, rerunning, and then rerunning again—those costs are hidden. By the time you've stepped on every landmine and done the math, the money and effort you've put in is no less than what RLHF would've taken.

That reminds me of something: the places where you think you're saving money often hide the biggest costs.

Offline DPO is a good path, but it's not a shortcut. It just sends you the bill in a different form.

Next time somebody tells you "DPO is cheap," smile and ask them: "Have you counted the cost of trial-and-error?"

SFT、RLHF、DPO、IFT — (English)

SFT、RLHF、DPO、IFT — (English)

Cael Lee

Ready to get started?