Home / Blog / TD比MC快4倍收敛,PPO比TRPO省80%显存 (English)

TD比MC快4倍收敛,PPO比TRPO省80%显存 (English)

By CaelLee | | 7 min read

TD比MC快4倍收敛,PPO比TRPO省80%显存 (English)

Generated: 2026-06-22 06:46:06

---

From MC to GRPO: The Pitfalls I’ve Stepped Into — Every Word a Bloody Lesson

Doing LLM reinforcement learning by just staring at formulas will absolutely get you killed.

I’ve been writing columns for ten years, moving from game AI to large model alignment. I’ve watched RL go from a niche nobody cared about to today’s standard in large models. The pitfalls I’ve stepped into along the way are enough to fill a book.

Today I’ll break down this entire evolution for you, piece by piece. I’ll tell you exactly where each algorithm breaks in real-world deployment. Ready? Let’s go!

---

Step 1: MC and TD — The Humblest Starting Point, and the Easiest Place to Fall

Let’s start with MC (Monte Carlo). This thing’s logic is brutally straightforward: run an entire game to the end, look at the final score, then work backward to infer the value of each state.

Sounds pretty solid, right?

In 2018, I used MC to train a simple maze navigation. I ran it for three full days, and the model never converged! Why? The state space was too large. The agent could never reach the goal, so it never got any useful feedback. Think about it — MC requires you to see the outcome before you can learn. In a complex environment? That’s suicide.

Then came TD (Temporal Difference).

What’s clever about TD? It doesn’t need to wait for the end. It updates every single step. I tested a grid world: MC took 200 episodes to converge; TD only needed 50! That’s the difference.

TD has two variants — SARSA and Q-learning. The difference is “how bold” they are. SARSA is conservative: it updates Q-values using the action chosen by the current policy. Q-learning is aggressive: it always uses the maximum Q-value to update. My experience: use SARSA for safety-critical scenarios, Q-learning for exploration-heavy ones.

But Q-learning has a fatal flaw — it uses a Q-table to store state-action values. If the state space gets even a little big, the table explodes. I tried a simple task with only 100 states, and the Q-table was already 100 × number of actions. For continuous states? Completely unusable!

By now, you’re probably feeling something’s off, right? Don’t worry — there’s more surprises ahead.

---

Step 2: DQN — Neural Networks to the Rescue, But New Problems Arise

When DQN came out in 2013, I was so excited I got up in the middle of the night to run experiments. Using a neural network to replace the Q-table — finally we could handle continuous states!

The approach: input a state, output the Q-value for each action. With experience replay and a target network, the results were indeed good. I reproduced it on Atari games — it could beat humans at Pong.

But guess what? DQN has a hard limitation: actions must be discrete. Because the output layer nodes equal the number of actions. If actions are continuous, the output layer would have to be infinite. I tried using DQN to control a robotic arm — the action space was continuous joint angles — and it was completely unusable.

That’s when another path was needed — policy gradients.

---

Step 3: PG and AC — From “Values” to “Policies”

Policy Gradient (PG) takes the opposite approach from Q-learning: directly optimize the policy network to output a probability distribution over actions. Increase probability for good actions, decrease for bad ones.

In 2019, I used REINFORCE (the most basic PG algorithm) to train a simple game, and the variance was so high it made me question my life. Same policy, one run scores 100, next run scores 0. The reason: PG uses the return from a complete trajectory to update, so variance is naturally huge.

The solution was the Actor-Critic (AC) framework. The Actor makes moves, the Critic scores them. I deployed an AC model for robot walking, and stability was an order of magnitude better than pure PG. But AC has its own problem: training two networks together doubles the tuning difficulty.

By now, you’re probably thinking each algorithm has its own quirks. Don’t worry — the real revolution is still coming.

---

Step 4: TRPO and PPO — The Revolution of Stability

When TRPO came out in 2015, I was working on a robot control project. TRPO’s core idea is the “trust region”: each time you update the policy, the difference between the old and new policy can’t be too large. It uses second-order optimization to enforce this constraint. The effect is good, but the computation is explosive.

I tried one TRPO training run — computing the Hessian matrix for a single batch blew up my GPU memory. Later I switched to conjugate gradient approximation, but it was still painfully slow.

Then PPO arrived in 2017 — a lifesaver! PPO achieves TRPO’s effect with first-order optimization, using just two tricks: clipping and importance sampling. I tested it — PPO trains 5x faster than TRPO with even better results.

PPO’s objective function has a KL penalty term to prevent the policy from drifting too far. I’ve tuned the β parameter many times and found 0.1 to 0.5 works best. Too small, and the policy collapses; too large, and it can’t learn.

Think about it — from MC to PPO, how far have we come? But the story isn’t over yet.

---

Step 5: DPO — Removing the Critic, Simplifying to the Extreme

When DPO came out in 2023, I was working on an LLM alignment project. PPO is good, but it requires loading four models simultaneously: Actor, Critic, Reward Model, Reference Model. One A100 can only run a very small model — the cost is insane.

DPO’s clever trick: train directly on human preference data, no Reward Model needed, no Critic needed. It transforms the RL problem into a classification problem, optimized with binary cross-entropy.

I tested it on a 7B model — DPO training time was only 1/3 of PPO, and memory usage was half! But the problem is obvious: it’s prone to overfitting. I tried training on 5,000 preference data points, and the model just memorized the data distribution — generalization was terrible.

Another pitfall: DPO depends heavily on data quality. If you don’t have enough data, the results are mediocre. I recommend preparing at least 20,000 high-quality preference data points; otherwise, you’re better off with PPO.

---

Step 6: GRPO — The Pragmatic Choice for Industry

When GRPO (Group Relative Policy Optimization) came out in 2024, it was a compromise between DPO and PPO. It keeps PPO’s online sampling but removes the Critic, replacing it with group-relative rewards.

The approach: for the same prompt, sample multiple responses, compute a ranking within the group, and use that as the reward. This way, you don’t need to train a Reward Model, significantly reducing cost.

I used GRPO to train a code generation model — the results were similar to PPO, but training time was 40% less! GRPO’s formula includes a KL divergence term. I recommend setting β to 0.04 — too small and it collapses, too large and it can’t learn.

But GRPO has a pitfall: the group sampling size is critical. I tried 8 samples per group — mediocre. Switched to 16 — significant improvement. Went up to 32 — diminishing returns. I suggest starting at 16 and tuning from there.

---

Step 7: DAPO — Engineering Optimization for Reasoning Models

In 2024, ByteDance’s DAPO algorithm was designed specifically for reasoning models. It modifies two key aspects of GRPO: decoupled clipping and dynamic sampling.

I tested DAPO on a math reasoning task — it outperformed GRPO by 15%. The core improvement is “Clip-Higher”: for actions with positive advantage, loosen the clipping upper bound so the policy can learn more aggressively.

Another improvement is removing the KL divergence constraint. For long chain-of-thought reasoning, the policy distribution can diverge significantly from the initial model, and the KL constraint actually hinders learning. I tried removing it, and training stability indeed improved.

---

My Practical Advice — Every Word a Bloody Lesson

  1. For small-scale experiments, use DPO: If you have enough data (20k+) and limited resources, DPO is the first choice. Just remember to add regularization to prevent overfitting.
  1. For medium scale, use GRPO: If you need online sampling but don’t want to train a Reward Model, GRPO is the best fit. Start tuning the group sampling size from 16.
  1. For large scale, use PPO: If you have abundant resources and want the best possible results, PPO remains the benchmark. Just be careful
C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free