:Policy Gradient,PPO及PPG (English)
:Policy Gradient,PPO及PPG (English)
Generated: 2026-06-22 10:04:16
---
That Day, I Got a Severe Beating from Policy Gradient
Five years ago, I encountered reinforcement learning for the first time.
Not in a school lab, not while grinding LeetCode—it was on a robotic arm grasping project. The big shots on the team made the call in a meeting: "Let's start with Policy Gradient."
I said okay, let's just get it done.
And the result? I wrote REINFORCE, ran it—and holy crap, it blew up instantly. Two hours of training, and the reward curve didn't budge. Occasionally it twitched upward a tiny bit, then on the next step it came crashing back to the bottom.
Naively, I thought my hyperparameters were off. Later I found out—it wasn't that I sucked, this algorithm really does collapse on its own.
Can you believe it? An algorithm touted as "revolutionary," running for a hundred thousand episodes, and the actions are still random—staggering around like a drunkard.
From that day on, I had a bone to pick with policy gradients. I went all the way from REINFORCE to PPO, and the pitfalls I hit along the way could line up and wrap around the office twice.
Today's post isn't about throwing fancy formula derivations in your face. I just want to talk to you about what these algorithms are really like in practice—where they flip you over, and where you need to tiptoe around.
---
1. Original PG: A Double-Edged Sword, with the Edge Pointed at You
When it comes to Policy Gradient, tons of online tutorials start by slapping up the formula.
∇J(θ) ∝ Σ ∇logπθ(a|s) · Gt
It's easy to understand literally: if the episode return is high, push all action probabilities up; if low, push them down.
Sounds fine, right?
But when you actually run it—let me tell you, it's a disaster waiting to happen.
The biggest problem? The variance is ridiculously high. Run the same set of parameters for ten episodes, and the reward can bounce from -200 to +200. You have no idea whether the policy actually improved or your random seed just gave you a godlike hand.
I tested it on Pong with a batch size of 32, let it train all night—the policy still couldn't learn to catch the ball.
Tell me that's not frustrating.
Then I switched to A2C, and it worked within half a day.
So, let me say this: raw PG without a baseline is practically useless.
Almost every tutorial mentions "subtract a baseline to reduce variance," but very few slam the table and tell you: without it, this thing is broken.
On that robotic arm project, I tried a PG version without a baseline. A hundred thousand episodes! The actions were still chaotic, like a hand that had no idea what it was doing. After I added the state value V(s), I finally saw the curve slowly start to rise.
So if you're going to play with PG, rule number one: use the advantage function A(s,a) instead of Gt.
A(s,a) = Q(s,a) - V(s)
In plain English: "How much better is this action compared to the average?"
If it's above baseline, raise its probability; if below, suppress it.
That's the only reason modern PG even works.
---
2. Why Did PPO Become So Popular? Because It's Smartly Cautious
Think about it—how annoying is it to set the step size for PG updates?
Too small, and it barely learns; too large, and it collapses. Adam doesn't mean squat—crank the learning rate too high, and you get a full gradient explosion family bucket.
Then TRPO came along, using KL divergence as a constraint—theoretically solid as a rock.
But here's the thing—try implementing it yourself.
You need second-order derivatives, conjugate gradients, line searches. I tried TRPO once. Just setting up the environment took two days. And it ran painfully slow.
Who uses this in production? Are you looking for trouble?
Then PPO arrived.
In 2017, Schulman published that paper. My first reaction after reading it: Huh? Something this simple can get published?
The core is just a clip operation, restricting the update ratio to [1-ε, 1+ε]. Basically, don't let a single update be too aggressive.
That one little trick, and the results exploded.
I've used PPO for continuous control and also for training dialogue models. The most common version is PPO-clip, combined with GAE, with 4 to 8 parallel environments. The whole training process is as stable as a straight line.
One detail I really want to tell you: Don't set the loss weights of Actor and Critic to 1:1.
Why? Because if the Critic learns too slowly, the Advantage estimate is inaccurate, and the Actor's update tends to drift.
My habit is to set it to 0.5:1, and then double the Critic's learning rate. The improvement is noticeable.
And PPO really is simple to implement. Go check out the source code of Stable Baselines3—the core is only a few dozen lines. Using PyTorch to run Atari, the default hyperparameters (clip=0.2, lr=3e-4, gae_lambda=0.95) already beat most built-in AIs.
I did a facial expression controller project before. With the same compute budget, PPO reached the same reward level almost twice as fast as A2C.
At this point, you might ask: PPO is so strong, what's its secret?
I'll tell you—it's not that the math is beautiful, it's that it has a high tolerance for errors.
If your parameters are a bit off, it won't die abruptly; if you mess around with the code a few lines, it won't crash immediately.
For engineers, that's a huge blessing.
---
3. PPG: Splitting Policy and Value, Letting Each Do Its Own Thing
You might think: if PPO is already this powerful, is it even worth messing with
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.