我花了3个月调参车还是不会过环岛,一个优化问题而已 (English)
我花了3个月调参车还是不会过环岛,一个优化问题而已 (English)
Generated: 2026-06-22 06:42:43
---
What Is Reinforcement Learning Really Doing? Don’t Be Fooled by the Algorithm Names!
This story starts with a moment that still makes me cringe to this day.
In 2018, I jumped into an autonomous driving project. A few guys on the team who came from academia kept talking to me about PPO, SAC, TD3—all these high-end algorithms being thrown at the car. We spent three months tuning parameters. Guess what? The car still couldn’t navigate a roundabout! I was about to lose my mind.
Then an older colleague who specialized in control theory walked by, glanced at our code, and laughed: “What the hell are you guys doing? This is just black-box optimization! Something that dynamic programming could solve in three minutes, and you’ve been messing with it for three months.”
At that moment, I was furious and refused to accept it. But later, in the quiet of the night, I thought it over carefully—he was right, absolutely right!
---
1. Reinforcement Learning Isn’t Magic—It’s Optimization! Really!
A lot of people think of reinforcement learning as something mystical, and others hype it up to the skies. I’ve been writing this column for ten years, and I’ve seen too many projects die on one phrase: “Let’s use reinforcement learning!”
Think about it—from a mathematical standpoint, what is reinforcement learning actually doing? It’s learning a decision-making rule that maximizes long-term rewards in an uncertain environment. It’s not some mysterious algorithm, not PPO, not DQN, not RLHF, and definitely not a magic button that suddenly makes large models capable of reasoning.
The essence of reinforcement learning is an optimization problem. And it’s a tough, messy, reality-slapping optimization problem.
Let me break it down for you. What reinforcement learning cares about is actually very simple:
- There’s a state s
- There’s an action a
- After taking the action, the environment gives you a reward r
- Then it takes you to the next state s'
How does the environment change? It’s represented by a transition probability: P(s' | s, a)
The agent needs to learn a policy: π(a | s). This means, given a certain state, the probability of choosing a certain action.
The problem reinforcement learning solves is to find a set of parameters θ that maximizes the long-term cumulative reward:
J(θ) = E[∑γ^t r_t]
Here, γ is the discount factor, meaning the farther away the reward, the lower its weight.
See, this formula is the core of reinforcement learning! Don’t get confused by all the algorithm names—DQN, A3C, PPO, SAC, GRPO, all the RLHF variants. They’re all circling around this objective function, just with different estimation methods, constraint methods, and sampling methods.
I’ve tested many projects, and honestly, not many teams truly understand this problem. Most people treat reinforcement learning as a black box, throw data into it, and expect it to get smarter on its own. And what happens? The data goes in, the money burns up, and the model doesn’t converge. Frustrating, isn’t it?
---
2. What’s Actually Happening Mathematically? A Trick That’ll Make Your Head Spin
There’s a particularly clever mathematical trick here called “dual pairing for transferring differentiation targets.” The name sounds intimidating, but the idea is actually very simple.
Let me give you an example to make it clear.
Suppose you need to calculate the derivative of a function df/dx, but the function has terrible properties—it’s non-smooth, and differentiation is prohibitively expensive. What do you do?
The answer is integration by parts:
∫f(x)g'(x)dx = -∫f'(x)g(x)dx
As long as you design g(x) to be zero at the boundaries, you can transfer the derivative from f to g. Pretty clever, isn’t it?
The broader idea is: linear operators can be transferred to the paired side through duality. This technique runs through everything from weak solutions of PDEs and Galerkin methods to modern machine learning techniques like flow matching and diffusion.
Applied to reinforcement learning, an MDP can be written as an optimization problem over trajectory space:
J(π) = E{τ~pπ}[R(τ)]
Here, τ is a trajectory, p_π is the trajectory distribution induced by policy π, and R(τ) is the cumulative reward along the trajectory.
The problem is that trajectory space is enormous! In discrete time, it’s (S×A) to the power of T, containing complete sequential joint distribution information. Solving directly in this space would cause computational explosion.
This is where the dual pairing trick comes in handy. You can transform the problem of taking gradients with respect to the trajectory distribution into taking gradients with respect to the policy parameters. That’s the mathematical essence of the policy gradient theorem.
In plain terms, reinforcement learning is essentially an optimization algorithm that “uses dual pairing to avoid differentiation.” Isn’t that stunning?
---
3. Why Is Reinforcement Learning So Hard? Three Words: It Doesn’t Obey!
Supervised learning solves the problem of predicting a label y given an input x. Its training data is basically fixed—you just learn from it.
But where does reinforcement learning get tricky? The data isn’t pre-prepared; the data is generated by your own policy! As soon as your policy changes, the sampled data changes too, and the data distribution shifts along with it. It’s like a disobedient child—the more you try to control it, the more it runs off course.
Even more annoying is that you can’t tell whether an action is good or bad right away. You might only know the result ten steps later. You make a decision today, and you won’t see the consequences until next month. Who can handle that?
These are the two most critical problems in reinforcement learning: the data distribution changes with the policy, and rewards are delayed.
In more engineering terms, supervised learning is mostly about fitting the past, while reinforcement learning is about making decisions for future rewards. The past is certain, the future is unknown—that’s why it’s hard.
I once saw a recommendation system team using reinforcement learning to optimize CTR. They spent six months building the environment, defining rewards, and training the model. When they launched, they found that the model’s recommended list did increase short-term click-through rates, but user retention dropped. Why? Because the model learned to recommend “clickbait” content—users clicked but found it didn’t deliver, so they stopped coming back.
That’s a reward design problem. There’s a huge gap between the reward you define and the actual business goal. You think you’re optimizing A, but the model is cheating to get B.
---
4. Reinforcement Learning After the 2024 Turing Award: An Amplifier, Not a Panacea
The 2024 Turing Award went to the field of reinforcement learning, officially stamping the Sutton and Barto lineage. By early 2026, the industry’s attitude toward reinforcement learning had clearly shifted again.
In the past, people working on recommendations, advertising, games, and robotics talked about reinforcement learning, but many companies found it too heavy, too slow, and too unstable. Now, with the rise of large models—especially RLHF, RLAIF, rule-based reward RL, and models like DeepSeek showing impressive reasoning abilities—people suddenly realized:
Supervised learning can teach a model up to a certain point, but to go further, simply feeding it standard answers isn’t enough. The model must adjust its own behavior around a goal.
But there’s a big misconception here.
Many people think reinforcement learning makes models smarter. More accurately, reinforcement learning changes the model’s behavior distribution, pushing it toward outputs that yield higher rewards. It doesn’t create knowledge out of thin air; it’s more like shifting probability mass toward more valuable areas within an existing capability space.
If the base model is weak, reinforcement learning can’t save it; if the reward signal is bad, reinforcement learning will amplify the problem. It’s an amplifier, not a magic wand.
Last year, I ran a comparative test on a large model project. Using the same base model, RLHF tuning improved accuracy on math reasoning tasks by 15%. But when I switched to a worse base model, the same RLHF process only improved accuracy by 2%, and it produced more “hallucinations”—the model started fabricating reasoning steps to get rewards.
This is interesting. Reinforcement learning isn’t a panacea; it’s an amplifier. A good base model plus good reward design can surprise you. A bad base model plus bad reward design can make you question your life choices.
---
5. My Take: Don’t Treat It as Magic—Treat It as a Tool!
From a mathematical standpoint, reinforcement learning is an optimization problem. It uses the dual pairing technique to transfer gradient solving from trajectory distributions to policy parameters. At its core, it’s at the optimizer level.
But why do so many people find it complex? Because in the real world, MDPs have enormous state spaces, complex action spaces, fuzzy reward definitions, and often unknown environmental dynamics. What’s simple mathematically becomes a tangled mess in engineering implementation.
I predict that in the coming years, reinforcement learning will become increasingly
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.