再来聊聊强化学习在自动驾驶中的应用 (English)
再来聊聊强化学习在自动驾驶中的应用 (English)
Generated: 2026-06-21 03:03:30
---
Let's Talk Again About Reinforcement Learning in Autonomous Driving
Speaking of this topic, I remember a particularly embarrassing scene last month.
I was all excited, having just reproduced an imitation learning baseline. The model looked incredible on the open-loop validation set—loss down to 0.0 something. I almost thought I was about to get a paper into a top conference.
So what happened? We put it in the car for a quick test drive, and its true colors were revealed immediately.
At the intersection, it veered half a meter to the left and nearly scraped the curb.
You know what I was thinking? This isn't the first time this has happened to me!
---
Think about it—during training, the model gets nothing but perfect human driving trajectories. Every state is a "correct answer" state. But once the car is on the road, even the slightest deviation throws it into states it has never seen during training. And then it's completely lost, with no idea how to get back.
This is what we call distribution shift. In plain English? The model hasn't seen much of the world.
So it was almost inevitable that RL (reinforcement learning) would enter the autonomous driving landscape.
But to be honest, RL itself hasn't had an easy road. The pitfalls I've stumbled into and the papers I've read over the years have made me increasingly aware: this is far from simple.
---
1. Waymo's ECCV RL Fine-Tuning Paper: Smart, but Really Sneaky
Let's start with the paper Waymo published at ECCV 2024, "Improving Agent Behaviors with RL Fine-tuning for Autonomous Driving."
The core idea is particularly interesting: since you lack a real environment for closed-loop training, let the network create an environment for itself.
They used MotionLM, which generates trajectories autoregressively—it outputs one action for the ego vehicle and agents at a time, then loops to concatenate them into a full 6-second trajectory. In other words, the network itself constitutes a simulation. It's rough, but good enough for running RL fine-tuning.
I immediately tried it on my own data after reading it—and guess what? The results were terrible.
One detail left a deep impression on me: the paper emphasizes scene-centric encoding. The static information fed into the network isn't just the HD map from the current frame, but an aggregation of static information across all time steps during the 6-second rollout.
I thought, "How important could that be?" and lazily used only the current frame's HDMap.
By the third second of rollout, the car started driving into areas with no road.
Why? Because the model had no idea there would be no road 6 seconds later!
That lesson completely humbled me: what seems like a minor detail is often the difference between success and failure.
But on the other hand, relying on the network to generate its own rollout isn't truly closed-loop. Errors in the generated interaction trajectories accumulate over time. If I were to deploy this at scale, would this simplified world model hold up? Honestly, I'm not confident.
---
2. The Three Mountains of RL: Closed-Loop, Reward, Optimization Objective
I recently read an analysis on Zhihu that summarized the difficulties of RL in autonomous driving as "three tigers blocking the road." I mostly agree, but I want to expand a bit.
First, let's talk about how hard closed-loop really is.
Think about it—right now there are only three types of closed-loop: vector closed-loop, intermediate feature closed-loop, and sensor closed-loop.
Vector closed-loop is the most mature, but it can only do two-stage training—train the planning module separately, then connect it to a closed-loop simulator. I've spent a lot of time on feature closed-loop, trying to feed BEV features into the next frame's prediction, but the accuracy is hard to verify. As for sensor closed-loop? That's the world model. No matter how fancy the name sounds, it's riddled with pitfalls.
I've seen several startups that claimed to be building world models, and they all ended up falling into the same trap: **"nov
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.