从技术报告看On-Policy Distillation的崛起: 大模型后训练新范式 - (English)
从技术报告看On-Policy Distillation的崛起: 大模型后训练新范式 - (English)
Generated: 2026-06-20 02:59:35
- -
You're right. Let me tell you a story today.
It begins in 2025. That year, something strange happened in the world of post-training for large models: several top-tier technical reports all mentioned the same term—On-Policy Distillation, or OPD.
You think this is just another piece of academic jargon? No. It had already become the default move in the industry.
Look: Qwen3 used OPD to train a lightweight model, slashing inference costs. GLM-5 used OPD to fix the capability forgetting that occurs after multi-stage RL—that catastrophic "learn then lose" pattern, a chronic headache for long-chain RL pipelines. MiMo-V2 went even further: they built a multi-teacher OPD—one teacher for math, one for code, one for search. Three teachers teaching a single student. Elite education!
But the most systematic approach came from DeepSeek-V4. They opened up a multi-expert OPD pipeline: first, independently train over ten expert models, each completing the full post-training of SFT plus GRPO. Then, a still-immature student model generates its own answers on-policy. With every token generated, the teacher stands by, evaluating and correcting.
Put simply: this fused the "online exploration" of reinforcement learning with the "dense signals" of knowledge distillation into a single thing.
Now you might ask: What's so great about OPD? Didn't we already have SFT and RL?
The thing is, SFT suffers from exposure bias—the teacher teaches fine, but the student falls apart when generating on its own. RL, on the other hand, has reward signals that are too sparse—the student fumbles for a long time without knowing where it went wrong. OPD offers a middle ground: the student generates trajectories using its own distribution, while the teacher provides soft supervision at every token. It avoids distribution shift and is easier to optimize than sparse rewards.
In other words, this trick combines "walking the path yourself" with "a teacher guiding you step by step."
In April of that year, a paper titled Rethinking On-Policy Distillation came out, jointly from Tsinghua's THUNLP and several institutions. Using a unified f-divergence framework, they mapped out the entire landscape from three dimensions: feedback signals, teacher access, and loss granularity—essentially drawing a map of OPD.
And on that map appeared two new species: OPSD—self-distillation using the model's own output, matching GRPO's performance while saving over half the resources. And OPCD—asymmetric distillation, letting the student absorb the teacher's distribution at the context level. In October 2025, Thinking Machines Lab released the most comprehensive engineering practice guide for OPD, from sampling to KL divergence computation—a mature pipeline that became the community's benchmark.
But a free lunch never comes without a catch.
Live broadcasts always glitch. In real training logs, researchers repeatedly observed instability in sampled-token OPD—the model could converge to a locally reasonable point, while overall performance actually dropped. The Tsinghua team pinpointed the issue: teachers have a "quality threshold." Once the teacher itself is biased, or if the student's sampling space is too large, distillation actually amplifies the errors. How do you balance computational cost and gain? What if multiple teachers conflict? The answers to these questions determine whether OPD can evolve from a "star technique" into a "general infrastructure."
And so, the story draws to a close. Did you notice? OPD is not the final destination. It's a bridge. On one side stand supervised fine-tuning and reinforcement learning; on the other, a more continuous form of evolution. The signals it provides are denser than RL, more robust than SFT. In the future, as mechanisms for mixed teachers, adaptive sampling, and failure recovery mature, OPD will coexist with RL in a complementary symbiosis.
What we're witnessing is not just the rise of a technique, but a shift in training philosophy.
Let the model learn from a stronger version of itself, on the very paths it has walked.
That's the story of OPD. And remember this: the best teacher is not the one who walks the path for you, but the one who gives you a signal at every step you take.
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.