The Real o1 Breakthrough: RLHF Is Dead, Long Live Self-Play RL

o1 isn't about model architecture. It's about a training paradigm shift so fundamental that I think we'll look back on 2024 as the year everything changed for LLMs.

Here's the one-sentence summary: RLHF's era is over. Self-play RL's era has begun.

I know that sounds like hype. But last week I was reading OpenAI's o1 blog post—really reading it, not skimming—and one phrase made me stop cold: "training models via end-to-end trial and error." That's AlphaGo's playbook, straight up. Except instead of a Go board, you've got token sequences. Instead of placing stones, you're generating reasoning steps.

But here's the thing—and I only realized this after trying to replicate it myself—it's way harder than it sounds.

Self-play RL is the engine, not the architecture

o1's breakthrough isn't architectural. Under the hood, it's probably GPT-4o's foundation, maybe even smaller. Cao Yu's team estimates o1-preview's actual parameter count could be 20-30% less than GPT-4o's.

So how does a smaller model absolutely demolish its bigger sibling on reasoning tasks?

Training paradigm shift. RLHF → pure RL.

My evidence is embarrassingly simple: the o1 blog repeatedly mentions "RL training scaling" but barely references human annotation data. Think about it—if they were still doing RLHF, what exactly would they be scaling? Human labelers can produce maybe 500 high-quality reasoning chains per day, tops. That's a data bottleneck, not a scaling opportunity.

Self-play doesn't have that problem. The model generates its own data, verifies it, and iterates. Theoretically infinite training data.

I'm guessing o1's training loop looks something like this:


Generator produces reasoning path → Verifier scores it → Score signal updates Generator → Stronger Generator creates harder problems → Verifier levels up too

Classic self-play. Standard stuff if you've followed DeepMind's work.

The difference from AlphaZero? Go has sparse but deterministic reward signals—count the stones, black wins if they control 184.5+ intersections, done. Reasoning tasks? Good luck designing a reward function for "is this intermediate step logically sound?" GPT-4 evaluating whether step 3 of a mathematical proof is rigorous might itself get it wrong.

This is o1's central technical challenge: how do you build a verifier that actually works?

My attempt to reverse-engineer the technical approach

Based on what's leaked out—Cao Yu and Zhang Junlin's analyses, Peking University's alignment team's interpretation, plus the ReST-MCTS* paper—I've pieced together two probable technical routes:

Route 1: MCTS + Learning (the "orthodox" approach)

Use Monte Carlo Tree Search during training to explore reasoning space, find good paths, and train on those. During inference, use similar search strategies—which is why it's dramatically slower. That 43-second "thinking" period where o1-preview generated 2,930 tokens? That's about 68 tokens/second for final output, but internally it probably generated 10x more tokens and pruned heavily. Traditional GPT-4o inference runs at 40-50 tokens/second. o1-preview looks faster on the surface, but those 43 seconds of "thinking" are almost certainly internal tree search.

Route 2: Extended STaR + Iterative Bootstrap + Self-reflection

The model generates its own reasoning, filters for quality, and trains on the good stuff. The secret sauce is "reflection"—the ability to spot errors and backtrack. o1's hidden reasoning chains apparently contain lots of "Hmm," "Wait," "Alternatively" markers followed by immediate corrections. I've seen leaked o1 chain-of-thought fragments (don't ask for links, OpenAI already took them down), and roughly 15-20% of steps include self-correction behavior.

Both routes are self-play driven, no massive human annotation needed. The difference: one leans on search compute, the other on the model's own discriminative ability. My bet? OpenAI uses a hybrid—MCTS provides exploration breadth, bootstrap provides learning efficiency.

My experiment crashed and burned (predictably)

I couldn't replicate o1, obviously, but I ran a small experiment to test the self-play idea on math reasoning.

Used Qwen2.5-Math-7B—don't ask why not a bigger model, I'm not made of GPU money—on high school competition problems from the MATH benchmark (Level 4-5 questions, 500 total).

The setup was straightforward: model generates solution steps (temperature=0.7, 3 paths sampled), a separate verifier model judges each step's correctness, collect the good steps, SFT them back into the original model. Repeat for 3 rounds. Used GPT-4o-mini as verifier because it's cheap—$0.15 per 1K tokens.

Results:

Round 1: Accuracy jumped from 32% (baseline) to 41%. +9 points. Expected.
Round 2: 47%. +6 more points. Okay, decent.
Round 3: 48%.

Barely moved.

I panicked and thought it was a data issue, so I expanded from 500 to 2,000 problems and ran another round. Still 48%.

Then I dug into the generated solutions and found the problem: GPT-4o-mini's step-level judgment accuracy was trash. I manually spot-checked 200 samples—it was correct only about 71.5% of the time. Plausible-sounding but wrong reasoning got labeled as correct, and the training data got poisoned.

Example: one problem asked to prove a sequence converges. The model used an incorrect inequality bound, but GPT-4o-mini rated it "reasonable reasoning." That error got fed back into training, and the model learned garbage.

Here's what I should've realized earlier: the ceiling on self-play RL is mostly determined by verifier quality. AlphaGo's verifier is deterministic—count stones, you're done. Reasoning verifiers are themselves an unsolved problem. Use a 70% accurate verifier, and your model's performance ceiling is probably around 50%. Beyond that, you're just recycling noise.

OpenAI must have a dramatically better verifier. Probably a dedicated Process Reward Model with hundreds of billions of parameters, or more sophisticated cross-validation through search. The ReST-MCTS* paper proposes jointly training policy and process reward models—let verifier and generator co-evolve, rather than using a fixed weak verifier like I did.

Amateur hour over here.

Surprise finding: test-time scaling is the actual killer feature

During my experiment, I stumbled onto something: giving the model more "thinking time"—more sampling rounds plus verification—produced gains way beyond what I expected.

I used the simplest possible approach: generate 5 solution paths, let the verifier vote for the best one. That's it.

Accuracy jumped from 48% to 61%. Thirteen points.

I was floored.

Then I tried 10 paths: 67%. 20 paths: 71%. After that, diminishing returns—it saturated around 75%, because the verifier's own accuracy capped the ceiling.

This reframed how I understand o1's "test-time scaling law." The key insight isn't just about training compute—it's that you can scale at inference time. Give the model more thinking time, and it performs better. In a certain range, the scaling curve is nearly linear: doubling inference compute yields 3-5 percentage point improvements.

Traditional LLM inference is "generate once"—cost is fixed. GPT-4o producing 100 tokens costs exactly 100 tokens' worth of compute. o1 is "search-based generation"—inference cost is variable and directly correlated with output quality. Want a smarter model? Throw more compute at it, no retraining required.

Translation: o1 turns "thinking time" into a tunable knob. From a product perspective, this is genius. Give paid users more inference compute, free users less. Same model, differentiated experience. Perfect commercialization.

This changes everything about inference accelerators

I used to look at Groq and Cerebras—those ultra-low-latency inference chips—and think they were solving a problem nobody had. Human reading speed is 5-10 tokens/second. GPT-4o outputs 40 tokens/second. We're already faster than people can read. Why do we need 450 tokens/second?

Now I get it.

o1-style models do massive internal search and verification—dozens or hundreds of model calls per user-facing query. Each call generates candidates, verifies, backtracks, regenerates. Latency becomes everything.

Let's do the math: suppose one o1 inference requires 100 internal model calls, each generating 50 tokens.

Traditional GPU (50ms/token): 250 seconds total. Four minutes. User's gone.
Groq LPU (5ms/token): 25 seconds. Tolerable.
Cerebras WSE3 (2.2ms/token): 11 seconds. Actually smooth.

Groq hits 5ms/token. Cerebras WSE3 is even more absurd—450 tokens/second throughput, ~2.2ms per token latency. These chips might be purpose-built for models like o1. I'd bet actual money OpenAI is in talks with Cerebras. Pure speculation, no evidence, but it makes too much sense.

o1-preview is still slow though—68 tokens/second with frequent multi-second "thinking" pauses. OpenAI's inference infrastructure clearly hasn't caught up. Those aggressive rate limits (50 o1-preview queries per week, seriously?) probably aren't about model size. They're about self-play inference eating compute alive. One o1 inference might cost 10-50x what a GPT-4o query costs. OpenAI can't afford to be generous.

My worldview got rearranged

After all this analysis and experimentation, a few convictions:

Post-Training Scaling Law is real. Pre-training scaling is hitting walls—GPT-5 is MIA, Gemini Ultra was underwhelming—but post-training scaling, especially RL + search, is just getting started. o1-preview is reportedly smaller than GPT-4o but destroys it on math reasoning (AIME 2024: o1-preview 56% vs GPT-4o 13.4%). That's post-training scaling in action. Pre-training scales data, post-training scales compute. The latter looks like better ROI right now.

Self-play RL will become standard. Currently only OpenAI, Anthropic, and Google play this game seriously—all DeepMind diaspora companies, shocker—but it'll spread fast. Sparse global reward signals plus self-play to crack specialized domains: the pattern is proven. I'd bet China's top LLM companies follow within six months. Verifier quality will be the differentiator.

Inference infrastructure is about to get weird. Ultra-long KV-cache management (o1's reasoning chains can hit tens of thousands of tokens), low-latency inference chips, distributed search orchestration—these become necessities, not nice-to-haves. Groq and Cerebras might've bet correctly. NVIDIA still owns training, but for search-based inference, specialized silicon's advantage is massive.

o1-preview is the appetizer. This is the "non-full" version, trained with roughly 100x compute—following AlphaZero's trajectory (AlphaGo Lee → AlphaZero: ~10x compute jump, AlphaZero → MuZero: another 10x). When full o1 and o2 drop, expect another explosion. I'm calling o2 for Q2 2025 with 90%+ AIME accuracy. Screenshot this.

Here's my spicy take: RLHF's era ended the day o1 shipped. RL's era started.

o1 reminds me of AlphaGo beating Lee Sedol in 2016. Everyone said "Go is dead," but the real story was the methodology. Eight years later, it's marched into LLMs and looks ready to dominate again. Self-play RL conquered Go, chess, StarCraft, and now language reasoning. Every time, it proves to be the strongest approach.

Honestly? It's beautiful.

TL;DR / Key Takeaways

o1's breakthrough is training paradigm, not architecture—self-play RL replaces RLHF, eliminating the human annotation bottleneck
Verifier quality is everything—my experiment capped at 48% because GPT-4o-mini's 71.5% judgment accuracy poisoned the training data; OpenAI likely uses dedicated Process Reward Models
Test-time scaling is the product superpower—give the model more inference compute, get better results, no retraining needed. Paid users get smarter AI, same model
Inference hardware is about to matter a lot more—Groq and Cerebras chips that seemed unnecessary for traditional LLMs become critical for search-based generation
Post-training scaling is the new frontier—pre-training is hitting diminishing returns; RL + search at inference time is where the next 10x improvements will come from

What's your experience with o1's reasoning? Have you tried self-play approaches on smaller models? Drop a comment—I'm especially curious if anyone's built a better verifier than my disaster of an attempt.

References I leaned on:

Cao Yu, "OpenAI o1 Self-Play RL Technical Roadmap Analysis"
Zhang Junlin, "Reverse-o1: OpenAI o1 Principle Reverse Engineering Illustrated"
Peking University Alignment Team, "o1 Opens the Post-Training Era: A New Paradigm for Reinforcement Learning"
ReST-MCTS* paper (DeepMind, 2024)
STaR/Quiet-STaR papers (Stanford, 2022-2024)
Groq/Cerebras inference performance data (official technical whitepapers, Q3 2024)
MATH benchmark data (UC Berkeley, 2023)
AIME 2024 results (OpenAI official blog, September 2024)

ai #machinelearning #llm #openai #reinforcementlearning

The Real o1 Breakthrough: RLHF Is Dead, Long Live Self-Play RL

The Real o1 Breakthrough: RLHF Is Dead, Long Live Self-Play RL

Self-play RL is the engine, not the architecture

My attempt to reverse-engineer the technical approach

My experiment crashed and burned (predictably)

Surprise finding: test-time scaling is the actual killer feature

This changes everything about inference accelerators

My worldview got rearranged

TL;DR / Key Takeaways

ai #machinelearning #llm #openai #reinforcementlearning

Cael Lee

Ready to get started?