I Spent 6 Months Trying to Build OpenAI's Secret Sauce. Here's Why It Didn't Work
I Spent 6 Months Trying to Build OpenAI's Secret Sauce. Here's Why It Didn't Work
Look, I have a confession to make. I've been obsessed with Process Reward Models (PRMs) for the better part of a year now. Ever since OpenAI dropped o1 last September, I've been that guy—the one who won't shut up about reproducing it at all costs. The number of rabbit holes I've fallen into? Honestly, I've lost count.
It started with what I thought was a brilliantly simple idea.
The "Just Score Every Step" Fantasy
Here's what I figured: if you grade each step of a reasoning chain, the model learns which steps are right and which ones are wrong. Simple, right? Think about a classic math word problem—like the one where a farmer's ducks lay 16 eggs a day, the family eats 3 for breakfast, uses 4 for pancakes, and sells the rest at $2 each. The correct reasoning is straightforward: 16 - 3 = 13, then 13 - 4 = 9, then 9 × 2 = $18.
But sometimes the model spits out $17. Maddening.
My theory? It's messing up somewhere in the middle. So if I just label which step went wrong and train a model to catch those mistakes... problem solved, right?
Nope.
That first approach buried me. Completely.
The First Disaster: Teaching Silence
I spent two solid weeks annotating 800 math problems, carefully breaking down each reasoning chain, finding the first error, and slapping a negative label on it. Feeling pretty good about myself, I trained a PRM and ran it on the test set.
67% accuracy.
I was stunned. Two weeks of work for barely-better-than-random performance on hard problems. What went wrong?
Then it hit me—I'd only labeled up to the first mistake. The model never learned what a good step looked like. It only knew when to yell "WRONG!" Imagine teaching a kid math by only speaking up when they screw up, staying dead silent when they get something right. What would they actually learn? Nothing useful.
This approach—wait, I should call it a strategy—was fundamentally broken.
Round Two: The Generalization Trap
So I pivoted. Following OpenAI's Let's Verify Step by Step paper more closely, I re-annotated everything. Every single step got a label—positive for correct ones, negative for errors. Another 800 examples. More sleepless nights.
This time? 74% accuracy. Progress.
But the real problem was hiding in plain sight: the model couldn't generalize. At all. On math problems, it was decent. Switch to logical reasoning or—god forbid—code debugging, and the scores became essentially random. I'd spent hundreds of dollars on annotation costs training a specialist that was useless outside its tiny comfort zone.
I wanted to throw my keyboard across the room. Seriously.
There are papers comparing PRM vs. Outcome Reward Model (ORM) generalization, and technically PRM does come out ahead. But not by much. And honestly? I'm not convinced those academic datasets resemble the messy, real-world problems I was throwing at my model. You know what I mean?
The Hidden Pitfall: Speed Kills
The third problem was sneakier.
I started combining my PRM with search algorithms—the whole Monte Carlo Tree Search (MCTS) pipeline. On paper, it's beautiful: the PRM guides the search tree, scoring candidates at each node, pruning low-quality branches. I read Tsinghua's ReST-MCTS paper three times. I could sketch the architecture diagram from memory.
But in practice? Glacial. Painfully, unusably slow.
For a simple three-step reasoning problem, here's what happened: the PRM generated 10 candidates per node, scored all of them, selected the top 3, expanded those, scored again... and then had to pick the highest-quality complete path. Total time per problem: 20 seconds.
Twenty seconds. For a problem a fifth-grader could solve in five.
And then DeepSeek R1 dropped.
No PRM. No fancy step-by-step verification. Pure reinforcement learning. Their AIME 2024 accuracy jumped from 15.6% to 71.0%. I remember that day vividly—January 2024, Beijing was -12°C outside, and my coffee went cold while I stared at those results. Completely cold. Just like my enthusiasm for PRMs at that moment.
So What Was I Even Doing for Six Months?
After the initial shock wore off, I realized something: PRMs aren't useless. I was just using them wrong.
A PRM is fundamentally a pruning tool. It shines when search costs are astronomical—complex multi-step reasoning, problems where the solution space is massive. For simple problems, best-of-N sampling works fine. Medium difficulty? Beam search might be your sweet spot. It's only when problems get genuinely hard, with long reasoning chains and enormous search spaces, that the fine-grained feedback of a PRM actually pays off.
This reminds me of two camps I've seen in competitive programming: the "search solves everything" people and the "graph theory solves everything" people. PRM is a pruning knife for the search crowd. But if your problem doesn't need searching in the first place, that knife is just dead weight. Expensive, slow dead weight.
The Plot Twist: Maybe PRM Is Just Training Wheels
Here's something interesting I noticed.
DeepSeek R1 didn't use PRM at all, but they hit a weird problem early in training: the model forgot how to communicate like a human. Strong reasoning capabilities, sure, but its outputs were an unreadable mishmash of Chinese and English. They had to add a cold-start supervised fine-tuning stage—thousands of human-annotated examples—just to make it coherent before unleashing RL.
That got me thinking. Maybe PRMs serve a similar purpose. They're not the final answer—they're a crutch to get you through the unstable early training phase. Once the model internalizes the reasoning patterns, you can probably toss the crutch aside.
At least... I think that's what's going on? DeepSeek's paper is pretty coy about the details. Big lab papers always hide the interesting bits, don't they?
What I'd Tell My Past Self (and You)
After all these bruises, my relationship with PRMs has changed. They're not a silver bullet—they're a specialized wrench in the toolbox. If you're going down this road, here's what I wish someone had told me:
Don't annotate 800 examples on day one. Run a 100-example pilot first. I was way too eager and burned two weeks labeling data in the wrong direction. That feeling? Let's just say it wasn't great.
Train on all tokens in each step, not just the last one. When I made this switch, accuracy jumped 3-5 percentage points. Papers rarely mention this detail, but it matters. A lot.
Verify you actually need a PRM. Simple problems? Best-of-N is plenty. Only reach for PRM+MCTS when your problems are so hard that vanilla models completely fail. Don't be like me—bringing a chainsaw to slice a tomato and then complaining about the mess.
Watch the DeepSeek approach closely. If pure RL can get us there, maybe we don't need PRMs at all. The jury's still out, but the evidence is... intriguing.
TL;DR
- PRMs sound elegant but are brutally hard to implement well
- My first approach (labeling only errors) failed spectacularly—67% accuracy
- Second approach (labeling all steps) hit 74% but couldn't generalize beyond math
- PRM+MCTS is painfully slow for simple problems (20 seconds per problem in my tests)
- DeepSeek R1 achieved 71% on AIME 2024 with zero PRM—just pure RL
- PRMs are pruning tools, not magic wands—only useful when search costs are genuinely high
- They might just be training wheels for the unstable early phase of RL training
So yeah, I basically spent six months learning when not to use a tool. But honestly? That's not nothing. At least now I know that some tools aren't bad—you just have to pick the right moment. Like using a chef's knife instead of a screwdriver. Sure, you could make it work, but why would you?
What's been your experience with PRMs or alternative approaches? Drop a comment—I'd love to hear if anyone else has gone down this particular rabbit hole. Misery loves company, right?
AI #MachineLearning #DeepLearning #OpenAI #Reasoning #PRM
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.