The Night I Stayed Up Till 2:30 AM Reverse-Engineering OpenAI's o1 (And What It Actually Means for U
The Night I Stayed Up Till 2:30 AM Reverse-Engineering OpenAI's o1 (And What It Actually Means for U
It was half two in the morning when I dug up Zhang Junlin's article on reverse-engineering o1.
o3-mini had just dropped, and I'm staring at my screen when it hits me — last September, when o1 first came out, I was in the middle of a code generation project that was absolutely kicking my arse. GPT-4o would spit out 200 lines of code, but somewhere around line 30 it'd get a variable name wrong, and then cheerfully write the remaining 170 lines based on that mistake. The result? A glorious mess of bugs that made no sense unless you traced back to that one dodgy assumption.
And here's the kicker — when you asked it to fix the bug, it'd just regenerate everything from scratch. Still capable of making the exact same type of mistake.
This is what Zhang calls "no take-backs on tokens." Once an LLM commits to a wrong turn, it has to invent 100 more errors just to make the first one look intentional. For the sake of logical consistency, it doubles down. This — right here — is the root of most LLM hallucinations.
o1 was built to solve exactly this.
The Hidden Chain of Thought (Or: What Happens in the Black Box)
o1 introduced something called Hidden COT — a concealed chain of thought. Before outputting its final answer, the model reasons, verifies, and self-corrects inside a "black box." You never see this process. You only get the cleaned-up result at the end.
Think about how you solve a difficult maths problem. You scribble on scratch paper for ages — crossing things out, trying different approaches, realising you went down the wrong path. What you finally write on the answer sheet is just the neat, logical progression. That's essentially what o1 does.
Honestly, this reminds me of Daniel Kahneman's Thinking, Fast and Slow. System 1 is rapid, intuitive thinking. System 2 is slow, deliberate, step-by-step reasoning. GPT-4o behaves more like System 1. o1? That's System 2 in action.
OpenAI themselves emphasised that o1 uses "reinforcement learning to generate Hidden COT," but beyond that one sentence? Technical details are practically non-existent. They're even more secretive than they were with Sora — and at least Sora got a rough architectural diagram.
How Zhang Reverse-Engineered It
Zhang mainly referenced AlphaZero's approach, attempting to figure out how LLMs and RL might be fused together. After reading through his reasoning, a few things stood out.
o1 probably isn't a single model
He speculates that o1 consists of three components: a main model, a summary model, and a pool of tree-search-related models whose quantity can be flexibly configured. These three work in concert.
How do you define the state space and action space for RL?
This might be the most critical question. Zhang argues that the state space is "the currently generated chain-of-thought content," while the action space is "the next reasoning step to generate." A Reward Model judges the quality of each step.
About that Reward Model — I later read another article discussing three scoring approaches for Process Reward Models (PRMs):
- Min method: Takes the lowest score across all steps. One wrong step, and the entire logical chain potentially collapses.
- Last step method: Takes only the final step's score. Since the PRM evaluates with full context, the last step's score can reflect overall quality.
- Prod method: Multiplies all step scores together.
OpenAI experimented with prod and min in their paper Let's Verify Step by Step. DeepMind went with last step. There's no universally superior option — it depends entirely on how you design your training process.
Absolutely fascinating stuff.
But here's what's been bugging me
If a PRM needs to score every single step, where on earth does the training data come from?
Mathematical problems are the most natural fit. A detailed solution has clearly defined reasoning steps, and the final answer is deterministic. You can use this data to train a model on single-step reasoning, or truncate at any point and let the model learn the next step. Wrong reasoning steps can even be used to construct error-correction data.
Code is trickier. Leetcode problems have solutions and verifiable test cases, but the format of those solutions differs significantly from how models actually think. Someone proposed reverse-generating thought steps from existing solutions — this approach, wait, I should probably call it a strategy — still requires verifying the correctness of each generated step. I haven't figured out how to do that properly yet.
Probably requires a more granular verification mechanism.
The Numbers Are Impressive (But That's Not What Excites Me)
o1 scored 83% on the IMO qualification exam. GPT-4o? 13%.
On GPQA-diamond — a test evaluating expert-level knowledge in chemistry, physics, and biology — o1 became the first model to outperform human experts. In programming competitions, it landed in the 89th percentile.
But honestly? Those numbers aren't what get me excited.
What matters is that o1 brings self-reflection and error correction capabilities to large models. I cannot overstate how significant this is. Previous models generated mistakes and that was that. Now, a model can internally re-examine its own reasoning, scrap a flawed approach, and start over.
Let's Not Get Carried Away Though
Someone raised a sharp question: Can large models continuously improve through self-play? The keyword here is "continuously."
If improvement converges quickly, o1 might only raise the ceiling from "near human average" to a somewhat higher plateau — not enable unlimited growth. RL itself scales reasonably well, but whether every input component in the pipeline can scale... that's an open question.
Speaking of self-play, o1 also employs something similar to MCTS — Monte Carlo Tree Search — during inference. It simultaneously explores multiple reasoning paths, uses the PRM to score them, and selects the most promising to continue. Beam Search or Best-of-N, depending on problem difficulty.
Beam Search for harder problems. Best-of-N for simpler ones.
Here's something curious: when the PRM is trained well enough, more complex search methods can actually perform worse. Lookahead search, for instance. And when problems become exceptionally difficult, even test-time scaling laws have limited impact — you might need to go back to the pretraining stage, add more data, expand model size.
This brings us to a new Scaling Law: Inference Time Scaling. Previously, we only focused on training-stage scaling — more parameters, more data. Now, compute during inference can be scaled too. o1 dynamically allocates computational resources by adjusting search depth and breadth at runtime.
When o3-mini launched on 31 January 2025, it even offered three reasoning levels: low, medium, and high. Users can choose. Low is fast but rough. High is slow but precise.
My Real-World Test (And Where It Went Wrong)
Last Wednesday afternoon, I threw a complex SQL optimisation task at o1-mini. Seven layers of nested subqueries. A proper mess.
o1-mini "thought" for about 45 seconds, then produced a completely rewritten query. Execution time dropped from 23 seconds to 0.8 seconds. During its thinking process — though I couldn't see the details — it must have decomposed the query logic, identified index failures, and redesigned the JOIN order.
Brilliant.
But here's the plot twist. I ran the exact same task again later. This time it "thought" for nearly two minutes, and the solution it gave was actually worse. I suspect the search space was too large, the PRM's scoring went a bit wonky, and it pruned a path that was actually decent.
So no, o1 isn't magic.
A Quick Word on DeepSeek
DeepSeek's evolution from V1 to V2 has been following a similar trajectory, just without OpenAI's fanfare about "reasoning capabilities." V2 adopted a MoE architecture with extensive optimisations for inference efficiency. I'd wager they're working on something similar to Hidden COT internally — just not productised yet.
Can't blame them. This stuff burns through compute like mad.
o1-pro launched on 20 March 2025, priced at eye-watering levels. I haven't tested it myself, but from what I've seen in evaluations, it genuinely outperforms o1 on extremely complex reasoning tasks. Whether it's worth the cost depends entirely on your use case.
For medical diagnosis, drug discovery, high-precision code generation — scenarios where error tolerance is near zero — spending extra for peace of mind makes sense. For writing a weekly report or summarising meeting notes? GPT-4o will do just fine.
Seriously.
It's brilliant, but it's expensive. The model has learned to play chess against itself, but every move on that board burns real money. What do we do with that?
Key Takeaways:
- o1 uses Hidden COT to self-correct before outputting — this is the real breakthrough, not benchmark scores
- It's likely three components working together, not one monolithic model
- Inference Time Scaling is the new frontier — we can now scale compute at runtime, not just during training
- PRM scoring strategies (min, last step, prod) have real trade-offs — no silver bullet
- For everyday tasks, stick with cheaper models. o1 shines when errors are genuinely costly
What's your experience with o1 or similar reasoning models? Have you hit the same inconsistency issues I did? Drop a comment below — I'd love to hear war stories.
AI #MachineLearning #OpenAI #DeepLearning #SoftwareEngineering
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.