OpenAI揭秘训练细节,但最值钱的数据配方没给 (English)
OpenAI揭秘训练细节,但最值钱的数据配方没给 (English)
Generated: 2026-06-21 06:51:32
---
OpenAI's First "Reveal" of Training Details? Let Me Be Real: They Gave You a Piece of Meat, but the Secret Broth Recipe Is Still Locked Away!
Guess what?
Late last night, a friend suddenly sent me a WeChat message: "How the hell is ChatGPT actually trained? I've been reading through all this material, and the more I read, the more confused I get!"
I was lying in bed scrolling through my phone when I saw that message, and I just laughed out loud. I replied: "This stuff — it's hard if you think it's hard, simple if you think it's simple. Just four steps: pre-training, supervised fine-tuning, reward modeling, reinforcement learning."
My friend was silent for five seconds, then sent back three words: "That's it?"
Me: "That's it."
Haha, you see — that's most people's reaction when they hear the answer. Four seemingly straightforward steps, doesn't sound like much. But if you actually buy that, you're being naïve.
---
Speaking of which, I have to bring up Andrej Karpathy's talk at Build 2023
To be fair, that really was the first time OpenAI laid the training pipeline out on the table for everyone to see. But did you watch closely? They laid out the process, not the recipe.
I watched that video forwards and backwards three times! Then I hunted down the interviews with Altman and the GPT-4.5 team, reading them over and over again. Add to that all the pitfalls and landmines I've stepped on while tinkering with open-source models myself — today I've got to talk to you about the real truth hidden behind those PowerPoint slides.
---
The first counterintuitive truth: **99% of the compute goes to pre-training, but the most valuable part — they didn't say a single word about it**
In Andrej's flowchart, the pre-training block alone accounted for 99% of the entire compute time!
He used Meta's open-source Llama as an example and talked about data mixing ratios: CommonCrawl, C4, GitHub, Wikipedia… Sounds super clear, right?
But here's the thing — what does OpenAI's own data mix look like?
No idea.
Andrej just casually tossed out: "We don't do public scaling studies."
Hmm… think about it. Really think about it.
My guess? It's not that he didn't want to say — it's that saying it wouldn't matter. You can copy their ratios, but you can't copy their data cleaning pipeline or deduplication strategy. That whole system is the real black box!
I once fine-tuned a small model myself, and just dealing with the garbage data in Chinese corpora nearly cost me half my life. Typos, duplicate content, messy formatting… I wanted to smash my computer.
In the GPT-4.5 interview, OpenAI mentioned a concept called the "data long-tail effect" — making the model learn those low-frequency but critical patterns. You know how much blood and treasure that took to grind out? Two years, 100,000 GPUs, hundreds of people running experiments day and night!
So here's the first truth: The essence of pre-training isn't some algorithmic breakthrough — it's a marathon of data engineering and infrastructure! Andrej gave you a map, but at every critical fork in the road, he didn't say a word about which way to go.
---
The fine-tuning and RLHF part — you can steal a few tricks… but don't take it too seriously
When it came to the second stage — supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) — Andrej went into a bit more detail: how to collect human demonstration data, how to train a reward model, how to use PPO for RL optimization.
This did point the open-source community in the right direction! Later came Alpaca, Vicuna, Qwen fine-tuned versions — some people claimed to have "replicated" ChatGPT's results for just a few hundred dollars.
But seriously? I asked the same questions to Alpaca and GPT-3.5 myself — the gap was obvious to the naked eye!
Why such a big difference?
Answer in two words: Data.
OpenAI has a massive amount of high-quality conversational data! This data comes from real users and from iterative collection during the RL process. Andrej mentioned that during the RLHF stage, they constantly generate multiple candidate responses and have humans rank them. The scale and quality of this dataset — you simply can't buy it on the open market!
Altman also clearly stated in an interview: data efficiency is the next bottleneck; compute has taken a back seat.
In other words — even if you had 100,000 GPUs, without good data, it's useless!
And then there's PPO. Tuning the parameters for that thing is straight-up hell. Andrej said they use an internal tool, but he didn't tell you how many experimental versions they ran before finding the right parameters. I tried tuning PPO once myself — the loss shot straight into the stratosphere, and it took me a full week just to barely stabilize it.
OpenAI's systems team — dozens of people, two years of grinding — you think you can master that from a single talk?
---
The pits of a 100,000-card cluster: this is where OpenAI was most honest
If Andrej's talk was the spectacle for outsiders, then the 45-minute conversation between Altman and the GPT-4.5 team — that's the real substance for insiders.
They admitted for the first time: the "catastrophic problems" of a 100,000-card training cluster!
A single hidden bug caused the system to fail frequently. And they didn't discover it until the training progress bar reached 40%!
40%, folks! That's like running a marathon and realizing halfway through that the grain of sand in your shoe is actually a rock!
The chief systems architect, Amin Tootoonchian, put it this way: every single accelerator, the network, all components need to work as expected simultaneously. When you scale a cluster from 10,000 cards to 100,000 cards, those once-in-a-blue-moon low-probability failures become daily certainties.
Think about it — with 10,000 cards, maybe a small issue once a week. With 100,000 cards, every day brings a new surprise!
I've read some AWS whitepapers, so I know how hard large-scale distributed training is in terms of network topology and fault tolerance. But OpenAI's confession still shocked me.
The birth of GPT-4.5 wasn't just an algorithmic victory — it was an engineering miracle by the systems team. They "repaired while training," running the entire training process over two months. That capability alone is a mountain others can't climb.
So do you get it now? The so-called 'barrier to entry' for large models — compute is just the ticket. Whether you can treat 100,000 GPUs as a single computer — that's the true dividing line.
---
I want to respond to one point of view
Some might say: "Andrej's talk was already very transparent! He even revealed the training objective functions for all four stages. What more do you want?"
I admit, compared to the old CloseAI that guarded technical details like secrets, the new OpenAI has definitely improved. But friend, you need to understand the difference between a "
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.