LLM训练-pretrain (English)

Generated: 2026-06-21 02:48:29

---

Stop Believing the Lie That "Only Big Tech Can Do Pre-training"

Let me tell you a story — the pitfalls I fell into.

Last year, I took on a project in the medical vertical. The client's requirement: use an open-source model to generate structured diagnostic reports — accurate, professional, and production-ready. Without a second thought, I grabbed a star base model and did instruction fine-tuning.

Guess what happened?

The model started wildly fabricating drug names.

Concoctions like "metformin combined with chlorpromazine for diabetes" — I've never seen that combo in any pharmacology textbook in my life. The whole team was dumbfounded.

I spent the next three weeks trying to reverse-engineer the model's training data, desperate to figure out which corpus it had seen these nonsense drugs in. Result? No way to check. The data source wasn't public at all.

That's when it hit me — if I had done the pre-training myself, even if only on a few hundred billion tokens, I would have known exactly where it saw that drug: which book, which paper.

---

The Real "Black Magic" Has Never Been the Model Architecture

These days, people in the field keep shouting: "Pre-training is already monopolized by big tech. Small teams might as well give up."

Let me be straight with you: That claim is both stupid and lazy.

Think about it — if you don't even know what's in your own training data, and you just take someone else's pre-trained model to fine-tune with SFT or RLHF, when things go wrong, you have no idea who to blame. Isn't that like driving a race car that someone else modified, and when it breaks down, you don't even know where the throttle adjustment is?

Speaking of which, here's a bigger bombshell: Data cleaning is the real "black magic."

A lot of people think the core of pre-training is the model architecture or the parallelization strategy. Bullshit. Feed a pile of garbage into Megatron, and what comes out is garbage. Data cleaning is what determines whether your LLM is a straight-A student or a parrot on repeat.

Back then, I referenced the approaches from RedPajama and FineWeb and distilled them into four steps. Every step came from blood and tears.

Start with data extraction. EPUB, PDF, HTML — formats all over the place. I tried pypdf and trafilatura, and the tables in PDFs would often come out garbled. The hard-learned lesson? You must do line-level validation after extraction. If there are too many short lines — like chat logs — throw them out directly. Otherwise, the model will learn a bunch of carriage returns and gibberish.

Then run heuristic filtering. In the rough screening phase, look at a few metrics: stopword ratio, special character ratio, word/phrase repetition rate. What was the most bizarre pitfall I fell into? Some novel websites we crawled — every chapter ending had a sentence repeated twenty times. We didn't notice at first. As training went on, the model started cycling through "To find out what happens next, listen to the next chapter."

Can you believe it? It became a broken record!

We quickly added N-gram deduplication with a window size of 5, and anything with a repetition rate over 0.8 got tossed. That saved us.

The third step is fine-grained screening. Here, I strongly recommend training a PPL model first. KenLM works fine. Run it on all the data, and throw out anything with an abnormally high PPL. Guess what? A lot of garbage documents full of gibberish had PPL values in the thousands. Then use a BERT-based quality classifier like FineWeb-EDU to score your data — from 0 to 5. Anything below 2 doesn't deserve to be in the training set.

If you don't do this, your model won't be able to tell the difference between People's Daily and some clickbait account.

Finally, safety filtering. Don't think this step is just about regulatory compliance. I tested it: if just 1% of inappropriate content sneaks in, the model will inexplicably generate cringey text in some domain. Use keyword filtering plus a small model as a double safeguard.

And then there's deduplication. I used MinHash + LSH with 6 permutations and a threshold of 0.7. After running it, I found that the Chinese forum data had a duplication rate of up to 40%.

The same question, copy-pasted across Tieba, Zhihu, and Baidu Zhidao — the model just learned "copy and paste."

---

The Tricks of Training Are Hidden in These Details

Okay, data is sorted. What about training?

First, a quick review. The goal of pre-training is just one thing: next token prediction. Using cross-entropy.

I copied that table from a Zhihu article and stuck it on my wall:

Loss 0.0 → PPL 1.0 (perfect, impossible)
Loss 1.0 → PPL 2.7 (roughly two choices)
Loss 3.0 → PPL 20.1 (picking from 20 candidates)
Loss 10.0 → PPL 22026 (completely guessing)

Get this table? When training a large model, dropping Loss by 0.5 is a massive improvement.

For a 7B model I worked on, going from Loss 2.5 to 2.0 required an additional 100B tokens of data. So don't listen to people who boast "I trained a SOTA in a week" — they're probably using data distilled from ChatGPT to cheat on PPL, and it's not reliable.

But what really trips up many teams?

Data concatenation.

Most pre-training randomly concatenates multiple documents into a long sequence, with an EOS token between them. The problem? The noisy co-occurrence of two unrelated documents makes the model learn false correlations.

For example, document A talks about "apples are delicious," and document B talks about "Apple's stock price." The model might learn a magical connection like "delicious → stock price."

I tried using segmentcausalmask by adding attention masks at document boundaries, so the model only looks at tokens within the current document. The code wasn't that complicated, and in practice it improved few-shot ICL by 1.6%.

But there's a side effect: if you use relative position encodings like RoPE or ALiBi, the cross-document position information gets wiped out by the soft mask, and the model essentially can't learn inter-sentence positional relationships. I still haven't fully solved this issue. My current approach is to keep a small bias instead of -inf when masking, so the softmax doesn't kill the gradient entirely.

---

Don't Let Megatron Intimidate You

Speaking of training frameworks, you've definitely heard of Megatron-LM.

Yes, it's the industry standard, supporting model parallelism, pipeline parallelism, and data parallelism — the whole trio. But honestly — for individuals or small teams starting a new project, the cost of getting into Megatron might be bigger than actually training a model.

Early on, I ran an experiment: using the PyTorch implementation from llama2.c, training a 110M parameter model on a single GPU with 10B tokens of data. It took about two weeks (24GB VRAM, batch size just barely acceptable). It was slow, but it let me run through the entire pipeline — data cleaning, tokenization, training loop, decontamination validation, PPL analysis.

In that process, my understanding of preprocessing, gradient accumulation, and learning rate annealing went deeper than if I'd read ten papers.

Once you've run a full pipeline, switching to Megatron becomes smooth. If you jump straight into Megatron with 128 GPUs without that experience, you won't even be able to tell Tensor Parallelism from Sequence Parallelism.

Also, one

LLM训练-pretrain (English)

LLM训练-pretrain (English)

Stop Believing the Lie That "Only Big Tech Can Do Pre-training"

The Real "Black Magic" Has Never Been the Model Architecture

The Tricks of Training Are Hidden in These Details

Don't Let Megatron Intimidate You

Cael Lee

Ready to get started?