NLP算法面试必备!PTMs:NLP预训练模型的 (English)
NLP算法面试必备!PTMs:NLP预训练模型的 (English)
Generated: 2026-06-20 21:00:00
---
Recently, I went for an interview at a search engine company. The interviewer started by asking me to draw the Transformer architecture. I finished that, and then he said, "Tell me about pre-trained models, from word2vec to BERT. What do you think?" I laughed inside—he wanted me to give an impromptu lecture! But honestly, over the years, I've made plenty of mistakes, from training word vectors from scratch to fine-tuning various pre-trained models for downstream tasks. I've accumulated quite a bit of hard-won experience. Today, I'm going to lay it all out—practical, actionable stuff. I'll call out the good and the bad, recommend and advise. It's useful for interviews, and even more useful for real work.
---
1. What's the Point of Pre-training? Let's Cut the Fluff
Back in 2015 when I first got into NLP, there was no concept of "pre-trained models." For text classification, you had to train your own word vectors on your corpus, or download the Google News word2vec vectors—300 dimensions, about 100 billion words, a several-gigabyte binary file that took forever to download. For downstream tasks, you had to manually engineer features: part-of-speech tags, named entities, dependency parsing… you'd throw the whole kitchen sink at it, and the results were pure luck, like buying a lottery ticket. Why? Because model parameters were randomly initialized, and if your dataset was small and your model even slightly deep, you'd hit overfitting instantly. You know that hopeless feeling, right?
Then pre-trained models came along. Basically, they did three things: First, they learned general language representations from massive amounts of unlabeled data, so you no longer had to slog through labeling data. Second, they gave downstream tasks a great starting point, leading to faster convergence and better generalization. Third, they acted as built-in regularization, preventing overfitting on small datasets—rock solid.
The first time I used word2vec to initialize my embedding layer, it felt amazing! The loss dropped like crazy, like I had swapped in a new engine. But the real game-changer was BERT—using pre-trained Transformers and just fine-tuning on downstream tasks. Suddenly, state-of-the-art on many tasks was being shattered. Honestly, in those first couple of years, I threw BERT at everything: classification, sequence labeling, matching. The results were so good they felt unreal. If you've tried it, you know that feeling—like getting the answers to the exam beforehand. It's so good it makes you question everything.
---
2. What's the Real Difference Between word2vec and BERT? Let's Start with Static Word Vectors
Interviewers love asking, "What's the relationship between word embeddings and distributed representations?" My understanding is simple: word embeddings are a specific implementation of distributed representations. Under one-hot encoding, each word is a sparse, high-dimensional vector, and you can't compute similarity—it's like a phone book for the entire universe where you can't find similar names. Distributed representations map words to dense vectors. No single dimension has inherent meaning, but together they capture semantic similarity. It's like how our brains encode concepts—not in a single cell, but through a network of neurons.
Pre-trained models can be split into two major phases: shallow word embeddings and pre-trained encoders.
Shallow word embeddings are static and context-independent. NNLM, word2vec, GloVe, fastText—they all fall into this category. The word vectors they produce are one-size-fits-all: one word gets one vector. Whether the context is "This apple is delicious" or "Apple just released a new phone," the vector is identical. That's a huge problem!
Back when I was doing sentiment analysis on reviews, I used word2vec to initialize my embedding layer. I hit a wall with polysemous words. For example, the model learned fine for "This phone is cool," but when faced with "It's cool today" (meaning "very" in some dialects), the vector for "cool" clashed with its common semantics, and the classification went completely wrong. So what's the fundamental issue with static word vectors? They can't adjust dynamically based on context. Frustrating, isn't it?
Pre-trained encoders, on the other hand, are context-dependent and generate representations dynamically. ELMo used a bidirectional LSTM—but note: its "bidirectional" was actually two separate unidirectional LSTMs, one left-to-right and one right-to-left, whose hidden states were concatenated. It wasn't true bidirectional context, because each layer was still unidirectional. Even so, ELMo was leagues ahead of static word vectors at the time.
GPT is a unidirectional Transformer that models language from left to right. It's good for generating long text, but it can't use both left and right context simultaneously. I tried using GPT for classification, and the results were mediocre. Because it only looks at the preceding context, it misses a lot of semantics. It's like asking someone to make a judgment after hearing only half a sentence—how accurate can they be?
BERT completely changed the game. It uses the Transformer Encoder with a Masked Language Model—randomly masking 15% of tokens and making the model predict them—to achieve true bidirectional context. Why not just use a regular bidirectional language model? Because if every word can see all other words, the model learns to "predict itself," leading to label leakage. Masking some tokens forces the model to infer from context, giving it bidirectional information without cheating. That's a stroke of genius!
The first time I used BERT for text classification, I was amazed: even without fine-tuning—just taking the [CLS] vector and feeding it to a linear classifier—it outperformed my carefully tuned word2vec + LSTM model! It felt like you had spent years training in martial arts, and then someone shows up who just ate a magic pill and skipped all the basics. Infuriating, right? But that's the power of pre-training: it has already learned language knowledge from massive data. You could attach a potato to it and it would still work.
---
3. A Zoo of Pre-training Tasks: How Do You Choose?
Let's talk about the different pre-training tasks. There are quite a few pitfalls here; let me walk through them.
MLM (Masked Language Model): BERT's classic approach: randomly mask tokens and predict them. But there's an inherent issue: during pre-training, the input contains [MASK] tokens, but during fine-tuning, downstream tasks don't have these, creating a mismatch. BERT tries to mitigate this by using [MASK] 80% of the time, random replacement 10%, and unchanged 10%. It's a patch, but is it perfect? Not quite.
NSP (Next Sentence Prediction): BERT uses this to learn sentence relationships, determining if two sentences are consecutive. Later research showed NSP isn't very useful and might even be harmful. I did some QA tasks and tried a version of BERT without NSP. The results were about the same, and I saved pre-training time. Sometimes what you think is icing on the cake is actually just sabotaging you.
SOP (Sentence Order Prediction): ALBERT found NSP too easy (it can be solved by topic prediction alone), so they replaced it with judging whether two consecutive sentences have been swapped. This task is harder and seems to learn better sentence relationships. When I worked on paragraph ordering, ALBERT indeed outperformed BERT. Another counterintuitive insight: the harder the task, the more useful the knowledge learned.
RTD (Replaced Token Detection): ELECTRA trains a discriminator to determine if each token in the input is real or replaced by a generator. The generator does MLM, and the discriminator does binary classification. The benefit is that every token contributes to the loss, unlike ML
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.