Transformer & Bert 相关问题复盘及 (English)

Generated: 2026-06-20 18:39:41

---

Have you ever had that kind of interview? After three months of fall recruitment, I was so sick of answering Transformer and BERT trivia that I could've thrown up. And it wasn't just me—my labmates grinding through the same questions were all complaining too: Why does this stuff just get deeper and deeper? The answers you find online are the same few lines repeated over and over, but when you actually try to explain it yourself, you freeze up on the spot.

So today, I'm not going to talk nonsense. I'll break down the traps I fell into, the experiments I ran myself, and those scalp-tingling follow-up questions from interviewers—like peeling an onion, layer by layer. By the time you finish reading, you'll realize that those "frequently asked questions" actually all share the same underlying logic.

---

Positional Encoding: The Interviewer Hit Me with Three Questions in a Row, and I Completely Crashed

ByteDance's first round. The interviewer fired off three questions, and I was drenched in sweat.

Why does Transformer need positional encoding?
Why use sinusoidal functions instead of learnable ones?
Self-Attention loses relative position information, so what's the point of adding it anyway?

I could barely handle the first two, but the third one completely stumped me—I only remembered that "position information disappears in Attention," but I'd never thought about "why bother adding it at all if it disappears."

First, why it's needed.

Transformer has no recurrent structure. For input like "I hit you" and "you hit me," without positional encoding, the model would just see a bag of three words. It can't tell whether "hit" is the second word or the third. The sinusoidal formula gives each position a unique "fingerprint" at different frequencies, letting the model know that order exists.

So why not use learnable embeddings?

I dug into a lot of blog posts on this later and even ran my own comparison experiments—replacing the sinusoidal encoding with learnable embeddings. On WMT English-German, the BLEU score difference was less than 0.3, but convergence was noticeably slower. The biggest advantage of sinusoidal encoding is that it requires no extra parameters, and in theory it can even encode relative positions. But after going through Self-Attention's weighting, that positional information does get diluted—like ink poured into the ocean.

So what's the point of adding it at all?

Later it clicked: Self-Attention can attend to any position, but it needs an initial compass. Without positional encoding, the model can't even figure out "which token is the first." Positional encoding isn't about preserving order in the final representation; it's about guiding attention to develop ordering dependencies in the early layers. Even though the signal blurs in later layers, the gradient can still propagate that "order sensitivity" back. It's like training a puppy with a hand gesture first—later the gesture fades, but the conditioned reflex is already there.

In a nutshell: It's not there to "preserve" order; it's there to "kickstart" it.

---

BERT, GPT, Transformer: Split the Architecture in Half, and Each Has Its Own Achilles' Heel

Interviewers love to ask: "Why does BERT only use the Encoder, and GPT only the Decoder? Could you swap them?"

The first time I was asked, without thinking I answered: "BERT is for understanding, GPT is for generation." Then the interviewer followed up: "If you put a language model head on BERT's Encoder, could it generate text?" I was completely stuck.

Later, I drew out all three architectures on paper and studied them over and over. It comes down to just two words: field of view.

Model	Architecture	Attention Visibility	Pre-training Task	Typical Use Case

Transformer	Encoder-Decoder	Self + Masked Self + Cross	Translation (conditional generation)	Machine translation, summarization

BERT	Encoder only	Bidirectional, full visibility	MLM + NSP	Classification, NER, QA

BERT's full bidirectional view makes it naturally suited for "understanding"—every token can see its full context, just like when you read a sentence you grasp it all at once. GPT's masked self-attention means each token can only look at what's to its left, simulating the process of generating one word at a time.

So could they be interchanged? Technically yes, but the performance would be terrible. If you make BERT generate, it never saw the "look left only" constraint during training. You could force masking, but the generated sentences would be logically chaotic. If you make GPT do classification, it can't see the second half of the sentence at all, and accuracy would plummet.

So when an interviewer asks "can you swap them?", they're really asking: Have you thought about how the nature of the task determines the attention field of view? Bidirectional understanding relies on the Encoder; unidirectional generation relies on the Decoder—this wasn't something the designers decided on a whim; it's what the data taught.

---

BERT's Mask: Why Mess with the Input Layer?

Another question I bombed.

"Transformer Decoder uses Attention Mask to hide future words. Why doesn't BERT use an Attention Mask, and instead directly replaces input tokens with [MASK]?"

Intuitively, they're both masking, just different approaches, right? But dig deeper, and you realize: These two operate on completely opposite logic.

The Transformer Decoder's Attention Mask sets future positions to negative infinity before softmax, making their attention weights zero. The model still sees the raw vectors of future words, but they don't contribute to the computation—this is masking at the attention level, like covering your ears during a meeting to not hear your boss.

BERT's [MASK], on the other hand, directly erases the input word and replaces it with a special token. The model has no idea what the original word was; it has to guess purely from context—this is destruction at the input level, with the explicit goal of forcing the model to learn "contextual reasoning."

Why doesn't BERT use attention masking?

Because if BERT only masked future words like GPT, it would become unidirectional and couldn't learn bidirectional understanding. Using [MASK] allows you to mask any position, and the model must look both left and right to reconstruct it—this is what's called "denoising auto-encoding": first dirty the input, then have the model clean it up.

I ran a small experiment myself: I changed BERT's MLM task to a unidirectional Attention Mask (masking 15% of positions and preventing them from seeing later tokens), keeping everything else the same. After pre-training, the GLUE score dropped by almost 4 points. Hard evidence: bidirectional understanding requires real input destruction; you can't cut corners with attention masks.

---

Training Tricks: Parameters That Seem Like Black Magic Are Actually Lessons Learned in Blood

The first time I set up a Transformer for translation, I just used the default Adam (β2=0.999). Halfway through training, the loss started oscillating wildly, bouncing up and down like it was on caffeine. I thought my model configuration was wrong and spent two days debugging before I found out: the original paper uses β2=0.98, not 0.999.

Why? Because Transformer gradients have high variance, especially after stacking deep FFN and residual connections. If the second-order moment is remembered for too long, the adaptive learning rate becomes too smooth, making convergence harder. β2=0.98 has a half-life of only 34 steps; β2=0.999 is nearly 700 steps—20 times longer. Transformer needs faster second-moment updates to keep up with drastic gradient changes. The paper doesn't spell this out, but if you don't tune it, you'll crash.

Then there's Warmup. I used to be impatient and would hit the peak learning rate right away, but in the first few hundred steps the loss would just turn into NaN. After reading the paper, I understood: right after initialization, the parameters are all over the place, and the Self-Attention distribution is very unstable. A large learning rate can blow up the embedding in one step. Warmup with linear growth for the first 4000 steps gives the model a warm-up period, letting the parameters slowly find their gradient descent direction. When an interviewer asks "why use Warmup?" answering "linear growth for the first xx steps" is superficial. Saying that "the initial self-attention distribution is unstable and needs a small learning rate to stabilize gradients" is the real key.

As for Label Smoothing (ε=0.1), I initially hated it because it kept the loss from going down. But after adding it, the generalization BLEU improved by 0.5. The reason is simple

GPT	Decoder only	Left-to-right, unidirectional	Next token prediction	Dialogue, creative writing

Transformer & Bert 相关问题复盘及 (English)

Transformer & Bert 相关问题复盘及 (English)

Positional Encoding: The Interviewer Hit Me with Three Questions in a Row, and I Completely Crashed

BERT, GPT, Transformer: Split the Architecture in Half, and Each Has Its Own Achilles' Heel

BERT's Mask: Why Mess with the Input Layer?

Training Tricks: Parameters That Seem Like Black Magic Are Actually Lessons Learned in Blood

Cael Lee

Ready to get started?