Transformer结构及其应用详解- (English)

Generated: 2026-06-20 16:20:40

---

Translate to English, keep the storytelling style:

---

It was a winter afternoon. I stared at the training logs on my terminal, frozen.

An 8-layer Transformer had taken a WMT translation task from 6 hours per epoch (RNN) down to 45 minutes — nearly 8 times faster. And the BLEU score for long sentences jumped by 3 points. I couldn't believe it, so I ran it again. Same result. When those numbers popped up, only one thought crossed my mind: The era of the horse-drawn carriage is over!

You may not know this, but before 2017, the hottest networks in NLP were RNNs — LSTMs and GRUs were all the rage. I myself had used LSTMs for text classification — and it was a pain. You had to feed in a sentence word by word, waiting for each one to be processed before moving to the next. Training a model often took ten-plus hours. Worse, when sequences got long, gradients would vanish. You had to carefully tune gradient clipping and learning rate, nursing the thing like a fragile furnace: too hot and it burns through, too cool and you never forge the steel.

I used to joke with friends: RNN is like a horse-drawn carriage. You can put the best shock absorbers (LSTM) on it, but it's still a carriage — speed just doesn't go up.

So when Google Brain dropped that paper Attention Is All You Need in 2017, my first reaction wasn't excitement — it was doubt. Abandon RNN entirely, relying only on the attention mechanism? Could that really work?

Well, as you know, it didn't just work — it directly took over NLP. In this article, we're going to break it down piece by piece, and see how superstar models like GPT, BERT, GPT‑2, and MT‑DNN each flex their muscles on top of the Transformer skeleton. I'll even spill all the potholes I stumbled into and the weird phenomena I encountered while tuning models. Follow along, and the next time you come across these terms, you won't just think, "I think I've heard of that."

---

First, the Core Parts: Positional Encoding and Self-Attention

The basic design of the Transformer is simple: one sentence goes in, one sentence comes out. The input is a sequence of word vectors for the whole sentence; the output is an enhanced sequence of the same length. This way, every word can see all the other words at the same time, the operation distance is always 1, and you never have to pass hidden states across long distances like in RNNs.

But this also brings a headache: how does the model know the order of words? Without order information, "I hit you" and "you hit me" are just the same bag of words to the Transformer. So it introduces two key designs: positional encoding and the self-attention mechanism.

Speaking of self-attention, the core idea is to let each word "look at" every other word in the sentence and decide how much attention to pay to them. Specifically, for each word, three vectors are generated: Query (Q), Key (K), and Value (V). Then you compute the dot product between Q and all K's, normalize with Softmax to get weights, and take a weighted sum of all V's to get the new representation of that word.

When I first learned this, I couldn't help thinking it's exactly like a search engine: you enter a query (Q), match it against keys (K) in the database, and then retrieve the corresponding content (V) based on the match scores. That analogy really works.

But when I actually wrote the code, I hit a massive pitfall: if you don't scale, the dot product values balloon as the dimension increases, pushing Softmax into a region with extremely flat gradients, causing training to completely fail to converge! The paper uses a scaling factor of dividing by the square root of d_k. When I built a simplified version, I forgot to add it, and the loss got stuck at 4.5 without moving at all. After adding the scaling, everything worked. Don't ever skip this detail.

Multi-Head Attention: A Committee of Experts in Your Brain

One single "attention mode" is not enough. For example, consider the sentence: "This apple is really tasty, wash it before eating." A single-head attention might only focus on the relationship between "apple" and "eat," ignoring the action "wash." Multi-head attention splits Q, K, V into 8 parts (the paper uses 8 heads), each head computes attention independently, and you concatenate the results.

I like to think of it as: let the model understand a sentence from 8 different perspectives. It's like a meeting where eight experts each give advice from their own angle, and then you synthesize a single report.

When tuning hyperparameters, the number of heads is one of them. I once tried 4, 8, and 16 heads on a classification task. 8 heads were the best; 16 actually performed worse. Why? Possibly because with too many heads, each head gets too few dimensions (e.g., splitting 512 dimensions into 16 gives only 32 per head), making it too weak to express anything. If you're using a small model, don't use too many heads. A rule of thumb is d_model // 64.

Feed-Forward Network, Residual Connections, and LayerNorm

After each word has been through self-attention, it passes through a two-layer fully connected network (the middle layer is typically 2048 dimensions). This step can be seen as a further refinement for each word individually. Then a residual connection is added: output = LayerNorm(input + sublayer_output).

At first, I didn't think much of this design. But then I tried removing the residual connection just to see what would happen — the training loss oscillated wildly, and convergence was nearly twice as slow. Residual connections are definitely not optional in deep Transformers; without them, gradient flow becomes much more difficult.

LayerNorm is also well thought out. It normalizes along the feature dimension for each sample, so unlike BatchNorm, it's not affected by batch size. In NLP, batches often have varying lengths, and LayerNorm is more stable. I once replaced LayerNorm with BatchNorm, and on variable-length batch training, the validation performance kept jumping up and down. Switching back to LayerNorm made it stable.

---

Three Branches: Three Directions of the Transformer Family

The original Transformer is an Encoder-Decoder structure, suitable for seq2seq tasks like translation. But later researchers found that using just the Encoder for understanding tasks, or just the Decoder for generation tasks, works great too. So three branches evolved:

Encoder-only: Represented by BERT. It takes a complete sentence as input and models context bidirectionally. It's good for sentence classification, entity recognition, reading comprehension. I see BERT as a pre-trainer that is a "fill-in-the-blank fanatic" plus "neighbor judge," but the key is that its attention can see words on both sides simultaneously.
Decoder-only: Represented by the GPT series. It takes a sequence and autoregressively predicts the next word, with attention masked to only look left. GPT is great for text generation, but essentially unidirectional.
Encoder-Decoder: The original Transformer, e.g., MT‑DNN also mixes pre-training and fine-tuning, but its foundation is still BERT-like (Encoder mainly; MT‑DNN uses BERT's architecture for multi-task training). Part 4 will go into detail.

I made a mistake here: when I first started using BERT for text generation, I naively thought I could just add a Decoder on top. Later I discovered that BERT's Encoder itself can't do autoregressive generation; you need to attach a separately initialized Decoder. In contrast, the pure Decoder structure like GPT is naturally suited for generation, but during training it can only use the left context. So before choosing a model, you must decide whether your goal is "understanding" or "generation."

Transformer结构及其应用详解- (English)

Transformer结构及其应用详解- (English)

First, the Core Parts: Positional Encoding and Self-Attention

Multi-Head Attention: A Committee of Experts in Your Brain

Feed-Forward Network, Residual Connections, and LayerNorm

Three Branches: Three Directions of the Transformer Family

Cael Lee

Ready to get started?