基于PyTorch,用搭积木的方式实现的Transfor (English)

Generated: 2026-06-20 22:13:42

---

Alright, let me check the facts and polish the writing.

Before I start, let me clarify my fact-checking:

The technical details in the article are basically correct, with no major issues. Things like the attention mask using -1e9, the scaling factor dividing by sqrt(d_k), the dimension shuffling and contiguous requirement in multi-head attention, the positional encoding precomputation, Pre-LN vs. Post-LN, parameter calculation, cross-attention, KV Cache, Flash Attention, MoE, 4-bit quantization—all match common practice or the original paper descriptions. The example numbers (500 steps, 95% accuracy, 40ms → 12ms) fall within reasonable ranges and are presented as experiential statements with no obvious contradictions.
The only tiny deviation: the original text says “the variance of the dot product grows linearly with dimension; divide, and the variance stays at 1” – strictly speaking, the variance of the dot product is \( dk \) (when each element has variance 1), and dividing by \( \sqrt{dk} \) makes the variance \( dk / dk = 1 \), so saying it “grows linearly” is correct and the wording is fine.
Also, the mention of “I used Linear+ReLU+Linear for the feed-forward network with d_ff=2048. Later I tried GELU, and it was about the same.” – in the original Transformer it was indeed ReLU, and many later models use GELU with similar performance, so that fits common experience.

So no fact changes needed. As for AI-ish expressions, the original is already quite conversational, and doesn’t contain the types of phrases we were told to delete (like “It is worth noting”, “In summary”, etc.). The only thing I might tweak slightly is one overly neat parallel structure (“Some keep an eye on… some look at… some can even grasp…”) – I broke up the rhythm a tiny bit. Meanwhile, I kept your parenthetical asides and dialog-like touches, since those are personal style, not AI flavor.

Here’s the final version:

---

Title: Don’t Be Fooled – Transformer Isn’t a “Model,” It’s a LEGO Set!

You know, I have this weird obsession: whenever a new model goes viral in the ML community, if I don’t sit down and code it up from scratch myself, I get this itch that the knowledge hasn't really sunk in.

Back in 2017, when Google dropped “Attention Is All You Need,” I was still wrestling with LSTMs for machine translation. When I read the title, I thought, “Whoa, that’s a bold claim – ‘All you need is attention’?” Honestly, I didn’t buy it at first.

But then? Guess what happened. I actually pulled down the code, line by line, and suddenly it clicked: “Hey, this is just a LEGO set! A bunch of bricks that you snap together.” But I swear, if you actually try to build it yourself, I’ll bet you a pack of spicy sticks that you’ll get tripped up by so many hidden traps before you finally get it working. If not, I’ll eat my laptop.

I’ve hand-coded the Transformer from scratch five times! From following the paper exactly to later stuffing in all kinds of “black magic,” I’ve stepped in enough holes to write A Tearful History of Transformers. Today, let’s get real and I’ll spill all the things I’ve learned and all the mistakes I’ve made.

Act One: Peel Off the Skin – What’s Really Inside?

You’ve seen the classic Transformer architecture diagram – Encoder on the left, Decoder on the right, fused together like Siamese twins.

Each Encoder has just two modules: a self-attention and a feed-forward network, wrapped in residual connections and layer normalization. The Decoder adds a “masked self-attention” plus a little cross-attention pipe that peeks at the Encoder’s output.

What confused me the most at first was the Decoder. Think about it: during inference, it works like an old steam locomotive, chugging out one word at a time. But during training? It can compute the whole sentence in one shot! I sat in front of my computer for two whole days trying to wrap my head around that. Then I realized the secret was in the mask – a triangular matrix that covers up the future tokens. It tells the model, “Hey, you can only look at what came before; the future is off limits.” That way, you can parallelize the computation while still keeping the causal “look only at the past” order.

I drew a diagram in my notebook: input I love you. After the masked attention, position 1 can only see , position 2 can see and “I”, and position 3 can see the first three. Once I got that, the whole Decoder logic collapsed into place like dominoes!

The Second Brick: Self-Attention – The Heart of the Transformer

The first version of self-attention I wrote was pretty naive – I just followed the formula directly: Q, K, V, dot product, scale, softmax, weighted sum, done.


class ScaledDotProductAttention(nn.Module):
 # … you know the code, I won’t paste it all. The core is just a few lines.

And then I hit the first trap, one that nearly buried me alive!

99% of people walking in will make this mistake. Never set the mask values to 0! They must be -1e9. Think about it: if you put a 0 on a position to tell the model to ignore it, the softmax that follows will turn that 0 into a tiny probability. That’s like secretly letting the model eavesdrop! I once had a training loss that just wouldn’t drop. I stayed up all night tracking it down, and when I finally found this bug, I felt like slapping myself.

And the scaling factor? People ask why you have to divide by sqrt(d). I’ve tested it: don’t divide, and the gradients and loss go on a roller coaster – a deeper model and it blows up. Later I read the paper and understood: the variance of the dot product grows linearly with dimension; divide and the variance stays at 1, so the softmax gradients can flow steadily.

Multi-Head Attention: From One Brick to a Box of Bricks

This one is even trickier! You split the big vector (dmodel) into smaller heads (dk), do self-attention on each, and concatenate them back. I thought, “How hard can that be?” Then I spent an entire afternoon messing up the dimension order!

The correct way:

Input [B, T, D], reshape to [B, T, numheads, dk], then transpose to [B, numheads, T, dk].

I got it mixed up: after transpose, I tried to view without contiguous! PyTorch immediately yelled at me: “cannot view after permute.” Another time, I swapped the order of numheads and dk, and the output shape didn’t match – another big pitfall. Looking back, it seems basic, but who hasn’t stepped in a few piles of “beginner’s luck” like that?

But you know, multi-head attention is the brick that amazed me most. Different heads really do focus on different things. Some look at verbs, some at subjects, and some even catch that “only a feeling can tell” emotional nuance. After training, I printed out the attention heatmaps. Seeing those different “eyes” divide the work gave me the feeling of raising a bunch of little sprites.

Positional Encoding: Without This Ingredient, the Dish Is Bland

Transformer doesn’t have innate sequential memory like an RNN. It’s like an amnesiac: it only sees the bag of words in front of it. So you have to inject it with a “position information” vaccine. I used the sinusoidal encoding from the paper, frequencies from 2π to 10000·2π.

Here’s a small detail: precompute the encoding matrix up to the maximum length and register it as a buffer,

基于PyTorch,用搭积木的方式实现的Transfor (English)

基于PyTorch,用搭积木的方式实现的Transfor (English)

Cael Lee

Ready to get started?