复旦大学邱锡鹏教授团队:Transformer最新综述 (English)
复旦大学邱锡鹏教授团队:Transformer最新综述 (English)
Generated: 2026-06-23 05:58:26
---
Did you know? The directions for improving Transformers can be sorted into three dimensions all on their own.
Let me tell you something.
Last year a reader sent me a private message. He said he was interviewing for a large model position, and when he got asked “What are the main directions for improving Transformers?” his mind just went blank right there. He squeezed out one word: "Reformer." After the interview he looked it up—damn! Back in 2021 Professor Qiu Xipeng’s team had already published a Transformer survey, and it broke the whole thing into three dimensions, covering hundreds of models, spelled out clearly. That guy was totally crushed.
I saw his message and laughed. Not because I was mocking him—I totally felt his pain.
The day that survey came out, I downloaded it. Opened it, read for ten minutes—snap! Shut it back down. To be honest, back in 2021 the Transformer variants weren't as crazy as they are now, but even then that paper had already compiled over 200 references. You try to devour it in one go? You’ll injure yourself from the effort.
So what did I do? I read it on and off for six months. It took three passes before I actually put it to use. Today I’m not going to talk about some high‑end theory. I just want to tell you how to read this survey, whether it’s worth reading, and the pits I fell into along the way.
---
2017–2021: The Hundred Schools of Transformers – What a Ruckus
Let me set the scene for you. In 2017, “Attention Is All You Need” came out, and everyone started cramming things into it like they were on adrenaline. Some people tweaked the attention structure, some messed around with positional encoding, some played tricks with layer normalization. By 2020 there were already over 200 papers just on improvement directions. And each one had a fancy name—Performer, Reformer, Linformer, Longformer, BigBird… Good heavens, these X‑formers were popping up like they were rolling off the alphabet.
The core motivations in the research community were actually pretty simple, just three:
- Long sequences are impossible – self‑attention complexity is O(n²), so with long texts it just folds on the spot. How do you deal with it? Cute tricks like sparse attention and linear attention.
- Poor generalization on small data – Transformers are data gluttons. If you want them to work on small samples, you have to add regularization or do pre‑training.
- Hard to adapt to new domains – they work well for translation, but when you toss them into CV, audio, or even scientific computing, you have to modify them until their own mothers wouldn't recognize them.
But here’s the problem: with all these improvement directions intersecting and tangling with each other, just tracking the papers could give you a headache. At the time I was working on BERT distillation experiments, and I wanted to find some similar attention compression schemes. I spent two days flipping through papers and still couldn’t find a systematic comparison. Imagine that feeling—like fumbling in a dark room and never finding the light switch.
That’s when Qiu Xipeng’s team survey came along and painted a complete map of the field.
---
What exactly does this survey talk about? Three angles, all crystal clear.
First, modifications at the architecture level. The attention mechanism, positional encoding, FFN, layer normalization—each module can be independently taken apart and modified. The survey categorizes these variants by "which module was changed." It’s like building with Legos: tell you that lab A swapped out the positional encoding, lab B split the FFN into chunks, lab C used hash‑based attention. It’s all laid out at a glance!
Second, improvements at the pre‑training level. From BERT to GPT to various fancy pre‑training objectives, the survey organizes the evolution of three paradigms: Masked Language Model, Auto‑regressive, and Seq2Seq. What was my biggest takeaway after reading it? It turns out that early pre‑training strategies were decoupled from the model architecture! With the same Transformer backbone, hanging different pre‑training objectives gives models completely different personalities—like the same person putting on different clothes, giving off entirely different vibes.
Third, adaptation at the application level. NLP is the home turf, but CV, speech, multimodal, even protein sequences are all getting the Transformer treatment. The survey goes through the typical adaptation methods for each domain: for example, how Vision Transformer handles image patches, how Swin uses hierarchical attention to save resources. This is pure solid‑gold material.
The first time I read it, I made a mind map based on this classification framework. Later, when I needed to look up a paper, I could just toss it into this framework and quickly find similar work. It doubled my efficiency! Tell me that’s not worth it.
---
My pitfall record: Never try to plow through from the first chapter!
Let me give you the conclusion first: Don't try to plow through from the first chapter.
Why? That’s how I read it the first time. Abstract, introduction, background, classification—read it in order. Before I got five pages in, I was almost asleep. The paper is stuffed with mathematical symbols, activation function comparisons scattered everywhere, and the citation density is so high that every sentence has [1][2][3] hanging off it. After you finish, your mind is a blank page, like you never read it at all.
The second time I changed my strategy: I skipped to the end first to read the future directions, then went back to the chapter on attention improvements. Everything clicked instantly!
How exactly should you read it? Let me share three methods with you:
First, when you encounter a variant (like Sparse Transformer), immediately search for its code or someone else’s reproduction notes. Over on Zhihu, the series by “迷途小书僮” is written following the structure of this survey. Each section is broken down and explained carefully. Read them together and you’ll save yourself a ton of effort. Don’t try to tough it out alone.
**Second, for the part on attention complexity
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.