Home / Blog / 中科院一区顶刊 | DilateFormer (English)

中科院一区顶刊 | DilateFormer (English)

By CaelLee | | 3 min read

中科院一区顶刊 | DilateFormer (English)

Generated: 2026-06-21 07:05:22

---

After a Decade in AI, I Can Say This: 90% of the Compute You Spend Is Just to Fool Yourself

Last month, a friend called me in the middle of the night, his voice hoarse.

He'd been tuning an industrial defect detection model, YOLOv8, swapping in three different attention modules. The metrics he wanted to improve? They didn't budge. But his GPU memory blew up first. He slammed the table: "Is there something you can just plug in and use? No performance loss, and it doesn't burn up my card?"

I told him yes.

He absolutely didn't believe me.

It's this year's CAS Zone 1 paper, DilateFormer — the core module is MSDA, Multi-Scale Dilated Attention. Just from the name, it sounds like another Frankenstein thing, right? Honestly, I thought so too at first. Attention mechanisms are so played out now — what weird variant haven't we seen?

But when I tested it, mAP went up by 0.06.

Note: that's without any increase in computation. I thought I was seeing things, so I tested again — GPU memory actually dropped.

Later? I had the team refactor the entire project. Guess what? Results were solid, and the model could run on cheaper cards. That's when I really started to question something:

Do we really need such a huge attention range?

---

Your Model Isn't "Seeing" the World — It's Just Counting Pixels

When ViT first came out, everyone went crazy. Global attention! Every patch connected to every other patch! Sounds so impressive.

But at what cost? O(n²). For a 224x224 image, 196 patches — barely manageable. Throw in a segmentation task with 1024x1024 input and thousands of patches, and your GPU memory is dead on arrival.

But think about it — is all that computation actually useful?

I pulled up the attention visualization from the DilateFormer paper and laughed out loud. The red square is the query, and the high-scoring keys are all crammed into a tight local cluster. The distant patches? Their scores are pitifully low. Plain and simple: most of the compute is spent weighing things you never need.

I ran the same visualization on a small model myself. Exactly the same.

Look at what low-level features are doing: They just need to know that "this cat ear" and "the fur next to it" are connected. They don't need to know what the background sofa has to do with the cat ear. Why force it to compute all that distant background? Isn't that exhausting?

It's counterintuitive, right? Here we are in 2023, and someone's telling you "maybe don't compute all the attention"?

Do the math: a 12-layer network. The first six layers are all about "local + sparse" patterns. Swap those six layers to MSDA — each layer only computes keys sampled sparsely within a window. The compute you save is enough for the last six layers to do real global interaction.

That's what I call putting steel on the cutting edge. Way smarter than brute-forcing global attention.

---

How MSDA Actually Saves Compute — Let Me Break It Down For You

MSDA's approach is actually dead simple. You'll get it in one listen.

Project the input X into Q, K, V. Split along channels into n heads. Give each head a different dilation rate — say 1, 2, 3 — and sample keys and values sparsely within a sliding window. Compute self-attention for each head separately, concatenate, and pass through a linear layer.

Two key points:

First, the window limits the number of keys, dropping computation from n² to window². Second, different dilation rates give each head a different "field of view" — some have dense small windows, others sparse large windows. You cover local details and slightly broader context all at once.

C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free