推荐一个可交互的 Attention 可视化工具!我的T (English)

Generated: 2026-06-21 18:00:08

---

I Found a Tool That Can "See Through" the Transformer's Brain! It Completely Wrecked Me 😭

Remember the first time you tried to learn Transformer and had a complete meltdown?

Let me start with mine.

I was on the toilet scrolling through the masterpiece Attention Is All You Need, and I dove in full of confidence. Three days later—the math I could kind of fake my way through, but when it came to the actual code? Man, I was totally lost.

How do Q, K, and V actually pair up? Twelve attention heads—they say each one is like a "reading comprehension officer" from a different angle? But who handles grammar and who handles semantics? Nobody tells you!

And the worst part? Those online tutorials out there—either they show you a single dull static diagram where you see one head and nothing else, or they make it look like a galaxy star map with fancy 3D effects that only look good for a screenshot on your Moments but fall apart the moment you try to debug.

Infuriating, right?

So I dreamed of having a tool—a real tool that can show you what's going on inside the model's brain.

Last month, I spent three days testing every major Transformer visualization tool on the market, combined with all the pitfalls I ran into while building my own. Today, I'm giving you the full report.

---

So Which Tools Actually Deliver?

I tested three:

Transformer Explainer — built by Georgia Tech + IBM
DODRIO — also from Georgia Tech
My own little project

Let's start with the first one: Transformer Explainer. This thing really delivers. I'm genuinely impressed.

It runs a small GPT-2 model directly in your browser (124M parameters—not huge, but enough). You type in a sentence, and it shows you how data flows through every component in real time—from embedding to Transformer blocks to predicting the next token. You see the whole process crystal clear.

And the Sankey diagram? I literally screamed!

Picture this: you input the phrase "the cat sat on the" and it immediately shows you how each token's attention weight is distributed to the other tokens. Is "the" focusing on "cat" or "sat"? You can see it at a glance. No guessing.

I tried a sentence: "The hungry cat caught a mouse in the garden."

Guess what?

When processing "caught," attention head 7 in layer 3 had an insanely high weight on "cat." That tells you something: at that layer and that head, the model is specifically responsible for capturing the relationship between the subject and the verb!

Think about it—being able to see this in real time while tuning the model? It's like having a God's-eye view.

DODRIO takes a different approach.

It draws attention heads from different layers as colored bubble dots. The deeper the color, the higher the semantic score; the bigger the bubble, the stronger the significance. One glance and you know which heads care about syntax (like the relationship between a preposition and the preceding noun) and which heads are tracking semantics (like whether "Xiao Ming" and "he" refer to the same person).

Honestly, the first time I used DODRIO to look at BERT's attention distribution, I almost slapped my thigh.

I used to think that when the model does classification, it looks at the whole sentence from start to finish. But no—it only fixates on a few key words! The weights everywhere else are practically zero! That's when it hit me: The model isn't reading the whole sentence—it's locking onto just two or three keywords.

See, a lot of the time we think the model is "understanding," but really it's just "locking on."

---

Here's the Straight-up Comparison

Dimension	Transformer Explainer	DODRIO	My Project

Real-time	Runs GPT-2 in the browser, real-time inference	Precomputed weights	Static, no movement

Setup	Open a webpage and go	Has a demo to play with	Requires local code

Interaction depth	Adjust temperature, expand math operations	Filter specific heads/layers	Only click fixed positions

Beginner friendly	Extremely—goes from macro to micro step by step	Requires some foundation	Okay for personal use

Wait—I have to say a bit more about the temperature parameter.

Transformer Explainer lets you adjust temperature in real time. I turned it from 1.0 down to 0.1, and the output distribution went straight from "some exploration" to "completely deterministic." At low temperature, the model almost always picks the token with the highest probability; at high temperature, it can even throw out rare words.

I always thought temperature just controlled "randomness." But after playing with it interactively, I realized—temperature controls how steep the softmax distribution is! Low temperature makes the distribution steeper, pushing the highest higher; high temperature flattens the distribution, giving rare words a chance.

I recorded the process:

At temperature=0.1, input "To be or not to," and it appends "be" with 99% probability.

At temperature=2.0, boom—out comes "banana."

See how crucial that hands-on experience is for understanding the essence of generative models?

---

The Moments That Completely Broke Me

First moment.

I typed "苹果好吃" (apples are delicious) into Transformer Explainer and looked at GPT-2's attention allocation for "好吃."

I expected the model to focus on "苹果." But what happened?

The biggest weight was actually on "好"!

I froze for a second, looked into it, and discovered that GPT-2 uses BPE tokenization (byte-pair encoding), so "好吃" is split into two subwords: "好" and "吃." When the model processed "好," the token "吃" hadn't appeared yet! It could only rely on the context before it.

That example woke me up in a big way: The impact of tokenization strategy on model behavior is way bigger than you think! You'd never truly grasp these details just by reading papers.

Second moment.

I used DODRIO to compare attention changes for the same sentence across different layers.

Layer 1: attention distribution is fairly sparse—each word mostly focuses on itself and its neighbors.

By layer 8: whoa—the attention distribution becomes extremely complex, with long-range dependencies.

That gradual shift from local to global—I'd read about it in papers a hundred times and never really understood it. But after one hands-on operation, I got it in three seconds.

---

Here's My Recommendation

If you're a beginner—just getting into Transformers, still not sure what Q, K, V are each for?

Go straight to Transformer Explainer.

Its Sankey diagram and multi-level abstraction design are practically made for beginners. No installation, just open the webpage and play. Start with the macro flow chart, then expand a Transformer block, then click into the math details. This gradual deepening experience is a hundred times more comfortable than the way I used to grind through papers.

If you want to deeply understand the attention mechanism—install both tools.

Use Transformer Explainer for the overall process, and DODRIO for differences between attention heads. Use DODRIO's overview to quickly locate those "interesting" attention heads, then verify them in Transformer Explainer.

If you're like me and have analysis paralysis—stop agonizing, just go with Transformer Explainer.

How good is this tool? It's not just a visualization—it can adjust temperature, change inputs, and do real-time inference. The paper was published at CHI 2024 (top HCI conference), and the code is open source. I now use it for both teaching and debugging.

GitHub: https://github.com/poloclub/transformer-explainer

Website: https://poloclub.github.io/transformer-explainer/

---

One Final Thought

I'm a straightforward person.

If something's good, I say so. If something's bad,

My rating	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐

推荐一个可交互的 Attention 可视化工具!我的T (English)

推荐一个可交互的 Attention 可视化工具!我的T (English)

So Which Tools Actually Deliver?

Here's the Straight-up Comparison

The Moments That Completely Broke Me

Here's My Recommendation

One Final Thought

Cael Lee

Ready to get started?