多模态:模型架构 (English)

Generated: 2026-06-22 08:36:18

---

Alright, here it is—using your original data but with a completely different soul.

---

Don’t Be Fooled by “Bigger Is Better”! 5 Counter‑Intuitive Truths About Multimodal Model Architectures, Paid for With Real Blood and Treasure

You know what? Every time I publish a piece on multimodal model deployment, my DMs blow up. Every single one asks, “Man, how do I choose the architecture? I’m going crazy over benchmark tables!”

I get it. I really do. The comparisons you see online come with dozens of parameter variants—DocVQA, ChartQA—scores so high they look unreal. But once you actually run them yourself, it’s a whole different story! After all the back and forth, it really comes down to these four questions:

Visual encoder: 300M vs. 6B, 20× the size—does it also give 20× the performance?
Connector: from Linear to Q‑Former to MLP—why did it loop all the way back?
Dynamic resolution sounds amazing, but my GPU is about to explode—what do I do?
From LLaVA to Qwen2‑VL—what was all that fuss in the last two years about?

See? Each one hits harder than the last. This time I’ve dug out all the internal test data and every pothole I hit along the way. No fluff, every conclusion backed by version numbers and real‑world scenarios.

Let me spoil three counter‑intuitive takeaways:

The visual encoder isn't always better when bigger. 1.8B is the sweet spot for cost‑effectiveness.
All the connectors ended up back at MLP—not because we regressed, but because dynamic resolution killed Q‑Former.
Dynamic resolution is amazing, but during inference the tokens can blow up your VRAM—I learned that the hard way.

Ready? Let’s dive.

---

1. How Big Should the Visual Encoder Be? I Tested It, and the Answer Surprised Me

Let me be real: the bigger the visual encoder, the better the fine‑grained capability, but the cost‑effectiveness inflection point is around 1.8B.

How did I get that conclusion? Last year I ran an internal comparison. We took the same batch of OCR data and tested CLIP ViT‑L (300M), CLIP ViT‑G (1.8B), and InternViT‑6B. Everything else was identical: LLaVA‑1.5’s MLP connector + Qwen‑7B LLM.

When the results came in, even I was shocked.

Going from 300M to 1.8B, document understanding (DocVQA) jumped from 72 to 81! Beautiful! A full 9‑point improvement.

But from 1.8B to 6B? Only up to 85. The gain was less than a third of the previous step.

And the price? For the 6B version, inference TTFT (time‑to‑first‑token) was 3× slower, and VRAM went from 8GB all the way to 28GB!

Think about it—in a production environment, is it worth paying several times more for just 4 extra points?

So when do you actually need the big model? I’ve been burned in two scenarios where it really pays off:

Fine‑grained visual understanding: e.g., a 2‑point font on an invoice, or pin labels on a circuit diagram. The 300M model often misreads characters; the 6B almost never makes a mistake.
Dense charts: line charts overlaid with bar charts and data labels—models below 1.8B can’t even read the axis labels correctly.

But for everyday tasks like “what objects are in this picture?” or “a cat on a sofa,” the 300M CLIP ViT‑L is more than enough. Some teams now blindly push “bigger visual encoder is always better,” and I think that’s lazy—they haven’t invested in data quality, connector design, or training strategies, so they just throw a huge model at the problem. It’s basically solving with brute force.

That makes me wonder: do you really need 6B? Or are you just trying to buy a “premium feel”?

---

2. A Brief History of Connectors: Linear → Q‑Former → MLP, and I Ended Up Choosing MLP

The connector story is an interesting one. My earliest experiments were with LLaVA (early 2023), which used a linear projection—it directly mapped the ViT output tokens into the LLM’s embedding space. It worked, but image‑text alignment? Pretty coarse. Ask the model to describe “a red car parked in front of a white car,” and it often said “two cars.”

Then BLIP‑2 came along with Q‑Former, using 32 learnable queries to compress visual features. Wow—the token count dropped from 256 to 32, cutting computation by 70%! I used it on a low‑compute project, and for coarse‑grained dialogue (like “what’s in the picture?”) the response time improved a lot. Pretty good.

But! Once I tried it on high‑resolution scenes, it immediately fell apart. 32 tokens are okay for a rough idea, but what about details? We ran an experiment: we split a 1000‑dpi engineering drawing into 4 sub‑images, compressed each with Q‑Former to 32 tokens, then concatenated them. The model couldn’t even see the dimension labels! Because the information was lost during compression—compression is an information bottleneck, and in practice that hits you hard.

That’s why later models like LLaVA‑1.5 and InternVL went back to MLP. It wasn’t a regression; it’s because dynamic resolution solved the token explosion problem, making Q‑Former’s compression advantage largely unnecessary. MLP has no information loss, and it’s dead simple—just two linear layers with an activation. Why not choose it?

My current default is: if the visual encoder is ≤1.8B, use MLP. If you really have to use Q‑Former, you must pair it with a high‑resolution strategy and keep at least 144 visual tokens (definitely don’t copy BLIP‑2’s 32!). I only settled on these ratios after countless tuning sessions.

---

3. Dynamic Resolution: Amazing, but If You Can’t Control the Tokens, It’s a Disaster

Dynamic resolution tackles the biggest headache since ViT was born—every image must be resized to 224×224. Think about it: a 4K image shrunk to 224—all small text is blurred, OCR becomes useless.

InternVL and Qwen2‑VL came up with a solution: tiling. I tested Qwen2‑VL on a 1024×768 document screenshot:

Without dynamic resolution: resized to 448×448, text shrunk to 4 pixels, OCR accuracy… only 31%.
With dynamic resolution: split into 4 448×448 sub‑images + 1 global thumbnail, each sub‑image encoded independently, OCR accuracy 92%!

That’s how dramatic the difference is. Good, right? The cost? Token count explodes. That single image produced 1500–2000 visual tokens. And video? Frames multiply that—8 frames, 12 000 tokens, VRAM instantly maxed out!

During production inference on 4×H100 with vLLM, I hit a nasty pitfall: with default settings, a 10‑second video had a prefill time of 45 seconds! Why? Too many visual tokens filled up the KV cache, forcing constant swapping.

What can you do? From my experience, these three tricks are the most effective:

First, request tiering. Build a lightweight router that checks the precision requirement of each request:

“Set the AC to 26°C”—just resize to 448×448, a few hundred tokens.
“Change the cell in the third row, fifth column of the blue table to red”—enable high‑resolution tiling, slower, but guaranteed accurate localization.

**Second,

多模态:模型架构 (English)

多模态:模型架构 (English)

Don’t Be Fooled by “Bigger Is Better”! 5 Counter‑Intuitive Truths About Multimodal Model Architectures, Paid for With Real Blood and Treasure

1. How Big Should the Visual Encoder Be? I Tested It, and the Answer Surprised Me

2. A Brief History of Connectors: Linear → Q‑Former → MLP, and I Ended Up Choosing MLP

3. Dynamic Resolution: Amazing, but If You Can’t Control the Tokens, It’s a Disaster

Cael Lee

Ready to get started?