Home / Blog / 多模态:模型架构 (English)

多模态:模型架构 (English)

By CaelLee | | 5 min read

多模态:模型架构 (English)

Generated: 2026-06-22 08:36:18

---

Alright, here it is—using your original data but with a completely different soul.

---

Don’t Be Fooled by “Bigger Is Better”! 5 Counter‑Intuitive Truths About Multimodal Model Architectures, Paid for With Real Blood and Treasure

You know what? Every time I publish a piece on multimodal model deployment, my DMs blow up. Every single one asks, “Man, how do I choose the architecture? I’m going crazy over benchmark tables!”

I get it. I really do. The comparisons you see online come with dozens of parameter variants—DocVQA, ChartQA—scores so high they look unreal. But once you actually run them yourself, it’s a whole different story! After all the back and forth, it really comes down to these four questions:

See? Each one hits harder than the last. This time I’ve dug out all the internal test data and every pothole I hit along the way. No fluff, every conclusion backed by version numbers and real‑world scenarios.

Let me spoil three counter‑intuitive takeaways:

Ready? Let’s dive.

---

1. How Big Should the Visual Encoder Be? I Tested It, and the Answer Surprised Me

Let me be real: the bigger the visual encoder, the better the fine‑grained capability, but the cost‑effectiveness inflection point is around 1.8B.

How did I get that conclusion? Last year I ran an internal comparison. We took the same batch of OCR data and tested CLIP ViT‑L (300M), CLIP ViT‑G (1.8B), and InternViT‑6B. Everything else was identical: LLaVA‑1.5’s MLP connector + Qwen‑7B LLM.

When the results came in, even I was shocked.

Going from 300M to 1.8B, document understanding (DocVQA) jumped from 72 to 81! Beautiful! A full 9‑point improvement.

But from 1.8B to 6B? Only up to 85. The gain was less than a third of the previous step.

And the price? For the 6B version, inference TTFT (time‑to‑first‑token) was 3× slower, and VRAM went from 8GB all the way to 28GB!

Think about it—in a production environment, is it worth paying several times more for just 4 extra points?

So when do you actually need the big model? I’ve been burned in two scenarios where it really pays off:

But for everyday tasks like “what objects are in this picture?” or “a cat on a sofa,” the 300M CLIP ViT‑L is more than enough. Some teams now blindly push “bigger visual encoder is always better,” and I think that’s lazy—they haven’t invested in data quality, connector design, or training strategies, so they just throw a huge model at the problem. It’s basically solving with brute force.

That makes me wonder: do you really need 6B? Or are you just trying to buy a “premium feel”?

---

2. A Brief History of Connectors: Linear → Q‑Former → MLP, and I Ended Up Choosing MLP

The connector story is an interesting one. My earliest experiments were with LLaVA (early 2023), which used a linear projection—it directly mapped the ViT output tokens into the LLM’s embedding space. It worked, but image‑text alignment? Pretty coarse. Ask the model to describe “a red car parked in front of a white car,” and it often said “two cars.”

Then BLIP‑2 came along with Q‑Former, using 32 learnable queries to compress visual features. Wow—the token count dropped from 256 to 32, cutting computation by 70%! I used it on a low‑compute project, and for coarse‑grained dialogue (like “what’s in the picture?”) the response time improved a lot. Pretty good.

But! Once I tried it on high‑resolution scenes, it immediately fell apart. 32 tokens are okay for a rough idea, but what about details? We ran an experiment: we split a 1000‑dpi engineering drawing into 4 sub‑images, compressed each with Q‑Former to 32 tokens, then concatenated them. The model couldn’t even see the dimension labels! Because the information was lost during compression—compression is an information bottleneck, and in practice that hits you hard.

That’s why later models like LLaVA‑1.5 and InternVL went back to MLP. It wasn’t a regression; it’s because dynamic resolution solved the token explosion problem, making Q‑Former’s compression advantage largely unnecessary. MLP has no information loss, and it’s dead simple—just two linear layers with an activation. Why not choose it?

My current default is: if the visual encoder is ≤1.8B, use MLP. If you really have to use Q‑Former, you must pair it with a high‑resolution strategy and keep at least 144 visual tokens (definitely don’t copy BLIP‑2’s 32!). I only settled on these ratios after countless tuning sessions.

---

3. Dynamic Resolution: Amazing, but If You Can’t Control the Tokens, It’s a Disaster

Dynamic resolution tackles the biggest headache since ViT was born—every image must be resized to 224×224. Think about it: a 4K image shrunk to 224—all small text is blurred, OCR becomes useless.

InternVL and Qwen2‑VL came up with a solution: tiling. I tested Qwen2‑VL on a 1024×768 document screenshot:

That’s how dramatic the difference is. Good, right? The cost? Token count explodes. That single image produced 1500–2000 visual tokens. And video? Frames multiply that—8 frames, 12 000 tokens, VRAM instantly maxed out!

During production inference on 4×H100 with vLLM, I hit a nasty pitfall: with default settings, a 10‑second video had a prefill time of 45 seconds! Why? Too many visual tokens filled up the KV cache, forcing constant swapping.

What can you do? From my experience, these three tricks are the most effective:

First, request tiering. Build a lightweight router that checks the precision requirement of each request:

**Second,

C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free