多模态大模型主流架构介绍:从 LLaVA 到 Qwen3 (English)

Generated: 2026-06-20 15:26:18

---

Here's the English translation, preserving the storytelling style and technical accuracy:

---

The Evolution of Multimodal Models Over the Past Three Years Really Comes Down to Understanding This One Step

At the beginning of 2023, I dove into multimodal LLMs with great enthusiasm, thinking: "Just feed image features to the language model, how hard can it be?"

And then? I took a high-resolution poster, asked it "What's in the top-left corner," and it made up something that wasn't even in the image. "A cat sitting on a roof"—when the poster clearly showed a dog.

I was so mad I almost smashed my GPU.

Later I realized: making a language model truly "see" an image, rather than just "glance" at it, is a rabbit hole deeper than the Mariana Trench.

You know what all the fuss about multimodal models over the past three years really boils down to?

Just one thing—how to get language models to truly see image details.

Not "roughly," not "maybe," but down to the pixel level.

Alright, I'm going to spill everything: the pitfalls I fell into, the electricity bills I racked up, the GPUs I burned through. From the most basic backbone to the two different paths taken by LLaVA and Qwen3-VL—read this and you might save yourself months of detours.

---

First, Remember This: Every Multimodal LLM, When You Strip It Down, Has Three Parts

Don't be fooled by all the fancy names. MLLM, vision-language models, image-text understanding… peel off the wrapper, and it's just three pieces:

Vision Encoder — turns images into feature sequences. Early on, it was all CLIP ViT-L/14. Now SigLIP, InternViT are fighting for territory. It's a lively scene.
Connector — "translates" visual features into embeddings the language model can understand. Could be an MLP, a Q-Former, a stack of cross-attention layers… anything works, but the differences are huge.
Large Language Model (LLM) — fuses information and generates text.

How you connect these three pieces directly determines how good your model performs and whether your wallet can handle it.

Here's the counterintuitive part: The first trap I fell into was obsessing over LLM selection while ignoring the connector.

In 2023, I was working on an OCR project using a two-layer MLP connector with LLaMA-7B. If the font was even slightly blurred, the model completely failed to recognize the text. At first, I thought the LLM wasn't strong enough, so I swapped in a bigger model—no improvement. Then I switched to LLaVA-1.5's two-layer MLP (with GELU activation), and the results shot up dramatically! That's when it hit me: the connector isn't just a dumb courier; it determines how much visual information gets distorted.

So if you're working on multimodal models, don't just brag about your LLM. First check if your connector is the bottleneck.

---

The LLaVA Approach: So Simple It's Ridiculous—Why Does It Win?

LLaVA 1.0's architecture is so simple it makes you question reality—CLIP ViT-L/14 + a single linear projection layer + Vicuna (fine-tuned from LLaMA).

Not even a nonlinear activation! My first thought: "This counts as innovation? Must be a grant grab, right?"

But it works. And it works really well.

The secret? Not the architecture—it's the data.

LLaVA used GPT-4 to generate 158,000 image-text instruction pairs for fine-tuning, and the model suddenly became capable of visual dialogue. This showed the industry a clear path: you don't need to make the vision encoder and LLM super complex. As long as the alignment is good enough and the data is plentiful and accurate, you can get great results.

By LLaVA-1.5, the connector was upgraded to a two-layer MLP, and it could already go toe-to-toe with big company models.

Then I hit my biggest pitfall: resolution.

Standard ViT training uses resolutions of 224×224 or 336×336. If you directly resize a high-resolution poster to 224, you lose all the details! OCR becomes basically useless.

LLaVA NeXT (LLaVA-1.6) came up with a solution called AnyRes: slice the high-resolution image into several patches, resize each patch to the ViT's native resolution and process them individually, plus a low-resolution global image, then stitch everything together and feed it to the LLM.

Sounds great in theory, but reality is harsh.

The first time I ran AnyRes, I took a 1080p image and cut it into 4 patches (each resized to 336), plus the global image—that gave me 5×576 = 2880 tokens. Not too bad, right? But if you cut it into 9 patches, the computation doubles, and VRAM explodes. Plus, all those tokens are fed into the LLM's input layer, and the resulting attention and KV cache overhead made my RTX 4090 literally cry.

LLaVA's workaround is to "compress before input"—downsampling high-resolution features to fewer tokens via interpolation. But fundamentally, it's still the old approach of stuffing everything into the LLM at once.

LLaVA's contribution is proving that input-side optimization + high-quality data can push an ultra-minimalist architecture to the top tier. But it doesn't resolve the core contradiction: to see finer details, you need to feed more tokens, and the LLM's attention complexity is O(n²). That problem requires a different path to solve.

---

Enter Qwen3-VL: Finally, Someone Stopped Banging Their Head Against the Input Layer

While LLaVA was piling on data to climb higher, the Qwen-VL series took a different road.

In 2024, Qwen-VL started investing in dynamic resolution. By 2025, Qwen3-VL pushed architectural innovation to the extreme.

When I first read the Qwen3-VL tech report, I said to myself: "Finally, someone gets it!"

Its core idea: Instead of cramming all high-resolution features into the input layer at once, inject them into different layers of the LLM.

How exactly?

First, a low-resolution image (336×336) goes through the ViT to get global visual tokens, which are fed into the LLM's input layer.
High-resolution image features are extracted separately. But they are not sent to the input layer. Instead, they are injected directly into the LLM's intermediate layers with the same token count. Every few layers, inject them again—three or four layers is enough.

The advantage is obvious: you don't need to increase the input token count just to get higher resolution. Fine-grained features take a side path, and the added computation only happens at the injection layers—the entire LLM doesn't bloat.

I tested Qwen3-VL-32B on a very hard OCR scenario—a poster with thin gray text in a tiny font that LLaVA-1.6 completely misread. Qwen3-VL not only read it correctly but also gave context descriptions of the background region. Most importantly, its inference speed was nearly twice as fast as LLaVA using AnyRes with 9

多模态大模型主流架构介绍:从 LLaVA 到 Qwen3 (English)

多模态大模型主流架构介绍:从 LLaVA 到 Qwen3 (English)

The Evolution of Multimodal Models Over the Past Three Years Really Comes Down to Understanding This One Step

First, Remember This: Every Multimodal LLM, When You Strip It Down, Has Three Parts

The LLaVA Approach: So Simple It's Ridiculous—Why Does It Win?

Enter Qwen3-VL: Finally, Someone Stopped Banging Their Head Against the Input Layer

Cael Lee

Ready to get started?