多模态大模型Qwen2-VL解剖 (English)

Generated: 2026-06-21 13:15:31

---

Qwen2-VL Anatomized: When Multimodal Models Start to "See" Video

Guess what? I recently had a model completely blow my mind.

Last month, I was processing a batch of client contracts. The PDF scans were full of stamps, signatures, and blurry tables—total chaos. How’d I used to handle it? First, I’d extract the images, run OCR to convert to text, then toss it all to a large model for analysis. By the time I finished that pipeline, the document structure was completely wrecked. The correspondence between section numbers and charts? Basically lost.

Then, I just threw a screenshot of the contract into Qwen2-VL and asked: “What are the acceptance criteria in the table on page three?”

Not only did it pull out the table contents, it even restored the hierarchy for me.

I was stunned right there. This thing… it’s got some serious chops.

---

The "Translator" Is Gone, But Understanding Runs Deeper

Speaking of which, something came to mind.

Back in 2015, I went to see the Mona Lisa at the Louvre in Paris. After queuing for two hours, I found myself behind ten layers of people. I pulled out my phone to look up some background info, only to discover—the Wi‑Fi signal in that area was absolutely terrible. "Connecting" to the network took forever.

Previous multimodal models were kind of the same. They all followed the "ViT + Connector + LLM" recipe. The connector in the middle was that "translator", responsible for converting image information into something the language model could understand.

But Qwen2-VL? Look at its architecture—the connector is almost gone.

I dug into the source code. The visual encoder is Qwen2VisionTransformerPretrainedModel, the language model is Qwen2VLModel, and the only thing sandwiched between them is something called PatchMerger. How many parameters? Just one linear transformation plus an MLP—so small it's negligible.

This reminds me of a counter‑intuitive insight: Real translation is mutual understanding, not a relay station.

It’s like two people falling in love—at first you need translation software, but later you just communicate with your eyes. The ViT and LLM in Qwen2-VL have learned to "talk" to each other directly, so the translator naturally becomes redundant.

---

The Triple Play of Positional Encoding—This Design Is a Masterstroke

Alright, here comes the big one.

When I read the source code for M‑RoPE, I literally slapped my thigh: “Wow, they can do that?!”

Traditional RoPE is 1D—it can only encode one‑dimensional token positions. But images naturally have two dimensions—height and width—and video adds a time dimension. How did previous approaches deal with that? Either they brute‑forced absolute positional encoding, or they layered on a 2D‑RoPE. But handling video? Hah, total crash and burn.

Qwen2-VL’s solution: Directly decompose positional encoding into three dimensions—time, height, and width.

For example, imagine an image divided into a 2×2 grid of patches. The encoding for each patch would have three numbers:


Visual time positions: [0, 0, 0, 0]
Visual height positions: [0, 0, 1, 1]
Visual width positions: [0, 1, 0, 1]

Each position is encoded independently and then concatenated—now tell me, isn’t that elegant?

Where it really shines is video processing. I tested a street‑view video:

“Which lane was this car in at the 8th second?”

Older models couldn’t handle that at all, because positional encoding couldn’t capture the spatiotemporal location corresponding to “the 8th second.” But Qwen2-VL answered correctly—it actually perceives the motion trajectory of the same object throughout the video.

---

Dynamic Resolution—Not Just Hype, Truly Useful

At this point, you might be thinking: “These are all technical details. How do they affect me?”

Fine, let me give you another example.

Earlier, when dealing with images of different sizes, multimodal models required a fixed input resolution—say 448×448 or 224×224. Images with different aspect ratios were simply squashed, warping the text beyond recognition.

How does Qwen2-VL solve this? It uses a technique called "naive dynamic resolution."

You heard that right—it’s called “naive,” like a kid stacking building blocks. Images of different sizes keep their original proportions, are directly cut into patches, and then reassembled.

I tested it with two sets of images:

One was a 1920×1080 widescreen webpage screenshot.

The other was a 1080×1920 tall document screenshot.

Before? I’d have to crop or scale them to a uniform size, losing a huge chunk of information in the process.

Now? I just feed them in directly—same idea as variable‑length training for text. And guess what? The model handles them with no difference at all.

The reasoning behind this is straightforward: ViT processes patches anyway. Gathering patches from different images into a single sequence is just like stacking Lego bricks—different shapes, but the stacking logic is the same.

---

A Treasure Map Dug Up from the Source Code

Speaking of which, I’d suggest all developers—don’t just read the paper, go read the source code.

Before you run inference, first look at the model’s input parameter structure:


model_inputs = {
 "input_ids": input_ids,
 "position_ids": position_ids,
 "past_key_values": past_key_values,
 "attention_mask": attention_mask,
 "pixel_values": pixel_values,
 "pixel_values_videos": pixel_values_videos,
 "image_grid_thw": image_grid_thw,
 "video_grid_thw": video_grid_thw,
 "rope_deltas": rope_deltas
}

See that? There’s a hidden parameter: rope_deltas.

What’s that for? It handles the scaling factor for RoPE rotation angles at different resolutions. Simply put: it’s needed to uniformly process both large and small images without adjustments through this parameter.

While digging through the code, I found a trick: when working with low‑resolution inputs, try reducing rope_deltas, and you can save about 10% VRAM. Details like that? Never mentioned in the paper.

Also, inside the ViT, the processing logic for images and videos is completely unified—an image is treated as a video with two identical frames, going through the same 3D convolution.

What are the parameters? Conv3d(3, 1280, kernel_size=(2, 14, 14), stride=(2, 14, 14))

You ask how I know? I found it by reading the code.

---

Training Strategy—Why Is It So Powerful?

Qwen2-VL’s training is divided into three stages, and I looked into them.

Stage one: train only the ViT, using 600B tokens of image‑text pair data.

600B! Let that number sink in—not 600 million, not 6000 million, but 600 billion!

The initialization strategy is interesting: the language model uses Qwen2 parameters, and the vision encoder is based on DFN’s ViT. The coolest part—they replaced the ViT’s fixed positional encoding with RoPE‑2D.

The clever thing about this stage: training only the ViT, letting the vision encoder learn to align with the semantic space of the language model. And the scale of 600B also explains—why don’t you need a complex connector later?

The ViT and LM have already been mathematically aligned.

Just like two people who have lived together for a long time—they understand each other with just a look, no words needed.

Stage two: multi‑task training.

Stage three: instruction fine‑tuning.

After these three steps, the model can both understand complex images and keep up with conversational flow.

What impressed me most was its performance on the DocVQA and MathVista benchmarks—the hardest part of document‑style QA is tables and mixed layouts. Qwen2-VL achieving SOTA there validates the effectiveness of dynamic resolution plus RoPE‑2D design.

---

Some Heartfelt Judgments

After testing it for two weeks, here are a few points I think are valuable for developers.

First, if you’re doing document understanding or video analysis—Qwen2-VL is the most cost‑effective choice among open‑source options.

The 72B version, with fp16 inference, needs about 140GB of VRAM—not expensive in the cloud.

The 2B

多模态大模型Qwen2-VL解剖 (English)

多模态大模型Qwen2-VL解剖 (English)

Qwen2-VL Anatomized: When Multimodal Models Start to "See" Video

The "Translator" Is Gone, But Understanding Runs Deeper

The Triple Play of Positional Encoding—This Design Is a Masterstroke

Dynamic Resolution—Not Just Hype, Truly Useful

A Treasure Map Dug Up from the Source Code

Training Strategy—Why Is It So Powerful?

Some Heartfelt Judgments

Cael Lee

Ready to get started?