MLLM理解多模态信息的过程比你想象得复杂1— (English)

Generated: 2026-06-22 17:26:11

---

Isn't it strange? I stared at that embedding space visualization for an entire afternoon without blinking. On the left was the figure from the paper—red dots, text features; blue dots, image features. Two clusters doing their own thing, separated like the Milky Way. On the right was a multimodal model I'd trained myself that supported audio input—even more chaotic: text, image, and audio forming a three-way standoff, each modality huddled in its own cluster, not even bothering to say hello.

My first thought: I messed up the hyperparameters. I tried several learning rates, ran it three times, got the exact same plot. Later it hit me—this isn't a bug, it's a feature!

Think about it: most people assume a Multimodal Large Language Model (MLLM) works like this: "visual encoder extracts features, a projector converts them into text embeddings, then feeds them into an LLM." Sounds logical, right? But if you actually trace how gradients flow through the projector—its only training objective is the autoregressive loss from the LLM's output. There's no direct constraint that forces its output features to look anything like text embeddings. In other words, the projector is like someone crossing a river by stepping on stones. Do you expect them to leap to the opposite bank in one go? All it needs is for the LLM to produce the right answer from its output—doesn't matter if the posture is awkward or not.

That leads to two painful puzzles. The modality gap is already ridiculously large, and the semantic information is paper-thin—put those two together and you've got a real headache. Isn't the visualization I just showed enough? Image and text are in their own worlds, having a blast. The projector forcibly shoves them into the same dimensional space, but the relative positions in that space? Sorry, still a galaxy apart.

There's a study called "Cross-Modal Projection in Multimodal LLMs Doesn't Really Project Visual Attributes to Textual Space" that ran a clever experiment: they fine-tuned the projector on a specific domain (like medicine) and then looked at the MLLM's final outputs. Accuracy did improve. But guess what? The richness of domain information in the projected image embeddings actually decreased. Doesn't that make it clear? The projector isn't carrying those high-level semantics. Who handles the high-level semantics? The LLM itself, modeling them in the back half. The projector is at best a messenger that runs in and shouts, "Hey, there's an image here, pay attention!"

At this point, someone might argue: "But BLIP-2's Q-Former compresses semantic alignment, doesn't it?" I compared them in my own project. Even with Q-Former feeding LLaMA, the distribution of projected embeddings was closer than with a linear projection, but the modality gap was still clearly visible. Go ahead and calculate it yourself using the MIR (Modality Integration Rate) metric—the gap is still there.

---

So what exactly does the projector encode? It gets even stranger.

If it encoded low-level image attributes—edges, colors—then in theory you wouldn't need a projector; a visual tokenizer in an Encoder-Free architecture could do that. But the weird thing is that models like X-InstructBLIP, which use a pure text LLM without multimodal training, can directly understand the embeddings output by the projector.

I tried an extreme experiment: I replaced the projector output of LLaVA-1.5 with random noise, and the LLM's output completely crashed. But if I replaced it with raw features from a different vision encoder (not trained with the projector), the LLM could still produce somewhat relevant outputs. What does this tell us? The projector learns a feature format that is "LLM-readable" during training—but that format isn't semantic. It's more likely a syntactic or structural alignment. For example, an implicit grammar of token order, or a patterned attention induction.

Simply put, it's like putting a disguise on visual features, cheating the LLM into thinking: "Look, I'm a text token, process me!" But the disguise is shallow—just enough for the LLM to recognize in the early layers, "Oh, this weird token is related to vision." The real semantic parsing has to wait for the later layers.

---

The LLM performs two-stage fusion internally—the more I look at it, the craftier it seems than the human brain.

There's a CVPR 2025 paper that specifically analyzes cross-modal information flow. I think it's the most insightfully deconstructive work on MLLMs this year.

I ran their attention knockout experiment on LLaVA-1.5-7b and verified their conclusions—brace yourself, because this is deeply counterintuitive.

Let's start with the early layers (roughly 1–8): The model does something very brute-force right away—it broadcasts the global visual features of the entire image to every token of the language question. It's like saying at the very beginning: "Okay, I already know what this picture looks like."

In the middle layers (9–24), the model starts getting precise. It locks onto entities mentioned in the question—like "dog"—and then passes only the visual information from the corresponding region in the image to that entity token's position. It's as precise as a sniper.

Finally, in the high layers (25–32), all information converges on the last token of the sequence (usually the end of the question or a special prediction marker) and delivers the final answer.

I tried the same thing on LLaVA-1.5-13b and Llama3-LLaVA-NEXT-8b, and the results were quite consistent.

Interestingly, from the outside, the projector outputs a sequence of image tokens. But the first thing the LLM's early layers do isn't to ask "What do the image tokens tell me?" directly; instead, they treat all image tokens as a whole and use attention to pull the representations at all text positions toward the visual direction. The selective alignment only starts in the middle layers.

This also explains why the projector doesn't need precise semantics: the LLM's early layers will do a rough global fusion first anyway, and the middle layers refine it. If the projector had already conveyed detailed semantics, it might actually conflict with the LLM's fusion strategy.

---

Don't fall into the CLIP curation trap—learn from my bloody lessons.

Many teams, when building multimodal RAG or training MLLMs, start by filtering data using CLIP similarity. Sounds scientific, right? But CLIP itself is trained on biased web data—when it filters data, it favors Western culture, and generic words like "image" or "picture" are systematically overrated.

I filtered a batch of Chinese food data. CLIP gave "dumplings" a similarity score of 0.25, but gave "a photo of food"—that kind of filler—a 0.33. Think about it: what will a model learn from data like that? It will think: "Specific descriptions don't matter, as long as the image and text match on a coarse modality level." That's exactly why many MLLMs crash and burn on fine-grained VQA tasks.

---

Practical pitfalls I've taken the fall for, so you don't have to.

Don't expect the projector to do semantic alignment for you. Even if you use Q-Former, you still need to do multi-stage training. In the first stage, you train only the projector, but you need to be careful with your data: make sure each image-text pair isn't just a description of the image, but includes a reasoning path. For example, pair an image with a Q&A dialogue, not just "This is a cat." This forces the LLM's middle layers to establish cross-modal relationships earlier.

Once you understand the information flow stages, debug the attention! If you find the model is insensitive to certain details, try moving the question earlier in the prompt so that during the early global fusion, the LLM pays attention to the key tokens sooner. I ran an extreme test: I changed the question from "What color is the dog in the

MLLM理解多模态信息的过程比你想象得复杂1— (English)

MLLM理解多模态信息的过程比你想象得复杂1— (English)

So what exactly does the projector encode? It gets even stranger.

The LLM performs two-stage fusion internally—the more I look at it, the craftier it seems than the human brain.

Don't fall into the CLIP curation trap—learn from my bloody lessons.

Practical pitfalls I've taken the fall for, so you don't have to.

Cael Lee

Ready to get started?