多模态大模型最新进展Modality Bridging篇 (English)

Generated: 2026-06-20 16:03:41

---

Are you sick of hearing the word "multimodal" yet?

In the last few years, if you’ve been anywhere near the internet and haven’t been bombarded with "multimodal" and "large models," you must have been meditating in a cave. But honestly, articles that actually explain this stuff clearly are rarer than giant pandas.

I just crawled out of a rabbit hole myself, my head full of question marks. Today, let’s skip the rigid "first define, then classify, finally discuss significance" format—way too fake. Let’s jump straight into the real pain I’ve been through, the stuff that actually hurt, and the four soul-searching questions that can save you two months of detours.

You ready?

---

First up: Multimodal bridging – what exactly are we bridging? And why are those retrieval folks always yelling "we need a bridge"?

Think about it. Sounds pretty mystical, doesn’t it?

But it’s actually dead simple. A "bridge" is just a translator between different data formats. You have to "translate" the meaning of an image for a model that only understands text. And you have to "describe" the flavor of text for a model that only sees images.

At first, I thought exactly like you: how hard can this be? Just grab a human-annotated caption and feed it to a text model, done!

Well, I was naive to the point of embarrassment.

I took an off-the-shelf image captioning model, turned a picture into "a cat sits on a windowsill," and went to retrieve it. Guess what? The results were so bad my scalp went numb!

Why? Because the model’s version of "a cat sits on a windowsill" and the image you’re thinking of—"a ginger cat lounging on the left side of the windowsill, dappled light on the right"—are two completely different worlds! The information loss is total! That’s a classic translation error.

Then I came across the CIG (Composed Image Generation) method mentioned in the materials. Mind-blowing. Absolutely genius.

It did something deeply counterintuitive: instead of pulling images and text into the same vector space to compare, it said, "You’re looking for an image? Let me first use your reference image and incremental description to draw you an ideal pseudo-target image on the spot, then go find the real one in the gallery!"

Isn’t that wild? Traditional methods force everything into a single vector space. CIG just changes the game entirely: "Help the user visualize their vague, fuzzy ideas first!"

It starts with a textual inversion network to understand the texture and composition of your reference image. Then, combining your "add more sunlight" or "make it red" descriptions, it uses a diffusion model to directly generate a "rough draft" of what you’re looking for. Finally, it takes that draft plus your text description and goes searching in the real image database.

What I love most about this approach: It doesn’t need those ridiculously expensive triplet datasets! Training only requires simple image-text pairs. The barrier to entry is so low it makes you emotional.

I ran it myself. Got about a 3 to 5 percentage point boost on CIRR and FashionIQ. But the pitfalls were real—the quality of the generated "draft" makes or breaks everything. If your image database is full of Monet-style blurry paintings, the generator falls apart immediately, and your retrieval goes down the drain.

Now take GME (General Multimodal Embedder). Even more ambitious. Its problem: your query isn’t one image and one sentence. It’s a Wikipedia page with pictures! And the target you want to retrieve is also rich media with text and images mixed together.

Classic methods hit a dead end here. They only know how to align "single modalities." GME’s idea I fully support: use an MLLM for unified encoding. Shove text tokens, vision tokens, and fused features all into the same embedding space.

See? Whether it’s text or image, in the model’s eyes, it should be the same language!

They also built an LLM-based data synthesis pipeline to automatically generate this kind of fused-modal training data. Three steps: generate query → extract and rewrite entities → fetch images. I tried it. The quality was okay, but there was a huge catch—you have to fine-tune your LLM prompt template really well, or the generated text and the fetched image often end up completely mismatched. Like, you generate "night view of the Eiffel Tower in Paris," and the image is taken during the day. Awkward, right? So when I ran it, I forcibly added a CLIP similarity filter. Anything below 0.6 got tossed out. That barely made it usable.

---

Next up: Architecture choices—I walked through this minefield and stepped on every single one!

If you’re building a multimodal model, the first unavoidable question is: how do you feed visual features into the language model? The materials gave three paths, and I walked down every one with a grimy face.

Option 1: Linear Projection (LLaVA v1)

Don’t laugh. When I saw this plan, my first reaction was, "That’s it?" – one linear transformation, and done!

The visual tokens from the CLIP ViT go through a linear layer, their dimensions aligned to the LLM’s embedding space, and then just concatenated to the front of the text tokens. I coded up an experiment. Found that it works, but only just barely.

On ScienceQA, about 84% accuracy. But ask it something more complex, like "Who in this picture is smiling, and why?" and it starts hallucinating nonsense. In short, linear mapping is too coarse. It simply cannot capture the complex, non-linear relationship between the visual space and the language space.

Option 2: MLP Connector (LLaVA-1.5)

This is the upgrade, with real chops! LLaVA-1.5 swapped the linear layer for a two-layer MLP with GELU activation. The complexity went up just a tiny bit, but the performance was a visible leap.

My test data: ScienceQA jumped to 87%, and crucially, on dialog tasks like VisDial, the model’s "IQ" was clearly online. The semantic richness of the visual tokens improved a lot.

My friend, I genuinely recommend this approach. Best bang for your buck! Training cost barely changed, but the results went up a full notch. If you, like me, have limited resources and want to validate an idea quickly, the MLP connector is the most solid and smartest choice. No contest.

Option 3: Q-Former (BLIP-2, InstructBLIP)

This one steps it up to another level. Q-Former works like a "journalist" sent to interview image features with 32 prepared questions. It has to distill the interview into 32 key pieces of information.

Basically, before the visual tokens enter the language model, it uses a bunch of learnable query vectors and cross-attention to brutally compress 256 visual tokens into just 32. When I first saw it, I thought it was overly complicated. But when I actually benchmarked it, the computational efficiency advantage was insane.

I had a project dealing with high-resolution scanned documents. Without compression, the VRAM would have exploded. Q-Former saved me about 75% of the visual token overhead! But the cost? The training pipeline is a huge pain. You have to pre-train it first to teach it how to "ask questions," then tune it together with the LLM. When I first started training, I skipped the warm-up and went straight to full training. The loss refused to drop. I wasted three whole days of GPU time.

To sum it up, if you’re the one going into battle right now:

Linear Projection: Good for quick research. Don’t use it in any real product. You’re just asking for trouble.
MLP: Best bang for your buck. Suits the vast majority of business scenarios. Use it now.
Q-Former: If you, like me, need to handle high-resolution images, or if computing resources are tight, invest two weeks in tuning it. It can be your lifesaver.

---

And one thing that really got me excited – how do multimodal models actually learn to "reason"? Isn’t it just feeding them more data and they magically get it

多模态大模型最新进展Modality Bridging篇 (English)

多模态大模型最新进展Modality Bridging篇 (English)

Are you sick of hearing the word "multimodal" yet?

First up: Multimodal bridging – what exactly are we bridging? And why are those retrieval folks always yelling "we need a bridge"?

Next up: Architecture choices—I walked through this minefield and stepped on every single one!

And one thing that really got me excited – how do multimodal models actually learn to "reason"? Isn’t it just feeding them more data and they magically get it

Cael Lee

Ready to get started?