Home / Blog / 我花了3个月踩的坑:多模态对齐做不对,融合就是和稀泥 (English)

我花了3个月踩的坑:多模态对齐做不对,融合就是和稀泥 (English)

By CaelLee | | 7 min read

我花了3个月踩的坑:多模态对齐做不对,融合就是和稀泥 (English)

Generated: 2026-06-22 03:14:04

---

One late night last summer, I ran an experiment.

I fed a video clip to three models: one looked at the visuals, one listened to the audio, and one read the subtitles. I asked them all the same question—"Is this person happy or sad right now?"

The visual model said: "Corners of the mouth turned up, deep crow’s feet around the eyes. Judging as happy, 94% confidence."

The audio model said: "Vocal cord frequency jitter increased, pitch dropped at the end with no upward curve. Judging as hidden sadness, 78% confidence."

The subtitle model said: "Text analysis found he said 'I'm really fine,' with six instances of semantic contradiction. Judging as emotional concealment, 88% confidence."

Same scene, three answers, each going its own way.

That’s the real trouble with multimodal alignment—it’s not that the models aren’t smart enough; it’s that what they see is fundamentally different. If you force them together, you just end up with a mishmash that looks like nothing.

Alright, that’s what I wanted to chat about today.

---

Alignment and Fusion Are Not the Same Thing

A lot of articles use these two terms interchangeably. I fell into that trap myself when I first started.

Alignment is about finding relationships. For example, at the third second of a video, someone says, "I'm so happy." You need to match it on the timeline—the exact frame where their mouth opens happens to be the moment the word "happy" comes out. That’s alignment.

Fusion is about blending the aligned information into a judgment. You know that at the third second, their mouth opened and they said "happy." Now you need to combine all the clues to decide if they’re genuinely happy.

If you get the order wrong, everything falls apart.

I once worked on an emotion analysis project with text and audio dual streams. At first, I was pretty dumb about it: I just concatenated the audio features and text features and threw them into the model.

What happened? Accuracy got stuck at 62%, no matter how I tuned it.

Later, I broke it down and found the problem was simple—the speech rate feature in the audio corresponded to the fifth second, while the text said "I'm so happy" at the third second. They were never aligned, so forcing them together was just muddling things up. No wonder the model was confused.

That mistake taught me one thing: Fusion without alignment is a terrible idea.

There are two ways to approach alignment: explicit and unsupervised.

Explicit alignment means using labels to teach it step by step. For example, you annotate 1,000 pieces of data, each one saying, "This sentence at 2.3 seconds corresponds to that frame." The results are good—really good. But have you ever experienced the pain of labeling data? I labeled 3,000 video clips, and it felt like peeling off my own scalp. Plus, generalization is poor. Switch to a different scenario, and you have to start labeling all over again. Think about it: you label comedy movies for video QA, then suddenly switch to a news broadcast—the model becomes an idiot in seconds.

Unsupervised alignment, on the other hand, lets the model find correlations on its own, without labels. Sometimes it can uncover patterns you wouldn’t notice—like a drop in pitch always coinciding with a certain keyword. But the results can be volatile. I fell into a big pit once: running unsupervised alignment on a public dataset, accuracy jumped up and down. I spent two days debugging and finally found that one modality’s sampling rate was set wrong. Audio at 16kHz, video at 30fps—the sampling intervals didn’t match during alignment. How could the results be stable?

So my conclusion is: Labels are expensive but stable; unsupervised is convenient but needs monitoring. Now I make it a habit to check the distribution of alignment loss after every training epoch. If the variance gets too large, I stop the machine immediately and inspect the data pipeline.

---

Encoder-Decoder: Classic Approach, but Choose Wrong and You’re in Trouble

This framework is intuitive: each modality extracts features first, then aligns, then decodes the output. Sounds simple, right? But there are three levels of fusion, and picking the wrong one will get you into trouble.

First, data-level fusion.

It sounds nice—just concatenate the raw pixels, waveforms, and text and feed them into the model. I tried it once. The image was 224×224, the audio sampling rate was 16kHz—the data volumes differed by orders of magnitude. Guess what happened? The model learned a shortcut. It discovered that just by relying on audio features, it could guess most answers correctly, so it basically ignored the image features. It’s like only studying one subject to pass all your exams—do you really expect it to learn the others?

Data-level fusion only works when modalities are highly synchronized, like in autonomous driving, where LiDAR and camera timestamps are aligned to the millisecond. If you’re writing a paper on multi-sensor fusion, this approach is fine. But for video QA? Forget it.

Next, feature-level fusion.

This is what I use most often. Each stream extracts features independently—ResNet for images, BERT for text—then they’re concatenated or weighted-summed in feature space. It’s flexible, but there’s a fatal issue: the quality of alignment directly determines the model’s upper bound.

I stepped into another trap here. I used the second-to-last layer of ResNet for image features, which was 2048 dimensions, and the CLS token from BERT for text features, which was 768 dimensions. The difference was more than double. The fusion layer struggled to learn anything, and the data distribution was skewed. Later, I forced both features into the same dimension—512—and accuracy jumped by 3 points.

Isn’t that amazing? Sometimes the model isn’t the problem; it’s that you haven’t even aligned the dimensions.

Finally, model-level fusion.

This one is more aggressive—train a separate classifier for each modality, then vote or take a weighted average. It’s suitable when the modalities are so different that they can’t be aligned in feature space—like touch plus voice. You feel a rough surface with your hand while hearing the sound of sandpaper rubbing. Can those two types of information be matched in feature space? Unlikely.

But the problem is obvious: training costs double, and the models might contradict each other. I once had a situation where the image model output "cat," and the audio model output "dog." How do you fuse that? It’s a mess.

If you use model-level fusion, at least ensure each sub-model has over 80% accuracy; otherwise, a low-quality modality will drag the whole thing down.

So how do I choose now? Here’s a reference:

If the data is well-aligned and can be trained end-to-end, I prioritize feature-level fusion. If the modalities are highly heterogeneous or I need to integrate existing models, I go with model-level. Data-level? Unless the sampling rates of the two streams are naturally consistent, don’t touch it. You’ll regret it.

---

Attention Mechanism: The Hottest Thing, but Don’t Think It’s Invincible

In recent years, attention mechanisms have dominated multimodal fusion. No one can avoid them.

What’s so good about them? They can dynamically focus on important parts of different modalities without requiring manually written alignment rules. In the past, you had to design templates by hand: "When the audio has a burst sound, focus on the edge regions of the image." Now, attention learns on its own.

Take Q-Former, for example. It uses a set of query vectors to "ask" the image features, then compresses the results into a small number of tokens to feed to the large model. LLaVA is even more direct—a single MLP for hard alignment, simple but effective.

But here’s a counterintuitive fact: Attention performance is positively correlated with data volume.

On small datasets, attention is worse than handcrafted feature fusion. I tried it on a small project with 5,000 data points—attention was 2 points lower than simple concatenation. Why? Because it has a large number of parameters, and without enough data, it overfits.

But once the data volume increases—I fine-tuned LLaVA-1.5 on 300,000 image-text pairs, running on 8 V100 GPUs for three days—attention completely dominated. It outperformed feature concatenation by about 4.2% in F1.

Think about it: what do those 4.2 points mean? The gap between a top conference paper and rejection is often just 2-3 points.

But attention also has its pitfalls. It has a large number of parameters and is slow for inference. Try

C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free