Home / Blog / 从零训练一个多模态LLM:预训练+指令微调+对齐+融合多 (English)

从零训练一个多模态LLM:预训练+指令微调+对齐+融合多 (English)

By CaelLee | | 4 min read

从零训练一个多模态LLM:预训练+指令微调+对齐+融合多 (English)

Generated: 2026-06-21 03:49:50

---

You think training multimodal LLMs is all about connectors? I burned through half a year on GPUs, and finally figured things out.

Last winter, I was sitting alone in the server room, staring at a stalled loss curve on the screen. Next to me, 32 A100s were humming away—burning real money, every second of it.

At the time, I was stubbornly going all-in on Early Fusion with LLaMA-65B—trying to fuse images and text at the token level. Sounded so cutting-edge, right? But in the end, the model couldn't even answer basic questions like "how many cats are in this picture." It would count to three and start making things up.

I spent nights tossing and turning, and finally it clicked: What determines your success or failure isn't those fancy-sounding connectors—it's your understanding of the data and the training rhythm!

Let me be straight with you: too many people in the open-source community right now are going down the wrong path. They jump straight into flashy stuff like Q-Former and Perceiver Resampler, and end up with models that reek of "academic zombie."

---

Seriously, Late Fusion is the "right answer" for a small team like ours

Think about it—you've only got a handful of A100s. Are you really going to pre-train a 65B model from scratch?

I followed the crowd at first, too. I thought Early Fusion was the true path—concatenate image patch tokens and text tokens at the input layer, process everything through a unified transformer. Sounds beautiful, right? But let me tell you about two real problems:

First, you have to pre-train the entire set of parameters from scratch. A 65B model from the ground up? Have you actually done the math on that bill? I did. Then I silently closed that browser tab.

Second, you're trying to fuse representations from vision and language domains that haven't even been aligned yet. The model struggles like crazy. Come on—making a toddler run a marathon? That's just unreasonable.

So I went back to basics and switched to Late Fusion: CLIP-ViT as the vision encoder, a simple MLP projection layer in between, and then the LLM. Simple and effective. LLaVA and Qwen-VL both do it this way, don't they?

As for the connector part, I've been through two iterations. First I used BLIP-2's Q-Former, which can compress 576 visual tokens down to 32—sounds like a big save on compute. But training? The gradient propagation was a wild roller coaster, and the loss would get stuck all the time. Frustrating, right?

Later I switched to LLaVA's linear MLP—just two layers, as simple as a grade-school assignment. And it worked. Training was as steady as an old ox pulling a cart, step by step. Some people say "576 tokens is too many, inference will be slow," but don't worry. Once you unfreeze the LLM during instruction fine-tuning, the LLM naturally sparsifies things through its own attention mechanism. The extra compute isn't really that big a deal.

---

When it comes to data, that's where your GPUs really get burned

Do you know how much data the Chinchilla scaling law says you need for a 65B model? 1.4 trillion tokens.

1.4T! Think about what that scale actually means.

And here's the big problem: Chinese data is especially hard to work with. I looked at the data source ratios from LLaMA and found something counter-intuitive—datasets like Common Crawl, despite their massive volume, didn't get a high sampling ratio. Instead, high-quality but much smaller datasets like Wikipedia, Books, and GitHub were repeated over several epochs.

What surprised me most: code data improves reasoning ability, and this isn't some myth. I ran a controlled experiment—removing code data during the SFT stage made the model noticeably dumber when answering slightly complex logic questions. Isn't that something?

Building datasets from Common Crawl? I've crawled out of that pit at least twice. The first time I just extracted plain text from WET files. The result? Navigation menus, ad templates, machine translation garbage—all mixed in. The trained model was pure zombie.

I eventually stuck to the four-step RefinedWeb pipeline: URL blacklisting (adult content, landing pages) → fastText language identification → NSFW detection → MinHash LSH deduplication. After all that, only 15% of the original CC data was usable.

So let me be blunt: don't blindly trust open-source datasets. The Pile and C4 have wildly varying quality—some subsets are good, but a lot of Pile-CC is just poorly extracted from Common Crawl. Using that to train a model is like chewing someone else's half-chewed sugarcane. Think about it.

If you have the compute, run the RefinedWeb pipeline on CC yourself. The output might not match carefully curated collections, but it's more than enough for a baseline. That I'm confident about.

---

Training rhythm—you can't rush it and you can't slack off

I copied Qwen-VL's three-stage training approach, and the difference was real.

Stage 1: freeze the LLM, train only the ViT and cross-attention layers. Use 1.4 billion image-text pairs to teach the vision encoder to project into the LLM's space. This is like laying a foundation—you can't rush it.

Stage 2: a lot of people don't get why you need an extra stage after all those image-text pairs. Let me tell you why: after stage

C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free