一文看完多模态:从视觉表征到多模态大模型 (English)
一文看完多模态:从视觉表征到多模态大模型 (English)
Generated: 2026-06-22 04:18:57
---
You Show ChatGPT a CT Scan? It Says: "I Can't See."
The other day, a friend came to me urgently: "Isn't ChatGPT supposed to be amazing? Get it to look at this CT scan for me and tell me if there's anything wrong."
I said, "Go ahead and try it. It can see images now."
He tried. Then he cursed: "It says it can't see the image. It's blind!"
Ha! Actually, he might have been using the free version, which doesn't support image upload yet. Even with GPT-4, its interpretation of medical images might not be reliable—after all, it's not a professional doctor. But this incident hides an interesting question: How do you make a language model "understand" images?
I spent months diving deep into multimodal large models from the inside out. I stepped in pitfalls, sweated, and even slammed the table. Today, I'm going to break it all down for you in plain, human language.
---
Part 1: Why Is Making a Model "See" So Hard? — You Think You're Feeding It Numbers, But It's Actually a Mountain
What do language models eat? Tokens! A string of numerical vectors. Give it the character "cat," it looks it up in the vocabulary table, turns it into a vector, and done.
But an image? It's a pile of pixels, a two-dimensional matrix.
Guess how many pixel values a 224×224 color image has? 224×224×3 = 150,528! One hundred fifty thousand numbers! How do you stuff that into a language model's tiny little head?
When I first started learning, I did something stupid—I thought: "Simple! Just flatten the image into a long vector, right?" But the model couldn't learn a thing! All spatial information was lost! It was like mashing all the notes of a song into a blob and saying "This is the melody." You couldn't hear a damn thing.
So the core idea is only one: Turn images into tokens too. Let the image and text speak the same language.
---
Part 2: ViT: Reading an Image Like a Sentence — Google Nailed It
In 2020, Google came up with ViT (Vision Transformer). The idea is so straightforward it makes you want to slap your thigh: "Since transformers can process text sequences, why not chop an image into small pieces and treat each piece like a 'word'?"
How do you chop it? Take ViT-L/14 as an example: It splits a 224×224 image into 14×14 pixel patches, getting 256 patches in total (16×16). Each patch goes through a linear layer and becomes a 1024-dimensional vector.
So, an image gets "translated" into 256 words. Each word represents a local region of the image—just like when you scan a photo, first you look at the person on the left, then the dog on the right.
Last year, in my own project, I tried replacing my original CNN with ViT for feature extraction. The results were really different! ViT can capture global relationships—for example, in a photo, the person on the left and the dog on the right: ViT's attention mechanism lets them "see" each other. But CNN? You need to pile on many, many layers to achieve that, like navigating a maze.
But ViT has a fatal flaw: It only understands vision, not language. It doesn't know the relationship between the vector for "cat" and the character for "cat." It's a pure visual feature extractor, like a mute who can only see but not speak.
---
Part 3: CLIP: Getting Images and Text to "Talk" — OpenAI's Cross-Language Dictionary
In 2021, OpenAI's CLIP solved this problem. The idea is also so simple it makes you question your sanity: simultaneously train an image encoder and a text encoder so that semantically similar image-text pairs become "close" in vector space, while dissimilar ones become "far apart."
I tested CLIP's zero-shot classification ability—show it a cat photo, no training data needed, and it can pick "cat" from categories like "cat, dog, car, airplane." The accuracy was shockingly high!
CLIP essentially builds a "cross-language dictionary" between images and language. From then on, machines can use a shared coordinate system to represent a photo of a cat and the sentence "a fluffy orange tabby."
But CLIP can only do matching; it can't generate text. If you ask, "What's in the picture?" it can't answer. It's like an album that can recognize every friend's photo but can't speak.
---
Part 4: LLaVA: The Simplest Multimodal Solution — I Ran It Myself, It's So Simple It Made Me Want to Cry
In 2023, LLaVA (Large Language and Vision Assistant) came out. Its architecture was so simple it made me doubt my existence:
- Use CLIP's visual encoder to turn an image into visual tokens.
- Use a single linear projection layer to align the visual token dimensions to the language model's embedding space.
- Concatenate the visual tokens with text tokens and feed them into the LLM.
That's it? At first I didn't believe it. Later I ran the LLaVA-1.5 code myself and found it really was that simple.
I ran the llava-llama-3-8b-v1_1 model using llama.cpp. It needs two files: the model itself and the mmproj projection file (multimodal projection layer). When running, the image goes through ViT to become tokens, then through the projection layer to align dimensions, and finally it's fed into the LLM along with the question.
The results were better than I expected. Give it a hand-drawn math problem, and it could understand it and provide the solving steps. But there are still issues with details—for example, if I ask it to count how many people are in an image, it occasionally miscounts. If it can't even count people right, the model still needs more practice.
---
Part 5: Qwen2-VL: A Breakthrough in Native Fusion — I Tested It, This Is a True Leap
This year, Qwen2-VL impressed me the most. It made two key improvements, each of which made me want to applaud:
Dynamic Resolution: Previous models required a fixed image size (e.g., 224×224), causing images to be stretched and distorted. Think about it: a 4K high-definition map compressed to 224×224—you can't even read the text anymore, so how can you understand anything? Qwen2-VL can handle arbitrary resolution images. I tested it with a 4K high-definition map; previous models couldn't handle it at all, but Qwen2-VL could read it! This is crucial in real-world scenarios.
M-RoPE: This technique achieves fusion between vision and language at the foundational architecture level, rather than simple concatenation. It uses rotary position encoding (RoPE) to unify the position information of visual tokens and language tokens.
My assessment: Qwen2-VL represents a key step for multimodal large models moving from "concatenation" toward "fusion." Early approaches (including LLaVA) essentially "translated" visual features for the language model, while Qwen2-VL makes vision and language share the same representation space at the bottom level. It's like two people going from "talking past each other" to "being on the same wavelength."
---
Part 6: Training Strategy: Two-Phase Approach — Data Quality Is 100x More Important Than Model Architecture
I've studied the training pipelines of several mainstream multimodal models and found they all basically follow two phases:
Phase 1: Alignment Pre-training
Freeze the visual encoder and LLM, train only the connector. The goal is to "translate" visual features into a format the LLM can understand. The data volume is usually large (millions of image-text pairs), but the training is relatively simple.
Phase 2: Instruction Fine-tuning
Unfreeze some parameters and train with high-quality image-text dialogue data. This phase determines the model's actual performance. I tried fine-tuning LLaVA-1.5 with a few dozen manually labeled examples, and it did improve. But data quality is crucial—feed in garbage data, and the model learns garbage. If all you see is trash images, what can you learn?
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.