GPT-4的多模态能力是如何实现的? (English)
GPT-4的多模态能力是如何实现的? (English)
Generated: 2026-06-20 12:59:42
---
Have you ever personally written a shopping list in such wild, scrawling handwriting that even you had to squint for half a day to figure it out?
I have.
Then I snapped a photo and tossed it at GPT‑4V—the one with vision capabilities. And guess what? It not only read every single character without missing a beat, but also automatically sorted it into “Perishables” and “Household Items.” Finally, it casually added: “Suggestion: check the milk’s expiration date.”
I was floored.
This wasn’t simple text recognition—it understood. It knew this was a shopping list, knew milk goes bad, knew to remind you. A chill ran down my spine: this thing is more thoughtful than my mom.
But you know what? When I first started diving into multimodality, I was hammered by all those fancy terms—cross‑modal attention, unified semantic space, Q‑Former bridging… a whole barrage of buzzwords that left me dizzy. Then I played around with a few models, pored over a dozen technical reports, and realized: It’s not as mystical as you think, but it’s also way more mind‑blowing than you imagine!
---
You Think Multimodality Is Just Looking at Pictures and Reading Words? Naive!
A lot of people feel that multimodality is just making a model both see and read, then stitch the two together.
That’s about as accurate as saying “building a rocket is just strapping fuel and an engine together.”
The real core of multimodality boils down to one idea: Throw images and text into the same Transformer and let them share a single attention mechanism. Not that you have your own vision encoder and I have my text encoder, and then a clumsy adapter forces them together—but from the very beginning, everyone’s splashing around in the same pool!
How does it work?
Images come in and get chopped into patches (Patch Embedding). Text comes in and gets sliced into tokens (Token Embedding). Then everything is dumped into a unified encoder. In the world of attention, there’s no distinction between image and text—only the need to “interact with whom.” So they have to flirt with their own kind and also buddy up with the other kind.
In other words, when it looks at your shopping list photo, it doesn’t just recognize a few words—it blends the shapes of the characters, the surrounding pixels, and the overall layout into one holistic understanding. It’s like when you look at a photo: you don’t just see the person; you take in the background, the lighting, the mood—it’s all about overall perception.
---
What Really Blew My Mind Was the “Minimalism” of Open‑Source Models
Last year I ran an open‑source model called LLaVA locally. Its design philosophy is pretty similar to GPT‑4’s, but the architecture is so simple I had to question my sanity: a CLIP ViT for image encoding, a linear projection layer for dimension alignment, and then it’s straight up married to LLaMA.
No Q‑Former. No complex cross‑modal attention. Just one linear mapping!
And the results? Frighteningly good.
Why? Because the language model itself is already a super‑context‑understanding machine! You just have to translate the visual information into something it can understand, and it figures out how to use it on its own. That linear projection layer is the interpreter—converting the visual coordinates into language coordinates.
Speaking of which, I have to say: Large language models have been seriously underestimated. You think they’re just talkative chatbots, but their real superpower is comprehension—no matter whether the input is text, images, or audio, as long as the format is right, they’ll chew it up and spit out logic.
---
The Training Process—That’s Where the Real Magic Happens
GPT‑4’s multimodality wasn’t built in a day. It went through four stages, with tons of iterations.
Stage 1: Massive Pre‑training. Pure text, interleaved image‑text, image‑caption pairs—all thrown in. First, carpet‑bomb it with data to establish basic understanding. Sort of like my old BERT training days—first drown it in data, then fine‑tune.
Stage 2: Behavior Alignment. Teach the model to align with human values. This stage produced two key weapons: a rule‑based reward model (RBRM) and a deep‑learning‑based reward model (RM).
Stage 3: RLHF. Use those two reward models as judges to score the main model, which then learns from the scores. You write, I edit, I score, you practice accordingly.
Stage 4—The slickest move: Self‑Improvement. GPT‑4 automatically generates more training data, then loops back to iterative training. Basically, it becomes its own teacher!
Are you kidding me?
Back when I trained models, the most painful part was data annotation. Now the model can produce training data itself. That’s just how technology progresses—unreasonably, and it makes you both love and fear it!
---
On Safety, GPT‑4 Went All In
You might think: multimodality is just a beefed‑up search engine, right? What’s so hard?
Half right.
Multimodality does improve understanding, but it also brings new risks—like “multimodal hallucination”: the model looks at an image and insists it sees something that isn’t there. Plus, harmful content in images is harder to detect.
To tackle this, GPT‑4 introduced a set of “Rule‑Based Reward Models” (RBRMs). Essentially, a bunch of zero‑shot classifiers that provide additional reward signals during RLHF fine‑tuning. For example, teaching it: “When faced with a harmful content request, you must refuse, but refuse gracefully—no hedging, no rambling.”
And how effective is it? OpenAI’s own numbers speak:
- GPT‑4’s tendency to comply with prohibited content requests dropped by 82%!
- Toxic content generation rate: only 0.73% for GPT‑4, compared to 6.48% for GPT‑3.5.
See? This isn’t a superficial upgrade—the underlying safety mechanisms are genuinely solid.
Numbers don’t lie.
---
But Don’t Think It’s Perfect—Reality Check
I have to be brutally honest.
GPT‑4’s multimodality still makes stuff up. Show it a blurry photo and it might say with confidence, “This is a Corgi!”—when it’s actually a cat.
The irony behind this is particularly biting: the goal of RLHF training is to make human annotators feel the model’s output is good. But human annotators usually prefer a “confident answer” over an “I’m not sure.” So the model learns: even if you don’t know, pretend you do!
Isn’t that like encouraging it to be confidently wrong?
That’s technology for you—always two sides. We gain unprecedented understanding, but we also have to put up with its “overconfidence.”
It reads the shopping list well because it’s clear; it mistakes a blurry cat for a dog because it doesn’t know that “I don’t know” is a valid answer.
---
In the End
From the shopping list to the Corgi, how does GPT‑4’s multimodality actually work?
Simply put, it throws images and text into the same big pool and lets them soak and learn from each other. Add four‑stage insane training, layered safety mechanisms, and self‑iterative upgrades. It doesn’t have eyes, and it doesn’t need them. It relies on data, algorithms, and humanity’s hunger for understanding.
And today, maybe we all need to ask ourselves one question:
Now that machines are starting to truly “understand” us, can we give them a better world to understand?
Here’s a closing line for you, the reader who made it to the end:
The ultimate meaning of multimodality isn’t to let machines see the world—it’s to let the world, through the machines, truly see us for the first time.
(Save that one for a screenshot.)
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.