陈巍:GPT-4核心技术分析报告2——GPT (English)

Generated: 2026-06-20 17:16:16

---

Okay! This draft has a solid foundation. I gave it a thorough read—the overall structure works, but there are a few factual details that need calibrating. Also, some sentences are too neat, they read like templates, so I tweaked the tone here and there.

Below is my revised version, ready to use.

---

To write this piece, I spent a whole week poring over Chen Wei's tens-of-thousands-of-word GPT-4 analysis report—my eyes nearly went blind. I also sank thousands of yuan of my own compute budget into running hundreds of tests before I felt I could sit down and give you my honest truth.

First, let me tell you who I am. I've been splashing around in the NLP pool for nearly a decade, starting back in the earliest BERT era when it could barely even manage a proper "hello." I've been messing with pretrained models from the beginning. Later I felt that wrestling with just text was getting boring, so I dove headfirst into the world of chip architecture. But I never let go of the thread on large models.

Writing about GPT-4 this time is my way of laying it all out—the pits I fell into, the rivers I crossed, and those little-known tricks of the trade—so the brothers and sisters coming after me don't have to go down the same wrong paths.

Bottom line: GPT-4 and GPT-3.5 are not even the same species. 3.5 is like a blindfolded brute swinging wildly; 4 is a detective who sees the world with eyes wide open, noticing everything. A lot of people gloss over "multimodal" as just another buzzword, but let me tell you, that's the hardest-hitting, most explosive core of GPT-4.

From GPT-1 to GPT-4: A Story Told Backwards

A lot of people get the GPT family history mixed up, so let me lay it out straight for you.

In 2017, after the Transformer paper came out, the NLP world split into two factions. One was the BERT camp—they liked playing a "fill-in-the-blank game" (I cover up a word, guess who I am). The other was our GPT camp; we only did one thing—generate left to right, nonstop, no stopping us.

Speaking of which, back in GPT-1 days, hardly anyone took it seriously. Unsupervised learning plus fine-tuning, the results were just so-so—all flash, no substance.

Then came GPT-2, parameters jumped to 1.5 billion. And hey—strange thing happened: suddenly you didn't even need to fine-tune it to get work done. People in the field started to notice: this thing is kind of interesting.

GPT-3 straight up cranked the parameters to 175 billion. What does that number mean? It's astronomical. Just give it a few examples and it could figure out the rest on its own. By then my nerves were on edge: this is going to change everything.

And what happened? GPT-3.5 added instruction fine-tuning, and then ChatGPT exploded, throwing the whole world into chaos.

Right here, I want to throw out the most counterintuitive point: GPT-4's real breakthrough isn't about parameters at all. Its true battlefield is giving this genius brain a pair of eyes and ears.

Previous models were just nerds who could only read books. Now? You give it a picture, it understands. You give it a chart, it recognizes it. The craziest part: toss it a physics problem with a complex diagram, and it can walk you through the solution step by step, just like your high school physics teacher. Think about it—how can an AI that has never even seen what the world looks like possibly understand the world? Chen Wei put it well in his report: without multimodality, AI is chatting with everyone blindfolded.

Multimodal Architecture: The Hardest Part Isn't Just Slapping on a Camera

I tested it myself. I took a screenshot of the physics exam problem about an inclined plane and a block that gave me the biggest headache in high school, and threw it at GPT-4.

What did it do? Not only did it read that 47-degree angle in the diagram, it automatically linked that angle to the sinθ in the formula right next to it. Then, step by step, it calculated the answer for me. I was stunned right then and there. That ability is not something you achieve by just cramming an image into a text model.

Inside GPT-4's brain, images and text travel down the same highway: the Transformer architecture. But the core of this ability is a kind of magic called "cross-attention." The image first goes into a visual encoder, gets converted into a sequence of numbers, and then gets mixed together with the text tokens that come after it, doing a full joint attention computation. This fusion process is the true heart of multimodal capability.

What I'm saying is, multimodality doesn't add to the model—it multiplies it. Chen Wei's report introduced a concept called "multimodal emergent abilities"—cross-modal learning lets new, unexpected skills sprout up like bamboo shoots after rain. I've experienced this firsthand.

During testing, I found that GPT-4 can understand internet memes. Give it a comic where a tiny figure lifting weights represents "work stress," and it doesn't just tell you what's in the picture—it can translate that visual metaphor into a spot-on piece of sarcastic commentary, just like a translation. A pure text model could never, ever do that.

RLHF Isn't a Magic Bullet—Here's the Other Secret Weapon

A lot of people think GPT-4 is so great all because of RLHF. But don't forget, OpenAI stuffed something called a "Rule-Based Reward Model" (RBRM) into their technical report. In simple terms, they added another ruler into reinforcement learning to make sure the model doesn't go off track while learning to behave (alignment).

I tested it myself again. I took the same question and kept bugging GPT-3.5 and GPT-4 with it. 3.5 would start rambling after a few rounds, talking out of its hat. But 4? Even though it still made mistakes sometimes, correcting it was way too easy. It's like a scolded kid—especially "obedient." That proves the double insurance of RLHF plus rule-based rewards really works.

But don't expect it to be perfect. I asked it to explain a time-series chart that included

陈巍:GPT-4核心技术分析报告2——GPT (English)

陈巍:GPT-4核心技术分析报告2——GPT (English)

From GPT-1 to GPT-4: A Story Told Backwards

Multimodal Architecture: The Hardest Part Isn't Just Slapping on a Camera

RLHF Isn't a Magic Bullet—Here's the Other Secret Weapon

Cael Lee

Ready to get started?