Home / Blog / GPT-4用MoE架构:16个专家分工,训练成本降至1/6 (English)

GPT-4用MoE架构:16个专家分工,训练成本降至1/6 (English)

By CaelLee | | 6 min read

GPT-4用MoE架构:16个专家分工,训练成本降至1/6 (English)

Generated: 2026-06-22 13:33:47

---

When AI Opened Its Eyes for the First Time, What Did It See?

A few days ago, I did something particularly boring—I threw a meme image of "programmer before fixing a bug vs after fixing a bug" at ChatGPT. It was completely baffled, like a blind person staring into the void. But when I sent the same image to GPT-4, it immediately said: "On the left, confidently opening the code; on the right, realizing the bug was actually a mistake in the requirements document."

At that moment, a chill ran down my spine.

From GPT-3 to ChatGPT, we've been chatting with a "blind person." It memorized all the world's knowledge, can write poetry, spin stories, and even debate "which came first, the chicken or the egg." But it has never seen what the world looks like—doesn't know apples are red, doesn't know cats chase mice, and certainly doesn't know how much programmer blood and tears are hidden behind a meme.

And GPT-4? That's the first time AI opened its eyes.

---

1.8 Trillion Parameters? Don't Be Fooled by the Numbers

Let's get a bit technical first. GPT-4 is rumored to have 1.8 trillion parameters. For comparison, GPT-3 had 175 billion—that's a tenfold increase. Hearing that, you might say, "Wow, amazing!"—but hold on, OpenAI is pretty sneaky.

They didn't build one giant supermodel. Instead, they built a "Mixture of Experts" (MoE) model. What does that mean? It's like running a company where you don't call every employee into the conference room for every meeting—you only bring in the people who know the topic. According to analysis, GPT-4 hides 16 "experts," each with 111 billion parameters. Every time you ask it a question, it only activates two of those experts to answer.

I did the math: if they had built a single giant model, each inference would require 3,700 TFLOPs. With MoE, it only needs 560 TFLOPs. That's a cost reduction to one-sixth. It's a brilliant move—saves electricity, saves GPUs, and still gets the job done.

Speaking of training costs, that's where it gets wild. According to some analysis reports, they used 25,000 A100 GPUs, running for 90 to 100 days, with utilization rates of only 32% to 36%. Why so low? Because it kept crashing! Training a large model is like driving on the edge of a cliff—you could flip over at any moment. When it crashes, you have to restore from a checkpoint, and that takes hours. I've been in this field, and I know how painful that is—every time you see "training failed," your blood pressure shoots up to 180.

But the thing that shocked me most was the batch size. OpenAI eventually pushed it to 60 million—but don't be fooled by that number, because not every expert sees all the tokens. Each expert actually processes only 7.5 million tokens. I've seen plenty of people bragging about this number in tech groups, and honestly, it's kind of cringey. It's like someone boasting about a million-dollar salary, only to find out it's in Vietnamese dong.

---

Multimodal: From Blind Men Touching an Elephant to Seeing the World

When I say ChatGPT was blind, I'm not joking. GPT-3.5 could only process text—like a genius with a blindfold on. You ask it, "What color is an apple?" and it answers, "Red or green." But if you show it a photo of a green apple and ask, "What's this?" it's lost—it has no idea what a "green apple" looks like.

GPT-4 is different. It can process both images and text simultaneously, using a "cross-attention mechanism." Simply put, images and text each go through their own Transformer, and then through the attention mechanism, they "understand" each other. It's like having a painter and a poet collaborate—the painter looks at the picture, the poet reads the poem, and when they exchange ideas, the picture gains a story, and the poem gains an image.

I've tested it a few times, and what impressed me most was: GPT-4 can understand diagrams in physics problems. You snap a photo of a force analysis diagram, and it tells you: "This object is subject to three forces: gravity downward, normal force perpendicular to the inclined plane, and friction along the plane." That was unimaginable before. An AI that can understand a physics problem means it doesn't just memorize formulas—it understands how forces actually work.

But don't get too excited. GPT-4 is still an offline model, with knowledge cut off in September 2021. Ask it, "What's happening with the Russia-Ukraine war?" and it says, "I don't know." In the end, it's a super-smart student, not a real-time news anchor. Like a high schooler who aced all exams—if you ask him, "How's the stock market today?" he just looks at you with an innocent face.

---

Safety and Hallucinations: OpenAI's Dilemma, and Our Fear

On the safety front, GPT-4 has indeed made efforts. They introduced a "Rule-Based Reward Model" (RBRM), which adds a layer of hard rules on top of RLHF. For example, "Don't teach people how to make bombs," "Don't generate hate speech." It's like installing a "moral police officer" inside the AI, constantly watching its every move.

But honestly, I think this is just a band-aid. GPT-4 still suffers from hallucinations—it still confidently spouts nonsense. I tried asking it to explain a physics concept that doesn't exist, and it fabricated a whole story, complete with a citation to "some paper." That's scary—the more powerful the AI, the harder it is to detect its lies. Just like a smarter person is harder to catch lying.

OpenAI also implemented multimodal hallucination detection, but from what I've seen, the results are mediocre. For example, show it a map that doesn't exist, and it might say, "This is a fictional geographic location." But if you show it a real but manipulated image, it might fall for it. See? Even AI can be fooled—so what about us humans?

---

My Take: This Is Not the End, It's the Beginning

GPT-4 is indeed a milestone, but it's not the finish line. Multimodal capabilities are just getting started—right now it can only handle images and text; video, audio, and 3D models are still missing. And the cost is too high—a single conversation round with GPT-4 costs about 1 yuan (based on API pricing, for a few hundred tokens each way), and ChatGPT's daily maintenance cost is nearly $1 million. It's like hiring a super butler for your home, but you have to spend the price of a BMW every day to keep him.

I predict two directions in the next two years: first, model lightweighting to make it accessible to more people; second, expansion of multimodal capabilities—video understanding will be the next explosion point. Just imagine, when AI can understand videos, comprehend a snippet of your daily life, and even help you edit a vlog—how crazy would that be?

But what worries me most isn't the technology itself—it's this question: Is AI's "understanding" real understanding, or just advanced pattern matching? When GPT-4 can understand memes, solve physics problems, and write code, how should we define "intelligence"? This question, I think, is even more worth pondering than GPT-4 itself.

Speaking of which, I recall a story: Someone asked a philosopher, "How can you prove you're not living in a dream?" The philosopher replied, "You can never prove it, because you can't use the logic of a dream to deny the dream." It's the same with AI—it may never truly "understand" the world, but it can act as if it does. And isn't that just like us humans? Who among us can honestly say they truly "understand" the world?

In short, GPT-4 is a super tool—used well, it's a weapon; used poorly, it's a disaster. How we use it depends on our own wisdom as "humans." And the first thing AI saw when it opened its eyes wasn't the world—it was us.

C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free