实测：MoE用1/2算力撬动4倍参数，效果反超稠密模型 (English)

Generated: 2026-06-22 05:43:21

---

Okay, here's the English translation, preserving the storytelling style:

You know, I've been writing this column for ten years and have seen countless tech concepts hyped up and then fizzle out. But MoE (Mixture of Experts) — I have to say, it really pulled large models out of the dead end of just "stacking parameters"! It surprised me so much I almost slammed the table!

Let me start with a personal story. Last year, I ran the same code generation task on Mixtral 8x7B (46.7B total params, 12.9B active) and LLaMA-70B (a dense model). Guess what? Mixtral could run on a single A100, while LLaMA-70B needed two cards and model parallelism. And the result? Mixtral's code generation pass rate was significantly higher than LLaMA-70B's. I was stunned — this wasn't just a tech iteration, it was a dimensionality reduction attack! A "skinny guy" beat the "fat guy" using only half the effort.

Argument 1: Computational efficiency isn't about "saving" — it's about "liberating"

Many people say MoE's advantage is saving compute — but that's only half the story. More accurately: MoE lets you leverage a much larger model capacity with less computational resources. It's like paying for a compact car but driving a big truck that can haul more cargo.

Look at the numbers. Kimi-K2.5 has 1.04T parameters, but only activates 32B per token — an activation rate of just 3%. DeepSeek-V3 has 671B total params, activates 37B, activation rate 5.5%. That means with the compute power of running a 30B dense model, you're actually doing inference with a 600B+ parameter model. Counterintuitive, right?

Last year, I helped a startup optimize their recommendation system. Their original fine-ranking model was a 2B dense model, and at 5000 QPS, latency was already 80ms. After switching to an MoE architecture with 8B total params and 1.2B active params, at the same QPS, latency dropped to 35ms, and AUC improved by 1.8 points. Do the math: 4x total parameters, half the computation, better results. This isn't just saving compute — it's unleashing compute from its cage.

Speaking of which, how bad is the MFU (Model FLOPs Utilization) of traditional dense models? During training, it's often below 10%; during inference, it hovers around 10%. For every $100 you spend on compute, $80-90 is idle during inference. MoE, through sparse activation, pushes MFU above 40% — this isn't a minor tweak, it's an order-of-magnitude leap. With the same electricity bill, others can only light a small bulb, but you're illuminating an entire street.

Argument 2: Knowledge capacity isn't about "stacking" — it's about "dividing"

Someone might say: "Does having more parameters really help? Don't dense models with too many parameters just overfit?"

Haha, that's exactly where MoE is cleverest. In a dense model, all parameters activate for every input — it's like a general practitioner who has to treat every disease: heart disease, athlete's foot, the common cold — all mixed together, no wonder it's chaotic. MoE splits the FFN layer into N experts, each responsible for its own specialty. This isn't stacking parameters; it's hiring experts by category.

I tested DeepSeek-V3's 256 experts and found a super interesting phenomenon: some experts are particularly sensitive to mathematical reasoning, others excel at code generation, and there are even experts specialized in Chinese idioms and classical texts. This specialization lets each expert excel in its own domain, unlike dense models where all parameters interfere with each other. It's like a symphony orchestra — each musician plays only their own instrument, but together they create heavenly music.

Let the numbers speak: DeepSeek's MoE model achieves performance close to a dense model (67B params) using less than 30% of the compute. Qwen3-235B (235B total, 22B active) surpasses LLaMA-3-70B on multiple benchmarks. This isn't accidental — it's the inevitable result of the architecture. If you invite 100 experts into a meeting room and let each solve problems in their own field, how could efficiency not be high?

Argument 3: Engineering flexibility isn't about "compromise" — it's about "restructuring"

This is quite interesting. Kuaishou's OneRec transformed the recommendation system from a multi-stage "retrieval → pre-ranking → fine-ranking" architecture into a single-stage MoE generative framework. In traditional architectures, each stage optimizes its own objective — recall optimizes recall rate, fine-ranking optimizes click-through rate — and the fragmented objectives make the global optimum unreachable. It's like three monks carrying water, each with their own agenda, and in the end, water spills everywhere.

What did OneRec do with MoE? It unified retrieval and ranking into a single generation problem, with different experts handling user intent at different granularities. One MoE layer has 384 experts, each token selects 8, and the remaining 376 are simply not computed — it's like hiring 384 experts but only letting the 8 who understand you best speak each time. During training, MFU went from single digits to nearly 40%; during inference, from about 10% to over 35% — this is truly "big but not clumsy."

Traditional recommendation systems require maintaining three separate models (retrieval, pre-ranking, fine-ranking), three data pipelines, and three monitoring systems — the operational cost (OPEX) is terrifyingly high. MoE handles everything with one model: experts share underlying representations, and both training and inference are managed uniformly. This isn't a compromise — it's a dimensionality reduction attack on the architecture level. Before, you had to carry a phone, a camera, and a GPS device; now one smartphone does it all — who would still carry three bags?

Argument 4: Load balancing isn't a "problem" — it's an "evolution"

Someone might say: "MoE's routing load balancing is really hard to handle. Uneven load among experts leads to unstable training."

That was true three years ago. But DeepSeek-V3 has already solved this with auxiliary-loss-free load balancing — no extra loss function to force load balancing; instead, it dynamically adjusts the routing strategy so that load naturally balances among experts. I actually ran DeepSeek-V3's MoE training on 8 A100s, and the load variance among 384 experts dropped from an initial 0.35 to below 0.05. The difference in the number of tokens processed between the busiest and the idlest expert was less than 5%. This was a major challenge back in the Switch Transformer era (2021), but now it's been engineered away — previously, building a road required a big detour; now they just built a bridge.

At this point, you might ask: Then why are so many people still using dense models? Because they haven't turned the corner yet. The worst thing in technology selection isn't choosing wrong — it's using an old map to navigate a new road. MoE isn't just icing on the cake; it's coal in the snow (a Chinese idiom meaning timely help in a crisis).

Conclusion: Don't follow the trend — understand the essence

After all this, what I want to say is: MoE is not a "bigger model" or a "more compute-efficient model." It's a new computational paradigm — using more parameters to store knowledge, but fewer parameters to process input. It's like having a huge library at home, but each time you only take out the book you need most — infinite knowledge storage, yet lightweight to read.

I recommend that all teams building models, after 2025, stop touching dense models. Start with 8 experts, use Top-2 routing, combine it with load balancing techniques, and you'll see improvements in both effectiveness and efficiency. Don't be afraid of complex routing or unstable training — these issues have been well solved in open-source implementations like DeepSeek and Mixtral.

Finally, let me leave you with a phrase I often say to my team: The worst thing in technology selection isn't choosing wrong — it's using an old map to navigate a new road. MoE isn't just icing on the cake; it's coal in the snow. While others are still stacking parameters, you're already running farther with less compute and a smarter architecture — doesn't that feel great?

实测：MoE用1/2算力撬动4倍参数，效果反超稠密模型 (English)

实测：MoE用1/2算力撬动4倍参数，效果反超稠密模型 (English)

Argument 1: Computational efficiency isn't about "saving" — it's about "liberating"

Argument 2: Knowledge capacity isn't about "stacking" — it's about "dividing"

Argument 3: Engineering flexibility isn't about "compromise" — it's about "restructuring"

Argument 4: Load balancing isn't a "problem" — it's an "evolution"

Conclusion: Don't follow the trend — understand the essence

Cael Lee

Ready to get started?