Home / Blog / 群魔乱舞:MoE大模型详解 (English)

群魔乱舞:MoE大模型详解 (English)

By CaelLee | | 6 min read

群魔乱舞:MoE大模型详解 (English)

Generated: 2026-06-21 17:19:39

---

A Pandemonium of Experts: What the Hell Is an MoE Large Model?

To be honest, the impulse to write this came from the embarrassment of being stumped.

The other day, a friend suddenly asked me: "What's so great about MoE? Why is everyone jumping on it?"

I thought for a moment, then blurted out: "It's like... multiple experts with a gating mechanism."

And then I saw that look on his face — you know, the "what the hell are you talking about?" look. That's when I knew I needed to sort this out properly!

This article isn't going to be some dry academic paper. I'm going to break down, piece by piece, all the pits I've fallen into, the tricks I've tried, and the papers I've dug through over the past year of tinkering with MoE models. I promise that by the time you finish reading, you'll know what MoE is all about, how to fine-tune it, and which traps to absolutely avoid.

Speaking of which, let me start with a question —

How Old Do You Think MoE Actually Is?

MoE isn't a new invention — it's been around since 1991! (You might not even have been born yet!)

But it's only in recent years that it really started to shine in large models.

Put simply, MoE replaces the feed-forward network in the Transformer with a "committee of experts." Then you add a gating network (I call it the "front desk receptionist") that decides which expert to assign a problem to, based on the characteristics of each input.

The most common approach is Top-K routing — for example, selecting only 2 experts to work on each task.

So what's the biggest advantage?

You can have a terrifyingly large number of parameters, but the actual computation is surprisingly small!

Only a few experts are activated each time. Take DeepSeek's 16B MoE model — it achieves performance comparable to LLaMA 2 7B while using only 40% of the computation!

Is that a good deal? You do the math!

The First Trap: Why Your "Multiple Experts" Idea Might Actually Backfire

When I first started working with MoE, I just copied over my old training hyperparameters — and sure enough, the validation loss kept climbing higher, while the training loss happily descended.

A classic case of overfitting!

I'd read about this online but didn't believe it until I experienced it myself.

Think about it — a sparse model has a massive number of parameters, but each token only interacts with a limited combination of them. That makes it especially prone to memorizing noise in the training set.

What's that like? Imagine a student who has 100 teachers but only ever gets tutoring from 2 of them. He remembers every single word those two teachers say, but he can't generalize to save his life.

The fix is simple — add higher dropout rates to the sparse layers.

The dropout rates I used: 0.1 for dense layers, and I cranked it up to 0.2–0.3 for sparse layers!

Don't be stingy — try it and you'll see the difference.

The Auxiliary Loss That Almost Led Me Astray

MoE training commonly uses an auxiliary loss to balance the load across experts — preventing everyone from crowding onto just one or two experts.

Many tutorials say: "You absolutely must add this!"

But I fell into a trap once —

In one experiment, I accidentally set the auxiliary loss weight to 0.01. It barely had any effect, but the model quality didn't drop much either.

Later, I read the ST-MoE paper and discovered that the authors actually tried turning off the auxiliary loss entirely! Even when 11% of tokens were dropped (because load imbalance caused some tokens to be discarded), model quality didn't degrade noticeably!

What does this tell us?

Token dropping itself might actually serve as a hidden form of regularization! It's like giving the model a "fasting training" regimen that actually helps prevent overfitting!

Of course, I'm not telling you to turn off the auxiliary loss — just don't treat it as gospel. Tune it for your specific task. In one QA task, I turned off the auxiliary loss and actually gained a few points!

The Most Counterintuitive Discovery: MoE Is Not a Silver Bullet

This one really goes against intuition. You'd think more parameters would mean across-the-board improvement, but that's not the case with MoE.

During my replication experiments, I observed —

On knowledge-intensive tasks like TriviaQA, with the same pre-training perplexity, sparse models far outperformed dense models of comparable size.

But on comprehension tasks like SuperGLUE, sparse models got absolutely crushed by dense models!

This overturns the notion that "more parameters is always better." Why does this happen?

My guess is: knowledge-intensive tasks rely more on the model "remembering" facts, and MoE's specialized experts can store different types of knowledge efficiently. Comprehension tasks, on the other hand, require global reasoning, and the discretization introduced by the gating mechanism hinders smooth information flow.

So if you're building a knowledge Q&A application — MoE is a godsend!

If you're working on fine-grained reasoning like reading comprehension — tread carefully!

I also noticed: during fine-tuning, the fewer experts you use, the better the performance on downstream tasks. So don't assume more experts is always better. I usually keep it between 4 and 8.

Fine-Tuning Strategies: I Tried Four Approaches, and Now I Only Use One

The biggest headache in fine-tuning MoE is GPU memory. With full-parameter fine-tuning, your A100 80G might not even handle a batch size of 4.

I tried:

1. Full fine-tuning: Great results, but not enough memory — you'd have to gradient accumulate until the cows come home.

2. Freezing all non-expert layers: Memory went down, but results were abysmal. The Mistral paper also mentioned that this approach leads to a significant performance drop.

3. Freezing only the MoE layer parameters: This is my go-to! Results are nearly identical to full fine-tuning, but memory requirements are much lower. Because training MoE layer parameters requires storing intermediate activations for each expert, freezing them saves a huge chunk.

4. Fine-tuning only the gating network: Doesn't work well — the gating layer is too thin.

Practical advice: During fine-tuning, unfreeze the attention layers, embedding layer, and layer norms. Only freeze the expert FFNs within the MoE.

This preserves MoE's structural advantages while drastically cutting memory needs. Plus it speeds things up — because you don't need to compute gradients for the experts!

As for hyperparameters, sparse models prefer smaller batch sizes and higher learning rates.

I usually halve the batch size (e.g., from 32 to 16) and double the learning rate (from 1e-5 to 2e-5). My guess for why: smaller batches introduce more stochasticity in gradients, which helps the gating network explore better routing assignments.

Let's Talk About Open-Source MoE Models — I've Tinkered with All of Them

There are quite a few open-source MoE models now —

DeepSeekMoE: From the DeepSeek team in China. 16B total parameters, 2B activated parameters. I've run their fine-tuning code — one command and you're done! I really like the "fine-grained experts" concept in their paper — they break experts into smaller pieces, activate more of them, and create richer combinations. It's more flexible than using eight powerful experts.

Qwen2 MoE: From Alibaba. Architecture similar to Qwen1.5-MoE. The biggest highlight is the shared expert + dedicated expert routing design. Some experts are fixedly activated for all tokens (shared experts), while others are dynamically routed. The results are impressive! I tested the inference speed of their 7B MoE version — it's nearly twice as fast as a dense model with the same parameter count!

Mixtral 8x7B: From Mistral. 8 experts, each token activates 2. I've used this model for a long time — it's stable and well-documented. However, the expert granularity is relatively coarse; each expert is a complete FFN, unlike Qwen's fine-grained approach.

Grok-1: From Musk's team. 314B parameters (8 experts, each 39.25B?), 64 layers, Attention is GQA 48/8. The structure is pretty conventional, but the scale is huge. I downloaded it but couldn't run it (not enough GPUs)… Currently not very practical.

**

C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free