一文讲明白大模型分布式逻辑从GPU通信原语到Megatr (English)

Generated: 2026-06-22 03:40:02

---

It took me three months to finally understand what distributed training of large models is all about.

Let me tell you a story first.

A friend of mine, full of ambition, built a 10B model. He ran it on a single GPU—boom, memory exploded. Straight up OOM.

Fine, let's go multi-GPU! But the training speed didn't increase—it dropped. And the error messages? Each one more ridiculous than the last.

Guess what?

That was me, three months ago.

Every day I'd stare at NCCL error codes, watching GPU utilization bounce all over the place, with only one thought in my head: What the hell are these cards passing around to each other?

Come on, today I'm going to spoon-feed you all those bloody lessons. You don't need to read source code, you don't need to understand those fancy terms—you just need to know: "Oh, so that's how it works."

Let's roll.

---

1. Three Parallelism Strategies, Plainly Three Cuts

When people talk about parallelism, many get dizzy. PP, TP, DP—acronyms everywhere, a headache.

Don't worry, let me paint you a picture.

First cut: Horizontal (Pipeline Parallelism, PP)

Think of a car factory assembly line.

GPU1 makes the wheels, GPU2 installs the doors, GPU3 paints the car. Each GPU only does its own layer.

Sounds perfect, right?

But GPU2 has to wait for GPU1 to finish the wheels before it can start. So GPU1 works like crazy, while GPU3 sits there idle.

That's the "bubble" you've probably heard about. The GPU sits idle, and you're burning electricity. Doesn't that hurt?

Second cut: Vertical (Tensor Parallelism, TP)

This cut is more brutal.

You take a huge matrix, smack, split it in half. GPU1 computes the left part, GPU2 the right part, and finally, you merge them.

The benefit is each GPU only computes half the matrix, so memory pressure is much lower.

The cost? Communication frequency so high it makes you question life. After every computational step, they have to exchange data—endlessly.

Third cut: Data Parallelism (DP)

This is the simplest. Each GPU holds a full copy of the model but feeds on different data. At the end, they compare gradients—if you have too much, I'll subtract a bit; if I have too little, you add a bit.

Seems easy, right?

But each card stores a complete model. For a 70B model across 32 cards, every single card must store all 70B parameters. Is your memory enough?

At this point, you're probably wondering: So how do modern frameworks actually do it?

The answer is—use all of them.

Megatron-LM cooked up "3D parallelism," combining those three cuts. It's like cutting a cake: one horizontal cut, one vertical cut, then a level cut, until everyone gets a small piece.

The underlying logic behind all this? Just a few tiny communication primitives.

---

2. Communication Primitives: The "Pinyin" of Distributed Training

The first time I saw Megatron code, I was completely broken.

The screen was full of allreduce, allgather, reduce_scatter—my head was spinning.

Later I realized: every parallel strategy ultimately builds on these little building blocks.

2.1 Communication Group — The Circle You Draw

This is the pit I fell into hardest.

You configure your TP group, PP group, DP group, thinking they're independent and won't bother each other.

Then what?

One tiny mistake, and you mix GPUs from different groups in a call. Either the program hangs, or the results are pure garbage.

A communication group is essentially a "boundary." Only GPUs in the same group can talk to each other.

The most common setup in Megatron is:

tp_group: for tensor parallelism
dp_group: for gradient synchronization
pp_group: for passing activations

They're like three WeChat groups—each chats among itself, never crossing.

I once did something stupid while debugging—I set worldsize=4 for tpgroup, but then also used those same 4 cards in dpgroup. When allreduce was called, the two groups waited on each other, and it deadlocked.

I stared at the screen for half an hour before I realized the groups weren't strictly separated.

2.2 Four Basic Primitives—That's Enough

To learn distributed training, you don't need too much. Just remember these four moves.

First: Broadcast (广播)

One person shouts, everyone hears.

Typical scenario: during model parameter initialization, the root card sends the parameters to all GPUs.

Second: Scatter (散射)

One person holds a deck of cards and deals a few to each person.

Like dealing cards. In pipeline parallelism, the master node sends different micro-batches of data to the GPUs in the first stage.

Third: All-Gather (全收集)

Each person holds a puzzle piece. After one communication, everyone gets the complete puzzle.

In tensor parallelism, you've split the weights column-wise. After each card computes its part, you need to merge the partial results from all cards to do the next step.

Fourth: All-Reduce (全规约)

Add up the data from everyone, then give the sum to everyone.

In data parallelism, after each card computes its gradient, it calls all_reduce once, sums the gradients, divides by the number of cards, gets the average gradient, then each updates its own parameters.

These four primitives are your foundation.

Advanced primitives like reduce_scatter are really just combining "reduce then scatter" into one communication for better efficiency.

DeepSpeed's ZeRO does exactly this.

2.3 Communication Cost, Not What You'd Expect

You think communication just takes time?

Wrong.

Communication cost has two dimensions.

One is bandwidth cost: the more data you transmit, the longer it takes.

The other is latency cost: starting communication requires a fixed handshake overhead, unrelated to data size, but proportional to the number of communications.

Sending small chunks each time is actually slower. Because the startup overhead eats up all the benefits.

There are only two optimization directions: either send more at once, or find a way to overlap computation with communication.

I learned this painfully when tuning sequence parallelism.

You split LayerNorm across multiple cards—the computation itself is fast, but the number of communications doubles, latency overhead eats all gains, and you end up slower.

So sequence parallelism only makes sense on compute-heavy parts like Attention.

---

3. How Does ZeRO Save Memory? Explained in One Sentence

Traditional data parallelism has a fatal flaw.

Every card stores the complete model, gradients, and optimizer states.

Let's do the math: a 70B model with the Adam optimizer requires 16 bytes per parameter (parameter itself + momentum + variance). 70B × 16B = 1120GB.

If you use 32 cards for data parallelism, each card still needs to store 1120GB.

Which card has that much memory?

Even the NVIDIA A100 only has 80GB.

So DeepSpeed came along with its ZeRO.

The core idea is super simple: since you're already doing all-reduce to synchronize gradients, why not spread the states across cards and gather them only when needed?

ZeRO-1: Only spread the optimizer states

ZeRO-2: Also spread the gradients

ZeRO-3: Also spread the parameters themselves

Imagine: instead of every GPU storing everything, each card stores a piece, and when needed, you assemble them with all_gather.

Memory is saved, but communication increases.

ZeRO-3 compared to pure data parallelism adds about 50% more communication, because of the extra allgather and reducescatter for parameter shards.

I tested a 13B model on an A100 cluster with 32 cards of data parallelism.

ZeRO-2 barely worked.

With ZeRO-3, I could double the batch size, but training speed dropped by 15%.

Then I combined ZeRO-3 with gradient accumulation and increased the micro-batch size, and finally caught the throughput back up.

This balance—you have to tune it yourself.

---

4. Tensor Parallelism: Simple Idea, Many Pitfalls in Practice

Tensor parallelism, at its core, is just splitting two matrix multiplications.

Type 1: Vertical split of weight A.

Each GPU holds a column. The input X is replicated fully to all cards. Each card computes X × A_i, and finally you concaten

一文讲明白大模型分布式逻辑从GPU通信原语到Megatr (English)

一文讲明白大模型分布式逻辑从GPU通信原语到Megatr (English)

1. Three Parallelism Strategies, Plainly Three Cuts

2. Communication Primitives: The "Pinyin" of Distributed Training

2.1 Communication Group — The Circle You Draw

2.2 Four Basic Primitives—That's Enough

2.3 Communication Cost, Not What You'd Expect

3. How Does ZeRO Save Memory? Explained in One Sentence

4. Tensor Parallelism: Simple Idea, Many Pitfalls in Practice

Cael Lee

Ready to get started?