图解大模型训练之:Megatron源码解读2,模型并行 (English)

Generated: 2026-06-22 02:39:37

---

Let me first tell you a story that gave me the "creeps."

What was your first reaction when you saw Megatron's model parallelism code?

Here’s mine: At that point I had been writing distributed training code for over half a year. I thought I had data parallelism and AllReduce down cold. Then one day someone asked me, "How exactly does Megatron's model parallelism work?" My first thought: Uh oh, I'm about to embarrass myself.

Not because it's hard. Honestly, you can explain "Tensor parallelism splits weights, pipeline parallelism splits layers" in three sentences. But actually digging into the code to see how it literally chops the model up, stuffs it into dozens of GPUs, and makes sure it's still correct? Man, I stumbled and fell multiple times before I crawled out of that pit.

So today, I'm not going to give you a start-to-finish tutorial. I'm going to directly answer the itchiest questions in your mind. Show up with your questions, and when you finish reading, you'll be able to go back to the source code with way fewer detours. Trust me, by the end you'll slap your thigh and say: "So that's how it works!"

---

First Gate: What exactly does model parallelism parallelize? You're probably half wrong!

Let me ask you something. How do you understand "model parallelism"?

Most people blurt out: "It's putting different layers on different GPUs." That's half right. But the other half is the real killer. Megatron's model parallelism is actually two levels of splitting tied together — you can't even separate them in the code.

Tensor Parallelism (TP): Cut the weight matrix inside a Transformer layer into two chunks, put them on two cards. Each card stores only half the weights, computes half of the forward and backward passes, and then stitches everything together via collective communication. It's like two people carrying a big table together — each lifts half, and they join it back at the destination.

Pipeline Parallelism (PP): Put different Transformer layers on different devices, split the input data into microbatches, and push them through like a factory assembly line.

The problem is, in Megatron, these two are locked into the same process group system. They're not separate modules like AA and BB — they share the same process assignment infrastructure.

Go look at the initializemegatron section. You'll find tensormodelparallelgroup, pipelinemodelparallelgroup, dataparallelgroup... seven or eight groups pop up out of nowhere. You know, the first time I ran pretrainbertdistributed.sh and saw --tensor-model-parallel-size 8, I thought: Simple, just split the model into 8 layer chunks. But after initialization printed the parallelstate, I discovered my own process was participating in three different groups at the same time! It felt like entering a maze where every fork had three directions, and any choice was wrong.

Data parallelism only needs one AllReduce group. Model parallelism? You need to know which operator communicates in which group.

Counterintuitive, right? Most people think model parallelism is about slicing the model. But first, you have to slice the process groups.

---

Second Gate: How does the code actually slice the model? Why does a Linear layer have to be split into two?

This was the trap that held me up for three days. Can you believe it?

In megatron/core/tensor_parallel/layers.py, you'll find that every linear layer has been replaced with either ColumnParallelLinear or RowParallelLinear. The names already reveal the secret of splitting:

ColumnParallelLinear: Split the weight matrix by column. Each card gets only a subset of the columns. The full input goes in, each card computes its part of the output, and then things are stitched together via AllGather — or sometimes not stitched at all, leaving it for the next layer to handle. It's like a company splitting customer duties between two departments, each handling its own half of the business, and then consolidating later.

RowParallelLinear: Split the weight matrix by row. Each card gets only a subset of the rows. The input is first distributed to each card via ReduceScatter, each card computes its part of the output, and finally AllReduce sums them up. It's like two chefs each using half the ingredients to cook one dish, then mixing their results together.

Why go through all this trouble? Because a Transformer has two main types of computation: MLP expansion (a linear layer that blows up the hidden dim by 4x) and Attention projections. According to the paper's design:

First MLP linear layer: Use ColumnParallelLinear, output is split across cards but not immediately gathered. Why? Because the next layer is GELU, an element‑wise operation — it doesn't need communication. Perfectly saves one AllReduce!

Second MLP linear layer: Use RowParallelLinear. The input is already scattered across cards (the split output from the previous layer). Multiply its own slice, then do an AllReduce to get the complete output. Exactly one communication done.

You see, there's a design philosophy buried in here: Save communication wherever possible, but always keep the math equivalent.

Here's a trap I fell into hard: When writing models by hand, we're used to defining weight matrices in their full shape. But in this codebase, weights are already sliced at initialization! If you print model.layers[0].mlp.fc.weight.shape, you'll see (hiddensize // tpsize, 4 * hidden_size), not the full shape. When I first debugged, I stared at it for ages, convinced the model had been loaded wrong. Only later did it click: There never was a full weight matrix in the code — it was designed to store slices from the start.

Also, Embeddings are implemented through ParallelEmbedding, which also slices along the vocabulary dimension. If the vocabulary size isn't a multiple of TP size, the program just throws an error. I've stepped on that landmine myself. My advice: pad your vocabulary to a multiple of TP size in advance, so you don't get blocked by that official assert.

---

Third Gate: How do TP and PP combine? How are process groups assigned to hardware?

This is the question I get asked most often, and also the one I most want to complain about — because most articles only talk about concepts, and never about how process groups map to hardware. It's like giving you the recipe but not the kitchen tools.

Take the CodeGeeX configuration (I keep coming back to this one): TP size = 8 (8 cards on one machine), DP size = 192 (across machines). Note that PP size = 1, so no pipeline. But if you look at Megatron's own example configs, many use combinations like TP=2, PP=4, DP=64. Once you combine them, the process allocation logic works like this:

Total processes = TP size × PP size × DP size.
First split by PP size into multiple pipeline stage groups. Inside each stage, split again by TP size into tensor parallel groups.
The remaining dimension is the data parallel group. Note that the data parallel group is cross‑pipeline and cross‑tensor — each group contains replicas that are data-parallel counterparts.

In human language: you have a pile of GPUs. First divide them into several pipeline groups. Inside each pipeline group, further divide by TP size into several TP groups. Each TP group does tensor parallelism. Then, all GPUs at the same position within each TP group, plus all GPUs at the same position across pipeline groups (if data parallelism exists), form a DP group.

Megatron's initializemodelparallel function does exactly this. It creates a bunch of global variables like MODELPARALLELGROUP, TENSORMODELPARALLELGROUP, PIPELINEMODELPARALLELGROUP, etc. The devil is in the details: If a process belongs to both a TP group and a PP group, it must know which group's handle to use when communicating. For example, an AllReduce inside tensor parallelism should use mpu.gettensormodelparallelgroup(), not mistakenly mpu.getdataparallelgroup().

I once made a rookie mistake: I wrote a custom operator that needed an AllReduce, but inside a TP context I used the default group (all processes globally).

图解大模型训练之:Megatron源码解读2,模型并行 (English)

图解大模型训练之:Megatron源码解读2,模型并行 (English)

First Gate: What exactly does model parallelism parallelize? You're probably half wrong!

Second Gate: How does the code actually slice the model? Why does a Linear layer have to be split into two?

Third Gate: How do TP and PP combine? How are process groups assigned to hardware?

Cael Lee

Ready to get started?