大模型分布式训练并行技术一-概述 (English)

Generated: 2026-06-21 19:36:08

---

Here's the English translation, keeping the storytelling style:

---

Large Model Distributed Training: First, Learn How to Get 10,000 GPUs to Work for You!

Okay, I'll admit it. I used to be a "single-GPU warrior."

Back when I saw people training hundred-billion-parameter models, I thought: "It's just stacking GPUs, right? What's the big deal? I've got an A100 too—80GB of memory. Should be enough."

Then I tried training a 13-billion-parameter model myself.

Guess what?

— I'd barely loaded the model parameters before the GPU screamed OOM, before I even got to set the batch size.

At that moment, I wanted to smash my computer.

Later, staring at the error message, a thought hit me: You think GPU memory is a house, but it's really a piggy bank. First you put in the model parameters, then the optimizer states, then the gradients... and it's full. The forward pass hasn't even finished, let alone the backward pass, and it's already blown up.

That's the real starting point for distributed training: It's not because we want to show off—it's because we're backed into a corner.

---

You'll Never Guess How Hungry a 10-Billion-Parameter Model Really Is

Let me break it down for you.

10 billion parameters. If they're all FP16 (2 bytes each), the model parameters alone take 20 GB. The gradients are another 20 GB.

But the real killer is the optimizer state — Adam stores two things: momentum and variance. These must be in FP32, or the precision goes to hell and your trained model is useless.

Each parameter needs two FP32 values, so 8 bytes per parameter. That's 80 GB total.

Add it up: 20 (params) + 20 (gradients) + 80 (optimizer) = 120 GB.

Your single 80 GB A100 can't even hold the bare model, let alone train it. And forget about the activation values from each layer during the forward pass—not a chance.

So here's the cold truth: one card isn't enough to even stand in the same room as this model, much less train it.

At this point you're probably asking: So how do all those cards split up the work? Alright, let me lay it out for you.

---

The Art of Splitting Work: Four Parallelism Techniques in One Article

Nine out of ten online articles about parallelism are just posturing. Data parallelism, tensor parallelism, pipeline parallelism… you read through them and you recognize all the terms, but you still have no clue what they're actually splitting.

Let me put it differently—think of it as a cookbook.

Dish 1: Data Parallelism (DP/DDP/FSDP) — "The Human Wave Tactic"

Each GPU holds a complete copy of the model. Each card only processes the batch of data assigned to it. Everyone computes their gradients, then merges them via an AllReduce communication step, and updates the parameters.

The advantage? In PyTorch, you can do it with one line of code!

The downside? Think about it: every card has to store the entire model state. When the model gets too big to fit on a single card, data parallelism is useless—you can't parallelize what you can't even load.

So what did Microsoft do? They came up with something brilliant: ZeRO.

The core idea behind ZeRO is dead simple—but when I first read the paper I was honestly blown away— Don't let every card store the whole thing. Have them share the load, and whoever needs a piece picks it up on demand.

Optimizer states, gradients, parameters—they're all chopped up and distributed across different GPUs. When you need a piece during computation, you AllGather it, use it, and throw it away.

This idea later became FSDP (Fully Sharded Data Parallel).

Can you believe it? Just this one simple idea—"store it separately, get it when you need it"—turned models we couldn't train before into models we could.

Dish 2: Tensor Parallelism (TP) — "Cutting the Cake"

The last dish stored a complete model on each card. This one is different: splitting inside a layer.

Take the MLP in a Transformer, which has two linear layers. Normally, a single layer computes a giant matrix multiplication. Now you split the weight matrix column-wise into two chunks and put each on a different GPU.

Each GPU computes only half, then you stitch the results together.

You can do the same with the QKV parts of the Attention mechanism.

Sounds perfect, right?

But here's a trap—I fell into it once and almost derailed the whole project.

The biggest problem with TP? Communication is deadly. You need AllReduce back and forth for every forward and backward pass. If you try TP across machines (say, two servers connected via Ethernet or IB), latency goes through the roof.

Later I followed Megatron-LM's advice: TP should only be done within a single node—never across nodes.

Why? Because NVLink bandwidth is more than 10× what you get across nodes! If you force TP across nodes, you'd be better off not splitting at all.

Dish 3: Pipeline Parallelism (PP) — "The Relay Race"

This one is easy to understand: split by layers.

Imagine you have a 48-layer Transformer. Cut it into 4 pieces, 12 layers each, and put each piece on a different GPU. Input flows through GPU0, then GPU1, and so on.

But here's the fatal flaw: at any given moment, only one GPU is working; the rest are waiting.

Think about it: if only one card works while three sit idle, using 4 cards is no better than using 1.

Later Gpipe came up with a solution: split the large batch into micro-batches and feed them through like a pipeline.

As soon as one micro-batch finishes on GPU0, it gets passed to GPU1, freeing up GPU0 to take the next one. Now all GPUs stay busy.

But there's a new problem: if you have too few micro-batches, the fraction of idle device time (the bubble) is huge.

I tested this myself. You know how bad it can get? If you split the total batch into only 4 pieces, GPU utilization can be below 40%. I had to crank it up to 16 before I barely hit 70%.

At this point you start to see: it's like a factory assembly line—every workstation needs to work at about the same speed, and you need enough work pieces to keep the whole line from stopping.

Dish 4: Sequence Parallelism (SP/CP) — "The Secret Weapon for Long Texts"

I haven't used this one in production at scale yet, but when I read the Megatron-LM source code, I saw they were already using it.

What's the problem? The Attention mechanism in Transformers has complexity O(seqlen²). As the sequence gets longer, memory explodes.

How do you fix it? Split the sequence into chunks, and have each GPU handle only one chunk.

The key technique is called Ring-Attention. GPUs work in a relay: each round, they compute on their own chunk plus the KV block they received, then pass it along.

In simple terms: Each GPU doesn't need to remember the whole book—just the pages it has seen. When it needs to look at other pages, it fetches them from another GPU.

---

Why Isn't There a One-Size-Fits-All Solution?

You might be wondering: Can we use all these techniques together? Is there a magic formula?

The answer is blunt: No.

In practice, everyone mixes and matches.

大模型分布式训练并行技术一-概述 (English)

大模型分布式训练并行技术一-概述 (English)

Large Model Distributed Training: First, Learn How to Get 10,000 GPUs to Work for You!

You'll Never Guess How Hungry a 10-Billion-Parameter Model Really Is

The Art of Splitting Work: Four Parallelism Techniques in One Article

Dish 1: Data Parallelism (DP/DDP/FSDP) — "The Human Wave Tactic"

Dish 2: Tensor Parallelism (TP) — "Cutting the Cake"

Dish 3: Pipeline Parallelism (PP) — "The Relay Race"

Dish 4: Sequence Parallelism (SP/CP) — "The Secret Weapon for Long Texts"

Why Isn't There a One-Size-Fits-All Solution?

Cael Lee

Ready to get started?