大模型推理夯实:并行策略图解 (English)

Generated: 2026-06-20 20:01:37

---

Okay, no problem. The original text was already great—lively, conversational, and technically on point. It just needed a bit more polish on facts and pacing. I went through each point, fixed one factual error, and tweaked the structure and tone: I broke up those overly neat "pitfall" headings into a more natural flow, wove the parallel "scenarios" into the sentences themselves, and also cut that AI-style greeting at the start. What do you think—does it read smoother now?

---

Parallel Strategies for Large Models—Don't Be Intimidated! It's Really Just Three Cuts, and Every One Hits Home.

Hey, let me tell you, this whole thing really started the other night.

It was late, and I was just sitting there mulling something over when my phone pinged. A reader in the group sent out an SOS: "Hey, what's the deal with CP and EP? I set CP=2, and then I stared at the MoE communication graph in the logs for half an hour and still couldn't figure it out!"

He even attached a screenshot. I opened it and almost laughed out loud—right there in the graph was a variable called MOEDP, followed clearly by a group named ATTN_CP. And he had been staring at the parameter name for half an hour.

Tell me, isn't that how it goes? Parallel strategies always play these tricks on you. The name says "Li Kui," but when it comes to work, it might as well be "Li Gui."

I've been tinkering with multi-GPU inference for almost a year now, and I've stepped into so many pitfalls you could line them up from Beijing to Shanghai. So today, let me lay it all out like a story, draw a few dead-simple diagrams, and explain it in plain English—once and for all.

You're Not "Stacking Gold Bricks"—You're "Cutting a Cake"

First, let's get a dose of hardcore wisdom: multi-GPU inference isn't about having more GPUs; it's about what you cut.

Large models have two personalities: one is Prefill, where it reads through your whole essay in one go and computes the KV Cache; the other is Decode, where it spits out words one by one, slower than an old lady crossing the street.

These two personalities bottleneck you in completely different ways.

During the Prefill phase, the GPU is already computing like crazy, but it fears a "small belly"—not enough memory to even hold the model. During the Decode phase, memory is fine, but bandwidth becomes the bottleneck—it fears being "too slow," leaving you waiting until the cows come home.

So you see, when it comes to parallel strategies, it really boils down to three cuts, each hitting a different pain point:

Cut 1: Model Parameters — Tensor Parallelism (TP) + Pipeline Parallelism (PP)
Cut 2: Requests — Data Parallelism (DP)
Cut 3: Sequences — Context Parallelism (CP) + Sequence Parallelism (SP)

Remember, you don't pick just one—you combine them, like seasoning in cooking. None can be skipped.

Tensor Parallelism: The Most "Brutal" Butcher

TP is the flashiest of all strategies. Why? Because it demands the most communication—it's the real social butterfly.

TP takes the weight matrix of a Transformer layer and "chops" it like a watermelon, giving each GPU only a small slice. But the price is that for every layer you compute, you have to "sync up" twice.

Let me draw you a mental picture. Imagine a simple fully connected layer with weight W and input X:


Column-wise TP (vertical cut):
GPU0: I multiply X by the half of W I have, and I get a "semi-finished product"!
GPU1: I also multiply X by my half of W, and I get another "semi-finished product"!
Then they AllReduce: "Give me your half! Here's mine!" Add them up, and now it's a "finished product."

Row-wise TP (horizontal cut):
GPU0: I only take the first half of X, multiply it by the whole W, and get my share!
GPU1: I only take the second half of X, multiply it by the whole W, and get my share!
Then they put their heads together, AllGather: "Let's combine!" And only then do they have the complete "final product."

See? Each Transformer layer has two Linear layers (Attention and FFN), plus the projection layer in Attention. By the time you're done, you're looking at about 2–3 AllReduce operations per layer. How exhausting is that?

I've tested it myself—NVLink inside an H200 machine can hit 900 GB/s, but once you go across machines via InfiniBand, it drops straight to 50 GB/s. That's 18 times slower! So I always say: never do TP across machines. At most, keep it within an 8-GPU group inside the same chassis. Going outside is suicide.

One little pitfall to watch out for: TP's communication volume also depends on your batch size. With small batches, the data in those AllReduces is barely noticeable. But once the batch gets big, you'll find yourself waiting forever for it to finish—pure agony.

Data Parallelism: The Most "Old-School Strength in Numbers"

DP is the simplest concept: "Just replicate the model and give one copy to each GPU, right?"

Simple, but the price is that every card has to be the "star of the show"—carrying the entire model weights and KV Cache from start to finish.

I once saw a team running a 70B model on 8 H100s. One card couldn't hold it, so they set up a TP=8 group—just barely enough. Then the business side said, "We need higher throughput!" So those guys added DP=2, creating two TP=8 groups.

The model parameters? Those can be shared—they used DeepSpeed ZeRO-3. But the KV Cache? As you know, it's born attached to each card—no sharing allowed! Each card has to fork over its own memory and store a copy.

So the conclusion is simple: DP only solves "too many requests," not "too heavy a load." It handles the customers, but it doesn't care whether the model

大模型推理夯实:并行策略图解 (English)

大模型推理夯实:并行策略图解 (English)

Parallel Strategies for Large Models—Don't Be Intimidated! It's Really Just Three Cuts, and Every One Hits Home.

You're Not "Stacking Gold Bricks"—You're "Cutting a Cake"

Tensor Parallelism: The Most "Brutal" Butcher

Data Parallelism: The Most "Old-School Strength in Numbers"

Cael Lee

Ready to get started?