通俗易读LLM训练-从显存占用分析到DeepSpeed (English)

Generated: 2026-06-22 10:49:27

---

Alright, leave it to me! I'm going to turn this article inside out and breathe a brand new soul into it.

---

Last month, I was hunkered down in the server room, staring at the memory usage of eight A100s, and I almost smashed my keyboard.

Here's what happened. I wanted to fine-tune a 7B model. I figured, memory should be enough, right? But when I did the math… man, my face went green.

Let me break it down in the simplest way. Listen to this and tell me if I'm wrong.

Take Llama 7B as an example. Parameters stored in fp16: 7B × 2 bytes = 14GB. We're good, right?

Nope. Too naive.

Gradients also need to be in fp16, right? So, another 14GB.

You think that's it? The sneakiest part comes last — that optimizer called AdamW is like a miserly landlord. It demands three separate fp32 chunks:

One for model parameters: 28GB.

One for the first-order momentum, m: another 28GB.

And one for the second-order momentum, v: yet another 28GB.

Add it up: 14 + 14 + 28 + 28 + 28 = 112GB?!

I was dumbfounded. A single A100 barely has 80GB, and a 7B model dares to ask for 112GB? That's like trying to fit an entire football team into a single room.

At this point, you might think the simplest solution is data parallelism — put a full copy of the model on each GPU and split the data.

Yes, it's the most straightforward. But also the most expensive.

Think about it: every GPU needs those 112GB of memory. Even one A100 can't hold it, let alone other cards. It's like buying a full encyclopedia set for each of eight people just so they can each read a book. Bulky, wasteful, and the warehouse explodes constantly.

So what happened next? Guess what? DeepSpeed's ZeRO stepped up.

Its core idea sounds almost like a joke, but it's brilliant: Everyone thought it was about stacking hardware, but it's really about eliminating redundancy.

You see, having every card store a complete copy of data is stupid. Why not split the storage and piece it together when needed?

I spent three days testing the three stages of ZeRO on an 8-card A100 cluster. The results were like watching a magic trick.

Stage 1: Shard the optimizer states.

Memory dropped from 112GB to about 40GB instantly! How? By taking that miserly landlord AdamW's fp32 assets and splitting them across 8 cards — each card handles only 1/8. Communication overhead didn't increase at all, exactly the same as normal data parallelism.

See, it's like instead of eight people each buying a full encyclopedia, each person buys just one volume, and together they have a complete set. How much did we save? You don't need me to say it.

Stage 2: Shard gradients too!

Memory dropped directly to about 28GB! This time, the gradients coming back during backprop are also stored in a sharded way. Communication becomes one "Reduce-Scatter" followed by "AllGather", but the total data volume doesn't increase.

This step is the most practical! I actually ran a 7B model on just 4 A100s! Think about that — the barrier to entry for ordinary people has been completely flattened! Everyone thought it was out of reach, but it's actually right there for the taking.

Stage 3: The ultimate magic — shard the parameters too!

How do you calculate the memory? (14 + 14 + 28 + 28 + 28) / 8 = 16GB!

16GB! Think about it — from 112GB to 16GB, that's an order-of-magnitude leap! But now there's a cost: communication increased by 50%. Because in every forward and backward pass, you need to AllGather to reconstruct the full parameters. When I tested on 8 cards, throughput was about 15% lower than Stage 2.

See, this is trading communication for memory — a willing buyer, a willing seller.

So here's my honest advice:

If your single-card memory is enough, don't listen to any AI guru telling you to use ZeRO. Just go with data parallelism — it's the most stable.
If memory is a bit tight but bearable, Stage 2 is your perfect partner — unbeatable cost efficiency.
If memory is extremely tight and feeling like it's about to explode, go with Stage 3, but be prepared for a performance hit.
If you only have one card? Don't worry, there's ZeRO-Offload — throw the optimizer states onto CPU memory, and you can still play.

Right now, I use Stage 2 the most. Four A100s running a 7B model with batch size 16 — rock steady.

Finally, one personal take: ZeRO doesn't magically conjure up memory out of thin air. It's demonstrating a very simple truth: Most of the bottlenecks we think exist are, in essence, caused by redundancy.

Think about it: eight cards each storing a full set of optimizer states — it's like eight people each building their own library, yet they're all reading the same book. Why not store it separately and piece it together when needed? This is essentially replacing duplication with collaboration.

But don't get too carried away. If your model is hundreds of billions of parameters, ZeRO alone isn't enough. You'll need the duo of Tensor Parallelism and Pipeline Parallelism.

Like that NVIDIA paper says: under a thousand cards, ZeRO is king. Above a thousand, TP/PP is better. Because the AllGather communication overhead is like a road toll — too many cars and it becomes unbearable.

Oh, right. I almost forgot about activation memory. I didn't include it in the earlier calculations. But that thing is the real "hidden killer"!

Thankfully, FlashAttention can cut out that O(S²) attention matrix, and Ring Attention can split long sequences across multiple cards. These technologies, working together with ZeRO, are what make large-scale model training actually feasible.

Training large models isn't about fighting for compute power — it's about fighting for efficiency against your own cognitive redundancy.

通俗易读LLM训练-从显存占用分析到DeepSpeed (English)

通俗易读LLM训练-从显存占用分析到DeepSpeed (English)

Cael Lee

Ready to get started?