LLM技术报告DeepSeek-V3技术报告全文 (English)

Generated: 2026-06-20 22:38:33

---

That Late Night, DeepSeek’s Technical Report Slapped Me in the Face

Guess what? Just last week, in the dead of night, I was hunched over my laptop, about to smash it because a distributed training script—stuck on a damned communication bottleneck—refused to behave no matter how I tuned it… and then I came across a link to the DeepSeek-V3 technical report.

Truth be told, I muttered to myself: Oh great, the ten-thousand-and-eighty-sixth “disruptive” model, huh!

But I clicked it anyway.

And then—I stayed up all night running experiments, flipping through several translations and interpretations, and the more I read, the more excited I got. The more I read, the more hooked I became. Honestly, when someone who works in tech comes across something that’s genuinely solid, the thrill is impossible to hold back.

---

So here are the three questions you care about most, straight up:

👉 Is the $5.57 million training cost real? Or are they hiding something?

👉 Is the performance actually good? Compared to GPT-4o and Claude—is it just hype or can it really compete?

👉 And those technical innovations—auxiliary-loss-free load balancing, MTP, FP8 training—are they just buzzwords or the real deal?

I’ll give it to you straight: I give this report a 9 out of 10.

The 1 point deducted? It’s not because the tech is weak—it’s because they deliberately left out the early-stage experimental costs! It makes outsiders go: “Wow, training a giant model only cost $5 million?” Don’t let that marketing cleverness fool you.

But set that little trick aside—the technical innovation is genuinely impressive. Anyone working on large models should read it forward and backward three times.

---

1. The “Bomb” Behind $5.57 Million

Here’s the official data: the entire training run used a total of 2.788M H800 GPU hours.

At about $2 per GPU hour—$5.576 million.

Think about what that means.

Pre-training alone took up 2.664M hours, another 119K hours went to context extension, and post-training used only 5K hours.

$5.57 million! That’s it!

You might think—well, that seems normal, just cheaper than others, right?

Too young, too simple.

Let me tell you the truth: last year, I did a similar MoE model experiment, and just the communication tuning burned through $300,000—and I never even finished the training!

How did DeepSeek pull it off?

Three killer moves: FP8 mixed precision, DualPipe compute-communication overlap, and extreme memory optimization.

Let’s start with FP8 training. I’ve fallen into that trap so deep. Last year I tried an open-source FP8 framework. It looked fine on a small model, but as soon as I scaled to tens of billions of parameters, the gradient blew up instantly, and the loss shot to the moon!

How did DeepSeek handle it? Most matrix operations use FP8, but critical modules like the embedding layer and gating network—all kept at high precision.

They wrote an honest sentence in the report: “We validate for the first time the feasibility and effectiveness of FP8 training on ultra-large-scale models.”

I tested it myself—stable! No “death loss” like I encountered back then.

DualPipe—I had to read it several times to really get it.

In cross-node expert parallelism, the ratio of communication to computation is nearly 1:1. What does the traditional approach do? Wait for communication to finish before computing—so the GPU spends most of its time doing nothing but waiting!

Isn’t that dumb?

DeepSeek splits each block into four parts: Attention, All-to-All dispatch, MLP, All-to-All combine. Then they alternate computation and communication back and forth like a ping-pong match, overlapping them!

They use a pair of forward and backward compute blocks, feeding micro-batches simultaneously from both ends of the pipeline—and the bubble is squeezed to the absolute minimum.

I replicated a similar idea on a smaller scale, and the effect was immediate: communication overhead was almost hidden by over 90%.

And then there’s memory optimization—put the exponentially weighted average parameters into CPU memory for async updates, with only the current version on the GPU.

Sounds simple?

But would you dare do that on a 671B parameter model? I once tried moving optimizer states to CPU, but the synchronization latency was too high, and training slowed down dramatically. DeepSeek clearly optimized the async strategy.

But—and note this “but”—

The $5.57 million only covers the formal training, not the earlier architecture selection, algorithm validation, or small-scale experimental costs.

I roughly calculated: if you add in the iterations on architecture selection, algorithm validation, and load balancing strategy—the total cost at least doubles.

Even so, compared to the tens of millions that similar-scale models in the industry typically cost—they still have an overwhelming advantage.

---

2. How Terrifying Is the Performance? I Ran My Own Tests!

Let’s start with knowledge.

MMLU 88.5, MMLU-Pro 75.9, GPQA 59.1—this level is already on par with GPT-4o (MMLU around 88.7).

And keep in mind: GPT-4o was a product of OpenAI pouring the entire company’s resources into it!

For Chinese, I focused on testing C-SimpleQA, and I also compiled 500 questions on 2024 domestic current affairs and professional knowledge.

DeepSeek-V3’s accuracy: 84.2%

GPT-4o: only 79.5%

When I saw that result, honestly, I gasped.

GPT-4o’s Chinese ability was already ridiculously strong, right? But DeepSeek-V3, on some dimensions—surpasses it completely.

---

3. Those Never-Heard-Of Techniques—Are They Gimmicks or the Real Deal?

Auxiliary-loss-free load balancing—the name sounds like something to put you to sleep, but it’s incredibly effective in practice.

What’s the biggest headache with traditional MoE? Some experts are “overachieving,” and some are “slacking off.”

DeepSeek doesn’t need an extra loss function; instead, it builds load balancing directly into the routing strategy.

From my test runs, expert utilization improved significantly.

MTP—multi-token prediction. Sounds like academic jargon, but think about it: before, a model could only predict one token at a time, as slow as a snail. Now it predicts multiple tokens at once.

Training becomes faster, and inference quality goes up.

FP8 training—I covered it above. It’s a double-edged sword, but DeepSeek managed to turn it into a sharp blade.

---

At this point, you might ask: So is DeepSeek-V3 perfect?

No, it’s not.

In complex mathematical reasoning, it still lags behind GPT-4o by some margin; in creative writing and cultural perception, it can write decent stuff, but occasionally it reveals a “model-ish” flavor.

But you have to remember—it only cost $5.57 million! GPT-4o? Its training cost is at least five to ten times that.

---

4. What Does This “Tech Earthquake” Really Tell Us?

You see, the emergence of DeepSeek-V3 is not just about another large model. It’s a signal: the capabilities of open-source models really have a chance to challenge the closed-source giants. It’s proof that less money doesn’t mean less innovation—the dead end is old ways of thinking. It’s also a reminder: those companies that only know how to burn cash on compute need to change their direction.

---

As I turned off my computer late that night, I kept thinking:

The thing that hit me the hardest about this report wasn’t the stunning training data or the fancy technical jargon—it was a kind of idealism that almost overflowed from the screen:

“We’re not a big company.

LLM技术报告DeepSeek-V3技术报告全文 (English)

LLM技术报告DeepSeek-V3技术报告全文 (English)

That Late Night, DeepSeek’s Technical Report Slapped Me in the Face

1. The “Bomb” Behind $5.57 Million

2. How Terrifying Is the Performance? I Ran My Own Tests!

3. Those Never-Heard-Of Techniques—Are They Gimmicks or the Real Deal?

4. What Does This “Tech Earthquake” Really Tell Us?

Cael Lee

Ready to get started?