Home / Blog / 分布式训练多机比单机慢?面试官揭秘80%的人踩过的坑 (English)

分布式训练多机比单机慢?面试官揭秘80%的人踩过的坑 (English)

By CaelLee | | 7 min read

分布式训练多机比单机慢?面试官揭秘80%的人踩过的坑 (English)

Generated: 2026-06-22 15:08:00

---

Okay, got your request. As an editor, I carefully reviewed this article, fact-checked it, and polished the wording. Below is my revised version, aiming to keep the original's sharpness and substance while making the expression more natural—like an experienced engineer venting, not AI-generated "clickbait."

---

Just Interviewed a Candidate That Made My Blood Pressure Spike! My Hands Are Still Shaking!

Hey, let me tell you something.

Today I interviewed a candidate with a resume that sounded amazing. Guess what? I asked three questions, and they flopped on two.

Not that they were bad—they were too "slick." They memorized answers so smoothly I almost wanted to applaud.

But you know, someone who's actually done the work versus someone who just recites textbook stuff—they can't survive three follow-up questions from me.

I've been an interviewer for years, and last year I started focusing on large model interviews. Honestly, it's pretty interesting—how to ask questions is a skill in itself.

Ask too shallow, you can't gauge depth.

Ask too deep, they think you're showing off.

Ask just right, and you dig out real ability—that's the real skill.

Today, I'm going to spill everything I've got. If someday you're sitting on the other side of the table, facing an interviewer like me… you'll thank yourself for reading this.

---

First Question: Distributed Training—My "Truth Mirror" for Screening

"Have you done data parallelism?"

I ask this question to everyone. Why? Because it works like a charm.

It's like a key—someone who's actually done it can rattle off a bunch of pitfalls; someone who hasn't can only recite textbooks.

There was a candidate whose resume boasted three years of distributed training experience. I asked the first question, and they answered perfectly. I thought, "Oh, not bad." Then I casually followed up:

"What communication backend did you use? What pitfalls did you encounter?"

…Silence. Ten seconds of silence. Like an awkward pause in a classroom—so awkward you could wring water out of it.

Speaking of which, do you know what's scariest about distributed training?

It's that you can fall into a pit without even realizing it.

For example.

I used NCCL for AllReduce, and it ran smoothly on a single machine with 8 GPUs. I thought, "This is easy, right?" Then I scaled to multiple machines with 32 GPUs.

Guess what?

Training speed didn't just fail to improve—it got slower!

I spent three whole days troubleshooting. Three days! Finally, I found it was a cross-machine network configuration issue—just a tiny problem that almost made me smash my computer.

The interviewer won't ask about this, and you'll never bring it up. But once asked, it reveals whether you've actually touched those cold GPUs.

Another one: What are the three stages of ZeRO optimized for?

People who memorize answers will tell you: ZeRO-1 splits optimizer states, ZeRO-2 adds gradients, ZeRO-3 splits parameters.

Yeah, that's correct.

But then I follow up: "What's the communication cost of ZeRO-3?"

Those who can answer immediately drop by 80%.

Think about it—parameters are split, so during forward propagation you need to AllGather parameters, and during backward you need ReduceScatter gradients. Communication volume is several times larger than pure data parallelism!

Last year, we trained a 70B model using ZeRO-3. Do you know what percentage of training time was spent on communication?

Over 40%.

Half the time was just waiting for communication! Frustrating, right?

Later, we switched to a hybrid strategy of data parallelism + tensor parallelism, and brought communication overhead down to under 20%.

See, this question isn't about whether you can recite—it's about whether you've done performance analysis, whether you're just bragging about configurations or actually dealing with pitfalls.

---

Speaking of Agents—the Hottest Direction in the Last Two Years, I've Added These New Questions

The most frequently asked about position lately is Agent development. If you're interviewing for this, you can't escape these questions.

"What's the difference between an LLM and an Agent?"

This question looks simple, right?

But half the people I've interviewed can't hit the mark.

One candidate left a deep impression on me. He said—

"An LLM is like a学霸 (top student) with tons of knowledge but no hands or feet. An Agent gives that学霸 hands and feet so they can actually do things."

I laughed out loud. That analogy was spot on!

You see, an LLM is just a language model—it can understand, generate, and reason, but essentially it's a "brain." An Agent, on the other hand, adds perception, planning, and action capabilities on top of the LLM—it can see (multimodal), think (plan), and act (tool invocation).

The core difference boils down to four words: initiative and closed-loop capability.

An LLM answers when you ask.

An Agent can break down tasks, call tools, observe results, and adjust strategies on its own.

Another one: What's the difference between Function Call, MCP, and Skills?

These three concepts are often mixed up. But if you can distinguish them, the interviewer will be impressed.

Function Call is the model's capability—it lets the LLM output structured function call instructions. In short, the model itself understands "I need to call an API."

MCP is a protocol standard—it defines how tools should look, describe themselves, and be called. Without a standard, each company defines its own format, causing incompatibility. MCP solves that.

Skills are about orchestration—combining multiple tool calls into a reusable skill. For example, "check weather + book flight + book hotel" can be packaged as one skill.

This question tests your overall understanding of the tech stack.

"How many working modes of Agents do you know?"

ReAct mode, Plan-and-Execute mode, Reflection mode, Multi-Agent collaboration mode… you should be able to explain these.

But what I value more is—can you describe the applicable scenarios and limitations of each mode?

For example, ReAct mode is suitable for simple tool calls. But when tasks get complex, it's error-prone.

Plan-and-Execute? It's good for tasks that need decomposition. But if the plan is wrong, all subsequent execution is wasted.

Last year, we built a code generation Agent. We started with ReAct mode. Guess what? The model kept looping through "call API - see result - call API again," with terrible efficiency.

Later, we switched to plan-then-execute—pre-plan the code's module structure, then let the model implement each module one by one. The improvement was huge.

Last Agent question: Do you know the A2A protocol?

This one is relatively new. It's fine if you haven't seen it, but someone who's done Agent development should be able to say something.

A2A stands for Agent-to-Agent. It addresses how AI Agents collaborate with each other.

What's the difference from MCP? MCP is about Agents calling tools; A2A is about Agents asking other Agents for help.

I think this concept will become increasingly important—everyone is working on single Agents now, but truly complex tasks will definitely require multi-Agent collaboration.

---

Multimodal and Evaluation—These Underestimated "Silent Killers"

"Have you worked with vision-language models? How is training different from pure text LLMs?"

I use this question to gauge a candidate's breadth of ability.

What's the core challenge of VLM training? Four words: multimodal alignment.

You input a picture of a cat and the text "cat" into the model, and it must learn to map them into the same semantic space.

The training strategy differs from pure text—usually three stages:

  1. Alignment pre-training (align image and text features)
  2. Multimodal instruction fine-tuning
  3. RLHF

Training resource overhead is also larger. The vision encoder and LLM might be distributed across different GPUs, so model parallelism strategies need careful design.

During interviews, I love to follow up: "The parameter sizes of the vision model and language model differ greatly. How would you design the parallelism strategy?"

This question can filter out most candidates who only memorize answers.

"Have you designed an evaluation plan?"

Many people think evaluation is unimportant.

Wrong. It's actually what I value most.

You train a model—how do you know it's good? Just looking at loss dropping isn't enough—that can be deceptive.

I usually ask candidates: Given a vertical domain model (e.g., medical consultation), how would you design an evaluation plan?

What answer am I expecting?

First, extract at least 500 questions from real business scenarios as the evaluation set. Don't just use public datasets—the distribution gap between public data and real deployment data is huge.

Second, evaluation should have multiple dimensions: correctness, relevance, completeness, safety.

Third,

C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free