Your GPU Cluster Is Burning Cash: My Bloody Battle with MoE Communication on Heterogeneous Hardware
Your GPU Cluster Is Burning Cash: My Bloody Battle with MoE Communication on Heterogeneous Hardware
Last Tuesday, our cluster's All-to-All latency spiked to 800ms, and my boss's face went greener than an A100's cooling shroud. Here's what that actually means: on your £250,000 8-GPU rig, MoE models spend 40% of every training step just waiting for data to shuffle around. You've hired eight Michelin-starred chefs, and they're spending most of their time passing plates to each other.
The "Unified Compute" Fairy Tale You've Been Sold
Back in 2023, when I was doing distributed training at a FAANG company, the infrastructure team loved bragging about their "unified scheduling across heterogeneous hardware."
The truth? They just dumped different GPU models into the same Kubernetes cluster and prayed your model wouldn't trigger NUMA hell.
Quick primer—and yes, this matters: MoE's All-to-All communication is fundamentally about each token finding its designated "expert" for computation. On a single node with 8 A100s, this flies over NVLink—200GB/s bandwidth with latency so low you can basically ignore it.
But the moment you scale to a real-world heterogeneous setup like "4 A100s + 4 H800s + 2 Ascend 910Bs," things get properly mental.
I'll never forget a project I took on last March. The client said, "We've got 8 A100s ready to go—let's just run MoE on them." By day one, I knew something was off. All-to-All communication took three times longer than expected. I debugged until midnight before discovering their machines had PCIe-version A100s, not the SXM variant.
The critical difference? PCIe GPUs route inter-GPU communication through the CPU's PCIe switch—roughly 32GB/s bandwidth. SXM versions use NVSwitch for direct connections, hitting 600GB/s. That's nearly a 20x gap.
Worse, nvidia-smi topo -m shows "NODE" for PCIe versions instead of "NV12." Most engineers don't even know to check this—I've met people who trained for three months before realising they were on PCIe hardware. When I ran nccl-tests with alltoall_perf, I nearly spat coffee on my monitor—effective bandwidth was just 23% of theoretical peak.
Hang on, I need to correct myself here. The "32GB/s" I mentioned is actually PCIe 4.0 x16 unidirectional bandwidth. In real All-to-All scenarios with bidirectional traffic, you lose more. From my measurements, effective bandwidth typically lands around 26-28GB/s, depending on whether you're on Intel or AMD—AMD's PCIe switch performs slightly better, maybe 5% more. That's from testing with Xeon 8480+ and EPYC 9654 back in early 2024.
The "Multi-Rail Communication" Scheme That Cost Me Three Sleepless Nights
Here's the problem: in a heterogeneous cluster, some nodes are NVLink-connected "rich neighbourhoods," others are PCIe "slums," and a few are "remote islands" linked by RoCEv2. Your All-to-All communication operator has to serve all three simultaneously.
I built a topology-aware communication scheduler. The core idea sounds simple enough, but the implementation—bloody hell, it was a minefield.
Layer 1: Intra-node over NVLink/PCIe
- Detect SXM GPUs → use
ncclSend/ncclRecvdirectly over NVSwitch - Detect PCIe variants → enable P2P access but force routing through the nearest NUMA node
Layer 2: Cross-node over RDMA
- Pre-stage expert weights into target GPU memory
- Use GDR (GPU Direct RDMA) to bypass CPU memory copies entirely
Layer 3: Cross-architecture via GDR + protocol translation
- Between Ascend and NVIDIA, bridge through CANN's
dsd_copyand CUDA'scudaMemcpy - This bit's the worst—you're manually managing memory allocations across two separate runtimes
Speaking of which, I have to share an experience that made me question my career choices.
I once got hold of a machine labelled "A100-SXM4-80GB." I configured it for SXM, but communication performance topped out at 60% of expectations. I debugged for two days straight. At one point, I was convinced the hardware was faulty and had ops reseat the GPUs twice. Turns out, there's a BIOS setting called "SR-IOV Global Enable"—it was disabled, which reserved part of NVSwitch's bandwidth for virtualisation features.
This issue isn't documented anywhere official. I found the clue buried on page 17 of the NVIDIA Developer Forum, in a reply from 2021 with two upvotes. After changing that BIOS setting and rebooting, All-to-All latency dropped from 450ms to 180ms. One BIOS toggle.
What was it worth?
Well... it's complicated. Based on our cluster utilisation, that fix reclaimed roughly 40% of communication time per machine annually. In compute cost terms, one 8-GPU A100 node saves about £17,000-£22,000 a year. If you've got ten such machines...
Three Strategies That Actually Work
Here's what you can use right now. After a year of bashing my head against this, I've distilled three battle-tested tactics for optimising All-to-All on heterogeneous clusters. I'll be as specific as I can.
1. Dynamic Token Routing + Compute-Communication Overlap
Don't sit around waiting for all tokens to finish communicating before starting computation. Here's my approach:
while tokens_remaining:
# Take the first K tokens that have completed communication
ready_tokens = get_completed_tokens(k=K)
# Fire them off to experts immediately
launch_expert_computation(ready_tokens)
# Meanwhile, keep waiting for the rest
On a Llama-MoE-16E model, this trick boosted end-to-end throughput from 127 tokens/s to 178 tokens/s. A 40% improvement.
The key implementation detail: use PyTorch CUDA streams to split communication and computation into two streams, synchronising with fine-grained events. The core logic is maybe 20 lines, but there's a gotcha—you must set NCCL_ALGO=Ring instead of the default Tree. Why? Ring algorithms pipeline overlapping much better under heterogeneous bandwidth; Tree's connection establishment phase introduces blocking.
I confirmed this after running profiling hundreds of times—no exaggeration. I used nsys profile version 2024.3.1. When I finally captured the timeline, Tree showed a clear global synchronisation point in the first 50ms. Ring didn't. I wouldn't have believed it myself without seeing that timeline.
2. Expert Placement Strategy: Make Data Chase Compute
The traditional approach is "route tokens wherever their expert lives." In a heterogeneous cluster, that's a disaster. I tried something counterintuitive: replicate popular experts onto the most powerful nodes.
The details:
- Track each expert's activation frequency (EMA-smoothed, α=0.99)
- Keep copies of the top 3 hottest experts on every A100/H800 machine
- Banish unpopular experts to Ascend nodes (they're rarely called anyway, so the communication tax is acceptable)
The payoff? 80% of tokens complete expert computation locally, with only 20% needing cross-node communication. Overall All-to-All traffic drops by 60%.
The cost is roughly 15% more GPU memory usage. Honestly, I think that's fine—memory is the cheapest thing right now, with HBM3e stacked up to 141GB.
3. Mixed-Precision Communication: Save Where You Can
MoE gate outputs don't actually need FP32 precision. I've tested dropping routing weight communication to FP16, even BF16, with negligible impact on model quality—perplexity difference under 0.3%, and yes, I ran that five times before believing it. But communication volume halves immediately.
Taking it further, for expert output gradient synchronisation, I use PowerSGD for low-rank compression before communicating. Tested on an 8-node heterogeneous cluster, communication time dropped from 38% to 17% of total, with virtually identical convergence.
One small detail: BF16 has bugs on certain Ascend 910B firmware versions. I'm on CANN 7.0.RC1—dsd_copy occasionally shows precision anomalies with BF16. From what I understand, this was fixed in 7.0.RC2, but I haven't upgraded to verify yet. To play it safe, I still use FP16 for cross-architecture communication.
The Hardware Vendors Are About to Undermine All Your Optimisations
Here's the really painful bit.
Many of the optimisations you're sweating over today will be completely irrelevant on next-gen hardware. Take NVIDIA's H100—they've extended NVSwitch beyond the node with the NVLink Switch System. In theory, cross-node All-to-All could run directly over NVLink, bypassing the whole RDMA stack.
But here's the question: Can you actually afford an NVLink Fabric of 8 H100s? The switches alone cost hundreds of thousands of dollars. I checked pricing last month—one Quantum-2 QM9700 Switch runs about £13,000, and a complete fabric needs four of them. Then add the H100s themselves...
On the domestic chip front, Ascend's HCCS bus saw major improvements on the 910B. But the software stack—CANN's All-to-All operator implementation—is still in "technically functional" territory. I've benchmarked mindspore.ops.AlltoAll: on 4-card 910B, it reaches just 35% of theoretical peak. The hardware's not the problem.
The operator implementation is simply too conservative.
It doesn't do compute-communication overlap. I dug into MindSpore's source—the AlltoAll kernel launch is synchronous, waiting for all data transfers to complete before returning. NCCL has supported asynchronous All-to-All since 2022. I raised this at the Ascend Developer Conference (October 2024) with their team. The response was "we're working on it, targeting Q1 2025."
We'll see.
Some Uncomfortable Truths to Close
Most of those "one-click MoE deployment" cloud platforms on the market—90% of them—have not properly handled communication optimisation for heterogeneous hardware. They're betting you'll never look at nvidia-smi utilisation curves. Your GPU compute utilisation hovers below 60%, with the rest of the time spent waiting on communication, and you're thinking "large model training is just this slow."
It's honestly absurd.
My advice? Next time your vendor boasts about "supporting heterogeneous cluster training," ask three questions directly:
- What's your All-to-All latency in mixed A100/H800 environments?
- For cross-architecture communication (e.g., NVIDIA to Ascend), do you use GDR or Host Memory relay?
- Can you provide real
nccl-tests alltoall_perfmeasurements?
If they can't answer, they're basically selling you "fake heterogeneity"—dumping different machines into one cluster and pretending they collaborate efficiently.
What communication optimisation nightmares have you hit with MoE? Especially if you've worked with Ascend, Cambricon, Biren, or other domestic chips—drop your real-world benchmarks in the comments. I'm compiling a white paper on distributed training with domestic chips, and anyone who contributes genuine case studies gets the final version.
Also, if your A100 cluster shows All-to-All latencies above 200ms, DM me your topology. I'll help you check if you've fallen for the "fake SXM" trap too. In the last six months, I've reviewed about 40 topologies—12 had BIOS configuration issues, 8 were PCIe versions masquerading as SXM, and one—the most ridiculous—had a loose NVSwitch cable.
Key Takeaways (TL;DR)
- Check your GPU topology:
nvidia-smi topo -m—if you see "NODE" instead of "NV12," you're on PCIe, not SXM - Use Ring, not Tree:
NCCL_ALGO=Ringfor heterogeneous clusters (avoids the global sync point) - Replicate hot experts: Put copies on your fastest nodes; 15% memory cost for 60% less communication
- Mixed precision FTW: Gate outputs don't need FP32; drop to FP16/BF16 for half the communication cost
- Vet your vendors: Demand
nccl-tests alltoall_perfresults, not marketing slides
Related reads:
- "Stop Worshipping NCCL: Your Communication Bottleneck Might Be PCIe Topology"
- "Ascend 910B Distributed Training: What CANN's Documentation Won't Tell You"
- "Implementing an All-to-All Operator from Scratch: 300 Lines of CUDA Explained"
MoE #DistributedTraining #AlltoAll #HeterogeneousComputing #GPUCluster #NCCLTuning #AscendDev
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.