断点续训最大的坑不是模型权重，而是数据加载状态 (English)

Generated: 2026-06-22 01:24:56

---

Hey! When it comes to resuming interrupted multimodal training, I've got a ton to say!

You know, I've been doing multimodal training for almost three years. At first, I naively thought resuming training just meant saving a model and loading it back? Ha, I was way too green!

Until last year, when I watched training crash twice, and after recovery the metrics completely tanked—I was dumbfounded.

Let me tell you the most painful lesson—

Last year, we were training an image-text understanding model. Eight A100 80GB cards stacked up, dataset about 20T of WebData. Ran for two weeks. We were about to see results when mid-step, boom—node failure. I confidently loaded the checkpoint, global step, optimizer state, everything accounted for. Cool, right?

Guess what happened? After resuming, the loss curve cliff-dove. It took three more days to slowly crawl back to its original trajectory.

After digging, we finally found the culprit: the shuffle buffer in the dataloader had completely messed up the shuffled sample order! Some samples were trained multiple times, some were skipped entirely.

Can you believe it? The model was restored perfectly, but the data was still spinning its wheels!

By now you get it: the real pitfall of resuming training isn't the model weights at all. It's that data loading state you thought was "unimportant."

I spent three days trudging down both paths. No silver bullet, but at least I can help you pick the right direction.

---

Path One: DIY Dataloader Checkpoint

I call this the "band-aid" approach.

Back then we used DeepSpeed + Megatron-LM with a custom streaming Dataset. To support resumption, I gritted my teeth and added extra logging per rank: current file offset, sample index per worker, shuffle buffer contents. Every save_checkpoint, I serialized and stored them in a separate pickle file. On resume, I had to assign states by global step, figuring out which samples had been fed into the model and which were still in the buffer.

Test environment: 8x A100, PyTorch 1.13, DeepSpeed 0.8.3. Dataset was image-text pairs, packaged as WebDataset tar files, 512 samples per tar.

The patch took two days to write... then endless testing.

The most annoying part? Inconsistent worker counts! Last training used 8 workers, but this resume changed to 4—how do you redistribute the shuffle buffer? I ended up adding a constraint: the dataloader configuration for save and resume must be identical.

Ironically, the whole point of resuming is to tolerate configuration changes, and here I was contradicting myself.

On resume speed: loading model weights and optimizer took about ten seconds, but restoring the dataloader state added another 30 seconds—rebuilding worker processes, reinitializing shuffle buffers. Losing a few hours is the real cost, so 30 seconds felt worth it.

But correctness was the real pain. I verified with random seeds three times; two out of three, the sample order after resume didn't match a continuous run. Some edge cases—like workers dying early, subprocess random number generators being forked—couldn't be handled properly.

Conclusion? It can solve the problem, but it's full of pitfalls, and maintenance costs keep popping up as the framework evolves.

---

Path Two: Go with a Professional Framework

I chose Energon (from the Megatron ecosystem). The core idea is standardization: convert data format from business Dataset to WebDataset, and offload data loading, shard allocation, shuffling, packing all to it. It provides a SavableLoader that can fully save and restore the loader state.

Migration cost? Not low.

I repackaged 10T of raw data into WebDataset format, each data entry as key-value pairs (image.png for binary, text.json for text). The conversion script took half a day, but ran only once. The real pain was adapting the task encoder—our original pipeline had complex image augmentations and text tokenization logic. I had to write that into Energon's task_encoder, ensuring random seeds were independent per worker and restorable.

It took me a week to refactor that logic. On the first run, memory was 5% higher than before—because Energon's packing mechanism needed extra caching of metadata. But throughput actually improved by 8%! Turns out WebDataset's IO optimization (sequential reading of large tar files) reduced IO wait.

Resume test? Stopped training, restarted, loaded Energon's saved state—the sample order was perfectly identical, even the random results of data augmentation (random cropping, flipping) aligned! After restoring the global step, the loss curve had no jump. The difference compared to a continuously trained version was down to the 6th decimal place.

I measured with two metrics: deviation of loss after resume from continuous training loss. The patch approach averaged 0.03 deviation; Energon was below 0.001.

That's how big the gap is.

---

Don't Just Focus on Resume Speed

Dimension	Patch Approach (DIY dataloader state)	Energon Route

Development time	2-3 days (firefighting)	1 week+ (data pipeline refactor)

Resume correctness	Prone to random seed & worker split inconsistencies	Native support, deterministic resume

Maintenance cost	Every Dataset change or framework upgrade needs rework	Unified data format & resume logic; later data source changes only require task encoder changes

Throughput impact	No extra overhead	Slightly lower initially, but in the long run IO optimization improves it

Extra resume time	30 seconds - 1 minute	~20 seconds (loading loader state faster than rebuilding workers)

Honestly, if your boss is breathing down your neck about "training failures wasting hundreds of thousands each time," Option 1 can hold the line. A friend of mine (true story) worked in finance training multimodal models. After a few crashes, they baked a dataloader checkpoint module directly into the Dataset. It ran for six months without issues. Their dataset was stable—only two modalities (text + table images) and worker count never changed. Option 1 worked fine.

But if your team's data scale is still growing, or you're already considering supporting video, audio, 3D, and other modalities, or you need to interface with different training frameworks—Option 2 is absolutely worth the investment. WebDataset+Energon encapsulates the complexity so you don't have to repeatedly step on the "shuffle restore" landmine in every business Dataset.

---

What You Really Need to Avoid is the "In-between" State

Think about it: model weights and optimizer fully restored, but the dataloader has no state—isn't that like pulling a book off the shelf and then coming back to find the exact line you were reading, only to find the book has been reshuffled?

In large-scale multimodal training, the dataloader itself is a complex state machine. It decides which samples have been trained, which are still in worker buffers, and what random processing has already happened. Saving only model, optimizer, and global step is nowhere near enough.

This is frustrating. Many engineering tutorials simplify resume training to savecheckpoint/loadcheckpoint, as if just saving the model is enough. Wait until you run a thousand-card training session that crashes and requires restoring eight hours of progress—then you'll understand how much compute cost those "lost sample orders" really amount to!

Two Pieces of Advice

If your training task loses 1-2 hours after a crash because the dataloader has to rescan data and re-shuffle— prioritize dataloader checkpointing. Whatever method, first reduce resume time. This is a lifesaver.

If your team is refactoring the multimodal data infrastructure, or planning to add new data sources (like video streams, audio streams) within the next six months— go straight to WebDataset+Energon. Once the data format and resume capability are standardized, future maintenance costs drop linearly. We later added audio modality: we just wrote an AudioTaskEncoder. The data format didn't change, resume logic didn't change.

Use case	Emergency fix for existing projects, unwilling to change data format	Rebuilding multimodal data platform, with plans to add more data sources and modalities

断点续训最大的坑不是模型权重，而是数据加载状态 (English)

断点续训最大的坑不是模型权重，而是数据加载状态 (English)

Path One: DIY Dataloader Checkpoint

Path Two: Go with a Professional Framework

Don't Just Focus on Resume Speed

What You Really Need to Avoid is the "In-between" State

Two Pieces of Advice

Cael Lee

Ready to get started?