我花了两周跑遍主流医学图像模型，发现没一个省心 (English)

Generated: 2026-06-21 07:55:39

---

Recently, I did something that really got me hooked—I spent two weeks running through every mainstream deep learning model that can be used for medical image analysis. Guess what happened?

The methods listed in those review articles, SAE, RBM, CNN, U-Net, FCNN, FRCNN… they looked comprehensive, but the moment I actually wrote the first line of code, I realized: not a single one of them was easy to work with! It’s like walking into a hardware store stocked with drills, chainsaws, and laser cutters, thinking you can just take them home and build a villa, only to find out they’re all knock-off brands—each one more frustrating to use than the last.

Come on, let me show you the pits I fell into, the tears I shed, and the flashes of insight I had.

---

First, the models: every one has an “you think” and an “actually is”

CNN, the golden child of deep learning. AlexNet and VGG16 crushed it on ImageNet, and I figured bringing them to medical images would be like cheating. Well, it backfired—they completely failed to adapt! Think about it: most medical images are grayscale, with only one channel, and lesions often occupy just a few pixels. Dragging ImageNet-style convolution kernels over them is like trying to catch mosquitoes with a fishing net. I used VGG16 with fine-tuning for lung nodule detection, and the results were worse than a simple sliding window with handcrafted features. Later I switched to Inception v3 (the upgraded version of GoogleNet), and it got a little better. What does that tell you? Medical images rely heavily on local features and multi-scale fusion—you can’t just force-fit the natural image approach.

U-Net is the absolute workhorse for segmentation, no question! I used it for liver segmentation, and the Dice coefficient shot straight to 0.9. But hold on—I’m talking about 2D U-Net on single slices. With CT 3D data, the problem hits instantly: VRAM explodes! Even my Titan X nearly choked on a 256×256×200 volume. And while 3D U-Net can capture spatial context, the performance gain isn’t proportional to the computational cost. I agonized over this myself. My practical advice: if you’re like me with just a single machine and single GPU, stick with 2D plus axial slices and use post-processing for 3D connectivity; only consider 3D if you have a cluster or can stand a week of training.

UNet++ I tried later. Zhou’s MICCAI paper introduced nested dense skip connections, and it does outperform the original U-Net on blurry boundaries. For example, in pancreas segmentation, Dice improved by about 2‑3 points. But the cost? Double the parameters, 50% longer training. Is it worth it? Depends on your accuracy requirements. If I’m not obsessively strict about precision, I feel it’s a bit of a loss.

Recurrent Neural Networks (RNN)—I was once fooled by the concept. Using LSTM to treat consecutive CT slices as a time series, trying to model continuous organ deformation. The result was very touching—basically the same as processing each frame with U-Net and then applying temporal smoothing, but the training difficulty was several times higher. I can only smile and reserve judgment on RNN’s effectiveness in medical imaging.

FRCNN works okay for object detection, like locating lung nodules. But I stepped into a huge trap: the proposal boxes generated by the RPN on CT images often mislabeled high-density blood vessels as nodules. Why? Because the network was originally designed for natural images, with lenient shape requirements. I later added shape priors and improved the feature pyramid design based on FPN, and that helped.

---

Three big pitfalls you will definitely encounter

Pitfall one: too little labeled data, and transfer learning isn’t a cure-all.

Some studies used over 40,000 CTs to train a lung cancer diagnosis model on par with radiologists. Back then, I was stubborn and used only 2,000 chest X-rays from a single center for classification. The moment I tested it on another hospital’s data, accuracy dropped from 0.93 to 0.65! The two hospitals used different X‑ray machine brands—contrast and noise distribution were completely different—and the model was clueless.

You ask how I fixed it? Data augmentation saved half the problem; the other half relied on ImageNet pretrained weights for initialization. But frankly, the gap between ImageNet’s colorful natural images and medical grayscale images is huge. In the literature, Esteva and Gulshan succeeded with skin and eye conditions because dermatoscopy and retinal images share textural structure with natural images. For CT, MRI, and other grayscale volumetric data, pretrained models are far from the miracle they’re hyped to be.

Pitfall two: black‑box models—you never know when they’ll freak out.

The first time I handed a trained nodule segmentation model to a doctor for evaluation, it performed normally on 20 consecutive cases. On case 21, which had metal implant artifacts in the CT scan, the model segmented the entire artifact region as nodules—bigger than real nodules! The doctor laughed and said, “Your algorithm is worse than two minutes of my looking at it.”

At that moment I had no idea why it failed. Later, using Grad‑CAM visualization, I realized the model had learned to detect high-contrast boundaries in the image, not the morphological features of nodules. In short, interpretability in deep learning for medical imaging is still a hard problem. You can’t tell a doctor “My model learned a hidden feature you don’t know”—they only trust pathological evidence.

Pitfall three: severe class imbalance—binary classifiers just aren’t enough.

Binary classifiers are too simplistic for complex scenarios. For lesion grading (normal, benign, suspicious, malignant), training directly with multi-class plus softmax makes the model always predict the class with the largest sample size. I later tried OHEM (Online Hard Example Mining) and Focal Loss to boost the recall of minority classes from 30% to 70%. But the cost? Tuning hyperparameters took me two weeks!

---

Rays of hope: directions that lit up my eyes

VoxelMorph—a registration surprise. Using deep learning for 3D image registration, it’s 1,000 times faster than traditional methods! Classical registration takes ten minutes for a pair of brain MRIs; VoxelMorph does it in seconds. Although its accuracy still falls short of the classic SyN method, the speed advantage is massive—especially for large‑scale cohort analysis. Papers have covered it, and it’s really worth attention.

Self‑supervised learning—my favorite direction in the last couple years. Zhou’s Models Genesis paper pre‑trains a 3D model using image inpainting, reconstruction, and other proxy tasks, then fine‑tunes on downstream tasks. I tried it on lung nodule classification and brain segmentation, and with only 10% of the labeled data, it matched the accuracy of training from scratch on 100% data! This approach is critical for medical fields where labeled data is scarce.

Getting high‑quality datasets—this matters more than any model. CheXpert and a Stanford PhD thesis both emphasized: instead of spending huge effort tuning models, first improve dataset quality. I later found a small dataset of a rare disease, with just a few hundred images, but each labeled by three radiologists. The results were far better than using 5,000 images with coarse single‑expert labels. True gold fears no fire; data isn’t about quantity, it’s about quality.

---

One last honest word

Deep learning can actually do some work in medical image analysis, but it’s far from replacing doctors. My current playbook: use U‑Net for segmentation, Inception for classification, and VoxelMorph for registration—these three tools handle most problems. Don’t get led astray by fancy model architectures; what really brings results are the dirty, tough tasks like data cleaning, labeling quality, and post‑processing. Don’t look down on traditional methods—sometimes morphological filtering plus a decision tree can solve a problem just as well, without crashing.

When I take on a new task now, the first thing I do: start with the simplest handcrafted features plus a Random Forest, build a baseline; only then do I bring in deep learning. That way, at least

我花了两周跑遍主流医学图像模型，发现没一个省心 (English)

我花了两周跑遍主流医学图像模型，发现没一个省心 (English)

First, the models: every one has an “you think” and an “actually is”

Three big pitfalls you will definitely encounter

Rays of hope: directions that lit up my eyes

One last honest word

Cael Lee

Ready to get started?