深度学习在医学图像处理中的应用 (English)
深度学习在医学图像处理中的应用 (English)
Generated: 2026-06-21 14:53:45
---
Five Years of Blood and Tears: 90% of People Get the First Thing Wrong!
Last month, a grad student messaged me in a panic: “Senior, we just got a new microscope imager and want to do cell segmentation. Where do we start?”
I stared at the screen and laughed. I've heard this question at least fifty times! And every time, I have to pour cold water on it first—
Don’t rush into running a model! Get your validation set straight first!
Frustrating, right? So many people immediately dive into the architecture diagrams of U‑Net, Mask R‑CNN, as if picking the right model will magically clean their data. Naive! Way too naive!
I’ve been there.
Five years ago, when I first started working on lung cancer pathology image classification, I used that classic Kaggle dataset—25,000 lung tissue histopathology images. I was so excited. It’s a public dataset, I thought, so the preprocessing must be fine, right?
Well?
I ran the model for two whole weeks. Then I discovered that a batch of squamous cell carcinoma and adenocarcinoma images had completely mismatched file names and folder labels. All that data, wasted. You know that feeling? Like chasing a girl for three months only to find out she mistook you for someone else? No—worse than that.
So today, I’m going to walk you through every single pitfall I’ve stumbled into over these five years.
I’m not being modest—this stuff really works. Every word here was bought with GPU lives.
---
Chapter 1: Don’t Touch the Model! First Ask: Can You Trust Your Data?
A lot of people jump straight into U‑Net and Mask R‑CNN. Please, first think about one question: Why do you believe your output means anything?
Medical image processing is not like autonomous driving!
In self‑driving, if the system mistakes a pedestrian for a traffic sign, anyone can see something’s wrong. But medical images? You ask the model to segment a tumor boundary, and a non‑expert can’t tell right from wrong. Even experts have to double‑check—what if the model is just fooling itself?
So now, before I start any project, I force myself to do three things. I don’t touch the model until they’re done.
1. Know what format your data is in.
Don’t laugh! I’ve seen someone treat a NIfTI file as a normal PNG, get a bunch of garbage, and still analyze it for a week. Isn’t that just asking for trouble?
2. Lock everything down with a fixed random seed.
I always use SEED=42. This isn’t mysticism! Without it, when you later do ablation studies or tune parameters, you won’t be able to tell whether a performance fluctuation is caused by the model or just randomness. Then you’ll have no one to cry to!
3. Compute Dice and IoU by hand first.
Pick 10 images as a small sample, label them yourself, and calculate the metrics manually. Make sure the logic in your code is correct.
Speaking of pitfalls—here’s one that still makes me grind my teeth.
I once wrote a Python implementation of Hausdorff Distance. The library function returned values that looked fine. Later, a reviewer suddenly asked, “What’s your AVD?” and I realized—my distance function was missing a dimension!
Don’t trust third‑party libraries blindly! Evaluation metrics for medical images aren’t as mature as those for classification tasks. You’d better hand‑code each metric and verify the logic.
You see, medical image analysis can mean “segment tumor tissue from necrosis” or “measure oxygen saturation in blood vessels.” These are worlds apart! The former needs pixel‑level localization; the latter needs quantitative consistency. Use the same framework without adjusting your evaluation metric, and you’re bound to crash.
---
Chapter 2: Choosing a Network? U‑Net Isn’t Omnipotent!
U‑Net is good. I won’t deny that.
I started with it for cell segmentation. Back then, reviews weren’t as strict, and you could publish with U‑Net plus an attention mechanism. But now? No way, really no way.
You must choose based on your specific problem!
Last year, I took on a live‑cell microscopy segmentation project. The data was nasty: low signal‑to‑noise ratio, uneven illumination, cells packed together, and extremely long time series.
Didn’t I start with U‑Net? Dice barely hit 0.6. A disaster!
Then I tried StarDist. Much better. StarDist represents cell shapes with star‑convex polygons, which works great for crowded scenes. That’s what I call the right tool for the job.
But StarDist isn’t magic either. When you get really irregular cell shapes—say, certain neuron types—its assumptions break down. It becomes clunky, slow, and prone to hanging.
I also tried Cellpose. It’s strong because of its flow‑field representation, giving it great generalization across cell types and imaging modalities. But it has a weakness: it’s heavily dependent on resolution. Change the image size, and problems pop up.
You see, there’s no perfect network—only the right one for the job.
Here’s my current selection logic:
- Nuclei segmentation (e.g., DAPI staining): StarDist, fast and effective!
- Whole‑cell segmentation with highly varied morphology: Cellpose or Mesmer (choose Mesmer if you have enough data)
- Multi‑class tumor segmentation: Attention U‑Net or TransUNet (Transformer’s global self‑attention compensates for U‑Net’s local limitations)
- Lots of overlapping cells: Mask R‑CNN, slow but top‑tier for instance segmentation.
And that’s not all. How long does a single Mask R‑CNN training epoch take? On a Tesla V100 with 512x512 images, it’s 40 minutes! Training for 200 epochs takes a week. Do the math—that’s enough time to binge‑watch a whole series!
So later, for classification tasks, I switched to MobileNetV2—a lightweight champion. Its inverted residual structure and linear bottlenecks give higher inference speed without sacrificing too much accuracy.
But don’t expect MobileNetV2 to handle segmentation! Segmentation needs far more detail than classification.
Speaking of which—I have to interrupt myself. I’ve seen too many people use MobileNetV2 for segmentation and then complain about poor results.
Bro! You’re trying to cut a steak with a screwdriver. Of course it won’t work!
---
Chapter 3: Loss Functions Aren’t Math Problems—They’re Psychology!
When I first started working on low‑count PET image denoising, I used MSE Loss. The resulting images were smooth as a ball—all texture gone! Lesion edges were completely blurred.
What good is that kind of model?
Then I read a paper called LeqMod. Its core idea: during denoising, maintain quantitative consistency for lesions. That is, not only should the image look clean, but the SUV values after denoising should match those of high‑count images.
That idea was a revelation.
From then on, I never used a single loss function.
Here’s what I do now:
1. Primary loss: Dice Loss or Focal Loss (for class imbalance)
2. Auxiliary loss: L1 Loss (higher weight on positive regions, lower on background)
3. Occasionally a perceptual loss: Extract feature layers from a pretrained network and compute feature‑level distances.
These losses aren’t just thrown together—they’re weighted! How do you tune the weights? Use actual tuning, not superstition!
Don’t think I’m being too picky. This really matters!
Last time I worked on MRI brain tumor segmentation, I used only Dice Loss as the primary loss. The model got stuck at Dice 0.78, no improvement. Then I added a boundary distance constraint, and it jumped straight to 0.83!
See? Just one extra step—a full 5 percentage point difference!
In short, your loss function determines what your model focuses on. Focus only on pixel overlap? It ignores boundaries. Focus only on boundaries? It ignores internal semantic consistency.
Recently, I’ve found a pretty interesting direction: CheXPO‑v2 shifts alignment from result supervision to process supervision, penalizing hallucinated generations using knowledge graph consistency rewards. This comes from the VLM field, but it could be adapted to medical image segmentation
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.