深度学习实现缺陷检测算法汇总 (English)

Generated: 2026-06-20 20:26:16

---

Let me tell you a true story.

A couple days ago, someone came up to me again: "Bro, is deep learning actually any good for defect detection?"

You have to understand—I've been in this field for almost ten years, and I've heard that question hundreds of times. Every time I hear it, it gives me a headache. Because there's no simple "yes" or "no" answer.

The real answer? It all depends on how you use it.

Which reminds me of when I first got into this back in 2016. The client wanted us to detect defects on fabric. We tried traditional machine vision—man, tweaking parameters until you question your existence. Every time we switched to a different fabric, we had to recalibrate everything from scratch. It drove us crazy.

Eventually we came up with an idea: chop the big image into small patches and feed them into a CNN for classification. The results were a little better, but the speed? So slow that the production line workers could stand there and doze off.

Today I'm going to walk you through the whole evolution of deep learning–based defect detection, from beginning to end. I'll tell you what are the pitfalls, what still works, and what you should never touch—all of it.

---

How did people first do it back in the day? In 2016 there was a paper titled "A fast and robust convolutional neural network-based defect detection model in product quality control." Fancy title, but the approach was actually pretty simple: cut the big image into small tiles, send each tile into a CNN to classify it as normal or defective, and use a sliding window during inference.

I actually tried this on that fabric project. The accuracy was okay—I mean, it was just image classification, you could grab VGG or ResNet and get a solid 80%. But for a 2000×2000 pixel image, with a stride of 32, you'd have to run thousands of inferences! One image took tens of seconds—the production line totally rejected it. This method only works for offline spot checks, or where defects are extremely sparse. If you want to use it on a conveyor belt? Better change your thinking.

Then in 2017, a paper on detecting fasteners in overhead line hardware used a three-stage SSD: first SSD to locate the main structural parts, then SSD within those parts to detect fasteners, and finally crop the fasteners and use a classification network to determine if they were missing. I saw a similar idea in a railway project: coarse-to-fine, filtering layer by layer. It sounded perfectly logical for industrial applications.

But in practice? Each stage introduced missed detections, and the miss rates stacked up. The final performance was terrible. Plus, it wasn't end-to-end—you had to tune three networks serially, and every time you switched scenarios you had to start over. From an engineering standpoint, two-stage detectors (Faster R-CNN) or single-stage ones (YOLO, SSD) are sufficient. A three-stage approach is more suited for padding a publication.

After that, around 2018, people realized that classification and detection could only draw bounding boxes, but defects are often irregular shapes, and the box contains a lot of normal area. So semantic segmentation came along. The most representative was that paper "Automatic Fabric Defect Detection with a Multi-Scale Convolutional Denoising Autoencoder Network Model." It used a Gaussian pyramid combined with semantic segmentation to reconstruct defects, fusing multi-scale results during inference.

I tested similar frameworks. Segmentation indeed gives pixel-level contours, which is great for irregular defects like scratches and dents. But you need pixel-level annotations! And in industrial settings, annotations are the scarcest resource. Let me tell you a true story: to label 1,000 images of steel strip surface defects, our team had three interns working for two weeks, and they got into several fights. No matter how high the mIoU in the paper, the labeling cost on the ground can scare you away.

Starting around 2019, people got spooked by labeling costs and began chasing unsupervised methods. The main trick was use autoencoders or GANs to reconstruct normal samples; defective areas, not seen in the training set, would have larger reconstruction errors and thus be located. I remember one paper used cascaded autoencoders for scratch detection, and the results were okay. But the threshold was a nightmare to tune—set it too high and you miss defects, set it too low and false alarms explode.

Then some people used GANs to generate defective samples for data augmentation. I tried DCGAN and WGAN. The quality of the generated images? Well, guess what? Sometimes the generated defects looked like special effects from a horror movie—the model could never learn real features. Recently diffusion models came out, and the generation quality took a big leap. I tried Stable Diffusion for data augmentation in 2024, and it was indeed better than GANs. But the computational cost was insane—tens of seconds per image, plus you need high-end GPUs. My advice: unsupervised detection can only serve as an auxiliary method right now—use it to prescreen possible defect areas, then have a human confirm. Fully automated deployment? I've never seen a successful case.

In recent years, the most popular thing in industry is definitely the YOLO series. From YOLOv3 to YOLOv8, v9, v10, each version improves in speed and accuracy. I've used YOLOv5 and YOLOv8 extensively for projects like chain defect detection and PCB defect detection. Let me give you some concrete numbers: in a chain defect detection project, we used YOLOv8n on a GTX 1660 and got 30 fps with [email protected] around 0.85. 30 fps! That's more than enough for an industrial production line.

Comparing YOLOv8n and YOLOv10n? YOLOv10n has faster inference (thanks to the NMS-free design), but the lightweight model struggles a bit with high-resolution small targets. My advice: go for v10 if you want extreme speed, and v8 with attention mechanisms (SE, CBAM, etc.) if you prioritize accuracy—you can even squeeze out an extra 1-2 points.

Also, Transformers have joined the party. In 2025, a paper used a Self-Induction Vision Transformer to hit AUROC over 98% on the MVTec AD benchmark, quite a bit higher than CNN. But here's the problem again: Transformer inference is computationally heavy, and industrial edge devices can't run it at all. For now, the most reliable choice is still YOLOv8 with attention.

---

Different industries have their own pitfalls. Let me tell you a few I've stepped in, so you don't repeat them.

First, PCB defect detection. I tried the Tiny-Defect-Detection-for-PCB project, which is specifically optimized for tiny defects and works well as an AOI replacement. But where's the trap? It's extremely sensitive to lighting—change the light source a bit and it fails. The solution is to deliberately vary the lighting angle during data collection; otherwise, after deployment, the alarm rate will explode.

Next, steel surface defects. On hot-rolled steel strips you get scale, patches, scratches—low inter-class variance, high intra-class variance. I tuned YOLOv8 plus SE attention, and added mosaic augmentation, just to pull recall from 72% to 85%. There's also a project called Severstal-Steel-Defect-Detection with a lot of public data—great for beginners to practice.

Textile inspection has its own headaches. Fabric wrinkles, color differences, broken yarns—the biggest pitfall I fell into was that my training set was all flat-laid under ideal lighting. When the fabric on the production line started fluttering, the false positive rate doubled. After adding random affine transforms and blur, it stabilized.

In short, you must do targeted data augmentation for industrial scenarios—no cutting corners.

---

For the next two or three years, I see three promising directions.

The first is few-shot learning. Right now, whether it's MAML or prototypical networks, the results on industrial defects are mediocre. But combining generative models to synthesize defects and then training is becoming mainstream. In a semiconductor project with only 10 defect images, I used a diffusion model to generate 500, and the model accuracy went from 55% to 80%. Still not commercial grade, but clear progress.

The second is multimodal fusion. Pure vision sometimes can't distinguish real defects from dust or oil stains. Adding infrared or 3D point clouds can significantly improve robustness. I've seen some factories experimenting with RGB-D, but data alignment and cost

深度学习实现缺陷检测算法汇总 (English)

深度学习实现缺陷检测算法汇总 (English)

Cael Lee

Ready to get started?