Home / Blog / 目标检测最新方向:推翻固有设置,不再一成不变Anchor (English)

目标检测最新方向:推翻固有设置,不再一成不变Anchor (English)

By CaelLee | | 4 min read

目标检测最新方向:推翻固有设置,不再一成不变Anchor (English)

Generated: 2026-06-20 21:04:03

---

I'm telling you, in 2020 I worked on a pedestrian detection project, and it nearly drove me crazy!

I swapped datasets over and over—COCO, CrowdHuman, MOT17. And every time I switched, the first thing I did wasn't tuning the model—it was opening K-Means to recalculate the anchor sizes! Then I'd find the aspect ratio was off, swap the number of clusters, recalculate again. The worst part? When I bumped the input resolution from 640 to 1280, all those anchors became useless. I slammed the table right then and there—is this even deep learning? This is just manually tuning parameters! Like an old-school tailor, every time you get new fabric you have to re-measure the whole body—clunky, painful, and ready to blow up at any moment.

Think about it—back then everyone was convinced: anchors are the foundation of object detection. Without them—impossible!

And then? Later I tried YOLOv8, and that was a slap in the face. It just flipped the table—threw away anchors and went anchor-free. Each pixel on the feature map predicts its own center point and the distances to the four boundaries. Clean and straightforward. I was hesitant at first, worried the accuracy would tank. But I ran it on my own five thousand business images—mAP was 1.2 points higher than YOLOv5! And the training pipeline was way simpler. You know that feeling when you've been killing yourself over something that suddenly becomes unnecessary? It's like… you always thought the key to the door was super important, only to find out the door was never locked!

Now you might say: "What about DETR? That's even crazier, right?"

Absolutely. When DETR first came out, I really didn't get it. It was bold—turning object detection into set prediction: feed an image, produce a hundred object queries, run them through a Transformer decoder, and directly output a hundred boxes and class labels. No anchors! No NMS! It uses the Hungarian algorithm for one-to-one matching. My first reaction was: is this just academic plaything? Five hundred epochs to converge—who’d ever use that?

But then? RT-DETR came out in 2023. It mixed CNN and Transformer in the encoder (that AIFI module) and added IoU-aware query selection. I tested its official model on my T4 GPU, batch=1, 1280×720 video—RT-DETR-L inference around 40ms, YOLOv8-L around 35-45ms, basically negligible difference. But look at accuracy—on dense crowd data like CrowdHuman, RT-DETR beats YOLOv8 by two or three mAP points! And because it eliminates the brute-force NMS suppression of overlapping boxes, false positives are way lower. I've already put it into my security project for zone-crossing alerts—it's really nice.

But of course, someone immediately jumps up and says: "Anchor-free and DETR must be worse than well-crafted anchors for small targets, right?"

I tell you, I used to think that too. When YOLOX first went anchor-free, small targets did struggle. But it's different now! YOLOv8 has TAL (Task Alignment Learning) for dynamic positive sample assignment, YOLOX has SimOTA for automatic matching—all using learning instead of hand-crafted rules. The model trains its own judgment: which feature position should handle which target. Flexibility is even better! I took a specific drone aerial dataset (full of tiny targets, about two thousand images), and YOLOv8's mAP_small was 0.8 points higher than YOLOv5 with anchors. Think about that—how counterintuitive? Ditch the fixed templates, and you actually gain accuracy.

Of course, I'm not saying blindly chase the new. If you have just three to five hundred images, a two-week project timeline, and you need to deploy on a Jetson Nano—buddy, don't get fancy! Stick with YOLOv8n or YOLOv5s. The DETR series needs at least a few hundred thousand images to feed the Transformer attention mechanism—small data converges so slowly it'll make you question your existence. I fell into that trap: forced RT-DETR to run 300 epochs, and it wasn't even as good as YOLOv5s trained for 50 epochs. Honestly, choosing the right model depends on your resources. My current principle is very simple—if you have tens of thousands of images, heavy occlusion, and want an elegant end-to-end solution, go with RT-DETR; if you have limited data, edge deployment, need speed, go with YOLOv8 or YOLOv11 (the new version that removes NMS—I'm testing it, and it's promising).

One more thing that's easy to overlook: maintenance cost.

Think about it—if you come back to an anchor-based project after six months, you might have forgotten which dataset those cluster-derived anchor ratios came from. But an anchor-free project is as clean as plain water in summer—Backbone + Neck + Decoupled Head, no extra priors. New team members can pick it up at a glance. My team migrated two old projects from YOLOv5 to YOLOv8 this year, not because of huge performance gains—but because the code interpretability is better, and debugging is way faster!

So, here are two pieces of advice for you, remember them—

First: If you're just starting to learn object detection, skip the anchor-based era entirely and start with YOLOv8 or RT-DETR. Understand the anchor theory, sure, but don't waste time tuning them.

Second: When you're choosing a model for engineering, don't get hung up on one approach. I've seen people force DETR just for "end-to-end" and then the deployment couldn't hold up. Use YOLO where it fits, mix and match where needed. In my own system, I use RT-DETR for pedestrian counting at the entrance and YOLOv8 for part detection in the workshop—each does its own job without interfering.

Tearing down established setups was never about destruction for its own sake—it's about having one less thing to worry about.

Anchors, NMS, manual label assignment… these things fade out not as a show of skill, but to let us focus our energy on data and the business itself. To me, that's the true direction of object detection.

Think about it—when a technology lets you forget it exists and you can just focus on solving the real problem—that's when it's at its best.

Right? 😉

C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free