在2020年,你觉得计算机视觉有哪些值得研究的领域? (English)

Generated: 2026-06-21 16:46:22

---

That Spring of 2020, I Nearly Got Sick From Practicing for the Detection Competition

Guess what?

In the spring of 2020, I did something incredibly “dumb”—I spent half a year grinding mAP until I felt like throwing up every time I saw the COCO dataset.

Not bragging, but during that time I could recite the YOLOv3 anchor parameters with my eyes closed. But the more I grinded, the more anxious I got: “Even if I can draw the boxes perfectly, then what? What the hell else can this field do?”

Honestly, that feeling is like spending a whole year learning how to cut vegetables beautifully, but never once thinking—what dish am I actually supposed to make?

So what happened next? A 3D reconstruction project came knocking, and I jumped in with both feet. Only then did I realize that my previous understanding was way too “flat.”

---

The 3D Direction: The Biggest Opportunity You Missed in 2020 Is Right Here

When it comes to 3D, before 2020 I was still a bit intimidated.

Think about it: annotation costs are ridiculously high, the data is a pain in the ass to get, and results often crash and burn—who would want to touch that?

But a few things happened that year that made me think: the time has come.

The first thing was called PointConv.

It came out at the end of 2019, but what really got me excited was 2020. Previous methods like PointNet++ were usable, but honestly, they still projected point clouds into a continuous space and then sampled—like peeling a grape, squeezing it into juice, and losing all the nutrients.

But PointConv was different. It directly applied convolution kernels on point clouds, using weight functions to fit local point distributions.

Now that's real point convolution!

I tried it on indoor scene segmentation. The results were great—really great—but the GPU memory consumption was terrifying. A single V100 could barely handle 16 point cloud blocks. Staring at that memory usage, I could only think: “Dude, can you not be such a picky eater when you're full?”

The second thing was the unsupervised approach to stereo matching and depth estimation.

Back then, GANs were all the rage in domain transfer. Works like SharinGAN and StereoGAN gave me hope—was it possible to train on synthetic data and then directly deploy on real scenes?

I fell into this pit, and I fell deep.

I ripped off SharinGAN’s code and tried to do indoor depth estimation. The detection metrics did improve, but the edges were still a blurry mess. Later I realized that the key wasn't really the GAN itself—it was the loss design based on multi-view geometry constraints. Just training the GAN’s generator hard is brute force. The truly smart approach is to add self-consistency constraints.

Speaking of which, I really want to share a counter-intuitive fact with you:

This direction still has huge room for growth even now. It’s especially suitable for teams that—you guessed it—don’t have the money to label 3D data.

Here’s the trap I fell into myself: If you were just starting out in 2020 and chose 3D detection or 3D reconstruction, stay away from point cloud segmentation. Why? The computational resource barrier was too high. Plus, most datasets back then were small indoor scenes. You could train a model, publish a paper, but never deploy it.

On the other hand, monocular depth estimation, while less accurate than LiDAR, wins on—

Cost.

Universality.

It’s really perfect for scenarios like industrial inspection.

---

Video Direction: Extracting Optical Flow and Using It as a Plugin Is a Total Game-Changer

Video understanding in 2020 also saw some eye-catching developments.

After target tracking had its fun with the Siam series, people started working on motion-aware semantic segmentation—segmenting objects while also determining whether they’re moving.

The one that impressed me the most was Google’s Looking Fast and Slow.

What a great name. The slow path extracts key frames for fine features, the fast path runs at dozens of frames per second for motion cues, and then they fuse together.

I was so excited that I immediately tried to reproduce it. And then I hit a huge trap: the fast path was too greedy for optical flow.

Guess how much of the total computation optical flow took in the official demo?

It was almost as much as the backbone network. Deployment couldn’t handle it at all.

So I changed my approach—I used optical flow as a plugin.

You see, the most beautiful thing about optical flow is this: you don’t have to tie it to the network. You can separate it out and use it as a preprocessing step. For example, compute optical flow externally with FlowNet2, and then feed the warped feature map into the network.

The downside? You can’t fine-tune it end-to-end.

But the upside is—speed. And you can swap it in and out easily.

Of course, if you care about accuracy, you still need end-to-end optical flow. Then the computation explodes. I tried plugging LiteFlowNet into SlowFast for video semantic segmentation, dropped the batch size to 2, and still ran out of memory.

It nearly broke me mentally.

But there’s one more thing that made 2020 worth it.

The Temporal Shift Module.

In simple terms, it shifts feature maps along the temporal dimension from neighboring frames, avoiding the heavy computation of 3D convolutions. The paper came out at the end of 2019, but it wasn’t until 2020 that a truly open-source version appeared.

I used it to replace the 3D convolutions in my video classification network—

Guess what happened?

Parameters were cut in half, and FPS improved by 60%!

That’s when I knew for sure: For temporal modeling, shifting is much smarter than stacking convolutions.

So if you’re asking for a clear recommendation, I’d say: if you’re working on video tasks, prioritize lightweight solutions like optical flow assistance or temporal shift modules in 2020. Don’t rush into 3D convolutions. Unless you have a distributed cluster, that stuff will burn a hole in your wallet.

---

Generative Models: Finally, Not Just Doodling Blindly in 3D

By 2020, GANs were starting to feel a bit stale to me, but the combination with 3D was truly attractive.

Take SynSin, for example.

It only needs a single RGB image to generate new view images, and it even supports continuous effects. I used it for data augmentation—rendering object views that didn’t exist in the training set to train a detector, and it did improve generalization.

But the pitfall was clear: generation quality was limited by the original image texture, and edge objects tended to blur.

It felt like spending hours drawing a picture, only to smudge the edges with a shaky hand.

Another one worth mentioning is VIBE. It uses a temporal GAN to fix jitter in video human pose estimation. I tried integrating it into an action recognition pipeline, and

在2020年,你觉得计算机视觉有哪些值得研究的领域? (English)

在2020年,你觉得计算机视觉有哪些值得研究的领域? (English)

That Spring of 2020, I Nearly Got Sick From Practicing for the Detection Competition

The 3D Direction: The Biggest Opportunity You Missed in 2020 Is Right Here

Video Direction: Extracting Optical Flow and Using It as a Plugin Is a Total Game-Changer

Generative Models: Finally, Not Just Doodling Blindly in 3D

Cael Lee

Ready to get started?