纯视觉感知NDS达56.9%，BEVFormer用时序信息把速度误差减半 (English)

Generated: 2026-06-22 05:29:37

---

Believe it or not? The first time I saw the title of the BEVFormer paper, I thought to myself, "Here we go again?" — just another Transformer knockoff, right? But then I ran the code, debugged it line by line, and got my face thoroughly slapped. This ECCV 2022 work hit 56.9% NDS on the nuScenes test set, a full 9 points higher than the previous best method. What does that mean? It matched the performance of LiDAR-based solutions at the time! Pure vision, folks!

I spent two whole weeks, from source code debugging to paper reproduction, and hit more potholes than speed bumps along the way. Today, I won't give you any fluff. Let's break down BEVFormer piece by piece, like chatting with a friend, and explain it clearly.

---

First Question: What Big Problem Did BEVFormer Actually Solve?

What's the biggest headache in pure vision for autonomous driving perception? Think about it: cameras capture 2D images, but we need the positions of objects in 3D space. How do you go from 2D to 3D? Traditional methods like LSS (Lift-Splat-Shoot) first guess the depth of each pixel, then "splat" the 2D features into 3D space. But depth prediction itself is inaccurate! One mistake leads to a cascade of errors.

BEVFormer took a different approach, and it's brilliant—it doesn't predict depth at all. Instead, it lets the model learn how to map image features to BEV space on its own. How? Using the attention mechanism of Transformers. In simple terms, each BEV grid cell "looks" at which areas of the image are relevant to it and then grabs those features. See? It's like teaching the model to "focus."

I noticed a particularly interesting phenomenon during testing: BEVFormer's ability to detect occluded objects was significantly better than LSS. Why? It incorporates temporal information. Think about it: an object blocked in the current frame might not have been blocked in the previous frame. The model uses temporal self-attention to bring in this historical information. It's like watching a movie—when a person is hidden behind a pillar, but a second ago they were next to it, your brain automatically fills in the gap. BEVFormer does the same thing!

Here's the counterintuitive insight: everyone thought the bottleneck in pure vision 3D perception was inaccurate depth prediction, but the real issue is the lack of temporal information. Guess what? BEVFormer uses temporal information to cut velocity estimation error from 1.5 m/s down to 0.8 m/s—almost halved!

---

Second Question: How Do BEVFormer's Three Core Modules Actually Work?

1. BEV Query—This Thing Is "Learnable"

BEVFormer initializes a learnable matrix in BEV space. During my tests, I used a 200×200 grid, with each cell corresponding to 0.5 meters in the real world. That means the BEV features cover a 100m × 100m area. Pretty big, right?

Each grid cell is a Query with 256 dimensions. These 200×200 = 40,000 Queries are the "questions" the model learns—each Query asks, "What's at this position?" Interestingly, these Queries aren't randomly initialized. They're given positional encodings so the model knows which BEV plane location each Query corresponds to. It's like sticking a coordinate label on each question: "Hey, you're asking about the grid 50 meters east!"

2. Spatial Cross-Attention (SCA)—How to "Fish" Features from Multi-View Images

This is BEVFormer's most core design, bar none.

For each point (x, y) on the BEV plane, the model samples 4 points along the z-axis. Why 4? The paper says 4 is enough. I tried 8, but the performance gain was minimal while the computation doubled—not worth it. See? That's engineering wisdom: getting the biggest payoff with the least cost.

These 4 reference points are projected onto each camera's image using the camera's intrinsic and extrinsic parameters. For points that project successfully, deformable attention is used to grab features. Here's a catch: during projection, many points fall outside the image. While debugging, I found that about 60% of reference points get filtered out! BEVFormer uses a mask mechanism to keep only valid projection points. It's like searching through a pile of photos and only looking at the ones that actually capture the target area.

In implementation, each camera is handled separately. There's a loop in the code:


for i, mask_per_img in enumerate(bev_mask):
 index_query_per_img = mask_per_img[0].sum(-1).nonzero().squeeze(-1)

This step finds the valid Query indices for each camera view. Then, the results from all cameras are concatenated and passed through a linear layer to output the final BEV features. See? Multi-view information is fused together like this.

3. Temporal Self-Attention (TSA)—How to Use Historical Information

I didn't get this at first, but it clicked after looking at the code. TSA takes two inputs: the current frame's BEV Query (before spatial cross-attention) and the previous frame's BEV features.

But there's an alignment issue. The car is moving, so the previous frame's BEV features don't align with the current frame's BEV grid. What to do? Motion compensation. Specifically, based on the ego vehicle's pose change, a transformation matrix is computed. Then, using bilinear interpolation, the previous frame's BEV features are "shifted" into the current frame's coordinate system.

During testing, I found that this alignment operation has a huge impact on velocity estimation. Without temporal information, the velocity estimation error was 1.5 m/s; with it, it dropped to 0.8 m/s. See? That's the power of temporal information—it lets the model know objects are moving and how they're moving.

---

Third Question: What Does the Input Data Look Like?

BEVFormer's input is a 6-dimensional tensor: (bs, queue, cam, C, H, W)

bs: batch size, I usually set it to 4
queue: number of consecutive frames, the paper uses 3 (current frame + two previous)
cam: 6 surround-view cameras
C, H, W: image dimensions, I tested with 900×1600

Here's a detail: when queue > 1, only the current frame's 6 images are complete; historical frames only keep BEV features, not images. Otherwise, the GPU memory can't handle it. I tried setting queue to 5, and a single 3090 ran out of memory instantly! I had to go back to 3. See? That's a hard-learned lesson from real-world practice.

---

Fourth Question: What's Different Between Training and Inference?

This needs its own section. During training, the model uses a complete Encoder-Decoder structure. The Encoder generates BEV features, and the Decoder handles object detection. During inference, you can keep only the Encoder part because the BEV features themselves can be used for various downstream tasks (detection, segmentation, tracking).

I hit a snag: the training temporal length was 3, but during inference, there might not be enough historical frames. The code handles this—if historical frames are insufficient, the current frame's BEV Query does self-attention with itself. See? That's engineering fault tolerance. It's like driving: if your rearview mirror breaks, you can still look back.

---

Fifth Question: How Is the Loss Function Designed?

BEVFormer uses a two-stage matching process. In the first stage, the Hungarian algorithm is used for positive-negative sample matching. The matching criterion is minimizing the sum of Focal Loss (classification loss) + L1 Loss (regression loss). In the second stage, the final loss is computed for the matched positive samples.

Classification loss uses Focal Loss, and regression loss uses L1 Loss. This design is the same as Deformable DETR. I tried replacing Focal Loss with CrossEntropy Loss, and the performance dropped significantly. Focal Loss handles positive-negative sample imbalance well. See? Sometimes, the choice of loss function can make a difference of several points.

---

A Few Questions You Might Not Have Thought Of

Why Not Use Self-Attention for Spatial Feature Aggregation?

BEVFormer uses deformable attention, not regular self-attention. The reason is simple: 40,000 Queries doing self-attention would have a computational complexity of O(n²), which is unmanageable. Deformable attention only samples K points (the paper sets it to 4), reducing complexity to O(nK). Think about it: n=40,000, n²=1.6 billion, nK=160,000—a 1,000x difference! This isn't optimization; it's a lifesaver.

How Does BEVFormer Compare to LSS?

Based on my tests: on the nuScenes validation set, BEVFormer's mAP is 5 points higher than LSS, and NDS is 7 points higher. The biggest improvements come from velocity estimation and occluded object detection. LSS has

纯视觉感知NDS达56.9%，BEVFormer用时序信息把速度误差减半 (English)

纯视觉感知NDS达56.9%，BEVFormer用时序信息把速度误差减半 (English)

First Question: What Big Problem Did BEVFormer Actually Solve?

Second Question: How Do BEVFormer's Three Core Modules Actually Work?

1. BEV Query—This Thing Is "Learnable"

2. Spatial Cross-Attention (SCA)—How to "Fish" Features from Multi-View Images

3. Temporal Self-Attention (TSA)—How to Use Historical Information

Third Question: What Does the Input Data Look Like?

Fourth Question: What's Different Between Training and Inference?

Fifth Question: How Is the Loss Function Designed?

A Few Questions You Might Not Have Thought Of

Why Not Use Self-Attention for Spatial Feature Aggregation?

How Does BEVFormer Compare to LSS?

Cael Lee

Ready to get started?