生成式推荐是不是一个伪范式? (English)

Generated: 2026-06-22 18:15:35

---

Generative Recommendation: My Hard-Earned Lessons and the Real Deal

Let me start with a true story. Brace yourself.

Last Wednesday at 11 PM, I nearly spilled coffee all over my keyboard! My team was going back and forth over an online experiment result—we swapped one recall channel for a generative approach, AUC went up 0.1%, and guess what? Online revenue dropped 0.3%! An operations guy came running over: "Does this thing even work?"

I told him, you know, people were asking me the exact same question ten years ago.

Back then I was a newbie, working on two-tower recall. My boss stared at the experiment data, his face like a storm cloud: "All this deep learning you've been obsessing over, and it can't beat our 2016 FM with sparse LR?" I was furious, but the numbers were right there—it really wasn't as good. Clunky, dangerous, and prone to blowing up.

Now? Two-tower is the industry baseline. New grads treat it like it's as natural as breathing. History is a cycle. Think about it: the people who doubted deep learning back then weren't won over by arguments—they just gradually faded out of the industry. Like that old physics joke: the old guys who opposed relativity weren't convinced; they just died. Crude but true. And that's how it goes.

---

1. Is Generative Recommendation a False Paradigm or Not?

I'll give it to you straight: It's not a false paradigm, but it's also not a silver bullet.

Look at what everyone's publishing these days—the numbers are all pretty.

Baidu released GRAB this year: CTR up 3.49%, revenue up 3.05%. Xiaohongshu rolled out LASER: ADVV up 2.36%, revenue up 2.08%. LinkedIn's Feed-SR, with 1.2 billion members online, boosted engagement by 3.52%.

Impressive numbers, right? But look closer: not a single one claims to have fully replaced the traditional architecture. Kuaishou's OneRec is aggressive, and even they only deployed it in certain scenarios. Douyin's Mixtoken kept the discriminative objective and just added generative loss. In short, everyone is testing the waters. Nobody's going all in.

Why? Let me tell you from personal experience.

I tried reproducing HSTU. Following Meta's paper, I threw in all the features. Guess what? It performed worse than our model from three years ago! Turns out their side information fusion is different from ours. All the feature engineering we'd accumulated over three years? Had to be scrapped. Think about it—a team's three years of feature engineering experience, just thrown away? Of course the boss would flip out!

Even worse, some papers are simply impossible to reproduce. We tried ByteDance's HLLM, followed the paper as closely as we could, and it performed way worse than two-tower. We suspected it was an issue with text sample construction, but for the Southeast Asian low-resource languages, the trial-and-error cost was just too high. Two months in, no results, and the boss pulled the plug. At that moment, I wanted to print the paper and burn it.

---

2. The Implementation Pitfalls Are Way More Than You'd Expect

Speaking of which, let me let you in on a secret.

I've interviewed nearly two hundred people, asking about Transformer and GPU topics. About half of fresh grads can answer roughly; less than 30% really get it. It's worse for experienced hires: less than 30% can explain it clearly, and maybe 10% actually give me the impression they truly understand.

What does that mean? The industry's knowledge base is still far from sufficient.

It's just like when deep learning first got hot—people who could write TensorFlow were rare. Later, that batch became the backbone of the industry, and the technology gradually became the baseline. Weng Jiayi once said something I completely agree with: "Every LLM infrastructure has bugs of varying degrees. The one that fixes the most bugs the fastest ends up training the best models."

The same goes for recommender systems. Once there's enough understanding of generative recommendation, and once the infrastructure bugs are mostly fixed, it will naturally become a strong baseline. But right now, the understanding isn't there, and the infrastructure isn't there either.

The most ridiculous thing I've seen—don't laugh—a major company's recommender system had an MFU below 1%. You read that right: below 1%! All that compute wasted on a fragmented cascade architecture: recall, pre-ranking, ranking, re-ranking. Each layer has its own model, each model is underfed, but together they gobble up resources.

That's the hard truth of the traditional architecture. But to change it? You'd have to touch the online inference system, the offline training system, the data system—a whole chain of middleware. In industry, technology decisions are made with extreme caution. Without solid business value, nobody dares to move.

---

3. Where Does Generative Recommendation Actually Excel?

To be honest, at this stage, generative recommendation doesn't blow traditional architectures out of the water in terms of effectiveness.

Traditional architectures, with their massive feature engineering and clever combination of pipelines, can achieve similar results. Add one generative recall channel, maybe you get a 0.5% lift. But the other side adds

生成式推荐是不是一个伪范式? (English)

生成式推荐是不是一个伪范式? (English)

Generative Recommendation: My Hard-Earned Lessons and the Real Deal

1. Is Generative Recommendation a False Paradigm or Not?

2. The Implementation Pitfalls Are Way More Than You'd Expect

3. Where Does Generative Recommendation Actually Excel?

Cael Lee

Ready to get started?