国内外知名大模型及应用——模型/应用维度2026/06/17 (English)

Generated: 2026-06-20 16:32:38

---

The AI Poker Table in 2026: I Finally See It Clearly

Let me start with a story.

Last week, I was having drinks with a friend who works in AI infrastructure. At some point, he suddenly asked me: "With all these large language models popping up one after another, which one should I actually use?"

I froze. Honestly, I've asked myself that same question a thousand times.

We're halfway through 2026, and there are at least dozens of LLMs you can name off the top of your head. Over 1,500 different models just in China alone. How's an ordinary person supposed to make sense of that kind of chaos?

So today, I'm not going to bother with any flashy rankings. I just want to share the real experience I've had over the past year of hands-on work.

China's Overachievers: They Took a Wrong Turn, Then Got It Right

First, something really interesting.

Last week, I gave the same piece of code to several models to complete. Guess what happened?

DeepSeek V4 Pro spat out code that was clean and sharp. The comments were more standardized than what my own intern writes, and it even attached anti-scraping tips out of courtesy. Honestly, it's so thoughtful it doesn't feel like an AI.

But when I tried Tongyi Qianwen Qwen3.6-27B, that thing actually asked me: "Do you need me to simulate the test environment for you?"

!!! Can you believe it? It proactively thought about the next step!

What I'd given it was the skeleton of a scraping script for it to flesh out. DeepSeek just filled in the code dutifully. Qwen3.6, though? It asked me if I needed a test environment and even came up with several exception-handling scenarios on its own.

In the past, only Claude could do that. Now domestic models have reached that level, too.

Later, I looked at Qwen3.6's technical report, and I almost dropped my phone—this 27B small model actually outperformed its own previous flagship model with over 400B in agent programming! Over 400B, folks! A 27B beating a 400B+? And winning?

See, parameters aren't everything. Architecture is where it's at.

But if you think all Chinese models are competing on the same track, you're dead wrong.

Take Zhipu GLM-5.2, for instance. It goes for a full-stack domestic solution—trained entirely on Huawei's Ascend chips. That makes it a hot commodity in the "Xinchuang" (domestic IT localization) space! It scores 77.8% on SWE-bench Verified, even higher than GPT-5.5, and the API price is just a fraction of GPT's. If you're working on government or state-owned enterprise projects, this is almost your only choice.

Speaking of which, I have to mention MiniMax. On June 1st, they just released M3, which directly supports 1 million tokens of context. In programming benchmarks, it beats GPT-5.5 and Gemini 3.1 Pro.

You read that right. A Chinese open-source model, in American benchmarks, is wiping the floor with American flagships.

I tried it immediately. My feeling? Two words: big ambitions. A 1 million context window is no joke—I directly fed it the entire codebase of an open-source project, and it could refactor the whole architecture for you.

Who would have dared to imagine that before?

But to be honest, I'm not convinced it's stable in every scenario. Big players tend to "stack parameters first, fix bugs later" with these things. Once large numbers of users flood in, problems will start showing. Still, having this kind of ambition already earns my respect.

The International Players Are Playing a Different Game

Now, let's talk about the global giants. The landscape is pretty interesting.

Anthropic's Claude Fable 5 is basically synonymous with "the strongest model across the board" right now. Three first-place scores—Intelligence 60, Coding 62.0, Agentic 80.6! Look at that Agentic score: 80.6, almost 3 points higher than their previous generation Opus 4.8's 77.8.

I've been using it for two weeks, and my takeaway is simple: this model almost never produces code that won't compile. The safety controls are extremely strict—there's no way you're getting it involved in anything borderline—but for serious engineering projects, it's the ceiling.

But it's expensive, too. The API price for Opus 4.8 is a nightmare for your wallet. However, Sonnet 4.6 costs about 20% of Opus and is only 8 percentage points behind on SWE-bench—that's the real "sweet spot." I use Sonnet daily and only switch to Opus for really complex tasks.

As for GPT-5.4, it has a particularly fun skill: Computer Use. Its OSWorld accuracy is 75%, surpassing the human baseline of 72.4%.

What does that mean? It means you can ask it to operate your computer—take screenshots, click buttons, fill out forms—and it can do it on its own. I asked it to help me batch-enter data into a SaaS system. I fed it 200 records, and it opened the page, logged in, filled in the forms, and submitted them. It even handled two CAPTCHA pop-ups along the way. The whole process took about 20 minutes with zero mistakes.

This is terrifying. While you're staying late fixing bugs, AI is already doing the grunt work for you. Frustrating, right?

But Google's Gemini 3.1 Pro has had some stumbles recently. Long document analysis is certainly its strong suit—it can chew through 1 million tokens easily, and its multimodality truly allows it to understand images. However, its stability is maddening. Sometimes, with the same prompt, the answer you get in the morning and the answer you get in the afternoon can be wildly different. Word is, they're in the middle of a massive internal restructuring, and the rapid iteration of model versions is causing instability.

If you're thinking of using it in production? I'd advise you to think twice.

Something else caused a stir in the community: xAI leased its Colossus 1 supercomputer to Anthropic. 220,000 Nvidia GPUs, $1.25 billion a month, leased until 2029.

xAI rented its largest compute cluster to a direct competitor! What this means is that they've taken a step back in the frontier race. Grok 4.3 is indeed cheap—just $1.25 per million tokens for input, one of the lowest among flagship models—and its access to X's real-time data is unique. But the next generation, Grok 5, is still in training. After the restructuring and the compute leasing, can it still launch on time? Hard to say.

My Assessment and Some Practical Advice

After testing for most of the year and stepping into plenty of pitfalls, here are a few concrete conclusions.

First, don't fetishize parameters. Qwen3.6 at 27B beating Qwen3.5 at 400B+ tells you clearly: architectural efficiency and training data quality are what matter. Chasing parameter counts will just lead you into a numbers game.

Second, Chinese models are taking a different path. Open source, low cost, fast scenario adaptation. On the SuperCLUE May ranking, DeepSeek V4 Pro scored 70.48 overall and 74.43 on reasoning, ranking first domestically. But Doubao Seed-2.0-pro scored 68.14 on application ability, with high marks in precise instruction following and hallucination control.

In short, Chinese models are transitioning from "can they fight?" to "are they easy to use?"

Third, the advantage of foreign flagships is narrowing. Claude Fable 5 and GPT-5.5 are indeed the ceiling, but that ceiling is lowering. The experience gap between paying $20 a month for ChatGPT and using free DeepSeek V4 in daily scenarios is getting smaller. When I asked them to write the same scraper, DeepSeek was even more thoughtful—it reminded me to watch out for anti-scraping strategies.

A year ago, would you have imagined that?

**Fourth, my selection criteria have changed

国内外知名大模型及应用——模型/应用维度2026/06/17 (English)

国内外知名大模型及应用——模型/应用维度2026/06/17 (English)

The AI Poker Table in 2026: I Finally See It Clearly

China's Overachievers: They Took a Wrong Turn, Then Got It Right

The International Players Are Playing a Different Game

My Assessment and Some Practical Advice

Cael Lee

Ready to get started?