Home / Blog / I Tested LLM Coding Skills for 3 Months—Most Chine...

I Tested LLM Coding Skills for 3 Months—Most Chinese Models Collapsed at Round Three

By CaelLee | | 9 min read

I Tested LLM Coding Skills for 3 Months—Most Chinese Models Collapsed at Round Three

The winner? GPT-5.5 took top marks in my V3 programming test.

But honestly? That's not the interesting bit.

The interesting bit is that most Chinese large language models couldn't get past the third round of Project C. What's Project C? A macOS OpenGL renderer written in Swift. Not rocket science, but not trivial either. And no matter how many times you rewound, cleared context, or started fresh—it just couldn't finish.

Seriously.

Here's the thing—this all started back in September. I wrote a teaser post about a new V3 testing framework I was designing. The goal was dead simple: score LLMs on programming ability using criteria that actually resemble real engineering work.

Not LeetCode puzzles. Actual project building. From scratch.

And then? It took me until now to publish the first results.

The reason's almost comical—LLM companies are shipping new models faster than I can test them. Each model takes 3 to 4 days per project, and with three projects that's nearly a fortnight. By the time I'd finish testing one, three new models would drop.

I simply couldn't keep up.

Absolutely mental.

How the Tests Actually Work

For this first batch, I picked three projects that barely overlap.

Project C is a macOS OpenGL renderer in Swift—tests niche language handling, graphics programming, and heavy interactivity. Project D is a full-featured chat app built on Flutter with a Golang backend—covers mobile dev, databases, and network communication. Project E is a web-based video editor (tech stack chosen by the model)—probes frontend skills, audio/video processing, and complex state management.

Each project gets broken into 10 to 12 prompts, roughly 1,500 to 2,000 words each. That word count—sorry, I should call it information density—was something I obsessed over. In my earlier A and B test projects, my prompts were too sloppy. Ambiguous. Models kept misinterpreting what I wanted. This time I wrote every requirement like a product spec. Zero room for confusion.

The testing process simulates what I call "vibe coding."

I know exactly where the bug is. But I don't tell the model. I just describe the symptoms: the page is blank, the log shows this error, it freezes when you click here. Figure it out yourself. Just like real development—your colleague slacks you "this feature's broken" and you've got to diagnose it from scratch.

Three rounds to fix it. If it can't? Rewind. Clear the code. Clear the context. Start over. Three more rounds. Still broken? Test terminated. That model has officially failed this project.

This standard's stricter than real-world usage, if I'm honest. In practice, you'd probably try a different prompting strategy or just fix the damn thing yourself. But testing needs a consistent benchmark. And let's be real—if a model can't fix something in three rounds, you'd likely switch models anyway. Same outcome, basically.

Some Properly Interesting Findings

The V3 leaderboard is only one piece of my evaluation system.

In a separate logical reasoning test, I've been tracking something called autonomous self-repair. Since September, I changed one rule: when a model's first attempt has syntax errors or runtime exceptions, I don't fix them manually anymore. Instead, I feed the error message back and give it one shot at self-repair.

This change matters.

In real agent setups, models work alongside linters and runtime tools—they can already catch these "hard" failures. My old single-round scores were a bit... what's the word? Artificial. Not realistic enough.

The results surprised me.

GPT-5, Sonnet 4, the Gemini series, Grok 4—they all self-repaired almost every basic exception. Sonnet 4 Think tripped up twice: once on a C# algorithm problem that timed out (still timed out after the fix), once on a C++ syntax error it couldn't resolve. This backs up what I've seen elsewhere—Sonnet 4's C++ skills are genuinely weaker.

Anyway, back to the Chinese models.

DeepSeek V3.1 Think and Doubao 1.6 Thinking performed best. Among non-reasoning models, Kimi K2 was the standout—though it had three repair failures, all Golang "declared but not used" variable errors. K2 would fix one, which would cascade into another unused variable elsewhere. Like whack-a-mole. In a real multi-round agent setup, these get caught and fixed pretty easily, so if I strip those out, K2's post-repair error rate drops to 2.78%.

That's genuinely competitive.

Qwen3 Coder, though? Bit grim. Initial error rate of 11% isn't terrible, but the repair round only handled about half of them, leaving 6.4%. And the remaining errors were nasty ones—timeouts, heap exceptions, the kind of thing extra rounds probably won't solve. It's not that the model's not clever enough. The foundations are just a bit shaky.

Doubao-1.6 Thinking had one stat that really jumps out: post-repair scores jumped 9.2 points. What does that tell you? The model's being held back by basic errors. Fix those, and usability shoots up immediately. Meanwhile, GPT-5, Kimi K2, and GLM 4.5 barely improved with repairs—if the first output's not good enough, you're better off regenerating from scratch than patching.

Effective, yeah. But it burns through tokens like mad.

April's Logic Problems Got Harder

Starting in April, all new models get tested at their highest tier. I used to standardise on "high" because most models topped out there. Now every company's gone mad—xhigh, max, ultra tiers everywhere—and the old "high" tier's actually been nerfed. So the rules had to evolve.

This month I retired two old problems and introduced two new ones. Difficulty went through the roof.

Problem #61 tests insight. You get a chunk of compressed text and need to find the original. You can't brute-force this—you have to spot the pattern. Only GPT-5.4/5.5 and Opus 4.6 scored perfectly every time. Among Chinese models, DeepSeek V4 Pro sometimes got full marks. GPT and Opus averaged under 20K tokens. DS V4 Pro was less efficient in its reasoning—some local brute-forcing crept in—burning through 60K tokens.

Three times the tokens. Fast, but expensive.

Kimi K2.6, GLM-5.1, and DeepSeek V4 Flash solved about half the cases, averaging 50K tokens. The Qwen 3.6 series, Seed 2.0 Pro, Gemini 3 series? They mostly only caught the obvious fallback cases—and hallucinated badly along the way. Midway through reasoning, they'd lost track of the original problem entirely. Just dreaming up whatever.

Total collapse.

Problem #62 tests instruction following. Relatively simple. But I designed it with loads of "negative" constraints—the model has to avoid all the "don't do this" traps. GPT-5.4/5.5, DeepSeek V4 Pro, Gemini 3.1 Pro, and Seed 2.0 Pro got full marks. GLM-5.1, Qwen 3.6-Max, Opus 4.6, and Kimi K2.6 were inconsistent—different outputs across multiple passes. Hy3, which has had specific instruction-following training, did alright, scoring about half. Everyone else either couldn't parse the question, couldn't juggle all the constraints simultaneously, or hallucination went nuclear.

This problem—this type of problem—isn't hard per se. But it absolutely filters the field.

May: Qwen 3.7 Max Goes Absolutely Mental

I retired four problems in May.

Why? Because Qwen 3.7 Max was so dominant that the old problems stopped providing useful differentiation. I needed fresh challenges to find its ceiling.

Qwen 3.7 Max passed the new problems handily. Meanwhile, most other Chinese models saw their scores dip to varying degrees.

Problem #63 builds on my old Rubik's cube rotation concept. I designed it with a state machine—the model has to precisely track every intermediate step and the original rules across a long sequence. Five models scored perfectly: GPT-5.5, Opus 4.8, DeepSeek V4 Pro, Gemini 3.5 Flash, Qwen 3.7 Max. These are the ones with solid hallucination suppression.

Kimi K2.6 and GLM-5.1 kept making small errors, only managing half marks. Hy3 Preview surprised me—it scored similarly and stayed reasonably stable across multiple passes. Seed 2.0 Pro occasionally got close to perfect, but most runs were completely wrong. Like playing the lottery.

Problem #64 was the first time I used vibe coding to build the testing environment itself. Before this, everything was manual—hand-crafted problems, hand-validated answers. Painfully slow. Like working in a Victorian workshop. This time I used models to help generate the test, and efficiency shot through the roof. The problem tests spatial reasoning. Models with weak spatial intuition have to simulate step-by-step deduction, and the margin for error is tiny.

GPT-5.5—still flawless. Opus 4.8/4.6, DeepSeek V4, Gemini 3.5 Flash/3.1 Pro occasionally got perfect scores but weren't consistent. Qwen 3.7 Max struggled with this type of problem—inefficient simulated deduction produced small errors that cost full marks. The rest? Burned tens of thousands of tokens and still couldn't work out the spatial relationships. Sleepwalking.

Back to the V3 Leaderboard

When you piece all these fragments together, the picture gets clear.

Logic problems test reasoning depth, hallucination control, instruction following. Programming tests check engineering ability, multi-turn interaction, autonomous repair. Two dimensions that validate each other. GPT-5.5 sits at the top of both. The Opus series crushes logic problems, but how they perform on programming tests? I haven't finished yet—it's painfully slow. Each model, each project, takes at least 2 hours. Millions of input tokens, hundreds of thousands of output tokens. When the API's congested, it takes even longer.

I can't wait. Literally—I don't have the time.

Among Chinese models, DeepSeek V4 Pro and Qwen 3.7 Max can now touch the top tier on logic problems. But in V3 programming tests, most Chinese models can't even clear Project C's third round.

This honestly gives me pause.

Benchmark scores and real work are two different things entirely. I've been writing this column for nearly a decade. I've tested—I don't know—eighty, ninety models? Every time a new one drops with those gorgeous benchmark numbers, I think about the Chinese models stuck at Round 3 in my V3 tests. Expectations too high, disappointment hits harder, I suppose.

But it's not all doom and gloom. Qwen 3.6-Plus is a proper example. Looking back across 2025, the Qwen team's performance was genuinely impressive. By 2026, they were shipping five models in three days, racing to catch up. The rat race—hyper-competition, whatever you want to call it—sometimes works. Qwen 3.6-Plus improved massively on programming. For routine frontend/backend work and web apps, it's already better than Sonnet 4.5 and GLM-5.0/MiniMax.

It's just that the agent revolution arrived too fast. The once-weaker competitors have already eaten their fill.

A brilliant student who writes poor code isn't stupid—nobody taught them properly. Now they've learned, but the landscape's already shifted.

What do you reckon?

TL;DR / Key Takeaways

What's been your experience with LLMs on multi-turn coding tasks? Have you hit the third-round wall? Drop a comment—I read every single one, even if it takes me three months to reply.

AI #LLM #Programming #CodingAssistant #MachineLearning #DevTesting

C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free