GPT-5 Mini全面碾压Sonnet4.5：大模型编程能力10月榜，新测试方法揭真相 (English)

Generated: 2026-06-22 13:58:45

---

Large Language Model Programming Capability Evaluation: October 2025 Rankings – I Changed My Testing Method

Last Friday night, I was coding in a café when a guy in a plaid shirt sitting next to me was losing his mind staring at his screen—error messages flooding the console, and he had that "I'm about to smash this laptop" look. I leaned over, glanced at it, and said with a smile, "Try throwing this error to an AI and let it fix itself." He was skeptical but copied and pasted it in. Guess what? Three minutes later, the code ran. His eyes went wide: "You can use it like that?"

See, a lot of people still treat AI like a one-shot keyboard warrior—throw in a requirement, wait for it to spit out code, then slog through debugging themselves. But the real trick is to let it fix its own mess! That's exactly where I got the inspiration to change my testing method this time.

---

Why Change the Testing Method?

I've been writing tech columns since 2015, and I've been evaluating LLMs for coding since the GPT-4 days. I update it every month, and this is the tenth edition now. But honestly, I've always felt something was off with my previous testing method.

Before, I used to do it like this: let the model output code in one shot, then manually fix obvious syntax errors—like a missing colon in Python, or an unused variable in Golang. After fixing, I'd run the tests.

What's the problem? Think about it—who actually uses AI like that in real development? You paste code directly into your IDE, run lint, hit run, and error messages pop up immediately. The model can totally fix itself through multi-turn conversations! You're not the AI's nanny, so why should you clean up after it?

So this month, I changed the game: If the first round of code has syntax errors or runtime exceptions, I don't touch it. I just throw the error message back and let the model fix itself. After each round of fixes, I re-score it until it passes or the model completely gives up.

What this measures is the level you can actually use in real scenarios. In other words—you play the hands-off boss, and let AI be the repairman.

---

Test Configuration

Test Period: October 5–7, 2025 (three days, pulled all-nighters)
Languages: TypeScript, Java, Golang, Python, C# (6 questions each, 30 total)
Models: GPT-5 Mini, Sonnet4.5, Gemini 2.5 Pro/Flash, DeepSeek V3.2 (reasoning/non-reasoning)
Context Window: Unified 32K, output 8K
Scoring: Full marks if all test cases pass within 3 rounds; points deducted per round

---

Overall Results: Who's the Real King?

Let's get to the conclusion first. GPT-5 Mini leads across the board, Sonnet4.5 follows closely, while Gemini series and DeepSeek V3.2 each have their own weaknesses. But what surprised me most isn't the scores—it's that some models get worse the more they fix, while others get more stable. Guess which is scarier?

Here's a table with the specific data (don't skip it, there's a story behind it):

Model	Overall Score	First-Round Pass Rate	Average Rounds	Fix Stability	Refactoring Performance

GPT-5 Mini	96.5	60%	1.4	Stable improvement	Conservative

Sonnet4.5 (non-reasoning)	93.2	43%	1.8	Stable	Excellent

Gemini 2.5 Pro	88.1	37%	2.1	Unstable (3 regressions)	First-round always fails

See that? The highest first-round pass rate isn't GPT-5 Mini? No! You'd think 60% is already high, but Sonnet4.5 only has 43% first-round pass rate, yet its final score is only 3 points behind. Why? Because Sonnet is incredibly good at fixing bugs—it's like an experienced old doctor: the first diagnosis might be off, but the second time it hits the nail on the head.

---

Breakdown of Each Model's Performance: Who's Slacking Off, Who's Going All Out?

GPT-5 Mini: Rock-Solid

Out of 30 questions, only 12 went to the second round, and 3 to the third. And every time it modified, the score went up—no "making the code worse" nonsense. It's like an old driver who never overtakes but never crashes.

But there's a catch: too conservative on refactoring tasks. Question 16 was a 500-line Java class that needed refactoring. GPT-5 Mini basically made local tweaks, afraid to touch the structure. It ended up with 300 lines that ran, but it fell short of what "refactoring" implies—it was more like "fine-tuning" than "rewriting."

In short, it's like an old driver—safety first, never overtakes. But if you ask it to race on a track, it'll just hit the brakes.

Sonnet4.5: The Efficiency King

What impressed me most about Sonnet wasn't the score—it was speed. In non-reasoning mode, it averaged 30 seconds per output, the third fastest among the tested models. Think about it: if a coding assistant makes you wait two minutes every time, who can stand that? By the time you finish your coffee, it's still not done, and your coffee's cold.

But Sonnet has a problem: it tends to stumble on the first round. On complex problems, the first-round pass rate is only 43%, and it often needs to go to the third round to get full marks. However, for refactoring and code migration tasks, it basically passes in one go, and its thinking is more aggressive than GPT-5 Mini—it actively looks for better architectures. It's like a young designer: the first proposal might be off track, but the second one surprises you.

The non-reasoning version had one question where the final code was uncompilable, but I think one more round would probably fix it. It was just a little short of luck.

Gemini 2.5 Pro/Flash: Decent Scores, Terrible Experience

The Gemini series didn't score low overall, but they have a bunch of issues. It's like that classmate who can score 85 on exams but always copies the wrong line in homework.

The most annoying thing is multi-round instability. Pro and Flash each had 3 questions where the score dropped in the third round compared to the first. Pro had 4 questions that went in circles, and Flash was even worse—7 questions did that. Imagine asking it to fix a bug, and it introduces bugs in places that were fine, making things worse—isn't that the infamous "computer repair syndrome"?

Refactoring tasks were a disaster. Pro and Flash always made mistakes in the first round—breaking original logic, introducing basic syntax errors. It usually took 2-3 rounds to recover. It's like asking an intern to refactor code, and on day one they delete the database table.

The non-reasoning version of Flash repeatedly made syntax errors in Golang. I started to suspect it had a personal grudge against Golang. Three times in a row with the same error—I almost thought it was trying to annoy me on purpose.

DeepSeek V3.2: The Reasoning Version Is Actually Worse

V3.2 is also a heavy hitter in the "gets worse with fixes" category. The reasoning version is even more extreme—sometimes it overthinks, makes more errors, and its score improvement is worse than the non-reasoning version. It's like a top student who overcomplicates a simple problem during an exam and gets it wrong.

But on complex problems, the reasoning version does have a higher initial score—after all, it has the advantage of deep thinking. However, once it comes to detail-oriented scenarios like refactoring, V3.2's context hallucination starts acting up. The more it reasons, the more mistakes it makes, and it uses more rounds than the base version. For question 16's Java refactoring, the first two rounds scored steadily, and the third round got zero. Can you believe it? It's like it's fighting with itself, determined to get confused.

To be fair, V3.2 is balanced among domestic models—except for C++ being a bit weak, it doesn't have obvious biases in other languages. But if you rely on it to fix bugs, better have some heart medication ready.

---

My Usage Recommendations: Choose

DeepSeek V3.2 (non-reasoning)	85.4	33%	2.3	Unstable	Severe hallucination

GPT-5 Mini全面碾压Sonnet4.5：大模型编程能力10月榜，新测试方法揭真相 (English)

GPT-5 Mini全面碾压Sonnet4.5：大模型编程能力10月榜，新测试方法揭真相 (English)

Large Language Model Programming Capability Evaluation: October 2025 Rankings – I Changed My Testing Method

Why Change the Testing Method?

Test Configuration

Overall Results: Who's the Real King?

Breakdown of Each Model's Performance: Who's Slacking Off, Who's Going All Out?

GPT-5 Mini: Rock-Solid

Sonnet4.5: The Efficiency King

Gemini 2.5 Pro/Flash: Decent Scores, Terrible Experience

DeepSeek V3.2: The Reasoning Version Is Actually Worse

My Usage Recommendations: Choose

Cael Lee

Ready to get started?