GPT-5 Mini全面碾压Sonnet4.5:大模型编程能力10月榜,新测试方法揭真相 (English)
GPT-5 Mini全面碾压Sonnet4.5:大模型编程能力10月榜,新测试方法揭真相 (English)
Generated: 2026-06-22 13:58:45
---
Large Language Model Programming Capability Evaluation: October 2025 Rankings – I Changed My Testing Method
Last Friday night, I was coding in a café when a guy in a plaid shirt sitting next to me was losing his mind staring at his screen—error messages flooding the console, and he had that "I'm about to smash this laptop" look. I leaned over, glanced at it, and said with a smile, "Try throwing this error to an AI and let it fix itself." He was skeptical but copied and pasted it in. Guess what? Three minutes later, the code ran. His eyes went wide: "You can use it like that?"
See, a lot of people still treat AI like a one-shot keyboard warrior—throw in a requirement, wait for it to spit out code, then slog through debugging themselves. But the real trick is to let it fix its own mess! That's exactly where I got the inspiration to change my testing method this time.
---
Why Change the Testing Method?
I've been writing tech columns since 2015, and I've been evaluating LLMs for coding since the GPT-4 days. I update it every month, and this is the tenth edition now. But honestly, I've always felt something was off with my previous testing method.
Before, I used to do it like this: let the model output code in one shot, then manually fix obvious syntax errors—like a missing colon in Python, or an unused variable in Golang. After fixing, I'd run the tests.
What's the problem? Think about it—who actually uses AI like that in real development? You paste code directly into your IDE, run lint, hit run, and error messages pop up immediately. The model can totally fix itself through multi-turn conversations! You're not the AI's nanny, so why should you clean up after it?
So this month, I changed the game: If the first round of code has syntax errors or runtime exceptions, I don't touch it. I just throw the error message back and let the model fix itself. After each round of fixes, I re-score it until it passes or the model completely gives up.
What this measures is the level you can actually use in real scenarios. In other words—you play the hands-off boss, and let AI be the repairman.
---
Test Configuration
- Test Period: October 5–7, 2025 (three days, pulled all-nighters)
- Languages: TypeScript, Java, Golang, Python, C# (6 questions each, 30 total)
- Models: GPT-5 Mini, Sonnet4.5, Gemini 2.5 Pro/Flash, DeepSeek V3.2 (reasoning/non-reasoning)
- Context Window: Unified 32K, output 8K
- Scoring: Full marks if all test cases pass within 3 rounds; points deducted per round
---
Overall Results: Who's the Real King?
Let's get to the conclusion first. GPT-5 Mini leads across the board, Sonnet4.5 follows closely, while Gemini series and DeepSeek V3.2 each have their own weaknesses. But what surprised me most isn't the scores—it's that some models get worse the more they fix, while others get more stable. Guess which is scarier?
Here's a table with the specific data (don't skip it, there's a story behind it):
| Model | Overall Score | First-Round Pass Rate | Average Rounds | Fix Stability | Refactoring Performance |
|---|
| GPT-5 Mini | 96.5 | 60% | 1.4 | Stable improvement | Conservative |
|---|
| Sonnet4.5 (non-reasoning) | 93.2 | 43% | 1.8 | Stable | Excellent |
|---|
| Gemini 2.5 Pro | 88.1 | 37% | 2.1 | Unstable (3 regressions) | First-round always fails |
|---|
| DeepSeek V3.2 (non-reasoning) | 85.4 | 33% | 2.3 | Unstable | Severe hallucination |
|---|
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.