御三家大模型横评Claude, Gemini, GPT和 (English)
御三家大模型横评Claude, Gemini, GPT和 (English)
Generated: 2026-06-21 00:34:15
---
Stop Believing in a Single Big Model – You Need to Learn to "Raise Your Own Swarm"
A couple of days ago, I was refactoring an old module in a Python project.
Guess how I did it?
First, I had the latest GPT sketch out a bare-bones framework. Then I fed it to Claude 3.5 Sonnet to flesh out the core logic. Finally, I tossed it to Gemini 2.5 Pro for a code review — three models, tag-team style. A job that would've taken two days got done in three hours.
I sat there at my desk, staring at the screen for a solid three seconds.
Honestly, people who are still obsessing over "which model is the best" are probably asking the wrong question from the start.
The Big Three, Each with Their Own Persona
Let me level with you — here's my "war story."
I've been writing a newsletter with large language models for over two years. I've followed GPT from GPT-4 all the way to GPT-4o, Claude from version 3 to 3.5 Sonnet, and Gemini from 1.5 to 2.5 Pro. My subscriptions add up to nearly two thousand yuan a month — ChatGPT Plus, Claude Pro, Gemini Advanced, Grok SuperGrok, and even a Kimi membership.
I know, it sounds ridiculous.
But here's the reality: No single model can do it all.
ChatGPT feels to me like an experienced product manager. Warm, attentive, covers every angle, and never misses a beat. Ask a question, and he'll break it down from five different perspectives, unpacking it all with information density so high it's almost overwhelming.
A while back, I was chewing over a user growth analysis framework, still just a fuzzy idea. I chatted with GPT for an afternoon. By the end, the framework just emerged naturally. GPT excels at this kind of scenario — pulling clarity out of chaos. You don't need a clear vision; he helps you carve one out.
But there's a downside: the information density is so high that sometimes it lacks focus. You have to know how to "steer" him — constantly pull the conversation back on track or set clear boundaries.
In short, GPT is a great teammate, but you've got to keep the reins in your own hands.
Claude 3.5 Sonnet is more like that senior engineer — technically brilliant, few words, but always hits the nail on the head.
Code generation, long-form writing, and complex logic unraveling — that's his arena. The SuperCLUE benchmarks back this up: Claude scored high on code generation, outperforming Gemini, with clear gains over its own previous version. It also topped the SWE software engineering sub-tasks.
My own experience: Claude writes code with a certain "clarity." He doesn't just mechanically piece functions together — he truly understands the business logic.
I once had this nasty bug combining a multi-threading deadlock with a memory leak. Several models took their shots, but it was Claude 3.5 Sonnet that pinpointed the root cause — a timing issue I'd checked three times and missed.
But he has weaknesses: translation and structured tasks don't match GPT or Gemini.
My current workflow: GPT for the first draft, Claude for polishing the expression. Each model is irreplaceable in its own way.
Gemini 2.5 Pro — to be fair, the progress has been impressive.
Especially after the experimental version with chain-of-thought dropped in March 2025, the improvements in math and coding are clear. But Gemini has this schizophrenic issue: the AI Studio experience and the App experience are completely different.
I've compared them myself. With the same query, the AI Studio response quality is noticeably higher. The App, probably because of some underlying system prompt, often oversimplifies replies, and sometimes just gives you wrong answers.
And — you can't edit an intermediate message in the App! Only the last one.
That means you can't branch off in a conversation and return to the main thread. Too much context and things get "muddy." Every time I hit that design choice, I want to swear.
But—AI Studio's 1M context window is genuinely amazing!
Upload a YouTube video link, adjust the frame rate to 0.1, resolution to the lowest — a two-hour podcast episode fits in easily. Combined with the Deep Research feature, the research reports it generates are more solid than GPT's Deep Research — because Gemini first creates a search plan, covering more ground than I would have anticipated.
The "Debate" Between Three Models Reveals Their True Characters
A friend on Zhihu put it well: Ask the same question, and each model paints a completely different picture.
I tried it with a product R&D scenario —
ChatGPT is like a product manager, enthusiastically discussing all the possibilities;
Claude is like a professional engineer, diving into technical details, logically rigorous but sometimes caught in his own monologue.
DeepSeek, on the other hand, feels like a business-savvy leader — gets straight to the core contradiction, concise but hits the bullseye.
These three styles? No absolute good or bad. It's all about the context.
When you need a brainstorming session, go to GPT;
When it's time to execute, go to Claude;
When you want a sharp diagnosis, go to DeepSeek.
Take my recent factor mining experiment. On the AlphaMind platform, we pitted three models against the same task — optimize the turnover-relative-strength reversal factor, maximizing the Pure Long Short Sharpe ratio.
But this time there was a key change: instead of using Claude Code as a unified execution client, we let each model use its own official native tool.
The result? The highest score pushed the baseline Sharpe all the way to 3.07.
The three models generated three completely different "worldviews."
What does this tell us? No matter how smart a model is, if its official tooling (calling, context management, long-task stability) can't keep up, the final performance takes a hit.
Conversely, a well-tuned official Agent can squeeze every last drop of capability out of the model.
So now, when you choose a model, you have to consider the entire "toolchain."
The Landscape Is Quietly Shifting
Looking at market share, ChatGPT is still the undisputed heavyweight — mobile MAU far ahead of competitors, web monthly visits 10 times Claude's and 2.7 times Gemini's. Enterprise market share sits around 45%, with annualized revenue growing steadily. No other player can shake them in the short term.
But Claude's growth curve is downright terrifying.
At the end of 2024, its annualized revenue was still under $1 billion. By mid-2025, it had more than doubled — enterprise customer explosion is the core engine. Enterprise clients spending over $1 million annually doubled in half a year. After Claude Code and agentic features launched, mobile MAU surged month-over-month, and daily active users soared year-over-year. What's even crazier is the per-user value: Claude's average revenue per user tops the chart — nearly 30 times that of ChatGPT.
That shows Claude's irreplaceability in "high-value tasks" makes paying users willing to spend much more.
Gemini is in an awkward spot. The technology foundation isn't bad — 2.5 Pro even overtook others in the SuperCLUE Chinese benchmark. But its market share is shrinking. The split between the App and AI Studio, the chaotic decision-making at the top — all of it leaves users with a sense of friction.
Practical Experience: How to "Raise Your Swarm"
After months of use, I've developed my own combination:
Coding scenarios: GPT-4o for the framework, Claude 3.5 Sonnet for core logic, Gemini 2.5 Pro for code review. This process catches most bugs before deployment. GitHub Copilot now also incorporates GPT-4o
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.