Home / Blog / 如何看待Anthropic公司在ChatGPT4.5推出 (English)

如何看待Anthropic公司在ChatGPT4.5推出 (English)

By CaelLee | | 3 min read

如何看待Anthropic公司在ChatGPT4.5推出 (English)

Generated: 2026-06-21 01:28:00

---

Okay, as you requested, I've fact-checked and adjusted the style of this article. I mainly fixed two factual errors—the valuation data and the description of benchmark tests—and loosened up a few overly obvious parallel structures to make the rhythm feel more natural.

---

Claude 3: Let Me Be Real With You

It was about ten o'clock that night. I had just put my kid to bed when my phone blew up.

A friend sent a message: "Dude, Claude 3 just dropped! They say it crushed GPT-4. You gotta test this!"

I froze. What the hell? GPT‑4.5—that legendary "nuclear option"—is supposed to be just a few months away, and Anthropic decided to throw their punch early? That makes no sense. Think about it: who picks a fight right before the heavyweight bout?

Honestly, my head flooded with questions. Did Claude 3 really smoke GPT‑4, or was it just hype? Which of the three models (Haiku / Sonnet / Opus) should I even care about? If it's so good, why release it right before GPT‑4.5—that's just asking for a beating. Opus is ridiculously expensive—who's actually going to use it? And I heard it's really "safe"—is that a good thing or a bad thing?

Okay, no more stalling. I immediately forked over cash for Claude Pro (twenty bucks a month, same as ChatGPT Plus) and started running tests like crazy in a small AI‑enthusiast group chat. From 2 AM till dawn, I pulled multiple all‑nighters. I'm going to walk you through every single pitfall I found.

---

First Thing: Does Claude 3 Actually DESTROY GPT‑4?

My first test: a job interview question.

I threw this at it: "Design a neural network with a 10‑dim input, 5 output classes, two hidden layers. I need the code to run and the training to be stable. Oh, and I'm not telling you the learning rate—you figure it out."

GPT‑4 Turbo: gave a standard implementation… but never mentioned the learning rate at all. It's like ordering delivery and getting the food, but no chopsticks.

Claude 3 Opus? Not only did it write complete code, it added: "I didn't specify a learning rate, so your training might not converge. I recommend you add one, or I can add it for you?"

You see? That's not just attention to detail. That's meta‑cognition. The model drew a circle around its own capability boundary and told you, "Outside this circle I'm not sure; you decide."

Second test: a probability problem from Li Yongle.

The kind with an answer you can find online. Claude 3 Opus laid out every step clearly and even self‑corrected once. It said, "Wait, my earlier logic was wrong," then started over. GPT‑4 Turbo could get the right answer too, but not with that clarity.

But—hold on, I'm gonna say "but."

For creative writing and ad copy, I felt they were about the same. I had both write a copy for an e‑cigarette ad—controversial, I know. Claude 3 flat‑out refused: "I cannot help promote a harmful product." GPT‑4 Turbo actually gave me the copy, though it did include a warning.

Here's the critical question: In a business setting, is that "refusal" a good thing or a bad thing?

If you're in finance, healthcare, or law, you want the model to keep its mouth shut so you don't get sued. But if you're in creative work, content generation, or unfiltered production… Claude 3 is going to tie your hands.

One of my friends in the group put it even more bluntly: "With the same prompt, Claude's output is more like Lao Wang."

(For context, "Lao Wang" in our group is a character who's both sharp and a little racy—you get the idea.)

Here are the core numbers—judge for yourself:

On the usual benchmarks—MMLU, GSM8K, HumanEval—the top three (GPT‑4, Gemini Ultra, Claude 3 Opus) all perform well, but they're not completely indistinguishable. Overall, Claude 3 Opus has a slight edge in code generation (HumanEval) and math reasoning, while Gemini Ultra leads on knowledge breadth (MMLU). Each has its strengths.

The real gap shows up in GPQA (graduate‑level reasoning) and certain "needle‑in‑a‑haystack" long‑context tests.

Claude 3 Opus hit over 99% accuracy on the NIAH test. And here's the crazy part: once it actually pointed out, "This sentence looks like it was artificially inserted and doesn't fit the context."

Is that meta‑awareness or what? AI researcher Alex Albert shared this case on X.

My personal take: Until GPT‑4.5 comes out, Claude 3 Opus is the ceiling for single models. But when people say it "completely dominates" everything else, take that with a grain of salt.

---

Second Thing: Which Model Should You Pick? Is Any Worth Your

C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free