如何看待 Anthropic 发布的 Claude 4 (English)

Generated: 2026-06-21 08:29:15

---

I Spent $100 on a Claude Max Subscription. Two Days of Testing Later, I'm Both Laughing and Cursing.

You might not believe this—

But before I hit "subscribe" on Claude Max, my finger literally hovered for three full seconds.

One hundred dollars. My wallet was bleeding.

But guess what? After the payment went through and I calmed down enough to do the math, I actually started laughing.

Why? Because over at OpenAI, the $200 Plus plan is sitting right there. And Anthropic? For $100, they're bundling Pro + Code + Research all together. You know how much OpenAI charges if you break those features out separately? This isn't a price hike—Anthropic is playing the price war game.

Still, don't pull out your wallet just yet.

Before you do, let me show you all the potholes I hit first.

---

Three Things That Made Me Shake My Head After Two Days of Testing

I started with a side-by-side comparison between Sonnet 4.5 and its big brother 3.7.

Task one — Code review.

I wrote a piece of code myself and deliberately planted five bugs, buried pretty deep. Want to know what happened?

Sonnet 4.5 found four. 3.7 found three and a half. That half was more of a near miss. So yeah—there's improvement. A leap? Not really.

Task two — Multilingual translation.

Chinese to English, Japanese to Chinese, Arabic to German—I tested all of them. Honestly? Gemini 2.5 Pro leaves it in the dust on this front. How bad was it?

Claude 4 tripped up three times on Japanese honorifics. That business-Japanese phrase 「お世話になっております」? It translated it straight as "Thank you for your care."

Read that again. Slowly.

All that politeness, gone. Say that in a company setting and your Japanese clients are silently crossing you off.

Task three — Tool calling.

This was the biggest headache. I wrote an MCP server to call an API for weather data. Sonnet 4.5 swapped the latitude and longitude parameters twice, and failed to parse the response format once. You asking if this bug existed before? Yes. 3.7 had the exact same problem. They didn't even fix it!

So if you ask me honestly—

I don't think it's unfair to call Sonnet 4.5 a Refresh of 3.7.

But the story doesn't end there.

---

Opus 4.8 — Now That One Actually Made My Eyes Light Up

I ran it through three scenarios so real they hurt.

First scenario — Code review.

And I don't mean one of those "find the bug" exercises. I threw an actual microservice refactoring plan I wrote last month at it. The real deal.

Opus 4.8 did something that made me sit up instantly.

It didn't just tell me what was wrong. It started with: "Are you sure you want to put the cache layer behind the API gateway? That's going to spike your write latency. I'd suggest a different approach."

I froze.

See, with every other model I'd used before, they just toss out suggestions. But this one? It questioned my decision first.

I didn't believe it, so I ran a performance test following its line of thinking.

It was right. That plan really did have a performance issue hiding in it.

The official docs call this ability "proactive uncertainty marking." In plain English—it's no longer a mindless task-execution tool. It's starting to feel like a programmer partner.

Second scenario — Writing a due diligence report for an investment.

I handed Opus 4.8 three financial reports in PDF format and asked it to extract key metrics for comparison.

Back when I used 3.7, citation mismatches were common—you'd have to manually check each one. This time around, the citation accuracy on 4.8 was way better. Cut my verification time in half.

But that's not the part that blew me away.

On its own, it noticed that one of the three reports had a data口径 inconsistency—that report used GAAP, while the other two used IFRS.

Think about it. If you don't adjust for that, your entire analysis is skewed.

I've used AI for this kind of analysis many times. Not once has any model proactively raised this kind of question. Not a single time.

Third scenario — This one I found the most interesting.

I asked it a deliberately tricky question.

"On April 17, 2026, Google's AlphaFold team published a paper on a new method for protein folding. What's the DOI?"

It replied: "I can't be sure which specific paper this is, because my training data cuts off at February 2026. A paper published on April 17 might not be covered. I'd suggest checking Google Scholar or the AlphaFold team's latest updates."

You might think—that answer looks pretty ordinary, right?

But think about it. What would an older model most likely do? Make up a DOI. Even if it wasn't sure, it would fabricate one. That's been AI's old habit.

This improvement in "honesty" on 4.8? After testing it, I genuinely respect it.

---

The Feature That Made Me Love and Hate It at the Same Time: Interleaved Thinking

In simple terms, this feature breaks down a complex problem, thinks about one part, then the next, and lets you go back and adjust in between.

I gave it a LeetCode hard-level algorithm problem. It first broke the problem into three sub-problems, then solved each one. Midway through, it spotted a logical flaw in its own reasoning and corrected itself.

Sounds perfect in theory, right?

But here's the catch: you need API access to use it. The web version doesn't support it yet. And that judge model—the one that evaluates the quality of the thinking path—hasn't been released.

Right now, using it feels like: it runs, but you have no idea if it's running correctly.

---

One Million Tokens of Context — Do You Even Need It?

Opus 4.8 keeps the 1M token context window. I ran a stress test—

I threw in the entire English PDF of Designing Data-Intensive Applications, roughly 900 pages. Asked it to summarize, then compare several distributed consistency approaches mentioned in the book.

So what happened?

It could handle it. It did produce the summary.

But two problems: first, the recall speed slowed down noticeably. Second, anything beyond 200K tokens is billed at the premium rate—$10 per million input tokens and $37.5 per million output.

In other words, this feature is really for enterprise users who aren't pinching pennies.

Occasional use as an individual developer? Fine. Frequent use? Your wallet will give out first.

---

Three Things I Have to Say About the Industry Impact

First, Anthropic is ahead on the path to Agents.

The Dynamic Workflows feature, even though it's still in preview, has the

如何看待 Anthropic 发布的 Claude 4 (English)

如何看待 Anthropic 发布的 Claude 4 (English)

I Spent $100 on a Claude Max Subscription. Two Days of Testing Later, I'm Both Laughing and Cursing.

Three Things That Made Me Shake My Head After Two Days of Testing

Opus 4.8 — Now That One Actually Made My Eyes Light Up

The Feature That Made Me Love and Hate It at the Same Time: Interleaved Thinking

One Million Tokens of Context — Do You Even Need It?

Three Things I Have to Say About the Industry Impact

Cael Lee

Ready to get started?