Prompt caching,一篇就够了 (English)

Generated: 2026-06-20 13:25:15

---

I read your article again. The core arguments and most of the experiences are solid—there are just a few factual details that could be more precise, and some phrasing that sounds too much like AI-generated text that needs to be toned down. Below is my edited version. Just match it against your original.

---

Can you believe it? It took me a full two years to figure out one thing—

Prompt Caching isn’t some “technology”—it’s an art form about “queuing.”

What are all those people online writing? Lines like “KV Cache reuses the Key-Value matrix computation results in the attention mechanism”—the statement isn’t wrong, but when you’re writing code, do you really have Softmax running through your mind? No, you don’t.

Last year I took on an optimization project for an Agent system. On day one, I fully integrated Anthropic’s caching API, marking every cache_control breakpoint clearly. Guess what happened next?

Hit rate: 12%.

12%! At that point I almost thought the API docs were messing with me.

Then I spent three weeks digging through the source code of two open-source solutions—Codex and Claude Code—and uncovered a truth that sent a chill down my spine—

Prompt Caching isn’t a technical problem. It’s a “sorting” problem.

You don’t need to understand the position encoding in Transformers. You just need to figure out one thing: what should go first, and what should go last.

---

Let me tell you a story.

Anthropic has a blog post with a dramatic title: “Prompt Caching is everything.” At first I thought it was just marketing fluff—until I hit so many pitfalls myself that I started questioning my own decisions.

How do they order things?

Static system prompts and tool definitions (shared by everyone)
CLAUDE.md file (shared by the project team)
Conversation context (for this specific dialogue)
The latest message (changes every round)

See the pattern?

The less likely something is to change, the further forward it goes.

That one principle alone—at least 80% of production systems don’t follow it.

Let me give you an example. The Codex team stepped into a huge trap. When they later added MCP tool support, the order of the tool enumeration turned out to be non-deterministic. Today, listfiles comes before readfile; tomorrow, readfile might be before listfiles. It seems like only a tiny change, right?

The entire cache was completely invalidated.

When I read about that case, a chill ran down my spine—because in my code, I also used a dict to define tools! Before Python 3.7, dict didn’t preserve insertion order. I’m using 3.11 now, but who knows what hidden traps the underlying libraries might have.

After that, I added sorting to the tool definitions before every request, combining a frozenset with a sorted list.

The hit rate jumped from 12% straight to 67%.

This might matter to you too. Check your code: are you putting timestamps inside a static prompt? Are you using sets or dicts to define your tools? If the answer is “yes”—then your cache might have been nothing more than a decoration.

---

Now here’s another pitfall, even trickier than the first.

After ChatGPT came out, people split into two camps. One group swears by OpenAI’s automatic caching; the other trusts Anthropic’s explicit control.

I tried both, and found both have their own pitfalls.

OpenAI’s automatic caching is convenient, but the problem is—you never know when it hits or when it gets evicted. I ran a test: sending the same prompt 15 times in a row. Three times I got a cache miss. Sometimes the prefix changed, sometimes system load was high and the cache was cleared.

You have zero visibility.

As for Anthropic’s explicit control—on the surface it gives you fine granularity, but in practice it’s full of landmines.

Take the placement of the cachecontrol breakpoint. The official docs say “the cache is written after meeting the minimum token requirement,” but they never tell you what that minimum token count is. I tried adding cachecontrol to a 1000-token prompt—it never hit the cache. After digging through tons of resources I finally found out—

Anthropic’s cache requires a minimum prefix length of at least 1024 tokens to take effect.

And there’s TTL. Default is something like 5 minutes. When you set it to an hour, you also have to worry about the cache being evicted if no one accesses it for a while.

A friend of mine works at a startup building an AI product using Claude Code as the underlying engine. One day he messaged me saying their costs had exploded by 120x. 120 times! After half a day of troubleshooting, they found that someone had manually changed a single punctuation mark in a tool’s description.

One punctuation mark—costs spiked 120x.

Something like this would be completely absurd in the world of backend caching. But for large model APIs? It’s just Tuesday.

So now I use a method that’s super simple but incredibly effective: separate static content and dynamic content into two independent objects. The static content gets a pre-created cache; the dynamic content is sent fresh each time.

Gemini’s explicit caching API does this the most cleanly. You can directly create a Cache object, set a TTL, and then reference it in your request. That way, the cache for static content is completely decoupled from the request, and no matter how much the dynamic content changes, it won’t affect the cache.

The code looks something like this:


cache = client.Caches.Create(
 contents=[system_prompt, tools_definition],
 ttl=3600
)
response = client.Models.GenerateContent(model_name, user_query, cached_content=cache.name)

I’ve been using this pattern for almost half a year—hit rate consistently above 85%.

---

You might be thinking: “It’s just caching. Hitting or missing only costs me a few extra cents.”

Let me tell you—that mindset will kill you.

Inside Anthropic, Prompt Cache hit rate is treated as an infrastructure-level metric, right up there with server uptime. If the hit rate drops, it triggers an oncall alert. If you told their engineers “caching is just a few cents,” they’d look at you like you were a fool.

Why? Because Agent-type products have one special characteristic—long conversations.

An AI coding assistant might go back and forth with you for dozens of rounds in a single session. Each round has to send the entire previous context back to the model. Without a cache, every round recomputes the entire history. The computation cost doesn’t grow linearly—it grows almost quadratically.

I ran a rough estimate: for a 50-round conversation, without caching, first-token latency is about 8 seconds. With caching? 1.2 seconds.

Nearly 7x difference.

And that’s the optimistic case. If you’re using DeepSeek V4, the input cost with a cache hit is only a tenth of the normal price. With the same budget, you can run 10x the number of requests.

Think about it—if your product is user-facing, what’s the difference in experience between waiting 8 seconds and waiting 1 second? What’s the difference in user retention?

That’s why I say: Prompt Caching isn’t a nice-to-have optimization. It’s a prerequisite for the system to even work.

Without caching, there would be no Claude Code. Anthropic itself admits this.

But have you noticed? So many people still treat it as an “optimization.” They think, “Let’s get the feature working first”—and then when the feature is done, the cost is too high to sustain. So they have to spend even more effort redoing the prompt structure and refactoring the code. I know that pain all too well. If you’ve been through it, you know.

---

There’s another hidden trap that almost nobody talks about.

While I was doing research, I came across a paper (titled “An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks”) that raised a problem I’d never considered before:

Position encoding offset from cached states.

Prompt caching,一篇就够了 (English)

Prompt caching,一篇就够了 (English)

Cael Lee

Ready to get started?