I Spent 3 Hours Debugging Anthropic's Cache Limit — Here's What I Learned About LLM Prompt Caching
I Spent 3 Hours Debugging Anthropic's Cache Limit — Here's What I Learned About LLM Prompt Caching
Cover image: A developer sitting at a café in Berlin, staring at multiple terminal windows showing API costs, with a half-empty coffee cup and a laptop covered in stickers.
TL;DR
OpenAI's prompt caching kicks in automatically, but Anthropic and Google make you manage breakpoints manually. After extensive testing, I found caching can slash costs by 50-90% in long conversations — but the implementations vary wildly. Also, I lost three hours of my life to Anthropic's undocumented minimum token limit. You're welcome.
The Night I Drank Three Coffees and Questioned Everything
Last Tuesday at 11 PM, I was staring at my AWS bill in my Berlin apartment, trying to figure out why our chatbot project's monthly cost had jumped from €200 to €900. The culprit? A 30% increase in users. That's it.
Every single conversation was re-sending 2,000 tokens of system prompts. Every. Single. Time.
At first, I was convinced we were getting DDoS'd. I spent two hours digging through logs at 2 AM, eyes burning. Plot twist: we weren't. Just good old-fashioned user growth. The best-worst news ever.
That's how I fell into the prompt caching rabbit hole. Now let me walk you through what I discovered about how the major LLM providers handle caching — and where they'll trip you up.
So What Is Caching, Actually?
Here's the simple version: when you send the same prompt prefix over and over, LLM providers can cache the processed result and only compute the new stuff you add.
I tried explaining this to a non-technical friend, and here's the coffee shop analogy I came up with:
- No caching: "I'd like a latte with oat milk, less sugar, in a to-go cup" — every damn time
- With caching: The barista remembers you, and you just say "the usual"
Actually, wait — that's not quite right. The barista remembers your preference, but LLM caching remembers the computed key-value states. A better analogy: it's like ordering the same sandwich every day, and the kitchen preps your ingredients ahead of time. When you show up, they just assemble it. Yeah, that tracks.
The Big Three: How Caching Actually Works
OpenAI: The Automatic Transmission
OpenAI's prompt caching is the most developer-friendly — it just works. I'm serious. You don't need to change a single line of code, update any API version, or sacrifice a goat to the cloud gods. As long as your prompt prefix is long enough, caching kicks in.
# OpenAI - automatic caching, zero code changes needed
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a professional code reviewer..." * 100}, # Long system prompt
{"role": "user", "content": "Check this code for security issues"}
]
)
# System prompts over 1024 tokens get cached automatically
# Cache hits = 50% discount on that portion
What I actually measured (tested December 15, 2024, from a Hetzner VPS in Berlin):
- Cache hit rate: 90%+ in consecutive conversations
- Cost savings: 50% on system prompt portions
- Limitation: only kicks in for prefixes over 1,024 tokens
Here's a gotcha nobody mentions: the cache isn't instant. The first few requests pay full price — in my tests, caching typically started hitting around request #3 or #4. I spent ten minutes staring at Postman thinking something was broken before I figured this out.
Anthropic: The Manual Transmission With a Hair Trigger
Anthropic makes you explicitly mark breakpoints, but the payoff is bigger. They use ephemeral caches with a default 5-minute expiration. Five minutes. I nearly sprayed coffee on my keyboard when I first saw that number.
# Anthropic - manual cache breakpoints required
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-3-5-sonnet-20240620",
system=[
{
"type": "text",
"text": "You are a professional code reviewer..." * 100,
"cache_control": {"type": "ephemeral"} # This is your breakpoint
}
],
messages=[{"role": "user", "content": "Check this code"}]
)
# Cache hits = 90% discount. Yes, ninety percent.
The pits I fell into 💡:
- Cache only lives 5 minutes (turns out you can extend this, but it requires an enterprise plan)
- Minimum 1,024 tokens to cache — I lost three hours debugging this. The actual error message is
Error: cache_control point must have at least 1024 tokens. The docs bury this in paragraph four, line three. I'm not bitter. I'm not. - Breakpoint placement is on you — put it in the wrong spot and you get exactly nothing
Honestly, once you get it working, that 90% discount is incredible. But getting there? That's a different story.
Google Gemini: The "Wait, What?" Approach
Google's approach is the weirdest — and I've grown to love it. Instead of marking prefixes like OpenAI and Anthropic, Gemini lets you create standalone cache objects. You explicitly upload content to a cache and reference it later. I thought this was overengineered nonsense at first. Then I tried it with massive documents.
// Google Gemini - standalone cache contexts
const { GoogleGenerativeAI } = require("@google/generative-ai");
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);
// Create cached content
const cachedContent = await genAI.createCachedContent({
model: "gemini-1.5-pro",
contents: [{
role: "user",
parts: [{ text: "Reference docs: " + largeDocument }]
}],
ttl: "3600s" // Lives for 1 hour
});
// Later requests use the cache
const model = genAI.getGenerativeModel({
model: "gemini-1.5-pro",
cachedContent: cachedContent.name
});
One thing I almost forgot to mention — cache storage costs money separately. It's not much, but if you create 50 cache objects and forget to delete them, your end-of-month bill will have opinions. Don't ask how I know.
Real Cost-Benefit Numbers
I ran 100 consecutive conversations with the same setup: 2,000-token system prompt + 500-token user input. Testing environment: Hetzner VPS in Berlin, connecting via VPN to each API endpoint, timestamp December 15, 2024, 02:00 CET.
| Provider | No Cache Cost | With Cache Cost | Savings | Response Time |
|---|
| OpenAI GPT-4o | $0.25 | $0.175 | 30% | 1.2s → 0.8s |
|---|
| Anthropic Claude 3.5 | $0.30 | $0.12 | 60% | 1.5s → 0.6s |
|---|
| Google Gemini 1.5 Pro | $0.20 | $0.05 | 75% | 2.0s → 0.5s |
|---|
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.