| Dynamic prefix with sliding window | 91% | 78% | 67% |
That 12% baseline was... humbling. We were basically throwing money away.
Implementation: Building a Prefix-Aware Cache Layer
Here's the architecture we deployed on AWS Lambda with ElastiCache for Redis. I'd show you our actual Terraform configs, but our infosec team would probably have opinions about that.
graph LR
A[Client Request] --> B{Prefix Extractor}
B --> C[Redis Cache Check]
C -->|Hit| D[Return Cached Response]
C -->|Miss| E[OpenAI API]
E --> F[Store in Redis]
F --> D
Pretty straightforward. The complexity is all in the prefix extraction logic.
Prerequisites
- Python 3.11+ (we're on 3.12.1 as of writing—had to bump from 3.10 for the
@lru_cache performance improvements)
openai>=1.54.0 (caching support added in 1.52.0, but 1.54.0 fixed a nasty race condition)
redis>=5.0.0 (the async support in 5.x is worth the upgrade pain)
tiktoken>=0.7.0
- AWS account with Lambda and ElastiCache access (and hopefully better budget alerts than we had)
Step 1: Token-Aware Prefix Hashing
This is the core of the whole thing. The @lrucache on getprefix_hash was a late addition—I realized we were re-encoding the same system prompts thousands of times.
import hashlib
import tiktoken
from typing import List, Optional
from functools import lru_cache
class PrefixCacheManager:
def __init__(self, model: str = "gpt-4o", min_prefix_tokens: int = 20):
self.encoder = tiktoken.encoding_for_model(model)
self.model = model
self.min_prefix_tokens = min_prefix_tokens
@lru_cache(maxsize=1024)
def get_prefix_hash(self, prompt: str, granularity: int = 50) -> str:
"""
Generate hash from first N tokens of prompt.
granularity: number of tokens to include in hash
"""
tokens = self.encoder.encode(prompt)
# Only cache if prompt has minimum tokens
# Short prompts aren't worth the Redis overhead
if len(tokens) < self.min_prefix_tokens:
return hashlib.sha256(prompt.encode()).hexdigest()
# Take first 'granularity' tokens for prefix matching
prefix_tokens = tokens[:min(granularity, len(tokens))]
prefix_text = self.encoder.decode(prefix_tokens)
return hashlib.sha256(prefix_text.encode()).hexdigest()
def find_cache_boundary(self, prompt1: str, prompt2: str) -> int:
"""Find token index where two prompts diverge"""
tokens1 = self.encoder.encode(prompt1)
tokens2 = self.encoder.encode(prompt2)
for i, (t1, t2) in enumerate(zip(tokens1, tokens2)):
if t1 != t2:
return i
return min(len(tokens1), len(tokens2))
You might wonder about the minprefixtokens=20 default. We started with 10 and got too many collisions. 50 was too aggressive—cache hit rate dropped. 20 seemed like the sweet spot, but honestly? Your mileage may vary. Test it.
Step 2: Redis-Backed Cache with Prefix Strategy
Okay, so here's where I need to admit something. The first version of this code had a bug where we weren't checking full message hashes for collision detection. We got a cache hit, served the response, and it was... completely wrong. Different user message, same prefix hash. Oops.
import redis
import json
from datetime import timedelta, datetime
from openai import OpenAI
class OpenAIPrefixCache:
def __init__(self, redis_url: str, ttl_hours: int = 24):
self.redis = redis.from_url(redis_url)
self.prefix_manager = PrefixCacheManager()
self.client = OpenAI()
self.ttl = timedelta(hours=ttl_hours)
def chat_completion_with_cache(
self,
messages: List[dict],
model: str = "gpt-4o",
temperature: float = 0.7,
prefix_granularity: int = 50
) -> dict:
# Extract system + first user message for prefix
prompt_prefix = self._extract_prefix(messages)
cache_key = f"openai:cache:{self.prefix_manager.get_prefix_hash(prompt_prefix, prefix_granularity)}"
# Check Redis cache
cached = self.redis.get(cache_key)
if cached:
cache_data = json.loads(cached)
# Verify full prompt matches (collision prevention)
# This is the check I forgot in v1. Don't skip it.
if cache_data['messages_hash'] == self._hash_messages(messages):
return {
**cache_data['response'],
'_cached': True,
'_cache_key': cache_key
}
# Cache miss - call OpenAI API
response = self.client.chat.completions.create(
model=model,
messages=messages,
temperature=temperature
)
# Store in Redis with prefix key
self.redis.setex(
cache_key,
self.ttl,
json.dumps({
'response': response.model_dump(),
'messages_hash': self._hash_messages(messages),
'timestamp': datetime.utcnow().isoformat()
})
)
return {**response.model_dump(), '_cached': False}
def _extract_prefix(self, messages: List[dict]) -> str:
"""Extract cacheable prefix from messages"""
prefix_parts = []
for msg in messages:
if msg['role'] in ['system', 'user']:
prefix_parts.append(msg['content'])
if len(prefix_parts) >= 2: # System + first user message
break
return " ".join(prefix_parts)
def _hash_messages(self, messages: List[dict]) -> str:
"""Full message hash for collision detection"""
content = json.dumps(messages, sort_keys=True)
return hashlib.sha256(content.encode()).hexdigest()
The collision detection adds overhead. I debated removing it for performance. Then I remembered the $4,200 bill and kept it in.
Step 3: Production-Ready Usage Pattern
We've got 12 microservices using this pattern now. Each one has slightly different needs, so we ended up with strategy configs. It's not elegant, but it works.
# config/cache_strategies.py
CACHE_STRATEGIES = {
"document_analysis": {
"prefix_granularity": 100, # Cache first 100 tokens
"ttl_hours": 48,
"template": """You are a document analyzer.
Analyze the following {doc_type} and extract:
1. Key entities
2. Main topics
3. Sentiment score (-1 to 1)
Document: {content}"""
},
"code_review": {
"prefix_granularity": 75,
"ttl_hours": 12,
"template": """Review this {language} code for:
- Security vulnerabilities
- Performance issues
- Best practices violations
Code:
{code_snippet}
}
}
# service.py
class CachedLLMService:
def __init__(self):
self.cache = OpenAIPrefixCache(
redis_url=os.getenv("REDIS_URL"),
ttl_hours=24
)
def analyze_document(self, doc_type: str, content: str) -> dict:
template = CACHE_STRATEGIES["document_analysis"]["template"]
# Structure prompt for maximum prefix sharing
# The split here is intentional — system prompt stays static
system_msg = template.split("Document:")[0].format(doc_type=doc_type)
user_msg = content
messages = [
{"role": "system", "content": system_msg},
{"role": "user", "content": user_msg}
]
return self.cache.chat_completion_with_cache(
messages=messages,
prefix_granularity=CACHE_STRATEGIES["document_analysis"]["prefix_granularity"]
)
I'm not gonna lie—the template splitting with .split("Document:")[0] is hacky. We should probably use proper template inheritance or something. But it's been running in production for two months without issues, so... ship it?
Benchmark Results: Before vs After
I ran these benchmarks using oha against our staging environment. 10,000 requests, 50 concurrent connections. The results were almost too good to believe.
# Before caching
oha -n 10000 -c 50 https://api.staging.company.com/v1/analyze \
-H "Content-Type: application/json" \
-d '{"doc_type":"contract","content":"Lorem ipsum..."}'
# Results:
# Latency: avg 1247ms, p99 3421ms
# Cost: $0.032/request average
# After prefix caching
# Latency: avg 183ms, p99 412ms
# Cost: $0.008/request average (75% reduction)
That p99 drop from 3.4 seconds to 412ms is... I mean, it's absurd. Our frontend team actually asked if we'd switched to a different model.
Well.
They asked after they stopped being mad about the $4,200 incident. Which took about a week.
Common Pitfalls and Solutions
I made all of these mistakes. You don't have to.
Pitfall 1: Dynamic Content at Start of Prompt
This one seems obvious in hindsight. Wasn't obvious when I was instrumenting request IDs at 2 AM.
# ❌ Breaks caching
prompt = f"[RequestID: {uuid4()}] Translate to French: {text}"
# ✅ Move dynamic content to end
prompt = f"Translate to French: {text}\n[RequestID: {uuid4()}]"
The request ID still gets logged, still shows up in traces. It just doesn't murder your cache hit rate.
Pitfall 2: Inconsistent Whitespace
This bit us hard. Different services were normalizing whitespace differently, and nobody noticed because... well, it's whitespace. Who looks at whitespace?
# These won't share cache due to tokenization differences
prompt1 = "Summarize:\n\n\n{text}" # Extra newlines = different tokens
prompt2 = "Summarize:\n{text}"
# ✅ Normalize prefixes
def normalize_prefix(text: str) -> str:
import re
return re.sub(r'\s+', ' ', text).strip()
We added this normalization to our CI pipeline. If a PR changes whitespace in prompt templates, it flags for review. Overkill? Maybe. But it's saved us from at least three cache regression incidents.
Pitfall 3: Ignoring Token Boundaries
I mentioned this earlier, but it's worth repeating because it's so counterintuitive. We lost 51 percentage points of cache hit rate—from 82% to 31%—by adding a single word to our system prompt. One word.
The fix was understanding that cache boundaries align with token boundaries, not logical sections. That word happened to shift token boundaries for the entire subsequent prompt. Fun times.
# Monitor cache boundaries
def analyze_cache_efficiency(prompts: List[str]) -> dict:
encoder = tiktoken.encoding_for_model("gpt-4o")
token_lengths = [len(encoder.encode(p)) for p in prompts]
return {
"avg_tokens": sum(token_lengths) / len(token_lengths),
"min_shared_prefix": min(token_lengths),
"optimal_granularity": int(sum(token_lengths) / len(token_lengths) * 0.6)
}
We run this analyzer in our deployment pipeline now. If optimal granularity shifts by more than 10%, the deploy gets blocked. I sleep better.
Monitoring and Observability
Because if you can't measure it, you can't fix it. And if you can't alert on it, you'll find out about problems from your finance team. 0/10, do not recommend.
from prometheus_client import Counter, Histogram
cache_hits = Counter('openai_cache_hits_total', 'Cache hit count')
cache_misses = Counter('openai_cache_misses_total', 'Cache miss count')
cache_latency = Histogram('openai_cache_latency_seconds', 'Cache operation latency')
class MonitoredCache(OpenAIPrefixCache):
def chat_completion_with_cache(self, *args, **kwargs):
with cache_latency.time():
result = super().chat_completion_with_cache(*args, **kwargs)
if result.get('_cached'):
cache_hits.inc()
else:
cache_misses.inc()
return result
Grafana dashboard alert: If cachehitratio < 0.6 for 15 minutes → Slack notification to #llm-ops.
That alert has fired exactly twice in production. Both times, someone had "just made a small change" to a prompt template. Both times, I got to be the person who said "I told you so" in the postmortem. It's the little things.
Key Takeaways
Look, this got way longer than I planned. Here's what actually matters:
- Prefix matching is token-based, not character-based. Structure prompts accordingly or suffer.
- Granularity control lets you balance cache hit rate vs. specificity. We settled on 50-100 tokens, but test this for your use case.
- Always hash full messages for collision detection. I cannot stress this enough. I shipped a bug because I skipped this. Don't be me.
- Monitor cache boundaries. A small prompt change can absolutely destroy your cache effectiveness. Ask me how I know.
- Dynamic content goes at the end. Always. No exceptions. I will die on this hill.
GitHub Repository
Complete implementation with Docker Compose setup for local testing:
github.com/rajpatel/openai-prefix-cache
Includes:
- Full Redis cache layer
- Load testing scripts with k6
- Terraform configs for AWS ElastiCache
- Datadog dashboard JSON (Grafana version coming... eventually)
Quick note on the Terraform configs—they're set up for us-east-1 because that's where we run. Change the region or don't, I'm not your DevOps engineer.
Further Reading
What's your experience with LLM API caching? We're currently experimenting with predictive prefetching based on user behavior patterns and seeing 94% cache hit rates in early testing. It's probably over-engineered, but watching those Grafana graphs is addictive.
If you've tried similar approaches or have questions about implementing prefix matching in your stack, drop a comment. I'm genuinely curious what hit rates other teams are seeing. And if there's interest, I'll write up our prefetching architecture—it's even more ridiculous than this one.
Tags: #OpenAI #LLM #Caching #Python #DevOps #AWS #Redis #CostOptimization #APIDesign