How I Cut Our OpenAI Bill by 73% Using Semantic Caching (And Almost Killed Our Production Server)
How I Cut Our OpenAI Bill by 73% Using Semantic Caching (And Almost Killed Our Production Server)
Last November, during our Black Friday rush, I nearly did a spit-take with my morning coffee. Our AWS bill had arrived, and our OpenAI API costs had exploded 340% month-over-month. The kicker? 67% of those requests were basically the same questions, just worded differently.
My boss didn't even finish saying "Can we reduce this?" before I was already knee-deep in vector database documentation.
Honestly, I was skeptical about semantic caching at first. Traditional Redis caching is dead simple—exact key-value matching, low hit rate, but zero brain cells required. But when you watch users ask the same question 17 different ways, each one burning through tokens like there's no tomorrow... you realize exact matching is about as useful as a screen door on a submarine.
What Actually Is Semantic Caching?
Here's the thing—semantic caching doesn't care if strings match exactly. Instead, it calculates how similar two pieces of text are in meaning. If they're close enough, boom—cached response, zero tokens burned.
Let me show you what I mean. These three queries would be completely separate requests in a traditional cache:
"How do I optimize Python loop performance?"
"My Python loops are too slow, any way to speed them up?"
"What's the best way to improve for-loop execution speed in Python?"
Same. Exact. Answer. Semantic caching catches this. The second and third queries? Straight to cache. No API call. No token cost.
Game changer.
The Architecture I'm Running in Production
After some trial and error (okay, a lot of error), here's what I landed on:
User Query → Embedding Model → Vector Similarity Search → Hit?
↓
Yes → Return Cached Response
↓
No → Call OpenAI → Store Vector + Response
Three core components:
- Embedding Model:
text-embedding-3-small— cheap, solid, gets the job done - Vector Database: Milvus — been running it for 8 months now
- Similarity Threshold: 0.92
Wait—scratch that last part. 0.92 wasn't my first choice. I'll get to that disaster story in a minute.
The Code (What Actually Works)
import openai
from pymilvus import Collection, connections
import hashlib
import json
class SemanticCache:
def __init__(self, similarity_threshold=0.92):
self.threshold = similarity_threshold
connections.connect(host='localhost', port='19530')
self.collection = Collection("openai_cache")
def get_embedding(self, text):
response = openai.Embedding.create(
model="text-embedding-3-small",
input=text
)
return response['data'][0]['embedding']
def search_similar(self, query_embedding):
search_params = {"metric_type": "COSINE", "params": {"nprobe": 10}}
results = self.collection.search(
data=[query_embedding],
anns_field="embedding",
param=search_params,
limit=1,
output_fields=["response"]
)
if len(results[0]) > 0 and results[0][0].distance >= self.threshold:
return results[0][0].entity.get('response')
return None
def query_with_cache(self, user_query):
# Generate query embedding
query_embedding = self.get_embedding(user_query)
# Search for similar cached responses
cached_response = self.search_similar(query_embedding)
if cached_response:
return cached_response, True # Cache hit
# Cache miss — call OpenAI
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": user_query}]
)
answer = response.choices[0].message.content
# Store in cache
self.collection.insert([
[query_embedding],
[answer],
[user_query]
])
return answer, False
Three Numbers That Made My CFO Smile
1. Token Usage Dropped 73%
We plugged this into our customer support system. 30-day results:
- Total queries: 1,247,893
- Cache hits: 912,456 (73.1%)
- Actual API calls: 335,437
- Tokens saved: ~187 million
- Monthly cost: $2,840 → $767
That's not a typo. We saved over two grand in one month.
2. Response Latency: 2.3 Seconds → 180 Milliseconds
This one hit different. Users noticed immediately. Before, every request waited for GPT-4 to generate a response. Now? Cache hits return in 180ms. Our P99 latency dropped from 2.3 seconds to... well, 180 milliseconds.
Our support team told me the "bot is too slow" complaints just... stopped. Completely.
Well—here's the nuance. 180ms is pure cache hits. If there's a miss, you're back to 2 seconds. So the real experience is "blazing fast most of the time, occasional slow poke." But users seem fine with that pattern.
3. Why 0.92? (The Week I Lost to Threshold Tuning)
I spent an entire week on this number. No joke.
Started at 0.85. Big mistake. Users asking "How do I get a refund?" were hitting the cache for "How do I return an item?" Similar-ish? Sure. Same answer? Absolutely not. We had some very confused customers.
Cranked it to 0.95. Too strict. Hit rate plummeted from 73% to 41%. At that point, why bother?
0.92 was the sweet spot. Caught the semantic duplicates without crossing into "wrong answer" territory. Your mileage may vary—this depends heavily on your domain and query patterns.
The Mistakes I Made (So You Don't Have To)
Mistake #1: FAISS Ate All My RAM
I started with FAISS for local vector indexing because I thought, "Hey, it's simple, it'll work."
Three days later, my server hit 32GB of RAM and OOM-killed the process. FAISS's IndexFlatIP does brute-force search and loads everything into memory. Production? Not a chance.
Switched to Milvus with IVF_FLAT indexing. Memory stabilized under 4GB, and queries got 3x faster. Sometimes the "easy" solution is just the wrong one.
Mistake #2: The Wrong Embedding Model for Chinese
Tried to save money with the open-source all-MiniLM-L6-v2 model. Works great for English. For Chinese? Absolute garbage.
"笔记本电脑" (laptop) and "笔记本" (notebook) had a similarity score of 0.67. That's... not helpful. At all.
Switched to OpenAI's text-embedding-3-small. Night and day difference for multilingual support. Yes, generating embeddings costs tokens, but at roughly 1 token per 3,000 characters, it's basically noise compared to GPT-4 generation costs.
Mistake #3: The "Set It and Forget It" Expiry Policy
My first deployment had a global 24-hour TTL. Lazy? Absolutely. Functional? Until the day we updated our return policy.
Users started asking about the new policy. Cache kept serving the old one. Our support team nearly mutinied.
Now I use tiered TTLs:
- Factual content (addresses, phone numbers): 7 days
- Policy content: 1 hour
- Real-time data (inventory, pricing): no caching at all
The Pro Move: Proactive Cache Warming
Passive caching is fine. But I wanted more.
I built a warming script that analyzes historical query logs, extracts high-frequency question patterns, and pre-generates embeddings during off-peak hours (3 AM UTC).
# High-frequency query warming script
hot_queries = [
"How do I reset my password?",
"Where's my order?",
"What's your return policy?",
"How do I contact support?",
# ... Top 100 questions from log analysis
]
for query in hot_queries:
# Generate standard answer and cache it
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": query}]
)
cache.store(query, response)
This bumped our hit rate another 8 percentage points. More importantly, users never feel that "cold start" lag during peak hours.
The Bottom Line
| Metric | Before | After | Change |
|---|
| Monthly API Calls | 1.24M | 330K | -73% |
|---|
| Monthly Tokens | 256M | 69M | -73% |
|---|
| Monthly Cost | $2,840 | $767 | -73% |
|---|
| Avg Response Time | 2.3s | 0.18s | -92% |
|---|
| Cache Hit Rate | 0% | 73% | — |
|---|
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.