How I Cut Our OpenAI Bill by 73% Using Semantic Caching (And Almost Killed Our Production Server)

Last November, during our Black Friday rush, I nearly did a spit-take with my morning coffee. Our AWS bill had arrived, and our OpenAI API costs had exploded 340% month-over-month. The kicker? 67% of those requests were basically the same questions, just worded differently.

My boss didn't even finish saying "Can we reduce this?" before I was already knee-deep in vector database documentation.

Honestly, I was skeptical about semantic caching at first. Traditional Redis caching is dead simple—exact key-value matching, low hit rate, but zero brain cells required. But when you watch users ask the same question 17 different ways, each one burning through tokens like there's no tomorrow... you realize exact matching is about as useful as a screen door on a submarine.

What Actually Is Semantic Caching?

Here's the thing—semantic caching doesn't care if strings match exactly. Instead, it calculates how similar two pieces of text are in meaning. If they're close enough, boom—cached response, zero tokens burned.

Let me show you what I mean. These three queries would be completely separate requests in a traditional cache:


"How do I optimize Python loop performance?"
"My Python loops are too slow, any way to speed them up?"
"What's the best way to improve for-loop execution speed in Python?"

Same. Exact. Answer. Semantic caching catches this. The second and third queries? Straight to cache. No API call. No token cost.

Game changer.

The Architecture I'm Running in Production

After some trial and error (okay, a lot of error), here's what I landed on:


User Query → Embedding Model → Vector Similarity Search → Hit?
 ↓
 Yes → Return Cached Response
 ↓
 No → Call OpenAI → Store Vector + Response

Three core components:

Embedding Model: text-embedding-3-small — cheap, solid, gets the job done
Vector Database: Milvus — been running it for 8 months now
Similarity Threshold: 0.92

Wait—scratch that last part. 0.92 wasn't my first choice. I'll get to that disaster story in a minute.

The Code (What Actually Works)


import openai
from pymilvus import Collection, connections
import hashlib
import json

class SemanticCache:
 def __init__(self, similarity_threshold=0.92):
 self.threshold = similarity_threshold
 connections.connect(host='localhost', port='19530')
 self.collection = Collection("openai_cache")
 
 def get_embedding(self, text):
 response = openai.Embedding.create(
 model="text-embedding-3-small",
 input=text
 )
 return response['data'][0]['embedding']
 
 def search_similar(self, query_embedding):
 search_params = {"metric_type": "COSINE", "params": {"nprobe": 10}}
 results = self.collection.search(
 data=[query_embedding],
 anns_field="embedding",
 param=search_params,
 limit=1,
 output_fields=["response"]
 )
 if len(results[0]) > 0 and results[0][0].distance >= self.threshold:
 return results[0][0].entity.get('response')
 return None
 
 def query_with_cache(self, user_query):
 # Generate query embedding
 query_embedding = self.get_embedding(user_query)
 
 # Search for similar cached responses
 cached_response = self.search_similar(query_embedding)
 if cached_response:
 return cached_response, True # Cache hit
 
 # Cache miss — call OpenAI
 response = openai.ChatCompletion.create(
 model="gpt-4",
 messages=[{"role": "user", "content": user_query}]
 )
 answer = response.choices[0].message.content
 
 # Store in cache
 self.collection.insert([
 [query_embedding],
 [answer],
 [user_query]
 ])
 
 return answer, False

Three Numbers That Made My CFO Smile

1. Token Usage Dropped 73%

We plugged this into our customer support system. 30-day results:

Total queries: 1,247,893
Cache hits: 912,456 (73.1%)
Actual API calls: 335,437
Tokens saved: ~187 million
Monthly cost: $2,840 → $767

That's not a typo. We saved over two grand in one month.

2. Response Latency: 2.3 Seconds → 180 Milliseconds

This one hit different. Users noticed immediately. Before, every request waited for GPT-4 to generate a response. Now? Cache hits return in 180ms. Our P99 latency dropped from 2.3 seconds to... well, 180 milliseconds.

Our support team told me the "bot is too slow" complaints just... stopped. Completely.

Well—here's the nuance. 180ms is pure cache hits. If there's a miss, you're back to 2 seconds. So the real experience is "blazing fast most of the time, occasional slow poke." But users seem fine with that pattern.

3. Why 0.92? (The Week I Lost to Threshold Tuning)

I spent an entire week on this number. No joke.

Started at 0.85. Big mistake. Users asking "How do I get a refund?" were hitting the cache for "How do I return an item?" Similar-ish? Sure. Same answer? Absolutely not. We had some very confused customers.

Cranked it to 0.95. Too strict. Hit rate plummeted from 73% to 41%. At that point, why bother?

0.92 was the sweet spot. Caught the semantic duplicates without crossing into "wrong answer" territory. Your mileage may vary—this depends heavily on your domain and query patterns.

The Mistakes I Made (So You Don't Have To)

Mistake #1: FAISS Ate All My RAM

I started with FAISS for local vector indexing because I thought, "Hey, it's simple, it'll work."

Three days later, my server hit 32GB of RAM and OOM-killed the process. FAISS's IndexFlatIP does brute-force search and loads everything into memory. Production? Not a chance.

Switched to Milvus with IVF_FLAT indexing. Memory stabilized under 4GB, and queries got 3x faster. Sometimes the "easy" solution is just the wrong one.

Mistake #2: The Wrong Embedding Model for Chinese

Tried to save money with the open-source all-MiniLM-L6-v2 model. Works great for English. For Chinese? Absolute garbage.

"笔记本电脑" (laptop) and "笔记本" (notebook) had a similarity score of 0.67. That's... not helpful. At all.

Switched to OpenAI's text-embedding-3-small. Night and day difference for multilingual support. Yes, generating embeddings costs tokens, but at roughly 1 token per 3,000 characters, it's basically noise compared to GPT-4 generation costs.

Mistake #3: The "Set It and Forget It" Expiry Policy

My first deployment had a global 24-hour TTL. Lazy? Absolutely. Functional? Until the day we updated our return policy.

Users started asking about the new policy. Cache kept serving the old one. Our support team nearly mutinied.

Now I use tiered TTLs:

Factual content (addresses, phone numbers): 7 days
Policy content: 1 hour
Real-time data (inventory, pricing): no caching at all

The Pro Move: Proactive Cache Warming

Passive caching is fine. But I wanted more.

I built a warming script that analyzes historical query logs, extracts high-frequency question patterns, and pre-generates embeddings during off-peak hours (3 AM UTC).


# High-frequency query warming script
hot_queries = [
 "How do I reset my password?",
 "Where's my order?",
 "What's your return policy?",
 "How do I contact support?",
 # ... Top 100 questions from log analysis
]

for query in hot_queries:
 # Generate standard answer and cache it
 response = openai.ChatCompletion.create(
 model="gpt-4",
 messages=[{"role": "user", "content": query}]
 )
 cache.store(query, response)

This bumped our hit rate another 8 percentage points. More importantly, users never feel that "cold start" lag during peak hours.

The Bottom Line

Metric	Before	After	Change

Monthly API Calls	1.24M	330K	-73%

Monthly Tokens	256M	69M	-73%

Monthly Cost	$2,840	$767	-73%

Avg Response Time	2.3s	0.18s	-92%

Infrastructure costs (Milvus server + embedding calls) run about $120/month. Against $2,073 in savings, the ROI is... let's just say my boss stopped asking about it.

When This Works (And When It Doesn't)

Perfect for:

Customer support chatbots
Documentation search
Knowledge base queries
High-volume, standardized Q&A

Skip it if you're doing:

Creative writing (every request is unique)
Code generation (context is hyper-personalized)
Real-time analytics (data changes constantly)

Here's my rule of thumb: if more than 40% of your queries are repeats, semantic caching is a no-brainer. Under 20%? Stick with Redis and call it a day.

TL;DR: Semantic caching uses vector similarity to catch duplicate queries even when they're worded differently. We cut our OpenAI bill by 73%, dropped latency from 2.3s to 180ms, and only spent $120/month on infrastructure. The trick is tuning your similarity threshold (0.92 worked for us) and not cheaping out on your embedding model.

What's your experience with semantic caching? Anyone else accidentally serve wrong answers because of a loose threshold? I'm currently experimenting with RAG + semantic caching for better recall—would love to hear if anyone's tried that combo.

semanticcaching #openai #tokenoptimization #vectordatabase #costoptimization #llmops

Cache Hit Rate	0%	73%	—

How I Cut Our OpenAI Bill by 73% Using Semantic Caching (And Almost Killed Our Production Server)

How I Cut Our OpenAI Bill by 73% Using Semantic Caching (And Almost Killed Our Production Server)

What Actually Is Semantic Caching?

The Architecture I'm Running in Production

The Code (What Actually Works)

Three Numbers That Made My CFO Smile

1. Token Usage Dropped 73%

2. Response Latency: 2.3 Seconds → 180 Milliseconds

3. Why 0.92? (The Week I Lost to Threshold Tuning)

The Mistakes I Made (So You Don't Have To)

Mistake #1: FAISS Ate All My RAM

Mistake #2: The Wrong Embedding Model for Chinese

Mistake #3: The "Set It and Forget It" Expiry Policy

The Pro Move: Proactive Cache Warming

The Bottom Line

When This Works (And When It Doesn't)

semanticcaching #openai #tokenoptimization #vectordatabase #costoptimization #llmops

Cael Lee

Ready to get started?