I Spent $3,600 Testing 7 Memory Strategies for AI Agents — Here's What Actually Works
I Spent $3,600 Testing 7 Memory Strategies for AI Agents — Here's What Actually Works
Last year I almost bankrupted my team building a customer service agent. First week in production, our token costs were 3x the budget, and my boss's face was greener than our error logs. I ended up testing seven different memory management approaches before realizing this isn't just "shove chat history into the prompt" territory.
Here's what I learned — and exactly how much each approach costs in the real world.
First, Let's Talk Token Math
A lot of people think tokens equal word count. They don't.
With GPT-4 class models, 1,000 Chinese characters eat up roughly 1,500-1,800 tokens on input. Output is separate. At the time I ran these tests (June 2024), GPT-4o pricing was $2.50/1M input tokens and $10/1M output tokens through most API providers.
In practical terms: every 1M input tokens cost me about $2.50, output was $10.
Wait — I should clarify something. Those were June 2024 prices. As of January 2025, OpenAI's dropped input to $1.50/1M. But all my calculations here use the old pricing since that's when I actually ran this project. Makes it easier to compare across approaches.
Alright, with that pricing in mind, let's look at the options.
Approach 1: Full Context Window
The idea: Dump the entire conversation history into every prompt. Simple.
The math:
Say you've got 20 turns per conversation, averaging 500 tokens each (user + assistant). That's 10,000 tokens of context per request. At 1,000 conversations per day, you're burning through 10M input tokens daily — about $25.
How it actually went:
This is exactly what I did at first. "The context window is 128k tokens," I thought. "What could go wrong?"
Three days later I got the API billing warning. We'd hit $55 in a single day. Worse, response times kept creeping up. Users started complaining that our bot was "contemplating the meaning of life" between replies.
I remember one user waited 8 seconds for a response and just typed "trash" before leaving.
Verdict: Fine for proof-of-concept. Terrible for production. Costs grow at O(n²) — the longer the conversation, the more absurd it gets.
Approach 2: Sliding Window
The idea: Only keep the last N turns. Old messages fall off naturally.
The math:
Set the window to 10 turns, and your context drops to 5,000 tokens. Same scenario — daily cost goes from $25 to about $12.50. Literally cut in half.
The problem:
One time a user said "that order I mentioned earlier," and the agent went completely blank. "Earlier" was turn 11. Already gone. The user typed "useless bot" and bounced.
This approach works for FAQ-style customer service where conversations are short and transactional. But if you're building something like financial advising or legal consultation with long-running threads? Don't even bother.
I eventually hacked together a "critical info pinning" mechanism — order numbers, phone numbers, anything explicitly stated gets stored separately and never slides out. That patch was painful to build.
Approach 3: Summarization
The idea: Use another LLM call to periodically compress conversation history into a summary, replacing the raw messages.
The math:
Trigger a summary every 10 turns. Each summary call burns ~2,000 tokens input (the raw conversation) + ~500 tokens output (the summary). At 1,000 conversations/day, that's 100 summary calls, adding about 250K tokens — roughly $0.60 extra. Combined with a sliding window for recent messages, total daily cost: around $13.
Where it went wrong:
One time the summarizer compressed "user said they don't like red" into "user has color preferences." The agent then enthusiastically recommended red sneakers. The user screenshotted it. Posted it on Twitter. 100K+ impressions.
That was March 2024. I was using GPT-3.5-turbo for summaries back then. Switching to GPT-4o-mini helped accuracy a lot, but information still gets lost.
The lesson:
Don't cheap out on your summarization model. And critical information — negations, numbers, dates — should be extracted structurally, not left to the LLM's creativity. I ended up writing regex to pull out amounts, dates, and phone numbers first, then letting the model summarize the rest.
Approach 4: Structured Memory
The idea: Store user information in discrete fields — preferences, behaviors, key facts — maintained as JSON. Only inject relevant fields into each prompt.
The math:
About 500 tokens of structured data per request, plus the last 5 turns (2,500 tokens). That's 3,000 tokens of context per call. Daily cost: roughly $7.50. And since the data is already structured, you don't need frequent re-summarization. Maintenance cost is near zero.
Real example:
The SaaS product I'm building now uses this approach. We store about 20 user profile fields — age, gender, purchase preferences, recently browsed categories, etc. Before each conversation, we query the database and assemble the prompt. Results have been surprisingly good — reply accuracy up 30%, costs down 40%.
We're using PostgreSQL with JSON fields for the profiles. Didn't even bother with Redis. Simple and effective.
But honestly? The upfront design cost is significant. You need to figure out what to store, how to update it, and how to handle conflicts. What happens when a user first says "I'm male" and later says "I'm female"? Do you overwrite? Flag a conflict? I added a confidence field — same info confirmed multiple times before overwriting — and it took me two all-nighters to get the conflict resolution logic right.
Approach 5: Vector-Based Retrieval
The idea: Vectorize all conversation history and store it in a vector database (Pinecone, Milvus, etc.). For each new query, retrieve the most semantically relevant historical snippets.
The math:
Embedding costs: roughly $0.08 per 1M tokens (using text-embedding-3-small). If you've accumulated 100M tokens of history, embedding everything once costs about $8. Per-query retrieval cost is negligible. Injecting relevant snippets adds ~1,000 tokens. Combined with the last 5 turns, that's 3,500 tokens per call. Daily cost: around $8.75.
The catch:
Retrieval precision is... let's call it "temperamental."
A user once asked "how's that blue jacket I bought last time?" The system retrieved a record from three weeks ago about them browsing red t-shirts. Why? Vector similarity scored "jacket" and "t-shirt" at 0.87 similarity.
Optimization tip:
Don't rely purely on vector search. Add a keyword filtering layer. I'm now using LlamaIndex with hybrid search (vector + BM25), and recall accuracy jumped from 70% to 95%. Configuration: Milvus for the vector store, BM25 weight at 0.3, vector weight at 0.7.
Took me an entire weekend to tune those weights.
Approach 6: Hierarchical Memory
The idea: Combine approaches 4 and 5 into three layers — working memory (last N turns), short-term memory (structured profile), long-term memory (vector-retrieved history).
The math:
I ran this setup for a month. Daily cost averaged about $9.70. That's $2.20 more than pure structured memory, but user satisfaction jumped from 3.8 to 4.5 stars.
Implementation:
- Working memory: Last 5 turns, injected in real-time
- Short-term memory: User profile JSON, updated before each conversation
- Long-term memory: Weekly "memory consolidation" — short-term memories get embedded and stored in the vector database
Unexpected win:
A user came back after three months. The agent said, "That phone case you bought last time — there's a matching red version now." They bought it immediately.
That's the business value of long-term memory.
But honestly? Scenarios like that happened maybe two or three times a month. Whether that's worth an extra $2.20/day depends on your numbers. We calculated the ROI — you'd need roughly 3,000+ daily active users to break even.
Approach 7: Hybrid Cache + Memory Network
The idea: Layer semantic caching on top of Approach 6. Similar questions get cached responses, no LLM call needed.
The math:
After adding caching, our hit rate was about 30% (customer service has lots of repeated questions). Daily cost dropped from $9.70 to $6.80. Cache stored in Redis with embeddings, similarity threshold set to 0.95. False hit rate stayed under 2%.
The hidden cost:
Cache maintenance is a nightmare. Every time you update the knowledge base, you need to clear related caches. One time I forgot, and a user asked "what promotions do you have right now?" The agent responded with a three-month-old expired offer.
Twitter found us again.
My advice:
This approach is for mature products. Don't touch it early on — the maintenance overhead eats more than the token savings. We had two people maintaining cache logic, constantly debugging "why didn't this cache clear?" issues.
Cost Comparison: All Seven Approaches
| Approach | Daily Tokens | Daily Cost | Accuracy | Best For |
|---|
| Full Context | 20M | $50 | 95% | POC only |
|---|
| Sliding Window | 10M | $25 | 85% | MVP |
|---|
| Summarization | 5.3M | $13 | 80% | Early stage |
|---|
| Structured Memory | 3M | $7.50 | 90% | Growth phase |
|---|
| Vector Retrieval | 3.5M | $8.75 | 88% | Growth phase |
|---|
| Hierarchical | 3.9M | $9.70 | 93% | Mature product |
|---|
| Hybrid Cache | 2.7M | $6.80 | 92% | At scale |
|---|
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.