Your AI Agent Has Goldfish Brain: 3 Ways to Fix Short-Term Memory (Without Burning Cash)
Your AI Agent Has Goldfish Brain: 3 Ways to Fix Short-Term Memory (Without Burning Cash)
Last week, my AI assistant hit me with this gem on conversation turn #12: "Wait, what framework was that bug in again?"
I stared at the screen. We'd been debugging for 20 minutes. Twenty. Minutes.
Look, maybe I was also running on fumes after 6 hours of coding. But still — I built this thing, and it forgot faster than I do after my third coffee wears off.
Here's the problem that's driving developers insane: how do you keep your AI Agent from forgetting everything in long conversations without torching your entire API budget? I've burned enough tokens and embarrassed myself enough times to figure out three approaches. They're not perfect, but they'll save you some pain.
TL;DR for the Impatient
- Sliding window: Dead simple, but drops context like a hot potato
- Summary compression: Cheap on tokens, but details get... fuzzy
- Vector retrieval: Pinpoint accurate, but the engineering complexity will make you question your life choices
- In production? I mix all three. More on that later.
Why Short-Term Memory Is Actually a Big Deal
Picture this.
You're using an AI Agent to debug a distributed system. Over 20 conversation turns, you've fed it: your architecture diagram, the error logs, three failed solutions, and why each one bombed. Then on turn 21, it starts hallucinating nonsense.
It's not stupid.
Its working memory just hit a wall.
GPT-4's 128K context window looks massive on paper, right? But here's the reality: a moderately complex Agent task — code review + multi-step reasoning + a few tool calls — eats 30K-50K tokens without breaking a sweat. I benchmarked this. GPT-4-turbo (April 2024 snapshot) averaged 42K tokens on a debugging task with tool calling.
Without memory management, you've got two options: drop information, or watch your budget go up in flames.
Last September, I shipped a customer support Agent using the then-latest gpt-4-0125-preview. Day one. First hour. A customer messages us: "Why does your AI have the memory of a goldfish? 7 seconds, tops." Their exact words. I wanted to crawl under my desk.
Approach 1: Sliding Window — Simple, Not Stupid
The idea is almost insultingly straightforward: keep only the last N conversation turns. Old messages? Gone.
class SlidingWindowMemory:
def __init__(self, max_turns=10):
self.max_turns = max_turns
self.messages = []
def add_message(self, role, content):
self.messages.append({"role": role, "content": content})
# Trim when we exceed the window
if len(self.messages) > self.max_turns * 2:
self.messages = self.messages[-(self.max_turns * 2):]
def get_context(self):
return self.messages
How It Actually Performs
I shipped this in an internal tool with an 8-turn window.
The results? Mixed bag.
- ✅ Token usage? Rock solid. Never blew the budget once.
- ✅ Implementation took 5 minutes. It's basically 10 lines of code.
- ❌ User mentions "that thing we talked about earlier" on turn 9? Agent goes full deer-in-headlights.
- ❌ Cross-window reasoning? Forget about it.
Actually, scratch that — I said "10 lines of code" but that's not quite honest. If you're handling system prompts properly and distinguishing between user/assistant role markers, it's more like 30 lines. The code above is the simplified version.
When does this make sense?
Tasks with clear boundaries. Single-shot code generation. Translation. Quick back-and-forth that doesn't need history. If your Agent needs to "know who the user is" across sessions, this alone won't cut it.
I mostly use it as a safety net now.
Approach 2: Summary Compression — Distilling the Conversation
Here's the play: use an LLM to squash your conversation history into a dense summary. Keep the signal, dump the noise.
class SummaryMemory:
def __init__(self, llm_client):
self.llm = llm_client
self.summary = ""
self.recent_messages = []
def compress_history(self, messages):
prompt = f"""
Compress the following conversation into a key summary. Preserve:
- User's core needs
- Decisions already made
- Critical technical details
Conversation:
{json.dumps(messages, ensure_ascii=False)}
"""
self.summary = self.llm.complete(prompt)
self.recent_messages = []
Where I Faceplanted
Last November, I tried building a "long-term learning companion" with summary compression. The vision was beautiful: compress key learning points after each session, inject the summary next time.
Reality? Train wreck. I went through three cups of coffee just diagnosing the problems:
- Detail loss is brutal. A user says "I use VS Code's Vim plugin with custom keybindings mapped to Ctrl+J." The summary becomes "User uses VS Code." Vim plugin? Gone. Custom keybindings? Evaporated. So when the Agent recommends shortcuts later, it completely ignores the user's muscle memory.
- Compression isn't free. Every compression call hits the LLM. I used Claude 3 Haiku for this (cheap), but with frequent compression, it still added $200+/month to the bill.
- Error accumulation is terrifying. This one kept me up. If the summary has a slight bias — say, compressing "user is considering Redis" into "user has adopted Redis" — subsequent conversations build on that error. I watched an Agent spend 5 turns discussing a Redis configuration that didn't exist. The user finally just typed "???"
An Accidental Fix
I later switched to layered summaries.
Separate "user profile" (long-term preferences, habits, tech stack) from "task context" (current goal, progress). User profile gets compressed once and reused. Task context gets compressed at task boundaries.
Performance improved. Engineering complexity doubled. I told a coworker: "Maintaining this is more complicated than my ex's emotional state."
He didn't laugh. Probably because he's met my ex.
Approach 3: Vector Retrieval — Find It When You Need It
No compression. No discarding. Store everything in a vector database and retrieve relevant chunks on demand.
class VectorMemory:
def __init__(self, vector_store):
self.store = vector_store
def add_interaction(self, user_msg, assistant_msg):
doc = f"User: {user_msg}\nAssistant: {assistant_msg}"
embedding = get_embedding(doc)
self.store.insert(embedding, doc)
def retrieve_context(self, query, top_k=5):
query_embedding = get_embedding(query)
results = self.store.search(query_embedding, top_k)
return "\n".join(results)
The 3 AM Wake-Up Call
Three months ago — December 2024 — I was building a code review Agent. Users might reference "that SQL injection issue from the last PR." Sliding window? Useless — that could be 50 turns ago. Summaries? They won't preserve specific code details.
Vector retrieval seemed perfect.
I deployed with ChromaDB + text-embedding-3-small. First week in production. 3:07 AM. PagerDuty screams me awake.
What went wrong?
User says "that SQL issue." Vector search returns every conversation fragment containing "SQL." Twelve historical chunks. Eight completely irrelevant — the user had previously discussed SQL optimization, SQL indexing, even a terrible SQL-related pun. The Agent had no idea which "SQL issue" they meant and returned a wall of unrelated information.
Deeper problem: retrieval quality depends heavily on your embedding model. text-embedding-3-small treats "SQL injection" and "SQL optimization" as 0.85 similarity. From what I understand, it's because the model doesn't distinguish technical subdomains well enough.
My salvage operation:
- Added timestamps and session IDs to every memory chunk. Retrieval now prioritizes recent and same-session results. Simple decay weighting: 24-hour recency gets 2x boost, same-session gets 1.5x.
- Hybrid retrieval: keyword matching (BM25) + vector similarity, weighted. BM25 nails exact term matching.
- Retrieved chunks now include metadata: "Mentioned 3 days ago in conversation turn #5."
The system's stable now. Retrieval accuracy hovers around 87%. But honestly? The parameter tuning made me consider quitting tech to open a coffee shop.
So What Should You Actually Use?
I run a combo setup now. Mix and match based on the scenario:
| Scenario | Recommendation | Why |
|---|
| Simple Q&A / single-shot tasks | Sliding window | Good enough, zero hassle |
|---|
| Long-term user profiles | Summary compression | High information density |
|---|
| Knowledge-intensive tasks | Vector retrieval | Precision search |
|---|
| Complex Agent systems | All three combined | Each covers the others' weaknesses |
|---|
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.