Home / Blog / My AI Agent Has the Memory of a Goldfish — Here's ...

My AI Agent Has the Memory of a Goldfish — Here's How I Fixed It

By CaelLee | | 8 min read

My AI Agent Has the Memory of a Goldfish — Here's How I Fixed It

Last Wednesday at 2 PM, I watched my customer service agent hit conversation turn 47 and completely blank on a return request the user mentioned just three turns earlier. Instead of processing the refund, it cheerfully recommended the same product again.

The user fired back: "Are you a goldfish?"

Ouch. But honestly? They weren't wrong.

Memory management in long-sequence decision-making — this is the Achilles' heel of AI agents. I've been hacking away at a solution since last November, and eight months later, I've finally got something that works: progressive state summarisation with a forgetting mechanism. No academic jargon here — just the trenches, the failures, and the accidental discoveries that actually made a difference.

Why Long Sequences Break Everything

Three war stories. All mine.

Case 1: The 7-Day Customer Service Nightmare

We built an e-commerce support agent using gpt-4-0125-preview. Average conversation length? 30-50 turns. Users constantly reference previous order numbers, return requests, voucher usage. The naive approach — shoving every message into the context window — meant GPT-4 started going selectively blind after about 15 turns.

Turn 8: "I've already returned this item."

Turn 23: Agent cheerfully checks the shipping status.

The logs were a graveyard of duplicate orderstatuscheck calls. My token bill? Let's not talk about it.

Case 2: The RPG NPC With Trust Issues

A text-based RPG I helped a mate build. The NPC needed to remember player choices across 200+ turns. Started with full memory + vector retrieval (Qdrant, similarity threshold 0.82). The NPC kept conflating "you once stole a potion" with "you once bought a potion". Two words different. Vector similarity through the roof.

Case 3: The Code Reviewer That Invented Function Signatures

My own code review tool, analysing a beefy PR with 3000 lines of diff. It needed to track a function's call chain. Halfway through, it "forgot" the signature and started hallucinating parameter types. Worst offence: it remembered def calculate_total(items: List[OrderItem]) -> Decimal as def calc(items: list) -> float, then ran an entire type check against that phantom signature.

Three different problems. One root cause: more information isn't better information. What matters is remembering the right things and forgetting the rest. Ruthlessly.

Three Pits I Fell Into (So You Don't Have To)

Pit 1: Treating Summaries as Compression Instead of Restructuring

My first attempt was dead simple — every 10 turns, ask the LLM to summarise. Condense 1000 words of history into 200. Looks sensible on paper, right?

Completely fell apart.

The agent turned into some kind of amnesiac — it could only remember what made it into the summary. A user would say "that red one I mentioned earlier" and the agent would freeze. The summary said "user inquired about products" but dropped "red" entirely. Whoops.

Here's what actually worked: layered summarisation.

Actually — let me correct myself. I later changed the immediate layer from a fixed 5 turns to a dynamic 3-7 range, depending on topic stability. Stable topic? Keep fewer turns. Topic jumping all over? Keep more. Fixed windows are almost always wrong.

The real insight: summarisation isn't compression — it's reorganising information structure. I went through four versions of the customer service summary format before landing on this:


{
 "current_intent": "return_request",
 "order_id": "ORD-20240115",
 "customer_sentiment": "frustrated_defused",
 "confirmed_details": [
 "return_reason:sizing_too_small",
 "return_method:pickup"
 ],
 "outstanding_items": [
 "refund_amount_needs_voucher_verification"
 ]
}

Structured summaries let the agent locate key information instantly. Before this, it was blindly searching through 500 words of unstructured text. Now it reads fields. Night and day.

Pit 2: Forgetting Rules That Were Too Rigid

This one cost me. Actually cost me.

I designed very "engineer-brained" forgetting rules: discard anything older than 30 turns, immediately drop non-critical entities, resolve conflicts by favouring the latest info. Clean logic. Clear boundaries.

First week in production, a user mentioned "I'm allergic to peanuts" on turn 5. On turn 35, the agent recommended peanut butter. My rules had purged all "non-order-related information" past the 30-turn mark. The user emailed a complaint. My manager wanted a chat.

Enter importance scoring + soft forgetting:

The decay function uses exponential decay with a half-life around 15 turns. But here's the clever bit — there's a "wake-up mechanism". If a subsequent conversation touches on semantically similar content, the decay resets.

That half-life parameter took ages to tune. Too short and you haemorrhage information. Too long and your memory explodes. My 15 turns is empirical — your mileage will almost certainly vary.

Pit 3: Updating Summaries on a Fixed Schedule

Every 10 turns. So logical. So clean.

So completely stupid.

Two problems: First, a user drops critical info on turn 9, summary triggers on turn 10, then the user adds related details on turn 11. Your summary is now incomplete. Second, a user blitzes through three unrelated topics in three turns, and your fixed window mashes them into one incoherent digest.

Switched to event-driven dynamic windows:

The information density improvement was dramatic. Before, a single summary might mix chit-chat, product enquiries, and complaints. Now each summary captures one coherent thread.

This one's fiddly though — topic detection itself is noisy. That 0.65 threshold? Tuned on my dataset. Yours will likely need adjustment.

The Accidental Discovery

Six months in, I stumbled on something counterintuitive: what the agent forgets is worth recording.

We added a "forgetting log" — a record of what the agent actively chose to discard and why. Originally just for debugging. It's since become more valuable than any monitoring dashboard we built.

Now our forgetting log is the core data source for tuning the importance scoring model. Accidentally became a feature. We internally call it "meta-forgetting", which sounds properly academic but it's really just... logging.

Practical Advice (The Stuff I Wish Someone Told Me)

1. Instrument First, Optimise Later

Before touching your memory mechanism, monitor your agent's "forgetting rate", "repeat question rate", and "information conflict rate". Otherwise you're flying blind — you'll never know if your changes made things better or worse. We use Prometheus + Grafana, three metrics running for six months. Looking back at those graphs, you can see exactly when each change landed.

2. Get Business Stakeholders to Sign Off on Summary Schemas

My first summary JSON had 20 fields. I thought I was being thorough. The business team looked at it and said "we just need to know what the user wants". Cut it to five core fields, and the agent performed markedly better. More fields mean more inconsistent LLM output quality.

3. Don't Worship Vector Retrieval

Lots of teams hear "long-term memory" and immediately spin up Pinecone, Weaviate, Milvus — the works. But in practice, structured queries ("what was the user's last order number") vastly outnumber semantic searches. I now use structured indexing as primary, vector retrieval as secondary. From what I've heard, plenty of customer service builders have reached the same conclusion.

4. Give Forgetting an Undo Button

Let the agent mark information as "I might need this but I'm not sure". These items don't get immediately forgotten — they go into a limbo state with extended retention. This has saved my bacon countless times, especially with offhand user mentions that initially seem unimportant.

One Thing I'm Still Wrestling With

I'm currently experimenting with agent self-reflection to dynamically adjust memory strategy. The idea: after a session ends, the agent reviews its own decision chain and flags which memory points were pivotal and which were noise. Then use those annotations to train the importance scoring model.

Two problems: First, reflection itself burns tokens — roughly 30% more. Second, the agent sometimes "over-reflects", marking trivial details as crucial.

Still tuning. On prompt version 7 now. I'll update when there's progress.

TL;DR

What memory management disasters have you run into? Ever had an agent go selectively amnesiac at the worst possible moment? I'm particularly keen to hear from anyone building game AI or customer service systems — how do you handle the "user is being sarcastic" problem? Our agent keeps taking sarcasm literally and assigning it sky-high importance scores.

Drop your war stories in the comments.

AI #AgentArchitecture #MemoryManagement #LLM #SoftwareEngineering #ProductionML

C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free