I Built an AI Agent With "Memory" — It Gaslit Itself Into Thinking Our CTO Was a Cat
I Built an AI Agent With "Memory" — It Gaslit Itself Into Thinking Our CTO Was a Cat
Last month I shipped an internal support bot that was supposed to remember user preferences across sessions. Instead, it developed what I can only describe as digital dementia mixed with conspiracy theories. The kind where it's confidently wrong, which is somehow so much worse.
I've been lurking on r/MachineLearning for years, and every other post is about RAG this, vector DB that. But nobody talks about what happens when your "long-term memory" system starts hallucinating with authority. So let me break down the three-tier memory architecture I ended up with, and why each tier exists because of a specific disaster that still makes me wince.
The Three Tiers (and Their Failure Modes)
Tier 1: Short-term / Working Memory
This is just your conversation history stuffed into the context window. Simple, right?
What I did: Sliding window of last 20 messages, summarised older ones with gpt-3.5-turbo-0125. This was back in late January before the price drop — I think I was paying something like $0.50 per 1K tokens for the summariser alone. Ouch.
How it failed: The summariser was too aggressive. A user mentioned they were "running late for a meeting with the VP of Engineering." The summariser condensed this to "user is late for meeting." The agent then confidently suggested rescheduling with "the person you're meeting" — completely losing the power dynamic context. User was not amused that our bot told them to casually reschedule with someone three levels above them.
Actually, wait — I should clarify that this wasn't a one-off. It happened four times before I caught it. Different users, same pattern. The summariser would strip titles and names about 30% of the time. I only noticed because one of the affected users DM'd me on Slack with "hey uh your bot just told me to ping our CTO directly about a printer issue."
That Slack message is now framed in my brain forever.
Lesson: Summarisation needs entity preservation. Names, titles, and relationships are not optional metadata. They're load-bearing.
"Just use a bigger context window" — every LLM API sales pitch ever
Yeah, until you're processing 50 concurrent users and your token costs look like a phone number. Also, there's solid research showing that attention degrades in the middle of long contexts anyway. Bigger isn't always better.
Tier 2: Episodic / Session Memory
This is where you store structured facts extracted from conversations — user preferences, past decisions, that kind of thing. Think of it as the agent's notes from previous sessions.
What I did: After each conversation, extract key-value pairs (username, preferredlanguage, lastticketid, etc.) and dump them in Postgres with a user_id foreign key. Simple relational DB, nothing fancy. I'm using Postgres 16.1 on a t3.medium RDS instance if anyone cares. It's boring. It works.
How it failed: Stale data poisoning. A user changed their deployment region from us-east-1 to eu-west-2. The agent "remembered" the old region from three weeks ago and generated a whole troubleshooting guide for the wrong AWS region. The user spent 20 minutes trying to find resources that didn't exist.
I still cringe thinking about this one. Twenty. Minutes.
The fix: Added a lastverifiedat timestamp and a confidence score. If the fact is older than N days, the agent prefaces it with "Based on our last conversation on [date], you were using [X]. Is that still correct?" This tiny change reduced misdirected responses by roughly 40%.
Also, I learnt the hard way that you need conflict resolution. What happens when the user says "I use Python" in session 1, then "I've switched to Rust" in session 2? Without explicit update logic, you end up with both facts coexisting and the agent randomly picking one. Mine picked Python 60% of the time because it appeared first in the retrieval order.
That was a fun bug to debug. Three hours of my life I'm not getting back.
Tier 3: Semantic / Vector Memory
This is the cool kid. Embed everything, store in a vector DB, retrieve by similarity. Pinecone, Weaviate, ChromaDB — pick your poison.
What I did: Embedded conversation snippets, documentation chunks, and past solutions using text-embedding-3-small. Stored in pgvector 0.6.0 (because I'm cheap and didn't want another service to manage). Cosine similarity search with a threshold of 0.75.
How it failed (spectacularly): The CTO Incident.
Our CTO made an offhand joke in a company-wide Slack thread: "I'm basically a cat herder at this point." The bot, in its infinite wisdom, embedded this. Weeks later, a new engineer asked the bot: "Who should I talk to about the backend architecture?" The bot retrieved the cat herder comment with high similarity (because "who to talk to" ≈ "what someone does"), and responded: "You should speak with [CTO's name], who describes their role as a cat herder."
The CTO found it hilarious. HR did not.
I stared at my screen for a solid minute when I saw the logs. Just... staring. Processing.
The actual fix: Vector similarity is necessary but not sufficient. I added:
- Metadata filtering: Every embedding now has
sourcetype(slackmessage, supportticket, documentation, userpreference) andauthoritylevel(1-5). Jokes from Slack get authoritylevel=1. Official docs get 5. The retrieval now filters by minimum authority before ranking by similarity.
- Cross-reference verification: Before using a retrieved memory, the agent does a quick sanity check: "Is this fact corroborated by another source?" If a "fact" only appears once in a joke Slack message, it gets flagged.
- Temporal decay: Older memories get their similarity scores multiplied by a decay factor. That cat joke from 6 months ago? Basically invisible now unless it's the only match.
The Architecture I Actually Ship
After all these failures, here's what's running in production:
# Simplified version of the memory merge logic
def fetch_context(user_id: str, query: str) -> dict:
tier1 = get_recent_messages(user_id, limit=20) # raw, no summarisation
tier2 = get_structured_facts(user_id) # with confidence + staleness check
tier3 = vector_search(query, authority_min=3, decay_factor=0.9)
# Merge: Tier 2 facts override Tier 3 if conflict
merged = merge_and_deduplicate(tier1, tier2, tier3)
return inject_with_source_tags(merged)
The source tagging is crucial. Every piece of memory injected into the prompt looks like:
[SOURCE: support_ticket_2024_01_15 | AUTHORITY: 4 | AGE: 12 days]
User's deployment uses Kubernetes 1.28 with Istio service mesh
This lets the model weigh information appropriately instead of treating everything as equally true.
Well... that's the theory anyway. It still gets confused sometimes. But the failure rate dropped from "embarrassing" to "acceptable." I'll take it.
Stuff I Wish Someone Told Me
1. Your vector DB is not a database. It's a similarity engine. It will return something even if that something is completely irrelevant. Always apply post-retrieval filtering. I cannot stress this enough. Seriously.
2. Memory is a UX problem, not just an ML problem. Users need to see what the agent "remembers" and correct it. I added a /memory command that dumps the agent's current beliefs about the user. Adoption was immediate — turns out people want to verify what the bot thinks it knows. One engineer on our team even made it his Slack status: "Current bot beliefs about me: 3 correct, 2 outdated, 1 concerning." Legend.
3. Start with Tier 2 before Tier 3. Structured facts get you 80% of the value with 20% of the complexity. I spent three weeks fine-tuning vector retrieval only to realise most of the actually useful "memory" was just key-value pairs I could have stored in a JSONB column.
Three weeks. I could have been playing Baldur's Gate 3.
4. Your summariser is a lossy compression algorithm. Every summarisation step is an opportunity for information to degrade. I now log every summarisation with a diff so I can audit what got dropped. You'd be horrified at what these models consider "unimportant." Names, for one. Names are apparently unimportant.
5. Confidence without calibration is dangerous. When the agent says "I'm 90% sure you prefer Python," that 90% needs to actually mean something. I ended up implementing a dead-simple calibration: track how often the agent's "high confidence" memories turn out to be wrong, and adjust the confidence scores accordingly. It's not perfect — it's just basic frequentist probability — but it's better than raw model confidence, which is basically astrology.
TL;DR
Built a three-tier memory system for an AI agent. Short-term memory works fine until you summarise away important context. Episodic memory gets poisoned by stale data unless you track confidence and staleness. Vector memory will retrieve jokes as facts unless you add authority filtering. Start simple, add complexity only when you've personally experienced the failure mode.
And yes, the CTO still signs off emails with "Chief Cat Herder" now. So I suppose the bot won in the end.
What memory architectures have you all tried? Anyone else had their agent develop... let's call them "creative interpretations" of user history? Drop your war stories below. I need to feel less alone in this.
Edit: A few people asked about the pgvector setup. It's literally just a Docker container with the pgvector extension. Nothing fancy. I'll post the schema in the comments if there's interest.
Edit 2: Thanks for the gold, kind stranger. And yes, I did have to explain to HR that no, the bot is not sentient and does not actually believe our CTO is a feline. That was an interesting meeting.
Edit 3: Someone DM'd me asking about the summarisation diff logging. I'm using a really hacky Python script that compares the before/after with difflib. It's not pretty but it works. Will clean it up and throw it on GitHub this weekend if I have time. Maybe. Don't hold me to that.
ai #machinelearning #programming #devops #llm
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.