We Built a Memory-First Architecture for AI Agents—Here's What Actually Worked
We Built a Memory-First Architecture for AI Agents—Here's What Actually Worked
Last quarter, our agents started forgetting things after three interactions. Not "oops, lost my train of thought" forgetting—proper amnesia. Task completion rates dropped 23%, and I found myself in a board meeting explaining why our "production-ready" AI wasn't remotely ready for production.
Awkward doesn't begin to cover it.
Here's what I've learnt about building memory architectures that don't crumble under real-world pressure. Some of it's obvious in hindsight. Most of it we discovered by breaking things spectacularly.
The Mistake I Keep Seeing (And Making)
When I took over our AI infrastructure team in March, I treated memory like a feature to tick off a list. Slap on a vector database, point it at agent outputs, job done. Pinecone was our hammer and everything looked like a nail.
Demo days were glorious. Production was... not.
Actually, let me back up. Vector databases aren't the problem. We just used them like absolute muppets. We were embedding entire conversation histories and crossing our fingers that semantic search would magically surface relevant context. It didn't. Retrieval latency sat around 800ms, and half the time our agents pulled conversations from completely different users. A customer asking about refund policies would get context from someone's billing dispute from three months ago.
Not ideal.
The lightbulb moment came from an unexpected place. I spent a weekend in April reading Jeff Hawkins' "A Thousand Brains," and one idea rewired how I think about this stuff: biological intelligence doesn't separate memory from reasoning. Your cortex isn't a database with a CPU attached. It's a memory system that is the reasoning system.
That reframe changed everything we built afterwards.
The Three-Layer Architecture We Landed On
We went through three iterations. The first two were disasters—the second one actually made things worse by introducing a recursive retrieval loop that, well, let's just say our AWS bill that month prompted some uncomfortable questions from finance.
But the third iteration? It stuck. Task completion climbed from 67% to 91% over six weeks. Here's the breakdown.
1. Working Memory: The "Right Now" Layer
This is your agent's active context window. The naive approach—which we absolutely tried first—is cramming everything into the prompt. Conversation history, tool outputs, system instructions, the whole lot. It's like that colleague we all know who keeps 47 Chrome tabs open and complains their laptop is slow.
We built a sliding attention mechanism instead. It prioritises:
- Current task state and immediate subgoals
- Last 5 interaction turns, weighted by relevance not recency (this distinction took two weeks to get right)
- Active tool outputs and error states
The key insight? Working memory isn't about volume. It's about signal-to-noise ratio.
We cut context size by 40% while improving decision accuracy by 28%. Our lead architect, Sarah, showed me the before-and-after traces during a late-night debugging session. The "before" looked like a hoarder's garage—stuff everywhere, no clear structure. The "after" was a minimalist workspace. She'd colour-coded the attention weights and the difference was almost comical.
Almost.
2. Episodic Memory: The "Experience" Layer
This is where most teams over-engineer. I know because we did exactly that.
Our first attempt was an elaborate Neo4j graph database storing every single interaction with full relationship mapping. Looked stunning in the architecture diagram. Complete nightmare in practice—agents took 2-3 seconds just to "remember" things. In conversational AI, that's an eternity.
What finally worked was surprisingly straightforward:
- Semantic search via embeddings for conceptual similarity (we switched from OpenAI's text-embedding-ada-002 to Cohere's embed-v3 after seeing better performance on technical content)
- Keyword-based BM25 for exact matches—critical for error codes and specific product names
- A lightweight temporal decay function that mimics human forgetting
That decay function was Sarah's idea. She pointed out that human memory doesn't work like a database query. Recent and frequently-accessed memories stick around. Unused ones fade. We implemented it in about 40 lines of Python.
def temporal_decay(access_count, last_accessed, current_time, half_life_days=7):
"""Simple exponential decay mimicking human memory patterns."""
time_factor = math.exp(-(current_time - last_accessed).days / half_life_days)
frequency_boost = math.log(access_count + 1) # diminishing returns
return time_factor * frequency_boost
The business impact was immediate. Our customer support agent went from repeating solved issues 35% of the time to proactively referencing past solutions in 82% of cases. That translated to a 12-point NPS increase in our beta group.
Small sample size—200 users—but the trend held.
3. Semantic Memory: The "Knowledge" Layer
This is your agent's understanding of the world. Domain knowledge, user preferences, learned patterns. We structured it as a dynamic knowledge graph that updates through three channels:
- Explicit user feedback (thumbs up/down on agent actions)
- Implicit pattern extraction from successful task completions
- Weekly batch updates from subject matter expert reviews
Here's a counterintuitive lesson. We initially tried to make this layer fully automated.
Big mistake.
The agents started developing weird superstitions. One began always adding an unnecessary confirmation step—"Just to confirm, would you like me to proceed?"—because it picked up a correlation from a single anxious user's behaviour. Another started avoiding certain API endpoints at specific times of day because of a coincidental pattern with a rate limiter.
Seriously.
We now run human-in-the-loop validation for any new semantic connections above a 0.85 confidence threshold. It's slower. It doesn't scale perfectly. But it prevents the kind of silent degradation that's impossible to debug later.
The Reasoning Engine: Turning Memory Into Action
Memory without reasoning is just a database with extra steps. And honestly, that's what our first iteration was.
Our reasoning architecture uses a plan-execute-reflect cycle that I, uh, borrowed from a robotics paper I found on arXiv and adapted for software agents:
- Plan phase: The agent queries all three memory layers and generates a ranked list of possible actions. We use chain-of-thought prompting but with a specific requirement—each reasoning step must cite which memory source it's drawing from. Like footnotes in a paper. This traceability was a game-changer for debugging. You can actually see why the agent did something.
- Execute phase: Actions are dispatched with explicit confidence scores. Below 0.7, the agent escalates to a human. That's it. That single threshold reduced critical errors by 47% in our first month. We argued about the exact number for two weeks—0.6 vs 0.7 vs 0.75—and honestly I think we just needed to pick one and ship it.
- Reflect phase: Post-action, the agent updates its episodic memory with outcomes and adjusts semantic connections. Think of it as the agent's "lessons learned" journal.
Well... that's the theory. We try to log everything. Sometimes the reflection step fails silently and we're still working on making that more robust. Last week we found an agent that had been running with a corrupted episodic memory for three days.
Fun times.
What I'd Do Differently
I wish I'd started with observability. For the first two months, we were flying blind—no way to trace why an agent made a specific decision. I can't overstate how frustrating this was.
Now we log every memory retrieval with timestamps, relevance scores, and the resulting action. We use LangSmith for traces (switched from a homegrown solution in June, wish we'd done it sooner) and dump everything into Datadog for dashboards. The data has been invaluable for debugging.
I also completely underestimated the organisational challenge. Getting our product and domain expert teams to contribute to the semantic memory layer required a cultural shift. Engineers got it immediately. Product people? Not so much. They have other priorities.
We ended up gamifying it with a leaderboard for "most valuable knowledge contributions." It sounds gimmicky. It is gimmicky. But it drove a 3x increase in participation and I'll take what works over what looks sophisticated any day.
The Numbers That Actually Matter
After implementing this architecture (as of our September metrics review):
- Agent autonomy rate: 73% (up from 41%)
- Average task completion time: Down 38%
- Human escalation rate: Reduced by half
- User satisfaction: 4.2/5 (up from 3.1)
But here's the metric I actually care about. Our engineering team's velocity on AI features tripled. We're no longer rebuilding memory systems for every new agent. The platform team owns the memory layer, and the feature teams build on top of it. That's how it should have been from day one.
TL;DR
- Treat memory as architecture, not a feature. Vector databases alone won't save you.
- Three layers work: Working memory (context window), episodic memory (past experiences), semantic memory (world knowledge).
- Add temporal decay. Human-like forgetting patterns beat pure retrieval.
- Log everything. You can't debug what you can't trace.
- Human-in-the-loop for semantic updates. Agents develop superstitions otherwise.
- Pick a confidence threshold and ship. We spent two weeks debating 0.7 vs 0.75. Just pick one.
I'm convinced the next wave of AI differentiation won't come from bigger models. It'll come from better memory architectures. Probably not a hot take at this point, but the companies that treat agent memory as a first-class engineering problem will be the ones delivering reliable, trustworthy AI experiences. The ones chasing benchmark scores on static datasets will keep wondering why their agents don't work in production.
I think.
What's your experience with agent memory systems? Have you found a sweet spot between context richness and latency? I'm particularly curious about teams using alternative approaches like MemGPT or Letta—saw some interesting stuff from their ICML workshop paper but haven't had time to properly evaluate it. Drop your insights in the comments, especially if you've tried the hierarchical memory approach and found it either brilliant or terrible.
Michael Torres is VP of Engineering at a Series B startup, where he leads platform engineering and AI infrastructure teams. He writes about scaling engineering organisations and building reliable AI systems. Currently based in Austin, building in the agentic AI space since early 2023.
AIAgents #EngineeringLeadership #MachineLearning #SoftwareArchitecture #AutonomousSystems
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.