| 128K tokens | 3.8s | 74% | $0.15 |
That's a 40% speed drop and a 15% accuracy hit. And the token cost? It tripled.
Why? Two reasons. First, you're making the model sift through noise to find signal—like searching for a stapler on a desk buried under 200 documents. Second, there's a well-documented phenomenon called "lost in the middle." LLMs naturally pay less attention to information in the middle of the context window. If your critical business logic sits between a deprecated README and some old commented-out code, the model might just... gloss over it.
There were some solid papers on this in late 2024 from researchers at Stanford and Anthropic. Worth a read if you're into the mechanics of attention.
Three Strategies I Actually Use
After that near-miss with the payment system, I got systematic about context management. Here are three approaches I've battle-tested.
Strategy 1: Summarization Compression
The simplest approach: don't let the AI remember everything. Make it remember the highlights.
Between conversation turns, use a lightweight model to compress history into a summary:
def compress_context(conversation_history, max_tokens=2000):
if count_tokens(conversation_history) <= max_tokens:
return conversation_history
# Keep the last 3 turns verbatim
recent = conversation_history[-3:]
# Summarize everything older
older = conversation_history[:-3]
summary = llm.summarize(older,
instruction="Extract key technical decisions and code changes only")
return summary + recent
I tested this on a React project where we'd agreed on specific component naming conventions. Without compression, by turn 20 the AI started "forgetting" our conventions and suggesting random patterns. With compression? It held strong through turn 50.
Actually—let me correct that. It held mostly strong. It remembered the big decisions (like "we're using compound components for the form library") but lost some granular stuff (like "this specific prop should be camelCase, not snake_case"). Honestly, that's fine. If a conversation goes 50 turns deep, you should probably re-establish your conventions anyway.
Pros: Dead simple to implement. Works for 80% of use cases.
Cons: Loses detail. Terrible for tasks requiring precise traceability.
Strategy 2: Structured Memory with Hot/Warm/Cold Tiers
This one's more advanced. The idea is to mimic how CPU caches work—keep frequently accessed data close, archive the rest.
Here's my current setup:
- Hot memory: The last 10 conversation turns, kept in full
- Warm memory: Current file + directly imported dependencies, retrieved in real-time
- Cold memory: Project docs, architectural decisions, coding standards—all vectorized and searchable
I hit a snag early on, though. Originally, I used keyword matching for cold memory retrieval. "User login" wouldn't match "user authentication," which... yeah, obvious in hindsight. Switched to embedding-based search and the difference was night and day.
This approach—wait, I should call it an architecture—is complex to set up. You need a vector database. I'm using Qdrant (self-hosted, lightweight, gets the job done). Some folks on my team prefer Pinecone's managed service. It's pricier but zero maintenance.
const projectMemory = {
hotContext: [], // Recent turns, full fidelity
warmContext: [], // Current file context
async retrieveColdMemory(query) {
const queryEmbedding = await getEmbedding(query);
return vectorDB.search(queryEmbedding, { topK: 5 });
}
};
Pros: Memory persists across sessions. Team-wide consistency.
Cons: Setup overhead. Overkill for small projects—you're bringing a flamethrower to a candle problem.
Strategy 3: Sliding Window + Importance Scoring
This is my daily driver. It's the sweet spot between the first two approaches.
Every piece of information gets a score. High scores survive. Low scores get compressed or dropped:
- User explicitly says "remember this": +10 points
- Architectural decision: +8 points
- Code snippet: +5 points
- Casual chat or jokes: +1 point
- Not referenced in 10+ turns: -1 point per turn
When the context window fills up, low-scoring content gets the axe first.
The beauty is how naturally it works. Last week I was deep in a database schema discussion with Cursor, and I made some offhand joke about MySQL's error messages being written by sadists. The scoring system quietly dropped that joke within a few turns—but kept every table structure decision intact.
From what I've read, Cursor uses something similar internally, just with more sophisticated scoring dimensions. They published some technical details in early 2025 on their engineering blog.
Pros: Flexible, intuitive, handles mixed-context conversations well.
Cons: You need to tune the scoring weights for your workflow. One size doesn't fit all.
My Current Toolkit
Here's what I'm actually using day-to-day:
- Daily coding: Cursor with custom sliding window rules (I'll share my config in a follow-up post)
- Complex refactors: Strategy 1 (summarization) to compress history, then start a fresh session
- Team collaboration: Strategy 2 (structured memory) for shared coding standards and project docs
Tool-wise, LangChain and LlamaIndex both have solid memory management components. If you're using OpenAI's Assistants API, they've got built-in thread management that handles some of this for you. Not perfect, but good enough to start.
Something Weird I'm Still Experimenting With
Here's a wild idea I've been tinkering with: what if the AI decides what to remember?
I gave my assistant a "notebook" tool. During conversations, it can proactively jot down key decisions. At the start of the next session, the notebook contents get injected into context.
Early results are surprisingly good. The AI will pause mid-discussion and note: "User prefers Option B for database migration strategy." In later sessions, it'll reference those notes naturally.
But it's still experimental. Sometimes it records the most random things. Last Tuesday, while I was debugging a CSS layout issue, it solemnly noted: "User has a strong preference for the color #ff6b6b."
I mean... it's not wrong. That's a nice red. But not exactly the architectural insight I was hoping for.
TL;DR
- More context ≠ better results. Dumping your entire project into an AI often backfires.
- Context windows have a "lost in the middle" problem. Critical info gets buried.
- Use compression, structured memory, or importance scoring to manage what the AI sees.
- Spend 5 minutes curating context before a complex task. It'll save you hours of debugging.
Context management is fundamentally about tradeoffs. Give the AI too much, and it drowns. Give it too little, and it hallucinates. Finding the balance requires actually understanding your project—there's no shortcut.
I've built a habit now: before any complex AI-assisted task, I spend five minutes asking myself, "What does the model actually need to know?" Those five minutes have saved me more debugging time than I want to admit.
What's your experience? Have you hit any "AI amnesia" bugs in production? Got a context management trick I haven't tried? Drop it in the comments—I'm always looking for better approaches.
ai #programming #productivity #softwareengineering #devtools