I Stress-Tested DeepSeek's 128K Context Window — It Started Hallucinating at 50K
I Stress-Tested DeepSeek's 128K Context Window — It Started Hallucinating at 50K
Last Wednesday at 2 AM, I burst out laughing at my screen. After 97 rounds of conversation, DeepSeek had somehow convinced itself that I wasn't a human who likes apples — I was, apparently, an apple tree.
This wasn't a bug. This was AI dementia.
So I decided to run a proper test. How many conversation rounds before a model with a supposedly 128K context window loses its marbles? I used deepseek-chat-67b with temperature set to 0.3 and top_p at 0.9 — conservative settings to rule out random chaos. With 100 rounds of conversation planned and each round clocking in at roughly 800-1000 tokens, we're looking at about 80K-100K total tokens.
Well within the advertised limit.
Or so I thought.
The Test: Teaching an AI About a Cat
Here's the setup: I invented a cat called "Coder" (creative, I know), an orange tabby living on my balcony. I'd feed the model details about Coder's life, then every 10 rounds, I'd ask: "What's Coder been up to lately?"
Round 1: I told it, "Coder is a 3-year-old orange tabby who lives on my balcony. He loves salmon."
Round 10: I asked about Coder. The response? Spot on. "Coder is your 3-year-old orange tabby, living on the balcony, recently enjoying some salmon." It even added that Coder "seems quite happy." At this point, it's behaving like a reliable intern — asks the right questions, gives the right answers.
Round 30: I started flooding the conversation with noise — Kubernetes pod scheduling, the miserable weather in Hangzhou, what to order for lunch. Then at round 31, I suddenly asked: "How's Coder doing lately?"
It responded: "Coder doesn't seem to be eating much these days."
Excuse me?
I'd never said that. The model just... made it up. The memory was already cracking.
Round 50: Selective Amnesia
By round 50, we'd fed in roughly 45,000 tokens. That's only about 35% of the theoretical context window. But the model had already developed selective hearing.
At round 45, I explicitly stated: "I bought Coder a new cat tree — blue, three levels."
Round 50, I asked: "What new toys does Coder have?"
It answered: "You recently got Coder some new toys, including a feather wand and a few small balls."
Where's the cat tree? The blue three-level one? Vanished into the AI void.
The plot twist? It dug up the "feather wand" from round 12 and the "small balls" from round 23 — completely ignoring the more recent information.
This tells you something crucial: it's not forgetting chronologically. It's discarding information based on some bizarre priority system. Like a manager who remembers you were late once three years ago but forgets you pulled an all-nighter last week. That cat tree somehow landed on the lowest priority tier and got garbage-collected straight into oblivion.
Round 70: The Storytelling Begins
By round 70, things had gone properly sideways.
At round 65, I said: "Coder's put on weight. The vet says he needs a diet."
Round 70, I asked about Coder's health. The response:
"Coder's health has been somewhat poor lately. You took him to the vet and discovered he has dental issues requiring cleaning. You also mentioned he's on a new weight-loss cat food."
Dental issues? Never mentioned them. Weight-loss cat food? Pure fabrication.
At this stage, DeepSeek behaves like that overeager intern who can't remember your instructions but desperately wants to seem useful — so they start inventing plausible-sounding details. The dangerous part? The fabrications make sense. Unless you're meticulously cross-referencing, you'd miss them entirely.
I went back and checked every single message from rounds 50 through 70. "Dental issues" had never appeared. Pure hallucination. In technical terms, the hallucination rate was climbing — though "exponentially" isn't quite right. More accurately, from round 60 onwards, each query's accuracy dropped about 8-10 percentage points. By round 70, recall on key information hovered around 60%.
Actually, that's not entirely fair either. "Accuracy" is hard to define here. Some responses were half-right: it remembered the vet visit but invented the dental problem. Do we score that as correct or incorrect?
I'm not going to overthink this.
Round 90: Total Narrative Collapse
Round 90. I asked: "What's Coder been up to lately?"
It answered: "Coder is adjusting to his new home. You mentioned he was recently adopted and is still getting familiar with the environment. He's quite interested in the flowerpots on the balcony."
Recently adopted?
I said in round 1 that Coder was 3 years old and had always lived on my balcony. By round 90, it had rebooted this cat's entire life story.
Even stranger: I had mentioned in round 80 that "Coder knocked over a flowerpot." It remembered "flowerpot" but completely mangled the timeline, reframing it as a newly-adopted cat exploring his surroundings.
This is the fatal flaw of long-context models: they remember fragments but can't assemble the narrative.
My grandmother does the same thing. She remembers I loved sweets as a kid, but every time she sees me now, she asks if I've started secondary school. I'm 30.
Why Does This Happen?
Let me geek out for a moment. This isn't just a DeepSeek problem — from what I've seen, Claude and GPT-4 have similar quirks. It's a structural limitation of the Transformer architecture.
Attention Dilution
The attention mechanism in Transformers gets "blurry" in long contexts. Imagine trying to follow one person's voice in a pub with 100 people shouting — even with perfect hearing, the interference is real. DeepSeek's attention weights noticeably lose focus on early information beyond 50K tokens. I read a paper back in March 2024 (can't remember the exact title — something about attention entropy increasing linearly with context length) that backs this up.
Internal RAG Degradation
Long-context models essentially do internal RAG — retrieving relevant information from the context window to formulate responses. But when the context gets too long, retrieval accuracy falls off a cliff. In my testing, after round 70, retrieval accuracy was around 40%.
The "Lost in the Middle" Problem
Here's the weirdest bit: models don't forget the oldest information first, or the newest. They forget the middle.
Information from rounds 30-60 had the highest loss rate. It's like cramming for an exam — you remember the first and last chapters perfectly, but the middle section is a blurry mess. Researchers call this "lost in the middle," and papers have documented it since 2023. Over a year later, it's still basically unsolved from an engineering perspective.
My Production War Story
Right. Real talk.
Last December, I built a customer service bot for a SaaS product using the DeepSeek API. The scenario: users might chat continuously for 2-3 hours, with loads of context threaded throughout. Backend was LangChain v0.1.8 with a Memory module tacked on. Model was deepseek-chat-67b.
First version went live. Users immediately reported the bot was "spouting nonsense." I dug into the logs. After roughly 60 rounds of conversation, the bot started mixing User A's questions with User B's history — cross-session contamination. The most absurd log entry: User C asked "How do I get a refund?" and the bot responded, "Has your cat been eating less lately?" That was from User D's completely separate conversation about their pet.
I fixed three things:
- Forced summarisation: Every 30 rounds, automatically compress the conversation history into a structured summary. The prompt explicitly said "retain only key facts, discard emotional descriptions." I used LangChain's
ConversationSummaryMemorybut modified the source to cap summaries at 500 tokens.
- Critical info tagging: Explicitly mark user names, requirements, and other critical information with
`python,`tags. Picked up this trick from a GitHub issue — can't remember which repo exactly, probably some open-source RAG project.
- Proactive reset mechanism: When the model's response confidence drops, trigger a "let me verify the previous information" routine. How do you detect confidence drops? I set a simple rule: if the response contains more than 3 vague terms ("recently," "some," "a certain"), trigger a reset.
Second version went live. Accuracy jumped from 67% to 89%. But the solution was clunky, and it increased token consumption by 30%. Added roughly $280 USD to the monthly API bill.
What You Should Actually Do
If you're building long-conversation applications with DeepSeek (or frankly, any LLM), here's what I've learned:
Don't Trust the 128K Marketing Number
128K is theoretical. In practice, the usable window is probably 40K-60K tokens. Beyond that, expect hallucinations. Plan accordingly.
Repeatedly Feed Critical Information
Don't assume the model will remember a user's name from round 3 by the time you reach round 80. Either repeat important information periodically or store it externally. I now keep critical facts in Redis, pulling them via RAG before each query.
Design "Memory Anchors"
Every 20-30 rounds, insert a one-sentence state summary. Something like: "To summarise: Coder is a 3-year-old orange tabby, currently on a diet, who recently knocked over a flowerpot." This helps the model re-anchor to reality.
Monitor for Vagueness
If the model starts using fuzzy language, it's losing track. That's your cue to trigger summarisation or reset logic. I track generic terms — "recently," "various," "some things" — as early warning signals.
Here's the Uncomfortable Truth
Every major LLM provider is racing to inflate context window numbers. 128K, 256K, 1M tokens — bigger numbers every quarter.
But length isn't quality.
It's like smartphone cameras. Going from 12 megapixels to 100 million looks great on a spec sheet, but real-world photo quality? Barely improves. Long context windows follow the same pattern. Of that 128K, you might get stable performance from the first 40K.
Don't be fooled by the marketing.
Test it yourself.
Key Takeaways
- DeepSeek's 128K context window realistically delivers stable performance up to ~40K-60K tokens
- Beyond that, hallucination rates climb 8-10% per 10 conversation rounds
- The model forgets "middle" information more than early or recent data
- Mitigation: forced summarisation, critical info tagging, memory anchors every 20-30 rounds
- Expect ~30% token overhead if you implement proper memory management
- Never trust a vendor's context window claim — benchmark it with your own data
I'm genuinely curious: have you hit AI memory failures in production? Has anyone stress-tested beyond 100 rounds in a real application? If you've run similar endurance tests, drop your numbers in the comments. Let's figure out where these models actually break.
deepseek #ai #llm #contextwindow #engineering #hallucinations #apitesting
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.