Home / Blog / 解析Agent框架中的上下文管理策略 (English)

解析Agent框架中的上下文管理策略 (English)

By CaelLee | | 7 min read

解析Agent框架中的上下文管理策略 (English)

Generated: 2026-06-23 13:52:22

---

Okay, I've carefully reviewed the original text, corrected the factual contradictions and inconsistencies in the data, removed the rigid clichés, and broke up the overly neat parallel sentence structures to make the rhythm more natural. Here's the revised final version:

---

Why Does Your Agent Project Always Fail? 90% of People Trip on This One Step

Last month, a friend of mine went for an AI position at a big tech company.

On his resume, he wrote "worked on an Agent project." The interviewer's eyes lit up on the spot.

For the next fifteen minutes, almost every question hit the same sore spot—context management.

When should you compress? How do you continue the conversation after compression? What should the prompt for the compression summary look like? How do you draw the architecture diagram?

Lucky for him, he'd read through two of my code analysis articles before. He held his ground.

In the end, he actually passed.

Afterwards, during his debrief, he said the interviewer cared less about how well you memorized your knowledge base and more about whether you'd personally stepped into that pit.

Think about it—it makes sense. Anyone can draw an architecture diagram on paper. But only someone who's actually done it knows: the model suddenly loses memory because the truncation strategy is too crude; the Agent calls the same tool five times because the key instructions got eaten by the summary.

So I decided to break this topic down completely.

This isn't a conceptual overview.

I'm going to open up, one by one, how the mainstream frameworks handle context today.

Next time an interviewer asks, "How do you manage context in your Agent project?" you'll be able to confidently rattle off a well-structured answer.

Instead of choking out: "We… used truncation."

---

You Have No Idea What Kind of "Box" It's Dealing With

Let's start with the underlying logic.

Large language models have no memory.

Every time you send a question, it reads everything from scratch—the system prompt, the conversation history, the sentence you just asked—all crammed into a fixed-size box.

This capacity is called the context window.

Its unit is tokens. One Chinese character is roughly 1 to 2 tokens; one English word averages about 1 token.

Many models claim a 128K window, which sounds pretty big, right?

After just one or two rounds of conversation plus tool calls, a quarter of it is gone.

But here's the thing—an Agent isn't a chat.

A chatbot can go for dozens of rounds. The earlier "What's the weather like today?" is fine to discard.

An Agent is different. It has to execute tasks, call tools, read files, run tests. One single read_file, and hundreds of lines of code go in. One terminal, and thousands of lines of logs scroll by.

Add to that the tool schemas, error stacks, the user's original requirements, fixed constraints set earlier…

After just a few rounds, the box is full.

What's worse is: being full doesn't just affect the current round.

When the context is stuffed, the model's attention behaves like you in the middle of a long afternoon meeting—it starts losing information, ignoring instructions, even repeating operations it's already done.

What you see is: the Agent is still running, but it's already "losing its memory." It doesn't remember which file it just modified ten minutes ago. It doesn't recall that the user said, "Don't touch this config."

So, at first, I thought this was simple.

When it's full, just truncate. Chop off the front, keep the most recent rounds.

Then I got proven wrong.

Quite spectacularly, too.

---

The Early Truncation Disasters: Not Just One Pit, A Chain of Pits

A few years ago, almost all Agent frameworks used the same crude approach: set a token threshold—say, 80% of the window—and once that's exceeded, compression fires off automatically.

How did they compress? Either they dropped the earliest conversations entirely, or they simply kept the most recent few rounds.

You can imagine what that leads to.

First, cliff-edge triggering.

For the first 90%, the system has no reaction at all. The longer the conversation, the more sluggish the model gets. You think it's thinking? It's already swimming in information overload.

Then, suddenly, compression kicks in.

And in one go, it crushes dozens of rounds of history into a single summary.

The model, just overwhelmed by being overstuffed, now gets most of its context ripped away.

It's like someone is trying hard to digest a big meal, and you reach over and yank every plate off the table, leaving only a single vegetable stem.

Second, full-summary loses details.

Decades of messages compressed into a few hundred words. No matter how well you write your prompt, the variable names, function signatures, error stacks, the user's exact words—they're all gone.

And these are exactly the things the Agent needs most to do its job.

I tried a scenario once:

In the first round, the user said, "Don't modify that config.yml file."

Later, when the Agent was working, it completely forgot. Because it had been summarized into something vague like "keep certain configuration files unchanged."

Third, no priority distinction.

All history is treated equally. An explicit constraint from the user and a casual "that looks nice" are handled the same way.

Large chunks of logs from tool outputs and critical error messages get compressed indiscriminately.

Honestly, this isn't a flaw unique to any one framework. It's a common problem in almost all early implementations.

My very first Agent project used this strategy.

Later, I got sick of revising it.

---

Six Products, Six Philosophies: This Is the Key

Starting last year, many products started taking context compression seriously.

I spent a fair amount of time digging into all the mainstream approaches.

And I found something—

Their ideas diverge wildly.

Here's a table for you, the core secret weapons of each product:

ProductCore StrategyOne-Sentence Summary
Claude CodeFive-stage pipeline, ordered by increasing costCheap local operations first, LLM summary as a last resort
Codex CLIKeep recent user messages raw; replace the rest with handoff summariesWhat the user says is most accurate; what the model says can be rewritten
OpenCodeTimestamp hiding + structured summary + replay last user messageDoesn't truly delete; theoretically recoverable
ClineAuto + manual dual mode; /compact generates a summary and continues within the same taskGives the user a choice
CursorAuto-summary + prompt to start new conversation + searchable historyCan trace back to original history even after compression
AmpNo recursive compression; use /handoff to start a new thread with key pointsLong conversations themselves are the problem; changing threads is better than compressing

Do you see? Behind every choice, there's a trade-off.

For instance, Codex CLI believes the user's original message is the most accurate, so it keeps the raw user message and replaces model-generated content with summaries.

Logical reasoning—model replies might be wrong, but what the user said is definitely correct.

The cost? User messages don't reduce much in size, especially when the user pastes a large block of code.

Claude Code's approach is more layered.

It designs a five-stage pipeline: first discard tool outputs that are no longer needed, then do key-value pair compression (keep key fields), then structured summaries, then hierarchical summaries, and finally, only as a last resort, summon the LLM to generate a full summary.

Each level is more expensive but retains more information.

Try to solve it with cheap local operations first; only bring out the big model when you have to.

Amp's approach is the most extreme—it almost never compresses.

The team believes that long conversations themselves are the problem, and they should be solved by switching threads. Each thread stays short and focused, passing key state via /handoff. That way, no thread's context window gets blown out.

The cost is weaker continuity; global information across threads needs to be passed explicitly.

MemGPT/Letta's design is the most academic: treat context like memory management. The Agent itself decides what should stay in "RAM" and what should be swapped out to "disk" (long-term storage). It actively loads things back when needed.

Sounds

MemGPT/LettaContext = RAM, history = disk; Agent autonomously swaps in and outOperating system-level memory management
C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free