I Spent 3 Months Learning Context Engineering the Hard Way — Here's What Actually Works
I Spent 3 Months Learning Context Engineering the Hard Way — Here's What Actually Works
Last fall, I built a customer service agent that nearly broke me.
Seriously. I'm not being dramatic.
GPT-4o, 27 prompt revisions (yes, I counted), 12 few-shot examples, 800+ documents in the knowledge base. It ran beautifully in staging — smooth as butter. Then we shipped it to production, and everything went sideways. A user said, "I want to return the blue one," and the agent froze. Started asking, "May I have your order number?"
The user had literally given their order number three messages earlier.
The model wasn't stupid.
We were.
I was still trapped in that "find the perfect prompt" mindset back then — convinced that if I just crafted the right incantation, the model would handle everything. Reality slapped me hard. It wasn't a prompt problem. It was a context problem. The user's information got lost. Tool outputs drowned critical data. The system prompt was bloated with noise that squeezed out actual reasoning space.
That's where context engineering starts — though I didn't know the term yet.
It took me three months to truly understand one thing: context isn't chat history. Context is the entire information set you feed the model. Way more complex than I initially thought. When I first saw the three-category framework — instructional context, informational context, actional context — something clicked. Sounds academic, but in plain English: tell the model what to do, what to know, and what it can act on. Your system prompt is the conductor's baton. RAG-retrieved knowledge is the ammunition. Tool calls and their results are the hands and feet. Miss any one of them, and you're limping.
But knowing the categories? That's just step one.
The real headache is managing all that context.
The 15K Token Wake-Up Call
Last December, I inherited a marketing agent project. Average call: 15K tokens burned. 15K. Nearly 40% of that was duplicate tool outputs and stale execution traces. A user would say, "Tweak that proposal from last time," and the agent had to dig through the entire history to figure out what "that proposal" even was. It failed. Hard.
That's when I realized context engineering has four core operations: writing, selecting, compressing, and isolating.
Writing: The Easy Part
Writing is the foundation — defining rules in system prompts, describing interfaces in tool definitions, storing user preferences in memory modules. You're "writing" all of it in. But getting it in is easy. Getting the right stuff out? That's where things get interesting.
Selecting: Where the Magic Happens
Selection is the real craft. My current approach: Agentic RAG with just-in-time context. Don't dump every relevant document in at once. Instead, dynamically retrieve only the tiny slice of information needed right before each reasoning step. Honestly? The results are dramatic.
Here's a real number for you: we were using Claude 3.5 Sonnet for a customer service agent, stuffing 5 knowledge base documents into every call. Average: 12K tokens, ~78% accuracy. Switched to the just-in-time strategy — 1-2 documents per retrieval, token count dropped to 6K, accuracy jumped to 89%. The reason's simple: less noise. The model stopped getting lost between "return policy" and "membership benefits." Think about that for a second.
But selection isn't a silver bullet.
Some things have to be in context, and they're just... long.
Compression: The Art of Being Cheap
This is where I fell into my deepest hole — assuming the model would just ignore redundant information on its own. The Chroma team has this concept called "context corruption." Noise doesn't just take up space; it actively destroys reasoning logic. When the signal-to-noise ratio drops, hallucinations skyrocket.
I once had an agent processing a refund dispute. The tool returned a complete payment log — 200 lines of JSON, with only three records actually mattering. The model "reasoned" a coupon into existence from that mess. A coupon that never existed. It confidently told the user, "Your $5 coupon will be refunded within 3 business days."
Brilliant. Just brilliant.
Here's what I do now: summarize long execution traces, keep only critical fields from tool outputs, strip out failed intermediate steps, leave only conclusions. Your context budget isn't infinite. You've got to be stingy — miserly, even. Every token needs to earn its place.
Isolation: The Thing Nobody Talks About
Isolation. Most people overlook this — no, scratch that, most people don't even realize it's a thing.
Context from different tasks needs to be walled off. Otherwise, your agent applies the last user's preferences to the next user. When you think about it, that's kind of terrifying. We use a sub-agent architecture now: each sub-task gets its own isolated context window, and the main agent just handles orchestration and summarization. The effect was immediate — cross-talk dropped from 12% to under 2%. From 12% to 2%. That gap was way bigger than I expected.
MCP: Standards That Actually Make Sense?
Speaking of all this, I should mention MCP.
At first, I figured it was just another organization inventing buzzwords. But after digging into Anthropic's design rationale — I mean, I kind of get the ambition now. MCP is essentially standardizing interfaces for actional context and some informational context. How tools are defined, how data gets exchanged, how results get formatted — all specified.
It's interesting. Before, connecting a tool meant writing a bunch of glue code, inconsistent formats, every error handler doing its own thing, debugging at 3 AM was practically a ritual. If MCP takes off, the "writing" and "selecting" parts of context engineering get a lot easier. Anthropic is genuinely ahead on the standards game — I've got to give them that.
From what I'm seeing though, selection is maturing fastest. Agentic RAG, just-in-time context — these are pretty solid now. But compression, writing, and isolation? Still early days. Everyone's figuring it out as they go. Especially compression. Nothing feels quite elegant yet.
What Andrej Karpathy Got Right
This reminds me of that Andrej Karpathy tweet — you know the one. "The new skill isn't prompt engineering, it's context engineering." He nailed it. That guy has a gift for hitting the bullseye in one sentence.
I sometimes wonder: once these areas mature, the bottleneck for long-running coding agents will shift from "how do I manage context" to "how do I decompose tasks and define goals." An agent's ceiling is the clarity of your problem description. But that's a story for another day.
Here's the reality right now.
Most AI agent failures have nothing to do with model capability. It's context engineering done poorly. You stuff 10K tokens of garbage into the context window and then complain the model is dumb.
Really.
I was reading Anthropic's piece on effective context engineering for AI agents recently, and they dropped this slogan: "Do the simple thing that works." I think that's the core philosophy right there — don't jump straight to complex memory layers. Get the minimal loop running first. Lock down your system prompt, define clear tool boundaries, curate a small set of high-quality examples, run through one real task trajectory end-to-end, then layer in RAG, summarization, caching. Step by step. Don't try to leap to the finish line.
These days, when I start a project, I make context observable and evaluable before touching any optimization. If you can't even see what your agent sees at each step or how many tokens it's burning, and you're already reaching for multi-agent architectures — you're digging your own grave. A deep one.
TL;DR / Key Takeaways
- Context ≠ chat history. It's the full information set: instructional (what to do), informational (what to know), actional (what it can act on).
- Four operations matter: writing, selecting, compressing, isolating. Selection is the most mature; compression is still the wild west.
- Real numbers: Just-in-time context retrieval cut tokens by 50% and boosted accuracy from 78% to 89% in one project.
- Noise kills. "Context corruption" is real — redundant information doesn't just waste tokens, it actively breaks reasoning.
- Isolate or suffer. Sub-agent architecture dropped cross-talk from 12% to 2%.
- Start simple. Lock down the basics before adding complexity. Observable context first, optimization second.
I feel like I've finally got a handle on this — replacing the habit of sloppily stuffing prompts with disciplined context assembly: budgeted, prioritized, evidence-backed.
This is going to keep me busy for the whole year.
What's your experience? Have you hit context management walls in production? Drop a comment — I'd love to hear what's worked (or failed spectacularly) for you.
ai #contextengineering #llm #promptengineering #agents #softwareengineering
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.