I Spent Three Years Chasing AI Models. Cursor Taught Me I Was Wrong

Let me tell you about last Wednesday night.

I sat there staring at Cursor's three technical blog posts for four straight hours. Not skimming—actually reading, line by line, like some kind of masochist. Halfway through, I had this flashback to November 2023. It was 2 AM, I was debugging context management for an agent framework, and I came this close to hurling my keyboard across the room.

At the time, I blamed the model.

Looking back now?

I was an idiot.

This Opinion Might Piss Some People Off

Everyone's obsessed with chasing new models. Cursor? They've been obsessively polishing the wrapper.

Here's a take that'll probably get me in trouble—the gap between SOTA models is way smaller than you think. What actually separates great products from mediocre ones is everything wrapped around the model.

I know, I know. The model builders want to fight me through the screen right now. The application folks are rolling their eyes thinking "tell us something we don't know." But hear me out.

Think about how absurd our industry has become. Claude drops a new version, everyone runs benchmarks. GPT gets an update, there's an immediate comparison thread with fancy charts. Two percentage points difference? People switch models like they're changing socks.

And what's Cursor doing?

They're polishing the harness.

Same Model, Different Wrapper, Wildly Different Results

Cursor mentioned something in their blog that I reread three times. Honestly.

Every time they get a new SOTA model, they don't just toss it to users. They spend weeks tuning the harness specifically for that model's quirks. After tuning, the same model is noticeably faster, smarter, and cheaper on tokens.

Weeks.

Let that sink in. Models are iterating monthly—sometimes weekly now. And they're spending weeks adapting to each one. If this didn't work, they'd have cut it ages ago, right? But they haven't. They've doubled down.

So what the hell is a harness?

It's everything wrapped around the model. System prompts, tool definitions, context management, tool call orchestration, error handling, model switching logic—all of it. This stuff determines whether a model runs smoothly in your product or stumbles around like it's had a few too many.

It looks boring. But—how do I put this—it's like dropping a Ferrari engine into a tractor's gearbox. Doesn't matter how good the engine is. The drive's going to be miserable.

I ran a comparison experiment late last year. Same Claude model, two different agent frameworks, same code refactoring task. One framework's harness was... let's say "rough." Tool descriptions were slapped together. Context management was basically "cram everything in." The other framework was clearly polished—tool descriptions had parameter-level constraints, context had clear prioritisation and pruning logic.

The results?

The rough one averaged 12 conversation turns to finish the task, burning about 80k tokens, and kept veering off course. I had to constantly correct it. The polished one? Six turns on average, 40k tokens, rarely needed my intervention.

Same. Model.

That's when it clicked. The model is the engine. The harness is the gearbox, suspension, steering. Slap a crap gearbox on, and your fancy engine means nothing.

From "Feed It Everything" to "Let It Find What It Needs"

Of the three blog posts, the one that really hit home was about context management.

Back in 2024, models were still a bit... thick. Not stupid exactly, just not great at figuring out what context they actually needed. So Cursor's strategy was all guardrails, treating them like toddlers:

Editing a file? Shove lint and type errors back to it proactively
Model not reading enough files? Auto-rewrite its request to read more
Strict limits on how many tools it could call per turn
At session start, cram directory structure, semantically matched code snippets, compressed user attachments—everything—into the system prompt

It backfired.

I remember using Cursor last year and feeling these overprotective vibes. I'd want to rename a function, and it'd go off reading seven files first. It felt like "I'm worried you won't have enough, so here's EVERYTHING." Once I asked it to fix a type error, and after reading a bunch of unrelated files, it started "helpfully" refactoring other modules. I had to yank the emergency brake, but thousands of tokens were already torched.

Now?

Those guardrails are mostly gone. Cursor's approach is painfully restrained now. Static context is minimal: OS, git status, currently open and recently viewed files. Everything else? The agent fetches it on demand.

They call this Dynamic Context Discovery.

I actually love that name. The essence is: don't dump everything on the table upfront. Let the model open the drawers itself.

This shift reflects a genuine leap in model capabilities. Old models needed to be spoon-fed. New models can hunt for what they need. So your harness design has to evolve—from "here's everything you might possibly need" to "here are legs, go walk."

I tried this approach in my own project. Previously, my agent initialised with a mountain of project docs and code indexes—roughly 15k tokens of static context. I slashed it to 3k and made the rest available via search tools for the agent to fetch as needed.

The result surprised me.

Token consumption dropped 40%. Task completion rates improved. Because the model wasn't drowning in potentially irrelevant information. It only grabbed what it actually needed.

Seriously.

Better Models Need More Restrained Harnesses

Here's the counterintuitive bit. Sit with this.

A lot of people assume stronger models mean simpler harnesses—the model's smart now, just give it some tools and it'll figure things out. That's not wrong exactly, just naive.

Cursor's experience suggests the opposite. Stronger models need more careful harness design. They're more sensitive to prompts, more aware of tool definition boundaries, less tolerant of contextual noise.

It's like working with a brilliant but opinionated colleague. Give them vague instructions, they'll interpret things their own way—and the result might not be what you wanted. You need to define boundaries, constraints, and expectations clearly for them to truly shine. A weaker model is easier—you dump everything on it, and it just obediently follows without overthinking.

This observation—no, lesson—came from stepping on every rake in the garden.

Last year I used an early agent framework with incredibly detailed prompts. I basically specified every step. Back then the models were weaker, and it worked fine. Later the models got upgraded. I got lazy and didn't update the prompts. The agent started going rogue, because it felt my detailed instructions were cramping its style.

Brilliant.

Eventually I shifted from "tell it how to do the thing" to "tell it what outcome I want and what constraints exist." The improvement was dramatic. This was summer 2024, I think—can't remember the exact month, but the feeling stuck with me.

Why I Think This Matters

There's this culture in our space right now of relentless model-chasing. Some model scores two points higher on a benchmark, everyone jumps ship. But almost nobody spends time polishing their harness.

I get why.

Because tuning a harness is tedious. No benchmarks to flex. No papers to publish. Just trial and error, over and over, digging through agent conversation logs looking for patterns. And the results are hard to quantify—how do you prove the harness improved things rather than the model just being better? It's thankless work, honestly.

But the Cursor team clearly doesn't care about any of that. In their blog they were refreshingly honest—it's just a "continuous polishing" process. No silver bullets. No magic tricks.

I respect that.

I've been writing about this stuff for nearly a decade, and I've watched too many products die from "great model, terrible engineering." Model changes, entire experience collapses because the harness wasn't adapted. Then the team doubts the model, chases the next one—death spiral. I've seen this movie at least five or six times now.

Here's the thing: the model is trained by someone else. The harness is yours. Models will keep getting stronger, but harnesses don't magically improve. You have to polish them, bit by bit.

I should acknowledge a limitation here. Everything I've described applies mainly to coding agents. Other domains—customer support, writing, data analysis—probably have completely different harness priorities. And Cursor has massive user data guiding their optimisation, which smaller teams can't replicate.

But the principle holds. The model is just the engine. The entire engineering system defines the experience.

This is probably my biggest mental model shift of the year.

You can chase models for three years. Eventually you'll realise: what actually separates you from everyone else is how much time you spend polishing the invisible layer nobody sees.

TL;DR / Key Takeaways

SOTA model differences are overrated. Claude, GPT-4, Gemini—they're converging. The real differentiator is the engineering around them.
Harness design makes or breaks the experience. Same model, different harness = dramatically different results (I've seen 50% fewer tokens and 2x faster task completion).
Stronger models need more restrained harnesses, not simpler ones. They're sensitive to noise and over-specification.
Dynamic Context Discovery beats "dump everything upfront." Let the model fetch what it needs. My token usage dropped 40% with better task completion.
Polishing a harness is tedious but irreplaceable. No shortcuts here. Just continuous iteration.

What's your experience been? Have you seen harness design make or break an AI product you've worked on? Drop a comment—I genuinely want to know.

ai #cursor #softwareengineering #devtools #llm

I Spent Three Years Chasing AI Models. Cursor Taught Me I Was Wrong

I Spent Three Years Chasing AI Models. Cursor Taught Me I Was Wrong

This Opinion Might Piss Some People Off

Same Model, Different Wrapper, Wildly Different Results

From "Feed It Everything" to "Let It Find What It Needs"

Better Models Need More Restrained Harnesses

Why I Think This Matters

TL;DR / Key Takeaways

ai #cursor #softwareengineering #devtools #llm

Cael Lee

Ready to get started?