I Pitted Two AI Coding Assistants Against Our Legacy Codebase—One Actually Understood It
I Pitted Two AI Coding Assistants Against Our Legacy Codebase—One Actually Understood It
Last weekend, I did something that felt vaguely unethical: I fed our most cursed legacy module to two different AI coding assistants and watched them fight it out. The results? Honestly surprising. And I'm still not sure whether to be excited or terrified about what they mean for our tooling budget next quarter.
Here's the backstory. My team hit that nightmare scenario every engineering leader secretly dreads: a 4-year-old payment processing service that nobody fully understood anymore. The original authors? Long gone. Documentation? Maybe three paragraphs in a Notion doc from 2022. Yet business needed us to refactor it for a new pricing model—with zero downtime, naturally.
So we needed AI help. But which tool could actually handle this level of mess?
I decided to run a controlled experiment. OpenAI's Codex (via API, gpt-4-turbo) versus Cursor's integrated AI. Same refactoring job. Same 12,000-line legacy codebase. Same prompts. Same expectations.
Here's what happened.
The Test Setup
I isolated a payment calculation module. Here's what we were dealing with:
- 12,347 lines of Python
- 47 interconnected functions
- Deep inheritance chains—5+ levels in some places
- Database calls scattered across 18 files
- Zero type hints. Classic startup code.
The task: Extract pricing logic into a separate service without breaking existing integrations.
I measured three things:
- Context retention – How much of the codebase could the tool "remember" at once?
- Dependency awareness – Could it trace ripple effects of changes?
- Accuracy rate – How many AI suggestions actually compiled and passed tests?
Round 1: Context Retention
Codex (128K context window):
- Successfully ingested ~8,000 lines before responses got... weird
- Started hallucinating function signatures after the 15th file
- Lost track of variable scope when functions exceeded 200 lines
I remember staring at one suggestion where it confidently referenced a parameter that didn't exist anywhere in the codebase. Just... invented it. Like that one contractor we had in 2019 who nodded along in every meeting and then delivered something from a parallel universe.
Cursor (with codebase indexing):
- Indexed all 12,347 lines in under 3 minutes
- Maintained awareness of cross-file dependencies throughout
- But—and this matters—its suggestions became overly conservative after 20+ interactions
Actually, wait—I should clarify what I mean by "conservative." Cursor started refusing to suggest changes that touched more than 3 files at once. It would say things like "this modification may have unintended side effects" even when the change was straightforward. Helpful at first. Annoying by hour three.
Key insight: Cursor's indexing gave it a massive advantage in large codebases. Codex felt like working with a brilliant contractor who occasionally forgot what we discussed 10 minutes ago. Which, I mean... I've worked with that person. It's exhausting.
Round 2: Dependency Tracking
This is where things got interesting.
I asked both tools: "If I rename this calculate_tax method, show me every file that needs updating."
Codex correctly identified 14 of 18 affected files. It missed:
- 2 dynamic imports (understandable, honestly)
- 1 monkey-patched function in a test file
- 1 dependency hidden in a string-based
getattrcall
Cursor found 17 of 18 files. The miss? That same getattr call.
So neither tool caught it. That string-based metaprogramming trick—the kind of thing a senior dev writes at 11pm and forgets to document—remained invisible to both.
The real differentiator: Cursor showed me a dependency graph visualization. I didn't even know I needed that. My team spent 15 minutes verifying Cursor's output versus 45 minutes tracking down Codex's misses.
For a team of 5 engineers billing at $150/hour average, that's... let me do the math... roughly $375 saved per refactoring session. Not life-changing. But it adds up over a quarter.
Round 3: Accuracy Under Load
I measured suggestion accuracy across 50 refactoring tasks. Here's the raw data:
| Tool | Compile Rate | Test Pass Rate | Production-Ready Rate |
|---|
| Codex | 78% | 64% | 52% |
|---|
| Cursor | 85% | 73% | 61% |
|---|
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.