I Let GPT-5.1 Fix 236 Real GitHub Bugs — Here's What Actually Happened

Last Tuesday night, I was debugging a payment module concurrency issue. Two hours staring at logs. Nothing. You know what it was? A missing zero in the database connection pool config. poolsize=5 instead of poolsize=50. That single character cost me from 6pm to 8:30pm.

Then I did something slightly unhinged.

I threw the exact same error logs and code at GPT-5.1-Codex-Max. It pinpointed the connection pool issue in 47 seconds. Then — and this is the bit that got me — it also fixed a race condition I'd completely missed. Shared state in a callback function, no mutex. Ran 2,000 concurrent requests after the fix. Clean.

Honestly? My reaction wasn't excitement. It was this weird, uncomfortable feeling I still can't quite name.

Twitter's having its quarterly meltdown about "Will AI replace programmers?" again. I think that question's fundamentally wrong. The real question is: How reliable is autonomous debugging, actually? When should you trust it, and when absolutely shouldn't you?

I spent three weeks throwing 236 real-world bugs at GPT-5.1-Codex-Max. Not toy projects — proper bugs scraped from GitHub. Here's the data, the surprises, and the moments it properly failed.

How I Tested This

Let me be precise about methodology. Without context, the numbers are meaningless.

I pulled 236 bugs from 120 open-source projects. Every single one had clear reproduction steps and had already been fixed by human developers — so I had ground truth to compare against. Language breakdown: roughly 40% Python, 35% TypeScript/JavaScript, 10% each for Go and Rust, the rest Java and C++. All from March 2024 to January 2025. Recent stuff.

Each bug got three attempts:

Round one: zero hints. Just the error message and relevant code files. Let the model figure it out.

If that failed, round two: stack traces and log snippets.

Still broken? Round three: the human discussion summary from the GitHub issue.

After each fix attempt, I ran the project's test suite and manually reviewed every code change for logical soundness.

One thing I need to clarify. GPT-5.1-Codex-Max has something previous versions didn't: an execution feedback loop. It can run its fix in a sandbox, see the test results, and decide whether to iterate. GPT-4 in early 2024 was essentially "blind fixing" — patch something, hope it works, no feedback. Now it's got its eyes open.

Actually, let me correct that. GPT-4 did get a code interpreter after May 2024, but it was manually triggered. Not an automatic built-in loop. Max runs code, watches the output, and decides on its own whether to try again. That distinction matters quite a bit.

The Numbers

Here's the headline: 68.6% overall fix rate across three attempts.

Not amazing. Not terrible. The breakdown is where it gets interesting:

Zero hints (just error + code): 41.2%. This surprised me. Last March I ran similar tests with GPT-4 — same conditions, 18%. In one year, it more than doubled.

That's a bit unsettling.

Added stack traces and logs: jumped to 57.8%. Python and TypeScript saw the biggest gains. I reckon it's because their error messages are naturally structured — the model can trace through stack frames effectively. Go's errors are... well, you know. if err != nil and then you're on your own with whatever logging you wrote.

Added human discussion summaries: final rate of 68.6%. Go and Rust leapt the most here, from roughly 45% to nearly 70%. My theory? These languages' bugs often involve ownership, lifetimes, and conceptual design decisions. You can't infer original intent from code alone, but human discussions fill in that missing context beautifully.

"With sufficient context, GPT-5.1-Codex-Max independently resolves roughly seven in ten real-world bugs, end to end. This isn't lab data — it's production projects."

But don't get too excited. That 70% figure hides a lot of sharp edges.

What It Fixes Well — and What It Doesn't

I categorised the bugs. The success rate gap is staggering:

Dependency conflicts, config errors, null pointers/undefined: 85% and above. Not shocking. These patterns are incredibly consistent, and the model's seen them millions of times in training data. At one point I deliberately changed "axios": "^1.6.0" to "axios": "^9.9.9" in a package.json — a version that doesn't exist. It not only spotted the issue, it queried the npm registry and substituted 1.7.9. Two years ago I'd have called that science fiction.

Logic errors and boundary conditions: 55-65%. Here's a fascinating pattern I noticed — when the buggy logic had clear intent expressed in comments or variable names, success rates shot up. In other words, the clearer your code, the better the AI fixes it. Sounds obvious, but it means something real: writing comments isn't just for your future colleagues anymore. It's effectively providing "debugging hints" to AI tools. A comment like // handles cross-timezone timestamp conversion versus radio silence — the difference in fix accuracy is measurable.

Concurrency issues and race conditions: drops to roughly 30%.

This is the clearest weakness right now.

I analysed dozens of failure cases. The model often senses the problem is concurrency-related, but its fixes either over-lock things into deadlocks or miss execution paths entirely. These bugs require a precise mental model of global system state at runtime. GPT-5.1-Codex-Max is better than its predecessors, but it still struggles here.

It's a tricky one. I think the fundamental issue is architectural — current LLMs are essentially doing pattern matching, but concurrency bugs demand tracking three or four execution paths simultaneously. That multi-dimensional tracing isn't their strength.

A Story That Stuck With Me

Let me give you a concrete example.

One test involved a Django project. Symptom: "Users occasionally see someone else's shopping cart after login." This bug sat in the GitHub issues for three months. A senior engineer eventually spent two days fixing it. Root cause: in Redis Cluster mode, MGET across nodes returned incomplete data, and session deserialisation mixed in cache fragments from other users.

I fed the code and error logs to GPT-5.1-Codex-Max.

First attempt: "Add select_related to the view function to avoid lazy loading." Completely wrong. The problem had nothing to do with ORM queries. This felt very junior engineer — fix what you see, don't dig deeper.

Second attempt, with stack traces: it pinpointed the session middleware, but suggested "add locking to session reads and writes." That'd cause serious performance degradation. Under high concurrency, basically suicide.

Third attempt, I included the issue discussion summary. One person had commented: "This only happens in production with Redis cluster mode. Can't reproduce at all with local single-instance Redis."

That one sentence.

The model immediately pivoted. It traced the issue to MGET returning incomplete data across nodes and produced a fix nearly identical to what the human engineer eventually committed: switch to pipeline with individual gets, add consistency validation.

This taught me something crucial: GPT-5.1-Codex-Max's bottleneck often isn't reasoning capability — it's information access. Give it enough clues, and it makes remarkably precise judgements. Without enough clues, it guesses like a junior developer — and guesses with alarming confidence. That's actually the most dangerous part.

Where the Remaining 31.4% Went Wrong

68.6% sounds decent. But what about the other third?

I went through every failure case. They cluster into three scenarios:

First: cross-service distributed bugs. A microservice throws a timeout error, but the root cause is a config change in a different service. The model only sees the current service's code. It's not a capability problem — it's an information boundary problem.

Second: domain-specific business logic. A tax calculation bug in a financial system. The model doesn't know that "Brazil's ICMS tax has special exemption rules in São Paulo state." From pure code logic, the calculation looks correct. These problems need external knowledge, and the model's knowledge cutoff creates blind spots. Max's cutoff is November 2024, from what I understand.

Third: bugs requiring architectural refactoring. The model defaults to minimal changes — usually a good instinct. But some bugs exist because the architecture is fundamentally wrong. Patching just makes the system more fragile. The model currently can't judge "this design is broken at its core and needs rethinking."

"AI can fix bugs, but it can't judge when a bug shouldn't be fixed. That judgement is probably the human engineer's moat for a long while yet."

How I Actually Use This Now

After all that data, here's my real-world workflow.

When I hit a bug now: error message, relevant code snippets, recent commit history — all goes to Max for a first-pass analysis. About half the time, it gives me the correct location straight away, or at least narrows things to two or three files.

It saves a lot of the mechanical "grep logs, guess, set breakpoints" grind.

But I've got one hard rule: I must fully understand any fix before it touches the codebase.

It's not about trust.

It's that when I don't understand a fix, I've effectively surrendered control of the code. Last month it suggested disabling CSRF validation in an auth middleware to "fix a CORS issue." Technically? Tests pass. Practically? Security vulnerability. The model wasn't being malicious — it was optimising for "make tests green," and security wasn't in its objective function.

That's not its fault. It's ours, for how we've designed the objectives.

TL;DR

GPT-5.1-Codex-Max fixes 68.6% of real-world bugs with sufficient context
Zero-hint fix rate is 41.2% — double what GPT-4 managed a year ago
It excels at config errors, dependency issues, null pointers (85%+)
It struggles with concurrency bugs (~30%) and distributed system failures
The bottleneck is information access, not reasoning — give it clues, it performs
Never merge a fix you don't understand. The model optimises for tests passing, not security or architectural soundness

Here's a question for you: if your team adopted a tool that automatically fixes 70% of bugs, would you loosen your code review process? Or would AI-generated fixes actually get more scrutiny? I'm genuinely curious how different teams handle this. Drop your thoughts in the comments.

AI #Debugging #SoftwareEngineering #DeveloperTools #GPT5

I Let GPT-5.1 Fix 236 Real GitHub Bugs — Here's What Actually Happened

I Let GPT-5.1 Fix 236 Real GitHub Bugs — Here's What Actually Happened

How I Tested This

The Numbers

What It Fixes Well — and What It Doesn't

A Story That Stuck With Me

Where the Remaining 31.4% Went Wrong

How I Actually Use This Now

TL;DR

AI #Debugging #SoftwareEngineering #DeveloperTools #GPT5

Cael Lee

Ready to get started?