I Watched GPT-5.1 Rewrite My 3-Night Algorithm Module in 47 Seconds. Here's What Happened Next

Last week at a hackathon in Berlin, I watched GPT-5.1-Codex-Max refactor an entire graph algorithm module in 47 seconds.

Forty-seven seconds.

Here's the thing—I'd spent three nights on that module. Late nights. The kind where you're questioning your career choices at 2 AM, staring at a bug that turns out to be a missing semicolon. And this thing did it in less time than it takes to make instant noodles.

The coffee in my hand suddenly tasted like defeat.

But here's the twist: after two weeks of actually using it, I've discovered it's both more impressive and more frustrating than I expected. Let me give you the unvarnished truth—the wins, the face-plants, and the 2 AM debugging session I still haven't emotionally recovered from.

TL;DR

GPT-5.1-Codex-Max is genuinely scary-good at dynamic programming and graph algorithms—I'm seeing roughly 32% better accuracy than the last version
Inference speed is impressive but uneven: about 1.8 seconds for 2,000 lines of logic analysis, though it sometimes hits what I've started calling "hallucination interrupts"
Three real-world war stories ahead, including one that had me debugging until 2 AM (yes, I'm still bitter)

Let's Talk Numbers

I ran a proper comparison—50 LeetCode hard problems from the February 2025 update, tested through Copilot Chat's GPT-5.1 interface with temperature set to 0.2. Here's what fell out:

GPT-5.1-Codex-Max nailed 78% on the first try. No prompt engineering tricks, just clean problem descriptions with clear input/output specs. Back in November, GPT-4-Codex managed 59% on the exact same dataset. That's a proper jump.

Three areas where it really shines:

Dynamic programming: State transition derivation accuracy shot from 61% to 89%. I reckon this is because it now explicitly models the state space instead of pattern-matching solutions. You can almost see it thinking through the recurrence relations.
Graph algorithms: Dijkstra and Floyd-Warshall variants handle edge cases much more gracefully, especially pruning on DAGs. Fewer off-by-one errors in boundary conditions.
Concurrency models: Multi-thread synchronisation and deadlock avoidance code is noticeably more complete. Still makes stupid mistakes sometimes—more on that in a bit.

Actually, wait. I need to correct myself here.

That 78% figure? That's with standardised prompts and crystal-clear specifications. Real projects are messier. Much messier. Don't expect plug-and-play perfection—I learned this the hard way.

Case Study 1: 3D Rainwater Trapping

I threw it a classic with a twist—the 3D version of "Trapping Rain Water II". LeetCode hard. You need a min-heap maintaining the boundary while doing BFS expansion. The core concept isn't new, but there's enough variation in the wild that boundary conditions trip up most implementations.

My prompt was deliberately minimal:


Implement function trapRainWater(heightMap) that takes an m x n matrix 
of non-negative integers and returns total trapped rainwater. 
Use priority queue optimisation. Go implementation.

What came back surprised me.

Not only was the code immediately runnable, but the comments actually explained why it chose container/heap over simple sorting. It walked through the complexity reduction—from O(mn log(mn)) down to O(mn log(m+n)). GPT-4-Codex rarely volunteered that kind of analysis unprompted. It'd just generate what you asked for.

One quirk I noticed: when I didn't specify a language, it defaulted to Java 8 out of 10 times. Python came second. Rust and Go? Almost never. If you're working in something that isn't Java or Python, explicitly specify your language. Found that out the tedious way.

Case Study 2: The Deliberately Broken Deadlock Detector

This one was a trap I set on purpose.

I asked it to implement a deadlock detector—feed it a process-resource allocation graph, determine if there's a cycle indicating deadlock. Pretty standard stuff. Except I buried two contradictory constraints in the prompt:

"Use adjacency matrix for graph storage"
"Space complexity must be O(V+E), not O(V²)"

Anyone who's taken a data structures course knows these conflict on dense graphs. Grade-A contradiction.

GPT-5.1-Codex-Max handled it in a way I didn't expect. Instead of blindly generating code, it first returned an analysis pointing out the conflict, then suggested switching to adjacency lists. Then it provided two implementations—one for each approach—with trade-offs clearly annotated.

This "push back on bad requirements" capability? Honestly, it's more valuable than the code generation itself. I've watched entire sprints go up in flames because someone implemented contradictory requirements without questioning them. GPT-4-Codex would just... do what you asked. Garbage in, garbage out, no questions asked.

I suspect this is tied to the RLHF fine-tuning they did in late 2024. OpenAI hasn't published details, but the behavioural shift is pretty stark. In a good way.

Case Study 3: The 2 AM Face-Plant

Right. Story time. A humbling one.

March 14th, I'm building a sliding window statistics module for a quant trading system. The brief: calculate weighted standard deviation over the last N ticks, where each tick is a struct with price, volume, and timestamp. Relatively straightforward logic, but it involves floating-point accumulation errors—the sneaky kind that don't show up in unit tests.

I fed it an extremely detailed spec. Complete struct definitions, edge cases, numerical stability requirements. GPT-5.1-Codex-Max generated about 150 lines of Python.

The first 120 lines? Bloody brilliant. It used Welford's algorithm for online variance updates—proper numerical stability, not the naive approach that falls apart with cumulative floating-point error. Type annotations everywhere. Even the docstrings were better than what I'd write. I was impressed.

Then the last 30 lines happened.

When calculating the weighting coefficients, it suddenly contradicted its own normalisation logic from earlier. The first section properly used exponential decay weights. The last section? Switched to equal weights without warning. The result was off by about 0.3%.

I didn't catch it at first.

Stared at the screen for nearly an hour, checking and rechecking my input data. Tick data is inherently noisy—0.3% disappears into the noise floor. Completely invisible unless you know exactly what you're looking for.

At 2:17 AM, I finally tracked it down by comparing the code line-by-line against the spec. The model had experienced what I'm calling "logical drift"—it simply forgot constraints defined earlier in the generation. OpenAI's technical docs admit this: code coherence drops about 15% once you pass roughly 120 lines.

No error messages. Nothing crashed. The code ran fine, produced output, and that output was wrong. These are the absolute worst bugs to hunt.

The lesson: Break complex algorithms into modules, keep each one under 100 lines, and stitch them together with interfaces. Don't let it generate the whole thing in one go. Don't ask me how I know this. Actually, you already know—I spent three hours on that bug.

Inference Performance (With Asterisks)

Tested on a 2,000-line Go microservice: 47 API endpoints, three-tier service architecture. The task was analysing inter-endpoint dependencies and generating a call topology graph.

GPT-5.1-Codex-Max averaged 1.8 seconds response time. GPT-4-Codex? 3.2 seconds. Nearly twice as fast. Test environment: M3 Max MacBook Pro, 64GB RAM, hitting Azure OpenAI Service's Japan node with ~40ms network latency.

But there's a catch.

The speed boost mainly comes from inference framework optimisation, not deeper reasoning. What does that mean in practice? Simple dependencies (A calls B directly) are blazing fast. But indirect dependencies nesting three or four layers deep? Sometimes it misses transitive paths entirely. Out of 47 endpoints, it missed 3 indirect dependencies—all at the bottom of four-layer call chains.

So here's my actual workflow now: GPT-5.1 does the first pass and rapid prototyping. Gets things runnable. Then proper static analysis tools handle critical path dependencies. My pipeline is GPT-5.1 for the draft → SonarQube for static analysis → manual review for core logic. Each step catches what the others miss.

It saves serious time. But full autopilot? That's a disaster waiting to schedule itself.

Should You Switch?

If your stack looks like this, GPT-5.1-Codex-Max is worth a proper look:

You're regularly implementing mid-to-high complexity algorithms—recommendation engines, pathfinding, financial models
Your team maintains labyrinthine legacy codebases and needs to untangle logic dependencies quickly. Honestly, its code reading ability might be stronger than its code writing
You've got hard latency requirements, like auto-fixes in CI/CD pipelines

But if you're mostly doing CRUD development?

Stick with GPT-4-Codex. Seriously. The API cost difference is meaningful, and you won't notice the algorithmic upgrades. Spend the savings on actual good coffee ☕

I'm still poking at the edges of what this model can do. Next on my list: functional programming, especially monad transformers and type-level code generation. And formal verification—writing Coq proof scripts with AI assistance. That should be entertaining.

Have you tried using it for weird algorithms? Or hit any maddening bugs that had you questioning your life choices? Drop a comment. Your horror story might save me some hair follicles 🚀

gpt5 #algorithms #aicoding #performancetesting #developertools

I Watched GPT-5.1 Rewrite My 3-Night Algorithm Module in 47 Seconds. Here's What Happened Next

I Watched GPT-5.1 Rewrite My 3-Night Algorithm Module in 47 Seconds. Here's What Happened Next

TL;DR

Let's Talk Numbers

Case Study 1: 3D Rainwater Trapping

Case Study 2: The Deliberately Broken Deadlock Detector

Case Study 3: The 2 AM Face-Plant

Inference Performance (With Asterisks)

Should You Switch?

gpt5 #algorithms #aicoding #performancetesting #developertools

Cael Lee

Ready to get started?