I Threw Our Cursed Auth System at GPT-5.1-Codex-Max and It Found a Bug I'd Been Chasing for Months

I've been burned by AI coding tools before. Remember that horror story last month about Copilot suggesting sudo rm -rf / in a bash script? Yeah, I've been in the trenches since the GPT-3 days, and my bullshit detector is usually pretty well-calibrated. But a friend who works at a FAANG-adjacent company (you know the one) wouldn't shut up about the new Codex-Max model, specifically its "multi-file reasoning" capabilities. So I did what any sane, burned-out senior dev would do on a slow sprint week: I threw our most cursed project at it.

For context, our main product is a SaaS platform that started in 2015 as a Rails monolith, got microservice-envy in 2019, and now exists in this quantum superposition state where it's neither monolith nor microservice. We call it the "distributed monolith" when management isn't in the room. The auth system alone spans 14 files across 4 services, with JWT handling, OAuth2 flows for enterprise SSO, and a custom role-based access control that someone (me, it was me) wrote in a caffeine-induced fugue state during a production incident in 2021. There are comments like # TODO: Fix this before the next audit from three audits ago.

I set up GPT-5.1-Codex-Max in a sandboxed environment—no way I'm giving it access to actual infra. I've seen too many horror stories about AI-generated Terraform plans. My plan was simple: ask it to map out the auth architecture across all files, then suggest a refactor to consolidate the token validation logic, which currently has slightly different implementations in our Node.js gateway, Python user service, and Go permissions service. You know, the kind of task that makes you question your career choices during sprint planning.

The Architecture Mapping Was Terrifyingly Accurate

I gave it read access to the repo and asked: "Map the authentication flow across all files and identify where token validation logic is duplicated."

Within maybe 30 seconds, it spit out a Mermaid diagram (in markdown, because of course it did) showing the exact request lifecycle. It identified 7 files I'd forgotten existed, including a Python utility that was still importing PyJWT==1.7.1—a version we were supposed to remove in 2022. It also found a race condition in the token refresh logic that I'd been chasing for months.

Actually, wait—I should clarify that. It wasn't just "a race condition." It was this specific thing where the refresh token gets invalidated in Redis before the new one is written, and if a request hits that 200ms window, the user gets a 401. Our error logs were full of TokenRevocationError: refreshtokeninvalidated with no clear pattern. I'd spent probably 15 hours on this across three debugging sessions. The model found it in 30 seconds.

I literally said "you've got to be fucking kidding me" out loud, and my cat looked at me with judgement.

Here's the kicker: it noticed that our Go service was using HS256 for JWT signing while the other services expected RS256. This wasn't causing errors because the gateway was doing something stupid with key fallback that I won't get into, but it meant our "signature verification" was effectively theatre. We'd passed a SOC 2 audit like this.

I poured myself a whiskey at 2 PM.

The Refactoring Proposal Made Me Feel Obsolete and Grateful Simultaneously

After the mapping, I asked for a refactoring plan to consolidate token validation into a shared library. This is where the "multi-file" part of the model name actually meant something—previous AI tools I've used would happily generate a single file solution that ignored all our existing dependencies and import structures.

GPT-5.1-Codex-Max generated 8 modified files and 2 new shared library files. It handled:

Extracting the token validation from the Node.js gateway while preserving its middleware pattern
Rewriting the Python service to use the same logic via a gRPC call (it proposed the gRPC approach itself, correctly noting our existing inter-service communication patterns)
Updating the Go service to import the shared Go library, fixing the HS256/RS256 issue in the process
Adding proper error handling that matched our existing error taxonomy (this is usually where AI tools fail miserably, generating generic Exception throws)

The code wasn't perfect. It hallucinated a function signature in one place—assuming our user model had a preferred_language field that doesn't exist. I think it picked that up from some i18n migration code in a completely different part of the repo. And it suggested using @company/logger==2.4.0 which we deprecated in January after that Log4j-adjacent panic. But these were 10-minute fixes, not "rewrite half the PR" situations.

The part that actually made me stare at the ceiling that night: it wrote better test coverage than I would have. It generated integration tests that mocked the exact failure scenarios from our PagerDuty history, testing edge cases like token expiry during a role update. I know this because the test descriptions literally referenced "scenario where user role changes between token issuance and validation"—an edge case that caused a Sev-2 incident last March that I'd never documented.

Well... that's complicated. I half-documented it in a postmortem draft that's still sitting in my Google Docs. The model somehow surfaced the incident from our PagerDuty alert history. Not sure how I feel about that.

The "Oh Shit" Moment

Here's where I got genuinely uncomfortable. I asked it to explain why it chose certain patterns, and it referenced a combination of our existing codebase conventions, the OWASP authentication cheatsheet, and—this is the part that got me—a comment thread from our internal engineering Slack that was in a public channel about auth refactoring from 2023. We use the Slack-GPT integration for documentation search, and it had apparently indexed that conversation. The model cited it like "based on the team's previous discussion about preferring decorator patterns for middleware..."

I'm not saying it's sentient. It's not. But the context awareness across files, commit history, and even team communication patterns is a level of integration that feels qualitatively different from previous coding assistants. It's like pairing with a senior dev who joined the company last week and already read every document and Slack thread—unnerving but useful.

The Catch (Because There's Always a Catch)

YMMV significantly. A friend at a startup tried the same experiment with their NestJS app and said the model kept suggesting Express.js patterns because their codebase had inconsistent architectural decisions. From what I've seen, it seems to work best when your codebase already has clear conventions—which is ironic because those are exactly the codebases that need refactoring less.

Also, the "Max" variant is absurdly expensive if you're paying per token. Our experiment cost about $47 in API calls, which is fine for a one-off, but integrating this into CI/CD would require a serious budget conversation. And I'm still not convinced about the security implications of giving an AI read access to our entire codebase, even in a sandbox.

I'm also wary of the skill atrophy risk. I noticed myself getting lazy during the review process, wanting to just trust the output because the first 90% was so good. That's how bugs ship to production.

I caught myself doing it twice. TWICE. Had to physically step away from the keyboard.

Key Takeaways

What worked brilliantly:

Multi-file context awareness is genuinely impressive—it understands your architecture, not just individual files
It spotted a race condition I'd been debugging for months in 30 seconds
The refactoring proposal was 85% production-ready and respected our existing patterns
Test coverage was better than what I'd write manually (painful to admit)

What didn't:

Hallucinated a user field from unrelated code
Suggested a deprecated library version
The cost is eye-watering for regular use ($47 for one experiment)
The Slack integration reading team conversations is... a lot

The uncomfortable truth:

This isn't like previous AI coding tools. The context awareness across your entire codebase, commit history, and even team communication patterns represents a step change. I'm impressed, uncomfortable, and slightly worried about my job security in 5 years.

Anyone else tried the multi-file features yet? I'm especially curious if it handles monorepos with different languages as well as it claims. Drop your war stories in the comments—I need to know if my experience was a fluke or if we should actually be paying attention this time.

Edit: RIP my inbox. To answer the common question—yes, we had legal review the sandbox setup before I ran this. No, I don't think it's actually "thinking," calm down. And to the person who DM'd me asking if I'm hiring: we're always hiring, but fair warning, our auth system is still held together with duct tape and prayers. The AI just helped us find better duct tape.

Tags: #ai #refactoring #architecture #codereview #seniordevproblems #gpt5

I Threw Our Cursed Auth System at GPT-5.1-Codex-Max and It Found a Bug I'd Been Chasing for Months

I Threw Our Cursed Auth System at GPT-5.1-Codex-Max and It Found a Bug I'd Been Chasing for Months

The Architecture Mapping Was Terrifyingly Accurate

The Refactoring Proposal Made Me Feel Obsolete and Grateful Simultaneously

The "Oh Shit" Moment

The Catch (Because There's Always a Catch)

Key Takeaways

Cael Lee

Ready to get started?