I Fed a 47-File Python Disaster to GPT-5.1-Codex-Max at 3 AM—Here's What Happened
I Fed a 47-File Python Disaster to GPT-5.1-Codex-Max at 3 AM—Here's What Happened
Last Tuesday at 3 AM, I was staring at a codebase that made me question my career choices.
Not because it was complex. Because it was a 47-file Python monolith cosplaying as microservices. You know the type—import utils.py and suddenly your LSP crashes from circular dependencies before Docker even finishes building. My usual bag of tricks (AST grepping, pydeps graphs, truly irresponsible amounts of cold brew) had abandoned me entirely.
Then I remembered the GPT-5.1-Codex-Max preview. OpenAI quietly dropped it in their April 2025 release cycle, and I'd been meaning to properly stress-test it. So I pointed it at this exact codebase.
It didn't melt down.
The git diff stats made our senior architect physically walk over to my desk and ask what the hell I'd been doing all night. I'll walk you through everything—cross-file semantic analysis that actually worked, a full refactoring plan that didn't break anything, and it even generated the CDK infrastructure I'd been putting off for weeks. I'll paste the real terminal output below. And yes, there's a moment where it probably saved me from a 4 AM PagerDuty incident. Those are the worst kind.
If you're currently drowning in CloudFormation or untangling some legacy backend spaghetti, this is for you. I'm a DevOps person by trade, but I end up in backend code more often than I'd like.
TL;DR
- GPT-5.1-Codex-Max actually understands cross-file dependencies before generating code—it builds a dependency tree first
- It successfully broke circular imports in my 32-file FastAPI project on the first attempt (all 742 unit tests passed)
- Migrated a 12-file Flask app to FastAPI with zero manual fixes—GPT-4o and Claude 3.5 Sonnet both needed corrections
- Generated a complete AWS CDK TypeScript project that synthesised correctly on the first
cdk synth - It's not magic—it chokes on binary files, monorepos over 500 files, and merge conflicts with live changes
What You'll Need to Follow Along
Look, I don't want to waste your time. Here's the exact setup I used:
- Access to GPT-5.1-Codex-Max—I got mine through Azure OpenAI Service around 15 April 2025. You might have it directly on platform.openai.com by now
- Python 3.12+ with
openaiSDK v2.3.0 - A multi-file project. I'll reference my e-commerce microservices test repo (link in Further Reading if you want to clone it)
- AWS CDK v2.160.0 and TypeScript 5.6 for the infrastructure bits
pydepsandgraphvizif you want the pretty dependency graphs (optional, but genuinely helpful)
pip install openai==2.3.0 pydeps graphviz
Actually, wait—I should clarify. The openai SDK v2.3.0 specifically has the codex upload subcommand. Earlier versions don't, I think. I wasted 20 minutes fighting with v2.1 before I realised that. Learn from my mistakes.
How GPT-5.1-Codex-Max Actually Handles Multiple Files
GPT-4o had a 128K token limit, and honestly? It got a bit lost with multi-file projects. The new model does something fundamentally different. From what I've observed and what the docs hint at, it uses what they call a hierarchical context window. Basically, it works out which files matter based on your import graph before it generates anything.
Here's what I think is happening under the hood:
- Static analysis pass first: It reads all your
import/requirelines and builds a dependency tree. Not during generation—beforehand - Semantic chunking: This is the clever bit. Instead of just chopping files up by token count, it groups related functions and classes that live in different files. So
auth_service.pyandutils.pyget analysed together if they're tangled up - Refactoring-aware attention: When you ask for project-wide changes, it weights cross-file symbol definitions higher. That's the secret sauce, I reckon
I tested this with a 32-file FastAPI backend that had some genuinely cursed service layer imports. Here's what the dependency hell looked like:
graph TD
A[auth_service.py] --> B[utils.py]
C[order_service.py] --> B
D[payment_service.py] --> B
B --> A
B --> C
E[database.py] --> A
E --> C
E --> D
See that auth_service.py ↔ utils.py loop? Runtime import errors every single deployment. GPT-5.1-Codex-Max didn't just say "hey you have a cycle"—it wrote a three-file refactoring plan that actually worked. More on that next.
Example 1: Breaking Circular Dependencies Without Breaking Everything
Here's exactly what I did at 3 AM, complete with terminal output.
Step 1: Dump the Whole Project
New CLI command they added in March 2025. You can just upload your entire project tree:
openai codex upload ./ecommerce-backend --model gpt-5.1-codex-max --context-mode full-deps
That --context-mode full-deps flag is what triggers the static analysis. Terminal output:
Uploading 32 files (14,230 lines of Python)...
Building dependency graph... Done.
Resolved 47 cross-file symbols.
Context window utilisation: 68% (87,040/128,000 tokens)
Step 2: Actually Ask It to Fix Things
I kept the prompt dead simple on purpose:
Analyse the dependency graph for circular imports. Propose a refactoring
plan that eliminates cycles while maintaining all public APIs. Generate
the complete new file structure with code.
No fancy prompt engineering. Just... ask.
Step 3: What It Gave Me
It came back with a structured plan. Not just "here's the problem" but actual files:
- Spotted the cycle:
authservice.pyandutils.pywere importing from each other (specificallyhashpasswordfrom one andverify_tokenfrom the other) - Proposed a new file:
base_utils.pywith the shared stuff extracted - Generated diffs for 8 files that needed import updates
Here's a chunk of the base_utils.py it wrote:
# base_utils.py (extracted from utils.py and auth_service.py)
from __future__ import annotations
import hashlib
from typing import Optional
from .database import get_db_session # No circular dependency
def hash_password(plaintext: str, salt: Optional[str] = None) -> tuple[str, str]:
"""Extracted from utils.py; used by auth_service and user_service."""
if salt is None:
salt = hashlib.sha256(os.urandom(60)).hexdigest()
hashed = hashlib.pbkdf2_hmac('sha256', plaintext.encode(), salt.encode(), 100000)
return hashed.hex(), salt
def verify_token(token: str, db_session=None) -> dict:
"""Extracted from auth_service.py; breaks cycle with utils."""
# Implementation moved here
...
It also rewrote the imports in utils.py to pull from baseutils instead of authservice. Cycle gone. My 742 unit tests? All green. First attempt.
I actually laughed out loud. It was 3:15 AM and I'd been mentally preparing for a multi-hour slog.
Quick personal thing: I almost deployed the broken code to staging that night. Was too exhausted to refactor manually and thought "eh, the tests pass locally" (they didn't test the circular import scenario—rookie mistake). That would've caused 500 errors on /login right during APAC peak hours. The PagerDuty alert would've hit at 4 AM. I know because it's happened before. Twice. This model literally saved my sleep.
Well. That's a bit dramatic. But you get what I mean.
Example 2: Moving a Whole Flask App to FastAPI
After the circular import thing worked, I got ambitious. 12-file Flask REST API. Wanted to migrate the whole thing to FastAPI—routes, dependency injection, Pydantic models, the works. This isn't regex find-and-replace territory. You have to understand how request context flows across files.
What I Asked
Migrate this Flask project to FastAPI. Requirements:
1. Convert all @app.route decorators to FastAPI router syntax
2. Replace Flask-SQLAlchemy with SQLAlchemy 2.0 async sessions
3. Generate Pydantic v2 models for all request/response schemas
4. Maintain existing error handling patterns
5. Update requirements.txt and Dockerfile
How Different Models Performed
I ran the exact same prompt on three models. Here's how many files compiled without me touching them:
| Model | Files Migrated Correctly (out of 12) | Manual Fixes Needed |
|---|
| GPT-4o | 7 | 23 lines across 5 files |
|---|
| Claude 3.5 Sonnet | 9 | 11 lines across 3 files |
|---|
| GPT-5.1-Codex-Max | 12 | 0 |
|---|
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.