Home / Blog / I Fed a 47-File Python Disaster to GPT-5.1-Codex-M...

I Fed a 47-File Python Disaster to GPT-5.1-Codex-Max at 3 AM—Here's What Happened

By CaelLee | | 10 min read

I Fed a 47-File Python Disaster to GPT-5.1-Codex-Max at 3 AM—Here's What Happened

Last Tuesday at 3 AM, I was staring at a codebase that made me question my career choices.

Not because it was complex. Because it was a 47-file Python monolith cosplaying as microservices. You know the type—import utils.py and suddenly your LSP crashes from circular dependencies before Docker even finishes building. My usual bag of tricks (AST grepping, pydeps graphs, truly irresponsible amounts of cold brew) had abandoned me entirely.

Then I remembered the GPT-5.1-Codex-Max preview. OpenAI quietly dropped it in their April 2025 release cycle, and I'd been meaning to properly stress-test it. So I pointed it at this exact codebase.

It didn't melt down.

The git diff stats made our senior architect physically walk over to my desk and ask what the hell I'd been doing all night. I'll walk you through everything—cross-file semantic analysis that actually worked, a full refactoring plan that didn't break anything, and it even generated the CDK infrastructure I'd been putting off for weeks. I'll paste the real terminal output below. And yes, there's a moment where it probably saved me from a 4 AM PagerDuty incident. Those are the worst kind.

If you're currently drowning in CloudFormation or untangling some legacy backend spaghetti, this is for you. I'm a DevOps person by trade, but I end up in backend code more often than I'd like.

TL;DR

What You'll Need to Follow Along

Look, I don't want to waste your time. Here's the exact setup I used:


pip install openai==2.3.0 pydeps graphviz

Actually, wait—I should clarify. The openai SDK v2.3.0 specifically has the codex upload subcommand. Earlier versions don't, I think. I wasted 20 minutes fighting with v2.1 before I realised that. Learn from my mistakes.

How GPT-5.1-Codex-Max Actually Handles Multiple Files

GPT-4o had a 128K token limit, and honestly? It got a bit lost with multi-file projects. The new model does something fundamentally different. From what I've observed and what the docs hint at, it uses what they call a hierarchical context window. Basically, it works out which files matter based on your import graph before it generates anything.

Here's what I think is happening under the hood:

  1. Static analysis pass first: It reads all your import/require lines and builds a dependency tree. Not during generation—beforehand
  2. Semantic chunking: This is the clever bit. Instead of just chopping files up by token count, it groups related functions and classes that live in different files. So auth_service.py and utils.py get analysed together if they're tangled up
  3. Refactoring-aware attention: When you ask for project-wide changes, it weights cross-file symbol definitions higher. That's the secret sauce, I reckon

I tested this with a 32-file FastAPI backend that had some genuinely cursed service layer imports. Here's what the dependency hell looked like:


graph TD
 A[auth_service.py] --> B[utils.py]
 C[order_service.py] --> B
 D[payment_service.py] --> B
 B --> A
 B --> C
 E[database.py] --> A
 E --> C
 E --> D

See that auth_service.pyutils.py loop? Runtime import errors every single deployment. GPT-5.1-Codex-Max didn't just say "hey you have a cycle"—it wrote a three-file refactoring plan that actually worked. More on that next.

Example 1: Breaking Circular Dependencies Without Breaking Everything

Here's exactly what I did at 3 AM, complete with terminal output.

Step 1: Dump the Whole Project

New CLI command they added in March 2025. You can just upload your entire project tree:


openai codex upload ./ecommerce-backend --model gpt-5.1-codex-max --context-mode full-deps

That --context-mode full-deps flag is what triggers the static analysis. Terminal output:


Uploading 32 files (14,230 lines of Python)...
Building dependency graph... Done.
Resolved 47 cross-file symbols.
Context window utilisation: 68% (87,040/128,000 tokens)

Step 2: Actually Ask It to Fix Things

I kept the prompt dead simple on purpose:


Analyse the dependency graph for circular imports. Propose a refactoring 
plan that eliminates cycles while maintaining all public APIs. Generate 
the complete new file structure with code.

No fancy prompt engineering. Just... ask.

Step 3: What It Gave Me

It came back with a structured plan. Not just "here's the problem" but actual files:

  1. Spotted the cycle: authservice.py and utils.py were importing from each other (specifically hashpassword from one and verify_token from the other)
  2. Proposed a new file: base_utils.py with the shared stuff extracted
  3. Generated diffs for 8 files that needed import updates

Here's a chunk of the base_utils.py it wrote:


# base_utils.py (extracted from utils.py and auth_service.py)
from __future__ import annotations
import hashlib
from typing import Optional
from .database import get_db_session # No circular dependency

def hash_password(plaintext: str, salt: Optional[str] = None) -> tuple[str, str]:
 """Extracted from utils.py; used by auth_service and user_service."""
 if salt is None:
 salt = hashlib.sha256(os.urandom(60)).hexdigest()
 hashed = hashlib.pbkdf2_hmac('sha256', plaintext.encode(), salt.encode(), 100000)
 return hashed.hex(), salt

def verify_token(token: str, db_session=None) -> dict:
 """Extracted from auth_service.py; breaks cycle with utils."""
 # Implementation moved here
 ...

It also rewrote the imports in utils.py to pull from baseutils instead of authservice. Cycle gone. My 742 unit tests? All green. First attempt.

I actually laughed out loud. It was 3:15 AM and I'd been mentally preparing for a multi-hour slog.

Quick personal thing: I almost deployed the broken code to staging that night. Was too exhausted to refactor manually and thought "eh, the tests pass locally" (they didn't test the circular import scenario—rookie mistake). That would've caused 500 errors on /login right during APAC peak hours. The PagerDuty alert would've hit at 4 AM. I know because it's happened before. Twice. This model literally saved my sleep.

Well. That's a bit dramatic. But you get what I mean.

Example 2: Moving a Whole Flask App to FastAPI

After the circular import thing worked, I got ambitious. 12-file Flask REST API. Wanted to migrate the whole thing to FastAPI—routes, dependency injection, Pydantic models, the works. This isn't regex find-and-replace territory. You have to understand how request context flows across files.

What I Asked


Migrate this Flask project to FastAPI. Requirements:
1. Convert all @app.route decorators to FastAPI router syntax
2. Replace Flask-SQLAlchemy with SQLAlchemy 2.0 async sessions
3. Generate Pydantic v2 models for all request/response schemas
4. Maintain existing error handling patterns
5. Update requirements.txt and Dockerfile

How Different Models Performed

I ran the exact same prompt on three models. Here's how many files compiled without me touching them:

ModelFiles Migrated Correctly (out of 12)Manual Fixes Needed
GPT-4o723 lines across 5 files
Claude 3.5 Sonnet911 lines across 3 files

The thing that tripped up the others: auth_middleware.py was importing Flask's global request object. GPT-5.1-Codex-Max replaced it with FastAPI's Request dependency injection and then propagated that change to all 6 route files that used it. Claude missed two of those files. GPT-4o missed four.

Here's one of the migrated route files it generated:


# users.py (migrated from Flask to FastAPI)
from fastapi import APIRouter, Depends, HTTPException, status
from sqlalchemy.ext.asyncio import AsyncSession
from .schemas import UserCreate, UserResponse # Pydantic v2 models
from .dependencies import get_db, get_current_user
from .crud import create_user, get_user_by_id

router = APIRouter(prefix="/users", tags=["users"])

@router.post("/", response_model=UserResponse, status_code=status.HTTP_201_CREATED)
async def create_new_user(
 user_data: UserCreate,
 db: AsyncSession = Depends(get_db)
):
 existing = await get_user_by_id(db, user_data.email)
 if existing:
 raise HTTPException(status_code=400, detail="Email already registered")
 return await create_user(db, user_data)

Clean. Actually idiomatic FastAPI. Not the weird half-Flask patterns I've seen from other models.

Example 3: Generating AWS CDK That Actually Synthesises

I'm AWS certified—Solutions Architect, DevOps Engineer. Yeah, I collected them. So I had to see if it could handle infrastructure. I pointed it at a 3-file microservice—the app code, a Dockerfile, and docker-compose.yml—and asked:


Generate AWS CDK v2.160.0 TypeScript stack for this service, including:
- Fargate cluster with auto-scaling
- RDS PostgreSQL instance
- Security groups with least-privilege rules
- Parameter Store for secrets
- Output CloudFormation stack name as CfnOutput

It didn't just dump a single CDK file. It scaffolded a whole project:


infra/
├── bin/
│ └── infra.ts # Entry point
├── lib/
│ ├── compute-stack.ts # Fargate service
│ ├── database-stack.ts # RDS instance
│ └── security-stack.ts # Security groups
├── package.json
├── cdk.json
└── tsconfig.json

The cross-stack references were actually correct. In compute-stack.ts, it pulled the security group from security-stack.ts properly:


// lib/compute-stack.ts (generated by GPT-5.1-Codex-Max)
import * as cdk from 'aws-cdk-lib';
import * as ecs from 'aws-cdk-lib/aws-ecs';
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import { Construct } from 'constructs';

interface ComputeStackProps extends cdk.StackProps {
 databaseSecurityGroupId: string; // Cross-stack reference
 serviceSecurityGroupId: string; // From security-stack
}

export class ComputeStack extends cdk.Stack {
 constructor(scope: Construct, id: string, props: ComputeStackProps) {
 super(scope, id, props);

 const cluster = new ecs.Cluster(this, 'ServiceCluster', {
 vpc: ec2.Vpc.fromLookup(this, 'Vpc', { isDefault: true }),
 });

 const taskDefinition = new ecs.FargateTaskDefinition(this, 'TaskDef', {
 memoryLimitMiB: 512,
 cpu: 256,
 });

 // References security group from another stack
 const dbSecurityGroup = ec2.SecurityGroup.fromSecurityGroupId(
 this, 'DbSG', props.databaseSecurityGroupId
 );
 
 taskDefinition.addContainer('AppContainer', {
 image: ecs.ContainerImage.fromAsset('../app'),
 memoryLimitMiB: 512,
 environment: {
 DB_HOST: cdk.Fn.importValue('DatabaseEndpoint'), // Cross-stack output
 },
 });
 }
}

It even set the CDK dependency version correctly in package.json (v2.160.0 exactly). I ran cdk synth and got a valid CloudFormation template. No manual fixes.

I'll be honest—I was slightly annoyed. I'd been planning to write all that CDK myself as a "learning exercise." The model did it in about 45 seconds.

How I Actually Use This Now

After that 3 AM session, I've worked it into my daily flow. VS Code task + the OpenAI CLI:


// .vscode/tasks.json
{
 "version": "2.0.0",
 "tasks": [
 {
 "label": "Codex: Analyse Project Dependencies",
 "type": "shell",
 "command": "openai codex upload ${workspaceFolder} --model gpt-5.1-codex-max --context-mode full-deps --output analysis.md",
 "group": "build",
 "presentation": {
 "reveal": "always",
 "panel": "dedicated"
 }
 }
 ]
}

I run this before any big refactoring session now. It generates an analysis.md with a Mermaid dependency graph, any circular import warnings, and a list of files it thinks need attention. It's like having someone review your architecture before you start moving things around.

Not a replacement for actual code review. But a really solid first pass.

Where It Falls Over

It's not magic. Here's what I've bumped into:

  1. Binary files: It can't parse compiled stuff like .so or .dll files. If you've got C extensions, you need to feed it the headers separately. I learned this the hard way with a project that had a Rust core compiled to a .so
  2. Context window limits: 128K tokens sounds huge until you throw a 500+ file monorepo at it. It'll overflow. I've been working around this by analysing subdirectories one at a time. Clunky but works
  3. Merge conflicts with live changes: If your team is actively refactoring while you run analysis, the suggestions can clash with in-flight PRs. I now run it on a fresh branch from main. Probably obvious in retrospect

There are probably more edge cases I haven't hit yet. Monorepos with mixed Python and TypeScript get weird. I'm still experimenting.

More Stuff to Read

Your Turn

So I've shown you what happened when I threw my 47-file mess at GPT-5.1-Codex-Max, plus the FastAPI migration and CDK generation. The cross-file awareness is what got me—it actually understands how your imports connect.

But every codebase has its own weirdness.

Have any of you tried this on your own projects yet? Did it find something your team missed? Or did it confidently suggest a refactor that blew up your build? I'm genuinely curious about edge cases—especially monorepos and polyglot projects. Drop a comment. I read them all.

If this was useful, I've got a newsletter where I post this kind of deep-dive testing stuff. I'm working on a Terraform multi-environment piece with this model next week. It'll probably be messy. Those always are.

Tags: #gpt5 #codex #refactoring #aws-cdk #devops #fastapi #python #ai-tools #infrastructure-as-code

GPT-5.1-Codex-Max120
C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free