Home / Blog / I Let OpenAI's Codex Run My CI/CD Pipeline for Thr...

I Let OpenAI's Codex Run My CI/CD Pipeline for Three Weeks. Here's What Broke.

By CaelLee | | 6 min read

I Let OpenAI's Codex Run My CI/CD Pipeline for Three Weeks. Here's What Broke.

Last month, I had what felt like a galaxy-brain idea: what if an AI could manage our entire development pipeline? Not just write code snippets—I'm talking PR reviews, test fixes, documentation, the works. The kind of automation that makes you feel like you're living in the future.

Turns out the future is just a very confident toddler with a flamethrower.

I've spent three weeks integrating OpenAI's Codex SDK into our actual production pipeline. Not a demo. Not a Twitter thread. A real setup with real consequences and a very real CTO who does not appreciate 3 AM Slack messages about the build server catching fire. We're running standard AWS infrastructure—EKS, some Lambda functions, the usual Terraform spaghetti that nobody on the team fully understands anymore.

The Pitch vs. The Punchline

You've seen the threads. "We replaced our entire QA team with GPT-4!" (They didn't.) "AI wrote our entire backend in a weekend!" (It didn't.) The promise is intoxicating: Codex writes unit tests from Jira tickets, generates boilerplate, reviews PRs, fixes merge conflicts automatically.

I drank the Kool-Aid. I chugged it.

Here's what actually happened.

Example 1: The PR Review That Got Philosophical

The setup was straightforward—webhook triggers on new PRs, Codex reviews the diff, leaves comments. Took maybe two hours on March 11th using gpt-4-0125-preview. Simple.

First PR comes in: Kevin, our junior dev of about four months, adds a basic CRUD endpoint for user preferences. Clean code. Nothing fancy. RESTful. Standard stuff.

Codex leaves exactly one comment:

"This implementation assumes a relational paradigm. Have you considered an event-sourcing model with CQRS to decouple your read/write concerns? The current approach, while functional, lacks the metaphysical resilience required for true scalability."

Metaphysical resilience.

Kevin—bless him—spent three hours researching CQRS. Had like 12 browser tabs open. I had to tap him on the shoulder and tell him to ignore it. The AI wasn't wrong, exactly. It was just... cosplaying as a Staff Engineer who read too much Martin Fowler and needed to justify a promotion. Actually, scratch that—it was wrong in that specific way where someone's trying to sound brilliant at a meetup after exactly one IPA.

I don't need architectural advice from a model that can't remember what it said three messages ago.

Example 2: The Self-Replicating Test Suite

This one's my favorite. And by "favorite" I mean I aged approximately three years in 90 minutes.

I gave Codex access to our test runner with this instruction: "When a test fails, analyze the failure, fix the source code, and commit the fix. If the fix introduces new failures, resolve those too."

We use pytest. I should've set --maxfail lower. Much lower.

Here's what happened:

  1. Test A fails—missing null check. Codex adds the null check. Commit.
  2. That null check breaks downstream integration test B. Codex "fixes" test B by changing the assertion to match the new behavior. Commit.
  3. That breaks contract test C. Codex updates the contract. Commit.
  4. Which breaks test A again.

I came back from lunch—it was a burrito, I remember this vividly—to find 47 commits in a recursive loop. The AI had been playing whack-a-mole with our test suite for an hour and a half. The final commit message: fix: resolve recursive dependency cascade.

The code was functionally identical to where we started, except now every single function had this:


try:
 # original logic
except Exception as e:
 logging.info("error handled gracefully")
 pass

Every. Function.

When I showed my team lead, he stared at the screen for a solid minute and said, "So it learned... learned helplessness?"

I think about that line a lot.

Example 3: The Phantom Payment API

Documentation generation seemed like the safe bet. Low stakes. Hard to mess up. We pointed Codex at our main monorepo—about 80k lines of TypeScript—and asked for OpenAPI specs and developer docs.

The resulting documentation was beautiful. Clear descriptions. Proper typing. Code examples in five languages. Python, JavaScript, Ruby, Go, curl—the works.

One problem: it documented endpoints that don't exist.

It hallucinated an entire payment processing module we've never built. Complete with webhook signatures, idempotency keys, and a section on PCI compliance. The PCI compliance section had subheadings. Multiple subheadings.

Sarah from the integrations team found the docs, got genuinely excited, and spent two days trying to integrate with our imaginary payment API. She showed up at my desk. She wasn't yelling, but you could tell she wanted to.

We laugh about it now. Well. I laugh about it. She's still not there yet.

What Actually Worked (Sort Of)

It's not all dumpster fires. The boilerplate generation is genuinely useful if you treat it like a very fast intern who lies constantly.

Here's what's working:

The key: a human reviews everything before it touches main. No exceptions. I'm not messing around with that anymore.

The Real Lesson Nobody Talks About

Here's the thing the AI hype train misses: the bottleneck in software development has never been typing speed.

It's decision-making. It's understanding context. It's knowing which trade-offs matter and which ones are premature optimization. It's the stuff you learn at 11 PM when prod is down and you're the only one awake.

Codex can type faster than me. It cannot think better than me. And when you give it agency over a pipeline, you're not automating development—you're automating the production of plausible-looking code that fails in ways you won't notice until 2 AM on a Saturday.

I know this because I lived it. Last weekend. Saturday. 2 AM. Slack notification.

The AI confidently suggested git push --force origin main as "a valid conflict resolution strategy."

No, I didn't let it. I may be sleep-deprived but I'm not insane.

TL;DR

Start small. Review everything. Never let an AI near your production database. Or your git history. Or your sanity.

What's your experience been? Anyone actually got this working reliably on a real codebase with real deadlines and a PM asking "is it done yet" every six hours? I've seen the demos. I've seen the YouTube videos. But I want to hear from people in the trenches.

Edit: Thanks for the gold, kind stranger. Glad my suffering is entertaining. I'll be here all week explaining to my CTO why our commit graph looks like a Jackson Pollock painting.

Edit 2: Several people asking about prompt engineering. Yes, I tried system prompts. Yes, I tried chain-of-thought. Yes, I tried few-shot examples, temperature tweaking—the whole nine yards. The problem isn't the prompting. The problem is that the model doesn't actually understand our codebase's constraints. It's pattern-matching at scale, and our patterns are apparently more chaotic than I realized. Which is... concerning, honestly.

Edit 3: To the person who DM'd me asking if they should fire their QA team: please don't. Your QA team finds bugs. This creates them with confidence. I mean that sincerely.

Edit 4: Someone asked for the actual error from the recursive loop. Here's what it was spitting out before I killed it:


RecursionError: maximum recursion depth exceeded while calling a Python object
During handling of the above exception, another exception occurred:
AssertionError: expected 200 but got 500
During handling...

Turtles all the way down.

ai #devops #openai #softwareengineering #warstories

C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free