When Your AI Bot Ghosts You at 3 AM: A Production War Story

Last Thursday at 2:47 AM, our chatbot went silent. Not the graceful kind of silent—the kind where customers start tweeting things you don't want your investors to see.

The next morning, our customer support lead had a face that could curdle milk. I'd been debugging until 3 AM, staring at logs that contained exactly one useful line: openai.RateLimitError: Error code: 429. That's it. One rate limit error, and the entire service collapsed like a house of cards. No retries. No graceful degradation. Not even an alert. I sat there thinking, "What have I actually been doing for the past year?"

This brought back memories. When I first started integrating ChatGPT in early 2023, I genuinely thought that getting the API to respond meant we were production-ready. Adorably naive, I know. The potholes I've hit since then outnumber the ones on the road outside my flat—and that road's been under construction for three years.

Let's talk about OpenAI API error handling and retry strategies. This is all battle-tested stuff, none of that "you should gracefully handle exceptions" fluff from the official docs.

Know Your Enemy

I categorise OpenAI API errors into four types. Much more straightforward than the official documentation, if I do say so myself.

HTTP 429 - Rate Limit

The most common. Also the most infuriating. Your code isn't broken—you're just knocking on the door too aggressively. OpenAI enforces RPM (requests per minute) and TPM (tokens per minute) limits that vary by model. GPT-4's RPM sits around 200-500, depending on your account tier.

We had a content generation pipeline where peak QPS hit 30. Do the maths—GPT-4's RPM couldn't cope. Initially, I did the dumb thing and threw money at the problem by upgrading our tier. Then we got smart: optimising prompts to reduce token consumption slashed our request volume by 40%.

Actually, let me correct myself—GPT-4 Turbo now defaults to 500 RPM, GPT-4 to 200, but if you climb to Tier 5, GPT-4 can reach 10,000 RPM. The catch? Upgrading tiers costs both money and time, and not every team has that luxury.

HTTP 5xx - Server-Side Meltdown

OpenAI's infrastructure decided to take a nap. Remember the massive outage in November 2023? status.openai.com lit up like a Christmas tree, and our service was dead in the water for four hours because we had zero disaster recovery. June 2024 had another hiccup—only 20 minutes this time, but it coincided perfectly with our promotional campaign. Because of course it did.

We eventually wised up and added Azure OpenAI Service as a backup. Lesson learned the hard way.

Network Timeouts / Connection Errors

This one's... complicated. If you're connecting to OpenAI's API from regions with spotty international routing, timeouts can be brutal. Set your timeout too generously, and a single request can hang for 60 seconds, clogging your entire thread pool. We eventually settled on 30-second timeouts, though honestly, that number was a bit of a finger-in-the-air decision.

Token Overflow / Content Moderation

These hide within 400 errors. Your context window explodes, or your prompt triggers the content filter. Retrying is pointless here—you need to fix the business logic. Once, our operations colleague accidentally fed a user's profanity-laced input directly into a prompt. The API rejected it five times in a row before we realised the content filter was the culprit.

Retry Logic: It's Not Just a For Loop

I once saw a colleague wrap an API call in while True with a 5-second sleep. Infinite retries. When the 429 hit, it kept hammering the API until OpenAI banned our key. He called me at 3 AM to tell me the key was gone. I nearly had a coronary.

Proper retry logic needs layers:

Layer 1: Exponential Backoff + Jitter

When you get a 429, OpenAI includes a Retry-After header telling you how long to wait. But don't wait precisely that long—add some randomness to avoid the thundering herd problem, where all your requests wake up simultaneously and immediately hammer the API again.


import random
import asyncio

async def call_with_retry(prompt, max_retries=3):
 for attempt in range(max_retries):
 try:
 response = await openai.chat.completions.create(
 model="gpt-4",
 messages=[{"role": "user", "content": prompt}]
 )
 return response
 except openai.RateLimitError as e:
 if attempt == max_retries - 1:
 raise
 retry_after = e.response.headers.get("Retry-After")
 if retry_after:
 wait_time = float(retry_after) + random.uniform(0, 1)
 else:
 wait_time = (2 ** attempt) + random.uniform(0, 1)
 await asyncio.sleep(wait_time)

Prioritising Retry-After is genuinely useful—in our testing, it reduced wasted wait time by about 30%.

Layer 2: Circuit Breaker

This is crucial. When your error rate crosses a threshold, you stop sending requests entirely. Otherwise, your retry logic actively makes the problem worse. We use pybreaker with this configuration: if the failure rate exceeds 50% within one minute, the circuit opens for 30 seconds. During that window, we serve cached responses or a fallback message: "AI assistant is taking a brief break—please try again shortly."

This February, OpenAI had another outage at 3 AM (why is it always 3 AM?). The circuit breaker kicked in automatically. I checked the monitoring later—47 minutes of circuit-breaking saved us over 20,000 doomed requests. User experience took a hit, sure, but the system didn't collapse.

Layer 3: Queue-Based Smoothing

Don't let requests hit the API directly. We added RabbitMQ as a buffer—requests queue up, and workers consume them at a steady rate. Even if upstream traffic spikes 10x, downstream calls stay smooth.

This change came after our CEO did a live-stream on an e-commerce platform. The sudden flood of users pushed our concurrency to 10x normal levels, and our API quota evaporated instantly. With the queue in place, users waited a few seconds for responses, but the service stayed up. In a live-stream selling scenario, waiting 5 seconds versus getting an error can mean a 20x difference in conversion rates—at least according to our e-commerce team. That number feels slightly inflated to me, but the direction is spot-on.

Hard-Won Lessons

Don't Hardcode Your API Keys

This should go without saying, but I've genuinely seen someone commit a key to GitHub and wake up to a $3,000 OpenAI bill. A scraper found the key and used it for crypto mining. Now all our keys come from environment variables and HashiCorp Vault with dynamic injection, plus we set usage limits—$1,500 monthly cap, then automatic cutoff.

Monitoring Matters More Than Code

We use Prometheus + Grafana to track these metrics:

API call success rate (broken down by error code)
P50/P99 latency
Requests per minute (alert when approaching quota)
Retry ratio

Once, our P99 latency suddenly jumped from 2 seconds to 15 seconds. After some digging, we found a colleague had stuffed an entire product manual into the prompt as context. Token count went nuclear. Without monitoring, we'd never have caught this slow-motion disaster. He bought the team bubble tea as penance, so we called it even.

Optimise Costs Early

When we first started with GPT-4, I didn't pay attention to costs. The end-of-month bill was $800, and my manager looked ready to deduct it from my salary. We made some changes:

Use GPT-3.5 wherever GPT-4 isn't strictly necessary
Add Redis caching—similar queries return cached results (35% hit rate)
Trim prompts ruthlessly—stop stuffing unnecessary system messages

Costs dropped 60% with negligible quality difference.

Always Have a Plan B

Our current setup: OpenAI as primary, Azure OpenAI as fallback, and a locally deployed Llama 3 70B as the last resort. Llama's output quality is noticeably worse, but at least users don't see a 500 error. This three-tier degradation lifted our core service availability from 99.2% to 99.8%.

That 0.6% might sound trivial, but for a service handling 500,000 daily requests, it means 90,000 fewer failures per month.

Funny story—Azure OpenAI's API isn't identical to native OpenAI. During our first failover attempt, we got a parameter name wrong, and the degradation failed entirely. Our ops colleague called me at midnight to yell at me about my documentation. He wasn't wrong.

Key Takeaways

Rate limits will bite you. Optimise prompts before upgrading tiers.
Retry logic needs layers. Backoff, circuit breakers, and queues—not just a loop.
Monitoring is non-negotiable. You can't fix what you can't see.
Costs creep up silently. Cache aggressively and use cheaper models when possible.
Redundancy saves careers. Have at least one fallback provider.

Honestly, OpenAI's API stability has improved massively since 2023. But in production, it's rarely the API itself that causes headaches—it's the edge cases. Network hiccups, traffic surges, cost spirals, security gaps. Your retry mechanism is just the last line of defence. The architecture, monitoring, and cost controls you build around it are what actually keep you asleep at 3 AM.

What's your experience with OpenAI in production? Any horror stories involving 429 errors at ungodly hours? Drop a comment—I'd genuinely love to know I'm not the only one who's been woken up by rate limits.

OpenAI #API #SRE #ProductionEngineering #ErrorHandling #DevOps

When Your AI Bot Ghosts You at 3 AM: A Production War Story

When Your AI Bot Ghosts You at 3 AM: A Production War Story

Know Your Enemy

HTTP 429 - Rate Limit

HTTP 5xx - Server-Side Meltdown

Network Timeouts / Connection Errors

Token Overflow / Content Moderation

Retry Logic: It's Not Just a For Loop

Layer 1: Exponential Backoff + Jitter

Layer 2: Circuit Breaker

Layer 3: Queue-Based Smoothing

Hard-Won Lessons

Don't Hardcode Your API Keys

Monitoring Matters More Than Code

Optimise Costs Early

Always Have a Plan B

Key Takeaways

OpenAI #API #SRE #ProductionEngineering #ErrorHandling #DevOps

Cael Lee

Ready to get started?