I Got Woken Up at 2:47 AM by PagerDuty — Here's What OpenAI's API Docs Don't Tell You About Error Ha

Last Wednesday at 2:47 AM, my phone started screaming. Not a gentle vibration. Full-on, someone-is-dying PagerDuty alert.

The service was down. Not the "oh, some endpoints are slow" kind of down. The entire conversation interface had timed out. Users staring at white screens. I cracked open Grafana half-asleep and the error rate was pegged at 100%. Checked the logs, and there it was — an error code from OpenAI I'd never seen before: contextlengthexceeded, followed by some cryptic internal error string.

Wait — correction. I had seen that error code before. Just never in this scenario. We had token truncation logic in the frontend. There was literally no way a request should've exceeded the context window. Turns out OpenAI had silently tweaked the context window calculation for one of their models on March 14th. Not a single mention in the docs.

That night, sitting on the floor of our server room eating cold instant noodles and hot-patching code at 3 AM, I thought: "I need to write this down. No one else should have to learn this the hard way."

So here we are.

Honestly, OpenAI's API documentation reads like it was written by someone who's never actually deployed to production. The error handling section? Barely a paragraph. They give you HTTP status codes and basically say "figure it out yourself." I checked their November 2024 docs update — same sparse content from two years ago. After six months of taking punches in production, I've collected every scar.

Let me walk through what actually matters.

TL;DR / Key Takeaways

OpenAI's rate limiting has two dimensions (RPM and TPM), and TPM burns you when you least expect it
Retrying 5xx errors without backoff is basically a self-inflicted DDoS
Connection pools drain faster than you'd think under latency spikes
Default timeouts are absurd (600 seconds? Really?)
Always have a fallback model ready
Monitor your retry costs or your finance team will hunt you down

The Big Three Error Categories (Paid For in Real Money)

1. 429 Rate Limiting — It's Not Just "Too Many Requests"

If you think 429 just means "slow down and retry," you're in for a surprise.

OpenAI rate limits on two dimensions: RPM (requests per minute) and TPM (tokens per minute). We originally only monitored RPM. Then one day, during a batch job processing long documents, we started getting hammered with 429s at only 60% of our RPM limit.

Took us hours to figure out: TPM was the culprit. Those documents were chewing through tens of thousands of tokens per request, blowing right past the TPM ceiling.

The ugly part? Rate limiting has a lag. By the time you see the 429, you've probably been over the limit for several seconds already. Our fix was maintaining a sliding window counter that predicts token consumption and proactively throttles before hitting the wall. Here's what that looks like:


class OpenAIRateLimiter:
 def __init__(self):
 self.request_window = deque(maxlen=60)
 self.token_window = deque(maxlen=60)
 
 async def acquire(self, estimated_tokens):
 now = time.time()
 # Clean up expired entries
 while self.request_window and self.request_window[0] < now - 60:
 self.request_window.popleft()
 
 # Buffer of 500 requests below the limit
 if len(self.request_window) >= 3500:
 wait_time = self.request_window[0] + 60 - now
 await asyncio.sleep(wait_time)
 
 self.request_window.append(now)

We've been running this for about three months. It's decent. But I'll be honest — under sudden traffic spikes, it still occasionally faceplants.

2. 5xx Retries — Don't "Help" OpenAI by DDoSing Them

One night I got paged for a flood of 502 Bad Gateway errors. My half-asleep response? Infinite retries.

Plot twist: my retry logic made everything worse. OpenAI was already struggling under load, and I was effectively adding to their problem. When our DevOps engineer reviewed the incident, he said something that's stuck with me: "You weren't retrying. You were finishing the job."

Ouch. But he was right.

Here's our current strategy: max 3 retries for 5xx errors, always with exponential backoff and random jitter. The critical insight is knowing which errors deserve retries. 500, 502, 503 — yes. 504 Gateway Timeout? Think twice. If you've already waited 30 seconds with no response, another retry will probably just time out again.

This is the retry config we settled on after much trial and error:


retry_config = {
 'max_retries': 3,
 'backoff_factor': 2.0,
 'jitter': True,
 'retry_on_status': [429, 500, 502, 503],
 'max_delay': 60,
 'retry_on_timeout': True,
 'timeout': 30
}

From what I've seen, most teams land on similar parameters — but the exact numbers depend on your workload. Don't just copy-paste. Please.

3. Connection Pool Exhaustion — My Personal Horror Story

December of last year. Our service hummed along at ~100 QPS, and I'd configured a 200-connection pool. Plenty of headroom, right?

Then one afternoon around 3 PM, OpenAI's response time suddenly jumped from 2 seconds to 15 seconds.

Avalanche.

The connection pool drained in seconds. New requests couldn't grab a connection and just threw exceptions. Imagine Black Friday doorbusters, but everyone's stuck at the entrance trying to get in.

I learned my lesson. Three protection mechanisms now: semaphore-controlled concurrency, circuit breaker pattern, and sane timeouts. The circuit breaker is the star of the show:


class CircuitBreaker:
 def __init__(self, failure_threshold=5, timeout=60):
 self.failure_count = 0
 self.last_failure_time = None
 self.state = 'CLOSED'
 
 async def call(self, func, *args, **kwargs):
 if self.state == 'OPEN':
 if time.time() - self.last_failure_time > self.timeout:
 self.state = 'HALF_OPEN'
 else:
 raise Exception('Circuit breaker is OPEN')
 
 try:
 result = await func(*args, **kwargs)
 if self.state == 'HALF_OPEN':
 self.state = 'CLOSED'
 self.failure_count = 0
 return result
 except Exception as e:
 self.failure_count += 1
 if self.failure_count >= self.failure_threshold:
 self.state = 'OPEN'
 self.last_failure_time = time.time()
 raise e

This thing has saved my sleep more times than I can count.

Real Numbers From Production

Our service handles 5+ million API calls daily. I pulled three months of error distribution data:

429 Rate Limiting: 35% (of which TPM violations account for 60%, RPM for 40% — TPM is the silent killer)
5xx Server Errors: 28% (502 is most common, 503 close behind)
Network Timeouts: 20% (noticeably worse during peak hours, especially US Eastern mornings)
Other Errors: 17% (auth failures, malformed parameters, the usual suspects)

After implementing proper retry logic and circuit breaking, our service availability went from 99.2% to 99.7%. I know, 0.5 percentage points doesn't sound dramatic. But for paying customers, that's 3.6 fewer hours of downtime per month. My boss finally stopped scheduling "let's discuss reliability" meetings.

Hard-Won Wisdom From the Trenches

Set Your Own Damn Timeouts

OpenAI's Python SDK v1.6.0 has a default timeout of 600 seconds. I wish I was joking. Six hundred seconds. Your users have already opened a competitor's app, completed their task, and gone to lunch.

We set timeouts by endpoint type:

Chat completions: 30 seconds
Embeddings: 15 seconds
Fine-tuning jobs: 60 seconds

Log Everything (Because Error Codes Alone Are Useless)

When things break at 3 AM, seeing "502 error" in the logs tells you nothing. Our logs now capture: request_id, retry count, elapsed time, token consumption, and the raw error message. structlog has been a massive improvement over Python's standard logging — if you haven't tried it, do yourself a favor.

Have a Fallback Plan (Even a Mediocre One)

We built a simple rule engine that switches to Anthropic or an open-source model when OpenAI becomes unavailable. The quality drops — I won't pretend it doesn't — but the service stays up. Took us about two sprints to stabilize this setup.

Watch Your Wallet (Retries Cost Real Money)

Here's a fun story. Last November, during OpenAI's massive 3-hour outage, our retry logic went into overdrive. We burned through one-third of our monthly API budget. In three hours.

When my boss asked why costs had exploded, I said, "The AI got too smart and learned how to spend money." He did not laugh. Finance now requires a written report within 4 hours if API costs fluctuate more than 50%.

We also had a week where daily costs spiked 300% because of a bug causing requests to retry 15+ times. Now every retry gets metered, and we get alerts when thresholds are crossed.

The Bottom Line

OpenAI API error handling isn't something you can solve with a try-catch block and a prayer. You need rate limiting awareness, intelligent retries, circuit breaking, and graceful degradation — all working together.

Most importantly: don't trust the official docs blindly. The edge cases? You'll discover those yourself at 2:47 AM. I've made it a habit to document every new error code in our team's "Lessons Learned" doc. Six months in, that document is basically a small book. New engineers read it during their first week. It's been way more useful than any official documentation.

What weird production errors have you hit with OpenAI's API? Especially the undocumented ones that make you question your career choices. I'm actually working on an open-source retry toolkit that codifies all these patterns — first release should be out next month. Drop a comment if you're interested, or share your horror stories. Misery loves company.

#OpenAI #APIErrorHandling #ProductionEngineering #RetryPatterns #BackendDevelopment #WarStories

I Got Woken Up at 2:47 AM by PagerDuty — Here's What OpenAI's API Docs Don't Tell You About Error Ha

I Got Woken Up at 2:47 AM by PagerDuty — Here's What OpenAI's API Docs Don't Tell You About Error Ha

TL;DR / Key Takeaways

The Big Three Error Categories (Paid For in Real Money)

1. 429 Rate Limiting — It's Not Just "Too Many Requests"

2. 5xx Retries — Don't "Help" OpenAI by DDoSing Them

3. Connection Pool Exhaustion — My Personal Horror Story

Real Numbers From Production

Hard-Won Wisdom From the Trenches

Set Your Own Damn Timeouts

Log Everything (Because Error Codes Alone Are Useless)

Have a Fallback Plan (Even a Mediocre One)

Watch Your Wallet (Retries Cost Real Money)

The Bottom Line

Cael Lee

Ready to get started?