I Built an AI API Gateway After Getting Woken Up at 2:47 AM (Here's What I Learned)

Last Tuesday at 2:47 AM, my phone exploded with PagerDuty alerts.

Our AI service had gone down. Three times. The alerts all said the same thing—"upstream model timeout." First, OpenAI's GPT-4 started acting up. I switched to Claude, and boom—529 errors. I'm lying in bed scrolling through the alert history when it hits me: we're connected to five AI providers, but all the routing logic is hardcoded. Every time one goes down, I'm manually switching like some kind of human patch panel. Twist this valve, close that one.

Here's the thing—we're not alone in this mess.

Last month I grabbed drinks with a few folks building AI apps, and we're all dealing with the same garbage. One team runs an e-commerce customer service bot. On Singles' Day 2024 (think Black Friday but bigger), OpenAI went down for 40 minutes. This was right after GPT-4 Turbo's big rollout—stability was, uh, let's call it "aspirational." They had to manually change DNS records to fail over to a backup provider. When I asked how many orders they lost, the guy just stared at his beer and said, "I don't want to know."

So today I want to talk about how to manage a bunch of AI provider APIs using an API Gateway pattern—and handle health checks and automatic failover while you're at it. I spent three months in the trenches on this thing, from October through January. Hopefully I can save you some pain.

Why You Need an AI API Gateway

Let me throw some numbers at you:

Our team connects to 5 providers (OpenAI, Anthropic, Google AI, Alibaba's Bailian, DeepSeek). We handle roughly 2 million model calls per day. Peak days hit 3.5 million.
Based on our internal monitoring, a single provider's monthly availability hovers around 99.5%. Sounds decent, right? But here's the kicker—when you're using three providers, the probability of at least one having issues jumps to 3-4 times per month. That's not theoretical math. That's our actual stats.
During OpenAI's massive outage in November 2024 (3 hours and 12 minutes), our auto-failover handled about 120,000 requests. Without it, I would've spent that night in the data center.

Bottom line: relying on a single AI provider is gambling. But here's the problem—every provider has different API formats, authentication methods, and error codes. OpenAI returns `choices[0].message.content, Anthropic uses content[0].text`, and Google AI has its own thing entirely. How do you stitch all this together so your application doesn't care?

Core Architecture: The Unified Routing Layer

Let's look at the overall architecture. I used the standard Gateway pattern—no black magic here:


客户端请求
 │
 ▼
┌─────────────────┐
│ 统一 API 接口 │ ← 对外暴露标准 OpenAI 格式
└─────────────────┘
 │
 ▼
┌─────────────────┐
│ 路由决策引擎 │ ← 根据模型名、负载、健康状态选厂商
└─────────────────┘
 │
 ├──────► OpenAI Adapter
 ├──────► Anthropic Adapter
 ├──────► Google AI Adapter
 ├──────► 阿里百炼 Adapter
 └──────► DeepSeek Adapter

The design principle is dead simple: unified externally, adapted internally.

Translation for humans: your business code calls one endpoint. The Gateway translates the request into each provider's format, then translates the response back. Your app doesn't need to know whether it's talking to GPT-4 or DeepSeek—it just sees a standardized response format.

Wait, I should clarify something. I said "standardized OpenAI format," but that's not quite right. OpenAI themselves changed their format three times in 2024. What I mean by "standard" is our internal spec, based on OpenAI's June 2024 version. All provider responses get normalized to that.

My first mistake—and this one cost me weeks—was trying to make every provider's response format identical. The adapter layer became this monstrosity of field mapping, type conversion, and default value injection. Spaghetti code doesn't begin to describe it. I eventually got smart and only standardized the critical fields (content, finishreason, tokenusage). Provider-specific stuff goes into a `metadata` field, and the application grabs what it needs.

Saved me an embarrassing amount of adapter code.

Routing Strategy: It's Not Just Round-Robin

Routing is where I burned the most brain cells. I started with simple round-robin because, hey, all these models are roughly equivalent, right? Just distribute evenly.

So naive.

Problem one: massive cost differences. GPT-4 costs over 20x what DeepSeek charges. You want to split traffic evenly between them? Your finance team will hunt you down. When our December bill landed, my CTO scheduled a special ten-minute chat with me.

Problem two: capability mismatches. I once routed a JSON structured-output request to a certain provider—I won't name names—and it returned JSON with camelCase one time, snake_case the next. Downstream parsing exploded. That incident took four hours to resolve.

So now our routing has three layers:

Layer 1: Model mapping. We maintain a capability matrix showing which providers support which models and what they're good at. Looks something like this:


model_mapping:
 gpt-4:
 providers:
 - name: openai
 model: gpt-4-0125-preview
 priority: 1
 cost_weight: 1.0
 - name: deepseek
 model: deepseek-chat
 priority: 2
 cost_weight: 0.05 # 成本是 GPT-4 的 1/20
 capabilities:
 - json_mode
 - function_calling
 - vision

Layer 2: Cost-aware routing. For non-critical tasks, we prefer cheaper providers. We added a `cost_tolerance` parameter so the application can specify cost sensitivity. Summarization and draft generation go through the cheap lane. Core conversations and structured output take the premium path.

Layer 3: Health-based routing. The last line of defense. If a provider goes down or latency spikes, it gets booted from the candidate list automatically.

Here's what the actual code looks like (simplified—I stripped out our internal config loading):


def route_request(request, model_name):
 candidates = get_candidates(model_name)
 
 # 过滤掉不健康的厂商
 healthy = [c for c in candidates if health_checker.is_healthy(c.provider)]
 
 if not healthy:
 raise AllProvidersDownError(f"All providers for {model_name} are unhealthy")
 
 # 按优先级和成本排序
 sorted_candidates = sorted(
 healthy,
 key=lambda c: (c.priority, c.cost_weight * request.cost_tolerance)
 )
 
 return sorted_candidates[0]

Health Checks: Don't Just Ping It

Health checking is the most underestimated part of this whole system.

My initial approach was laughably naive—call the `/v1/models` endpoint every 30 seconds. If it returns 200, we're good. I felt pretty clever.

Reality humbled me fast.

In December 2024, OpenAI's API happily returned model lists, but actual chat completion calls were timing out like crazy. Our health check saw nothing wrong. Traffic kept flowing. Error rate hit 40%. By the time I noticed, 23 minutes had passed.

Now we do three levels of health checks:

L1 - Connectivity check (every 10 seconds): A lightweight call to verify the service is reachable. We only look at HTTP status codes and response time. Uses `/v1/models` with a 2-second timeout.

L2 - Functional check (every 60 seconds): Send a real inference request with a fixed prompt—"Reply with OK and nothing else"—to verify the model is actually working. This catches the "API is up but model is down" scenario. That OpenAI incident? This would've caught it.

L3 - Quality check (every 5 minutes): Run a set of standard test cases to detect response quality degradation. This one came from an incident where a provider silently swapped model versions. API format was identical, but output quality tanked. Users started complaining that "the AI got dumber." Took us two days to pinpoint.

Hmm... actually, L3 is tricky to implement well. Defining "quality degradation" is pretty subjective. Right now we compare output similarity against the previous version and alert if it crosses a threshold, but false positives are still high. I'm still tuning this.

Health status isn't a simple binary either. I use a sliding window scoring mechanism:


class HealthScorer:
 def __init__(self, window_size=60, threshold=0.8):
 self.window_size = window_size # 60秒窗口
 self.threshold = threshold
 self.checks = [] # 存储最近60秒的检查结果
 
 def record_check(self, success, latency_ms):
 self.checks.append({
 'success': success,
 'latency': latency_ms,
 'timestamp': time.time()
 })
 # 清理过期记录
 self.checks = [c for c in self.checks 
 if time.time() - c['timestamp'] < self.window_size]
 
 def get_score(self):
 if not self.checks:
 return 0
 
 recent = self.checks[-10:] # 最近10次检查
 success_rate = sum(1 for c in recent if c['success']) / len(recent)
 avg_latency = sum(c['latency'] for c in recent if c['success']) / max(1, len(recent))
 
 # 延迟超过5秒扣分
 latency_penalty = max(0, (avg_latency - 5000) / 5000)
 
 return success_rate * (1 - latency_penalty * 0.3)
 
 def is_healthy(self):
 return self.get_score() >= self.threshold

The beauty of this approach: one timeout won't immediately mark a provider as unhealthy, but sustained issues will tank the score fast. We set the threshold at 0.8—anything below triggers a failover.

War Stories from the Trenches

Here are a few incidents that left scars. All paid for with real money.

Trap #1: The rate-limit ping-pong effect.

November 2024. We failed over from OpenAI to DeepSeek, but DeepSeek had rate limits (1,000 requests per minute). It got hammered instantly and started returning 429s. Our health check saw "unhealthy" and failed back to OpenAI—which was also rate-limiting at that point. Back and forth, back and forth. Ping-pong. That incident lasted 17 minutes, with over 8,000 combined 429 responses.

The fix: special handling for 429 in health checks. When we see a 429, we don't mark unhealthy—we degrade based on the `Retry-After` header and trigger an alert for human intervention. This logic has saved us at least three times since.

Trap #2: Streaming response adapter hell.

Every provider's SSE format is different. OpenAI uses `data: {"choices":[{"delta":{"content":"hello"}}]}, Anthropic uses data: {"type":"contentblockdelta","delta":{"text":"hello"}}`. I spent three days just on streaming adapters.

Oh, and here's a fun one: Google AI's streaming responses don't actively close the connection when it drops. Our connection pool filled up completely. Had to add a 30-second forced timeout to fix it.

Honest advice: use an existing solution for streaming. LiteLLM's streaming handling is way more stable than what I wrote. I eventually refactored to use their approach.

Trap #3: Cost accounting doesn't add up.

After building the unified gateway, finance asked, "How much did we spend on AI last month?" I added up the provider bills and... they were 15% higher than what the gateway reported. Two days of investigation later, I found the culprit: requests that timed out at the gateway layer had already consumed tokens at the provider. We paid for results we never received.

Now we have strict timeout controls (30 seconds, shorter than the typical provider default of 60) and idempotency handling to minimize "paid but no response" situations.

Should You Build Your Own?

I get this question a lot. Here's my take:

If you're using 2-3 providers with moderate traffic, just use LiteLLM or One API. Seriously. Don't build this yourself. I mean it.
If you need deep customization—routing strategies, cost controls, audit logging— consider building in-house. But budget at least 2-3 person-months. Minimum.
If you're at scale (10M+ daily calls), you'll probably need to build your own. Open-source solutions struggle with performance and stability at that level. Not because they're bad—they're just not designed for that scale.

Our team went with a hybrid: custom routing and health checks, but we borrowed LiteLLM's adapter code for provider integration. Saved us a ton of time.

What I Want to Build Next

There are a few things I'm itching to optimize but haven't gotten to yet:

Smart pre-warming: When a provider's health score starts dropping, proactively send warm-up requests to backup providers to reduce cold-start latency during failover. Right now, the first call after switching is 200-300ms slower because the connection pool is cold.
Cost forecasting: Predict next month's costs based on historical data for budget planning. Finance keeps asking for estimates, and I'm basically guessing, "Uh, probably about the same as this month?"
Multi-region deployment: Deploy access points in different regions for the same provider. We only use OpenAI's US endpoint right now with ~180ms latency. Adding Europe and Singapore nodes could theoretically get us down to 60-80ms. Not a high priority though—current latency is acceptable for our use case.

If you're building something similar or have better approaches, drop a comment. I'm especially curious about cost control strategies—I feel like there's huge room for improvement there. Our current priority-plus-weight system feels too crude. Anyone using reinforcement learning for dynamic routing? How's that working out?

Tags: #AIgateway #APIGateway #MultiProvider #HealthChecks #LLMOps #CostOptimization

I Built an AI API Gateway After Getting Woken Up at 2:47 AM (Here's What I Learned)

I Built an AI API Gateway After Getting Woken Up at 2:47 AM (Here's What I Learned)

Why You Need an AI API Gateway

Core Architecture: The Unified Routing Layer

Routing Strategy: It's Not Just Round-Robin

Health Checks: Don't Just Ping It

War Stories from the Trenches

Should You Build Your Own?

What I Want to Build Next

Cael Lee

Ready to get started?