I Killed Our API Gateway With a Coffee Cup and 47 Minutes of Downtime
I Killed Our API Gateway With a Coffee Cup and 47 Minutes of Downtime
Last Wednesday, 3:12 PM. Our API aggregation platform went dark. For 47 minutes straight.
The reason was embarrassingly stupid—one model provider's gpt-4-turbo-preview endpoint timed out and took down our entire scheduling thread pool. My boss dropped three question marks in our company chat. I was mid-sip of my flat white and nearly launched the cup into my keyboard.
I'm not exaggerating when I say I wanted the ground to swallow me whole.
That incident forced me to rethink everything I thought I knew about model scheduling and high availability. I've since discovered that while plenty of articles talk about API aggregation, almost none tackle the real question: what do you actually do when your upstream providers catch fire?
So here it is. Every pitfall I've fallen into, every tear I've shed over incident reports, and the architecture that's actually working for us now.
What We're Actually Scheduling
Let me set the scene first.
Our platform connects to seven model providers: OpenAI, Anthropic's Claude, plus a handful of Chinese LLMs you've probably heard of—ERNIE Bot, Tongyi Qianwen, DeepSeek, Zhipu, and MiniMax. When a request hits our API, the scheduling layer has about 300 milliseconds to answer four questions:
- Which provider's model do we use?
- What happens if that provider is down?
- How do we keep response times from tanking?
- How do we control costs?
Sounds straightforward, right?
The actual pitfalls could fill Wembley Stadium. I'm not joking.
Our initial design was painfully naive—just a simple round-robin. I figured all the models were roughly equivalent, so why not distribute requests evenly? Then OpenAI had one of its wobbles, and our entire platform went down with it. Every single request queued up waiting for connections that had already timed out. Thread pool? Completely saturated.
That was March 2024. I remember it vividly because I'd just told my partner how "rock-solid" our architecture was.
The universe has a wicked sense of timing.
The Scheduling Journey: From Round-Robin to Intelligent Routing
Version 1: Weighted Round-Robin (Decommissioned)
# November 2023 code. Looking at it now makes me want to delete my GitHub account.
def select_provider(request):
providers = get_available_providers()
provider = providers[current_index % len(providers)]
current_index += 1
return provider
The problem? This approach completely ignored real-time provider health. Once, Claude's API latency spiked to 13 seconds—and our round-robin kept cheerfully routing requests its way. User experience cratered. Our support queue exploded with complaints.
Twelve tickets, to be precise. Yes, I counted.
Version 2: Health Checks + Dynamic Weights
Next, we added health checks—pinging each provider's /v1/models endpoint every 30 seconds. I felt rather clever about this one.
Reality promptly educated me.
During a promotional event last November—don't ask why an international service was running Singles' Day promotions, that was marketing's idea—one provider suddenly rate-limited us between health check intervals. In that 30-second window, we accumulated over 2,300 failed requests.
Our Celery task queue memory went critical.
Wait, let me correct that. It wasn't the Redis queue itself that failed. It was Celery's worker memory that overflowed because we hadn't configured tasksofttimelimit. Tasks piled up in worker memory until everything ground to a halt. This detail matters because we later added tasksofttimelimit=45 and tasktimelimit=60 specifically to prevent this scenario.
Current Approach: Multi-Dimensional Scoring + Circuit Breaking
We now use a real-time scoring system. After every request completes, we update the provider's score:
Score = Latency (40%) + Success Rate (35%) + Cost (15%) + Load (10%)
The implementation looks roughly like this:
class ProviderScorer:
def calculate_score(self, provider_stats):
latency_score = self.normalize_latency(provider_stats.p99_latency)
success_score = provider_stats.success_rate * 100
cost_score = self.calculate_cost_efficiency(provider_stats.avg_cost)
load_score = 100 - provider_stats.current_load
return (
latency_score * 0.4 +
success_score * 0.35 +
cost_score * 0.15 +
load_score * 0.1
)
But I'll be honest—we're still tuning those four weight values. The 0.4/0.35 combination was set in June 2024, and we tweaked it once after GPT-4o launched. GPT-4o was so much faster than GPT-4 that our original latency weighting was too low, causing the scheduler to over-prioritise OpenAI and blow our costs out.
We'll probably adjust again soon. DeepSeek V3 has completely rewritten the price-performance equation.
Scoring alone isn't enough, though.
The real magic is circuit breaking.
We use a sliding-window circuit breaker—not the simplistic "fail N times and break" approach. Our window is 60 seconds. If the error rate exceeds 50% within that window, the breaker trips automatically. After a 30-second cooldown, we let a few requests through as probes. It's inspired by Hystrix's design philosophy, but we built it ourselves since the Go ecosystem didn't have quite what we needed.
Speaking of which—I actually wanted to use Resilience4j, but our backend is written in Go. That decision sparked quite the debate at the time. Some colleagues argued for Java specifically because of its mature ecosystem. We compromised: custom circuit breaker, off-the-shelf everything else.
Three Critical Pillars of High Availability
1. Request-Level Timeout Control
This lesson cost me a production incident.
17 April 2024. I remember the date because it was the night before my birthday, and I spent it debugging a live outage.
We'd set a global 30-second timeout using Nginx's proxyreadtimeout. Then one provider's particular model (I won't name names) started responding painfully slowly, occupying every worker thread. New requests couldn't get in. Health checks got blocked too—we'd effectively deadlocked ourselves.
Our current approach is layered timeouts:
- Connection timeout: 3 seconds (TCP handshake)
- First-byte timeout: 10 seconds (TTFB)
- Overall timeout: 15-60 seconds, dynamically adjusted per model
Each provider gets its own goroutine pool with backpressure control via Go channels. When a pool is full, we degrade gracefully rather than affecting other providers.
This design saved us during OpenAI's global outage in August 2024. While the entire AI community was in meltdown mode, our users barely noticed—the scheduler automatically shifted traffic to Claude and DeepSeek.
2. Result Caching and Graceful Degradation
Not every request needs a live model call.
We noticed users repeatedly asking similar questions, especially for code generation and translation tasks. Someone asking "write quicksort in Python" might appear hundreds of times in a single day.
We added a semantic cache using pgvector (v0.6.0, the latest at the time) with OpenAI's text-embedding-3-small for vectorisation. Similarity threshold is set at 0.92. Hit rate hovers around 15-18%.
Doesn't sound impressive, does it?
But 18% at peak means 40 fewer API calls per second. At GPT-4 pricing, that's roughly $200 saved daily. Not life-changing money, but it covers our team's coffee budget for a month.
Degradation strategy matters even more. When all premium models are unavailable, we automatically fall back to backup models. GPT-4o down? Switch to GPT-4o-mini. Claude Opus having issues? Hello, Sonnet. Quality dips slightly, but it beats serving 503 errors.
There's a gotcha here. Different models produce different output formats—especially for function calling response structures. We built an adaptation layer to normalise everything, but that layer itself has caused bugs. Last September, Claude updated its tool_use format, and our adaptation layer didn't catch up. Parse failure rates spiked. Users spotted it before we did. Embarrassing.
3. Multi-Region Deployment
This one was forced upon us.
May 2024. A construction crew in northern China accidentally severed a backbone fibre cable. I wish I were making this up. Our entire North China region went dark for over three hours.
That day, we started planning multi-region deployment.
We now run scheduling nodes in three regions: Shanghai (Alibaba Cloud), Tokyo (AWS), and Singapore (Azure). Cloudflare handles Anycast routing for proximity-based access. Failover is automatic:
if primary_region.availability < 99%:
redirect_traffic_to(backup_region)
notify_oncall_team()
log_incident_timeline()
Switchover takes 12-17 seconds. A handful of requests fail during the transition. But it's infinitely better than complete platform collapse.
I should mention—multi-region isn't cheap. Data synchronisation, cross-region latency, operational complexity... it all adds up. We only did it because of that severed cable. Otherwise, we'd probably still be single-region today.
The Cost Optimisation Playbook
Good scheduling genuinely saves money. Our current strategy:
- Simple tasks (classification, extraction, summarisation) → cheap models like GPT-4o-mini or DeepSeek
- Complex tasks (reasoning, creative writing, long-form content) → premium models
- Off-peak hours (2-6 AM) → automatically switch to pay-as-you-go instances
- Every API key has a daily spending cap to prevent leakage abuse
Last month, optimising our scheduling strategy cut costs by 23% while improving response times by 15%.
My boss now sends thumbs-up emojis instead of question marks.
Though I should admit—half that improvement came from DeepSeek being absurdly cheap. Their V3 pricing is practically predatory. Our scheduler now heavily favours DeepSeek because the price-performance ratio is just that good.
TL;DR / Key Takeaways
- Don't use simple round-robin for model scheduling unless you enjoy platform-wide outages
- Health checks alone aren't enough—you need real-time scoring and circuit breaking
- Layer your timeouts: connection, first-byte, and overall, all with different values
- Semantic caching saves money: 15-18% hit rate = $200/day at scale
- Multi-region deployment hurts until the day a construction crew severs your fibre
- Cost optimisation is a scheduling problem: route simple tasks to cheap models automatically
Where We Go From Here
Building a model scheduling layer for an API aggregation platform is fundamentally about balancing availability, cost, and performance. There's no silver bullet—everything gets tuned incrementally based on your specific workload.
Our current architecture isn't perfect. Consistency guarantees during cross-region failover still need work. Merging streaming responses from multiple providers is particularly nasty—SSE format implementations vary wildly. Some use data: prefixes, some don't. Some leave the event: field empty. Combining them is a proper headache.
If you're building something similar, or if you've got better approaches, I'd genuinely love to hear about them.
How do you handle provider failures? Any particularly bizarre production incidents you've survived? Drop a comment below.
Coffee's on me—spiritually speaking ☕
What's the worst upstream failure you've dealt with? Ever had a provider go down during a critical demo? Share your war stories—I promise I'll read every single one.
apigateway #highavailability #llmops #backend #systemdesign #sitereliability
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.