I Fixed Our AI Scaling Nightmares (And Stopped Waking Up at 3 AM)
I Fixed Our AI Scaling Nightmares (And Stopped Waking Up at 3 AM)
It was Wednesday. No wait, Thursday. Actually, it was definitely Wednesday because I remember the next morning's standup with painful clarity. 3:12 AM. My phone vibrated so violently it nearly walked off the nightstand. PagerDuty.
Our production GPU cluster was down. Again.
Third time that month.
The pattern was always the same: some model API would get hammered by traffic, cascade-fail into other nodes, and suddenly everything's on fire. And there I was, bleary-eyed, manually tweaking HPA configs and replica counts at stupid o'clock in the morning. Honestly? I'd rather delete my entire Kubernetes cluster than do that one more time.
The Mess of Heterogeneous Model Scheduling
Let me back up. Our team hosts seven AI models—LLaMA 3.1 70B for text generation, ResNet-152 for image recognition, Whisper large-v3 for speech-to-text, plus a handful of fine-tuned specialized models. The hardware's equally chaotic: four A100 80GB GPUs, a bunch of T4s, and two Ascend 910B chips that our company bought during that "support domestic silicon" push last year. Those Ascend chips? Yeah, we'll get to those.
Here's the core problem: each model consumes resources in radically different ways.
Take our Stable Diffusion XL model. Normal QPS hovers around 20. But the moment users upload images for inference, it spikes to 200+ for about 30 seconds, then drops off a cliff. The text model? Steady, moderate load most of the time—until someone submits a 20,000-token document for translation. Then VRAM usage doubles and OOM Killer starts swinging its axe like it's chopping firewood.
Traditional HPA looks at CPU and memory to trigger scaling decisions. For our workloads? Basically blind.
GPU utilization and actual business load have this weird, non-linear relationship. I've seen 90% utilization with perfectly fine inference latency. Other times, 60% utilization and requests are timing out in droves. This has something to do with model compute density—I think. Honestly, I didn't dig deep enough at the time. I was just trying to stop the bleeding.
My brilliant initial idea (read: terrible idea) was dumping all models into one giant GPU pool managed by Cluster Autoscaler. What happened? A100 nodes got hijacked by tiny models that could've run on T4s, while 40GB+ models sat queued behind them. Money burned. Performance tanked. I once caught an A100 running some dinky text classification model using less than 8GB VRAM while an image model needing 40GB waited 15 minutes in queue.
Waste. Pure, expensive waste.
Building a Workload-Aware Scaling Strategy
After face-planting enough times, I realized: scaling decisions must come from business metrics, not infrastructure metrics. Staring at CPU graphs was like trying to drive by looking at your gas gauge instead of the road.
We settled on three monitoring dimensions:
- Request queue depth: Way more honest than QPS. When one model takes 30ms and another takes 15 seconds, raw request counts tell you nothing
- P99 latency: Not averages. P99 catches degradation early. Averages get lied to by long-tail distributions
- GPU VRAM headroom: Some models bottleneck on memory, not compute. Hit 85% and you'd better be spinning up capacity
We wired these into KEDA using ScaledObject configs. KEDA's honestly great—pulls custom metrics from Prometheus and drives scaling. We used version 2.14, latest at the time.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: image-gen-model-scaler
spec:
scaleTargetRef:
name: image-gen-deployment
advanced:
horizontalPodAutoscalerConfig:
behavior:
scaleDown:
stabilizationWindowSeconds: 120
scalingStrategy:
customScalingQueueLength:
targetValue: 5
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: request_queue_depth
query: |
sum(rate(model_queue_depth{model="image-gen"}[1m]))
threshold: "5"
The config says: when image-gen queue depth crosses 5, scale up. I specifically added that 120-second stabilization window for scale-down. Before that? Pods were bouncing up and down like ping-pong balls. P99 actually got worse from all the scheduling churn. That lesson cost me about two days of overtime.
First week after deploying? Genuinely better than manual ops. But then we hit a nastier problem.
Cold starts.
The Incident That Still Gives Me Chills
December 18th last year. 3:07 PM. I remember the exact time because it was my mom's birthday and I had planned to leave early. Our text model suddenly got flooded with long document translation requests. Queue depth jumped from 3 to 50 in about 20 seconds.
KEDA triggered scaling immediately. New pods started pulling the model image.
The image was 15GB. Pull time: almost 4 minutes.
During those 4 minutes, the existing pods got obliterated. New pods weren't ready. Users started seeing 504 errors everywhere. Worse? Because all traffic routed through the same API gateway, healthy models got dragged down too. Grafana alerts kept popping up. My palms were sweating. One thought looped in my head: we're completely screwed.
Post-mortem revealed three fatal flaws:
- No warm-up mechanism: New nodes came up "cold." Loading the model into GPU memory took another 30-50 seconds. Total? Nearly 5 minutes of dead air
- Missing circuit breakers: When one model backend struggled, requests should have failed fast instead of piling up
- Single point of failure: All models shared one API gateway with zero fault isolation
Here's how I fixed it:
1. Warm Pod Pooling
We now maintain two "warm" pods per model—fully loaded, not receiving traffic. When scaling triggers, they instantly join the Service backend while new pods spin up in the background to refill the pool. Scaling response time dropped from 4 minutes to under 10 seconds.
# Warm pool manager—messy but functional
class WarmPoolManager:
def maintain_warm_pool(self, model_name):
warm_pods = self.get_warm_pods(model_name)
if len(warm_pods) < self.min_warm_pods:
pod = self.provision_pod(model_name)
self.wait_model_loaded(pod) # Wait for model readiness
self.mark_as_warm(pod) # Mark warm, keep out of traffic
Actually, let me correct that. waitmodelloaded doesn't just poll—it hits the model's /health endpoint, checks for 200 with model_loaded: true before marking ready. My first version? time.sleep(60). Hard-coded. A colleague eviscerated me in code review, called it "wishful-thinking-driven development." Deserved that one.
2. Model-Level Circuit Breaking
We applied Istio DestinationRules with circuit breaker policies per model endpoint. Istio version 1.20.2:
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: text-model-circuit-breaker
spec:
host: text-model-service
trafficPolicy:
connectionPool:
http:
http1MaxPendingRequests: 10
maxRequestsPerConnection: 5
outlierDetection:
consecutive5xxErrors: 3
interval: 30s
baseEjectionTime: 60s
Massive improvement. When the text model backend times out, Istio trips the breaker. Requests stop flowing there entirely. Other models stay completely unaffected. Users might see one feature unavailable, but not a total platform outage. I brainstormed this approach on CNCF Slack—someone warned me not to set baseEjectionTime too short, or pods would flap in and out constantly. Good call.
3. Multi-Zone Disaster Recovery
If I'm being honest, multi-zone deployment always felt like a "big tech" thing. After that full-site 503 though? I gritted my teeth and deployed core models across two availability zones. Costs jumped about 30%. But compared to user churn from one major incident? Worth every cent.
We use pod anti-affinity to spread model pods across zones:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: model
operator: In
values:
- text-gen
topologyKey: topology.kubernetes.io/zone
Small gotcha: requiredDuringSchedulingIgnoredDuringExecution is rigid. If one zone runs out of resources, pods just sit there Pending. I later switched to preferredDuringSchedulingIgnoredDuringExecution with weights. This is one of those tradeoffs I still wrestle with—strict anti-affinity vs. availability. No clean answer yet.
The Numbers Don't Lie
I pulled a comparison dataset before and after optimization. Honestly, even I was surprised:
| Metric | Before | After |
|---|
| Scaling response time | 3-5 min | 8-12 sec |
|---|
| P99 latency variance | 200ms-3000ms | 180ms-450ms |
|---|
| Monthly incidents | 4.7 | 0.3 |
|---|
| GPU idle rate | 42% | 18% |
|---|
| Incident blast radius | Entire platform | Single model |
|---|
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.