I Fixed Our AI Scaling Nightmares (And Stopped Waking Up at 3 AM)

It was Wednesday. No wait, Thursday. Actually, it was definitely Wednesday because I remember the next morning's standup with painful clarity. 3:12 AM. My phone vibrated so violently it nearly walked off the nightstand. PagerDuty.

Our production GPU cluster was down. Again.

Third time that month.

The pattern was always the same: some model API would get hammered by traffic, cascade-fail into other nodes, and suddenly everything's on fire. And there I was, bleary-eyed, manually tweaking HPA configs and replica counts at stupid o'clock in the morning. Honestly? I'd rather delete my entire Kubernetes cluster than do that one more time.

The Mess of Heterogeneous Model Scheduling

Let me back up. Our team hosts seven AI models—LLaMA 3.1 70B for text generation, ResNet-152 for image recognition, Whisper large-v3 for speech-to-text, plus a handful of fine-tuned specialized models. The hardware's equally chaotic: four A100 80GB GPUs, a bunch of T4s, and two Ascend 910B chips that our company bought during that "support domestic silicon" push last year. Those Ascend chips? Yeah, we'll get to those.

Here's the core problem: each model consumes resources in radically different ways.

Take our Stable Diffusion XL model. Normal QPS hovers around 20. But the moment users upload images for inference, it spikes to 200+ for about 30 seconds, then drops off a cliff. The text model? Steady, moderate load most of the time—until someone submits a 20,000-token document for translation. Then VRAM usage doubles and OOM Killer starts swinging its axe like it's chopping firewood.

Traditional HPA looks at CPU and memory to trigger scaling decisions. For our workloads? Basically blind.

GPU utilization and actual business load have this weird, non-linear relationship. I've seen 90% utilization with perfectly fine inference latency. Other times, 60% utilization and requests are timing out in droves. This has something to do with model compute density—I think. Honestly, I didn't dig deep enough at the time. I was just trying to stop the bleeding.

My brilliant initial idea (read: terrible idea) was dumping all models into one giant GPU pool managed by Cluster Autoscaler. What happened? A100 nodes got hijacked by tiny models that could've run on T4s, while 40GB+ models sat queued behind them. Money burned. Performance tanked. I once caught an A100 running some dinky text classification model using less than 8GB VRAM while an image model needing 40GB waited 15 minutes in queue.

Waste. Pure, expensive waste.

Building a Workload-Aware Scaling Strategy

After face-planting enough times, I realized: scaling decisions must come from business metrics, not infrastructure metrics. Staring at CPU graphs was like trying to drive by looking at your gas gauge instead of the road.

We settled on three monitoring dimensions:

Request queue depth: Way more honest than QPS. When one model takes 30ms and another takes 15 seconds, raw request counts tell you nothing
P99 latency: Not averages. P99 catches degradation early. Averages get lied to by long-tail distributions
GPU VRAM headroom: Some models bottleneck on memory, not compute. Hit 85% and you'd better be spinning up capacity

We wired these into KEDA using ScaledObject configs. KEDA's honestly great—pulls custom metrics from Prometheus and drives scaling. We used version 2.14, latest at the time.


apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
 name: image-gen-model-scaler
spec:
 scaleTargetRef:
 name: image-gen-deployment
 advanced:
 horizontalPodAutoscalerConfig:
 behavior:
 scaleDown:
 stabilizationWindowSeconds: 120
 scalingStrategy:
 customScalingQueueLength:
 targetValue: 5
 triggers:
 - type: prometheus
 metadata:
 serverAddress: http://prometheus:9090
 metricName: request_queue_depth
 query: |
 sum(rate(model_queue_depth{model="image-gen"}[1m]))
 threshold: "5"

The config says: when image-gen queue depth crosses 5, scale up. I specifically added that 120-second stabilization window for scale-down. Before that? Pods were bouncing up and down like ping-pong balls. P99 actually got worse from all the scheduling churn. That lesson cost me about two days of overtime.

First week after deploying? Genuinely better than manual ops. But then we hit a nastier problem.

Cold starts.

The Incident That Still Gives Me Chills

December 18th last year. 3:07 PM. I remember the exact time because it was my mom's birthday and I had planned to leave early. Our text model suddenly got flooded with long document translation requests. Queue depth jumped from 3 to 50 in about 20 seconds.

KEDA triggered scaling immediately. New pods started pulling the model image.

The image was 15GB. Pull time: almost 4 minutes.

During those 4 minutes, the existing pods got obliterated. New pods weren't ready. Users started seeing 504 errors everywhere. Worse? Because all traffic routed through the same API gateway, healthy models got dragged down too. Grafana alerts kept popping up. My palms were sweating. One thought looped in my head: we're completely screwed.

Post-mortem revealed three fatal flaws:

No warm-up mechanism: New nodes came up "cold." Loading the model into GPU memory took another 30-50 seconds. Total? Nearly 5 minutes of dead air
Missing circuit breakers: When one model backend struggled, requests should have failed fast instead of piling up
Single point of failure: All models shared one API gateway with zero fault isolation

Here's how I fixed it:

1. Warm Pod Pooling

We now maintain two "warm" pods per model—fully loaded, not receiving traffic. When scaling triggers, they instantly join the Service backend while new pods spin up in the background to refill the pool. Scaling response time dropped from 4 minutes to under 10 seconds.


# Warm pool manager—messy but functional
class WarmPoolManager:
 def maintain_warm_pool(self, model_name):
 warm_pods = self.get_warm_pods(model_name)
 if len(warm_pods) < self.min_warm_pods:
 pod = self.provision_pod(model_name)
 self.wait_model_loaded(pod) # Wait for model readiness
 self.mark_as_warm(pod) # Mark warm, keep out of traffic

Actually, let me correct that. waitmodelloaded doesn't just poll—it hits the model's /health endpoint, checks for 200 with model_loaded: true before marking ready. My first version? time.sleep(60). Hard-coded. A colleague eviscerated me in code review, called it "wishful-thinking-driven development." Deserved that one.

2. Model-Level Circuit Breaking

We applied Istio DestinationRules with circuit breaker policies per model endpoint. Istio version 1.20.2:


apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
 name: text-model-circuit-breaker
spec:
 host: text-model-service
 trafficPolicy:
 connectionPool:
 http:
 http1MaxPendingRequests: 10
 maxRequestsPerConnection: 5
 outlierDetection:
 consecutive5xxErrors: 3
 interval: 30s
 baseEjectionTime: 60s

Massive improvement. When the text model backend times out, Istio trips the breaker. Requests stop flowing there entirely. Other models stay completely unaffected. Users might see one feature unavailable, but not a total platform outage. I brainstormed this approach on CNCF Slack—someone warned me not to set baseEjectionTime too short, or pods would flap in and out constantly. Good call.

3. Multi-Zone Disaster Recovery

If I'm being honest, multi-zone deployment always felt like a "big tech" thing. After that full-site 503 though? I gritted my teeth and deployed core models across two availability zones. Costs jumped about 30%. But compared to user churn from one major incident? Worth every cent.

We use pod anti-affinity to spread model pods across zones:


affinity:
 podAntiAffinity:
 requiredDuringSchedulingIgnoredDuringExecution:
 - labelSelector:
 matchExpressions:
 - key: model
 operator: In
 values:
 - text-gen
 topologyKey: topology.kubernetes.io/zone

Small gotcha: requiredDuringSchedulingIgnoredDuringExecution is rigid. If one zone runs out of resources, pods just sit there Pending. I later switched to preferredDuringSchedulingIgnoredDuringExecution with weights. This is one of those tradeoffs I still wrestle with—strict anti-affinity vs. availability. No clean answer yet.

The Numbers Don't Lie

I pulled a comparison dataset before and after optimization. Honestly, even I was surprised:

Metric	Before	After

Scaling response time	3-5 min	8-12 sec

P99 latency variance	200ms-3000ms	180ms-450ms

Monthly incidents	4.7	0.3

GPU idle rate	42%	18%

That GPU idle rate dropping from 42% to 18%? That's what sold my boss on continued optimization investment. Before, the thinking was "just rent more GPUs." Now we know intelligent scheduling is the real answer. Our CTO saw the data and said, "The savings alone could fund another hire." We didn't get the hire, but hey—budget got approved.

Stuff I'm Still Stuck On

Some problems remain gloriously unsolved:

Cost attribution: With models sharing hardware, calculating per-model costs is a guessing game. Finance bugs me monthly about it. I estimate percentages. I feel guilty every time
Cross-zone data transfer fees: The bill for inter-AZ traffic was way higher than I expected, especially intermediate data from large model inference. Last month, cross-AZ traffic hit 12% of total costs. That stung
Ascend chip scheduling: Those two Ascend 910Bs still only run specific models. Scheduling policies are completely incompatible with NVIDIA's stack. Almost semi-manual management at this point. Even after upgrading CANN to 7.0.RC1, certain operators still fail. I'm seriously considering sending them back

I'm currently digging into Kuberay and Volcano, wondering if smarter batch scheduling could replace our KEDA setup. Kuberay v1.1.0 supposedly added model pre-warming features. Haven't tested yet. I'll write it up when I do.

TL;DR / Key Takeaways

Business metrics > infrastructure metrics for scaling AI workloads. Queue depth and P99 latency beat CPU graphs every time
Cold starts will ruin your day. Warm pod pools are worth the overhead—10 seconds vs. 4 minutes is night and day
Circuit breakers are non-negotiable. One misbehaving model shouldn't take down your whole platform
Multi-zone costs hurt, but incidents hurt more. Factor in user trust, not just infra bills
Hybrid hardware scheduling is still a nightmare. Those Ascend chips? Still gathering dust

What's your setup look like? Are you also waking up at 3 AM to manually edit YAML files? Anyone successfully wrangling heterogeneous GPU fleets—especially mixing NVIDIA with domestic Chinese silicon? I'm genuinely curious, because our Ascend chips are still sitting in the server room, silently judging me.

Drop your war stories in the comments. Let's trauma-bond over failed deployments.

AI #Kubernetes #AutoScaling #GPUComputing #MLOps #SiteReliability #CloudNative

Incident blast radius	Entire platform	Single model

I Fixed Our AI Scaling Nightmares (And Stopped Waking Up at 3 AM)

I Fixed Our AI Scaling Nightmares (And Stopped Waking Up at 3 AM)

The Mess of Heterogeneous Model Scheduling

Building a Workload-Aware Scaling Strategy

The Incident That Still Gives Me Chills

1. Warm Pod Pooling

2. Model-Level Circuit Breaking

3. Multi-Zone Disaster Recovery

The Numbers Don't Lie

Stuff I'm Still Stuck On

TL;DR / Key Takeaways

AI #Kubernetes #AutoScaling #GPUComputing #MLOps #SiteReliability #CloudNative

Cael Lee

Ready to get started?