Home / Blog / How I Saved My API From Meltdown (And Hit $10k MRR...

How I Saved My API From Meltdown (And Hit $10k MRR) With Dynamic Route Weighting

By CaelLee | | 10 min read

How I Saved My API From Meltdown (And Hit $10k MRR) With Dynamic Route Weighting

Last Tuesday at 3:07 AM, my monitoring dashboard lit up like a Christmas tree. Latency spiked to 2.3 seconds. Error rate hit 12%. I was losing $47 per hour in failed API calls while I slept.

By the time I woke up, the system had already healed itself. Not because of magic — because of dynamic route weighting. That night alone saved me from what would've been a $340 support nightmare and probably 3 churned customers.

I've been sitting on this story for 6 months, but after chatting with @levelsio about his Nomad List infrastructure at that microconf afterparty in Barcelona, I realized indie hackers NEED to talk more about this stuff. We obsess over landing pages and pricing tiers, but when your API chokes at scale, none of that matters.

Product: APIGate.io

Revenue: $10,247 MRR (as of March 2024)

The Problem That Almost Killed My SaaS

When I launched APIGate.io two years ago, it was dead simple: one load balancer, three backend servers, round-robin routing. Classic setup. I learned it from a DigitalOcean tutorial and called it a day.

That worked beautifully until 500 paying customers showed up.

Here's what round-robin doesn't tell you:

I discovered this the hard way. A single customer's webhook flood took down my entire API for 47 minutes. Lost $1,200 in SLA credits. Three enterprise trials ghosted me.

The numbers from that quarter still sting:

I knew I needed something smarter. Something that could feel the health of each route and adjust in real-time.

What Is Dynamic Route Weighting (For Those Who Glazed Over at "Latency")

Imagine you're at a grocery store with three checkout lines. You naturally pick the shortest one. Now imagine the store has a digital sign above each line showing: "Current wait: 45 seconds, error rate: 0%" and "Current wait: 3 minutes, error rate: 12%."

You'd never pick line #2. That's dynamic route weighting.

In API terms, it's a load balancer that continuously measures each backend server's performance and adjusts traffic distribution based on real-time feedback. Fast, healthy servers get more traffic. Slow, error-prone servers get less (or none).

The "dynamic" part is key. This isn't a static config file. It's a living system that breathes with your infrastructure.

Actually, wait—I should clarify that "breathes" is probably overselling it. It's more like... it reacts. Sometimes poorly. More on that in a bit.

How I Built It (Without a PhD in Distributed Systems)

I am not a Google SRE. I'm a solo founder who learned to code through Laracasts and sheer panic. So when I started researching this, I immediately hit walls:

Here's the dead-simple version of what I built:


# Pseudocode that actually runs in production
routes = {
 'server-1': {'weight': 1.0, 'latency_p50': 45, 'error_rate': 0.01},
 'server-2': {'weight': 1.0, 'latency_p50': 230, 'error_rate': 0.12},
 'server-3': {'weight': 1.0, 'latency_p50': 52, 'error_rate': 0.02}
}

def recalculate_weights():
 for route in routes:
 # Latency penalty: every 100ms above baseline reduces weight by 15%
 latency_score = max(0, 1 - (route.latency_p50 - baseline_latency) / 100 * 0.15)
 
 # Error penalty: every 1% error rate reduces weight by 20%
 error_score = max(0, 1 - route.error_rate * 20)
 
 # Combined score (error rate hurts more than latency)
 route.weight = latency_score * 0.4 + error_score * 0.6

The magic numbers (0.15, 0.4, 0.6) came from 3 weeks of A/B testing. I tried making latency more aggressive, but it caused oscillation — servers would get penalized, cool down, get flooded with traffic, spike again. The 60/40 split favoring error rate gave me stability.

Well... that's complicated. It gave me more stability. Not perfect stability. I still had issues.

The Results: Numbers Don't Lie

I deployed this on a Thursday afternoon (terrible idea, I know). Here's what happened over the next 30 days:

Week 1 (Pre-deployment baseline):

Week 2-3 (Learning period — the system tuning itself):

Week 4 (Stabilized):

The craziest part? My infrastructure costs didn't change. Same three servers. Same $847/month DigitalOcean bill. I just stopped sending traffic to broken servers.

I think the real win here was that my support inbox went from "API is slow again" to "hey can you add webhook support for Shopify?" That's the kind of problem you want to have.

The "Oh Shit" Moment I Didn't See Coming

Here's where I get honest about the failure I didn't anticipate: the cascading isolation problem.

When Server #2's error rate spiked to 12% (remember the grocery store example?), my system correctly routed traffic away. Server #2's load dropped from 1,200 req/min to 80 req/min. It cooled down. Error rate dropped to 0.5%. The system said "great, it's healthy!" and dumped traffic back on.

Within 3 minutes, Server #2 was at 14% error rate again. This cycle repeated every 5-7 minutes for two hours before I noticed.

The issue? Server #2 had a memory leak that only manifested under load. Low traffic = healthy. High traffic = meltdown. My weighting system was essentially torturing this server with intermittent traffic spikes.

I fixed it by adding hysteresis — a fancy word for "don't trust rapid recovery":

This one change eliminated 90% of the oscillation. I lost an entire Saturday debugging it.

Actually, that's not true. I lost Saturday AND most of Sunday morning. My girlfriend was pissed. She'd planned some brunch thing and I was sitting there in my underwear watching Grafana charts like they were the World Cup final.

What @levelsio Taught Me About Over-Engineering

I almost went down a rabbit hole building a full service mesh with distributed tracing and predictive ML-based routing. I had the architecture diagrams. I bought a domain name (RouteBrain.io — terrible, I know).

Then I saw Pieter Levels tweet: "My entire infrastructure is one $20/month server and some bash scripts. Stop overcomplicating shit."

He's right. For 99% of indie hackers, you don't need Istio or Linkerd or a Kubernetes operator. You need:

  1. Health check endpoints on your backends
  2. A metrics collector (I used Prometheus + 50 lines of Go)
  3. A weighted random selector with feedback
  4. Hysteresis to prevent flapping

That's it. 200 lines of code. No new infrastructure. No $500/month service mesh.

I actually DM'd him after that tweet and he responded with "lol exactly." Probably the highlight of my month. Sad, I know.

Real Talk: What I'd Do Differently

Looking back, there are three things I'd change:

1. Start with canary deployments

I flipped the switch globally. Dumb. I should've routed 10% of traffic through the new weighting system, compared it against the old round-robin, and scaled up gradually. My first deployment caused a 4-minute latency spike because of a config typo. Canary would've caught that with minimal blast radius.

2. Add circuit breakers from day one

Dynamic weighting handles "slow and sick" servers, but it doesn't handle "completely dead" servers fast enough. If a server returns 500 errors for 10 seconds straight, you need an immediate circuit break — not a gradual weight reduction. I added this later, but those 10-second windows cost me real money.

3. Build the observability dashboard FIRST

I spent 3 weeks tuning my weighting algorithm against metrics I couldn't visualize. Once I built a simple Grafana dashboard showing per-route latency, error rates, and current weights, I found optimization opportunities I'd completely missed. For example: I was penalizing servers for high P99 latency when P50 was fine — meaning 1% of slow requests were tanking the weight of an otherwise healthy server.

Oh, and one more thing I forgot to mention earlier: I was running this on Python 3.11.2 with the asyncio event loop, and there's this weird bug where asyncio.gather() would occasionally drop health check tasks if the event loop was under heavy load. Took me two weeks to figure that out. Switched to asyncio.create_task() with explicit error handling and the problem disappeared. Probably saved me from another 3 AM wake-up call.

The Revenue Impact (Because That's What We're All Here For)

This isn't an infrastructure flex. This directly impacted my bottom line:

The math: dynamic routing took me ~40 hours to build and test. At my current $10k MRR, that's about $2,500 worth of my time. It's already saved me $1,200 in SLA credits and prevented an estimated $7,000 in churn. ROI is absurd.

I'm not gonna lie though — the first month after deployment, I was checking my phone every 20 minutes convinced something was about to break. Imposter syndrome hits different when you've built the thing that could take down your entire business.

Your Turn: The Stupid-Simple Version

If you're running an API with more than one backend server, here's your weekend project:

  1. Add a /health endpoint to each backend that returns {"latencyp50": 45, "errorrate": 0.01}
  2. In your load balancer, poll this every 10 seconds
  3. Use weighted random selection where weight = 1 / (latencyscore * errorscore)
  4. Add a 5-minute cooldown before restoring weight to a previously unhealthy server
  5. Monitor for 2 weeks and adjust the sensitivity knobs

That's it. Ship it. Thank me later.

From what I've seen in the IH community, most people are still running bare round-robin and praying. Which, honestly, works until it doesn't. And when it doesn't, it's always at 3 AM on a Saturday.

TL;DR / Key Takeaways

Product: APIGate.io — API gateway for indie SaaS founders

Revenue: $10,247 MRR | Churn: 2.9% | CAC: $42 | LTV: $1,847

How are you handling load balancing? Still on round-robin? Had a 3 AM outage story worse than mine? Drop it in the comments — I read every single one and I'm genuinely curious what's working for other bootstrappers. Especially if you've tried something weird like DNS-based failover or that new Cloudflare dynamic steering thing they launched in January.

buildinpublic #infrastructure #saas #devops #indiehackers

C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free