How Our "Clever" Multi-Model Router Burned $2,200 Without Anyone Noticing
How Our "Clever" Multi-Model Router Burned $2,200 Without Anyone Noticing
Last Wednesday, 3pm-ish. I'm nursing a flat white, mentally checked out for the afternoon, when Slack pings.
Finance team. "Your AWS bill this month is 47% higher than last month. Care to explain?"
I stared at that number. Proper stared. For a good ten seconds.
Here's the thing — our traffic hadn't budged. Daily active users were actually down 3%. Where the hell was this money going?
Two days of digging later, I found the culprit: our "genius" multi-model routing strategy. Well — the strategy itself wasn't broken. We just trusted it to run on autopilot. Big mistake.
Here's the full post-mortem. Hopefully it saves you from making the same expensive assumptions.
Multi-model routing: the promise vs. the reality
Let me set the scene. Last November, we built a smart routing layer into our API gateway. The logic was straightforward enough:
- Simple queries (greetings, weather checks, basic chit-chat) → GPT-3.5-turbo. Cheap and cheerful.
- Medium complexity (code explanations, document summaries) → Claude 3 Haiku. At the time, the price-to-performance ratio was genuinely brilliant.
- Hard problems (complex reasoning, long-form generation) → GPT-4o. Expensive, but sometimes you just need the big guns.
On paper, this should've saved us 30-40% on API costs. And for the first two months, it did — we were down about 35%, proudly presenting the numbers at our weekly standup. Smug doesn't even cover it.
Then month three hit. The cost curve started creeping up. Silently. We didn't notice until Finance came knocking.
Failure mode #1: The classifier was bleeding us dry
We used a distilled BERT-base model to classify incoming requests. Each inference cost roughly $0.003. Sounds negligible, right?
Here's the kicker: it runs on every. single. request. When your daily volume jumps from 100K to 800K, that "cheap" classifier suddenly costs over $720/month. I triple-checked the maths before I believed it.
But the real facepalm moment came when I dug into the logs. About 23% of requests classified as "medium complexity" were actually things like "hello" or "ok thanks bye". The classifier had learned a weird shortcut — during training, short texts were mostly imperative commands, so it started associating brevity with complexity.
# Actual log entries I pulled at 2am, questioning my life choices
2025-01-15 14:23:07 | user_input: "OK, got it, thanks"
2025-01-15 14:23:07 | classifier_output: medium_complexity (confidence: 0.78)
2025-01-15 14:23:07 | routed_to: claude-3-haiku-20240307
2025-01-15 14:23:07 | should_be: gpt-3.5-turbo-0125
# 5x cost difference. To reply "you're welcome."
Using a sledgehammer to crack a nut — thousands of times a day. It adds up fast.
Actually, let me correct myself. I said 23% misclassification, but that's not quite right. It was 23% overestimated complexity, plus another 8% underestimated (requests that should've hit GPT-4 went to Haiku, produced rubbish, and needed retries). Total routing deviation: roughly 31%. I'll come back to this number.
Failure mode #2: The silent model upgrade nobody told us about
17 March. I remember the date because it was a Monday, and Mondays are cursed.
Anthropic quietly bumped Claude 3 Haiku to a minor version (haiku-20240307 → haiku-20240317). Better reasoning, sure — but also 18% more expensive, jumping from $0.25/MTok to $0.295/MTok.
Our routing config? Hardcoded in a YAML file. It kept merrily forwarding requests to Haiku, completely oblivious that the cost equation had shifted. By the time Finance threw the spreadsheet at me, this "silent upgrade" had been running for three weeks.
Extra cost: about $2,200.
Not catastrophic, but enough to make my face go crimson during the next team sync. The lesson hit me like a brick: model versions aren't static, but our routing logic was frozen in amber. That's a ticking time bomb. How did I not see this coming?
Failure mode #3: Our caching layer got completely bypassed
This one hurt the most. Physically hurt.
We had a Redis cache humming along at a respectable 35% hit rate. Nothing spectacular, but solid. Then multi-model routing entered the picture, and suddenly the same question could land on different models because of tiny classifier fluctuations.
Real example:
- Request #1: "How do Python list comprehensions work?" → classifier scores 0.81 → routes to Claude 3 Haiku
- Five minutes later, same exact question → classifier scores 0.79 → routes to GPT-3.5
Our cache key included the model name. So we ended up with two cached responses for identical queries. Cache hit rate plummeted to 18%. Redis costs stayed flat, but API calls jumped 17 percentage points out of nowhere.
I spent an entire afternoon tracing this. When I finally spotted the root cause, I genuinely wanted to slap myself. Who designs a cache key that's sensitive to routing jitter? Oh wait — I did.
To be precise, it wasn't really the classifier "fluctuating". BERT produces slightly different embeddings for semantically similar inputs, and we were using the L2 norm of those embeddings as our complexity score. That value wobbles by about ±0.03. The difference between 0.79 and 0.81? Just noise.
A quick diagnostic framework (that I now run every Friday)
After crawling out of this mess, I built a simple health check. Takes about 20 minutes, and I run it religiously every Friday afternoon.
1. Routing deviation rate
deviation_rate = (requests where actual_route ≠ ideal_route) / total_requests
Ours was 31%. You want this under 5%. From what I've seen, the best teams hover around 3-8%.
2. Model cost-efficiency curve
Weekly snapshot of each model's "tokens per dollar". I use Grafana with Prometheus — the query looks something like:
SELECT model_name,
SUM(token_count) / SUM(cost) as tokens_per_dollar,
DATE_TRUNC('week', timestamp) as week
FROM api_logs
GROUP BY model_name, week
ORDER BY week DESC;
If any model's efficiency drops more than 10% week-over-week, it's time to check for version bumps.
3. Cache dilution factor
cache_dilution = copies_of_same_semantic_content / total_cache_entries
Anything above 1.2 is a warning sign. We hit 2.7. Absolute disaster.
What we actually did to fix it
Three changes, two weeks to implement:
- Added a "trivial mode" to the classifier. If the input is under 10 characters, skip the classifier entirely and go straight to GPT-3.5-turbo. Cost to implement: basically zero. Monthly savings: $200+. Sometimes the simplest fixes are the best ones.
- Automated model version monitoring. A dead-simple cron job that runs at 2am UTC, pulling pricing and version info from provider APIs. If anything changes, it fires a Slack alert. Built with curl and jq — nothing fancy. Both Anthropic and OpenAI have public endpoints for this; most people just don't use them.
- De-model-ified the cache key. Now it's just a SHA256 hash of the input content. Routing decisions don't affect cache hits. Hit rate bounced back to 34%, basically where we started.
Next month's bill? Back to normal. The team exhaled. I exhaled.
The uncomfortable truth: routing strategies need constant feeding
If I've learned one thing from this fiasco, it's that multi-model routing isn't a "set it and forget it" kind of deal.
It's more like a houseplant with an attitude problem. Every time a model provider ships an update or tweaks their pricing — and 2024-2025 has been absolutely relentless, with GPT-4o, Claude 3.5, and Gemini 2.0 all landing within months — your routing logic can silently go stale. The cost creep is gradual, like a frog in slowly boiling water. By the time you notice, you're thousands of dollars in the hole.
There's a saying I've heard floating around the infra community that nails it: "AI infrastructure isn't Lego. It's bonsai." You've got to tend it constantly, or it'll quietly die in the corner — or grow so wild you can't rein it back in.
If you're running multi-model routing, do yourself a favour: spend an hour this week running those three diagnostics. I'd bet actual money there's a surprise hiding in your bill somewhere.
Have you run into something similar? Or found a clever way to monitor API costs? I'm slightly obsessed with this topic right now — especially if anyone's using Langfuse or Helicone for cost observability. Drop your war stories in the comments. Misery loves company.
AI #LLM #backend #costoptimisation #softwareengineering
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.