How an $11,000 API Bill Forced Me to Rebuild Our Entire AI Routing System (And What I Learned)
How an $11,000 API Bill Forced Me to Rebuild Our Entire AI Routing System (And What I Learned)
Last Thursday at 3 PM, our finance person messaged me on Slack. Just a screenshot of our API bill—$11,000 for the month—followed by a single knife emoji. 🔪
My back literally went cold.
Turns out, a batch processing job had a bug in its loop. Instead of running for 20 minutes, Claude 3 Opus ran all night. This was a text summarization task. GPT-4o mini could've handled it for about $14. Instead, we burned 800 times that amount.
I was borderline depressed for two days.
Here's the context: our team's been building an AI-assisted writing platform for about a year. We integrated six models—GPT-4o, GPT-4o mini, Claude 3.5 Sonnet, Claude 3 Haiku, DeepSeek-V3, and Qwen-Max. At first, I thought more models = better flexibility. Wrong. Different tasks have wildly different model requirements. Summarization? Cheap models work fine. Deep analysis? You need the heavy hitters. Multilingual translation? Some models completely fall apart with certain language pairs.
Our initial solution was embarrassingly simple: random assignment.
One month in, our bill exploded. Complex tasks kept landing on small models, producing garbage output that needed retries. Simple tasks hogged the expensive models. Classic "penny wise, pound foolish"—except we were being stupid in both directions.
Three Painful Lessons I Learned the Hard Way
Lesson 1: Token-based pricing isn't straightforward at all
My original plan was dead simple: track input and output tokens per request, multiply by the unit price, done.
First month's reconciliation made me feel like an idiot. The pricing logic varies so dramatically across models that a "multiply by a coefficient" approach is basically useless.
GPT models charge separately for input and output, with output costing 3-5x more. Claude also splits pricing, but it's roughly 30% more expensive than GPT overall. DeepSeek's input pricing is absurdly cheap, but its output pricing matches GPT-4o mini. And Qwen-Max? It charges extra for context beyond 8K tokens—documented, but buried so deep in the docs that I missed it entirely.
Here's a trap I almost didn't catch: some models charge for system prompts, others don't. We'd been calculating everything based on input tokens, which meant we were underestimating Claude's cost by roughly 15%. Actually—let me correct myself, I just pulled up the old records—it was 18.7%, not 15%. Claude charges for system prompts at the full token rate without the various optimizations applied to regular input, so the effective price is even higher.
We ended up rebuilding our pricing model from scratch. It looks something like this:
// Notice all the edge cases—each one represents real money lost
const pricingModel = {
'gpt-4o': {
inputPrice: 0.0025,
outputPrice: 0.01,
cachedInputPrice: 0.00125, // prompt caching is half price
freeSystemPrompt: true
},
'claude-3.5-sonnet': {
inputPrice: 0.003,
outputPrice: 0.015,
cacheWritePrice: 0.00375, // cache writes cost extra
cacheReadPrice: 0.0003,
freeSystemPrompt: false // system prompts count toward billing
},
// ... other models
}
One data point that really drives this home: after enabling prompt caching, Claude's cost dropped 42%. But for the first two weeks, we weren't accounting for cache write fees, so our actual savings were only 32%. I'd even bragged in our weekly report about "42% cost reduction." Had to sheepishly issue a correction later.
Lesson 2: Choosing by price alone is spectacularly dumb
For a while, I became obsessed with DeepSeek. It's ridiculously cheap—input pricing is literally one-tenth of GPT-4o. I thought, why even think about this? Just switch everything over!
So I routed all summarization tasks to DeepSeek.
By day three, user feedback was pouring in like a waterfall. English summaries were... okay. Chinese summaries had this weird translation-like stiffness—the kind where "he made a decision" never just becomes "he decided." Japanese summaries were a disaster. The honorific system was completely broken. One user straight-up asked, "Are you using machine translation?"
We scrambled to do a spot check, sampling 500 outputs. Here's roughly what we found:
| Model | Chinese Summary Score | Japanese Summary Score | Avg Latency | Cost per 1K requests |
|---|
| GPT-4o mini | 4.2/5 | 4.0/5 | 1.2s | $0.39 |
|---|
| DeepSeek-V3 | 3.8/5 | 2.9/5 | 0.9s | $0.08 |
|---|
| Claude 3 Haiku | 4.3/5 | 4.1/5 | 1.5s | $0.49 |
|---|
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.