I Cut Our LLM API Bill by 62% — Here's the Mixed Billing Model We Actually Use
I Cut Our LLM API Bill by 62% — Here's the Mixed Billing Model We Actually Use
I thought my £2,400 monthly LLM API bill was ridiculous until the team next door casually mentioned they'd blown £376,000 last month. They hadn't even accounted for the intern who accidentally turned a GPT-4 loop into an infinite recursion nightmare.
Plot twist: they noticed when their monitoring dashboard literally crashed.
I've been in this game for 8 years now — from fine-tuning BERT back when that was cutting-edge to wrangling today's GPT-4-class models. Here's the thing: the hardest problem has never been "which model should we pick?" It's decoding your bloody bill. Last month, our finance director slapped my expense report onto the table and asked me to explain — in plain English — why identical workloads suddenly cost triple. No architecture changes. No traffic spikes. Just... triple.
That conversation sparked a full rebuild of our cost model across three business lines. I figured I'd share what we learned before someone else's CFO goes nuclear.
The Mess We Got Ourselves Into
Last November, we launched a customer service platform powered by LLMs. Three separate workloads, one shared API pipeline:
- Live chat — high concurrency, short exchanges, predictable daily rhythm
- Ticket analysis — low frequency, absurdly long context windows
- Marketing copy generation — unpredictable volume, massive token counts per call
We took the lazy route. Opened a pay-as-you-go account with one of the big providers — yes, the one that hiked their prices in March 2024. You know exactly which one I mean.
The first month's bill landed. Our CTO very nearly gave me a ceremonial burial under the server racks.
What went wrong? Pay-as-you-go looks flexible on paper, but peak-hour pricing is brutal. Our live chat system handles a flood between 10 AM and 3 PM — QPS jumping from 20 to 200+ in minutes. During those windows, on-demand rates hit nearly 3x what we'd pay with reserved capacity. Even worse? The marketing team kept running batch jobs at 2 AM, precisely when we had zero reserved resources allocated. Every single call got billed at maximum surge pricing.
Classic consequence of skipping hybrid billing planning.
So What Actually Is Mixed Billing?
ELI5 version: It's like your home water supply. You've got a baseline usage that's dirt cheap (standard metered rate), you can pre-buy water at bulk discount (reserved capacity), and when unexpected guests show up, you pay a premium for the overflow (on-demand).
For enterprise LLM APIs, the hybrid model has three layers:
1. Pre-Purchased Bundles (Monthly/Annual Commitments)
This works brilliantly for stable, predictable workloads. Our live chat handles roughly 8,000 sessions daily, averaging 900 tokens each — about 216 million tokens monthly. Buy the equivalent resource bundle upfront, and your per-token cost drops to 40-60% of on-demand pricing.
The catch? These bundles expire. We straight-up wasted around £2,400 in March because our marketing pipeline got restructured, usage got cut in half, and our leftover tokens simply... vanished. Nobody warned us about the expiry policy. Read the fine print.
2. Reserved Throughput (TPM Commitments)
This one flies under the radar. Some providers let you commit to a minimum TPM (Tokens Per Minute), and in exchange they'll slash your per-token rate by another 30%. Our ticket analysis pipeline uses this approach — we committed to 5,000 TPM minimum and got a genuinely comfortable discount.
The trade-off? You're paying for that capacity at 3 AM whether anyone's using it or not.
Actually — let me correct myself here. Not all providers treat reserved throughput the same way. Azure OpenAI's PTU (Provisioned Throughput Units) charges by the hour regardless of actual usage. Anthropic's TPM commitment model is slightly more forgiving, with some wiggle room on the floating margin. We evaluated three providers before committing, and I'll break down the comparison in a follow-up post (edit: now linked at the bottom).
3. On-Demand (Pay-As-You-Go)
This is your overflow valve. Expensive but flexible. We use it to absorb peak-hour spikes in live chat and the marketing team's unpredictable late-night sprints.
Here's the split we eventually landed on:
- 60% pre-purchased bundles (stable base)
- 20% reserved throughput (discount on committed minimum)
- 20% on-demand (surge absorption)
Real numbers from three months of actual billing data: The hybrid model saved 62% versus pure on-demand, and roughly 18% versus buying all pre-paid bundles (which would've wasted capacity). I pulled these figures from Excel, not a vendor's white paper. March: $12,400 (pure on-demand). April: dropped to $4,700 after switching. May: fine-tuned down to $4,200.
Three Real-World Patterns
Case 1: E-Commerce Customer Service (High Concurrency, Extreme Peaks)
Mid-sized online retailer, averaging 100,000 LLM calls daily, with volumes quintupling during sales events. They went with 60% annual commitment + 30% on-demand + 10% reserved.
Why so little reserved? Because e-commerce peaks are too extreme. Commit to high reserved capacity and you're haemorrhaging money during quiet weeks. Commit too little and you're dead during Black Friday.
Their strategy was brutally simple: buy a short-term bundle (some providers offer 7- or 30-day packages) right before major promotions, then lean on on-demand for the rest of the year. Last Singles' Day — yes, the one with 11/11 in China, think Black Friday but bigger — pure on-demand would've cost them $184,000. The hybrid approach landed at around $88,000. That's two extra junior engineers' annual salaries.
Case 2: SaaS Tool (Multi-Tenant, Long-Tail Usage)
An AI document analysis platform we collaborated with had completely unpredictable per-tenant behaviour. No clear peaks, just scattered usage throughout the day. They went with something counterintuitive: 80% reserved throughput + 15% pre-paid + 5% on-demand.
Their logic: baseline usage during office hours was remarkably stable (document processing during working hours), so reserved throughput could drive per-token costs into the floor. Pre-paid bundles? Barely touched — their token distribution was too fragmented, and bundles kept expiring.
I'll be honest — I still think this was slightly bonkers. But they made it work. Average cost per tenant dropped from $1.20 to $0.40. Took them two months of tuning though, including one catastrophic Monday morning when a major client batch-uploaded documents, hit their reserved limit, and triggered a 429 error cascade. The client nearly walked.
Case 3: Content Platform (Low Frequency, Marathon-Length Texts)
An AI novel-writing platform — tiny call volumes but each generation runs tens of thousands of tokens. On-demand pricing punishes long contexts mercilessly. OpenAI charges a premium for >128K context windows, and it's buried deep in their pricing page where nobody looks.
They negotiated a custom TPM commitment + volume discount tier directly with the provider — committing to at least $160,000 monthly consumption in exchange for a stepped discount table. Single novel generation (roughly 100K tokens) dropped from $4.70 to $1.80.
But here's the reality check: you need serious volume to get those custom deals. I'd say below $40,000 monthly spend, don't even bother picking up the phone.
The Mistakes That Cost Me Sleep
Looking back, I can count at least three craters I personally jumped into:
Mistake 1: Token Estimates Based on Vibes
When I first built our cost model, I asked each business unit for their usage estimates. They were off by 40%. Forty. Percent.
I ended up piping a month of full access logs through a Python script — nothing fancy, just pandas and matplotlib — to map actual token distributions and QPS curves. That's when I discovered the marketing team's night-time batch jobs were running at double the frequency I'd assumed.
Modelling without logs is like driving with your eyes closed.
Quick and dirty setup: we added a logging middleware at the API wrapper layer, dumped every call's model name, token count, and timestamp into BigQuery, then spun up a Grafana dashboard. Three weeks of data revealed a weird spike every Sunday at 3-5 AM. Turned out marketing had a cron job generating weekly report copy — 800 GPT-4 calls, 2,000 tokens each. Nobody had mentioned this to engineering. Of course.
Mistake 2: Hidden Pricing Rules Between Models
Naively assuming one provider uses consistent pricing logic across all models. Wrong. Dead wrong.
Our provider — the one you're thinking of — applies extra coefficients for long-context GPT-4 class models, but not for GPT-3.5. When we switched ticket analysis to long-context mode, the bill jumped overnight. Took me three days of forensic accounting to find the explanation buried in paragraph three of an FAQ.
Specifically: GPT-4-32K applies a 1.5x multiplier beyond 16K tokens. GPT-4 Turbo doesn't. We were on GPT-4-32K, and our average ticket analysis ran 18K tokens. Literally two thousand tokens past the threshold. Two thousand.
Mistake 3: The Devil in Reserved Throughput Details
Reserved throughput isn't a set-it-and-forget-it deal. Most providers' SLAs say "best effort" — exceed your committed TPM, and they might not throttle you, but they'll charge you at penalty rates that can exceed even on-demand pricing.
I learned this during load testing. Wrote a k6 script, pushed it hard, forgot to set an upper limit, and accidentally hit 120,000 TPM against our 30,000 commitment. The overflow got billed at $0.06 per 1K tokens. Our reserved rate was $0.02. Three times the price. That little stress test cost $800.
Finance emailed me. Then called. Then visited my desk.
Practical Advice (Paid for in Scar Tissue)
If you're planning a hybrid billing strategy right now, here's what I wish someone had told me:
- Run a full month of real logs before modelling anything. Don't trust business teams' estimates. Don't trust the vendor's "average usage calculator." We used LangSmith for cost tracking plus a custom log parser. Cheap and effective.
- Model each business line separately. Different QPS curves and token distributions shouldn't share the same bucket — you'll end up with the expensive workloads subsidised by the cheap ones, and the numbers won't make sense.
- Set reserved throughput at 80% of your lowest trough. This way you're not wasting capacity most of the time, and peaks get absorbed by on-demand. Our ticket analysis pipeline bottoms out at ~2,000 TPM around 2-4 AM, so we committed to 1,600.
- Read the bloody bundle expiry policy before buying. Some providers allow refunds or extensions — AWS SageMaker resource bundles are refundable, Azure OpenAI's are not. Knowing this can save your budget.
- Set up billing alerts with a daily threshold. We rigged a Slack notifier using AWS Lambda + CloudWatch — every morning at 9 AM, it pushes the previous day's API spend. If it exceeds 150% of the daily average, it @mentions me directly. It's triggered three times: twice for normal business growth, once for a bug that was silently hammering our endpoint.
Oh, and one more thing — and I cannot stress this enough — cost optimisation isn't just an ops or finance problem. It's an architectural decision. Which model you choose, how you cache, prompt length optimisation, whether you batch requests... these technical choices impact your bill far more than which billing tier you're on. We trimmed our average prompt from 1,200 tokens to 800 and saved roughly $1,600 monthly. That's faster and easier than restructuring your entire billing model.
TL;DR
- Mixed billing = pre-paid bundles + reserved throughput + on-demand. Start with a 60:20:20 split and tune from there.
- Model from real logs, not spreadsheets full of guesswork.
- Different models and context lengths have hidden pricing rules. Read the docs cover to cover.
- Don't over-commit on reserved throughput — overflow penalty pricing can be worse than on-demand.
- Set up monitoring alerts. Know about billing anomalies the moment they happen, not when finance forwards you an email with subject line "URGENT: Please Explain."
What's your monthly LLM bill looking like these days? Anyone else got a "configuration error that accidentally burned through five figures" story? Those are my favourite — equal parts horrifying and educational. Drop it in the comments so we can all feel better about our own mistakes.
Edit: Wow, wasn't expecting this many responses overnight. A few of you asked about the vendor comparison (Azure vs Anthropic vs the Chinese providers) — I'll write that up properly this weekend. Also, several people DM'd me about the Python log analysis script. I can't share our internal codebase directly (compliance would shred me), but the logic is straightforward: intercept API request/response logs, extract token counts and timestamps, run time-series analysis with pandas. Search GitHub for "llm-cost-analyzer" — the documentation is a mess, but the code logic is solid enough to adapt. There's also a decent thread on r/MachineLearning about this.
LLM #APICostOptimisation #CloudBilling #HybridBilling #EnterpriseAI #TrueStoriesFromProduction
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.