I Cut My AI API Bill by 82% — Here's What Actually Worked
I Cut My AI API Bill by 82% — Here's What Actually Worked
Last year, I built a code review tool using GPT-4. The first month's bill nearly gave me a heart attack — $1,200. My manager looked at me the way investors look at founders who say "we'll be profitable next quarter, trust me."
Painful.
So I spent three months running Token optimisation experiments. Prompt compression, model selection, caching strategies, batch processing — the works. Got the costs down to 18% of what they were. Everything I'm about to share was paid for in actual money. If you're staring at AI bills that make you wince, this should help.
You're Not Buying AI — You're Buying Tokens
Let's address the elephant in the room: why does your AI bill always blow past the budget?
Because most of us treat AI like a colleague, but it bills like a telegraph service — per character. Every character you send, every character you receive, cha-ching. GPT-4 Turbo charges $0.01 per 1K input tokens and $0.03 per 1K output tokens. Claude 3 Opus is pricier — $0.075 per 1K output tokens. A token is roughly 0.75 of an English word, or about 0.5 Chinese characters.
Here's the maths. Say you ask AI to review 500 lines of code. That's roughly 8,000 tokens in, 3,000 tokens out for the review comments. Single call cost: (8,000/1,000 × $0.01) + (3,000/1,000 × $0.03) = $0.08 + $0.09 = $0.17.
Doesn't sound bad, right?
Now run that 500 times a day. That's $85. Per day. Monthly? $2,550.
And that's just code review. Add code generation, documentation, translation — the numbers double without breaking a sweat.
There's a stat I can't shake. Semianalysis published a report — Q2 2024, if I remember correctly — showing that enterprise AI applications waste 40-60% of tokens. The culprits? Redundant system prompts, repeated context transmission, unnecessarily verbose outputs. Translation: half your bill is paying for filler.
Actually, let me correct myself. That 40-60% figure came from Semianalysis's Q2 2024 report, sampling mostly North American SaaS companies. From what I've seen in other markets, it's often worse — some cultures tend towards longer prompts by default. My own project? 63% waste rate before optimisation. Sixty-three percent.
First Cut: Slash Prompts, Not Features
My biggest rookie mistake? Writing prompts like product requirement documents.
Here's what my original code review prompt looked like:
You are a senior full-stack engineer with 15 years of software development experience,
proficient in Python, JavaScript, Go, and multiple programming languages.
You have worked at top tech companies like Google and Meta,
specialising in code review, architecture design, performance optimisation...
(200 more words omitted)
Now, please review the following code and identify potential issues and improvements.
Please format your output as follows:
1. Critical issues (system crashes or security vulnerabilities)
2. Medium issues (performance or maintainability impact)
3. Minor issues (code style or naming conventions)
... (another 150 words omitted)
This prompt? 487 tokens. Every. Single. Call.
15,000 calls a month meant 7.3 million tokens burned on prompts alone — $73 down the drain. What did that $73 buy? A bunch of backstory the AI absolutely didn't need.
I did three things:
1. Compressed role descriptions by 80%
Changed "You are a senior engineer with 15 years of experience who worked at Google" to "You are an expert code reviewer." The AI doesn't need your CV — it needs task context.
We tested this internally with 200 annotated samples. After removing role backstories, GPT-4's accuracy on code review dropped by... 0.3%. Input tokens dropped by 65%. Trading 0.3% accuracy for 65% cost savings? I'll take that deal every time.
2. Used examples instead of rule descriptions
Instead of 200 words describing output format, just show an example. Few-shot prompting isn't just cheaper — it works better:
Review the following code. Output a list of issues.
Example output:
- [CRITICAL] SQL injection risk, line 12 missing parameterised query
- [MEDIUM] N+1 query problem, line 28 executing DB query inside loop
This prompt is 87 tokens — 82% smaller. And here's the unexpected bonus: with concrete examples, the model's output format became more consistent. Before, the parser would randomly break because the model decided to get creative with formatting.
This is actually counterintuitive. I assumed more rules would mean more consistent output. Nope. More rules, more ways for the model to "misinterpret" them. The format got less stable, not more.
3. Separated system and user prompts
Most APIs let you split system prompts from user prompts. System prompts can be cached (more on that in a bit), user prompts change each time. Fixed instructions go in system, variable code goes in user. Combined with caching, this saves a surprising amount.
Second Cut: Caching — The Most Underrated Money Saver
June 2024. Anthropic launched Prompt Caching. OpenAI followed in October.
The AI Twitter-sphere went nuts about it, but honestly? Few people I know actually implemented it. I did. The results blew past my expectations.
The principle is dead simple: if your prompt contains repeated content, the API caches it automatically. Subsequent calls only charge 10%. But the trigger conditions are strict — content must be byte-for-byte identical and exceed 1,024 tokens (both Claude and OpenAI use this threshold).
This means you need to deliberately structure your prompts for cacheability.
Here's my approach — split prompts into three parts:
[Fixed system instructions] → cacheable
[Fixed example outputs] → cacheable
[Variable user code] → not cacheable
The first two chunks total about 512 tokens. The third averages 3,000 tokens. With caching enabled, those first 512 tokens cost 90% less. Monthly, that chunk went from $76 to $7.60.
Ninety percent. Just from restructuring prompts.
For the truly aggressive: batch processing. If your task doesn't need real-time response — overnight code reviews, batch test generation — use Batch API. OpenAI's Batch API gives you 50% off. Anthropic's Message Batches, same deal. The trade-off? You might wait up to 24 hours.
I moved all non-urgent code reviews to batch processing. Another 40% saved.
Real numbers from our team: 6 developers, roughly 200 code review requests daily. 60 real-time (urgent PRs), 140 batched (routine scans). Pre-optimisation monthly cost: $850. Post-optimisation: $153. My manager finally stopped giving me that look.
Third Cut: Stop Using a Sledgehammer to Crack a Nut
Another classic mistake: using the most powerful model for everything.
GPT-4 writes brilliant code. But for classification tasks? Summarisation? Format conversion? GPT-4o mini is perfectly adequate — at one-twentieth the price. One. Twentieth.
I built a simple routing system:
- Code generation / complex reasoning: GPT-4o or Claude 3.5 Sonnet
- Code classification / language detection: GPT-4o mini
- Document summarisation / tag extraction: GPT-4o mini
- Code formatting / syntax checking: GPT-4o mini or even a local small model
We ran A/B tests on an internal tool: 1,000 code review tasks randomly assigned to the big model (GPT-4o) versus the small one (GPT-4o mini). For clear-cut tasks like "detect SQL injection risks," accuracy was 94.2% vs 92.8%. Less than 2 percentage points difference.
But the cost difference? 20x.
The truly ruthless approach: use local models for pre-screening. We run Llama 3.1 8B on our dev server — a single RTX 4090 handles it fine. It scans all code first, only sending "suspicious" segments to the cloud model for deep analysis. Result? 70% of code never generates an API call.
The core of cost optimisation isn't using cheaper models — it's matching each task to the *exactly* right-sized model. That's the biggest lesson from three months of optimisation work.
Output Control: Don't Let AI Write Novels
Here's a cost black hole nobody talks about: output length.
AI models are pathologically helpful. Ask "does this code have issues?" and it won't just say "yes." It'll explain why, how to fix it, show corrected code examples, share best practices, and end with a warm "happy coding!" message.
Those extra tokens? All money. Real money.
My rule of thumb: explicitly constrain output format and length in the prompt.
Bad prompt: Review this code and point out issues.
Good prompt: Review code. List only critical issues (crashes or security vulnerabilities).
Keep each issue under 30 words. If no critical issues, reply "None."
That "reply 'None' if no issues" line alone saved 30% on output costs. The AI's default behaviour is to always find something to say. Even with flawless code, it'll write "This code looks good, though you might consider..." followed by 200 words of nitpicking.
Another technique: hard-limit with max_tokens. I typically set it to 1.2x the expected output length — some buffer, not much. If code reviews usually need 500 tokens, I set 600. Even if the prompt fails to constrain output, you won't get a surprise £10 response.
I started with max_tokens=2000 because I was paranoid. Turns out 80% of calls produced under 500 tokens. That extra 1,500 was pure waste.
Monitoring: The First Step to Saving Is Actually Seeing the Money
Let's get meta for a moment: if you can't see where money's going, you can't stop it.
I set up cost monitoring using Langfuse's open-source version (free, took about half an hour to deploy). It tracks token consumption and cost per API call. The data was... illuminating.
One colleague had commented-out legacy code in their prompt — sent with every call — wasting $40 a month. An automation script had a bug causing duplicate requests, burning $200 before anyone noticed. The most absurd? A scheduled task running over bank holidays, reviewing an empty repository, consuming 2,000 tokens each time to output "No code changes detected."
Our monitoring dashboard now tracks three core metrics:
- Per-call cost distribution: flags anomalously expensive calls
- Token efficiency: useful output ÷ total tokens — alerts if below 30%
- Model usage ratio: if big models exceed 50% of calls, it's review time
These stats auto-post to our Slack channel weekly. Everyone's costs are transparent.
When spending becomes visible, saving becomes natural. Not my quote — that's basic behavioural economics. But damn if it doesn't work.
TL;DR
- Compress prompts — kill the CV backstory, use examples instead of rules (saves ~60% on input)
- Leverage Prompt Caching + Batch API (saves 50-90%)
- Right-size your models — simple tasks get small models, complex tasks get the big guns (saves ~80%)
- Cap output length — explicitly say "reply 'None' if no issues" (saves ~30% on output)
- Monitor everything — make every penny visible and trackable
What's your monthly AI API bill looking like? Ever had a bill that made you choke on your coffee? Drop a comment — I'll help you spot where to trim.
AI #TokenOptimisation #CostControl #PromptEngineering #LLM
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.