I Spent $5,400 on GPT-4 Last Month — Here's What I Should Have Used Instead
I Spent $5,400 on GPT-4 Last Month — Here's What I Should Have Used Instead
Last Wednesday, during our weekly retrospective, our tech lead threw a number on the screen that made the entire room go dead silent.
$5,400. That was our API bill for the last six months.
One NLP project alone burned through $3,200 on GPT-4. Meanwhile, the team next door ran nearly identical workloads on DeepSeek-V3 for about $420. Their QA team blind-tested 200 responses against ours — and the pass rate difference was under 3 percentage points.
I just stared at those numbers. $3,200 versus $420.
Honestly? It broke me a little.
So everything that follows is paid for with real money — mine, specifically, from a budget I had to defend in front of our CFO. If you're still trying to figure out which model API to use in 2025, I hope these numbers and the mistakes I made save you some serious cash.
TL;DR for the Impatient
- For 80% of tasks, you don't need GPT-4. DeepSeek-V3 costs 1/27th as much and performs within 5%.
- Check concurrency limits before you pick a model. Cheap models often cap QPS hard — and you'll find out at the worst possible time.
- Benchmarks lie. Test on your own data. Always.
- Prompt migration costs are real. Switching models can tank performance by 30% if you don't rewrite prompts.
- My recommendation: DeepSeek-V3 for general tasks, Claude 3.5 Sonnet for code, Doubao Pro-256K for long documents, GPT-4o only when you really need multimodal.
First: Stop Looking at Leaderboards
Before you even glance at a benchmark, grab a piece of paper and answer three questions:
- What are you actually doing? Text classification? Customer support chats? Complex reasoning and code generation? The model that's best for each of these is wildly different.
- What's your daily call volume? A few hundred requests versus a few million — completely different selection logic.
- How long can users wait? 200ms versus 2 seconds is the difference between "this feels slick" and "is this thing broken?"
I've seen too many teams default to the strongest model available. Then end-of-month hits and they realize 80% of their requests were simple intent detection and keyword extraction. Tasks that GPT-3.5-level models handle fine. You don't need GPT-4 for that.
It's like using a Ferrari to buy groceries.
Can you? Sure. Is it expensive? Oh yeah.
What We Actually Measured (January 2025)
Over the last three months, my team built an automated evaluation pipeline and tested the major model APIs against our own business data — 1,000 real user queries across customer support, content generation, and code assistance. We measured four dimensions: text generation, code capability, reasoning, and multimodal.
Quick correction — for multimodal, we only tested image understanding. Video and audio? Too expensive, and honestly, we don't have a use case yet. Maybe later this year.
Here's the real data, collected between January 6–12, 2025:
1. Text Generation & General Conversation
| Model | Input ($/1M tokens) | Output ($/1M tokens) | Avg Latency | Human Score (1-5) |
|---|
| DeepSeek-V3 | $0.28 | $1.12 | 1.2s | 4.5 |
|---|
| Qwen-Max | $2.80 | $8.40 | 0.8s | 4.3 |
|---|
| GPT-4o | $10.08 | $30.24 | 1.5s | 4.7 |
|---|
| Claude 3.5 Sonnet | $7.70 | $23.10 | 1.8s | 4.6 |
|---|
| Doubao Pro-256K | $0.70 | $2.10 | 0.6s | 4.2 |
|---|
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.