I Compared 2025's Multimodal AI API Prices—Some Are 10x Cheaper Than Others
I Compared 2025's Multimodal AI API Prices—Some Are 10x Cheaper Than Others
Last week, our team was building an image moderation pipeline and decided to benchmark a few multimodal large language models. We pulled the API pricing from half a dozen providers. Same task. Same images. Same prompts. The cost difference? Nearly 10x between the cheapest and most expensive options.
I stared at that spreadsheet for a solid minute. If I hadn't done this comparison, our end-of-month bill would've been brutal.
So here's the real-world pricing breakdown for multimodal AI APIs in 2025—not the marketing pages, not the "starting at" nonsense, but what developers actually pay. I've also included the mistakes I made along the way. Some of them were expensive. One cost me a very awkward conversation with my boss.
Why Token-Based Pricing Is a Minefield
Here's the thing about multimodal models: they don't just process text. They chew through images, audio, and video—all of which get chopped into tokens. And the way each provider calculates those tokens? Wildly different.
Take a 1024×1024 image. One provider might count it as 200 tokens. Another might clock it at 800. If you're building a budget around the lower estimate, you're in for a nasty surprise.
I learnt this the hard way last November. We were processing a batch of product photos, and I assumed the image token cost would fall into the cheapest tier. Plot twist: it didn't. Our bill tripled. After two hours digging through documentation, I discovered the model was calculating high-resolution images by "tiles"—each tile adding more tokens, and a large image eating up over a thousand tokens in one go. When my boss asked why we'd blown through the budget, I had to explain the entire tokenisation architecture of multimodal models. It was not my finest moment.
2025 Multimodal API Pricing: The Actual Numbers
I've pulled together the providers most developers I know are using. These prices are accurate as of mid-March 2025—but fair warning, these numbers shift faster than British weather. One provider quietly changed their pricing rules while I was writing this article (I spotted it during a routine check on Tuesday). Always verify against the official docs before committing.
1. OpenAI (GPT-4o / GPT-4o-mini)
OpenAI isn't a China-based provider, obviously, but plenty of teams still use them, so they're worth including as a reference point.
- GPT-4o: $2.50/million input tokens, $10.00/million output tokens
- GPT-4o-mini: $0.15/million input tokens, $0.60/million output tokens
- Image token calculation: 85 tokens fixed for low-res; high-res uses 170 tokens per tile
For most small-to-medium teams, GPT-4o is steep. We tested 1,000 images through an audit task—GPT-4o cost roughly $18, while GPT-4o-mini came in at $1.20. The accuracy difference? Less than 5%. Unless you genuinely need bleeding-edge reasoning, the mini version is the sensible choice. The money we saved could fund a fairly impressive office snack budget.
2. Zhipu GLM-4V
Zhipu was one of the earlier players in China's multimodal space. Their GLM-4V now handles both images and video, and they had a massive update wave in early 2024 that got everyone talking.
- GLM-4V-Flash (free tier): Good for lightweight testing, but rate-limited
- GLM-4V-Plus: ¥0.01 per thousand input tokens, ¥0.01 per thousand output tokens
- Image tokens: Base 85 + resolution surcharge; a typical image lands around 300–500 tokens
I'm genuinely fond of their Flash tier—it costs nothing for prototyping. But for production, you'll need Plus. Flash has fairly strict concurrency limits, and during peak hours you'll queue. I once got hit with a 429 error at 3pm on a Wednesday and had to wait nearly two minutes for it to recover.
Wait—I should correct myself. That 429 wasn't actually Flash's fault. I hadn't properly handled retry logic, so my requests stacked up and triggered rate limiting. Adding exponential backoff fixed it. That one's on me, not Zhipu.
Real gotcha with Zhipu, though: video input tokens are calculated by frame extraction. A one-minute video can consume about 15,000 tokens. We didn't clock this initially and uploaded several five-minute videos for testing. Burned through 30% of our monthly budget in a single day. Watching that consumption graph spike was genuinely distressing.
3. Alibaba Tongyi Qianwen (Qwen-VL)
Alibaba's Qwen-VL series has been updating aggressively, and their pricing is competitive. Their Max model, released late 2024, reportedly trades blows with GPT-4o on certain benchmarks.
- Qwen-VL-Plus: ¥0.0015 per thousand input tokens, ¥0.006 per thousand output tokens
- Qwen-VL-Max: ¥0.003 per thousand input tokens, ¥0.012 per thousand output tokens
- Image tokens: (resolution/28/28) tiles; a 1024×1024 image ≈ 1,200 tokens
Word of caution: Qwen's image token calculation is, shall we say, honest. The same image consistently yields more tokens here than with other providers. So while the per-token price looks low, your actual total cost won't be clear until you run real tests.
We benchmarked 100 e-commerce images across providers. Qwen-VL-Plus cost ¥0.47 in total, GLM-4V-Plus came to ¥0.38. But Qwen's understanding of complex scenes was noticeably better. Price alone doesn't tell the full story—you need to factor in accuracy for your specific use case. Though honestly, accuracy is deeply tied to your particular scenario. Benchmark numbers from someone else's test suite only get you so far.
4. Baidu Wenxin Yiyan (ERNIE-VilG)
Baidu's multimodal offering performs well on Chinese-language tasks, but their pricing structure is a bit fiddly. I had to read their billing docs three times before it clicked.
- ERNIE-Bot-turbo: Free input, ¥0.008 per thousand output tokens
- ERNIE-Bot: ¥0.004 per thousand input tokens, ¥0.012 per thousand output tokens
- Image billing: Per image, not purely token-based; ¥0.02–0.05 per image fixed
This per-image approach is actually quite clever for certain workloads. If you're processing a consistent volume of moderate-resolution images, it makes budgeting straightforward. However—and this is buried deep in their docs—high-resolution images incur additional charges based on resolution. I only discovered this by inspecting our bill and then spending an afternoon confirming it with their technical support.
5. iFlytek Spark
iFlytek entered the multimodal game later than the others, but their pricing is genuinely cheap. They shipped a significant model update in the second half of 2024—capabilities improved, though they still trail the frontrunners by a margin.
- Spark-Lite: Free input, ¥0.005 per thousand output tokens
- Spark-Pro: ¥0.002 per thousand input tokens, ¥0.006 per thousand output tokens
- Image tokens: Flat 150 tokens per image, regardless of resolution
That flat 150-token policy is wonderfully simple. Zero guesswork when building budgets. The trade-off? The model isn't quite as capable. For complex scene understanding, it occasionally stumbles. We tested it on a product defect detection task, and accuracy was about seven or eight points lower than Qwen's.
Three Mistakes That Cost Me Real Money
Mistake 1: Ignoring Output Token Costs
Most developers obsess over input pricing. "I'm uploading images, but the output is just a short sentence—output costs are negligible." Nope. Absolutely wrong.
If you ask the model for detailed descriptions or structured JSON output, those output tokens can exceed your input tokens. We built a product description feature: input was ~300 tokens per image, but the JSON output was ~800 tokens. The output fees were several times higher than the input fees. The first time I saw the cost breakdown in our bill, I realised just how naive I'd been.
Mistake 2: The Testing Environment Illusion
Every provider offers free or ultra-cheap testing tiers. But the concurrency characteristics of production APIs are completely different. I know a team that tested exclusively on a free tier, everything looked great, then they went live on the paid tier and immediately hit rate limits from the increased concurrency. They had to upgrade to a higher-priced instance, and their budget doubled overnight. The team lead later described testing environments as "the biggest fraud in API pricing."
Mistake 3: Skipping Image Preprocessing (This One's a Money Saver)
I only fully grasped this recently. Most models are sensitive to image resolution. Feed them a 4000×3000 original, and they'll internally scale it down but still charge you based on the original dimensions. If you compress images client-side to 1024×1024 or smaller, token consumption drops by 60% or more—with minimal accuracy impact.
Our current pipeline enforces client-side compression using Sharp, configured as resize(1024, 1024, {fit: 'inside'}), then converts to WebP at 85% quality. This one change saves roughly 30% on our monthly API bill. Actual money staying in our account.
// Our standard preprocessing pipeline
const sharp = require('sharp');
async function preprocessImage(inputPath) {
return sharp(inputPath)
.resize(1024, 1024, { fit: 'inside' })
.webp({ quality: 85 })
.toBuffer();
}
TL;DR and Recommendations
- Tight budget, simple tasks: iFlytek Spark-Pro or Zhipu GLM-4V-Flash. Cheap and capable enough.
- Strong Chinese-language understanding needed: Alibaba Qwen-VL-Plus. Slightly pricier, but the accuracy is solid.
- Cutting-edge reasoning required: GPT-4o. Just brace yourself for the bill—it genuinely burns through cash.
- Massive image volume, moderate resolution: Baidu Wenxin. Per-image billing makes the maths easy.
These prices were accurate when I wrote this, but multimodal API pricing changes absurdly fast. Always run your own benchmarks with real data before committing. Our team now pulls fresh pricing from each provider at the start of every month and runs a 200-image benchmark to track the cost-performance ratio shifts. This habit has saved us from several surprise price hikes.
What multimodal API are you using? Ever had a bill that made you question your life choices? Drop a comment—your horror stories might save someone else from the same fate. ☕
multimodalAI #APIpricing #tokeneconomics #devcosts #AI2025 #programming
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.