| 2025.03 | Zhipu GLM-4-Flash | Free | Free | Table flip |
Fourteen months. From ten dollars to literally zero. That price curve dropped faster than my company's last round of layoffs.
Actually, hold on—I need to correct myself here. That "free" GLM-4-Flash isn't quite the free lunch it appears to be. I tested it properly later, and the free tier has strict QPS limits: roughly 5 requests per second. Exceed that, and you're greeted with a wall of 429 errors. Want the limit lifted? That'll be the enterprise plan, thank you very much. So this "free" tier is more like a fishing rod with a complimentary worm.
And there's more mischief behind those headline numbers. I've actually load-tested several of these ultra-cheap models, and plenty of providers are playing the disguised throttling game. One model advertised at $0.014/million tokens? Push concurrency past 50, and suddenly you're queueing. Latency jumps from 200ms to 30 seconds. You think you've bagged a bargain, but you've actually bought a Ferrari with the handbrake permanently on.
The Tech Arms Race, Layer 1: Squeezing Inference Costs to the Bone
The pricing isn't pure marketing spin—there's genuine technical work making this possible. I ran an internal tech survey with my team a couple of months back, mapping out how each provider is attacking costs.
MoE Architectures Finally Shipping
DeepSeek V2 and V3 are the most aggressive players here. They've pushed Mixture of Experts activation down to under 5% of total parameters. What does that actually mean? The model technically has 671 billion parameters, but each inference call only wakes up a tiny subset of experts. Imagine your company employs 200 people, but any given project only needs 10 of them—the rest are scrolling Twitter at their desks. Your compute cost just got divided by 20.
# Pseudocode for MoE routing logic
def moe_forward(x, experts, router):
# Only activate top-k experts
weights, indices = router.top_k(x, k=2)
output = 0
for w, idx in zip(weights, indices):
output += w * experts[idx](x)
return output # Only 2 out of 160 experts activated
But there's a catch—one we discovered the hard way in production.
MoE models don't save you a single byte of VRAM. All those experts need to be loaded onto the GPU, waiting to be called up. A 671B-parameter model barely fits on 8 H100s. The inference deployment cost is actually higher than a dense model with equivalent performance. DeepSeek's API can be cheap because they're eating the hardware costs themselves, running hyperscale clusters at high utilisation. Smaller shops simply cannot afford to play this game.
FP8 Quantisation + Speculative Sampling
ByteDance's Doubao model has been particularly aggressive with quantisation. Their internal training and inference uses FP8 mixed precision, which slashes memory bandwidth requirements in half compared to FP16. Layer speculative sampling on top—using a small model to rapidly generate candidate tokens that the big model then verifies—and you can 2-3x your effective throughput.
I ran speculative sampling benchmarks in my own test environment:
# vLLM with speculative sampling
python -m vllm.entrypoints.openai.api_server \
--model doubao-pro-32k \
--speculative-model doubao-lite-32k \
--num-speculative-tokens 5 \
--dtype fp8
A single H100 went from 1,200 tokens/s to 3,100 tokens/s. This isn't new technology—speculative sampling papers have been around for a while—but it only became properly production-ready in the second half of 2024. Before that, framework support was experimental at best. I remember vLLM before version 0.5.0: enable speculative sampling, and you'd have about 20 minutes before everything OOMed, logs screaming CUDA out of memory.
The Scheduling War Nobody Talks About
This is the most overlooked piece, and honestly, it's where the biggest cost wins live.
Most LLM APIs run at under 50% GPU utilisation because traffic comes in waves. Whoever nails batching and scheduling can slash their costs dramatically. From what I've gathered, one major provider pushed their H100 cluster utilisation from 42% to 78% using dynamic batching plus request priority queues. Same hardware, nearly double the throughput. That's where the pricing confidence comes from.
I should admit—this bit gets complex, and I don't fully understand their scheduling algorithms either. The rough idea: mix latency-insensitive offline tasks (data labelling, evaluation runs) with live requests, using the offline work to fill GPU idle fragments. Sounds straightforward on a whiteboard. In practice? Memory management, VRAM fragmentation, scheduling jitter—each one of those will peel a layer off your sanity.
The Tech Arms Race, Layer 2: When Models All Look the Same
Here's the uncomfortable truth: by early 2025, base model capabilities have largely converged. I had my team run an internal evaluation—200 real prompts from our actual business scenarios, tested against GPT-4o, DeepSeek V3, Qwen-Long, and Doubao Pro. The results:
- Code generation: GPT-4o ≈ DeepSeek V3 > Qwen-Long > Doubao
- Long-context understanding: Qwen-Long > DeepSeek V3 > GPT-4o > Doubao
- Chinese creative writing: Doubao ≈ Qwen-Long > DeepSeek V3 > GPT-4o
- Instruction following: GPT-4o > DeepSeek V3 > Doubao > Qwen-Long
The gaps? Mostly within 5%. Which leads to a brutally practical conclusion: when models are this similar, price becomes the only decision factor.
But here's a scar I earned the hard way.
Last November, trying to save money, we swapped a customer service summarisation pipeline from GPT-4o to one of those bargain models. Our eval set showed only a 3-point drop. A week after launch, the business team was fuming: summaries had started hallucinating, inventing things customers never said. When I dug in, I realised our eval set only covered routine cases. But 15% of production traffic was long-tail stuff—multi-person conversations, transcription errors from accents, extremely long context windows—and on those, the cheap model's performance fell off a cliff.
So here's the rule I now enforce with my team: expensive models for core paths, cheap ones for edge cases, and absolutely bulletproof fallback mechanisms.
# Our model routing configuration
routes:
- name: customer_service_summary
primary:
model: qwen-long-32k
max_tokens: 2000
timeout: 15s
fallback:
model: gpt-4o
condition: "primary.error_rate > 0.05 OR primary.latency_p99 > 10s"
cost_limit:
daily: 500 # USD
action: "alert_and_throttle"
This config has saved me at least three times. Last month, Qwen-Long's API had a widespread timeout incident. The fallback switched to GPT-4o automatically—zero business impact. Sure, the bill jumped by $200, but compared to getting hunted down by stakeholders, that's money well spent.
What "Free" Models Actually Cost You
When Zhipu announced GLM-4-Flash was going free, my first reaction wasn't "awesome." It was "what's the angle?"
After chatting with a few infrastructure friends, we landed on the same conclusion: free models are a data flywheel entry point. The more you use them, the more real-world prompts they collect. Clean and label that data, and you've got training material for your next-generation model. You're using their API and paying them in data.
This isn't conspiracy theory. OpenAI's terms explicitly state they collect API data for training (enterprise plans can opt out, but the default is opt-in). I've read through several providers' privacy policies carefully—the language is consistently vague. For consumer-facing stuff, maybe it doesn't matter. But if you're processing sensitive enterprise data, I'd strongly suggest reading the agreement properly, or just going with private deployment.
Another overlooked detail: free models come with effectively zero SLA.
I monitored GLM-4-Flash availability from 17 February to 19 March. In that one-month window, there were three separate days with unavailability windows exceeding five minutes, with zero advance notice. The longest outage? Twenty-three minutes of solid 503 Service Temporarily Unavailable. If you're running that in production, you're braver than I am.
Where This Is All Heading: Three Trends I'm Watching
Based on recent observations and conversations across the industry, here's my read on late 2025 into 2026:
Inference costs will keep falling, but the drops will shrink. On the hardware side, H200 and B100 rollouts will boost single-card inference capacity by another 2-3x. But the algorithmic optimisation headroom is narrowing. MoE, quantisation, speculative sampling—we've played most of those cards already. I'm estimating that by end of 2025, top-tier model API pricing will stabilise around $0.007-0.014 per million tokens. Push much lower, and providers start bleeding cash.
Models will tier out, and the price war will shift from base models to applications. Right now, we're seeing base models race to the bottom on price. But the application layer—RAG, agents, workflows—still commands serious premium. I've been talking to several enterprises about AI integration lately, and they'll happily pay for "industry solutions that work out of the box" rather than raw API calls. I suspect the second half of 2025 will bring more packaged products: vertical-specific models plus toolchains, with pricing actually trending upward.
Open-source API-ification will squeeze closed-source providers. DeepSeek open-sourced the V3 weights. In theory, anyone can download and self-host, bringing inference costs down to about a third of the API price. What's missing right now is ease of use. Once vLLM, SGLang, and similar frameworks mature their MoE support—once one-click deployment scripts are everywhere—closed-source providers lose even more pricing power. This is probably why OpenAI and Google are frantically pushing agents and multimodality. The moat around plain text models? Gone.
What I Actually Recommend
As someone who wrestles with these APIs daily, my stance on the price war is: short-term good, long-term cautious.
The good part? Innovation costs have cratered. Last year, our internal hackathon burned through two grand in API fees over three days. Now, the same projects cost about fifty bucks. But the caution: depending on a single provider's dirt-cheap API means handing them the keys to your infrastructure. If they suddenly raise prices or change terms because they can't sustain the burn, your entire product could implode.
My advice is boring but battle-tested:
- Build a model abstraction layer. You should be able to swap providers without rewriting your application
- Keep private deployment capability for core workflows—even if it's just a smaller open-source model
- Monitor cost and quality with equal obsession. Don't stare at the unit price; stare at actual business outcomes
Oh, and a few days ago I noticed Anthropic released Claude 3.5 Haiku at half the previous price. This war isn't ending anytime soon.
What models are you running in production right now? Has this wave of price cuts hit your stack yet? Drop a comment—I'm currently compiling real load-test data across providers and I'd be happy to share what I've got.
llm #ai-infrastructure #pricing #deepseek #cost-optimisation #2025-trends #api-design