| Together AI | $3.00 | $12.00 | 20% |
But here's the twist—and this surprised me.
Aggregation platforms can actually save you money through intelligent routing. Portkey's "fallback to cheapest model" feature reduced our overall costs by 18% last month. It routed about 30% of traffic to Claude 3.5 Haiku when latency requirements allowed it, instead of always hitting GPT-4o.
# Portkey's config for cost-optimised routing
{
"strategy": "cost-minimization",
"targets": [
{"model": "gpt-4o", "max_cost_per_1k": 0.015},
{"model": "claude-3-5-sonnet", "max_cost_per_1k": 0.012},
{"model": "claude-3-5-haiku", "max_cost_per_1k": 0.004}
],
"fallback_order": ["gpt-4o", "claude-3-5-sonnet", "claude-3-5-haiku"]
}
So the 10% markup on paper translated to an 18% net saving. Funny how that works.
4. The 3 AM Incident (Or: Why Health Check Intervals Matter)
Right. Let me tell you about that cold coffee.
We'd configured OpenRouter to route between GPT-4o and Claude 3.5 Sonnet. The logic was simple: use whichever was available, prioritise the faster one. At 2:47 AM UTC, Anthropic's API experienced a partial outage in us-east-1. Nothing catastrophic—just 4 minutes of degraded performance.
OpenRouter's health check didn't notice.
Why? They were polling every 60 seconds with a 30-second timeout. For 90 agonising seconds, traffic kept flowing to a dead endpoint. 14,000 requests failed with 502 Bad Gateway. Our error budget for the entire quarter evaporated in under two minutes.
I was the on-call engineer. My coffee went cold while I manually switched traffic to the OpenAI-only endpoint at 3:04 AM.
The fix—after a very pointed email to OpenRouter's support team—was migrating to Portkey with proper health check configuration:
// Portkey gateway config with aggressive health checks
const gateway = new Portkey({
healthCheckInterval: 5000, // every 5 seconds instead of 60
circuitBreaker: {
failureThreshold: 3,
recoveryTimeout: 10000, // 10 seconds
halfOpenMaxRequests: 5
}
});
This config would've detected the Anthropic outage in 5 seconds instead of 90. Instead of 14,000 failures, we'd have seen roughly 780. That's the difference between "minor blip" and "all-hands incident review."
Lesson learned. Health check intervals aren't a config detail—they're a reliability feature.
So Which Platform Should You Actually Choose?
Go with OpenRouter if:
- You want access to 200+ models through a single API key
- Occasional p99 latency spikes won't wake you up at night
- You're prototyping and value simplicity over fine-grained control
Go with Portkey if:
- You're running production workloads that need 99.9%+ uptime
- You want caching, rate limiting, and request queuing out of the box
- Observability matters to you—their OpenTelemetry integration is solid
Go with Anyscale if:
- You're serving fine-tuned Llama or Mistral models at scale
- You're already in the Ray ecosystem (and if you are, you know who you are)
- Cost optimisation for open-source models is your priority
Go with Together AI if:
- You're exclusively using open-source models
- You need fine-tuning capabilities alongside inference
- You've got robust retry logic to handle throughput variability
Skip aggregation platforms entirely if:
- You only use one model provider
- Your latency budget is under 100ms (go direct—seriously)
- You've got a dedicated MLOps team managing model deployments
Deploy Your Own Benchmarks (Because Trust But Verify)
I've open-sourced everything—the Locust scripts, the analysis notebooks, the Terraform modules. Here's the quick start:
# main.tf
module "benchmark_infra" {
source = "github.com/rajpatel-ops/ai-gateway-benchmarks//terraform"
regions = ["us-east-1", "eu-west-1", "ap-southeast-1"]
instance_type = "c6i.xlarge"
locust_users = 200
test_duration_m = 30
platforms = {
openrouter = { api_key = var.openrouter_key }
portkey = { api_key = var.portkey_key }
together = { api_key = var.together_key }
}
}
# Deploy and run
terraform init && terraform apply -auto-approve
./scripts/run-benchmarks.sh
./scripts/generate-report.py --output report.html
Full results, raw CSV data, and the Grafana dashboard JSON are in the GitHub repo. If you spot something wrong with my methodology—and I'm sure there's something—open an issue. I'd rather be corrected than confident and wrong.
What I'm Testing Next
This week, I'm adding benchmarks for:
- Groq's LPU inference — they claim 300+ tokens/sec for Llama 3.1 70B. I'll believe it when I see it.
- Cloudflare AI Gateway — recently went GA, edge-based routing. Could be interesting for globally distributed apps.
- AWS Bedrock's cross-region inference —
us-east-1 + eu-west-1 + ap-northeast-1. If they've nailed the routing, this could be the enterprise play.
I'll publish the follow-up in March 2025. If you've benchmarked any of these platforms with different workloads—or if you've found something I missed—I'd genuinely love to compare notes.
What's your experience with AI API aggregation? Have you seen similar latency patterns, or did your tests reveal something completely different? Drop a comment below. I'm particularly curious about Azure AI Studio and Google Cloud's Model Garden. The cloud-native offerings might change the equation entirely.
Tags: #ai #api-gateway #benchmarking #openai #anthropic #portkey #openrouter #latency #throughput #devops #sre
Further Reading:
Ready to get started?
Get your API key and start building with 180+ AI models.
Get API Key Free