Home / Blog / AI Gateway Showdown: I Benchmarked 5 Platforms So ...

AI Gateway Showdown: I Benchmarked 5 Platforms So You Don't Have To (The 3 AM Wake-Up Call)

By CaelLee | | 10 min read

AI Gateway Showdown: I Benchmarked 5 Platforms So You Don't Have To (The 3 AM Wake-Up Call)

TL;DR: Portkey added just 30ms overhead. OpenRouter's health checks nearly cost me 14,000 failed requests. Together AI's throughput swings made capacity planning a joke. Full benchmarks for latency, throughput, and real costs across GPT-4o, Claude 3.5 Sonnet, and more—plus the Terraform scripts to replicate everything yourself.

Last Tuesday at 3 AM, I sat there watching my coffee go cold. Properly cold. The kind of cold where you consider microwaving it but you're too tired to stand up.

Our production RAG pipeline—serving about 50,000 daily active users—had just been migrated to an AI API aggregation platform. The sales pitch promised "2x lower latency." Our Grafana dashboard was telling a different story. P95 response times had doubled. Doubled.

That debugging session (which I'll get to later) sparked a fortnight of obsessive benchmarking. I stress-tested every major aggregation platform with a standardised workload. Some results confirmed my suspicions. Others genuinely surprised me.

If you're building anything that touches LLMs in production, you've probably wrestled with the same question: go direct and risk downtime, or add an aggregation layer and accept the overhead? Here are the actual numbers to help you decide.

What You'll Need to Replicate These Tests

Before we dive in, here's my setup:


# Grab the benchmark suite
git clone https://github.com/rajpatel-ops/ai-gateway-benchmarks.git
cd ai-gateway-benchmarks

# Install dependencies
pip install -r requirements.txt

# Set your keys
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export GROQ_API_KEY="gsk_..."
# ... (check .env.example for the full list)

How I Designed the Tests (And Why)

The Workload

I didn't just hammer endpoints with random requests. I modelled this after our actual production chatbot—the one that handles customer support queries, generates summaries, and occasionally hallucinates product features that don't exist. (We're working on that.)

Three distinct tasks:

The Contenders

PlatformVersion/DateRouting StrategyCaching
OpenRouterFeb 2025Latency-optimisedOptional
Portkeyv2.3.1Custom rules engineBuilt-in
Martian2025.01Model-agnostic routerNone
AnyscaleRay 2.40+Replica-awareDisk-based

Architecture


graph LR
 A[Locust Load Generator<br/>c6i.xlarge × 3] --> B[API Gateway Layer]
 B --> C[OpenRouter]
 B --> D[Portkey]
 B --> E[Martian]
 B --> F[Anyscale]
 B --> G[Together AI]
 C --> H[OpenAI GPT-4o]
 C --> I[Claude 3.5 Sonnet]
 C --> J[Groq LLaMA 3]
 D --> H
 D --> I
 E --> K[Mistral Large]
 F --> L[Llama 3.1 405B]
 G --> M[Mixtral 8x22B]

I deployed three c6i.xlarge instances across us-east-1, eu-west-1, and ap-southeast-1 to simulate globally distributed clients. Each ran Locust with 200 concurrent users, ramping up over 5 minutes and sustaining for 30. In total, about 2.1 million requests across all platforms.

The Results: Numbers That Actually Matter

1. Latency (Because Nobody Likes Waiting)

This is end-to-end: client → aggregation platform → model provider → aggregation platform → client. All in milliseconds. Lower is better. Obviously.


# This is the core measurement function
async def measure_latency(platform: str, model: str, prompt: str) -> dict:
 start = time.monotonic()
 async with httpx.AsyncClient(timeout=30.0) as client:
 response = await client.post(
 f"{PLATFORM_ENDPOINTS[platform]}/chat/completions",
 json={
 "model": model,
 "messages": [{"role": "user", "content": prompt}],
 "max_tokens": 200
 },
 headers={"Authorization": f"Bearer {get_api_key(platform)}"}
 )
 elapsed = (time.monotonic() - start) * 1000 # convert to ms
 return {"latency_ms": elapsed, "status": response.status_code}

GPT-4o (Direct vs. Aggregators):

Together AIJan 2025Load-balancedRedis
Platformp50 (ms)p95 (ms)p99 (ms)Error Rate
OpenAI Direct1,2402,8904,1200.02%
OpenRouter1,3103,4505,2000.15%
Portkey1,2802,9404,3000.08%

Claude 3.5 Sonnet:

Together AI1,5203,8906,1000.42%
Platformp50 (ms)p95 (ms)p99 (ms)Error Rate
Anthropic Direct1,8904,2006,5000.01%
OpenRouter2,0504,8907,8000.23%
Portkey1,9204,3106,7000.11%

Portkey added roughly 30-40ms at p50. That's genuinely impressive when you consider they're doing request validation, rate limiting, and logging on every call.

OpenRouter's p99 told a different story. Those spikes—nearly double the p50 in some cases—suggest routing delays during peak loads. If your application is latency-sensitive at the tail end (and whose isn't?), that's worth flagging.

Together AI... look, I wanted to like them. They've got solid open-source model support. But with GPT-4o, their p99 hit 6,100ms. That's nearly 5 seconds slower than going direct. For a chatbot, that's the difference between "this feels snappy" and "I've already opened a new tab."

2. Throughput: Tokens Per Second Under Load

Sustained throughput over 30 minutes, 200 concurrent connections. This test reveals what happens when you're actually serving users, not just running curl in a loop.


# How I ran it
locust -f locustfile.py --headless \
 --users 200 --spawn-rate 20 --run-time 30m \
 --host https://api.portkey.ai \
 --csv=results/portkey_throughput

Streaming Throughput (GPT-4o, tokens/sec):


Platform Avg TPS Peak TPS Min TPS Stability
─────────────────────────────────────────────────────────────
OpenAI Direct 184.2 312.5 98.7 ±12.3%
OpenRouter 156.8 289.3 67.2 ±18.9%
Portkey 178.9 301.4 112.5 ±8.7%
Martian 142.3 267.8 54.1 ±22.4%
Together AI 131.7 245.6 43.8 ±31.2%

That "Stability" column? It's the coefficient of variation. Lower is better. Portkey's ±8.7% was the most consistent by a mile. Together AI's ±31.2% was... let's call it "exciting."

I learned this the hard way. Our autoscaling group kept triggering scale-up events because Together AI's throughput would randomly plummet. Then it'd recover. Then plummet again. My phone was blowing up with PagerDuty alerts at 4 AM. After the third false alarm, I muted the channel. (Don't do that. But you understand the impulse.)

Embedding Throughput (text-embedding-3-small, requests/sec):


Platform Req/sec Avg Latency Batch Efficiency
─────────────────────────────────────────────────────────────
OpenAI Direct 42.3 780ms 94.2%
Anyscale 38.7 890ms 86.1%
Portkey 41.1 810ms 91.5%
OpenRouter 35.2 1,020ms 78.4%

Batch efficiency here means the percentage of theoretical maximum throughput. OpenRouter's 78.4% suggests they're not optimising embedding batch sizes properly. That's a lot of wasted capacity when you're processing millions of documents.

3. The Real Costs (Including the Hidden Bits)

Aggregation platforms add markup. Here's what you'll actually pay per 1M tokens (February 2025 pricing):

GPT-4o (Input/Output per 1M tokens):

Anyscale2,3405,6009,2000.67%
ProviderInput CostOutput CostEffective Markup
OpenAI Direct$2.50$10.000%
Portkey$2.75$11.0010%
OpenRouter$2.62$10.505%

But here's the twist—and this surprised me.

Aggregation platforms can actually save you money through intelligent routing. Portkey's "fallback to cheapest model" feature reduced our overall costs by 18% last month. It routed about 30% of traffic to Claude 3.5 Haiku when latency requirements allowed it, instead of always hitting GPT-4o.


# Portkey's config for cost-optimised routing
{
 "strategy": "cost-minimization",
 "targets": [
 {"model": "gpt-4o", "max_cost_per_1k": 0.015},
 {"model": "claude-3-5-sonnet", "max_cost_per_1k": 0.012},
 {"model": "claude-3-5-haiku", "max_cost_per_1k": 0.004}
 ],
 "fallback_order": ["gpt-4o", "claude-3-5-sonnet", "claude-3-5-haiku"]
}

So the 10% markup on paper translated to an 18% net saving. Funny how that works.

4. The 3 AM Incident (Or: Why Health Check Intervals Matter)

Right. Let me tell you about that cold coffee.

We'd configured OpenRouter to route between GPT-4o and Claude 3.5 Sonnet. The logic was simple: use whichever was available, prioritise the faster one. At 2:47 AM UTC, Anthropic's API experienced a partial outage in us-east-1. Nothing catastrophic—just 4 minutes of degraded performance.

OpenRouter's health check didn't notice.

Why? They were polling every 60 seconds with a 30-second timeout. For 90 agonising seconds, traffic kept flowing to a dead endpoint. 14,000 requests failed with 502 Bad Gateway. Our error budget for the entire quarter evaporated in under two minutes.

I was the on-call engineer. My coffee went cold while I manually switched traffic to the OpenAI-only endpoint at 3:04 AM.

The fix—after a very pointed email to OpenRouter's support team—was migrating to Portkey with proper health check configuration:


// Portkey gateway config with aggressive health checks
const gateway = new Portkey({
 healthCheckInterval: 5000, // every 5 seconds instead of 60
 circuitBreaker: {
 failureThreshold: 3,
 recoveryTimeout: 10000, // 10 seconds
 halfOpenMaxRequests: 5
 }
});

This config would've detected the Anthropic outage in 5 seconds instead of 90. Instead of 14,000 failures, we'd have seen roughly 780. That's the difference between "minor blip" and "all-hands incident review."

Lesson learned. Health check intervals aren't a config detail—they're a reliability feature.

So Which Platform Should You Actually Choose?

Go with OpenRouter if:

Go with Portkey if:

Go with Anyscale if:

Go with Together AI if:

Skip aggregation platforms entirely if:

Deploy Your Own Benchmarks (Because Trust But Verify)

I've open-sourced everything—the Locust scripts, the analysis notebooks, the Terraform modules. Here's the quick start:


# main.tf
module "benchmark_infra" {
 source = "github.com/rajpatel-ops/ai-gateway-benchmarks//terraform"
 
 regions = ["us-east-1", "eu-west-1", "ap-southeast-1"]
 instance_type = "c6i.xlarge"
 locust_users = 200
 test_duration_m = 30
 
 platforms = {
 openrouter = { api_key = var.openrouter_key }
 portkey = { api_key = var.portkey_key }
 together = { api_key = var.together_key }
 }
}

# Deploy and run
terraform init && terraform apply -auto-approve
./scripts/run-benchmarks.sh
./scripts/generate-report.py --output report.html

Full results, raw CSV data, and the Grafana dashboard JSON are in the GitHub repo. If you spot something wrong with my methodology—and I'm sure there's something—open an issue. I'd rather be corrected than confident and wrong.

What I'm Testing Next

This week, I'm adding benchmarks for:

  1. Groq's LPU inference — they claim 300+ tokens/sec for Llama 3.1 70B. I'll believe it when I see it.
  2. Cloudflare AI Gateway — recently went GA, edge-based routing. Could be interesting for globally distributed apps.
  3. AWS Bedrock's cross-region inferenceus-east-1 + eu-west-1 + ap-northeast-1. If they've nailed the routing, this could be the enterprise play.

I'll publish the follow-up in March 2025. If you've benchmarked any of these platforms with different workloads—or if you've found something I missed—I'd genuinely love to compare notes.

What's your experience with AI API aggregation? Have you seen similar latency patterns, or did your tests reveal something completely different? Drop a comment below. I'm particularly curious about Azure AI Studio and Google Cloud's Model Garden. The cloud-native offerings might change the equation entirely.

Tags: #ai #api-gateway #benchmarking #openai #anthropic #portkey #openrouter #latency #throughput #devops #sre

Further Reading:

Together AI$3.00$12.0020%
C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free