AI Gateway Showdown: I Benchmarked 5 Platforms So You Don't Have To (The 3 AM Wake-Up Call)

TL;DR: Portkey added just 30ms overhead. OpenRouter's health checks nearly cost me 14,000 failed requests. Together AI's throughput swings made capacity planning a joke. Full benchmarks for latency, throughput, and real costs across GPT-4o, Claude 3.5 Sonnet, and more—plus the Terraform scripts to replicate everything yourself.

Last Tuesday at 3 AM, I sat there watching my coffee go cold. Properly cold. The kind of cold where you consider microwaving it but you're too tired to stand up.

Our production RAG pipeline—serving about 50,000 daily active users—had just been migrated to an AI API aggregation platform. The sales pitch promised "2x lower latency." Our Grafana dashboard was telling a different story. P95 response times had doubled. Doubled.

That debugging session (which I'll get to later) sparked a fortnight of obsessive benchmarking. I stress-tested every major aggregation platform with a standardised workload. Some results confirmed my suspicions. Others genuinely surprised me.

If you're building anything that touches LLMs in production, you've probably wrestled with the same question: go direct and risk downtime, or add an aggregation layer and accept the overhead? Here are the actual numbers to help you decide.

What You'll Need to Replicate These Tests

Before we dive in, here's my setup:

AWS Account with EC2 access (I used c6i.xlarge instances on Amazon Linux 2023)
Python 3.12+ with httpx==0.28.1, pandas==2.2.0, and locust==2.31.4
Terraform v1.10+ (if you want to provision infrastructure—completely optional)
API Keys for: OpenAI, Anthropic, Groq, OpenRouter, Portkey, Martian, Anyscale, and Together AI
GitHub Repo: github.com/rajpatel-ops/ai-gateway-benchmarks (all scripts, configs, and the Grafana dashboard JSON)


# Grab the benchmark suite
git clone https://github.com/rajpatel-ops/ai-gateway-benchmarks.git
cd ai-gateway-benchmarks

# Install dependencies
pip install -r requirements.txt

# Set your keys
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export GROQ_API_KEY="gsk_..."
# ... (check .env.example for the full list)

How I Designed the Tests (And Why)

The Workload

I didn't just hammer endpoints with random requests. I modelled this after our actual production chatbot—the one that handles customer support queries, generates summaries, and occasionally hallucinates product features that don't exist. (We're working on that.)

Three distinct tasks:

Task 1 (Chat Completion): 500-token prompt, request 200-token response. This mirrors our customer query flow—think "What's your refund policy for digital products purchased in the EU?"
Task 2 (Embedding): Batch of 32 texts, 512 tokens each. This is our semantic search pipeline indexing help articles.
Task 3 (Streaming): 1000-token prompt with SSE streaming enabled. For when users want real-time explanations.

The Contenders

Platform	Version/Date	Routing Strategy	Caching

OpenRouter	Feb 2025	Latency-optimised	Optional

Portkey	v2.3.1	Custom rules engine	Built-in

Martian	2025.01	Model-agnostic router	None

Anyscale	Ray 2.40+	Replica-aware	Disk-based

Architecture


graph LR
 A[Locust Load Generator<br/>c6i.xlarge × 3] --> B[API Gateway Layer]
 B --> C[OpenRouter]
 B --> D[Portkey]
 B --> E[Martian]
 B --> F[Anyscale]
 B --> G[Together AI]
 C --> H[OpenAI GPT-4o]
 C --> I[Claude 3.5 Sonnet]
 C --> J[Groq LLaMA 3]
 D --> H
 D --> I
 E --> K[Mistral Large]
 F --> L[Llama 3.1 405B]
 G --> M[Mixtral 8x22B]

I deployed three c6i.xlarge instances across us-east-1, eu-west-1, and ap-southeast-1 to simulate globally distributed clients. Each ran Locust with 200 concurrent users, ramping up over 5 minutes and sustaining for 30. In total, about 2.1 million requests across all platforms.

The Results: Numbers That Actually Matter

1. Latency (Because Nobody Likes Waiting)

This is end-to-end: client → aggregation platform → model provider → aggregation platform → client. All in milliseconds. Lower is better. Obviously.


# This is the core measurement function
async def measure_latency(platform: str, model: str, prompt: str) -> dict:
 start = time.monotonic()
 async with httpx.AsyncClient(timeout=30.0) as client:
 response = await client.post(
 f"{PLATFORM_ENDPOINTS[platform]}/chat/completions",
 json={
 "model": model,
 "messages": [{"role": "user", "content": prompt}],
 "max_tokens": 200
 },
 headers={"Authorization": f"Bearer {get_api_key(platform)}"}
 )
 elapsed = (time.monotonic() - start) * 1000 # convert to ms
 return {"latency_ms": elapsed, "status": response.status_code}

GPT-4o (Direct vs. Aggregators):

Together AI	Jan 2025	Load-balanced	Redis

Platform	p50 (ms)	p95 (ms)	p99 (ms)	Error Rate

OpenAI Direct	1,240	2,890	4,120	0.02%

OpenRouter	1,310	3,450	5,200	0.15%

Portkey	1,280	2,940	4,300	0.08%

Claude 3.5 Sonnet:

Together AI	1,520	3,890	6,100	0.42%

Platform	p50 (ms)	p95 (ms)	p99 (ms)	Error Rate

Anthropic Direct	1,890	4,200	6,500	0.01%

OpenRouter	2,050	4,890	7,800	0.23%

Portkey	1,920	4,310	6,700	0.11%

Portkey added roughly 30-40ms at p50. That's genuinely impressive when you consider they're doing request validation, rate limiting, and logging on every call.

OpenRouter's p99 told a different story. Those spikes—nearly double the p50 in some cases—suggest routing delays during peak loads. If your application is latency-sensitive at the tail end (and whose isn't?), that's worth flagging.

Together AI... look, I wanted to like them. They've got solid open-source model support. But with GPT-4o, their p99 hit 6,100ms. That's nearly 5 seconds slower than going direct. For a chatbot, that's the difference between "this feels snappy" and "I've already opened a new tab."

2. Throughput: Tokens Per Second Under Load

Sustained throughput over 30 minutes, 200 concurrent connections. This test reveals what happens when you're actually serving users, not just running curl in a loop.


# How I ran it
locust -f locustfile.py --headless \
 --users 200 --spawn-rate 20 --run-time 30m \
 --host https://api.portkey.ai \
 --csv=results/portkey_throughput

Streaming Throughput (GPT-4o, tokens/sec):


Platform Avg TPS Peak TPS Min TPS Stability
─────────────────────────────────────────────────────────────
OpenAI Direct 184.2 312.5 98.7 ±12.3%
OpenRouter 156.8 289.3 67.2 ±18.9%
Portkey 178.9 301.4 112.5 ±8.7%
Martian 142.3 267.8 54.1 ±22.4%
Together AI 131.7 245.6 43.8 ±31.2%

That "Stability" column? It's the coefficient of variation. Lower is better. Portkey's ±8.7% was the most consistent by a mile. Together AI's ±31.2% was... let's call it "exciting."

I learned this the hard way. Our autoscaling group kept triggering scale-up events because Together AI's throughput would randomly plummet. Then it'd recover. Then plummet again. My phone was blowing up with PagerDuty alerts at 4 AM. After the third false alarm, I muted the channel. (Don't do that. But you understand the impulse.)

Embedding Throughput (text-embedding-3-small, requests/sec):


Platform Req/sec Avg Latency Batch Efficiency
─────────────────────────────────────────────────────────────
OpenAI Direct 42.3 780ms 94.2%
Anyscale 38.7 890ms 86.1%
Portkey 41.1 810ms 91.5%
OpenRouter 35.2 1,020ms 78.4%

Batch efficiency here means the percentage of theoretical maximum throughput. OpenRouter's 78.4% suggests they're not optimising embedding batch sizes properly. That's a lot of wasted capacity when you're processing millions of documents.

3. The Real Costs (Including the Hidden Bits)

Aggregation platforms add markup. Here's what you'll actually pay per 1M tokens (February 2025 pricing):

GPT-4o (Input/Output per 1M tokens):

Anyscale	2,340	5,600	9,200	0.67%

Provider	Input Cost	Output Cost	Effective Markup

OpenAI Direct	$2.50	$10.00	0%

Portkey	$2.75	$11.00	10%

OpenRouter	$2.62	$10.50	5%

But here's the twist—and this surprised me.

Aggregation platforms can actually save you money through intelligent routing. Portkey's "fallback to cheapest model" feature reduced our overall costs by 18% last month. It routed about 30% of traffic to Claude 3.5 Haiku when latency requirements allowed it, instead of always hitting GPT-4o.


# Portkey's config for cost-optimised routing
{
 "strategy": "cost-minimization",
 "targets": [
 {"model": "gpt-4o", "max_cost_per_1k": 0.015},
 {"model": "claude-3-5-sonnet", "max_cost_per_1k": 0.012},
 {"model": "claude-3-5-haiku", "max_cost_per_1k": 0.004}
 ],
 "fallback_order": ["gpt-4o", "claude-3-5-sonnet", "claude-3-5-haiku"]
}

So the 10% markup on paper translated to an 18% net saving. Funny how that works.

4. The 3 AM Incident (Or: Why Health Check Intervals Matter)

Right. Let me tell you about that cold coffee.

We'd configured OpenRouter to route between GPT-4o and Claude 3.5 Sonnet. The logic was simple: use whichever was available, prioritise the faster one. At 2:47 AM UTC, Anthropic's API experienced a partial outage in us-east-1. Nothing catastrophic—just 4 minutes of degraded performance.

OpenRouter's health check didn't notice.

Why? They were polling every 60 seconds with a 30-second timeout. For 90 agonising seconds, traffic kept flowing to a dead endpoint. 14,000 requests failed with 502 Bad Gateway. Our error budget for the entire quarter evaporated in under two minutes.

I was the on-call engineer. My coffee went cold while I manually switched traffic to the OpenAI-only endpoint at 3:04 AM.

The fix—after a very pointed email to OpenRouter's support team—was migrating to Portkey with proper health check configuration:


// Portkey gateway config with aggressive health checks
const gateway = new Portkey({
 healthCheckInterval: 5000, // every 5 seconds instead of 60
 circuitBreaker: {
 failureThreshold: 3,
 recoveryTimeout: 10000, // 10 seconds
 halfOpenMaxRequests: 5
 }
});

This config would've detected the Anthropic outage in 5 seconds instead of 90. Instead of 14,000 failures, we'd have seen roughly 780. That's the difference between "minor blip" and "all-hands incident review."

Lesson learned. Health check intervals aren't a config detail—they're a reliability feature.

So Which Platform Should You Actually Choose?

Go with OpenRouter if:

You want access to 200+ models through a single API key
Occasional p99 latency spikes won't wake you up at night
You're prototyping and value simplicity over fine-grained control

Go with Portkey if:

You're running production workloads that need 99.9%+ uptime
You want caching, rate limiting, and request queuing out of the box
Observability matters to you—their OpenTelemetry integration is solid

Go with Anyscale if:

You're serving fine-tuned Llama or Mistral models at scale
You're already in the Ray ecosystem (and if you are, you know who you are)
Cost optimisation for open-source models is your priority

Go with Together AI if:

You're exclusively using open-source models
You need fine-tuning capabilities alongside inference
You've got robust retry logic to handle throughput variability

Skip aggregation platforms entirely if:

You only use one model provider
Your latency budget is under 100ms (go direct—seriously)
You've got a dedicated MLOps team managing model deployments

Deploy Your Own Benchmarks (Because Trust But Verify)

I've open-sourced everything—the Locust scripts, the analysis notebooks, the Terraform modules. Here's the quick start:


# main.tf
module "benchmark_infra" {
 source = "github.com/rajpatel-ops/ai-gateway-benchmarks//terraform"
 
 regions = ["us-east-1", "eu-west-1", "ap-southeast-1"]
 instance_type = "c6i.xlarge"
 locust_users = 200
 test_duration_m = 30
 
 platforms = {
 openrouter = { api_key = var.openrouter_key }
 portkey = { api_key = var.portkey_key }
 together = { api_key = var.together_key }
 }
}


# Deploy and run
terraform init && terraform apply -auto-approve
./scripts/run-benchmarks.sh
./scripts/generate-report.py --output report.html

Full results, raw CSV data, and the Grafana dashboard JSON are in the GitHub repo. If you spot something wrong with my methodology—and I'm sure there's something—open an issue. I'd rather be corrected than confident and wrong.

What I'm Testing Next

This week, I'm adding benchmarks for:

Groq's LPU inference — they claim 300+ tokens/sec for Llama 3.1 70B. I'll believe it when I see it.
Cloudflare AI Gateway — recently went GA, edge-based routing. Could be interesting for globally distributed apps.
AWS Bedrock's cross-region inference — us-east-1 + eu-west-1 + ap-northeast-1. If they've nailed the routing, this could be the enterprise play.

I'll publish the follow-up in March 2025. If you've benchmarked any of these platforms with different workloads—or if you've found something I missed—I'd genuinely love to compare notes.

What's your experience with AI API aggregation? Have you seen similar latency patterns, or did your tests reveal something completely different? Drop a comment below. I'm particularly curious about Azure AI Studio and Google Cloud's Model Garden. The cloud-native offerings might change the equation entirely.

Tags: #ai #api-gateway #benchmarking #openai #anthropic #portkey #openrouter #latency #throughput #devops #sre

Further Reading:

Together AI	$3.00	$12.00	20%

AI Gateway Showdown: I Benchmarked 5 Platforms So You Don't Have To (The 3 AM Wake-Up Call)

AI Gateway Showdown: I Benchmarked 5 Platforms So You Don't Have To (The 3 AM Wake-Up Call)

What You'll Need to Replicate These Tests

How I Designed the Tests (And Why)

The Workload

The Contenders

Architecture

The Results: Numbers That Actually Matter

1. Latency (Because Nobody Likes Waiting)

2. Throughput: Tokens Per Second Under Load

3. The Real Costs (Including the Hidden Bits)

4. The 3 AM Incident (Or: Why Health Check Intervals Matter)

So Which Platform Should You Actually Choose?

Go with OpenRouter if:

Go with Portkey if:

Go with Anyscale if:

Go with Together AI if:

Skip aggregation platforms entirely if:

Deploy Your Own Benchmarks (Because Trust But Verify)

What I'm Testing Next

Cael Lee

Ready to get started?