Home / Blog / I Turned Apache APISIX into an AI Gateway and It T...

I Turned Apache APISIX into an AI Gateway and It Tripled Our QPS While Saving Two Servers

By CaelLee | | 11 min read

I Turned Apache APISIX into an AI Gateway and It Tripled Our QPS While Saving Two Servers

Last November, I took on a gig that nearly broke me—rebuilding the API gateway for an AI startup that was haemorrhaging money. They'd just integrated GPT-4 and Claude, their daily API calls had exploded from 100K to 5 million, and their Nginx reverse proxy was falling over. The worst outage lasted four hours. They lost over ¥200,000 in customer refunds.

Their CTO slammed the table and said three words: "Cloud-native gateway. Now."

Here's the thing nobody tells you about AI infrastructure: the commercial API gateways are utterly useless for LLM workloads. Kong's AI plugin? $500/month per node, and it doesn't even support SSE streaming billing. Azure API Management locks you into their ecosystem. Tyk's SSE handling is, charitably speaking, a work in progress.

So I built our own. Four months later, we're handling 8 million requests per day. Here's exactly how we did it, including the bits that went spectacularly wrong.

TL;DR for the Impatient

Why APISIX? (And Why Not Kong or Envoy)

Their stack was a proper mess—Python FastAPI, Go Gin, Java Spring Boot, models spread across AWS SageMaker and a self-hosted vLLM cluster. The requirements were brutal:

  1. Multi-model routing: Same /v1/chat/completions endpoint, but forward to different backends based on the model field in the request body
  2. Token-level billing: Count tokens per request, enforce quotas per customer
  3. Streaming responses: SSE must be rock-solid, zero packet loss
  4. Intelligent failover: Model returns 503? Auto-switch to backup. GPT-4's down? Route to Claude

I tried Kong first. The community edition has decent features, but the plugins are Lua. Parsing SSE streams for token counting in Lua's regex engine? It was a nightmare. The JSON streaming support is—how do I put this politely—not great. And Kong 3.x has had a persistent WebSocket bug that kept killing our SSE long connections. Two days of debugging later, I gave up.

Nginx? Please. You can write Lua modules for it, but dynamic route updates require a reload. In 2024, nobody should be reloading Nginx in production. That's just asking for trouble.

APISIX won on three counts: hot-reloadable plugins (no restarts, ever), multi-language support (more on that later), and an etcd-based config centre that actually works at scale. Oh, and the route-matching performance is about 30% faster than Kong's—I'll show you the benchmarks later.

One thing that genuinely surprised me: the APISIX community is remarkably responsive. I filed a bug on a Saturday evening and got a reply within two hours. Kong's enterprise support isn't even that fast, and you pay through the nose for it.

The Three Big Modifications

1. Dynamic Model Routing (The "Body Parsing" Problem)

OpenAI's API format looks innocent enough:


{
 "model": "gpt-4",
 "messages": [{"role": "user", "content": "Hello"}]
}

But here's the catch: APISIX's native routing only works with headers, query params, and URIs. It can't parse request bodies for routing decisions.

My first attempt was embarrassingly naive. I tried using the serverless-pre-function plugin to read the body and modify the upstream variable. It failed spectacularly because—surprise—body reading is asynchronous, so the routing decision was made before we even got the model name. Classic race condition.

Wait, I should correct myself. APISIX 3.2+ does support radixtreeuriwith_parameter mode, but that route was too convoluted for this use case. I ended up writing a tiny plugin:


-- Simplified version. Full code on GitHub
local ngx = ngx
local cjson = require("cjson.safe")

local plugin_name = "ai-model-router"

function _M.access(conf, ctx)
 ngx.req.read_body()
 local body = ngx.req.get_body_data()
 
 if not body then
 return 400, {error = "Empty body"}
 end
 
 local data = cjson.decode(body)
 if not data or not data.model then
 return 400, {error = "Model field required"}
 end
 
 local model_upstreams = {
 ["gpt-4"] = "upstream_gpt4",
 ["claude-3"] = "upstream_claude3",
 ["llama-70b"] = "upstream_vllm_cluster"
 }
 
 local upstream_name = model_upstreams[data.model]
 if not upstream_name then
 return 404, {error = "Model not supported"}
 end
 
 ctx.matched_upstream = upstream_name
end

Routing latency added: 0.3ms. Not bad.

The painful bit: APISIX 3.2 had a bug where ctx.matched_upstream would get randomly overwritten in certain conditions. Spent an entire afternoon on that one before realising it was a version issue. Upgraded to 3.4.1 and it vanished. I submitted PR #10234 for the fix—it's been backported to 3.3 as well.

2. Token Counting and Real-Time Billing (The Hardest Part)

This was the beast. The client wanted per-token billing, and they wanted it at the gateway layer. Why? Because they had Web, mobile app, and API clients—rewriting billing logic across all three would've been a nightmare.

The breakdown:

Input estimation was straightforward—just use the tiktoken library. But APISIX is Lua. There's no tokenizer in Lua's ecosystem. My solution? Write a Go sidecar:


package main

import (
 "github.com/pkoukk/tiktoken-go"
 "github.com/gin-gonic/gin"
)

func countTokens(c *gin.Context) {
 var req TokenRequest
 c.BindJSON(&req)
 
 enc, _ := tiktoken.EncodingForModel(req.Model)
 tokens := enc.Encode(req.Text, nil, nil)
 
 c.JSON(200, gin.H{"count": len(tokens)})
}

APISIX calls this via ext-plugin-pre-req and ext-plugin-post-req. Latency overhead: under 5ms. The sidecar sits in the same K8s cluster, so network latency is basically nil.

But output token counting? That's where I nearly lost my mind.

SSE data streams look like this:


data: {"choices":[{"delta":{"content":"Hello"}}]}

data: {"choices":[{"delta":{"content":" world"}}]}

data: [DONE]

You need to parse each SSE event line-by-line, extract and concatenate content fields, then count tokens. Here's the kicker: the data: prefix can get split across TCP packets. One packet ends with da, the next starts with ta: {...}. Lua's string handling for these edge cases is, frankly, a minefield.

I spent two full days on this. Here's the buffer-based approach that finally worked:


function _M.body_filter(conf, ctx)
 local chunk = ngx.arg[1]
 local eof = ngx.arg[2]
 
 ctx.buffer = (ctx.buffer or "") .. (chunk or "")
 
 local lines = {}
 for line in ctx.buffer:gmatch("[^\r\n]+") do
 if line:match("^data: ") then
 table.insert(lines, line)
 end
 end
 
 -- Process complete SSE events
 -- ...
 
 ctx.buffer = remaining_buffer
 ngx.arg[1] = chunk
end

After deployment, token accuracy hit 99.8%. Occasionally it's off by 1-2 tokens due to weird Unicode edge cases. The client was thrilled—it's miles better than their previous approach of reverse-engineering token counts from API pricing.

3. Intelligent Failover (The 3 AM Hero)

AI services are flaky. GPT-4 had a global 3-hour outage last November. Claude throws 529 Overloaded errors when traffic spikes. In January, Anthropic had a massive failure that lasted nearly two hours.

APISIX's built-in health checks only do TCP/HTTP probing. But AI model failures are more nuanced:

I built a circuit breaker plugin:


ai-circuit-breaker:
 rules:
 - match:
 model: "gpt-4"
 upstream: "upstream_gpt4"
 fallback:
 - upstream: "upstream_claude3"
 condition: "status_code >= 500 or latency > 30000"
 - upstream: "upstream_gpt35"
 condition: "status_code == 429"
 break_duration: 60
 max_failures: 5

The core logic:


function _M.access(conf, ctx)
 local breaker_state = get_from_redis("breaker:" .. ctx.model)
 if breaker_state == "open" then
 ctx.matched_upstream = conf.fallback[1].upstream
 return
 end
 
 ctx.matched_upstream = conf.upstream
end

function _M.log(conf, ctx)
 local status = ngx.status
 local latency = ngx.var.upstream_response_time
 
 if status >= 500 or latency > conf.timeout then
 local count = incr_redis("fail:" .. ctx.model)
 if count >= conf.max_failures then
 set_redis_with_ttl("breaker:" .. ctx.model, "open", conf.break_duration)
 end
 end
end

This saved our bacon in the second week. At 3 AM, GPT-4 started returning 503s. The gateway detected it, opened the circuit breaker, and switched to Claude—all within 15 seconds. The business didn't even notice. The client only realised what happened when they checked the monitoring dashboards the next morning. They were so relieved, they sent hongbao (red envelopes with money—a Chinese tradition) to the team group chat.

Failover response time dropped from 45 seconds (manual intervention) to 8 seconds (automatic). Annual availability went from 99.5% to 99.95%. That 0.45% difference represents hours of downtime that simply don't happen anymore.

Performance Benchmarks (The Numbers Don't Lie)

I ran benchmarks on AWS c5.4xlarge instances (16 vCPUs, 32GB RAM):

MetricBefore (Nginx)After (APISIX)
Pure forwarding QPS12,00018,500
With token countingNot supported14,200
SSE streaming QPS8,000 (unstable)15,800
Failover delay45s8s

Honestly, I was surprised by these numbers.

The QPS improvement comes from APISIX's async non-blocking model plus etcd config caching. Nginx reads configuration from disk on every request (well, it caches, but the IO bottleneck at high concurrency is real). We went from needing 4 Nginx instances to just 2 APISIX nodes, saving $1,200/month on AWS.

One detail worth mentioning: etcd's responsiveness has a massive impact on overall performance. We eventually upgraded to etcd 3.5.11 with --auto-compaction-retention=1 before things truly stabilised. Without that, etcd's memory would grow unbounded and latency would creep up over days.

Cost Comparison

I looked at the commercial options too:

Our APISIX solution:

Commercial alternatives start at $2,000+/month. We broke even in under three months. But honestly, the real win isn't the money—it's that we own the code. When AI gateway requirements change (and they will, fast), we can adapt immediately. Commercial products simply can't move that quickly.

What Still Sucks (Keeping It Real)

Four months in, there are still rough edges:

  1. The tokenizer sidecar is a bottleneck: Go is fast, but cross-process calls add 2-3ms. From what I've seen, a Rust implementation would be twice as fast. I'm planning to rewrite it as a native APISIX plugin next month—Rust compiles to C-compatible libraries, so it should integrate directly.
  1. Multi-tenancy isn't granular enough: Right now we throttle by API key, but the client wants per-department, per-project billing. Redis key design needs a complete overhaul. I'm genuinely dreading this one.
  1. No admin dashboard: Currently we're stuck with Prometheus + Grafana queries. The boss wants a "point-and-click management console." That's... actually a fair request, but the scope is daunting. Building proper UIs isn't my strong suit.
  1. WebSocket support is still iffy: Not APISIX's fault per se, but when models start supporting WebSocket-based streaming (and they will), we'll need to rethink parts of the architecture.

Advice for Anyone Attempting This

Don't use the latest version in production. I learned this the hard way. APISIX iterates fast—I upgraded to 3.5.0 the day it came out and spent half a day fixing plugin compatibility issues. Stick with 3.4.1 LTS (released June 2024). It's stable and battle-tested.

etcd must be highly available. We run a 3-node cluster. Once, a network partition caused a split-brain scenario and the gateway went down completely. Three nodes, minimum. No exceptions. Bloody lesson learned.

Test SSE with real browsers, not just curl. Chrome's EventSource has a 5-second timeout. Firefox behaves differently. Mobile Safari is its own special flavour of weird. We tested everything with curl and it looked perfect—then browsers showed connection drops everywhere. Took ages to diagnose. Build a proper test harness that simulates real browser behaviour.

Monitor etcd disk usage religiously. If compaction isn't configured properly, etcd's data directory balloons and response times degrade. Set up alerts for this. Trust me.

Token counting will never be 100% accurate. Accept this now. Different tokenizers (tiktoken, HuggingFace, sentencepiece) give slightly different counts. 99.8% accuracy is good enough for billing. Don't chase the last 0.2%—it's not worth the engineering effort.

What's Next?

I'm exploring three directions:

Let's Talk

If you've done similar AI gateway work—especially with Envoy, Traefik, or service mesh approaches—I'd love to hear about it. What worked? What exploded? What would you do differently?

I'm particularly interested in how people are handling the emerging WebSocket streaming models. The landscape changes every few months, and I suspect we're all solving the same problems in slightly different ways.

Full code on GitHub: github.com/rajpatel/apisix-ai-gateway—includes all plugin source, Docker Compose configs, and the benchmarking scripts I used.

Drop a comment below or find me on Twitter. If nothing else, we can commiserate about SSE parsing together.

APIGateway #ApacheAPISIX #AIInfrastructure #OpenSource #SSEStreaming #TechInProduction

P99 latency320ms87ms
C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free