We Built an LLM Gateway to Tame the Long-Tail Model Chaos (and Saved 60% on Maintenance)
We Built an LLM Gateway to Tame the Long-Tail Model Chaos (and Saved 60% on Maintenance)
TIL that 70% of our GPU costs were going to models nobody had heard of, and our infrastructure team was about to stage a mutiny. Fair warning: this is a war story about how we accidentally solved the long-tail model problem by shamelessly copying what OpenAI did with their API specification.
Here's the situation.
Our company runs an internal ML platform for various teams—marketing wants LLaMA for copy generation, legal needs Claude for contract review, and some PhD in R&D keeps requesting obscure models from HuggingFace that have, I'm not joking, 47 total downloads. For a while, we just spun up dedicated endpoints for each one. That worked brilliantly.
Until it really, really didn't.
Picture this: you're on pager duty at 2 AM because a niche embedding model from a Chinese university is OOM-ing on a single A100 that's otherwise sitting at 3% utilisation. I think the model was called something like bge-large-zh-v1.5—actually wait, no. It was specifically the v1.3 release from December 2023 that had the memory leak. v1.5 was fine. Nobody had bothered to update the deployment config because the original requester had left the company six months prior and apparently documentation was optional.
Meanwhile, our GPT-4 proxy is getting absolutely hammered and we can't route around it because every model has its own bespoke API format. Every. Single. One.
I distinctly remember our lead infrastructure engineer—this is Dave, who's been doing SRE since before Kubernetes was a thing and has the thousand-yard stare to prove it—saying "I don't care if it's GPT-5 from the bloody future, if it doesn't speak /v1/chat/completions it's dead to me."
That rant turned into our architecture's north star.
The ugly reality of long-tail models
Let's ground this with some numbers. We audited our model usage over 30 days (this was February 2024, if I remember correctly):
- Top 3 models (GPT-4, Claude-3, LLaMA-70B) accounted for 82% of requests
- The remaining 18% was spread across 37 different models
- Some of those models got literally 12 requests per day. Twelve.
- We were maintaining 41 unique deployment configs, auth patterns, and retry logic per model
Honestly, the long tail wasn't killing us on compute costs. It was killing us on operational overhead.
Every weird model meant another Docker container, another set of environment variables, another thing that could break during a deploy at 4 PM on a Friday. And don't get me started on streaming—some models used SSE, some used WebSockets, one even returned newline-delimited JSON without the data: prefix.
I wish I was making that up.
The commit message when we discovered that particular gem was six characters long: "why."
The OpenAI API as an unintentional standard
Here's the thing I've noticed after years of lurking on r/MachineLearning: whether you love or hate OpenAI, their API format has become the de facto standard. It's basically what REST became for web APIs—except for LLMs.
Every major inference engine (vLLM, TGI, TensorRT-LLM) now supports it natively. Even Anthropic's SDK can be configured to speak OpenAI-format with a translation layer. Funny enough, I think that happened partly out of spite and partly because the alternative was maintaining seventeen different client libraries forever.
So we thought: what if we just built a gateway that makes every model look like an OpenAI endpoint?
Then anything downstream that expects POST /v1/chat/completions with a standard payload just works™.
I know this sounds obvious in retrospect. I really do. But the key insight—and this is what made it click—was doing this at the aggregation layer rather than wrapping each model individually. Instead of building 37 adapters, we built one translation engine that maps between the OpenAI spec and whatever weird format a long-tail model expects. One ring to rule them all, basically.
How we actually built it (the non-marketing version)
We used LiteLLM as the translation core. Yes, it's open source. Yes, it has quirks—I think we hit a bug with their Gemini streaming in v1.28.3 that required a genuinely hacky workaround involving request header manipulation. But here's the thing: it handles 100+ model providers and you can extend it without forking.
The gateway sits behind a single URL: https://llm-gateway.internal/v1.
Every team hits that endpoint with standard OpenAI payloads. Under the hood:
- Request lands → Gateway inspects the
modelfield (e.g.,"bge-large-zh") - Routing table lookup → Maps to the actual backend config (HuggingFace TGI, vLLM, etc.)
- Payload translation → Converts OpenAI format to whatever the backend expects
- Load balancing → Distributes across our GPU pool based on model affinity and current load
- Response normalisation → Takes the weird response and makes it look like OpenAI's streaming or non-streaming format
The real magic is in the routing table. We built a simple config that lets anyone register a model:
models:
- name: bge-large-zh
backend: vllm
endpoint: 10.12.44.7:8000
max_batch_size: 32
cost_center: legal-dept
Suddenly that obscure embedding model is just another model parameter. The gateway handles batching, retries, and auth. The legal team doesn't know or care that it's running on a dedicated node in our data centre—they just see openai.Embedding.create(model="bge-large-zh", input=text).
Well... that's the theory. In practice, we had to add a max_retries field per model after the incident I'm about to describe.
The Pi Day outage (I'm still bitter about this)
I'm not going to pretend this was a flawless rollout.
We had a spectacular outage on March 14th—I remember because it was Pi Day and I was supposed to leave early to actually have a life—when someone registered a model with a typo in the endpoint. They wrote 10.12.44.7:800 instead of 8000.
The gateway's retry logic went exponential.
The error logs were just pages and pages of ConnectionRefusedError growing at 2x per second. It took us forty-five minutes to even find the offending config because the logging was so noisy. Pro tip: always set max_retries per model, not globally.
We learned that one the hard way. At 10 PM. On Pi Day.
The numbers after 3 months
Once we stabilised the thing:
- Infrastructure maintenance time dropped 60%
- GPU utilisation went from 31% to 78%
- New model onboarding: from ~2 days to ~15 minutes
- We decommissioned 12 dedicated endpoints running at <5% utilisation
But the real win was organisational. Teams stopped asking "can you deploy model X for us?" and started asking "can I get an API key for the gateway?"
It shifted us from being model janitors to platform builders.
Dave actually smiled once. I think. It might have been gas. I'm about 60% sure it was a smile.
The catch (there's always a catch)
YMMV significantly. This approach—wait, I should call it a strategy because "approach" feels too consulting-speak—works brilliantly because OpenAI's format is flexible enough to handle 90% of use cases.
But if you need model-specific features? Like Claude's tool use before OpenAI supported it, or Gemini's grounding capabilities? You'll hit the limits of translation pretty quickly.
We ended up with a hybrid approach: standard features go through the gateway, exotic stuff gets a dedicated endpoint. That's about 5% of traffic now. Down from 100%. I'll take it.
Also, cost attribution becomes absolutely critical. When everything looks like "gpt-4" to the client but it's actually running on your own hardware, you need solid tracking. We built a per-token cost model into the gateway that tracks actual GPU time and maps it to department budgets.
Without that, you'll get a nasty surprise at the end of the quarter when finance asks why the "free" internal models cost $40K USD in electricity and hardware depreciation. Ask me how I know. Actually don't. It's still too soon.
Why I'm posting this
I saw that thread last week about "are LLM gateways just API proxies" and the top comment was like:
"it's just nginx with extra steps lol"
That's technically true in the same way that Kubernetes is just Docker with extra steps. The value isn't in routing—it's in making heterogeneous infrastructure look homogeneous to consumers.
If you're dealing with more than 5 models, or if your team keeps asking for random HuggingFace checkpoints that someone's cousin's research group published, do yourself a favour and standardise on the OpenAI API format at the edge. It's not perfect. But it's the least bad option we've got right now.
Probably.
TL;DR
Long-tail models create disproportionate operational overhead. We built a gateway that makes every model speak OpenAI's /v1/chat/completions format. The result? Maintenance dropped 60%, GPU utilisation doubled. The trick is translating at the aggregation layer, not per-model. It's not a silver bullet—but it beats maintaining 37 bespoke deployments at 2 AM.
Anyone else doing something similar? I'm genuinely curious how you handle streaming edge cases—we still struggle with models that return tokens in weird chunk sizes. Had one model from a research lab in Singapore that would buffer exactly 7 tokens before flushing. Seven. Why seven? I still don't know. I've lost sleep over this.
Edit: Thanks for the gold, kind stranger. To the commenter asking about LiteLLM vs building from scratch—we tried building our own translation layer first. Spent three weeks on it before realising we were just badly reinventing LiteLLM. Don't be like us. The moment I knew we'd messed up was when I found myself writing a regex to parse Anthropic's streaming format at 11 PM on a Saturday. That's a special kind of rock bottom. I don't recommend it.
What's your long-tail model horror story? Drop a comment—I need to know I'm not alone here.
llm #mlops #gateway #openai-api #infrastructure #warstory #devops
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.