开源大模型推理引擎现状及常见推理优化方法 (English)
开源大模型推理引擎现状及常见推理优化方法 (English)
Generated: 2026-06-23 04:36:16
---
I kneel! Same model, same GPU, but 20% difference in performance? The truth behind open-source inference engines—lessons I took three years of painful experience to finally understand!
A few days ago, a friend sent me a screenshot and asked, “Bro, is this data legit?”
I looked—it was a leaderboard on Artificial Analysis. vLLM had hit number one with an Output Speed of 230 TPS! And TTFT at 10K input was under a second!
Holy cow.
First, I muttered to myself, “That’s damn fast.” Then I dug up some old history—a year ago, in the same setup, vLLM was still being slagged off as “all show, no substance.” How did it turn the tables so quickly?
As someone who’s been through the mud, I decided to run it myself. vLLM 0.6.3, SGLang 0.3.0, the same model (a 397B-parameter MoE model), the same graphics card.
And the result?
A performance gap of over 20%!
Isn’t that mind‑blowing?
Now, you might think I’m about to dig into the technical principles. Hang on—let me paint a picture first, so you can feel the frustration.
Starting from late 2023, I’ve been maintaining a few inference services on and off. You just can’t imagine the number of pitfalls I’ve stumbled into—I could circle Haidian Huangzhuang three times. Every time I thought I’d found the root cause of a performance bottleneck, I’d spend three days and nights wrestling with it, only to discover I was wrong.
Even worse, sometimes you actually find a real optimization, change a single line of code, and the entire inference pipeline collapses.
That feeling is like pulling one thread out of a tangled mess, only to have the whole ball unravel.
What I’m talking about today are those lessons that made me slap my thigh (in regret)—all learned through hands‑on practice. No fluff, no showing off.
---
The “True Faces” of Mainstream Engines: Who’s Swimming and Who’s Naked?
Let’s start with vLLM.
I’ve followed it from version 0.4 all the way to 0.6, watching it grow with every release. You know what it feels like?
It’s getting clearer about what it wants to do.
PagedAttention, Continuous Batching—those are already standard. What really launched it was the V1 architecture’s KV cache management. It cleaned up things like block tables and slot mappings, making prefix cache effectively “zero‑overhead”—you reuse the cache without recomputing, and multi‑turn conversations fly.
Don’t believe me? Go look at the vLLM source code. The scheduler logic was refactored from V0 to V1—much more complex, but throughput definitely improved!
I read the DigitalOcean Blackwell Ultra test from a couple of days ago and noticed: attention fusion for a certain model, EAGLE3 draft model for another, linear attention fusion for yet another—all merged into the main branch, not stuck in a private fork!
What does that mean?
Open‑source engines are already catching up to proprietary solutions in cutting‑edge optimizations—and sometimes even running faster!
And SGLang? Another project that made my eyes light up.
The RadixAttention design is brilliant. It stuffs the KV cache into a prefix tree and uses LRU for garbage collection. In multi‑turn conversations, identical prompts are reused directly without recomputation.
I built a chatbot test with that 397B model, you know what happened?
Under the same request pattern, SGLang’s time‑to‑first‑token was 40% lower than vLLM’s! Because its cache hit rate was way higher.
But SGLang has its own flaw: model support isn’t as comprehensive as vLLM’s yet. When you run into an obscure model, you have to write a custom layer yourself. For newcomers, that’s a wall.
Now, TGI…
sigh.
As a pioneer contemporary of vLLM, from Hugging Face, with perfect timing and a great starting hand—it’s a shame they’ve played it this way. It’s been two months since the last release, feels like it’s almost been abandoned.
What’s funniest? TGI claims it uses PagedAttention, but when I looked at the source code—it just calls a kernel with the same name in the decode stage! What they say and what they do are two different things.
Not enough drive in the team, I guess.
LMDEPLOY is also interesting. I’ve run a few experiments with it, stability is fine, but the community buzz isn’t as high as the first two. Mooncake’s PD‑disaggregation architecture I haven’t used in production yet, but I’ve studied the architecture diagram: Prefill and Decode are physically separated, each scaling independently. It’s similar to the CUPS separation idea in telecom—solid thinking.
---
The Pits I’ve Fallen Into Could Fill a “Bleeding History of Inference Optimization”
Looking back, there aren’t many genuinely “clever tricks” in inference optimization.
Most of the time, you’re wrestling with three devils: VRAM, scheduling, and kernel launch.
I’ve ranked five directions by importance, writing them down for future generations. Every single one is a lesson I’ve slapped my thigh over.
First: Raise the compute‑to‑memory‑access ratio.
In simple terms, you want the GPU to do more computation each time it loads data. The most direct way is to increase batch size, but what if VRAM isn’t enough? Quantize the KV cache (INT8/FP8) or quantize the weights (if bfloat16 isn’t enough, switch to int4). All these techniques share one goal: keep the batch as large as possible.
I’ve measured—going from B=1 to B=16 can lead to a hundred‑fold difference in throughput!
But batch size isn’t unbounded—once you pass B=256 or so, the VRAM bottleneck changes into a compute bottleneck, and gains level off. To go further, you need a new approach.
Speaking of that, speculative decoding is one direction: a small draft model generates multiple tokens quickly, and the big model verifies them. But that’s not worth it under high concurrency—the batch is already large, the model is already well utilized, and multiplying by an acceptance rate α<1 could actually lower throughput.
I tried a small model leading a big model: at low concurrency (B≤4), I got a 50% boost; at high concurrency (B≥64), practically zero gain.
So you see, a lot of “beautiful” techniques need to be thoroughly re‑evaluated when you put them in a real‑world scenario.
Second: Kernel optimization—the real hard battle.
FlashAttention uses tiling to boost the compute‑to‑memory‑access ratio of the attention operator. vLLM V1’s linear attention fusion for certain MoE models merges several kernels, cutting launch overhead and reducing kernel launches in the prefill stage by several hundred.
Don’t underestimate that number. Several hundred launches can be the difference between life and death for latency‑sensitive services.
In the MoE domain, Nvidia’s paper on communication optimization really caught my eye—MoE all‑to‑all communication often stalls for 30% of the time; RingReduce and locality‑aware scheduling can recover that. 30%! Just think: your GPU is idling for 30% of the time.
Third: Hardware utilization—keep the GPU busy.
Communication‑compute overlapping, CUDAGraph to eliminate Python‑side launch overhead, asynchronous scheduling… vLLM and SGLang have already done a lot.
But I once ran a profile trace and found that some services had 40% empty time on the timeline! The host was waiting for the GPU to return a sync point.
40%! Your GPU is loafing around for almost half the time!
That gLLM commit log said that the repetition penalty used to rebuild the full‑history mask at every step, reconstructing a [batch, vocab] mask from all sequence tokens—the longer the output, the more expensive it got. After switching to a worker‑local persistent pool, it only incrementally scatters the new token per step, reducing the workload from O(total_len) each time to O(batch) plus a bit of new input.
Optimizations like this seem small, but when the decode runs a million steps, the difference is enormous.
Then
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.