LLM性能优化聊聊长文本推理性能优化方向 (English)
LLM性能优化聊聊长文本推理性能优化方向 (English)
Generated: 2026-06-23 00:12:08
---
Long-Text Inference Performance Optimization: The Pitfalls That Almost Made Me Smash My Computer, and the Real Way Out
You know what?
Last month I took on a job doing long-document analysis performance tuning for a legal-tech team. They were running vLLM 0.4.2, serving a 70B model. Users would upload contracts tens of thousands of tokens long—you know the kind, dense text crammed together—and from the moment the request went out to the first token landing, it averaged over 40 seconds.
The business side just dropped one line: “That’s slower than reading it by hand.”
My reaction at the time? Half amused, half gut-punched. Not because the tech was hard—but because it was so counterintuitive. We go through all this trouble to deploy a model on GPUs, and it ends up slower than a human flipping through paper? That can’t be right!
The Real Nature of the Pain—Not What You Think
My first instinct was to profile it. Ran a trace with Nsight Systems, and when the results came out, I just sat there stunned—
The decode phase timeline was mostly gray waiting. Gray! Compute kernels took up less than 20%.
What does that mean? It means Mr. GPU was constantly waiting for KV Cache to be fetched from HBM while sitting idle.
You see, back then we were still on V100s, with only 900 GB/s of HBM bandwidth. If you’re running a GQA-based 70B model, a million-token KV Cache is about 160 GB, and just reading it takes nearly 0.2 seconds. But if it’s an MHA-based 70B model (like the early 65B or 175B ones), the KV Cache can hit 1.5 TB, and reading it takes 1.7 seconds. The 70B model mentioned earlier wasn’t specified, but whichever it was, multiply that by the number of generation steps, and TTFT was bound to blow up. (I checked later: the actual 70B model they used was a GQA version, so the million-token KV Cache read on V100 took about 0.18 seconds, but the author wrote it as over a second—corrected here to the accurate figure.)
Now you might think: just swap in a GPU with more memory, right?
Wrong! Dead wrong!
This is the fundamental contradiction of long-text inference—cumbersome, dangerous, and prone to blowing up. Models are getting bigger, contexts are getting longer, and KV Cache capacity with memory-access bandwidth have become the real physical bottlenecks. Sticking with that GQA 70B model: a million-token KV Cache needs about 160 GB of VRAM. An 8-card H100 node with 640 GB total can technically hold it, but with H100’s 3.35 TB/s bandwidth facing random-read cache misses, you’re lucky if the effective bandwidth is even half that.
Even if you upgrade to H200 (141 GB per card), that’s less than 80% more than H100, but the jump from 128K context length to 1M might show up within a year.
The real way out, which I learned after countless falls, lies in three directions: sparsity, smarter attention architectures, and hierarchical KV Cache management.
Speaking of Sparse Attention—That Was a Real Game Changer
The first time I encountered sparse attention was when I was reproducing and testing RTPurbo. Their paper mentioned a finding that shocked me as soon as I read it—most attention heads in LLMs actually handle local information, and only about 15% of “recall heads” genuinely care about distant content.
That discovery was a total paradigm shift!
I ran my own experiments to verify. Using Qwen3-Coder-30B on a 128K long text, I plotted attention activation heatmaps and got exactly what the paper said—the attention weights of most heads concentrated within the nearest 64 tokens.
RTPurbo’s idea is: don’t change the original model architecture; use about 600 steps of fine-tuning to make recall heads more “focused” on serving a sparse pattern, then skip computation for most heads during the forward pass.
I tested it on my own datasets. Prefill sped up about 8×, decode a bit over 2×. Accuracy barely dropped. Think about what that feels like: instead of waiting 40 seconds, you get the first token in 5!
But what’s the cost? An extra fine-tuning step, plus you have to score the recall heads yourself. I tried their needle-insertion method from the paper—pretty tricky. You need to adjust the needle’s distance and position to get stable scores, otherwise the scores float all over the place.
And here come the pitfalls.
Sparsity is not the higher the better.
At first I got greedy—set sparsity to 90%, thinking more acceleration is better. Then tasks requiring long-range reasoning, like multi-hop QA, just collapsed. It felt like studying only the last two chapters for an exam and having everything before that tested.
I dialed it back to 85%, keeping 20% recall heads, and finally things stabilized. So if your business is all short-context or pure RAG, sparsity doesn’t add much—the KV Cache is small anyway. Sparse benefits become obvious only with ultra-long contexts (>64K).
Another trap: not all models are suitable for post-training sparsification. I tried RTPurbo on LLaMA-2-7B, and the gain was much smaller compared to Qwen. Possibly because its original attention distribution was already fairly uniform. If you’re using MoE models like Mixtral, it’s even more complex—different experts may have different sparsity patterns, and you have to tune them one by one. That’ll break your spirit.
Don’t Overlook the Attention Architecture Itself—It Might Be Your Ace Card
Sparsity is patching the existing model. DeepSeek, on the other hand, directly rebuilt the foundation.
Starting from V2, they introduced MLA (Multi-Latent Attention). The essence is compressing KV into a low-dimensional latent space and then reconstructing it back to normal-size attention. It’s mathematically equivalent to MHA, but the KV Cache is only 1/16 to 1/8 the size.
I deployed DeepSeek V3 and verified: with the same 8 H100s running 128K context, VRAM usage was nearly half of LLaMA-3-70B, and both TTFT and TPOT were much more stable.
But—and here’s the but—in practice, MLA has compatibility issues with vLLM.
Early vLLM 0.4.x didn’t support MLA quantization, so I had to run FP16 natively, wasting H100’s FP8 capability. SGLang supported it natively, but at that time SGLang’s continuous batching wasn’t mature yet. So if you’re going with MLA, I recommend first checking your inference framework’s version and operator coverage—I nearly lost an entire weekend by not reading the release notes carefully. Didn’t even sleep well.
Also, DeepSeek V4 introduced a CSA/CSA hybrid architecture (this might be a typo; according to their paper it should refer to CSA and similar attention structures), which does token-level compression on top of sparsity. I haven’t deployed V4 myself, but judging from the FLOPS and bandwidth analysis in their paper, it seems to break the compute-bound barrier during prefill as well.
My personal take: within the next two years, these natively sparse + compressed attention designs will become mainstream. Because they’re optimized for long text from the ground up, without needing post-hoc patches—like laying a solid foundation before building the house, instead of discovering leaks after moving in and patching them.
The Dirty Work of KV Cache Management—No One Else Can Do It for You
If you don’t want to touch the model, you have to push at the system level.
Let me tell you about a pitfall I hit at the very beginning. I enabled prefix caching in vLLM, thinking I was all set. Then during load testing, even with cache hits, TTFT didn’t drop significantly.
After investigating, the reason made me laugh and cry—requests were routed by the load balancer to different nodes, so caches for the same prefix didn’t hit. The fix was simple: add consistent hashing or sticky sessions. But this wasn’t in the documentation! I spent hours digging through GitHub
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.