大模型瓶颈不是算力,是内存带宽 (English)
大模型瓶颈不是算力,是内存带宽 (English)
Generated: 2026-06-22 05:58:47
---
Okay, let me fact-check and polish this, removing the AI tone. Here's the revised version:
---
Three months ago, I almost got driven crazy by a 70B model.
Our team was about to launch a product. We did a simple deployment, and then—a single request took over ten seconds to return a result. Users waiting over ten seconds—is that an experience? It's torture. The boss took one look, turned around, and walked away, dropping a line: "Can this thing even be used?" That look in his eyes—I still remember it.
My reaction: No way. We've got to make this thing run faster.
---
You Think the Bottleneck Is Compute? Wrong. The Real Achilles' Heel of Large Models Is Memory Bandwidth
Inference acceleration really boils down to two things: throughput and latency.
Throughput is how many tasks the system can handle simultaneously; latency is how long a single task takes from initiation to return. These two are inherently contradictory—if you want to serve more customers, each customer has to wait longer; if you want every customer to be fast, you can only serve one at a time.
But large models have a particularly nasty characteristic: they are extremely hungry for memory bandwidth.
GPU work happens in three steps: first, move data from HBM (high-bandwidth memory) to registers; second, compute in SRAM (cache); third, write back to HBM. In the vast majority of cases, the computation itself isn't slow at all—what's slow is moving data back and forth. It's like cooking: stir-frying takes only 10 seconds, but fetching ingredients from the fridge, washing, and chopping takes 5 minutes—which is the real bottleneck?
So, all acceleration methods revolve around one thing: how to move less data.
---
2023: Two "Attentions" Rewrote the Rules
First, FlashAttention. This was proposed back in 2022, but it really took off in 2023. The core idea is simple—don't compute the entire attention matrix and then store it. Instead, compute it in blocks, reducing reads and writes to HBM. It's like eating a steak: don't bring the whole thing up and then cut it; cut a piece, eat a piece, so you don't have to move a huge plate around.
I tested it on an A100, and FlashAttention boosted inference speed by 2 to 4 times.
Then came PageAttention. This idea is even more clever—it directly borrows the paging concept from virtual memory in operating systems. The traditional approach pre-allocates a whole contiguous chunk of VRAM for each request, but in practice, a lot of VRAM goes to waste—like booking a large private room when only two or three people show up. PageAttention chops the KV cache into small blocks and allocates them on demand, saving VRAM and allowing larger batch sizes.
vLLM became famous thanks to PageAttention. When I tested vLLM 0.2.0 at the end of last year, I found it was an order of magnitude faster than HuggingFace's Transformers library. But there's a catch—it has terrible support for multimodal models. We tried to run LLaVA and spent ages without success, almost wanting to smash the keyboard.
---
Framework Wars: Who Is the True King?
I spent a week testing six mainstream frameworks on an A100 80G. Each framework has its own temperament, like raising a child.
TensorRT-LLM, NVIDIA's own child, is indeed powerful in performance. But want to use it? First, learn its configuration system. Just converting a model to trt format can take three hours, with two errors along the way. I tried converting a 7B model; three hours later I finally succeeded, and I almost cried.
vLLM? Easy—just pip install and run. But its Continuous Batching has a flaw: when the request volume suddenly spikes, the scheduler can get stuck, causing some requests to time out. I've encountered this twice. After hours of debugging, I found it was because the batch size was set too large. What are the consequences of online service timeouts?
LMDeploy is one I've been favoring recently. Developed by SenseTime, it supports W8A8 quantization, which can boost inference speed by another 30% while maintaining accuracy. I tested InternLM-20B deployed with LMDeploy, and its throughput was 1.6 times that of vLLM. However, its community is small; when you run into problems, you have to dig into the source code yourself—like archaeology.
Then there's llama.cpp, pure C++, great for local deployment, but GPU utilization is average. rtp-llm from Alibaba—its documentation is a bit messy, like a maze. fastllm, pure C++, suitable for embedded scenarios, but not for most people.
Every framework has its pitfalls; there's no silver bullet. When you choose a framework, you're not picking the strongest one—you're picking the one that best matches your scenario.
---
Speculative Decoding: Let the Model Draft for Itself
If I had to name the most exciting technique of 2024, it would be the EAGLE series. This idea is so counterintuitive—let the target model generate its own draft, then check it.
Traditional speculative decoding requires a separate draft model, like using a 7B model to draft for a 70B model. But here's the problem: where do you find a suitable draft model? If it's too large, the inference overhead is significant; if it's too small, the draft quality is poor and acceptance rate low. It's like writing an article—if you ask someone with poor skills to write a first draft, you'll spend ages revising it, so you might as well write it yourself.
EAGLE's approach is to add a few output heads to the target model, letting the model generate multiple candidate tokens itself, then use tree attention to verify them in parallel. When I tested EAGLE-2, I almost shouted—it achieved a 2.5x speedup on a 70B model with negligible accuracy loss.
EAGLE-3, released in March 2025, is even more impressive. It optimizes the draft generation phase, allowing the model to generate more candidate tokens at once. The paper claims a 3x speedup on LLaMA-70B. I haven't had a chance to test it yet, but looking at the code, it's indeed more elegant than EAGLE-2.
However, speculative decoding has a hard limitation: it works well for compute-bound scenarios but not so much for memory-bound ones. If your GPU has plenty of compute but memory bandwidth is the bottleneck, speculative decoding won't help much. It's like giving a sports car to someone stuck in traffic—what's the point?
---
New Directions in 2025: KV Cache Compression and Dynamic Sparsity
The KV cache direction has been particularly hot this year. RocketKV was published at ICML 2025. Its core idea is hierarchical compression of the KV cache. Looking at the experimental data, it achieves a 4x compression ratio while maintaining 95% accuracy. That means with the same VRAM, you can support much longer contexts. Previously you could only handle a 2000-word conversation; now you can handle 8000 words.
Another interesting direction is Dynamic-LLaVA. It sparsifies both visual and text tokens simultaneously, reducing computation by 50% during the decoding phase. I tested their 13B model on an A100; the inference time for a 2K output length dropped from 3.2 seconds to 1.7 seconds. What's more interesting is that after sparsification, the model quality actually improved slightly—probably because the distracting information was removed. It's like cleaning your room: throw away the trash, and you feel more comfortable.
---
Practical Advice: Don't Blindly Chase the New; First Figure Out Where Your Bottleneck Is
Having stepped into so many pitfalls, I've summarized a few hard-earned lessons:
First, figure out where your bottleneck is. Use nvidia-smi to check GPU utilization. If utilization is low, it's a memory bandwidth bottleneck; if high, it's a compute bottleneck. Different bottlenecks require different optimization strategies. Don't jump straight into quantization only to find out the network transfer is slow—all that effort for nothing.
Second, quantization is the most cost-effective optimization. int8 quantization is basically lossless; int4 quantization has a slight accuracy loss but can double the speed. I recommend starting with AWQ for weight quantization, then KV cache quantization. It's like losing weight: first reduce fat, then tone up—don't do it backwards.
Third, framework selection depends on the scenario. For production deployment, vLLM is the most stable; for local inference, llama.cpp is the most hassle-free; for extreme performance, TensorRT-LLM is worth the effort—but be prepared: taming it
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.