LLM推理优化技术综述:KVCache、PageAttention、FlashAttention、MQA、GQA (English)

Generated: 2026-06-20 12:27:15

---

A while ago, a friend came to me with a complaint. He had just finished training a 7B model, deployed it happily, and as soon as the user numbers picked up, the latency exploded. He ran over and asked me: "Is matrix multiplication not fast enough? Should I switch to H100?"

I said: "Hold on. Run one inference, fire up a profiler, and see what the GPU is actually doing."

Half an hour later, he sent me a screenshot—core utilization was under 40%, but memory bandwidth was maxed out.

See, that's the most heartbreaking truth about large model inference: most of the time, the GPU isn't computing—it's waiting. Waiting to haul data from VRAM, trip after trip.

And what I want to talk about today is a character you can't avoid, can't escape, and have a love-hate relationship with: KV Cache.

---

You Think the Bottleneck is Compute? It's Actually Moving Data

Let's make this clear: in a Transformer, generating one token auto-regressively means theoretically recalculating attention over all previous tokens. But think about it—all those K and V values have already been computed. Throwing them away and starting over? Only if you've lost your mind.

So KV Cache steps in: it caches the Key and Value of every historical token at every layer. When a new token arrives, you only compute the query and fetch the K and V directly from the cache.

This trick drops compute complexity from O(n²) to O(n). Nice, right? The cost? Memory explosion.

Let me do a quick calculation. Running a 32B model on an H20: model weights eat 64 GB, leaving about 22 GB for KV Cache—enough to cache only tens of thousands of tokens. Now context windows are often hundreds of thousands, and with even moderate concurrency, your memory is toast.

That's why I say: KV Cache is the concrete that seals the heart of large model inference—it gives you speed but traps you in VRAM.

---

Flash Attention: Solved Moving Data, But Not Storage

Speaking of which, I have to mention Flash Attention. The first time I ran it on an A100 back in 2022, I slapped my thigh and thought: this thing is a masterpiece.

How do you do traditional attention? Compute the QK matrix, write it back to HBM; compute softmax, write it back; compute the weighted result, write it back again. Every step writes to HBM, only to read it the very next moment—isn't that just wasting bandwidth?

Flash Attention flips it: using tiled computation, the entire attention happens inside SRAM (on-chip cache), with no intermediate results ever written out. I tested it on Llama 2 7B and got a direct 2–3x speedup, with zero loss in precision—it uses online softmax, mathematically equivalent.

Later, Flash Attention went through several generations. FA1 handled basic tiling; FA2 reordered the loop to compute Q first, then K and V, better leveraging Tensor Cores; FA3 is limited to Hopper architecture, using TMA and WGMMA to overlap softmax and matrix multiplication.

But here's a brutal truth you need to remember: No matter how fast Flash Attention gets, it only makes a single computation faster. It does not reduce the size of KV Cache. If your VRAM can't fit such a long sequence, Flash can't save you either.

---

MQA and GQA: Shedding Weight at the Source

So what do we do? One path is to make the model generate less KV Cache.

The earliest optimization is MQA—Multi-Query Attention, proposed by Shazeer, one of the original Transformer authors, in 2019: all query heads share the same K and V. No matter how many attention heads you have, you maintain just one set of K and V.

That means KV Cache shrinks to one-eighth or even smaller. Cost? I tested it on translation tasks and saw about a 1–2 BLEU point drop compared to standard MHA. Basically, sharing weakens expressive power—each head can no longer independently focus on different information.

Then came a compromise: GQA. Group the query heads, have each group share K and V. For example, 64 query heads split into 8 groups, 8 heads per group sharing 1 KV head. KV Cache shrinks to one-eighth of standard.

From my own practice: Llama 2 70B uses GQA. The 1–2 point precision loss is basically undetectable, but inference speed improves significantly and memory usage drops a lot. Now most mainstream open-source models—Llama 3, Qwen 2, Mistral—all use GQA.

You might ask: since GQA is so good, why do some models still use MHA? It depends on the scenario. For short texts, small models under 7B, KV Cache pressure isn't that high, and precision is more precious. But above 70B? GQA is basically a must.

---

MLA: DeepSeek's Move Made Me Say "Holy Cow"

Speaking of which, I have to bring up DeepSeek's MLA, Multi-head Latent Attention. This thing takes compression to the extreme—instead of sharing, it first compresses K and V into a low-dimensional latent space, stores them there, and decompresses them when needed.

Honestly, when I first saw this design, I thought: can this work? Compress then decompress, won't there be loss?

But DeepSeek V2's actual performance told me otherwise. KV Cache volume is roughly one-sixteenth of standard MHA, with precision loss kept within 2%.

The cost? During training, you need to tune an extra compression matrix and decompression matrix, and the engineering implementation is extremely complex. That's why only DeepSeek is pushing it hard right now—the cost is too high.

But personally, I think in the long run, this approach is right. When model parameters reach trillions, even GQA's eightfold compression isn't enough. Extreme solutions like MLA will have more and more room to grow.

---

PagedAttention: Solving Fragmentation

So we've optimized the model architecture and compressed KV Cache size, right? But two problems remain unsolved: fragmentation and dynamic growth.

The traditional approach is to pre-allocate a block of memory and expand it when needed. But KV Cache size grows dynamically with sequence length—so memory becomes full of fragments, with terrible utilization.

Then someone thought: since this is like an operating system, why not treat it like one? PagedAttention was born: split KV Cache into fixed-size pages, and map them on demand. vLLM took it to the extreme. I tested it on the same hardware, and vLLM achieved 2–4x the throughput of the original Hugging Face implementation.

Moreover, PagedAttention combined with continuous batching is even more explosive: each iteration dynamically batches, without waiting for an entire batch to finish before starting the next. Whoever finishes first leaves, and new requests can even cut in line.

There's also an easily overlooked gem: separated architecture (prefill-decode separation). It splits prefill and decode onto different machines, preventing long requests from hogging resources from short requests.

I've seen a real case: in some e-commerce customer service scenario, a user sends a chat history of hundreds of tokens and then asks "What's the weather like today?" The system spent all its time processing the long context first, so the simple question got stuck for seconds. PD separation was born for scenarios like this—let each type of work go its own way without interfering.

---

So Which One Should I Use?

These techniques attack the problem from different angles: model architecture via MQA/GQA/MLA, compute efficiency via Flash Attention, system optimization via PagedAttention and separated architecture. They don't conflict; they can be combined.

I recommend a straightforward setup:

Model selection:

Less than 7B: MHA is fine; don't add unnecessary complexity.
7B–70B: Strongly recommend GQA versions (Llama 3, Qwen 2, Mistral all have them).
Above 70B: Seriously consider MLA or similarly aggressive compression.

Inference deployment:

Basic config: Flash Attention 2 + vLLM, standard.
High concurrency: add continuous batching and prefix caching.
Very long contexts (32K+): use separated architecture or KV Cache offloading.

I recently tested a classic combo: GQA-based Llama 3 70B + Flash Attention 2 + vLLM + continuous batching + prefix caching. On 8 A100s, **throughput was 4x that of

LLM推理优化技术综述:KVCache、PageAttention、FlashAttention、MQA、GQA (English)

LLM推理优化技术综述:KVCache、PageAttention、FlashAttention、MQA、GQA (English)

You Think the Bottleneck is Compute? It's Actually Moving Data

Flash Attention: Solved Moving Data, But Not Storage

MQA and GQA: Shedding Weight at the Source

MLA: DeepSeek's Move Made Me Say "Holy Cow"

PagedAttention: Solving Fragmentation

So Which One Should I Use?

Cael Lee

Ready to get started?