大模型推理加速:KV Cache 和 GQA (English)

Generated: 2026-06-23 01:17:42

---

Can you believe it? Last year, I almost got driven to the brink by a 70B model!

I was so excited to deploy it. Right after I fed in a sequence that barely exceeded 2048 tokens, the GPU instantly went OOM. My first thought: "Damn, does this model have a memory leak?" Two whole days and nights I spent investigating, until stars were flashing before my eyes—and guess what? No leak! It was that sneaky little glutton, the KV Cache, that had gobbled up all the VRAM!

That accident forced me to fully understand KV Cache and GQA. Today, I'm going to break down all that hard-earned experience for you, piece by piece.

---

First, what exactly is Attention?

Think about it: how do we humans understand a sentence? For example, "That boy who was running in the park yesterday, he fell down." You need to know who "he" refers to, right?

That's exactly what Attention does. Every token asks: which tokens in the context carry information worth absorbing?

Let's break it down. In a Transformer, Attention has three roles:

Query (Q): What information does the current token want to find?
Key (K): What kind of Query can each token match?
Value (V): What "useful content" does each token actually provide?

The process is just three steps, dead simple:

Each token takes its Q, compares it with the K of every token, and calculates a match score.
The scores go through a softmax, turning into weights between 0 and 1.
Use those weights to compute a weighted sum of all the tokens' V.

When I first learned this, I made a super dumb mistake—I thought Q, K, and V were three different kinds of vectors. No way! They're actually the same entity looking at itself through three different "glasses." In simple terms, the embedding of a token is multiplied by three separate matrices to project it into these three spaces.

Tell me, isn't that like looking at the same photo with reading glasses, sunglasses, and 3D glasses? Same source of information, but completely different perspectives.

---

Why bother with multi-head? Isn't splitting your brain into parts exhausting?

A single Attention head can only learn one kind of relationship. But language is full of crazy stuff happening at the same time: subject-verb agreement, pronoun reference, local phrases, long-distance dependencies, positional patterns… one head simply can't handle it all!

So the Transformer runs multiple Attention heads in parallel, a kind of "group effort."

One massive pitfall I fell into: I thought "multi-head" meant splitting a 4096-dimensional vector into 32 pieces of 128 dimensions each. Wrong! Completely wrong!

Actually, each head uses its own set of "filters" to extract the 128-dimensional information it cares about from the full 4096-dimensional vector. It's like 32 different perspectives staring at the same token simultaneously.

What's even cooler is that these heads naturally specialize, with no need to supervise each other—some focus on syntax, others on pronouns, others on positional patterns. This emerges during training all by itself, without anyone telling them, "Hey, you handle grammatical relationships."

Isn't that amazing? Without hand-holding, they just learn it on their own.

---

KV Cache: Why is the first token always so slow?

Have you noticed that when using ChatGPT, the first word takes a moment to appear, but after that, the output flows almost continuously?

The culprit (or rather, the unsung hero) behind this is the KV Cache.

Let me explain how LLMs generate tokens: they're autoregressive, outputting one token after another, like squeezing toothpaste.

Each time a new token is generated, the model has to recompute the entire sequence. But think about it—the Key and Value from earlier tokens haven't changed at all! The historical tokens are fixed; they won't change their minds just because a new token appears.

So the idea of KV Cache is to precompute and store the Key and Value of all previous tokens. When a new token arrives, you only need to compute its own Q, then look it up in the cached KV.

I ran a comparison test with a 7B model, and even I was shocked by the results:

Without KV Cache: Generating 100 tokens took 1.8 seconds
With KV Cache: Generating 100 tokens took 0.15 seconds

That's a 12x difference!

12x! At that point, I went straight from "this model is unusable" to "hey, maybe it's not so bad?" Isn't that wild?

---

But don't get too excited! KV Cache is a VRAM-hungry beast

You see, trading space for time always has a price, right?

I calculated the cost for a 70B model, and after I saw the numbers, I went silent.

How much VRAM does each KV Cache entry take? Here's the formula:

2 × Layers × Heads × Head dimension × Sequence length × Bytes

The 2 is because we store both Key and Value.

Assume a model with 80 layers, 40 heads, head dimension 128, sequence length 4096, stored in FP16:


2 × 80 × 40 × 128 × 4096 × 2 = 6.7 GB

Just for one sequence, the KV Cache eats up nearly 7 GB of VRAM!

If you're serving online and handling multiple requests concurrently, you multiply that by the batch size. I hit the worst pitfall once: running four long sequences simultaneously on an A100, and the KV Cache alone consumed over a GB, leaving no room for the model weights! I was so angry I nearly threw my mouse.

---

MHA, MQA, GQA: Three options, three pitfalls

What we've been discussing so far is standard MHA (Multi-Head Attention), where each head has its own set of KV. The problem is the KV Cache is too big.

So what's the fix? Someone had a lightbulb moment: why not let multiple query heads share the same set of KV?

That's the idea behind MQA (Multi-Query Attention). It compresses the KV of all Q heads into a single set, reducing the KV Cache to 1/H of its original size (where H is the number of heads).

I tried MQA, and the VRAM savings were significant. But at what cost? The model's expressiveness took a noticeable hit. Especially when handling complex long-range dependencies, it felt inferior to MHA. After all, all Q heads share the same KV, so the information source is too constrained.

Enter GQA (Grouped-Query Attention)—the compromise.

GQA divides the Q heads into G groups. Within each group, the Q heads share a single set of KV. So the KV Cache size becomes G/H of MHA's.

Mathematically, it's easiest to understand like this:

When G equals the number of Q heads, it's equivalent to MHA (each head has independent KV).
When G equals 1, it's equivalent to MQA (all heads share one set of KV).
When G is somewhere in between, it's GQA.

Isn't this the perfect "have your cake and eat it too" solution?

---

Why did I ultimately choose GQA?

Nowadays, most mainstream models have adopted GQA.

In the Qwen3 series, for example, there are 32 Q heads and 8 KV heads, divided into 8 groups. LLaMA 2/3 also uses the GQA architecture.

My personal take: for most scenarios, GQA is a very solid choice.

Here's some test data from my own experiments (7B model, text generation, sequence length 4096, batch size 8):

Architecture	KV Cache Usage	Inference Speed	Quality Assessment

MHA	9.2 GB	1.0x (baseline)	baseline

MQA	0.9 GB	1.3x	slightly worse than baseline

The numbers may not be 100% precise, as they vary across different models and tasks. But the trend is clear at a glance.

Here's my advice (hard-earned from my own missteps):

Training stage: Use MHA. VRAM isn't a big deal here, and expressiveness is most important.
Inference with extremely high quality requirements: Use MHA, combined with other optimization techniques (quantization, PagedAttention).
Inference, seeking cost-effectiveness: Use GQA, with group count G between 8 and 16

GQA (G=8)	2.3 GB	1.2x	close to baseline

大模型推理加速:KV Cache 和 GQA (English)

大模型推理加速:KV Cache 和 GQA (English)

First, what exactly is Attention?

Why bother with multi-head? Isn't splitting your brain into parts exhausting?

KV Cache: Why is the first token always so slow?

But don't get too excited! KV Cache is a VRAM-hungry beast

MHA, MQA, GQA: Three options, three pitfalls

Why did I ultimately choose GQA?

Cael Lee

Ready to get started?