大模型百倍推理加速之KV cache篇 (English)
大模型百倍推理加速之KV cache篇 (English)
Generated: 2026-06-22 17:12:40
---
You must have heard someone say: "KV cache? It's just a cache—what's there to talk about?"
I’d bet ten to one that whoever says that has never written an inference engine themselves. Either that, or they’ve skimmed a couple of explainers and figured they’ve got it all figured out.
But here’s the thing: KV cache is simple. So simple you can explain it in a single formula. But simple ≠ unimportant. On the contrary, it’s one of the most critical designs in large model inference—it determines whether you can run a long conversation on a single 24GB GPU, how long your users wait for the first token, and how ugly your company's cloud bill gets.
In this post, I’ll mix my own hard‑learned lessons, real measured data, and a few industry reports to talk about what KV cache really is. Along the way, I’ll tell you why it deserves the seemingly gimmicky claim of “100x inference speedup.”
---
I. Saying KV cache is just a cache only sees one corner of the picture
Let’s start with the core pain.
Large models generate content token by token. You send in “What’s the weather like today?” The model first tokenizes the whole sentence and runs it through the Transformer once—all the input tokens are computed in parallel. That’s the Prefill phase. Once the input is done, the model starts producing the first output token, and here’s where the trouble begins:
Every time it generates a new token, it has to go back and compute attention with all previous tokens. And the next token has to wait for the current one to finish—that’s the Decode phase.
Think about it: if every time you generate a new token you have to recompute the entire history of K and V from scratch, the complexity is O(n²). Even if your input sequence is capped at 2048 tokens, it’s painfully slow. In the early days, inference engines without KV cache did exactly that—I’ve seen a single token take hundreds of milliseconds, and long sequences would blow up GPU memory instantly!
KV cache does something very simple: since the K and V needed at step n are identical to those computed in the previous n‑1 steps, why recompute them? Just store them. Each time a new token comes in, you only compute the K and V for that position, concatenate with the cached history, and reuse the whole thing for attention. Complexity drops from O(n²) to O(n).
I won’t paste the formula here—you can find it in any reference. But many people miss one point: KV cache isn’t free to use. It depends on the different roles of Query, Key, and Value in the attention mechanism—the Query changes every time, but once Key and Value are computed, they never change. You might ask, “Why cache only K and V, not Q?” Because Q is “who I’m looking for”—it’s different every step; K and V are “what I contain”—historical content doesn’t change.
That sounds like common sense, but I’ve actually seen someone stuff Q into the cache too—not only did it not speed up, it doubled the memory usage.
Speaking of which, here’s a vivid analogy: KV Cache is like your shopping cart when you buy groceries. You walk down each aisle without having to carry everything from scratch—the stuff already in the cart is your K and V; you just put new items in and keep going. But if you rebuilt the cart itself every time… that’s just asking for trouble.
---
II. Think the problem is solved? Memory is the real battlefield
Okay, now you have KV cache, and decoding is tens of times faster. Happy?
Don’t celebrate too soon. You’ll quickly hit a new bottleneck: GPU memory.
A model’s parameters are fixed, but KV cache grows dynamically! Take a 7B model running in FP16. For each generated token, every layer and every attention head has to store two vectors (K and V). Suppose 32 layers, 32 heads, head_dim=128. Then the KV cache for one token is: 32 × 2 × 32 × 128 × 2 bytes ≈ 0.5 MB. Doesn’t sound big?
But at 2048 tokens, that’s 1 GB! At 8192 tokens, it’s 4 GB! And if your context stretches to 100K tokens… you do the math.
Last year, I ran an inference service using vLLM (v0.4.2). Users uploaded long documents for Q&A, context length 32K. On a single A100 80G, with just a moderate number of concurrent requests, memory blew up—OOM instantly. Time to first token (TTFT) shot up to over ten seconds, and the user experience was as good as dead.
So when I saw the XSKY test report, I felt an instant connection. They tested DeepSeek‑R1 (inference engine vLLM) under different context lengths. Without KV cache offloading, TTFT at 8K was already high. After enabling MeshFusion (their shared‑storage KV cache offloading), TTFT dropped by 91%, 96%, 96%, and 94% (for 8K, 32K, 64K, 100K respectively). Throughput (TPS) improved by 13x to 28x.
91% and 28x—those aren’t random numbers. I haven’t run a cluster that large myself, but from the principles I can tell: this isn’t just plain offloading; it’s managing KV cache in a much more efficient way. They used a G3.5 shared‑storage pool to move the KV cache that doesn’t fit in GPU memory to high‑speed SSDs, while leveraging prefetching and locality to keep latency in check. In short, they swapped memory for storage, shifting the bottleneck from the GPU to something that scales horizontally.
The test on Huawei Ascend 910C with PD separation is also interesting. By splitting Prefill and Decode into different nodes, optimizing each separately, and adding KV cache offloading, TTFT dropped by 86%–92% and TPS improved by 271%–422%. The longer the context, the bigger the gains—which shows that in long‑sequence scenarios, managing KV cache becomes more critical than the computation itself.
My rule of thumb: no matter which inference framework you use, if you want to run long contexts, KV cache is the first thing you optimize. Don’t think buying A100/H100 solves everything—concurrency and context length will eat up every byte of memory.
---
III. Don’t be fooled by fancy optimization names—real deployment has its pits
In the last couple of years, KV cache optimization techniques have exploded: GQA, MLA, PagedAttention, quantization, PD separation… They all sound impressive, but their real‑world mileage varies.
Let’s start with GQA (Grouped Query Attention). It’s a variant of MHA (Multi‑Head Attention) that shares a group of Key/Value heads among multiple Query heads, reducing the amount of KV cache needed. I tried replacing MHA with GQA (for instance, changing 32 Query heads into 8 groups), and memory usage dropped by three‑quarters, inference speed improved noticeably. But the trade‑off is a slight accuracy loss, especially on detail‑sensitive tasks like long‑document Q&A or code generation. LLaMA 2 70B already uses GQA, but many open‑source small models still stick with MHA.
Quantization is another approach—compressing KV cache from FP16 to INT8 or even INT4. I tried 8‑bit KV cache in transformers 4.35.0 with bitsandbytes. Memory halved, speed stayed roughly the same. The problem is that quantization can increase decoding latency (due to dequantization), and INT4 precision loss is often unacceptable—especially in high‑repetition scenarios where weird errors pop up.
The one that really caught my eye is MLA (Multi‑head Latent Attention) introduced by DeepSeek V3. Its core idea is to compress K and V into a low‑dimensional latent space, then decompress them when needed. Theoretically, KV cache can shrink by an order of magnitude. According to the material, DeepSeek‑V4 with a 1M token context uses only 10% of the KV cache compared to a standard MHA model. I haven’t used it in production myself, but the thinking is elegant—instead of optimizing the cache
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.