关于大模型推理的量化算法 (English)
关于大模型推理的量化算法 (English)
Generated: 2026-06-23 03:59:46
---
Quantization for Large Models? Don't Be Scared! After Three Days of Hands-On Testing, I've Laid All These Algorithms Bare
The other day, a friend complained to me that his RTX 3090 was struggling to run a 13B model. A few exchanges in, his VRAM exploded, and everything ground to a halt. I slapped my thigh and said, "Dude, haven't you tried quantization yet? This thing is a total game changer for the LLM era!"
Think about it – no need to modify the model architecture, no need to retrain. Just change a few lines of configuration, and your memory usage is cut in half, or even down to a quarter, while inference speed doubles. Who would have believed that before? When I first heard about it, I didn't believe it either. I was sure the accuracy would go straight to hell. But you know what? Three days of testing later, I had to eat my own words.
These quantization algorithms today are seriously impressive.
But here's the problem: there are so many of them! GPTQ, AWQ, SmoothQuant, LLM.int8(), QLoRA, LLM-QAT… you can find articles all over the place, but most of them are buried in math formulas – Hessian matrices, activation migration, that kind of stuff that makes your head spin. Hardly anyone tells you: What's it actually like to use them? Where are the pitfalls? Which one is the least hassle?
So I went all in. A100 80GB, two days to run through the mainstream solutions, third day to organize the data. What follows is all hands-on experience, with data, with gotchas, with attitude. You just need to copy my homework.
---
What Quantization Actually Does – The Most Vivid Explanation
You know how high-resolution photos get compressed to JPEG? The original is tens of megabytes, every detail crystal clear. After compression, it's a few hundred kilobytes, good enough for your Moments, but when you zoom in, the edges are blurry and artifacts show up.
Quantization is exactly the same idea.
Model parameters are originally stored using FP32 or FP16, each weight taking up 32 bits or 16 bits. After quantization? You switch to INT8 or INT4. An INT4 weight only takes up 4 bits. Compared to FP16, the space is cut to a quarter. The pressure on VRAM drops instantly. The trade-off is the same as with compressed photos: you lose information, and the output might be slightly off.
So the core of quantization boils down to one sentence: How do you compress the precision while tricking the model into thinking everything is business as usual?
From my blood-and-tears experience: for models 7B and above, 4-bit quantization is practically imperceptible in terms of accuracy loss. Anything smaller, or if you go down to 2-bit or ternary? That's when things start getting arcane.
---
This Pitfall with Quantization Granularity – Don't Step in It
What's granularity? It's how many weights share a single scaling factor. Fewer factors is more convenient, but accuracy goes up in smoke.
When I first started, I was lazy and used per-tensor – one scaling factor for the whole layer. I figured, "How convenient!" But the moment I quantized the activations, the model turned into a broken record – no matter what I said, it replied with the same sentence.
You see, the dynamic range of activation values is huge. In a single layer of activations, you might get an occasional big value like 80, while everything else is a tiny 0.1 or 0.2. When you quantize, those small values get rounded to zero, and all the information is lost. The model goes crazy, doesn't it?
So I switched to per-channel or per-group.
- Per-channel: One scaling factor per channel. Much better accuracy. It's computationally intensive to implement on GPU, but worth it.
- Per-token + Per-channel: For activation quantization, one scale per row (per token), combined with per-channel for weights. This is the mainstream approach for W8A8. SmoothQuant is one example.
- Per-group: In GPTQ and AWQ, 128 elements per group. Fine-grained control, at the cost of storing a bunch of extra scales.
My advice: Don't try to cut corners here. If you can't stand the accuracy loss, you might as well not quantize at all.
---
Hands-On with the Five Musketeers: Which Ones are Worth It, Which Ones are Traps? Let Me Tell You One by One
GPTQ: Second-Order Compensation – Powerful as Hell, but Slow as Hell Too
GPTQ was the first algorithm I tried. Why? Because it had the flashiest claims – the paper said you could run a 175B model on a single A100 with almost no accuracy loss at 4-bit.
I tested it with OPT-13B. Quantization took about 20 minutes, which matched the paper's data. I was amazed at the time: 20 minutes to halve my memory usage? What a deal.
In terms of accuracy, at 4-bit on OPT-175B, the WikiText-2 perplexity was 8.37, compared to FP16's 8.34 – almost no drop! At 3-bit it went to 8.68, still tolerable. At 2-bit it skyrocketed to 9.58, and ternary quantization hit 9.20… The compression potential of large models left me speechless.
But! There's a catch.
GPTQ's quantization works by weight reconstruction, and during inference it has to unpack and dequantize the 4-bit weights. This introduces extra overhead. On an A100 using vLLM to load a GPTQ model, the generation speed was noticeably slower than with AWQ. If you're after throughput, GPTQ isn't the first choice.
One-liner: Good for offline batched inference or single-card deployment of
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.