目前针对大模型进行量化的方法有哪些? (English)
目前针对大模型进行量化的方法有哪些? (English)
Generated: 2026-06-20 13:32:46
---
Alright, over to you. This is the corrected version after fact-checking — keeps the whole rant-and-recommend vibe, but with tighter data and a smoother narrative right from the opening scene.
---
Winter 2022. I’m sitting there staring at my RTX 3090, feeling like a million ants are crawling under my skin.
A colleague tosses a sentence over: “Deploy LLaMA-13B and run a benchmark.” I look at my card — 24 GB VRAM. It can barely run LLaMA-7B in FP16. And he wants me to do 13B with this? Why don’t you swap the card out first?
But back then I was broke. No new GPU, so what could I do? Go all-in on quantization.
At that time the whole community was hyping one solution: the bitsandbytes library’s loadin8bit=True. Basically the LLM.int8() trick: take the abnormally large channels (outliers) in the activations, handle them separately with FP16, and smash everything else into INT8. I tried it — VRAM went from 26 GB (including activations) straight down to 13 GB! And the model actually ran!
But guess what?
Inference was so slow I wanted to throw the keyboard. Small batch sizes were okay, but as soon as concurrency increased, GPU utilization flatlined. I dug into the source code and figured it out: those FP16 outliers can’t use tensor core acceleration at all — they have to crawl along on CUDA cores. And at the time, bitsandbytes had terrible support for 4‑bit, so I was stuck with 8‑bit, halving the compression ratio. Nowhere near enough.
But that pitfall wasn’t wasted — it made me realize one thing: quantization for large models can’t just focus on weights; the outliers in activations are the real troublemakers. Think about it: weights are static — once quantized they’re fixed. Activations? They change on every single inference, with a dynamic range that’s terrifying. Whoever can clamp down on activations at runtime — that’s who’s got the real skill.
And right about there, a turning point — 2023, quantization started to fork.
First came GPTQ. The first time I read the paper I almost gave up because of the Cholesky decomposition, but once I actually tried it — awesome! I used AutoGPTQ with LLaMA-7B, calibrating on 128 samples from the Pile (2,048 tokens each), quantizing to 4‑bit — W4A16. The whole process took 15 minutes, surprisingly fast. Accuracy? I tested perplexity on Wikitext — it went from 5.68 (FP16) to 5.74, nearly lossless. I deployed it straight into a Flask service. With batch size = 1, VRAM dropped like a rock — running 13B on a 3090 was rock solid.
But GPTQ has an old problem: quantization is as slow as a turtle. For 7B it’s bearable, but 13B takes an hour. Try a 70B model? The tutorials online tell you to wait patiently — it’s a waste of life. And another pitfall: too much calibration data leads to overfitting; too little gives poor reconstruction. I once made a mistake — quantized with the full dataset, and the generated text actually reproduced sentences from the calibration set. That’s a dead giveaway. I battled for a while before settling on 128 samples — problem solved.
Right alongside GPTQ came SmoothQuant.
At that point I was going bald trying to quantize activations — I wanted full W8A8 quantization, but the outliers in activations can be tens of times larger than normal, and regular INT8 just can’t handle it. SmoothQuant’s idea is elegantly brutal: if activations are hard to quantize, shift some of that difficulty over to the weights. Concretely, you calculate a per‑channel scaling factor: for channels with large activations, multiply the corresponding weights by the same factor. That pushes activations down, the weights become larger numerically, and the relative error after quantization actually gets smaller. This is a mathematical identity transformation — the quantization itself doesn’t introduce any extra loss.
I tried SmoothQuant in TensorRT-LLM (which was still an internal version at the time) with W8A8 on Llama-2-7B. On a single A100, FP16 throughput was 1,000 tokens/s; SmoothQuant W8A8 hit 1,800 — almost doubled! And the perplexity drop was only 0.3 points. Honestly, this benefit completely changed my mind about INT8 inference. But there’s a prerequisite: you need a card that supports INT8 tensor cores — Ampere architecture or above; otherwise it’s useless.
By the end of 2023, AWQ came out and dropped a nuke on the table.
At that point the industry was fed up with GPTQ’s trade‑offs — good accuracy but painfully slow; use round‑to‑nearest (RTN) and accuracy turns to dust. AWQ burst onto the scene and compressed the whole W4A16 process down to minutes, with accuracy matching GPTQ. Most importantly, it doesn’t need labels in the calibration data — pure forward passes are enough.
The core insight is incredibly simple: a weight channel’s importance isn’t determined by itself, but by the magnitude of its corresponding activations. In other words, if a channel’s activation values are explosively large, even a tiny error in the weights multiplied into them will blow up the output. The conventional idea that “weights with large absolute values need protection” is an illusion. AWQ simply multiplies the weights for those high‑activation channels by a scaling factor (say 2.0), making the numbers larger so the relative quantization error becomes smaller. Then it divides the activations by the same factor, so the result is identical. No mixed‑precision, no retraining — it’s just a matmul/scatter.
I tried AWQ on vLLM 0.4.0, quantizing Mistral-7B‑v0
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.