Home / Blog / 中国科学院团队首篇LLM模型压缩综述 (English)

中国科学院团队首篇LLM模型压缩综述 (English)

By CaelLee | | 6 min read

中国科学院团队首篇LLM模型压缩综述 (English)

Generated: 2026-06-22 16:26:37

---

To be honest, a few days ago I did something particularly foolish—I dug out my old GPU with only 8GB of VRAM and tried to run a local large language model. I mean, with that little memory, what can you really do?

First, I tried the FP16 version of LLaMA 3 8B. And the result? It crashed with an OOM error the moment I loaded it—straight-up memory overflow. I wasn't about to give up, so I switched to the 4-bit quantized version, and it finally ran. But the generation speed? So painfully slow it made me question my life choices—one token every three seconds, like racing a tortoise.

I started thinking, is there any more reliable solution to this? Right around then, I came across that LLM model compression survey from the Chinese Academy of Sciences—it basically turned the whole field inside out. I read it in one sitting that night, and wow, my mind was blown.

The feeling hit me!

---

You know how massive GPT-175B is? 175 billion parameters. Just storing it in half precision takes 320GB of VRAM, and during inference you'd need at least five 80GB A100s. Can ordinary people afford that? No way!

So what do you do? You have to compress it—no other choice.

And have you noticed? Ilya Sutskever and his team dropped this powerful line early on: unsupervised learning is, at its core, compression. Yeah, let that sink in. It's not just a technical statement—it basically set the tone for the whole field: to truly understand something, you have to be able to compress it down to its essence. Isn't the same true for models? If you can make a model small and efficient, that means you really understand the task.

After digging through that survey and combining it with those nights of my own tinkering, let me break down the three main pillars for you today: quantization, pruning, and knowledge distillation. Each one has its pitfalls. But each one also holds real promise for deployment.

---

1. Quantization Is the Most Practical Path, But Don't Expect It to Be Lossless

Quantization, simply put, swaps out those fine-grained floating-point numbers in a model for integers, using fewer bits to store weights and activations. Swap FP32 for INT8, and the storage size shrinks by three-quarters!

The trade-off? Accuracy loss. But how much? That depends on the method you use.

The survey categorizes quantization into three types: Quantization-Aware Training (QAT), Quantization-Aware Fine-tuning (QAF), and Post-Training Quantization (PTQ). Right now, PTQ is the hottest. Why? Because it's simple and straightforward. GPTQ, AWQ, SpQR—they're all classic examples. I tried GPTQ 4-bit quantization on a 7B model myself, and the size dropped from about 13GB to 3.5GB—finally small enough to fit into my pitiful VRAM!

But! Did inference speed improve much? Not really. Because the compute bottleneck on your card is still there, and GPU hardware acceleration for INT4 isn't that widespread yet. That's the first trap: quantization saves memory bandwidth, but it doesn't necessarily make generation faster. Sounds counterintuitive, right? Yeah.

Then I tried AWQ. Its idea is clever: not all weights are equally important—just protect the 1% of critical weights, and the quantization error drops dramatically. I compared GPTQ quantization vs. AWQ quantization on CodeLlama 7B for a code completion task—AWQ's pass@1 was nearly two percentage points higher! Two percentage points in a production environment—that's not something you can ignore, folks.

See, choosing the right method is a hundred times more important than just chasing lower bit widths.

There's also a technique called SmoothQuant, which specifically handles outliers in activation values. Those outliers often crash INT8 quantization outright—but SmoothQuant uses per-channel scaling to smooth out the spikes, and suddenly the model behaves itself. I tested it myself: BERT-base for text classification, an INT8 model quantized with SmoothQuant, and the inference accuracy was within 0.3% of FP32. But the throughput? Doubled! That's real, tangible benefit in deployment.

So here's my take on quantization: it's the most mature approach right now and offers the best cost-performance ratio. But you have to pick your method carefully. If your task is knowledge Q&A or writing assistance, 4-bit PTQ is generally fine. If it's code generation or mathematical reasoning, go with 8-bit or pair it with some QAT to recover the loss. And don't think you can jump straight down to 2-bit—at least current experiments show 4-bit is the sweet spot. I really agree with what Dettmers said: 4-bit is pretty much the optimal solution.

2. Pruning Is Like Surgery on the Model—Go Lightly

Now let's talk about pruning. There are two types: structured pruning, which removes entire rows or columns of weights, and unstructured pruning, which zeros out unimportant individual parameters.

Sounds great, right?

With unstructured pruning, the compression ratio can be extremely high in theory. But—not only does inference speed not improve, it actually gets slower! Why? Because sparse matrix operations lack universal hardware acceleration support. Which GPU can efficiently handle random sparse tensors? Almost none.

I got an itch the other day and tried unstructured pruning on a 300M-parameter GPT-2. After pruning away 50% of the weights, the perplexity increased by less than 1—I thought I was a genius! Then I ran it and checked the speed... wow, it was even slower than the original model! Because PyTorch doesn't support efficient sparse matrix operations at all.

That lesson hurt: the compression from pruning must be aligned with the hardware! Pruning that doesn't take hardware into account is just talk.

So, for large models, structured pruning is the way to go. You directly reduce the model dimensions, and even if the accuracy loss is a bit larger, you can compensate later with distillation. But here's the problem: would you dare to deploy a pruned LLM in an online service? I sure wouldn't. If it suddenly spits out something nonsensical, users would curse me out.

The CAS survey covers pruning, but not extensively. Why? Because the engineering maturity of this field for LLMs is far behind quantization. The future trend? I'd guess combining structured pruning and quantization—first cut out redundant channels or attention heads, then compress the bit width. But the current deployment is limited, that's a fact.

3. Knowledge Distillation Passes the Secret Manual to Small Models, But It Comes at a Cost

The term "distillation" is so vivid. A large model acts as the teacher, teaching a small model the skills it has mastered.

Distillation is divided into white-box and black-box. White-box means the teacher provides intermediate representations or output distributions for the student to mimic. A typical example is MINILLM, which uses reverse KL divergence to prevent the student from blindly copying low-probability regions—pretty clever. Black-box is simpler and more direct: you use closed-source large models like ChatGPT or GPT-4 to generate data, then train an open-source small model with it. Alpaca and Vicuna became famous thanks to this.

I personally lean toward black-box distillation. Why? Because it doesn't depend on the teacher's architecture. I can directly use ChatGPT's output to distill a T5—how convenient is that?

But I fell into a big trap! If the student model lacks capacity, it simply can't learn all the teacher's skills. A while back, I used GPT-4-generated instructions to distill a 500M-parameter Flan-T5. Result? Whenever it hit a complex reasoning instruction, the student would completely break down and give wildly off-base answers. I later switched to a 3B model, and it barely passed. The TED method mentioned in the survey specifically addresses this—aligning intermediate representations to narrow the gap between teacher and student.

Where does distillation shine best? When you have a massive model in the cloud but want to run a decent model on the edge. For example, distilling Llama 3 70B into a 7B version usually yields much better results than training a 7B

C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free