300M模型蒸成30M，延迟降90%精度仅掉0.5% (English)

Generated: 2026-06-22 07:20:36

---

Guess what? I "distilled" a 300M model down to 30M!

Let me tell you a true story. Last year, I took on a project, and my boss just threw me a line: "Deploy BERT-large onto a phone."

My heart sank—over 300MB! Online latency would easily hit 500ms, and users would have already bailed. My boss said, "Figure it out." My first thought was pruning, but after pruning, accuracy dropped by 3 points. My second thought was quantization, but that also hurt precision. Finally, I gritted my teeth and went with distillation. After two weeks of tinkering, guess what? The model was compressed to 30M, latency dropped to 50ms, and accuracy only fell by 0.5%!

At that moment, I almost cried. Not being dramatic—it's true—distillation isn't some mystical art; it has a clear methodology!

Today, I'll walk you through all the pitfalls I've encountered, the methods I've tried, and the code I've written over the past two years. Get ready—let's start distilling!

---

1. What exactly are we distilling? — Don't make the small model memorize textbooks

When Hinton proposed knowledge distillation in 2015, the core idea was simple, but you might not believe it: Don't make the small model memorize the training set; let it learn the "way of thinking" of the large model.

Think about it: the training set gives one-hot labels—"This is a cat." But when a large model sees a picture of a cat, its output probability distribution might be: cat 0.9, tiger 0.08, dog 0.02. This distribution contains the "knowledge" the large model has learned: cats and tigers are somewhat similar, but cats and dogs are far apart. The small model learning this distribution directly is far better than learning hard labels!

So how do we learn it? The key is temperature T.

Normal softmax is exp(zi)/sum(exp(zj)). With T added, it becomes exp(zi/T)/sum(exp(zj/T)). The larger T is, the smoother the output probabilities, allowing the small model to learn more about the similarities between classes. For example, with MNIST handwritten digit 2, the large model might assign 0.9 to 2, 1e-6 to 3, and 1e-9 to 7. After increasing T, the probability for 2 drops to 0.3, 3 becomes 0.2, and 7 becomes 0.1, so the small model learns that "2 and 3 look somewhat alike."

I've tested it: T between 1 and 20 works, but the optimal value varies greatly by task. For text classification, T=4 works best; for NER, T=2 is about right. Don't trust fixed values—tune it yourself!

The loss function looks like this:


loss = alpha * KL(teacher_soft, student_soft) + (1-alpha) * CE(student_hard, label)

alpha controls the weight between distillation and ground truth labels. I usually set it to 0.7. Note: when computing soft targets for the student model, use the same T, otherwise the distributions won't match.

---

2. Six ways to distill BERT—which one is the best?

There are six classic methods out there. I've tried them all. Let me give you the verdict—but don't rush; each one comes with hard-learned lessons.

1. Distilled BiLSTM (most aggressive, most prone to failure)

Distill a 12-layer BERT into a single-layer BiLSTM, reducing parameters by 100x. How does it perform? For text classification, it's okay, but for slightly more complex tasks, it falls apart. I tried it on SST-2, and the BiLSTM only lost 2 points, but on CoNLL NER, it dropped 8 points. Suitable for scenarios with extreme latency sensitivity and simple tasks; don't expect it to carry the load.

2. BERT-PKD (intermediate layer distillation, steady and reliable)

Instead of distilling only the last layer, select intermediate layers of the teacher model (e.g., layers 3, 6, 9) for the student to learn. The authors used two strategies: PKD-skip (skip layers) and PKD-last (select the last few layers). In my tests, PKD-skip was more stable because skipping layers covers features at different abstraction levels. If you want a balance between accuracy and speed, this is worth a try.

3. DistillBERT (pre-training distillation, lazy person's choice)

Distill during the pre-training phase, keeping a 6-layer structure. HuggingFace has a ready-made distilbert-base-uncased. I directly fine-tuned it on downstream tasks, and it performed 1-2 points better than training a 6-layer BERT from scratch. First choice for lazy people—plug and play.

4. TinyBERT (two-stage distillation, SOTA)

This was the SOTA in 2019. It works in two steps: first distill the pre-training phase, then distill the fine-tuning phase. Moreover, it designs loss functions for every layer—embedding layer, intermediate hidden layers, attention matrices, and output layer—all aligned. I reproduced it, and the results were indeed good, but the training time doubled. Go for it if you want extreme accuracy, but don't be stingy with your GPU.

5. MobileBERT (slimming down, Google's own)

Instead of reducing the number of layers, reduce the width of each layer (hidden size from 768 to 128). Google uses it for mobile deployment. I tested it on a phone, and it's 30% faster than TinyBERT, but slightly less accurate. Suitable for real-time inference on mobile devices where accuracy requirements are not too high.

6. MiniLM (distill attention, best value for money)

Only distill the attention matrix (value matrix) of the last layer, not the hidden layers. The code is extremely simple, but the results are surprisingly good. I used it to distill BERT-large into 6 layers, and the average GLUE drop was only 0.8 points. If your budget is limited, choose MiniLM—best value for money, no contest.

My choice: If budget is limited, MiniLM offers the best value; if you have time, TinyBERT gives better results; if you want to save effort, just use DistillBERT. Don't overthink it—just try it out!

---

3. Hands-on code: Step-by-step guide to distilling a BERT

No more chit-chat. Here's the code. I wrote a distillation script based on Transformers and TextBrewer. Just change the paths and you're good to go.

But don't copy-paste yet—I've already fallen into a few traps for you. Read through first!


# distill_bert.py
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_dataset
from textbrewer import GeneralDistiller, TrainingConfig, DistillationConfig

# Configuration
teacher_name = "bert-large-uncased" # Teacher
student_name = "prajjwal1/bert-tiny" # Student
task = "sst2"
batch_size = 32
epochs = 3
lr = 2e-5
temperature = 4.0
alpha = 0.7

# Load data and models
tokenizer = AutoTokenizer.from_pretrained(student_name)
dataset = load_dataset("glue", task)
teacher = AutoModelForSequenceClassification.from_pretrained(teacher_name, num_labels=2)
student = AutoModelForSequenceClassification.from_pretrained(student_name, num_labels=2)

# Important: Teacher model must output hidden states and attention
teacher.config.output_hidden_states = True
teacher.config.output_attentions = True
student.config.output_hidden_states = True
student.config.output_attentions = True

# Configure distillation
distill_config = DistillationConfig(
 temperature=temperature,
 hard_label_weight=1-alpha,
 kd_loss_weight=alpha,
 intermediate_matches=[ # Intermediate layer matching: TinyBERT style
 {'layer_T': 0, 'layer_S': 0, 'feature': 'hidden', 'loss': 'mse'},
 {'layer_T': 6, 'layer_S': 2, 'feature': 'hidden', 'loss': 'mse'},
 {'layer_T': 12, 'layer_S': 4, 'feature': 'hidden

300M模型蒸成30M，延迟降90%精度仅掉0.5% (English)

300M模型蒸成30M，延迟降90%精度仅掉0.5% (English)

1. What exactly are we distilling? — Don't make the small model memorize textbooks

2. Six ways to distill BERT—which one is the best?

1. Distilled BiLSTM (most aggressive, most prone to failure)

2. BERT-PKD (intermediate layer distillation, steady and reliable)

3. DistillBERT (pre-training distillation, lazy person's choice)

4. TinyBERT (two-stage distillation, SOTA)

5. MobileBERT (slimming down, Google's own)

6. MiniLM (distill attention, best value for money)

3. Hands-on code: Step-by-step guide to distilling a BERT

Cael Lee

Ready to get started?