五万字综述!Prompt-Tuning:一种新的微调范式 (English)

Generated: 2026-06-23 07:55:20

---

I stared at BERT's [MASK] for an afternoon, and then I got it

Let me tell you, one day last year I sat staring at BERT's output layer for two straight hours with only one thought running through my head: Does this thing even understand a word I said?

Here's the deal. I was doing sentiment classification, the usual routine: pull out BERT, slap a classification head on top, feed it a few thousand labeled examples, and fine-tune all parameters. The result? Decent, but something felt off. Think about it—what does BERT do during pre-training? Guess which word [MASK] stands for, decide if two sentences are next to each other. And now you're suddenly asking it to be a judge—positive or negative? It's bound to be confused!

So when I first saw those GPT-3 and PET papers, my gut reaction was: Add a template and you need less data? Yeah, right.

But then I ran it myself, and I was sold. Really sold.

---

One trick that makes you see BERT in a whole new light

You see, traditional fine-tuning is like forcing a concert pianist to fix a pipe—he's talented, sure, but he's not going to shift gears overnight. What does Prompt-Tuning do? It wraps the downstream task into its specialty: fill-in-the-blank.

You heard that right. Cloze-style.

For example, you say "I love this movie." The traditional approach outputs a label. Prompt-Tuning? It turns the sentence into "I love this movie. It was [MASK]." Then it lets BERT guess whether the blank is "great" or "terrible." That's its bread and butter! The guessed result is the sentiment.

Once that idea clicked, the next two years of my experiments completely changed direction.

---

Three pitfalls I personally stepped into (each one worth a barbecue)

First try: manual templates, truly effective but also truly exhausting

Early on, when I was learning PET, I designed my own templates. For sentiment classification I used "It was [MASK]." On IMDB, with only 10 samples, the accuracy hit 85%—less than 5 points below full-parameter fine-tuning. My jaw dropped: 10 data points, a 5% gap, can you believe it?

But here's the problem: every new task meant inventing a new template, like writing a composition. Even worse, the same template that worked like a charm on BERT-base turned out to be terrible on RoBERTa-large. I tried "It was [MASK]" on BERT and it was solid; on ALBERT it collapsed—those parameter-sharing layers just couldn't handle that fixed pattern.

So yeah, templates are no silver bullet.

Second try: continuous prompts, fooled on small models

To save myself trouble, I started playing with "virtual tokens"—no manual design, let the model learn the embeddings itself. This is called Prompt Tuning and Prefix Tuning.

First I tried Prompt Tuning: just add a few trainable virtual tokens at the beginning of the input. On a 175B GPT-3, the results were explosive. But on BERT-base at only 110M, it directly underperformed full fine-tuning by 5 points. The paper The Power of Scale says the same thing: if your model has fewer than 10B parameters, don't expect Prompt Tuning to be budget-friendly while maintaining performance.

It's like giving a child a blackboard and saying "figure it out"—the blackboard is too small, they can't.

Next I tried Prefix Tuning: not only adding tokens at the input layer, but also prepending virtual tokens to the Key and Value of every Transformer layer. The effect was better, but the GPU memory costs shot through the roof. I ran GPT-2 medium (355M) with batch size 1, sequence length 512, and a prefix length of only 10, and the memory usage hit 4 GB. Under the same conditions Prompt Tuning used only 1 GB. Do the math: KV cache parameters = 10 × 2 × 12 × 768, over 180,000 floats, plus gradients and optimizer states—an 8 GB card OOMs on moderately large models.

And the killer: during training, these virtual tokens' KV caches are also trainable parameters, unlike in inference where they're read-only. One of my colleagues didn't notice that and spent the whole night debugging OOMs.

Later I found out that Prefix Tuning's learning rate needs to be an order of magnitude smaller than Prompt Tuning's. I tried 1e-4 and the loss blew up; switching to 1e-5 stabilized it. What you think is a shortcut often hides the biggest trap.

Third try: P-Tuning v2, finally found the right fit

P-Tuning v1 first uses an LSTM to generate the virtual token embeddings—essentially a mapping layer. On BERT-large it performed a bit better than fixed templates, but training was as slow as a snail—LSTM has sequence dependencies, no parallelism.

Then v2 came out. It looks a lot like Prefix Tuning, but with specific improvements for small models. I compared it on the SuperGLUE RTE task: P-Tuning v2 averaged 90.2 points, full fine-tuning 91.5, only 1.3 points apart! Meanwhile Prompt Tuning scored only 87. The key point: v2 uses only 0.1% of the parameters: 12 layers × 20 virtual tokens × 768 dimensions × 2 (K and V) ≈ 368k, compared to 330 million for full fine-tuning—hundreds of

五万字综述!Prompt-Tuning:一种新的微调范式 (English)

五万字综述!Prompt-Tuning:一种新的微调范式 (English)

I stared at BERT's [MASK] for an afternoon, and then I got it

One trick that makes you see BERT in a whole new light

Three pitfalls I personally stepped into (each one worth a barbecue)

First try: manual templates, truly effective but also truly exhausting

Second try: continuous prompts, fooled on small models

Third try: P-Tuning v2, finally found the right fit

Cael Lee

Ready to get started?