FLAN实测:19/25任务碾压零样本GPT-3,7倍提升怎么来的 (English)
FLAN实测:19/25任务碾压零样本GPT-3,7倍提升怎么来的 (English)
Generated: 2026-06-22 11:12:26
---
Stop Guessing, Just Teach It! A Discovery That Almost Made Me Give Up on Prompt Tuning
Have you ever tried shouting at something for ages, only to have it completely fail to understand what you're saying?
I have.
Last year, I was working on a zero-shot sentiment classification project and chose GPT-3's zero-shot interface. Guess what happened? I gave it a sentence—"This movie is boring, but the special effects are okay"—and asked it to determine whether it was positive or negative.
It replied: "Neutral."
There wasn't even a "neutral" option in my choices!
The feeling at that moment was like... you complain to a friend that you're exhausted today, and your friend says, "Then drink more hot water." What you wanted was comfort, but they handed you plain boiled water. It works, but it's not right.
I was baffled: Isn't this model supposed to understand language? How could it not even grasp basic sentiment?
Later, I dug into Brown et al.'s paper and realized—GPT-3's zero-shot performance was far worse than few-shot. Why? Because among the billions of texts it had seen, instructions like "Please determine the sentiment of this sentence" had almost never appeared. The model had no idea what you wanted it to do.
Frustrating, right?
Suddenly, the Light Went On
Just as I was about to give up, Quoc V. Le's team dropped FLAN—a 137B model fine-tuned on instruction data from over 60 NLP tasks.
The results blew my mind.
FLAN outperformed zero-shot GPT-3 on 19 out of 25 evaluation tasks. Even more astonishing, on tasks like ANLI, BoolQ, and OpenbookQA, it crushed few-shot GPT-3 with examples.
My immediate thought: This isn't optimization—it's the difference between turning on the lights and fumbling in the dark.
At this point, you might think I succeeded right away.
Wrong. I fell into a huge trap.
First Hard Lesson: Too Few Tasks, Wasted Effort
I had a dataset of instruction tasks from 5 tasks. I thought: Eh, doesn't matter how many, let's just run it and see.
After training, I tested zero-shot performance—nothing moved.
My internal monologue: Who am I? Where am I? Did I just train for nothing?
Then I read a line in the paper that almost made me cry: "Large-scale multi-task learning is critical; benefits only become significant when the dataset exceeds 20 tasks."
Twenty! And I dared to go with 5? Was I just asking for trouble?
Fine. I went all in. I collected 30 tasks, designed at least 5 different instruction phrasings for each, and ended up with over 200 prompts. After retraining—
Guess what?
On several unseen NLI tasks, accuracy jumped from around 0.1 to 0.7. Yes, you read that right—nearly a 7x increase.
That moment I understood: Quantity really can lead to a qualitative leap.
But quantity alone wasn't enough. Another pitfall awaited me.
Second Trap: Loss Scaling—Skip It and You're Wasting Your Time
The FLAN paper mentioned a detail: scaling the loss for different tasks. The formula: L_scaled = L / log(n), where n is the output dimension.
Think about it: for binary classification, n=2; for generation tasks, n=vocab size. Dividing like this makes things fair.
The first time I didn't do this, the losses from different tasks were all maxed out, and training heavily biased toward generation tasks with large output dimensions. Imagine one person choosing between 2 options and another choosing from tens of thousands of words—how could the former compete?
After adding scaling, everything balanced out. Performance took off.
Sometimes, it's not the method that's wrong—it's the details that aren't in place.
Okay, Here's the Real Deal: Don't Confuse Instruction Tuning with Prompt Tuning
At this point, I need to make something clear.
Many people see "instruction fine-tuning" and think it's similar to prompt tuning. I admit they both involve "templates," but they are fundamentally different.
Think about it—
What is the essence of prompt tuning? It's a word-filling game. You give an incomplete sentence like "This restaurant is too __" and let the model fill in the blank. The model acts as a "writer," completing what it thinks is the most likely word.
What is the essence of instruction tuning? It's following instructions to answer questions. You give an instruction like "Determine the sentiment of this sentence," then provide options "A: good, B: bad," and ask the model to judge. Now the model acts as an "examiner," making choices according to your standards.
These two modes use completely different capabilities! One uses the skill of predicting the next word to guess your intention; the other uses comprehension to meet your requirements.
What's the more critical difference? Generalization.
The current mainstream approach to prompt tuning fine-tunes a separate prompt for each task—once you've tuned a prompt for sentiment analysis, you can't use it for question answering. Moreover, many methods freeze model parameters, and soft prompts increasingly resemble adapters, straying far from the original goal of "activating pre-trained capabilities."
In my view, this path has gone a bit off track.
Instruction tuning, on the other hand, involves full model fine-tuning (or at least not freezing parameters) on hundreds or thousands of instruction-formatted tasks. The model truly learns the ability to "follow instructions," not "guess what this format means."
One is passive fill-in-the-blank; the other is point-and-shoot.
Which one do you think is more reliable?
"Isn't This Just Multi-Task Fine-Tuning?"—You Hit the Nail on the Head
You might say: Isn't this just multi-task fine-tuning? That's been around for a while.
Yes, but not exactly.
How does traditional multi-task fine-tuning work? Each task has its own classification head or regression head, sharing a bottom encoder, each with its own loss function. See, each task has its own "little tail."
What about instruction tuning? It converts all tasks into the same "generation" task, distinguishing tasks solely by the instruction text. This means—the model must decide its behavior pattern by understanding natural language instructions.
It doesn't learn "how to do task A, how to do task B." It learns "switch tasks based on instructions." This ability is the root of zero-shot generalization.
Experimental results confirm this. After T0 was fine-tuned on 177 datasets and 2073 prompts, its performance on unseen tasks far exceeded that of individually fine-tuned models. In my own small experiments, when the number of tasks increased from 5 to 30, zero-shot performance showed an "emergent" jump—not linear improvement, but a sudden spike.
Don't you feel like this is akin to a person suddenly becoming enlightened after learning many skills?
Where Does the Data Come From? I'll Give It Away for Free
Some might worry: Instruction tuning requires a lot of labeled data, right?
Actually, no. FLAN and T0 both use existing NLP datasets, simply rewriting them into instruction templates. Design 10 templates per task—the cost isn't high. Plus, open-source FLAN and T0 datasets are already available; you can use them directly.
ChatGPT's success also proves that instruction tuning combined with human feedback (RLHF) can align models with user intent. This path is not only effective but also engineering-validated.
Alright, enough talk. Here are some practical suggestions—
Three Actionable Tips, No Thanks Needed
First, don't just focus on prompt tuning. Spend time building a multi-task instruction dataset. Even with just 30 tasks, run instruction fine-tuning once, and the results might surprise you. I guarantee it's far more effective than tuning 10 hard prompts.
Second, make good use of existing models. FLAN-T5 and T0 are already open-source; use them directly for zero-shot scenarios. For reasoning tasks, these are my first choice—they outperform untuned GPT-3 and don't require complex prompt strategies.
Third, if you insist on training from scratch, remember: at least 20 tasks, each with at least 5-10 differently phrased instructions. Use the log(n) loss scaling I mentioned. During training, simply sum up the losses from all tasks for gradient descent—no fancy weighting. Simple and crude, but effective.
One Last Thing
Think about it: Prompt tuning is like frantically winking at a friend, hoping they'll "guess" what you want to eat. Instruction tuning is like directly telling them—"I want hotpot."
Of course, the former has its value—it's been instrumental in exploring the boundaries of language model capabilities. But when we want a model that truly "understands
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.