一文辨析清楚LORA、Prompt Tuning、P (English)
一文辨析清楚LORA、Prompt Tuning、P (English)
Generated: 2026-06-21 12:39:28
---
Speaking of this, I have to first tell you about the joke I made last year.
At the time, I had a task on my hands: getting a 7B base model to learn medical Q&A output in a specific format. I tore through several tutorials in one go—good grief, there was Prompt Tuning, Prefix Tuning, Adapter, LoRA… a whole bunch of English terms hitting me like incantations. Every article claimed their method worked best, but none of them told me which one I should actually pick.
Like a fool, I went with Prompt Tuning.
Why did I choose it? Because it looked the simplest—just modify the input layer, add a few trainable vectors, how easy was that? I thought, "What a steal!"
But when training was done and I ran the tests, my expression instantly froze: the model couldn't even remember "amoxicillin," and the output format was so off it was unusable. Can you imagine? A medical Q&A model outputting drug names with typos!
Later I switched to LoRA, same data, two hours of training, and the results turned around completely. The drug name accuracy shot up from 60% to 85%, and the output format was neat and tidy.
I sat there at my computer, both happy and frustrated: happy that the task finally succeeded, frustrated that I had made such a blind choice without understanding things first.
Later I realized—there simply is no "best" fine-tuning method; there's only "the one that fits your specific situation best."
Today, I'm going to spill all the lessons I learned from stepping in those pits, and walk you through each of these siblings one by one.
---
Section 1: What's the Real Difference Among Those "Soft Prompt" Siblings?
Let's start with Prompt Tuning.
It sounds pretty fancy, but what it actually does is quite straightforward: freeze the entire large model, and only add a trainable sequence of virtual tokens at the very beginning of the input. It's like putting a pair of glasses on someone—you don't change their brain, just slightly alter how they see things.
The number of parameters is ridiculously small—less than 0.01% of the original model—and the resource consumption is absurdly low. Sounds sweet, right?
But guess what? It has a fatal weakness.
The bigger the model, the better the effect. When you use Prompt Tuning on a giant model above 200B, even a small nudge makes the model figure things out on its own. But if you're using a small or medium model below 10B, it's like whispering to someone who's half deaf—they can't hear you clearly.
Why? Because smaller models have limited semantic capacity; the tiny perturbation of modifying just the input layer can't drive the whole model. To use a metaphor: a large model is like a professor who already knows everything—you just write “Please check” on the blackboard and they get it; a small model is like a student who only half understands—even if you write “Please check carefully,” they might still make elementary mistakes.
What about P-Tuning?
This guy isn't satisfied with putting soft prompts only at the input layer. It inserts trainable vectors into the input of every layer. The v1 version only tinkers with the embedding layer, but the v2 version adds them to every Transformer layer. The effect is a significant step up from Prompt Tuning, especially on NLU tasks (classification, reading comprehension), where it can even approach full fine-tuning.
I tried using P-Tuning v2 for medical Q&A. The model did retain a bit more than with Prompt Tuning, but key drug names still often dropped the ball. Later I crunched the numbers—drug name accuracy barely broke 60%. It felt like: I put in the effort, but it always fell just short.
Then there's Prefix Tuning.
Its idea is a bit similar to P-Tuning, but the approach is completely different. It doesn't modify the input layer; instead, it prepends trainable vectors to the Key and Value matrices of every Transformer layer. It's like adding an implicit prefix to each layer, guiding the model along those markers.
It works well on generation tasks (NLG) like dialogue, summarization, and translation. But honestly, it never became mainstream either.
At this point, you might ask: why didn't these three siblings become mainstream later?
The reasons are pretty straightforward. First, they all require invasive modifications to the model structure—you have to change the forward process, which makes deployment a huge hassle. Second, the effect varies too much depending on the task type—they're not general enough. More critically, while having few parameters is an advantage, it's also a fatal limitation—the change is too small to transfer complex domain knowledge effectively.
In plain terms: when you need to remember a long list of specialized terms, that little perturbation is like tickling an elephant—it doesn't feel a thing.
---
Section 2: Why Did LoRA Become the Default Choice?
Alright, now for the main part.
LoRA. This is the method I've used the most in the past year, and it's now the undisputed "default option" in the field.
Why did it take off? It's not some technology that appeared out of nowhere; it truly understood the model's "lazy" nature.
LoRA's core assumption is particularly interesting: pre-training has already taught the model language and knowledge; fine-tuning is just adding a "behavioral constraint" on top. And this constraint is naturally low-rank. To put it plainly, the correction a model needs is actually very small—like on a giant chessboard, you only need to move a few key pieces.
How does it work in practice?
You locate the Self-Attention module in the Transformer, pick out the two parameter matrices WQ and WV, and add a low-rank decomposition A×B to each. A is a randomly initialized r×d matrix, B is a d×r matrix, with r much smaller than d. During training, only A and B are updated; the original model weights stay frozen.
Imagine it like this: a master who has already mastered all martial arts—you only need to change the way he holds his sword, not make him train everything from scratch.
There's a key parameter here: r. How to choose r? My experience is: for a 7B model with a medium-difficulty task, r=8 or 16 is enough. If the task is particularly complex, like imitating multi-turn dialogue formats, you can go up to 32. But above 64, the returns diminish noticeably and GPU memory usage skyrockets. There's also the scaling issue with the coefficient α/r. The larger α is, the stronger the fine-tuning force, but too big and the model easily goes off track. I usually set α=2r, and it almost never causes problems.
Why does LoRA choose WQ and WV specifically, and not others? I suspect it came from experiments: the FFN layer is responsible for memorizing knowledge—like the books in a library; the Self-Attention layer is responsible for attention retrieval—like the library catalog. Fine-tuning focuses more on adjusting the retrieval method rather than stuffing in new books, so tuning Self-Attention gives the highest return.
Now, QLoRA—this one is incredibly developer-friendly, especially for individuals.
First, quantize the base model to 4-bit. During fine-tuning, keep the quantized weights frozen and only train the high-precision A and B matrices. Forward computation requires dequantization, but backpropagation only goes through A and B, drastically reducing GPU memory usage. I tried fine-tuning a 65B model on a single A100 80G—something I never dared to think about before—and it actually ran, with results within 5% of full fine-tuning.
Tell me, isn't that sweet?
What's even more impressive is its "hot-swap" capability. The incremental weights are just a few dozen MB. You can train different LoRA adapters for different tasks and dynamically switch them during inference. It's like changing the skin on your phone—swap it out anytime, the original system stays completely unchanged. In terms of
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.