大模型微调finetune方法 (English)

Generated: 2026-06-21 07:12:43

---

Don't Be Fooled by Those Fancy Names! When It Comes to Fine-Tuning LLMs, These Few Actually Deliver

When we talk about fine-tuning large language models, new methods have been popping up like mushrooms after rain over the past couple of years. Just the names containing "LoRA" alone could fill a list—AdaLORA, DyLORA, LongLoRA, DoRA, MaLoRA… When you first saw these, were you a bit overwhelmed too?

I was the same at the beginning! I went through paper after paper, ran test after test, and finally landed on a hard truth: Most of them are just old wine in new bottles. The ones that really hold up in production are still the same old classics.

Today, let me have a heart-to-heart with you about which methods are worth your time and which are just self-indulgent demos in academic papers.

---

First, LoRA Is the Real Pillar! All Other Variants Are, Honestly, Just Patchwork

Think about it—how simple is LoRA?

Freeze the original model, insert two low-rank matrices, done. Since it came out in 2021, nobody has been able to replace it. Why?

Because it solves a real problem, not one made up out of thin air.

Last year, I used LoRA to fine-tune a 7B model for customer service intent classification. The trainable parameters were only 0.1% of full fine-tuning, and the accuracy drop was less than 1%! Plus, inference had zero latency—you just merge B and A back into the original weights, no extra feed-forward layer like Adapter.

Let me break down the math for you:

W = W0 + BA

W0 stays fixed. The total parameters in A and B are 2×d×r, and r is usually between 8 and 64. Take GPT-3 175B as an example—full fine-tuning requires 175B parameters. With LoRA, you only need less than 0.01%. Even with a small dataset, you won't overfit.

There are people who say, "LoRA's performance isn't as good as full fine-tuning." Every time I hear that, I get annoyed—try full fine-tuning with 10,000 samples and you'll overfit so badly it'll make you question your life choices.

LoRA is the truly battle-tested method, bar none.

---

Second, AdaLORA and DyLORA Are Clever, but in Practice You Might Not Need Them

AdaLORA uses SVD to dynamically allocate ranks—the idea is impressive.

And DyLORA? According to the paper, it can speed things up by 4–7 times, which sounds really tempting.

But guess what? I benchmarked AdaLORA, and the training time was nearly double that of vanilla LoRA, while the performance gain on most tasks was less than 1%. Less than 1%!

Why? Because the low-rank property of large models is already quite strong. A fixed rank is sufficient for most scenarios. The extra cost of computing importance scores hasn't been lowered enough yet.

Someone might say, "But DyLORA accelerates training, right?"

Yes—it speeds things up by truncating redundant ranks, but that only works if you start with a very large rank. If you're just fine-tuning a dialogue model, r=16 is sufficient, and vanilla LoRA is already the fastest.

Let me share a pitfall I fell into:

Last year, I jumped on the AdaLORA bandwagon to fine-tune a Chinese legal QA model. The convergence was incredibly unstable, with the training curve looking like a roller coaster. In the end, I switched back to LoRA, and it was done in half a day.

It's not that the new techniques are bad—it's a question of cost-effectiveness.

---

Third, QLoRA Is a Lifesaver for Grinders! Must-Try for Single-GPU Users

If this had existed two years earlier, I wouldn't have had to rent four A100s.

Do you know what it does? It pushes quantization to the extreme.

NF4 data type—minimizes information loss for normally distributed weights.
Double quantization—quantifies the quantization constants themselves down to 8-bit, further compressing storage.
Paged optimization—uses CPU as a fallback when GPU memory runs out.

A single 48GB GPU can fine-tune a 65B model—I've replicated this! Using QLoRA to fine-tune LLaMA-65B, originally requiring 300+GB of VRAM for full fine-tuning, I got it down to just over 40GB. Training a Guanaco in 48 hours that achieved 99.3% of ChatGPT's performance on the Vicuna benchmark—I saw this run with my own eyes.

Here are some practical tips:

Use the latest versions of transformers and bitsandbytes; NF4 quantization runs fine on a 3090.
The first time I tried it, I forgot to disable torch.compile, which caused gradient explosion after quantization.
Double quantization really pays off: it takes groups of 256 quantization constants and re-quantizes them to 8-bit, compressing the original 32-bit constants to 8 bits, saving even more memory.
I personally don't use paged optimization that often because my VRAM is large enough. But the idea of swapping memory for VRAM is perfect for people in fine-tuning competitions.

---

Fourth, Don't Overlook the Basics: Adapter and Full Fine-Tuning

I'm a LoRA fan now, but I admit—Adapter still shines in certain scenarios.

The idea from that 2019 paper is to freeze the original model and add two feed-forward sublayers (reduce dimension, then increase) to each Transformer layer. The advantage is flexibility—you can choose to add Adapters only in some layers, like using lower layers for general features and higher layers for task-specific features.

But Adapter has a fatal flaw: during inference, it adds extra layers of computation, increasing latency.

I did a comparison test: with the same number of parameters, Adapter is 15%–20% slower than LoRA. That's why most people have gravitated toward LoRA.

As for full fine-tuning, I only consider it when I have over a million data points. Some people say, "The full fine-tuning paradigm is mature." Let me be blunt—any company still relying on full fine-tuning for PT in 2024 must have money to burn.

Incremental pre-training is the right way to inject domain knowledge, but that's a whole different story.

---

Someone Asks: "So, How Should I Choose?"

Here's a straightforward plan for you:

Single GPU with less than 16GB: Use QLoRA, r=8, BitsAndBytes 4-bit quantization.
Single GPU with 48GB or more: Run LoRA, r=16–32, bf16 mixed precision. If your dataset exceeds 100K samples, you can try AdaLORA for the top-layer adaptation.
You have a cluster but a limited budget: Use full fine-tuning only for incremental pre-training; for the SFT stage, LoRA offers better cost-effectiveness.
You're in academia or chasing the bleeding edge: You can try new methods like SFF, but don't get your hopes up too high.

Finally, I want to say—

Fine-tuning large models has now entered a phase of "methodology overload." What really determines the outcome is not how novel the method is, but your data quality, cleaning process, and hyperparameter search. I've seen people use the most basic LoRA, spend three days cleaning data, and end up with a fine-tuned model that outperforms a whole pile of paper reproductions stacked with AdaLORA and DoRA.

Don't chase novelty. Chase the essence.

What is the essence of fine-tuning large models? It's using a small amount of data to guide the model toward a task, not retraining it from scratch.

All you need is a stable base, clean data, and a mindset that doesn't overcomplicate things.

Got it? Don't let those flashy names fool you—the truly good stuff is often the simplest.

大模型微调finetune方法 (English)

大模型微调finetune方法 (English)

Don't Be Fooled by Those Fancy Names! When It Comes to Fine-Tuning LLMs, These Few Actually Deliver

First, LoRA Is the Real Pillar! All Other Variants Are, Honestly, Just Patchwork

Second, AdaLORA and DyLORA Are Clever, but in Practice You Might Not Need Them

Third, QLoRA Is a Lifesaver for Grinders! Must-Try for Single-GPU Users

Fourth, Don't Overlook the Basics: Adapter and Full Fine-Tuning

Someone Asks: "So, How Should I Choose?"

Cael Lee

Ready to get started?