全参微调7B模型要80GB显存，LoRA凭什么只用5GB？ (English)

Generated: 2026-06-22 09:45:30

---

After checking, the original facts are basically accurate, the data calculations are reasonable, and there are no major errors that need correction. AI-style expressions (like "it is worth noting") do not appear in the original text, but a few sentences are a bit too neat, so I've made minor tweaks to enhance the conversational tone. Here's the final version:

---

Brothers, fine-tuning large models? I'm totally sold on it!

Let me start with a story.

Last year, I bought a brand new RTX 3090 with 24GB of VRAM. I thought, "Hell yeah, now I can finally play with large models!" But guess what? Full-parameter fine-tuning of a 7B model? Forget it! The VRAM requirements were a joke.

You know how I felt? It's like you saved up half a year's salary to buy a sports car, only to find out the gas costs more than the car itself! That kind of despair, you feel me?

But then? LoRA came along!

It was like a lifesaver, letting a broke guy like me play with large models. Today, I'm going to break down the principle of this lifesaver for you, and also tell you about all the pitfalls that made me fall flat on my face.

---

The Nightmare for Casual Players: How Expensive Is Full-Parameter Fine-Tuning?

Think about it: a 7B model, just loading the parameters into VRAM takes 14GB (half precision). Okay, that's bearable. But training? You need to store gradients, optimizer states...

Let me do the math for you, don't blink:

In half precision, 1B parameters take 2GB. The 7B model itself is 14GB, gradients are 14GB, and the AdamW optimizer needs to store momentum and variance—if the optimizer uses float32, that's 56GB. Total: 84GB! What does that mean? You'd barely manage with two A100 80GB cards. Single card? Dream on!

When I saw that number, I almost smashed my keyboard. Over 80GB, brothers! Where am I supposed to get that much VRAM?

---

LoRA's Trick: Low-Rank Decomposition

Speaking of which, I need to mention a paper first. Aghajanyan and team discovered a very counterintuitive phenomenon—pretrained models actually have an "intrinsic dimension."

What does that mean?

Think about it: GPT-3 with 175 billion parameters, why can it be fine-tuned effectively with just a few thousand samples? Because there's massive redundancy among the parameters! The actual degrees of freedom that need adjustment might only be a few thousand.

Isn't that like "you think you bought a mountain, but you only need to dig a hole"?

The authors of LoRA were inspired by this. They thought: since the parameter update space is essentially low-rank, why not use two small matrices to represent this update?

Specifically, for a pretrained weight matrix W₀ (say d×d), instead of directly training ΔW (also d×d), they decompose it into the product of two small matrices:

ΔW = B × A

where B is d×r, and A is r×d. r is much smaller than d, typically 4, 8, or 16.

The number of parameters drops from d² to 2dr.

Take d=4096, r=8: the parameter count goes from 16.77 million to 65,536—a factor of over 250!

How much VRAM does that save?

---

Pitfalls in Practice: Almost Drove Me Crazy

Speaking of which, I need to highlight a pitfall I fell into, which almost ruined my project.

The LoRA paper says that during initialization, B is initialized as a zero matrix, and A is randomly initialized. I was lazy and initialized both matrices randomly.

What happened? The training loss just wouldn't go down!

I struggled for two whole days, re-read the paper, and finally found the problem.

Why does one have to be zero?

Because if both are randomly initialized, ΔW is not zero from the start, meaning the model is already modified before training begins. This has a huge impact on the initial performance of the pretrained model, especially in the first few steps—the loss spikes.

Remember: initialize B as zero, A randomly. Or the other way around, just make sure the product is zero.

---

Training Process: Pure Bliss

During training, the pretrained weights W₀ are completely frozen and don't participate in gradient computation. Only the two small matrices B and A are updated.

In forward propagation, the computation becomes:

h = W₀x + BAx

In backpropagation, gradients only flow to B and A; gradients for W₀ are not computed at all.

Inference is even better:

You can directly merge the weights:

W_new = W₀ + BA

Then discard the BA matrix and use W_new for inference. This adds zero inference latency!

This is way better than Adapter, which requires running two extra small networks during inference, adding real latency.

---

VRAM Usage Comparison: A Picture Is Worth a Thousand Words

I've compiled actual test data to give you a visual:

Method	Minimum VRAM for 7B Model

Full fine-tuning (AMP)	84GB

Full fine-tuning (FP16)	56GB

Freeze	20GB

LoRA	16GB

Note: "Minimum" in this table assumes batch size 1 and limited sequence length. If you increase batch size to 4 or 8, VRAM grows linearly.

My personal experience:

A single 16GB T4: can only run int4 QLoRA on 6B/7B models, with sequence length capped at 512
A single 32GB V100: can barely run LoRA on 6B/7B models, batch size 1
A single 80GB A800: Qwen1.5 72B int4 QLoRA works fine, Baichuan13B LoRA also works

---

Comparison with Other Methods: Why LoRA Wins?

There are several other efficient fine-tuning methods out there. I've tried them all, here's my take:

Adapter Tuning: Adds two small adapter structures in each Transformer layer. Works okay, but adds inference latency, not widely used in industry.

Prefix Tuning: Adds virtual tokens at the beginning of the input. Problem: these virtual tokens eat into the input sequence length, and training is unstable.

P-tuning v2: Adds prompt tokens at every layer. Better than Prefix Tuning, but prone to catastrophic forgetting—after fine-tuning on a new task, performance on old tasks drops significantly.

In contrast, LoRA's advantages are clear:

No inference latency
Stable training
Controllable parameter count (by adjusting r)

I ran several BERT discriminative tasks, and LoRA's performance was comparable to full fine-tuning, sometimes even better. I suspect full fine-tuning is prone to overfitting, while LoRA acts as a built-in regularizer.

---

Final Ramblings: A Takeaway for You

This article is the first in the LoRA series, covering the core principle. Next time, I'll talk about QLoRA, which adds quantization on top of LoRA to further reduce VRAM requirements.

If you're just starting out, I recommend practicing with a 7B model, set r=8, learning rate 2e-4, and make batch size as large as possible. First get the pipeline running, then slowly tune the parameters.

Finally, a parting thought:

Fine-tuning isn't about reinventing the wheel; it's about putting your mark on it.

If you have questions, feel free to leave a comment. I'll reply when I see them. See you next time!

QLoRA (int4)	6GB

全参微调7B模型要80GB显存，LoRA凭什么只用5GB？ (English)

全参微调7B模型要80GB显存，LoRA凭什么只用5GB？ (English)

Brothers, fine-tuning large models? I'm totally sold on it!

The Nightmare for Casual Players: How Expensive Is Full-Parameter Fine-Tuning?

LoRA's Trick: Low-Rank Decomposition

Pitfalls in Practice: Almost Drove Me Crazy

Training Process: Pure Bliss

VRAM Usage Comparison: A Picture Is Worth a Thousand Words

Comparison with Other Methods: Why LoRA Wins?

Final Ramblings: A Takeaway for You

Cael Lee

Ready to get started?