实测对比：LoRA显存降至1/5，但知识注入F1低8个点 (English)

Generated: 2026-06-22 04:54:05

---

Interviewing for a Large Model Position? This Set of LoRA Questions Will Instantly Reveal Whether You Truly Know Your Stuff or Just Reciting Lines!

Have you ever met someone like this? Their resume says "Proficient in LoRA fine-tuning," but when you dig into it, all they can muster is "saves GPU memory." Push a little deeper, and they can't even explain how the A and B matrices are initialized… I’ve interviewed way too many such candidates, and every time it makes me want to flip the table.

Eventually, I figured it out: instead of getting angry, I turned LoRA into a structured set of interview questions. Starting from basic project experience all the way to the math and engineering implementation—it basically gives you a thorough read of where someone truly stands.

So today, I’ll break down my question design and grading standards for you. If you're preparing for a large model role, this article is far more useful than memorizing a hundred eight‑legged essays (that’s Chinese slang for rote standard answers).

---

Level 1: Don't Give Me Theory—First Tell Me What You've Actually Done

Question 1: In what scenarios have you used LoRA? Which models have you fine-tuned?

It’s like asking "What are your hobbies?" on a first date—a warm‑up question that still eliminates a huge chunk of people.

You know what? I’ve had candidates start reciting right away: "LoRA is a parameter‑efficient fine‑tuning method that uses low‑rank decomposition…" — Stop right there! That’s not what I asked. I’m asking about your real‑world project experience!

If you’ve only run a demo script from LLaMA‑Factory, tweaked a config file to get llama‑7B running, does that count as "experience"? That’s called "following the documentation and typing commands"—anyone can do that!

What kind of answer am I looking for? One that clearly describes the business scenario: Was it for vertical domain knowledge injection? Building a role‑playing customer service agent? Training a code model? Which base model did you use—Qwen, ChatGLM, or LLaMA? What training framework—accelerate with your own training script, DeepSpeed, or a polished wrapper like LLaMA‑Factory?

See? This one question immediately reveals whether you've been doing real, hands‑on work or just playing with toys.

Question 2: Comparing LoRA with full fine‑tuning, what are the actual differences in memory, speed, and results?

This question is specifically designed to test whether you’ve run comparative experiments yourself!

What annoys me most when hiring is someone who says without hesitation, "LoRA’s results are about the same as full fine‑tuning." About the same? How much the same? On what kind of task? Under what data size? Give me specifics!

Let me share my actual measurements:

Memory: I tested on LLaMA‑2 7B. With full fine‑tuning using DeepSpeed ZeRO‑3, it barely fits on 4 A100 80G cards. Switching to LoRA (r=8, fine‑tuning only Q and V), a single A100 can run it! Memory consumption dropped to one‑fifth!
Speed: LoRA training is much faster. Why? Because the number of parameters that need gradients is orders of magnitude smaller! But note: the speedup is mainly in backpropagation; forward inference is roughly the same.
Results: Now for something a bit controversial—on small datasets (say a few thousand instruction pairs), LoRA can indeed match full fine‑tuning, sometimes even doing better (because it acts as a regularizer). But if you want to inject new knowledge—for example, stuffing a large amount of private domain documents into the model—LoRA’s ceiling is noticeably lower. Last year I ran an experiment: feeding the model 500,000 medical QA pairs. LoRA’s F1 score was a full 8 points lower than full fine‑tuning! Eight points, my friend!

Question 3: Which hyperparameters did you tune when training LoRA? What does each one do?

This question is a true watershed—people who only run scripts get stuck here.

Better candidates can name r, alpha, dropout, target_modules, etc. But what does someone with real experience tell you? Let me walk you through the pitfalls I’ve fallen into:

Rank r: I made a classic mistake early on—thinking bigger r is always better, so I went straight to 64. Guess what? The result was worse than r=8! I later realized it was overfitting. My current rule of thumb: simple tasks, r=4 works; medium tasks, r=8 or 16; complex tasks (e.g., teaching the model a new programming language) go up to r=32 or 64. And! Beyond a certain threshold, the gains diminish—don’t blindly increase it.
Scaling factor alpha: I usually set it to 2 times r. The ratio alpha/r determines the amplitude of LoRA’s output. Too small, and fine‑tuning has little effect; too large, and training becomes unstable. Simple rule: double, stable!
Dropout: This was my biggest pitfall! One time, with not much data, I was rushing and set dropout to 0. Result? The model overfit the training set, and validation loss went through the roof! Now, whenever the dataset has fewer than 10k samples, I always set it to at least 0.05.
Target modules: Many people only know to add LoRA to Q and V. Actually, K and O, and even the up/down projections in the FFN, are worth experimenting with. I tested this myself: adding LoRA only to Q and V vs. all modules. In a code‑generation scenario, BLEU went up by 3 points! Of course, training time also increased, but it was worth it!

---

Level 2: Theory Questions Reveal Whether You Truly Understand or Are Just Reciting Scripts

Question 4: What is the core principle of LoRA?

This looks like a freebie, but few people answer it with real depth.

The minimum is to say: freeze the pretrained weights W₀, introduce two low‑rank matrices A and B, and the forward pass becomes h = W₀x + BAx.

But what I really value is when I ask a follow‑up:

“How are A and B initialized? And why?”

Answering “A is initialized randomly, B is initialized to zero” is just the surface. Being able to explain that “this makes ΔW zero at the start of training, so the model starts fine‑tuning from the pretrained state”—that’s a bit better. If you can go further and say “if A were also zero‑initialized, the gradient wouldn’t flow, and the parameters would never update”—that’s true understanding! That’s real skill!

Question 5: What is a low‑rank matrix? Why can we assume that the update to a large model’s weights is low‑rank?

You need to explain three things clearly.

First, what is rank? Put simply, it’s the amount of truly independent information in a matrix. A 2000×2000 matrix with rank 10 has almost all its information redundant. Counterintuitive, isn’t it?

Second, low‑rank decomposition—splitting a large matrix into the product of two smaller matrices. For example, a d×d matrix becomes d×r and r×d, reducing parameters from d² to 2dr. When r is much smaller than d, the savings are huge!

Third, why does this work? Think about it—according to Aghajanyan et al.’s 2020 research, pretrained large models have an extremely low “intrinsic dimension.” The parameter updates needed during fine‑tuning move only within a very small subspace. Let me give you an analogy: the pretrained model is an all‑knowing master; fine‑tuning is just adjusting its speaking

实测对比：LoRA显存降至1/5，但知识注入F1低8个点 (English)

实测对比：LoRA显存降至1/5，但知识注入F1低8个点 (English)

Interviewing for a Large Model Position? This Set of LoRA Questions Will Instantly Reveal Whether You Truly Know Your Stuff or Just Reciting Lines!

Level 1: Don't Give Me Theory—First Tell Me What You've Actually Done

Level 2: Theory Questions Reveal Whether You Truly Understand or Are Just Reciting Scripts

Cael Lee

Ready to get started?