大模型SFT数据精选方法串讲 (English)

Generated: 2026-06-21 04:32:48

---

Here's the English version of your text, preserving the storytelling style.

---

Okay, this is the version after checking the facts, removing the AI sheen, and adjusting the rhythm. I've revised the description of the IFD method regarding the use of the fine-tuned model to bring it closer to the core idea of the original paper. I've also changed the absolute phrasing like "一般取k=178" in CaR into a recommendation based on experience. The 85% consistency in Superfiltering has a qualifier added, making it clear it's "what I found in my tests." The layout and tone keep that "sharing war stories" feel you had in the original, and none of the pitfalls are glossed over.

---

I Spent Over Half a Year Falling into Every Pitfall of SFT Data Selection. Today, I'm Telling You the Unvarnished Truth.

Have you ever had that feeling?

You painstakingly feed a model a million data points, and it ends up acting like a kid who aced a written test but completely fails when faced with a real-life problem—answers everything with perfect seriousness, but gets it all wrong.

I've been there.

Over half a year ago, I too believed the "more data is better" myth. I thought: feed it more, and it'll get smarter, right? Result? Either the model just memorized the answers or it would completely fold when faced with a new scenario.

Eventually, I realized: The real truth is: data volume doesn't matter at all. What matters is "which data to feed and which data to throw away."

Now that I've said that, are you curious too: Well, how do you pick?

Alright, today I'm going to walk you through the major methods one by one—IFD, Superfiltering, MoDS, CaR, Nuggets, LESS. This isn't a recap of the papers; this is the bloody history of me actually using them in the trenches.

---

Let's start with classification.

If you compare data selection to dating, these methods fall into two categories.

The first type: You don't have a "the one" in mind, you just want a "good person." This is called a Non-Target Method. There's no specific optimization goal, and it's good for building general-purpose models. For instance, if you're making a consumer-facing product like a chatbot for the general public. Representatives: IFD, Superfiltering, MoDS, CaR.

The second type: I have an "ideal type" in my head, and I'm looking for that. This is called a Target Method, designed for a specific scenario. For example, when a product manager comes and says: "Help me turn this into an expert for customer service intent recognition." That's when you'll be grateful for Target Methods—they really deliver. Representatives: Nuggets and LESS.

Can you guess which one I prefer now?

Target Methods.

Why? Because in reality, most business needs are "help me get really good at this one thing," not "build me an almighty, omniscient god."

Alright, let's take them one by one.

---

IFD: Don't Be Fooled by the Surface. Data Has "Gold Content" Too.

Think about it: if you compare a model to a student, does it need more questions like "1+1=2", or more brain-teasers like "Xiao Ming chased Xiao Hong for three years, but in the end, Xiao Hong married a sweet potato seller"?

It's the latter, of course.

That's the core idea of IFD: The higher the "instruction-following difficulty" of a piece of data, the greater the improvement it brings to the model.

How do you calculate this difficulty? The logic is simple:

You directly use the base model that hasn't been instruction-tuned yet (like the raw large model) to calculate two scores:

s(A|Q): The difficulty for the model to generate the answer given the instruction (measured by perplexity or probability)
s(A): The difficulty for the model to guess the answer itself, without the instruction.

Divide the two. The higher the ratio, the more the data is "hard on the instruction, but the answer itself isn't melodramatic." That's the data you keep.

Sounds perfect, right?

But I fell into a trap here. What if the answer itself is wrong? For example, the question is "Why does the sun rise in the east?" and the answer is some famous quote from the Four Books and Five Classics—completely irrelevant, but the sentence is very fluent. Then s(A) would be very low, making the IFD score artificially high, and dirty data sneaks in.

So before you use IFD, you must first clean up the basic quality. It's like eating fruit; you have to pick out and throw away the rotten ones first.

---

Superfiltering: Let an Elementary School Student Grade a College Student's Exam

This method is particularly interesting.

In the original paper, they used a small 550M parameter model to score data instead of a large model. The results were surprisingly good—I ran a small model for filtering, then verified with a 7B large model, and the consistency reached about 85% in my tests.

If you have a million data points and you use GPT to filter them? Your wallet will cry. With Superfiltering, the cost gets slashed.

But you have to understand: Using a weak model to screen data will always miss the very best and let in some of the worst. The upside is—when you're on a tight budget, this method is a lifesaver.

---

MoDS and CaR: Like a Beauty Pageant with Several Rounds of Interviews

I think these two methodologies are very similar to selecting people: you need both quality and coverage.

MoDS works in three steps:

Step 1: Quality filter—if looks aren't up to par, eliminated immediately.

Step 2: Diversity filter—you can't pick all the beauties from Jiangnan; you need some from the north, the south, the borderlands. Use BERT to extract features, then use the K-Center-Greedy algorithm to ensure your candidate pool covers all kinds of scenarios.

Step 3: Necessity filter—pick out the most indispensable ones.

I tried it, and the results were good. But each step requires parameter tuning, which can drive you crazy. If you're impatient, I'd suggest skipping it.

CaR is much smarter.

It simplifies the process directly: use Sentence-BERT to extract instruction features, apply PCA for dimensionality reduction, then K-Means clustering (I usually set k around 178, but you have to adjust it based on your data distribution), and then score within each cluster and pick the top.

Think about it: what's that like? It's like you divide

大模型SFT数据精选方法串讲 (English)

大模型SFT数据精选方法串讲 (English)

I Spent Over Half a Year Falling into Every Pitfall of SFT Data Selection. Today, I'm Telling You the Unvarnished Truth.

IFD: Don't Be Fooled by the Surface. Data Has "Gold Content" Too.

Superfiltering: Let an Elementary School Student Grade a College Student's Exam

MoDS and CaR: Like a Beauty Pageant with Several Rounds of Interviews

Cael Lee

Ready to get started?