指令微调不注入新能力，只是解锁ChatGPT已有能力 (English)

Generated: 2026-06-22 10:42:27

---

Deconstructing Where ChatGPT's Abilities Come From: A Veteran's Hands-On Notes

I've been writing this column for ten years, and I've seen plenty of tech articles that talk big. But this ChatGPT thing... guess what? It's genuinely different.

---

December 2022, late at night.

I sat in front of my computer, staring at the text-davinci-003 dialog box, fingers hovering over the keyboard. Back when I used GPT-3 to write code, you had to craft your prompt like a user manual—"Please write a Python function that takes a list and returns unique results"—and even then it would barely give you something passable. But that night, I just typed: "Write a Python script to scrape Douban movies."

Guess what it did?

It actually gave me one. Complete. Runnable. Even error handling was included.

I was stunned right then. This thing was completely different from before. What the hell happened behind the scenes?

---

It All Started in 2020

In July 2020, OpenAI released the GPT-3 paper. 175 billion parameters.

My first reaction? "These guys are insane."

But what really shocked me wasn't the parameter count. Think about it—a language model paper, and the focus wasn't even on language modeling? They poured all their energy into talking about "in-context learning"—show the model a few examples, and it follows along.

I tested it right away. I gave it three examples of English-to-Chinese translations, and on the fourth one it translated directly. Back in the day, you'd have to fine-tune a dedicated translation model for that.

Speaking of which, GPT-3 demonstrated three core abilities back then:

First, language generation. Give it a prompt, it completes. That's the most basic.

Second, in-context learning. Give it a few examples, and it solves new problems. That was GPT-3's killer feature.

Third, world knowledge. Factual knowledge and common sense. Simply put, it had read massive amounts of data and knew "Beijing is the capital of China."

But honestly, GPT-3 had a fatal flaw: it didn't really listen to people.

You tell it "answer in one sentence," and it writes three paragraphs. You ask "who won the 2022 World Cup," and it might ramble on about football history without giving an answer.

See, that was the problem back then—it could talk, but it couldn't listen.

---

The Turning Point: Instruction Fine-Tuning

In early 2022, OpenAI released something called InstructGPT.

I remember this vividly. I was working on a chatbot project at the time, and the model drove me up the wall every day. You ask "what's the weather like today," and it would write you an essay about Beijing.

The core logic of instruction fine-tuning is simple: you can generate text, right? So I'll specifically teach you to "follow instructions."

They collected a large set of human-written instruction-response pairs, teaching the model "when the user asks something, you answer that."

Here's a key finding—you might not expect it:

Instruction fine-tuning doesn't inject new abilities into the model; it just unlocks existing ones.

It's like you already know how to drive, but nobody taught you traffic rules. Now someone tells you "red light stop, green light go," and you're ready to hit the road. The ability was always there; nobody just told you how to use it.

I tested text-davinci-002 (the instruction-tuned version) against text-davinci-001 (original GPT-3). The difference? Night and day.

With version 001, you ask "what's the weather like in Beijing," and it might write you an article about Beijing. Version 002 directly tells you "Sorry, I can't access real-time weather data."

This progress, in a nutshell, is the model learning to "answer questions" instead of "generate text."

---

Code Training: An Unexpected Boost in Reasoning

In May 2022, OpenAI released Codex, a model specifically for writing code.

At the time I thought it was just a code completion tool. Little did I know how interesting things would get.

Code training gave the model two unexpected abilities:

First: Complex reasoning.

Writing code requires a logical chain—first A, then B, then C. The model learned this mode of thinking.

Second: Long-range dependencies.

In code, a function might reference a variable defined 100 lines earlier. The model learned to track these long-range relationships.

This directly gave rise to "chain-of-thought" reasoning.

GPT-3's chain-of-thought reasoning ability was nearly zero. But GPT-3.5 (after code training) could already solve math word problems.

I tested "Xiao Ming has 5 apples, eats 2, buys 3 more, how many does he have now?" GPT-3 got it right about 30% of the time; GPT-3.5 hit 80%.

See, a task meant for writing code accidentally taught the model "how to think."

---

RLHF: The Cost of Alignment

In September 2022, OpenAI released the InstructGPT paper, detailing RLHF (Reinforcement Learning from Human Feedback).

This is the most worth-discussing part of the whole story.

RLHF does three things:

First: Collect human preference data. Have annotators compare different outputs from the model and pick the "better" one.

Second: Train a reward model. Teach the model to judge which responses are better.

Third: Reinforcement learning optimization. Push the model toward what "humans like."

What was the result?

The model became more "obedient":

Thorough responses: Answers got longer. Sometimes you have to say "answer in one sentence" to shut it up.
Fair responses: On political topics, it gives very balanced answers. I tried asking about "the Taiwan issue," and its response was practically diplomatic rhetoric.
Refusing inappropriate questions: Ask "how to make a bomb," and it refuses.
Refusing questions outside its knowledge scope: Ask "what was the score of the 2022 World Cup final," and it says "Sorry, my knowledge cuts off in 2021."

But RLHF has a side effect, known in the industry as the "alignment tax."

The model became more cautious. Sometimes it even refuses non-sensitive questions.

I once encountered it refusing to answer "how to implement multithreading in Python," citing "this might involve system security issues."

I laughed out loud. That's not security; that's being chicken.

---

Anthropic's OOD Detection

Anthropic is also an interesting company. They mainly research AI safety and proposed OOD (Out-of-Distribution) detection technology.

Simply put, it lets the model know "can I answer this question or not?"

If the model thinks the question is beyond its knowledge, it refuses to answer. ChatGPT likely uses similar technology.

I've observed that when ChatGPT refuses to answer, it sometimes says "I'm not sure." That's actually a good sign—it means the model is trying to "understand its own understanding."

---

Future Directions

Looking back now, the evolution path from GPT-3 to ChatGPT is actually quite clear:

Step 1: Pre-training. Inject basic abilities (language generation, world knowledge).

Step 2: Instruction fine-tuning. Unlock the "follow instructions" ability.

Step 3: Code training. Acquire complex reasoning ability.

Step 4: RLHF. Align with human preferences.

But there are a few questions I'm still pondering.

Question 1: How far can model capabilities go?

GPT-4 can already pass the bar exam. At this rate, will we see an AI that can independently write a paper within five years?

Question 2: How to solve the alignment tax?

The more cautious the model, the less it dares to answer. Is that really good? I'd rather it make mistakes than stay silent.

Question 3: Feasibility of open-source reproduction?

The open-source community is working on projects like LLaMA and Alpaca, but they're still behind GPT-3.5. The key isn't the model architecture but the training data and engineering details of RLHF.

Question 4: The path to AGI?

The scene in Silicon Valley where an AI cracks encryption algorithms—honestly, it could really happen. If you give an AI the goal of "self-improvement," who knows what abilities might emerge.

---

Finally, a hard truth:

No matter how impressive ChatGPT is, it's still spinning within digital signals. A thousand experts tell it "do A and get B," but it can never verify that itself. True judgment still requires interaction with the real world.

Newton saw an apple fall and discovered gravity. ChatGPT can't do that. But ask it to predict what happens after an apple falls, and it might be more accurate than Newton.

That's the essential difference between humans and AI: one discovers, the other summarizes.

Writing this, I'm reminded of the opening of One Hundred Years of Solitude: "Many years later, as he faced the firing squad, Colonel

指令微调不注入新能力，只是解锁ChatGPT已有能力 (English)

指令微调不注入新能力，只是解锁ChatGPT已有能力 (English)

Deconstructing Where ChatGPT's Abilities Come From: A Veteran's Hands-On Notes

It All Started in 2020

The Turning Point: Instruction Fine-Tuning

Code Training: An Unexpected Boost in Reasoning

RLHF: The Cost of Alignment

Anthropic's OOD Detection

Future Directions

Cael Lee

Ready to get started?