大语言模型LLaMA, ChatGLM, BLOOM 的 (English)
大语言模型LLaMA, ChatGLM, BLOOM 的 (English)
Generated: 2026-06-22 12:30:20
---
The First Time I Fine-Tuned a Large Model, I Almost Smashed the Machine
Last year, I was full of confidence—I got my hands on the original LLaMA-7B and wanted to play around with Chinese instructions.
And guess what? The sentences the model generated—even I couldn't understand them!
“男儿何不带吴钩”——it split that into four or five garbled tokens, read like alien script.
I stared at the screen for three seconds, and only one word came to mind: Damn!
Later I found out LLaMA’s vocabulary is only 32k, basically designed for Latin-alphabet languages. For Chinese, it could split a single character into a train-wreck of tokens—no wonder it was garbled.
That’s when I realized: picking a base model, the first trap is already waiting for you.
So today, I’m going to lay out all those traps and facepalms one by one.
No beating around the bush, straight to the practical stuff—and I’ll throw in some complaints along the way.
---
You’re Probably Struggling with These Questions Too
- LLaMA, ChatGLM, BLOOM… which one should you pick for Chinese?
- Full fine-tuning vs. parameter-efficient fine-tuning—which one drives you crazier?
- LoRA, prompt tuning, prefix tuning… tons of methods, which one actually works in practice?
Don’t worry, each section below has my real test data and hard-learned lessons.
---
Choosing a Base Model? The First Trap is LLaMA
Back when I was choosing a base model, everyone online was hyping LLaMA, so I jumped on the bandwagon.
The Chinese output looked like encrypted gibberish—I was so mad I almost switched careers.
Then I switched to Chinese LLaMA, which expanded the vocabulary. For that same “男儿何不带吴钩”, the tokens dropped from 24 to 14.
The difference is obvious—those who know, know.
So my advice is simple:
Pure Chinese tasks? ChatGLM-6B works out of the box. Its vocabulary is 130k, natively trained on Chinese and English mixed, and its replies sound like a normal person.
BLOOM has a 250k vocabulary, but its Chinese performance is just okay—it's a middle-of-the-road choice.
Original LLaMA? Forget it unless you expand the vocabulary.
But if you’re doing English or cross-lingual transfer, BLOOM’s multilingual ability is definitely worth it.
Speaking of architecture, ChatGLM is a Prefix Decoder—bidirectional attention on the input, unidirectional on the output.
LLaMA and BLOOM are pure Causal Decoders.
These two behave differently during generation—we’ll get back to that later.
---
Full Fine-Tuning? You Think It’s Powerful, But It’s Easy to Crash
I was skeptical before: what’s all this fuss about parameter-efficient fine-tuning? Why not just go all-in with full fine-tuning?
So I took ChatGLM-6B and ran it on 20,000 Chinese instruction samples for 10 epochs.
Guess what? The training loss dropped beautifully, step after step—I almost jumped with joy.
But the validation loss started bouncing back after epoch 3, and by epoch 5 when I tested generation—
the model had memorized every formatting mistake in the training set and spewed out a bunch of special symbols.
That moment I understood: this is overfitting, and it’s a spectacular crash.
Later I switched to LoRA and prefix tuning, only adjusting the newly added parameters while freezing the pretrained weights.
The validation loss and training loss almost overlapped, no more bouncing.
See, unless you have millions of samples with dramatically different distribution, full fine-tuning will collapse.
Now LoRA is my default—I don’t change the original model parameters, just switch weights for different tasks. It’s so convenient it makes me cry.
---
Parameter-Efficient Methods, Tested One by One
LoRA: Stable as a Rock
I used the PEFT library for LoRA and ran it on all three base models.
Simple config: r=8, alpha=16, target modules Q and V.
Memory consumption dropped a lot compared to full fine-tuning—for a 7B model, full fine-tuning needs 14GB just for the weights, plus gradients and optimizer, so a single 12GB card can’t hold it. LoRA ran on my 3080Ti 12GB.
Effect? The generated fluency blew away my own hacked LLaMA-adapter.
One sentence: Close your eyes and pick LoRA—it almost never fails.
Prompt Tuning and Prefix Tuning: Depends on the Model
I tried prompt tuning on LLaMA—sentences were never smooth and often got stuck.
Prefix tuning was a little better, but still average.
But when I switched to BLOOM, both methods came alive, and prefix tuning was even better.
ChatGLM-6B also works, but there’s a quirk—replies tend to be short; even if you say “please answer in detail,” it
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.