LLM大语言模型之Generate/Inference生 (English)
LLM大语言模型之Generate/Inference生 (English)
Generated: 2026-06-21 03:11:05
---
You may not believe it, but my hands were shaking when I was flipping through my notes — there were more than twenty records of model parameters going completely off the rails. Temperature set to 1.2? The model started babbling nonsense. Top‑P at 0.9? Still looping like a broken record. Beam Search enabled? It turned out even dumber than the baseline. I stepped into every single trap, and I stepped in them knowingly — I might as well have “Parameter Victim” tattooed on my forehead.
Today I’ll break it down for you: where these parameters come from, how to use them, and how they’ll trip you up. The code is on GitHub — run it once and you’ll get it.
---
First, a naked run: what happens when you add nothing?
I’ve always believed the fastest way to understand something is to watch it work with nothing on. So let’s start with the most primitive method — greedy decoding.
Take Llama as an example (same story for any other generative model). I feed in “say”, it gets tokenized into token 1827. The model precedes it with a BOS token , so what actually goes in is two tokens.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
text = "say"
inputs = tokenizer(text, return_tensors="pt")
# What you see: {'input_ids': tensor([[ 1, 1827]]), 'attention_mask': tensor([[1, 1]])}
Then feed it straight to the model:
logits = model.forward(inputs.input_ids).logits
print(logits.shape) # [1, 2, 32000] Whoa, 32,000 candidates!
The shape is [batchsize, sequencelength, vocab_size]. The logits at the last position are the model’s scores for the next token. Pick the highest‑scoring one:
next_token_id = torch.argmax(logits[:, -1, :], dim=-1)
print(next_token_id) # e.g. tensor([[22172]])
tokenizer.decode(next_token_id[0]) # 'hello'
See? The model thinks that after “say”, the most likely next token is “hello”. OK, append “hello”, predict the next word… keep going until you hit EOS or the maximum length.
That’s greedy decoding — at each step you pick the token with the highest probability. Simple and brutal, but after a few uses you’ll notice: the output easily gets stuck in loops, like “I love to code. I love to code. I love to code…” — just like a scratched record.
Back when I was building my first dialogue system, this thing wrecked me. After two turns it started repeating, and I thought the model was broken. Only later did I realize — I was missing the sampling step.
Temperature: making the model “wilder” or “tamer”
So how do you break out of that rigidity? The simplest way is to stop picking the token with the highest probability and instead sample from the probability distribution. But here’s the problem: if the original distribution is too extreme (one token at 0.99 and the rest add up to 0.01), sampling is no different from argmax. So we need something to “soften” the distribution — enter temperature.
Show the code, and you’ll see it immediately:
logits = torch.tensor([[0.5, 1.2, -1.0, 0.1]]) # raw scores for four candidate tokens
# temperature = 1
probs = torch.softmax(logits, dim=-1)
# [0.2559, 0.5154, 0.0571, 0.1716]
# temperature = 0.5
probs_low = torch.softmax(logits / 0.5, dim=-1)
# [0.1800, 0.7301, 0.0090, 0.0809]
# temperature = 2
probs_high = torch.softmax(logits / 2, dim=-1)
# [0.2695, 0.3825, 0.1273, 0.2207]
See? The lower the temperature, the sharper the distribution (at 0.5 the second token jumps from 0.51 to 0.73); the higher the temperature, the flatter the distribution (at 2.0 the gap between the four tokens shrinks a lot).
The rule is clear: temperature near 0 → close to greedy, stable as a grandpa; temperature = 1.0 → original distribution, true to the model; temperature > 1.0 → randomness increases, the model starts to go wild.
Speaking of which, I have to confess a pitfall I ran into. Once I was doing story generation and thought, “this needs more creativity,” so I cranked the temperature up to 1.5. Guess what? The model started describing an alien invasion of Earth, with French lyrics thrown in the middle… I was cringing so hard I wanted to
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.