初始化图像:扩散模型中被忽视的控制机制 (English)
初始化图像:扩散模型中被忽视的控制机制 (English)
Generated: 2026-06-20 20:19:18
---
Lean in. Let me ask you something.
You’ve definitely tried this before—same prompt, same model, identical step count and CFG scale, not a single button touched differently. And what happened? Two images generated, and they turned out completely different.
One put the cat neatly on the chair. The other crammed the cat into a corner like it was going through some kind of existential crisis. One image was as cold as a freezer, the other warm enough to make you want to brew a cup of tea.
I used to think this was pure superstition. Image generation was all about luck. If fortune smiled on you, you’d get a masterpiece in one shot. If it didn’t, you’d stay up till dawn and end up with nothing but rejects.
Then I spent an entire weekend chewing through a paper that almost nobody online had read. I ran the code over and over until I finally found where the problem was—
That initial noise everyone treats as an empty parking spot? It’s not empty at all. It’s a seasoned driver with both hands on the wheel, and the moment it gets in, it decides where the car goes. Fight against its nature and the image crashes like a pile-up on the highway. Nudge it in the direction it wants to go, and it follows your lead faster than anything.
Today, I’m going to pull back the curtain and show you every card on the table.
1. What’s really hiding inside that so‑called “random noise”?
2. Why does changing the seed feel like shifting to a different universe?
3. When an image breaks, can I “operate” on that noise?
4. What are those convoluted “inversion” techniques actually trying to defy?
Come on. Buckle up.
---
I. You Thought It Was Empty? It Already Contains a Complete “Sketch”
Let me give you the conclusion first, so you don’t feel lost: that noise is anything but empty. The amount of information packed inside it will blow your mind.
The first time I reproduced the paper’s experiment, my hands were trembling.
It was simple. I fixed a single random initial noise (just that 64×64 matrix), and ran it with five different prompts. One said “a cat and a chair,” another “a dog and a sofa,” another “a bird and a table”… I kept swapping the nouns.
And guess what?
When I looked at the whole line, I was stunned.
With the same initial noise, no matter how the prompt changed, the overall composition and light distribution of the generated images showed remarkable similarity—the cat was in the upper left; swap to a dog, still upper left; swap to a bird, still squatting in the upper left. Even the shadowed and bright areas roughly matched.
Then I switched to a different row, a different initial noise, and ran the same five prompts again. This time, everything shifted to the lower right, and the color tone turned a dark green.
The only thought in my head was: this is what they mean when they say “random noise isn’t random.”
It fundamentally determines your composition, your lighting, the center of gravity of the image, and even where each object most wants to land. Put plainly, the initial noise is like an invisible sketch already filled with shape tendencies and positional preferences.
You think you’re painting? You’re really just coloring in a sketch.
Later I went back to the original paper. They called it the “generation tendency of the initial image”—certain content simply prefers to grow in certain regions. Even when the noise comes from a standard Gaussian, this tendency asserts itself with absolute authority.
In everyday language, it’s like buying a rough stone for carving. Some stones are suited for carving a horse, with flowing mane; others are right for a dragon, with coiled claws. If you insist on carving a dragon from a horse‑stone, you can—but you’ll have to chisel away a lot of material, and the result will always look a bit off.
---
II. The Truth About Broken Images: Your Prompt and the Initial Noise Are Fighting
When I think about this, my heart still aches.
In the old days, if I generated a hundred images, at least a dozen would come out bizarre. A cat floating in mid‑air, a figure blurred into a blob, or colors that looked poisoned—purple patches, green patches.
I always thought my prompts weren’t good enough, or the model was in a bad mood that day.
Then the paper told me: you’re wrong, dead wrong.
When the “preference” baked into the initial noise doesn’t match what you wrote in your prompt, the model is instantly caught in a dilemma. It’s like a car with two steering wheels. You turn left, it turns smoothly. You force it right, it either spins in place or flips into a ditch.
My own example makes this crystal clear.
The prompt was “a cat on a chair.” I imagined the cat curled up comfortably on the chair. But the generated cat was squeezed into the lower‑left corner, with the chair sitting lonely in the upper right.
The paper explained it exactly: in the initial noise, the region for the chair (call it region B) had no tendency to become a cat; the lower‑left corner (region A), on the other hand, was strongly inclined to produce a cat. The model was torn and finally made a “smart” decision—just put the cat in the corner. It’s the chair’s fault.
I used to dismiss such results as random crashes. After running the experiment, I went back and tested it. I fixed that initial noise and changed the prompt to “a dog on a sofa.” Guess what? The dog still shrank into the lower‑left corner, not budging.
The stone hadn’t changed, so the carved position was impossible to alter.
So what’s my iron rule now? If three consecutive images are broken, change the seed—change the initial noise outright. Don’t fight it. Wrestling with a stubborn driver for control of the wheel only ends badly for you.
---
III. Believe It or Not? I Personally “Operated” on That Noise
Yes, you absolutely can. But there’s a condition: you have to be mentally prepared.
Diffusion models are incredibly picky about the initial noise. They require that your latent values strictly follow a standard Gaussian distribution. If you change even a single pixel value, the entire distribution falls apart. The model treats you like a stranger—it either produces a fully black image or a complete mess.
Don’t ask how I know. I ruined over a dozen images before I was finally convinced.
So how did the paper actually do it? It’s almost absurdly simple: wherever the conflict happened, I redrew the lottery in that spot.
Remember the cat example? The cat was determined to go to the corner, and the chair region simply refused to accommodate it. The problem came down to the two regions being incompatible.
The solution was obvious: re‑sample the latent values of region A and region B from a standard Gaussian distribution. Keep everything else untouched, then re‑run the generation process.
In the paper, after a few rolls, they successfully produced an image with the cat sitting solidly on the chair.
I couldn’t help myself. I immediately used Stable Diffusion 1.5 with the Diffusers library, manually modifying the latent matrix. How? First, generate an initial latent and run it for a few steps. Then, using the attention map (or more crudely, just looking at the image and manually circling the problem area), I locked onto the conflicting region. Next, I re‑initialized that region with new random Gaussian values while freezing everything else. Finally, I started the generation from scratch.
Out of five attempts, three succeeded. In the other two, the conflict simply moved elsewhere—you chisel away one hard bump, and a new one pops up next to it.
What fascinates me most is: you’re not “editing the image”; you’re editing the noise. This is far more fundamental than inpainting or restoration. You’re changing the starting point of the generation process, not patching holes in the result.
Of course, don
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.