所有人都以为权重初始化要随机，ControlNet却靠全零卷积实现精细控制 (English)

Generated: 2026-06-22 02:27:17

---

Have you ever seen AI drawing crash and burn?

I have. More than once.

Back in late 2022, when Stable Diffusion first blew up, I spent every day tweaking prompts like I was summoning spirits. Every image generation was a complete lottery — you'd ask for "standing," and it would give you lying flat; you'd emphasize "a tree on the left," and it would draw a dog on the right. Every time I hit "generate," my heart raced, followed by, "Damn, it's off again!"

This blind-box life went on for months.

Until one day, I discovered ControlNet.

The first time I tested it in Canny mode, I drew a few white lines as an edge map and fed it in — oh my god, it actually followed the contour, filling in textures and lighting, and the edges were locked in tight! At that moment, I literally yelled out loud: AI art had finally evolved from "jittery mode" to "hit exactly where you point"!

So, what exactly did ControlNet solve?

In two words: controllability.

Before, you were stuck with text-based lottery; ControlNet let you throw in a condition image and say, "Follow this!" Whether it was a stick-figure skeleton, a depth map, or a rough sketch, it obeyed.

How does it work? Don't let the architecture diagram in the paper scare you. Break it down, and it's actually quite simple —

It clones the U-Net as a trainable copy, while freezing the original and keeping its parameters unchanged. The copy is trained only on your condition data. The original retains general painting ability, and the copy learns the new condition signal — they complement each other. Stable enough that you can even run it on your personal computer.

The part that made me slap my thigh was zero convolution.

Guess what the initial weights of this convolutional layer are? All zeros.

I thought at the time, isn't that ridiculous? How do you learn from zeros? At the start of training, ControlNet's output is zero — no effect on the original U-Net. As training progresses, the weights slowly increase, and the condition signal sneaks in. It's like hiring a director who is gagged at first, and only after the actors (the main model) have stabilized their performance does the director start making small requests. If the paper hadn't analyzed it in detail, I never would have believed that this "shut up first, speak later" design is actually the most stable.

As for how it's embedded: ControlNet only adds "external sockets" to the Encoder and Middle Block. For the Decoder, it injects control signals into the corresponding layers of the original U-Net via zero convolution. During inference, it first calculates a bunch of residual values and adds them to the original U-Net's layers. This is all clearly written in the StableDiffusionControlNetPipeline in diffusers — go check out the call method if you're curious.

---

But let me be honest — ControlNet isn't a cheat code; falling into pitfalls is inevitable.

Let me pour out some bitter experiences for you.

Pitfall #1: Not enough data but still charging ahead.

In my early days, I ambitiously wanted to train an OpenPose model with only 2,000 images. After training, the poses came out so twisted it was like there was no control at all. Later I checked the paper — the Canny model was trained on 3 million images! 2,000? Barely a snack. For small or medium datasets, LoRA is still the safer bet; don't try to take shortcuts.

Pitfall #2: Choosing the wrong preprocessor.

Once, when making an architectural image, I picked MLSD (line detection) as the preprocessor. But it treated people as line segments too, chopping off arms and legs. Later I switched to Depth extraction, which preserved lines while recognizing depth, and the building instantly stood upright. My current rule of thumb: MLSD for architecture, OpenPose for people, Canny or HED for detailed stylization, and Normal Map for wallpaper-level fine control. Choosing the right preprocessor is more effective than tweaking parameters.

Pitfall #3: Jumping on FLUX.1's ControlNet as soon as it came out.

I admit, I was a guinea pig. The moment FLUX.1's ControlNet was open-sourced, I jumped in. But the model was unstable, with detail control worse than the SDXL version. I had to wait several iterations before I could use it properly. My advice: let the community test the waters first, and only hop on once it's stable.

---

Now, with so many versions out there, which one should you choose?

Let me give you my real-world experience:

SD1.5: Like an old Nokia — you can mod it, smash walnuts with it, and the community models are abundant. QR codes, lighting, hand control — you can download anything. The downside is lower quality ceiling.
SDXL: A high-end flagship phone — high resolution, great quality, but it devours VRAM like a watermelon. Even on an A100, running it once makes me gasp.
FLUX.1: The latest foldable — cool tech (MMDiT architecture, dual-stream hybrid DiT), but the ecosystem is still immature, documentation is scarce, and stability is inconsistent.
SD2.1: I basically never used it — the community is too cold.

My personal recommendation: For real-time or extreme detail, SD1.5 + ControlNet 1.1 is rock solid. For product demos, go with SDXL directly. For gimmicks (like QR codes), try SD1.5 first — plenty of models to copy from.

What's the difference between 1.0 and 1.1? The 1.1 upgrade is significant: it supports more conditions (Inpaint, Soft Edge, Lineart), reduces overfitting (before, using it too much would warp style), halves the model capacity (the pruned version is only 689 MB, enough for daily use), and training scripts are more stable. I now use the 1.1 pruned version exclusively — saves hard drive space and runs faster. On ComfyUI, inference speed is almost the same as the original.

---

How hard is it to train your own ControlNet?

I ran the official tutorial_train.py once. Here's the process:

Prepare condition images (e.g., Canny edges) + target images (the real images you want).
Write a JSON file with each entry's path and description.
Run the command with --train=True, set batch size small or you'll blow up VRAM.
Watch the loss curve; don't interrupt it too early.

But I stumbled again: the preprocessor version wasn't aligned with the ControlNet version. Canny and depth maps were fine, but if MLSD parameters were off, the output was completely broken. Later, a community expert told me: most preprocessors should match the ControlNet version; otherwise, feature extraction goes wrong.

The hard requirements: at least 8GB VRAM (16GB+ recommended), tens of thousands of pairs of data, and training time in days. Most people should just use pre-trained models — there are hundreds to keep you busy for a year.

---

Commercial applications? I've tried a few scenarios. Here's the data.

E-commerce: Used depth maps to control product poses, turning flat images into 3D scenes. Client feedback said conversion rates increased by 15%.
QR code generation: SD1.5 specialized model + ControlNet Canny — never had a QR code scan fail. But you must leave blank space around the QR code area, or the recognition rate plummets.
Game assets: OpenPose for pose control, batch-generating NPCs with high consistency.

Key insight: the cleaner your condition image, the better the result. Different scenarios have different requirements — game demos can be looser, but e-commerce clients will complain if a color is slightly off.

---

Finally, two often overlooked questions.

Question 1: What's the relationship between ControlNet and T2I-Adapter?

T2I-Adapter is lighter — it doesn't clone the U-Net, just uses a lightweight network to inject condition features into the decoder. It requires fewer training resources, but its control strength isn't as strong as ControlNet (especially for

所有人都以为权重初始化要随机，ControlNet却靠全零卷积实现精细控制 (English)

所有人都以为权重初始化要随机，ControlNet却靠全零卷积实现精细控制 (English)

Cael Lee

Ready to get started?