Home / Blog / 所有人都以为权重初始化要随机,ControlNet却靠全零卷积实现精细控制 (English)

所有人都以为权重初始化要随机,ControlNet却靠全零卷积实现精细控制 (English)

By CaelLee | | 6 min read

所有人都以为权重初始化要随机,ControlNet却靠全零卷积实现精细控制 (English)

Generated: 2026-06-22 02:27:17

---

Have you ever seen AI drawing crash and burn?

I have. More than once.

Back in late 2022, when Stable Diffusion first blew up, I spent every day tweaking prompts like I was summoning spirits. Every image generation was a complete lottery — you'd ask for "standing," and it would give you lying flat; you'd emphasize "a tree on the left," and it would draw a dog on the right. Every time I hit "generate," my heart raced, followed by, "Damn, it's off again!"

This blind-box life went on for months.

Until one day, I discovered ControlNet.

The first time I tested it in Canny mode, I drew a few white lines as an edge map and fed it in — oh my god, it actually followed the contour, filling in textures and lighting, and the edges were locked in tight! At that moment, I literally yelled out loud: AI art had finally evolved from "jittery mode" to "hit exactly where you point"!

So, what exactly did ControlNet solve?

In two words: controllability.

Before, you were stuck with text-based lottery; ControlNet let you throw in a condition image and say, "Follow this!" Whether it was a stick-figure skeleton, a depth map, or a rough sketch, it obeyed.

How does it work? Don't let the architecture diagram in the paper scare you. Break it down, and it's actually quite simple —

It clones the U-Net as a trainable copy, while freezing the original and keeping its parameters unchanged. The copy is trained only on your condition data. The original retains general painting ability, and the copy learns the new condition signal — they complement each other. Stable enough that you can even run it on your personal computer.

The part that made me slap my thigh was zero convolution.

Guess what the initial weights of this convolutional layer are? All zeros.

I thought at the time, isn't that ridiculous? How do you learn from zeros? At the start of training, ControlNet's output is zero — no effect on the original U-Net. As training progresses, the weights slowly increase, and the condition signal sneaks in. It's like hiring a director who is gagged at first, and only after the actors (the main model) have stabilized their performance does the director start making small requests. If the paper hadn't analyzed it in detail, I never would have believed that this "shut up first, speak later" design is actually the most stable.

As for how it's embedded: ControlNet only adds "external sockets" to the Encoder and Middle Block. For the Decoder, it injects control signals into the corresponding layers of the original U-Net via zero convolution. During inference, it first calculates a bunch of residual values and adds them to the original U-Net's layers. This is all clearly written in the StableDiffusionControlNetPipeline in diffusers — go check out the call method if you're curious.

---

But let me be honest — ControlNet isn't a cheat code; falling into pitfalls is inevitable.

Let me pour out some bitter experiences for you.

Pitfall #1: Not enough data but still charging ahead.

In my early days, I ambitiously wanted to train an OpenPose model with only 2,000 images. After training, the poses came out so twisted it was like there was no control at all. Later I checked the paper — the Canny model was trained on 3 million images! 2,000? Barely a snack. For small or medium datasets, LoRA is still the safer bet; don't try to take shortcuts.

Pitfall #2: Choosing the wrong preprocessor.

Once, when making an architectural image, I picked MLSD (line detection) as the preprocessor. But it treated people as line segments too, chopping off arms and legs. Later I switched to Depth extraction, which preserved lines while recognizing depth, and the building instantly stood upright. My current rule of thumb: MLSD for architecture, OpenPose for people, Canny or HED for detailed stylization, and Normal Map for wallpaper-level fine control. Choosing the right preprocessor is more effective than tweaking parameters.

Pitfall #3: Jumping on FLUX.1's ControlNet as soon as it came out.

I admit, I was a guinea pig. The moment FLUX.1's ControlNet was open-sourced, I jumped in. But the model was unstable, with detail control worse than the SDXL version. I had to wait several iterations before I could use it properly. My advice: let the community test the waters first, and only hop on once it's stable.

---

Now, with so many versions out there, which one should you choose?

Let me give you my real-world experience:

My personal recommendation: For real-time or extreme detail, SD1.5 + ControlNet 1.1 is rock solid. For product demos, go with SDXL directly. For gimmicks (like QR codes), try SD1.5 first — plenty of models to copy from.

What's the difference between 1.0 and 1.1? The 1.1 upgrade is significant: it supports more conditions (Inpaint, Soft Edge, Lineart), reduces overfitting (before, using it too much would warp style), halves the model capacity (the pruned version is only 689 MB, enough for daily use), and training scripts are more stable. I now use the 1.1 pruned version exclusively — saves hard drive space and runs faster. On ComfyUI, inference speed is almost the same as the original.

---

How hard is it to train your own ControlNet?

I ran the official tutorial_train.py once. Here's the process:

  1. Prepare condition images (e.g., Canny edges) + target images (the real images you want).
  2. Write a JSON file with each entry's path and description.
  3. Run the command with --train=True, set batch size small or you'll blow up VRAM.
  4. Watch the loss curve; don't interrupt it too early.

But I stumbled again: the preprocessor version wasn't aligned with the ControlNet version. Canny and depth maps were fine, but if MLSD parameters were off, the output was completely broken. Later, a community expert told me: most preprocessors should match the ControlNet version; otherwise, feature extraction goes wrong.

The hard requirements: at least 8GB VRAM (16GB+ recommended), tens of thousands of pairs of data, and training time in days. Most people should just use pre-trained models — there are hundreds to keep you busy for a year.

---

Commercial applications? I've tried a few scenarios. Here's the data.

Key insight: the cleaner your condition image, the better the result. Different scenarios have different requirements — game demos can be looser, but e-commerce clients will complain if a color is slightly off.

---

Finally, two often overlooked questions.

Question 1: What's the relationship between ControlNet and T2I-Adapter?

T2I-Adapter is lighter — it doesn't clone the U-Net, just uses a lightweight network to inject condition features into the decoder. It requires fewer training resources, but its control strength isn't as strong as ControlNet (especially for

C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free