我花十年踩遍GAN的坑,模式崩溃最致命 (English)
我花十年踩遍GAN的坑,模式崩溃最致命 (English)
Generated: 2026-06-22 17:37:06
---
"Can't Generate a Damn Thing" — My Ten-Year Blood, Sweat, and Tears with GANs, and How It All Ended
Believe it or not?
In 2014, the first time I read the GAN paper, I almost smashed my computer.
To be honest, back then I'd just started working in image generation, and I figured I was at least somewhat of a tech guy. But Goodfellow's original GAN paper? I had to read it three whole times before I roughly understood what he was talking about.
What does it mean? Two networks going head-to-head — one fakes it, the other calls it out. The faker mimics like crazy, the critic nitpicks like crazy. And in the end? The faker gets better and better, the critic gets pickier and pickier, and the two push each other to create a model that can generate images so real you can't tell the difference.
Think about it — how brilliant is that? Clunky, risky, prone to training collapses — but the idea was an absolute outlier back then.
What happened next? I dove in headfirst, and I stayed there for ten years.
The Day I First Got Numbers Out, I Almost Cried
No joke, the first time I ran a GAN, I used Keras.
The generator network was just three fully connected layers with LeakyReLU, and the discriminator was pretty much the same. Feed in a 100-dimensional random noise vector, run it through the neural network, and out comes a 28x28 handwritten digit image — that simple.
Guess what?
After about 50 epochs of training.
That night, staring at the screen at those crooked, wobbly numbers — some looking like a "3" and an "8" at the same time, others completely unrecognizable — they weren't random noise anymore!
Can you imagine that feeling? A random vector that means nothing, run through a bunch of parameters, and it actually conjures up an image for you.
I couldn't sleep for days, I was so excited. I kept thinking how incredible this was.
Now, you might think GANs are simple. And yeah, they are.
Simple doesn't mean easy to train.
Mode Collapse: The Three Words I Hate Most
The training process for a GAN is basically two steps.
Step one: freeze the generator, update the discriminator. Teach the discriminator to give high scores to real samples and low scores to fake ones.
Step two: freeze the discriminator, update the generator. Teach the generator to fool the discriminator.
But this isn't two people playing pool, finishing a smooth game. This is a game, a mutual struggle.
My first project? 200 epochs. Guess what happened?
The generator output was the same digit every single time.
What does that mean? It found a "shortcut" to fool the discriminator — and then it got stuck there.
Mode collapse. Those three words, I muttered them to myself every single day back then.
Later, every time I saw the generator outputting stable but identical results, I knew — it had collapsed again. It felt like what? Like spending a fortune on a cooking robot, only for it to make one dish — scrambled eggs. Same taste, same heat, every time. You eat it for a month.
You want to laugh, and you want to cry.
The Evolution of Loss Functions: From "Bad" to "Where's It Bad?"
The original GAN used binary cross-entropy loss.
But what's the problem? Vanishing gradients.
When the discriminator is too strong, the generator can't learn any useful gradient information. It's like you're just starting to learn painting, and the teacher only ever says, "Terrible, terrible," but never tells you what's wrong — is the neck too long? Are the eyes crooked? Is the shadow on the wrong side?
You have no idea.
How are you supposed to learn?
Then LS-GAN came along, swapping the loss function for a least-squares form. Then came WGAN, using Wasserstein distance instead of JS divergence. WGAN-GP added a gradient penalty term, making training much more stable.
I even tested WGAN-GP on the CelebA dataset — 64x64 face generation, 100 epochs. Guess what?
The generated faces were actually watchable. Back in 2017, that was state-of-the-art.
Thinking about it now, I can't help but sigh: technology really is something you grind out step by step.
The Peak: 2020, GANs Were Unstoppable
From 2018 to 2020, GAN progress was like a rocket.
StyleGAN could generate 1024x1024 high-resolution faces — my god, 1024x1024! CycleGAN could do unsupervised image style transfer, Pix2Pix could do supervised image translation.
The experiment I remember most was using CycleGAN for horse-to-zebra conversion.
The training data was just regular horse and zebra images, no paired data. Meaning — you have a bunch of horse photos and a bunch of zebra photos, but they're not from the same angle, not even the same scene.
No pairing, and GAN could still learn.
I trained it for about two days. The result? The model learned to turn horses into zebras and zebras into horses. Some details were a bit off, but the overall effect was already astonishing.
Back then, I thought GANs were the future.
Then came 2022, and everything changed.
Diffusion Models Arrived: GANs Fell from Grace
After Stable Diffusion went open-source in 2022, the landscape shifted completely.
Diffusion models crushed GANs across the board in image generation — more stable training, better mode coverage, higher quality.
Honestly, I felt a bit lost.
Think about it: you follow a technology from nothing, from rough to refined, watch it climb to its peak and then fall. How does that feel? It's like watching your own child grow up, grow old, and then discovering someone else's kid has a newer, better version.
It's not jealousy. It's a complicated, indescribable emotion.
But guess what?
GANs Aren't Dead, They Just Live Differently
Really, GANs aren't dead.
They've just gone from lead actor to golden supporting role.
Now, in Stable Diffusion workflows, you often see GANs popping up. For example, ControlNet's preprocessors — many use models trained by GANs. In video generation, GANs still have the edge with fast inference — one forward pass gives results, while diffusion models need dozens or even hundreds of steps.
I'm working on a project right now: using GANs for image super-resolution, then feeding that into a diffusion model for fine-grained generation.
The results are surprisingly good.
GANs handle the fast generation of basic structure, diffusion models fill in the details. Together, both speed and quality improve.
Isn't that interesting?
What Does the Future Hold?
Honestly, I think the core idea of GANs — adversarial training — will stick around forever.
Whether it's diffusion models now or some new paradigm that comes later, adversarial thinking is a powerful tool.
Just like NCE, VAE, Score Matching — each has its own mathematical foundation, but they're all doing the same thing: teaching the model to learn the data distribution.
GANs are just one way to achieve that, and the most intuitive one.
I've been experimenting lately with introducing GAN's adversarial training into diffusion model training, letting the generator and discriminator play off each other during the diffusion process. Initial results are promising — generation quality improved by about 15%, training time cut by 20%.
See? That's how technology works.
You think a direction has hit a dead end, then you shift your thinking and find new ways to play.
Let me be straight with you:
Don't worship any technology, and don't dismiss any technology lightly.
The most important lesson GANs taught me is — there's no silver bullet, only what fits. You think diffusion models are unbeatable now, but in a couple of years, something new might come along and knock them off their pedestal.
Stay open. Stay hands-on. Stay restless.
That's the attitude you need for AI.
Because, you know what? This world has never been about one thing replacing another. It's about who walks further together.
Adversarial isn't the goal. Coexistence is.
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.