从自编码器到生成对抗网络:一文纵览无监督学习研究现状 (English)

Generated: 2026-06-21 06:20:44

---

2016, the first time I was blown away by unsupervised learning

You know that feeling? My lab lead tossed me a paper and said, "Reproduce this autoencoder for image denoising."

Back then I was using TensorFlow 0.8 or 0.9, and the dataset was MNIST. After training, I watched those blurry digits covered in noise turn into clean, crisp numbers—

Seriously, in that moment, I got goosebumps all over!

No labels needed. The model just learned on its own what "clean strokes" looked like. This was pure magic!

But guess what the deep learning crowd was obsessed with at the time? ImageNet competitions! Supervised learning! Who cared about some "niche" like unsupervised learning?

Yet it was these "unimportant" directions that quietly grew the roots that would later explode into AI: VAE, GAN, contrastive learning… every single one hit the pain point of data dependency head-on.

Speaking of which, let me walk through it chronologically—from the most primitive autoencoders all the way to today's unsupervised few-shot learning. I'll tell you about the pitfalls I fell into, the surprises I encountered, and my take on where this path is headed.

---

The Early Days: Autoencoders and Energy Models

Believe it or not, the earliest unsupervised learning wasn't meant for "generating" images at all—it was for "dimensionality reduction."

In 2006, Hinton introduced Deep Belief Networks (DBNs), which were essentially stacked Restricted Boltzmann Machines (RBMs). The RBM idea: encode input data into binary hidden variables and maximize likelihood for reconstruction.

But training them was like walking a tightrope!

The partition function had severe numerical issues and would blow up at the slightest provocation. I later tried using RBMs for feature extraction on CIFAR-10, let it run all night—and the result?

It wasn't even as good as K-means clustering plus a nonlinear mapping!

Adam Coates and Andrew Ng made it crystal clear in their paper: treat K-means as feature learning, throw on a pooling layer, and it's more stable than RBMs.

Of course, K-means has its own problems—it can only find linearly separable structures. And you have to train each layer separately, with no global optimization. Stack two layers of K-means, and the performance gain drops off—kind of a pain.

But it's so simple. Even now, many beginners start with it when diving into unsupervised feature learning.

Around the same time, autoencoders started stepping into the spotlight.

A typical autoencoder works like this: the encoder compresses an image into a low-dimensional vector, the decoder reconstructs the image from that vector, and the loss function is the reconstruction error.

This was much cleaner than RBMs—no complex sampling of probabilistic models, just straightforward MSE optimization.

I did experiments: on MNIST, a three-layer autoencoder could achieve very low reconstruction loss. But when I tried interpolating between latent variables—disaster!

The new generated images were often monstrosities: halfway between two digits turned into meaningless scribbles.

Why? Because the latent space of a plain autoencoder is discrete—there's no continuity constraint.

That led to VAE.

---

VAE: Taming the Latent Variable with Probability

In 2013, Kingma and Welling introduced the Variational Autoencoder (VAE).

The core change was just one sentence: I don't want a deterministic latent vector; I want the latent vector to follow a prior distribution (usually a normal distribution).

The encoder outputs mean and variance, then you use the reparameterization trick to sample the latent variable, which goes to the decoder.

With the KL divergence regularization term, the latent space becomes continuous and structured.

You could smoothly move along a direction—the generated handwritten digits would naturally morph! From a 3 slowly turning into an 8, from an 8 into a 0…

I reproduced VAE myself in 2018 with Keras 2.0, setting the latent dimension to 2. Directly on a 2D plane, I could see digits continuously change from one form to another.

That visualization was just beautiful!

But the price of VAE is that generated images always have that "plastic" feel.

In a 2018 paper comparing VAE and GAN, they had specific experimental data—on MNIST, the VAE final MSE was 34.66, and the generated images were fuzzy. But note: it covered all ten categories without missing any pattern.

VAE is like the obedient but uncreative student in class. When you ask it to reconstruct, it makes sure the overall numbers are close, but it ignores high-frequency details.

Later I tried VAE for face generation on CelebA—every generated face looked like I was seeing it through frosted glass. The outlines were recognizable, but wrinkles, hair strands—all blurred together.

To fix this, I added various perceptual losses on top of the reconstruction loss, even swapped the decoder for a deeper ResNet…

But no matter what I tuned, VAE's blurriness was structural. Because it's fundamentally optimizing a lower bound on likelihood, and it doesn't deliberately try to fool your eyes.

---

GAN: The Sharpness Revolution Through Adversarial Game

In 2014, Ian Goodfellow introduced Generative Adversarial Networks (GANs).

That was the real explosive method!

The generator fakes images from noise, the discriminator tells real from fake. They fight each other, and eventually the generator can fool the discriminator.

I first ran a GAN at the end of 2017, using PyTorch 0.3 to train a DCGAN on the LSUN bedroom dataset. About 4 hours of training…

The generated bedroom images—some were so real they looked like IKEA catalogs, others were completely deformed—nightstands with three legs.

That's mode collapse: you train hard all day, and the model only learns two modes: "nightstand" and "pillow."

DCGAN's discriminator design was simple: input an image, output a true/fake probability. The generator upsamples a 100-dimensional noise vector into a 64x64 image via transposed convolutions.

The structure wasn't complex, but stability was terrible!

Learning rate, optimizer, weight initialization—even a slight deviation and the loss would blow up. I tried changing Adam's beta1 from 0.9 to 0.5, and the training curve went from oscillating to exploding—had to go back to 0.5.

In the DCGAN paper, they reported an experimental MSE of 36.3, higher than VAE's 34.66. But to the human eye, it was much sharper!

What does this contradiction tell us? MSE has no absolute relationship with perceived visual quality!

GAN's role in unsupervised learning is subtle: it inherently doesn't need labels, because "real vs. fake" is a natural binary supervisory signal.

But many ignore that the original GAN generated images completely uncontrollably—you couldn't control which digit it generated. Only when you gave it class labels (Conditional GAN, cGAN) did it become supervised learning.

In 2016, Ian put out a summary video with this line: "GAN is the coolest idea in machine learning in the last twenty years."

I thought it was a bit arrogant at the time, but later developments proved that assessment basically held.

After DCGAN, WGAN used Wasserstein distance to solve training instability; StyleGAN modulated styles in latent space, generating faces almost indistinguishable from real ones.

But GAN has one fundamental weakness…

It can only generate, not encode!

Give it an image, and it can't give you a compact feature vector. In other words, the representation it learns isn't meant for inference; it's meant to fool the discriminator.

That's the exact opposite of VAE.

---

Fusion and Showdown: Why VAE-GAN Matters

See—VAE is strong at encoding but blurry at generation; GAN is sharp but has no encoder.

Combining them seems like a natural direction, right?

I tried VAE-GAN: use the VAE structure, but replace

从自编码器到生成对抗网络:一文纵览无监督学习研究现状 (English)

从自编码器到生成对抗网络:一文纵览无监督学习研究现状 (English)

2016, the first time I was blown away by unsupervised learning

The Early Days: Autoencoders and Energy Models

VAE: Taming the Latent Variable with Probability

GAN: The Sharpness Revolution Through Adversarial Game

Fusion and Showdown: Why VAE-GAN Matters

Cael Lee

Ready to get started?