速度堪比Adam,准确率媲美SGD,还能稳定训练GAN (English)

Generated: 2026-06-23 10:08:25

---

When I first saw the NeurIPS 2020 AdaBelief paper, my immediate reaction was—here we go again! I was sick of the whole "new optimizer beats Adam" routine. Think about it: at least a dozen papers with that claim come out every year, and in the end, either nobody uses them, or they only work on some toy dataset. But this one… I admit, it made me eat my words. A spotlight paper, with code open-sourced right away—nothing hidden. I spent a night reading it, then ran experiments for two days, and the result? Four words popped into my head: there's something here.

---

First, the Stalemate: SGD and Adam, Both Unsatisfying

Back in 2015 when I started deep learning, there was only one optimizer option—SGD with momentum. Then the DCGAN paper wrote: use Adam with learning rate 0.0002 and momentum beta1 set to 0.5. I tried it, and hey, it was way more stable than SGD. But when it came to classification tasks, Adam's generalization lagged behind SGD by a noticeable margin. Guess how much? Same ResNet-50 on ImageNet, Adam converged two to three times faster, but the final top-1 accuracy was 1.5 to 2 percentage points lower. Back in 2018, Zhihu (China's Quora) was full of debates: adaptive methods are fast but generalize poorly, SGD is slow but effective. Back then, there was no silver bullet—you had to switch manually—different tasks, different optimizers. Annoying? Extremely!

GAN training was a perfect example. Non-stationary optimization—from 2015 to 2017, everyone defaulted to Adam (beta1=0.5), but some pushed RMSProp. I struggled with it too until I read a blog post (by Jason Brownlee) saying adaptive learning rates handle gradient scales, and momentum boosts stability. Following that, I lowered beta1 from 0.9 to 0.5, and the oscillations did decrease. But even then, GAN training would still collapse without warning, wasting a whole night's run and crushing your motivation.

So when AdaBelief came along and claimed "speed like Adam, generalization like SGD, and rock-solid stability for training GANs," the first thought in my head was—"Can you at least not talk big before showing the proof?"

---

October 2020: Breaking the Impossible Trinity

The team at Yale University posted the paper on arXiv in October 2020, and NeurIPS 2020 accepted it. One detail caught my eye: the paper didn't use "surpass," it said "simultaneously achieves three advantages." That phrasing was clever—no "beats" anyone, just turned what everyone thought was an impossible trinity into something possible.

What's the algorithmic difference? One sentence: Adam's denominator is the exponential moving average (EMA) of squared gradients; AdaBelief's denominator is the EMA of (current gradient minus first-order momentum) squared. Adam looks at gradient magnitude; AdaBelief looks at whether the gradient deviates from its expected trajectory.

Don't get confused—let me use an analogy. You're walking down stairs, and the ladder is shaky. Adam checks how big your step is—big step? Take a smaller one. AdaBelief, on the other hand, checks how much your step differs from prediction—if you miss a step (sudden unexpected large gradient), pull your foot back fast. If every step goes as expected, stride forward boldly. Trust leads to speed; doubt leads to caution—that's the "belief" mechanism.

The paper's authors showed a diagram: in a narrow valley, Adam oscillates in the y-direction because large gradient squares reduce step size, slowing down the x-direction too. AdaBelief strides ahead in x while suppressing y-direction oscillations. I tried it on my own Cifar10 classification model, and the convergence curve was smooth as silk, with a final accuracy 0.3 percentage points higher than Adam. Not a huge gap, but at least they weren't lying.

---

The Data Convinced Me Halfway

The experimental section of the paper was solid. On ImageNet with ResNet-34, AdaBelief got 73.51% top-1 accuracy, SGD got 73.49%, and Adam only 72.07%. The key point: SGD took 90 epochs, while AdaBelief achieved that level in only 60 epochs. In language modeling (Transformer on PTB), AdaBelief's perplexity was 4 points lower than Adam and on par with SGD.

The GAN results were even more striking. On Cifar10, AdaBelief's FID score dropped from Adam's 19.06 to 15.07, and the generated sample quality showed almost no wild fluctuations during training. I tried it with a DCGAN framework—previously, using Adam (beta1=0.5) would start mode collapse after about 40 epochs. Switching to AdaBelief, it ran for 80 epochs and stayed stable! Note that I used the default hyperparameters: beta1=0.9, beta2=0.999, lr=0.001—no extra tuning at all. That surprised me a bit—if you use Adam with beta1=0.9 on GANs, the oscillations drive you crazy. AdaBelief somehow held up.

The code was clean too—the PyTorch version directly mirrors the torch.optim.Adam interface, with identical memory usage. Computation speed was less than 5% slower, just an extra calculation of (gt - mt) squared, essentially free.

---

Community Voices: Cheers and Skepticism

The discussion on Zhihu was lively back then. Some analyzed it from a bilevel optimization perspective, saying RMSProp suits non-stationary optimization, and adding momentum to Adam introduces unnecessary historical influence. AdaBelief retains momentum while making the denominator more sensitive to sudden changes. This aligns with a 2018 article interpreting deep learning optimizers: an ideal optimizer should account for loss curvature, not just gradient magnitude. AdaBelief's "belief" mechanism is essentially an implicit curvature estimate.

But there was plenty of skepticism too—the paper didn't include experiments on large-scale Transformers (like BERT or GPT), which are extremely demanding on the optimizer. I only had a single GPU, so I couldn't test it myself. Later in 2021, someone applied AdaBelief to ViTs, and the results were similar to AdamW—no significant advantage. That probably explains why, as of 2024, large model training is still dominated by AdamW.

There was also a practical pitfall: weight decay. Adam has a known bug—adaptive learning rates scale weight decay proportionally, weakening the regularization effect. AdamW decoupled them. AdaBelief's paper didn't specifically address weight decay, and when I used the same implementation as Adam, the results were unstable. Later, someone modified it to AdaBeliefW, but it was never officially released. If you want to try it, I recommend implementing weight decay separately, not mixed in with the gradients—a small trap, but don't step into it.

---

Four Years Later, Where Is AdaBelief?

It didn't replace Adam as some predicted, but it has firmly carved out its niche in specific scenarios. For image classification, it's easier to tune than SGD and generalizes better than Adam; for GAN training, it eliminates a bunch of tricks (tweaking beta1, gradient clipping, etc.). The old DCGAN wisdom—lower beta1 to 0.5—is unnecessary with AdaBelief; the default 0.9 works fine.

My personal habit: I start new projects with AdaBelief—if convergence isn't satisfactory, I switch back to SGD or AdamW. It adds no extra cost yet gives you a "what if it works" chance. Just think about it—what if?

But I have to admit, it hasn't broken into the large-model domain. AdamW with cosine annealing or linear decay is already the engineering standard answer; there's no incentive to switch. However, if you're working on small- to medium-scale models, GANs, or your own small project, I suggest giving AdaBelief a try. My own ResNet, medium Transformer, and GAN models have benefited—no reason not to recommend it.

As for the future, I doubt there will be a paradigm-shifting optimizer anymore. It took years to go from SGD to Adam, and Adam to AdamW felt more like a bug fix.

速度堪比Adam,准确率媲美SGD,还能稳定训练GAN (English)

速度堪比Adam,准确率媲美SGD,还能稳定训练GAN (English)

First, the Stalemate: SGD and Adam, Both Unsatisfying

October 2020: Breaking the Impossible Trinity

The Data Convinced Me Halfway

Community Voices: Cheers and Skepticism

Four Years Later, Where Is AdaBelief?

Cael Lee

Ready to get started?