机器学习中的优化方法 (English)

Generated: 2026-06-22 17:41:52

---

Oh man, talking about machine learning optimization—my first attempt almost made me cry.

Back when I was running a simple linear regression, I thought to myself, "Gradient descent is just sliding downhill, how hard can it be?" Well, guess what? I set the learning rate a tiny bit too high, and the loss shot straight up into the sky like a firework! Set it too low, and after an hour of training, the loss was still wandering around in circles, miles away from the optimal solution! Tell me that's not frustrating!

Later it finally clicked—optimizers are just like driving a car: knowing you need to go downhill isn't enough; the key is how you go down.

---

First, let's clarify what we're actually trying to do

If you break gradient descent down, it's really just three questions: Which direction? How big a step? How long do we keep going?

Concepts like derivatives, partial derivatives, and gradients tripped me up for a while too. To put it bluntly: with one variable it's a derivative; with multiple variables it's a partial derivative; stack all partial derivatives into a vector, and you've got a gradient. The gradient points in the direction of the steepest increase of the function—so to make the parameters smaller, you just charge full speed in the opposite direction.

That logic makes sense, right? But as soon as you try it, you realize—the road is anything but flat!

Today I want to walk through how these optimization methods evolved step by step, from "running naked down the hill" to "self-driving mode."

---

Step One: Gradient Descent, the original version

The earliest method was batch gradient descent—pull out the entire training set every time to compute the gradient.

When the dataset was small, it worked fine: one step at a time, stable. But once I was training a model with half a million samples! Every iteration, the server fans would take off like a helicopter! I'd be waiting for the gradient practically falling asleep. Did I look like I was doing research? More like I was waiting for the day to end!

Then I switched to stochastic gradient descent (SGD)—pick just one sample at a time. That was fast, really fast! But the loss curve jumped around like a heart monitor with atrial fibrillation—never converged!

Then came mini-batch gradient descent, set the batch size to 32 or 64. Now that was more like it! Computation was manageable, stability was decent. Looking back, this whole "compromise" mindset runs through the entire history of optimization methods—no perfect solution, just what's good enough.

---

The Arrival of Momentum: Giving the downhill traveler inertia

SGD was fast, but it had a problem: it would oscillate back and forth across valleys. I remember one case vividly—training a simple MLP, the loss kept bouncing around the optimum, unable to settle down.

Then I set momentum to 0.9, and guess what?

The effect was immediate! It was like giving the person running downhill a push, letting them carry inertia right over small pits, and even bounce out if they fell in. This "inertia" later became the core of almost all advanced optimizers. Tell me that's not important!

Many people might not know this: SGD + Momentum is still king in many CV competitions. I've personally tested it—training ResNet50 on ImageNet scale, SGD with momentum and step decay can really go toe-to-toe with Adam. The price you pay is more time tuning the learning rate—Adam's adaptivity saves you trouble, but everything that saves you trouble comes with some other cost, right?

---

Adaptive Learning Rates: From Manual to Automatic

Before Adam, there were two paving stones: AdaGrad and RMSProp.

AdaGrad had a clever idea: give each parameter a different learning rate, with smaller steps for frequently updated parameters and larger steps for sparse ones. Sounds nice, but when you actually use it, you'll see—it works okay on convex problems, but in neural networks? The learning rate decays to zero in later stages, and the model just stops learning! I gave up on it after my first try, like "What the heck is this?!"

RMSProp improved on that by adding an exponential moving average so the learning rate doesn't monotonically decay. It worked well on tasks like RNNs, but it wasn't mind-blowing.

Then Adam came along and blended momentum with the RMSProp approach!

The first time I used Adam, I was almost moved to tears—barely any learning rate tuning needed, the default 1e-3 gave decent results! For me at the time, it was revolutionary! No more crouching in front of the screen staring at the loss curve, frantically changing the learning rate every time I saw a plateau!

But later I discovered Adam has a pitfall: its weight decay implementation is flawed. Standard L2 regularization combined with Adam doesn't work well, and generalization is sometimes worse than SGD. So the big brains came up with AdamW, separating weight decay from adaptive learning rates.

What's the common practice now? Start a new task with AdamW, learning rate 1e-3, weight decay 1e-2, plus Warmup and Cosine Annealing. This combination has become the standard for the era of large models.

If you're a Transformer fan, you've definitely heard this sentence before:

In the early steps of training, you absolutely must add a warmup phase for the learning rate!

Why? Think about it—when you first initialize, all parameters are random, the gradient directions are as chaotic as headless flies. If you floor the accelerator at that point, the model flies straight into some bad region, and you can't save it no matter what you do afterward! Warmup lets the model take small, tentative steps to feel out the neighborhood first; once it has a sense of direction, then step on the gas.

---

Learning Rate Schedules: You Can't Drive at the Same Speed All the Time

Many tutorials stop after discussing optimizers, but let me tell you—the scheduling strategy is just as important! Optimizer chosen correctly, but with the wrong schedule, you still won't train well.

The dumbest thing I ever did: use a fixed learning rate from start to finish. Training AdamW for 100 epochs, the first 20 converged okay, but then? Stuck in place, loss didn't move a millimeter! So much wasted compute!

Then I switched to Step Decay, dividing by 10 every 30 epochs. Simple and brutal, right? But the problem is the decay step is hardcoded. If you haven't reached a plateau by epoch 30, the decay comes too early. I had a task where I set the step to 20, and right around epoch 15 the model was just about to settle—the learning rate drop sent it oscillating again. Frustrating, right?

Cosine Annealing is what I use most often now. The learning rate smoothly decreases from its maximum, no "cliff drops." In practice, on several classification tasks, the final accuracy was 0.5-1 point higher than Step Decay. And when paired with Warmup, it's basically the official default.

Speaking of a detailed Warmup configuration: I once trained a small BERT model with a total of 10,000 steps. I set warmup to 1,000 steps, about 10%. The learning rate went from 0 linearly up to 3e-4 over the first 1,000 steps, then decayed via cosine annealing down to 1e-6. If I had used a fixed learning rate of 3e-4 from the start? The loss shot up to NaN, no chance to recover!

Warmup is almost mandatory for large-batch training. In Google's original BERT paper, with batch size 256, they used 10,000 warmup steps—considered conservative at the time. Nowadays, many approaches set the warmup ratio to 5% to 10% of total steps.

Another one is OneCycleLR, which is more aggressive: the learning rate first increases then decreases, allowing for a larger learning rate in the middle phase. I tried it on a fine-grained classification task—it did converge faster, but only with careful parameter tuning, otherwise it'd blow up at the peak—like driving a car and realizing at the cliff's edge that the brakes don't work.

---

About Convex Optimization, Let Me Be Blunt

When you train neural networks, you're mostly running on non-convex terrain. So why bother with convex optimization?

Because many classic models—linear regression, logistic regression, SVM—are convex. The beauty of convex functions: a local optimum is the global optimum! You don't have to worry about falling into a pit and never getting out.

My most

机器学习中的优化方法 (English)

机器学习中的优化方法 (English)

First, let's clarify what we're actually trying to do

Step One: Gradient Descent, the original version

The Arrival of Momentum: Giving the downhill traveler inertia

Adaptive Learning Rates: From Manual to Automatic

Learning Rate Schedules: You Can't Drive at the Same Speed All the Time

About Convex Optimization, Let Me Be Blunt

Cael Lee

Ready to get started?