The Evolution of Deep Learning Optimizers: From SGD to AdamW
Training a neural network isn’t just about finding a solution; it’s about navigation. Imagine you are a hiker blindfolded on a foggy mountain, trying to find the lowest point in the valley (the minimum loss).
Over the years, the tools we use to navigate this landscape have evolved from simple walking sticks to intelligent, adaptive GPS systems. Here is the story of how that happened.
The diagram below is a simple illustration of how the optimizers are related.

1. SGD: The Blind Step
“I feel a slope, I step down.”
The baseline approach, Stochastic Gradient Descent, is simple physics. You feel the ground under your feet (calculate $\nabla L(\theta_t)$), and you take a step downhill.
$$ \theta_{t+1} = \theta_t - \eta g_t $$- $\theta_t$ (The Weights): The current position of the hiker (the parameters of the model).
- $\eta$ (Learning Rate): How big is your step?
- $g_t$ (Gradient): Defined as $\nabla L(\theta_t)$, the vector of partial derivatives. It points up, so we go the opposite way.
The Problem: It’s inconsistent. Sometimes the slope is steep, and you overshoot. Sometimes it’s shallow, and you barely move. Worse, if there’s a small ravine (local minimum), you get stuck because you have no “speed” to roll out of it.
2. Momentum: Adding Inertia
“I’m rolling now, I can’t just stop.”
To fix the “getting stuck” problem, we looked at physics again. A ball rolling down a hill doesn’t just stop the moment the slope flattens; it has velocity.
We introduce a new variable, Velocity ($v_t$), which is just a memory of where we were going.
$$ \begin{aligned} v_{t+1} = \underset{\text{Friction}}{\underbrace{\mu v_t}} + \underset{\text{New Push}}{\underbrace{g_t}} \\\\ \theta_{t+1} = \theta_t - \eta v_{t+1} \end{aligned} $$- $\mu$ (Momentum Coefficient): This represents friction (or rather, how much we resist it). Usually set to 0.9, it means “keep 90% of your previous speed.” It prevents us from stopping instantly just because the gradient vanished.
- Now, even if the gradient ($g_t$) becomes zero (a flat spot), your old velocity ($v_t$) keeps you moving. This smooths out the zig-zags and helps you plow through small bumps.
3. RMSProp: The All-Terrain Vehicle
“This terrain is too weird for a single speed.”
Momentum helped us move forward, but we still had a problem with scale.
- Some weights are very sensitive (steep slopes): a small step changes the loss hugely.
- Some weights are lazy (flat plains): you need massive steps to make progress.
Using the same Learning Rate ($\eta$) for both is inefficient. RMSProp introduced Adaptive Scaling. It looks at the recent history of the gradients’ magnitude.
$$ \begin{aligned} s_t = \beta s_{t-1} + (1 - \beta) g_t^2 \\\\ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{s_t} + \epsilon} g_t \end{aligned} $$- $s_t$ (Scale Meter): Think of this as a “volatility meter.” It keeps a running average of the squared gradients ($g_t^2$). If recent gradients have been huge, $s_t$ becomes large, which signals the algorithm to decrease the step size.
- $\beta$ (Decay Rate): This is the first appearance of the “memory dial” (e.g., 0.99). It decides how fast we forget the old terrain. It keeps a moving average so the scaling doesn’t jitter wildly with every single step.
- $\epsilon$ (Epsilon): A tiny safety buffer (e.g., $10^{-8}$). We add it to the denominator to prevent the universe from imploding (division by zero) if $s_t$ happens to be empty.
- The Magic: We divide by $\sqrt{s_t}$.
- If gradients are huge ($s_t$ is big), we divide by a big number $\to$ Small Step (Brakes).
- If gradients are tiny ($s_t$ is small), we divide by a small number $\to$ Big Step (Gas).
4. Adam: The Synthesis
“Why choose? Let’s do both.”
Adam (Adaptive Moment Estimation) didn’t strictly invent new math; it unified the two best ideas we had. It said: “Let’s track Momentum (to know where to go) AND Variance (to know how fast to go).”
Here, the symbols change slightly to fit statistical terms, but the concepts are identical:
- The Direction ($m_t$): Suddenly, a new symbol $m_t$ appears. Don’t let it confuse you—this is just Momentum wearing a different hat. Adam calls it the “1st Moment.”
It “popped up” because we need a variable to store the rolling average of direction.
- The Magnitude ($v_t$): This is just RMSProp from step 3. Adam calls it the “2nd Moment” (Uncentered Variance).
Note: In physics, $v$ was velocity. Here, $v$ is Variance. Don’t mix them up!
The “Memory” Dials ($\beta_1, \beta_2$): You see those $\beta$ symbols? They are the Decay Rates—essentially knobs that control how long we remember the past.
- $\beta_1$ (usually 0.9): Controls Momentum. It says “Keep 90% of the old velocity, and only add 10% new information.” It makes the turn radius wide and smooth.
- $\beta_2$ (usually 0.999): Controls Scaling. It averages the “scale” over a very long time, so a single freak spike in data doesn’t panic the model.
The Update:
We combine them. We take our smoothed direction ($m_t$) and divide it by our smoothed magnitude ($v_t$).
$$ \theta_{t+1} = \theta_t - \eta \frac{m_t}{\sqrt{v_t} + \epsilon} $$
Bias correction.
The moving averages $m_t$ and $v_t$ are initialized at zero, so early updates shrink them toward zero. Adam rescales them as $\hat{m}_t = m_t / (1 - \beta_1^t)$ and $\hat{v}_t = v_t / (1 - \beta_2^t)$ to remove that startup bias.
As $t$ grows, the denominators approach 1, so the corrected moments quickly converge back to the plain moving averages—the update rule above is still the core intuition.
5. The Twist: The “AdamW” Fix
“Wait, we broke the gravity.”
Everything looked perfect. Adam was fast, converged quickly, and became the default choice. But there was a hidden issue lurking in how we handled model complexity.
In almost all neural network training, we use a technique called L2 Regularization (or “Weight Decay”). Think of it as gravity: we want the model to learn, but we don’t want the weights to grow infinitely large and complex.
Mathematically, we achieve this by adding a penalty term to our Loss function:
$$ \text{Total Loss} = \text{Error} + \frac{1}{2}\lambda \theta^2 $$When we take the derivative (gradient) of this new loss, that extra term becomes $\lambda \theta$. This is our “gravity” force—a vector that always points toward zero, pulling the weights back.
For decades, with SGD, we implemented this “gravity” in the laziest way possible: we just took that derivative $\lambda \theta$ and added it directly to the data gradient.
$$ g_{total} = g_{data} + \lambda \theta $$It worked fine for SGD. But when we tried this same shortcut with Adam, something strange happened. The models weren’t generalizing as well as they should.
The Problem:
Remember how Adam works? It scales the step size based on the variance of the gradient.
When we baked the “gravity” into the gradient ($g_{total}$), we accidentally fed the gravity into Adam’s variance calculator ($v_t$).
This caused a mathematical conflict:
- If the signal is loud (High Variance): Adam sees a huge $v_t$ and shrinks the update. This effectively shrinks the gravity, too. The model stops being regularized just when it’s most active.
- If the signal is quiet (Low Variance): Adam sees a tiny $v_t$ and boosts the update. This inadvertently boosts the gravity, crushing the weights aggressively.
We didn’t want the gravity to change based on the terrain! Gravity should be constant.
The Solution (AdamW):
AdamW stands for “Adam with Weight Decay.” It fixes the bug by decoupling the two forces. 1. It calculates the Adam step using only the original data gradient (ignoring gravity). 2. It applies the gravity step separately at the end.
$$ \theta_{t+1} = \underset{\text{Pure Adam}}{\underbrace{\left( \theta_t - \eta \frac{m_t}{\sqrt{v_t} + \epsilon} \right)}} - \underset{\text{Pure Decay}}{\underbrace{\eta \lambda \theta_t}} $$This simple change restored the “constant gravity” behavior, allowing Adam to generalize as well as SGD while keeping its adaptive speed.
Comments
Loading comments…