Implementing the Adam Optimizer from Scratch

hard (>1 hr) optimizer adam from scratch gradient

this year by E

Implement the Adam optimizer from scratch as a subclass of torch.optim.Optimizer. You'll need to manage the first-moment vector (moving average of gradients) and the second-moment vector (moving average of squared gradients) for each parameter. The update rule for each parameter p is:

m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) g_{t}

v_{t} = β_{2} v_{t - 1} + (1 - β_{2}) g_{t}^{2}

{\hat{m}}_{t} = m_{t} / (1 - β_{1}^{t})

{\hat{v}}_{t} = v_{t} / (1 - β_{2}^{t})

p_{t + 1} = p_{t} - α {\hat{m}}_{t} / (\sqrt{{\hat{v}}_{t}} + ϵ)

Verification: Train a small model (e.g., a linear layer) on a simple regression task using both your custom Adam optimizer and torch.optim.Adam. The final loss values and parameter weights should be very close after a few epochs.