ML Katas

Implementing the Adam Optimizer from Scratch

hard (>1 hr) optimizer adam from scratch gradient
this month by E

Implement the Adam optimizer from scratch as a subclass of torch.optim.Optimizer. You'll need to manage the first-moment vector (moving average of gradients) and the second-moment vector (moving average of squared gradients) for each parameter. The update rule for each parameter p is:

mt=β1mt1+(1β1)gt vt=β2vt1+(1β2)gt2 m^t=mt/(1β1t) v^t=vt/(1β2t) pt+1=ptαm^t/(v^t+ϵ)

Verification: Train a small model (e.g., a linear layer) on a simple regression task using both your custom Adam optimizer and torch.optim.Adam. The final loss values and parameter weights should be very close after a few epochs.