Fixing Weight Decay Regularization in Adam: https://arxiv.org/abs/1711.05101
Following suggestions that adaptive gradient methods such as Adam might lead to worse generalization than SGD with momentum (Wilson et al., 2017), we identified at least one possible explanation to this phenomenon: the inequivalence of L2 regularization and weight decay we expose. We empirically showed that our version of Adam with the original formulation of weight decay yields substantially better generalization performance than the common implementation of Adam with L2 regularization. We also proposed normalized weight decay and the use of cosine annealing and warm restarts for Adam, resulting in a more robust hyperparameter selection, a better final performance and a better anytime performance, respectively. Our results obtained on image classification datasets must be verified on a wider range of tasks, especially ones where the use of regularization is expected to be important. It would be interesting to integrate our findings on weight decay into other methods which attempt to improve Adam, e.g, normalized direction-preserving Adam (Zhang et al., 2017). While we focussed our experimental analysis on Adam, we believe that similar results also hold for other adaptive gradient methods, such as AdaGrad (Duchi et al., 2011), RMSProp (Tieleman & Hinton, 2012), and AMSGrad (Reddi et al., 2018). Advani & Saxe (2017) analytically showed that in the limited data regime of deep networks the presence of eigenvalues that are zero forms a frozen subspace in which no learning occurs and thus smaller (e.g., zero) initial weight norms should be used to achieve best generalization results. We thus plan to consider adapting initial weight norms or weight norm constraints (Salimans & Kingma, 2016) at each warm restart.