Optimizer¶

Gradient Descent/ SGD¶

// TODO

Momentum¶

AdaGrad¶

adaptive learning rate during training consider only the sum of the magnitude of previous gradients

AdaDelta¶

RMSprop (Root Mean Square Prop)¶

consider only the magnitude of gradient of immediate previous iteration

Adam¶

Momentum + RMSProp

EVE¶

Adam + further adjust learning rate by the ratio of change of input after updating weight

Cyclical Learning Rates¶

Cyclical Learning Rates for Training Neural Networks (WACV 2017)
Instead of monotonically decreasing the learning rate, this method lets the learning rate cyclically vary between reasonable boundary values

LR Range test¶

ractically eliminates the need to experimentally find the best values and schedule for the global learning rates
Note: Even though some said the performance of cyclical learning rate is not significant, the concept of LR range test used for other cycle/restart based optimizer

SGDR¶

SGDR: Stochastic Gradient Descent with Warm Restarts (ICLR 2017)
SGD with cyclical restart
avoid local minimium, useful in many model, e.g. EDVR