SGD
Implements stochastic gradient descent (optionally with momentum).
$$ \begin{aligned} &\rule{110mm}{0.4pt} \ &\textbf{input} : \gamma \text{ (lr)}, : \theta_0 \text{ (params)}, : f(\theta) \text{ (objective)}, : \lambda \text{ (weight decay)}, \ &\hspace{13mm} :\mu \text{ (momentum)}, :\tau \text{ (dampening)}, :\textit{ nesterov,}:\textit{ maximize} \[-1.ex] &\rule{110mm}{0.4pt} \ &\textbf{for} : t=1 : \textbf{to} : \ldots : \textbf{do} \ &\hspace{5mm}g_t \leftarrow \nabla_{\theta} f_t (\theta_{t-1}) \ &\hspace{5mm}\textbf{if} : \lambda \neq 0 \ &\hspace{10mm} g_t \leftarrow g_t + \lambda \theta_{t-1} \ &\hspace{5mm}\textbf{if} : \mu \neq 0 \ &\hspace{10mm}\textbf{if} : t > 1 \ &\hspace{15mm} \textbf{b}t \leftarrow \mu \textbf{b}{t-1} + (1-\tau) g_t \ &\hspace{10mm}\textbf{else} \ &\hspace{15mm} \textbf{b}t \leftarrow g_t \ &\hspace{10mm}\textbf{if} : \textit{nesterov} \ &\hspace{15mm} g_t \leftarrow g{t-1} + \mu \textbf{b}t \ &\hspace{10mm}\textbf{else} \[-1.ex] &\hspace{15mm} g_t \leftarrow \textbf{b}t \ &\hspace{5mm}\textbf{if} : \textit{maximize} \ &\hspace{10mm}\theta_t \leftarrow \theta{t-1} + \gamma g_t \[-1.ex] &\hspace{5mm}\textbf{else} \[-1.ex] &\hspace{10mm}\theta_t \leftarrow \theta{t-1} - \gamma g_t \[-1.ex] &\rule{110mm}{0.4pt} \[-1.ex] &\bf{return} : \theta_t \[-1.ex] &\rule{110mm}{0.4pt} \[-1.ex] \end{aligned} $$
Nesterov momentum is based on the formula from On the importance of initialization and momentum in deep learning
Attributes
- Source
- SGD.scala
- Graph
-
- Supertypes