torch.optim
Members list
Packages
Type members
Classlikes
Implements the Adam algorithm.
Implements the Adam algorithm.
$$ \begin{aligned} &\rule{110mm}{0.4pt} \ &\textbf{input} : \gamma \text{ (lr)}, \beta_1, \beta_2 \text{ (betas)},\theta_0 \text{ (params)},f(\theta) \text{ (objective)} \ &\hspace{13mm} \lambda \text{ (weight decay)}, : \textit{amsgrad}, :\textit{maximize} \ &\textbf{initialize} : m_0 \leftarrow 0 \text{ ( first moment)}, v_0\leftarrow 0 \text{ (second moment)},: \widehat{v_0}^{max}\leftarrow 0\[-1.ex] &\rule{110mm}{0.4pt} \ &\textbf{for} : t=1 : \textbf{to} : \ldots : \textbf{do} \
&\hspace{5mm}\textbf{if} \: \textit{maximize}: \\
&\hspace{10mm}g_t \leftarrow -\nabla_{\theta} f_t (\theta_{t-1}) \\
&\hspace{5mm}\textbf{else} \\
&\hspace{10mm}g_t \leftarrow \nabla_{\theta} f_t (\theta_{t-1}) \\
&\hspace{5mm}\textbf{if} \: \lambda \neq 0 \\
&\hspace{10mm} g_t \leftarrow g_t + \lambda \theta_{t-1} \\
&\hspace{5mm}m_t \leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t \\
&\hspace{5mm}v_t \leftarrow \beta_2 v_{t-1} + (1-\beta_2) g^2_t \\
&\hspace{5mm}\widehat{m_t} \leftarrow m_t/\big(1-\beta_1^t \big) \\
&\hspace{5mm}\widehat{v_t} \leftarrow v_t/\big(1-\beta_2^t \big) \\
&\hspace{5mm}\textbf{if} \: amsgrad \\
&\hspace{10mm}\widehat{v_t}^{max} \leftarrow \mathrm{max}(\widehat{v_t}^{max},
\widehat{v_t}) \\
&\hspace{10mm}\theta_t \leftarrow \theta_{t-1} - \gamma \widehat{m_t}/
\big(\sqrt{\widehat{v_t}^{max}} + \epsilon \big) \\
&\hspace{5mm}\textbf{else} \\
&\hspace{10mm}\theta_t \leftarrow \theta_{t-1} - \gamma \widehat{m_t}/
\end{aligned} $$
For further details regarding the algorithm we refer to Adam: A Method for Stochastic Optimization.
Attributes
- Source
- Adam.scala
- Supertypes
Implements the AdamW algorithm.
Base class for all optimizers.
Base class for all optimizers.
Attributes
- Source
- Optimizer.scala
- Supertypes
-
class Objecttrait Matchableclass Any
- Known subtypes
Implements stochastic gradient descent (optionally with momentum).
Implements stochastic gradient descent (optionally with momentum).
$$ \begin{aligned} &\rule{110mm}{0.4pt} \ &\textbf{input} : \gamma \text{ (lr)}, : \theta_0 \text{ (params)}, : f(\theta) \text{ (objective)}, : \lambda \text{ (weight decay)}, \ &\hspace{13mm} :\mu \text{ (momentum)}, :\tau \text{ (dampening)}, :\textit{ nesterov,}:\textit{ maximize} \[-1.ex] &\rule{110mm}{0.4pt} \ &\textbf{for} : t=1 : \textbf{to} : \ldots : \textbf{do} \ &\hspace{5mm}g_t \leftarrow \nabla_{\theta} f_t (\theta_{t-1}) \ &\hspace{5mm}\textbf{if} : \lambda \neq 0 \ &\hspace{10mm} g_t \leftarrow g_t + \lambda \theta_{t-1} \ &\hspace{5mm}\textbf{if} : \mu \neq 0 \ &\hspace{10mm}\textbf{if} : t > 1 \ &\hspace{15mm} \textbf{b}t \leftarrow \mu \textbf{b}{t-1} + (1-\tau) g_t \ &\hspace{10mm}\textbf{else} \ &\hspace{15mm} \textbf{b}t \leftarrow g_t \ &\hspace{10mm}\textbf{if} : \textit{nesterov} \ &\hspace{15mm} g_t \leftarrow g{t-1} + \mu \textbf{b}t \ &\hspace{10mm}\textbf{else} \[-1.ex] &\hspace{15mm} g_t \leftarrow \textbf{b}t \ &\hspace{5mm}\textbf{if} : \textit{maximize} \ &\hspace{10mm}\theta_t \leftarrow \theta{t-1} + \gamma g_t \[-1.ex] &\hspace{5mm}\textbf{else} \[-1.ex] &\hspace{10mm}\theta_t \leftarrow \theta{t-1} - \gamma g_t \[-1.ex] &\rule{110mm}{0.4pt} \[-1.ex] &\bf{return} : \theta_t \[-1.ex] &\rule{110mm}{0.4pt} \[-1.ex] \end{aligned} $$
Nesterov momentum is based on the formula from On the importance of initialization and momentum in deep learning
Attributes
- Source
- SGD.scala
- Supertypes