torch.optim

package torch.optim

Members list

Packages

Type members

Classlikes

Implements the Adam algorithm.

Implements the Adam algorithm.

$$ \begin{aligned} &\rule{110mm}{0.4pt} \ &\textbf{input} : \gamma \text{ (lr)}, \beta_1, \beta_2 \text{ (betas)},\theta_0 \text{ (params)},f(\theta) \text{ (objective)} \ &\hspace{13mm} \lambda \text{ (weight decay)}, : \textit{amsgrad}, :\textit{maximize} \ &\textbf{initialize} : m_0 \leftarrow 0 \text{ ( first moment)}, v_0\leftarrow 0 \text{ (second moment)},: \widehat{v_0}^{max}\leftarrow 0\[-1.ex] &\rule{110mm}{0.4pt} \ &\textbf{for} : t=1 : \textbf{to} : \ldots : \textbf{do} \

&\hspace{5mm}\textbf{if} \: \textit{maximize}:                                       \\
&\hspace{10mm}g_t           \leftarrow   -\nabla_{\theta} f_t (\theta_{t-1})         \\
&\hspace{5mm}\textbf{else}                                                           \\
&\hspace{10mm}g_t           \leftarrow   \nabla_{\theta} f_t (\theta_{t-1})          \\
&\hspace{5mm}\textbf{if} \: \lambda \neq 0                                           \\
&\hspace{10mm} g_t \leftarrow g_t + \lambda  \theta_{t-1}                            \\
&\hspace{5mm}m_t           \leftarrow   \beta_1 m_{t-1} + (1 - \beta_1) g_t          \\
&\hspace{5mm}v_t           \leftarrow   \beta_2 v_{t-1} + (1-\beta_2) g^2_t          \\
&\hspace{5mm}\widehat{m_t} \leftarrow   m_t/\big(1-\beta_1^t \big)                   \\
&\hspace{5mm}\widehat{v_t} \leftarrow   v_t/\big(1-\beta_2^t \big)                   \\
&\hspace{5mm}\textbf{if} \: amsgrad                                                  \\
&\hspace{10mm}\widehat{v_t}^{max} \leftarrow \mathrm{max}(\widehat{v_t}^{max},
    \widehat{v_t})                                                                   \\
&\hspace{10mm}\theta_t \leftarrow \theta_{t-1} - \gamma \widehat{m_t}/
    \big(\sqrt{\widehat{v_t}^{max}} + \epsilon \big)                                 \\
&\hspace{5mm}\textbf{else}                                                           \\
&\hspace{10mm}\theta_t \leftarrow \theta_{t-1} - \gamma \widehat{m_t}/

\end{aligned} $$

For further details regarding the algorithm we refer to Adam: A Method for Stochastic Optimization.

Attributes

Source
Adam.scala
Supertypes
class Optimizer
class Object
trait Matchable
class Any

Implements the AdamW algorithm.

Implements the AdamW algorithm.

Attributes

Source
AdamW.scala
Supertypes
class Optimizer
class Object
trait Matchable
class Any
abstract class Optimizer

Base class for all optimizers.

Base class for all optimizers.

Attributes

Source
Optimizer.scala
Supertypes
class Object
trait Matchable
class Any
Known subtypes
class Adam
class AdamW
class SGD
class SGD(params: Iterable[Tensor[_]], lr: Float) extends Optimizer

Implements stochastic gradient descent (optionally with momentum).

Implements stochastic gradient descent (optionally with momentum).

$$ \begin{aligned} &\rule{110mm}{0.4pt} \ &\textbf{input} : \gamma \text{ (lr)}, : \theta_0 \text{ (params)}, : f(\theta) \text{ (objective)}, : \lambda \text{ (weight decay)}, \ &\hspace{13mm} :\mu \text{ (momentum)}, :\tau \text{ (dampening)}, :\textit{ nesterov,}:\textit{ maximize} \[-1.ex] &\rule{110mm}{0.4pt} \ &\textbf{for} : t=1 : \textbf{to} : \ldots : \textbf{do} \ &\hspace{5mm}g_t \leftarrow \nabla_{\theta} f_t (\theta_{t-1}) \ &\hspace{5mm}\textbf{if} : \lambda \neq 0 \ &\hspace{10mm} g_t \leftarrow g_t + \lambda \theta_{t-1} \ &\hspace{5mm}\textbf{if} : \mu \neq 0 \ &\hspace{10mm}\textbf{if} : t > 1 \ &\hspace{15mm} \textbf{b}t \leftarrow \mu \textbf{b}{t-1} + (1-\tau) g_t \ &\hspace{10mm}\textbf{else} \ &\hspace{15mm} \textbf{b}t \leftarrow g_t \ &\hspace{10mm}\textbf{if} : \textit{nesterov} \ &\hspace{15mm} g_t \leftarrow g{t-1} + \mu \textbf{b}t \ &\hspace{10mm}\textbf{else} \[-1.ex] &\hspace{15mm} g_t \leftarrow \textbf{b}t \ &\hspace{5mm}\textbf{if} : \textit{maximize} \ &\hspace{10mm}\theta_t \leftarrow \theta{t-1} + \gamma g_t \[-1.ex] &\hspace{5mm}\textbf{else} \[-1.ex] &\hspace{10mm}\theta_t \leftarrow \theta{t-1} - \gamma g_t \[-1.ex] &\rule{110mm}{0.4pt} \[-1.ex] &\bf{return} : \theta_t \[-1.ex] &\rule{110mm}{0.4pt} \[-1.ex] \end{aligned} $$

Nesterov momentum is based on the formula from On the importance of initialization and momentum in deep learning

Attributes

Source
SGD.scala
Supertypes
class Optimizer
class Object
trait Matchable
class Any