Adam

torch.optim.Adam

Implements the Adam algorithm.

$$ \begin{aligned} &\rule{110mm}{0.4pt} \ &\textbf{input} : \gamma \text{ (lr)}, \beta_1, \beta_2 \text{ (betas)},\theta_0 \text{ (params)},f(\theta) \text{ (objective)} \ &\hspace{13mm} \lambda \text{ (weight decay)}, : \textit{amsgrad}, :\textit{maximize} \ &\textbf{initialize} : m_0 \leftarrow 0 \text{ ( first moment)}, v_0\leftarrow 0 \text{ (second moment)},: \widehat{v_0}^{max}\leftarrow 0\[-1.ex] &\rule{110mm}{0.4pt} \ &\textbf{for} : t=1 : \textbf{to} : \ldots : \textbf{do} \

&\hspace{5mm}\textbf{if} \: \textit{maximize}:                                       \\
&\hspace{10mm}g_t           \leftarrow   -\nabla_{\theta} f_t (\theta_{t-1})         \\
&\hspace{5mm}\textbf{else}                                                           \\
&\hspace{10mm}g_t           \leftarrow   \nabla_{\theta} f_t (\theta_{t-1})          \\
&\hspace{5mm}\textbf{if} \: \lambda \neq 0                                           \\
&\hspace{10mm} g_t \leftarrow g_t + \lambda  \theta_{t-1}                            \\
&\hspace{5mm}m_t           \leftarrow   \beta_1 m_{t-1} + (1 - \beta_1) g_t          \\
&\hspace{5mm}v_t           \leftarrow   \beta_2 v_{t-1} + (1-\beta_2) g^2_t          \\
&\hspace{5mm}\widehat{m_t} \leftarrow   m_t/\big(1-\beta_1^t \big)                   \\
&\hspace{5mm}\widehat{v_t} \leftarrow   v_t/\big(1-\beta_2^t \big)                   \\
&\hspace{5mm}\textbf{if} \: amsgrad                                                  \\
&\hspace{10mm}\widehat{v_t}^{max} \leftarrow \mathrm{max}(\widehat{v_t}^{max},
    \widehat{v_t})                                                                   \\
&\hspace{10mm}\theta_t \leftarrow \theta_{t-1} - \gamma \widehat{m_t}/
    \big(\sqrt{\widehat{v_t}^{max}} + \epsilon \big)                                 \\
&\hspace{5mm}\textbf{else}                                                           \\
&\hspace{10mm}\theta_t \leftarrow \theta_{t-1} - \gamma \widehat{m_t}/

\end{aligned} $$

For further details regarding the algorithm we refer to Adam: A Method for Stochastic Optimization.

Attributes

Source
Adam.scala
Graph
Supertypes
class Optimizer
class Object
trait Matchable
class Any

Members list

Value members

Inherited methods

def step(): Unit

Performs a single optimization step (parameter update).

Performs a single optimization step (parameter update).

Attributes

Note

Unless otherwise specified, this function should not modify the .grad field of the parameters.

Inherited from:
Optimizer
Source
Optimizer.scala

Attributes

Inherited from:
Optimizer
Source
Optimizer.scala
def zeroGrad(): Unit

Sets the gradients of all optimized Tensors to zero.

Sets the gradients of all optimized Tensors to zero.

Attributes

Inherited from:
Optimizer
Source
Optimizer.scala