prysm.x.optym#

Optimizers#

Various optimization algorithms.

prysm.x.optym.optimizers.runN(optimizer, N)#

Perform N iterations of optimization.

Parameters:

optimizer (Any) –

any optimizer from this file, or any type which implements

def step(self): -> (xk, fk, gk)
pass
N (int) – number of iterations to perform

Returns:

yielding (xk, fk, gk) at each iteration

Return type:

generator

class prysm.x.optym.optimizers.GradientDescent(fg, x0, alpha)#

Bases: object

Gradient Descent optimization routine.

Gradient Descent travels a constant step size alpha along the negative of the gradient on each iteration. The update is:

\[x_{k+1} = x_k - α g_k\]

where g is the gradient vector

The user may anneal alpha over the course of optimization if they wish. The cost function is not used, nor higher order information.

Create a new GradientDescent optimizer.

Parameters:

fg (callable) – a function which returns (f, g) where f is the scalar cost, and g is the vector gradient.
x0 (callable) – the parameter vector immediately prior to optimization
alpha (float) – the step size the user may mutate self.alpha over the course of optimization with no negative effects (except optimization blowing up from a bad choice of new alpha)

step()#: Perform one iteration of optimization.

class prysm.x.optym.optimizers.AdaGrad(fg, x0, alpha)#

Bases: object

Adaptive Gradient Descent optimization routine.

Gradient Descent has the same step size for each parameter. Adagrad self- learns a unique step size for each parameter based on accumulation of the square of the gradient over the course of optimization. The update is:

\[\begin{split}s_k &= s_{k-1} + (g*g) \\ x_{k+1} &= x_k - α g_k / \sqrt{s_k \,}\end{split}\]

The purpose of the square and square root operations is essentially to destroy the sign of g in the denomenator gain. An alternative may be to simply do s_k = s_(k-1) + abs(g), which would have less numerical precision issues.

Ref [1] describes a “”fully connected”” version of AdaGrad that is a cousin of sorts to BFGS, storing an NxN matrix. This is intractible for large N. The implementation here is sister to most implementations in the wild, and is the “Diagonal” implementation, which stores no information about the relationship between “spatial” elements of the gradient vector. Only the temporal relationship between the gradient and its past is stored.

References

[1] Duchi, John, Hazan, Elad and Singer, Yoram. “Adaptive Subgradient: Methods for Online Learning and Stochastic Optimization.” https://doi.org/10.5555/1953048.2021068

Create a new AdaGrad optimizer.

Parameters:

fg (callable) – a function which returns (f, g) where f is the scalar cost, and g is the vector gradient.
x0 (callable) – the parameter vector immediately prior to optimization
alpha (float) – the step size

step()#: Perform one iteration of optimization.

class prysm.x.optym.optimizers.RMSProp(fg, x0, alpha, gamma=0.9)#

Bases: object

RMSProp optimization routine.

RMSProp keeps a moving average of the squared gradient of each parameter.

It is very similar to AdaGrad, except that the decay built into it allows it to forget old gradients. This makes it often superior for non-convex problems, where navigation from one valley into another poisons AdaGrad, but RMSProp will eventually forget about the old valley.

The update is:

\[\begin{split}s_k &= γ * s_{k-1} + (1-γ)*(g*g) \\ x_{k+1} &= x_k - α g_k / \sqrt{s_k \,}\end{split}\]

The decay terms gamma form a “moving average” that is squared, with the square root in the gain it is a “root mean square.”

RMSProp is an unpublished algorithm. Its source is Ref [1]

References

[1] Geoffrey Hinton, Nitish Srivastava, Kevin Swersky: “Neural Networks for Machine Learning Lecture 6a Overview of mini-‐batch gradient descent” U Toronto, CSC 321 https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

Create a new RMSProp optimizer.

Parameters:

fg (callable) – a function which returns (f, g) where f is the scalar cost, and g is the vector gradient.
x0 (callable) – the parameter vector immediately prior to optimization
alpha (float) – the step size
gamma (float) – the decay rate of the accumulated squared gradient

step()#: Perform one iteration of optimization.

class prysm.x.optym.optimizers.Adam(fg, x0, alpha, beta1=0.9, beta2=0.999)#

Bases: object

ADAM optimization routine.

ADAM, or “Adaptive moment estimation” uses moving average estimates of the mean of the gradient and of its “uncentered variance”. This causes the algorithm to combine several properties of AdaGrad and RMSPRop, as well as perform a form of self-annealing, where the step size will naturally decay as the optimizer converges. This can cause ADAM to recover itself after diverging, if the divergence is not too extreme.

The update is:

\[\begin{split}m &\equiv \text{mean} \\ v &\equiv \text{variance} \\ m_k &= β_1 m_{k-1} + (1-β_1) * g \\ v_k &= β_2 v_{k-1} + (1-β_2) * (g*g) \\ \hat{m}_k &= m_k / (1 - β_1^k) \\ \hat{v}_k &= v_k / (1 - β_2^k) \\ x_{k+1} &= x_k - α * \hat{m}_k / \sqrt{\hat{v}_k \,} \\\end{split}\]

References

[1] Kingma, Diederik and Ba, Jimmy. “Adam: A Method for Stochastic Optimization”: http://arxiv.org/abs/1412.6980

Create a new ADAM optimizer.

Parameters:

fg (callable) – a function which returns (f, g) where f is the scalar cost, and g is the vector gradient.
x0 (callable) – the parameter vector immediately prior to optimization
alpha (float) – the step size
beta1 (float) – the decay rate of the first moment (mean of gradient)
beta2 (float) – the decay rate of the second moment (uncentered variance)

step()#: Perform one iteration of optimization.

class prysm.x.optym.optimizers.RAdam(fg, x0, alpha, beta1=0.9, beta2=0.999)#

Bases: object

RADAM optimization routine.

RADAM or Rectified ADAM is a modification of ADAM, which seeks to remove any need for warmup time with ADAM, and to stabilize the variance estimate or second moment that ADAM uses. These properties make RADAM more invariant to the choice of hyperparameters, especially alpha, and avoid distorting the trajectory of optimization in early iterations.

The update is:

\[\begin{split}m &\equiv \text{mean} \\ v &\equiv \text{variance} \\ \rho_\infty &= \frac{2}{1-β_2} - 1 \\ m_k &= β_1 m_{k-1} + (1-β_1) * g \\ v_k &= β_2 v_{k-1} + (1-β_2) * (g*g) \\ \hat{m}_k &= m_k / (1 - β_1^k) \\ \rho_k &= \rho_\infty - \frac{2 k β_2^k}{1-β_2^k} \\ \text{if}& \rho_k > 5 \\ \qquad l_k &= \sqrt{\frac{1 - β_2^k}{\sqrt{v_k}}} \\ \qquad r_k &= \sqrt{\frac{(\rho_k - 4)(\rho_k-2)\rho_\infty}{(\rho_\infty-4)(\rho_\infty-2)\rho_t}} \\ \qquad x_{k+1} &= x_k - α r_k \hat{m}_k l_k \\ \text{else}& \\ \qquad x_{k+1} &= x_k - α \hat{m}_k\end{split}\]

References

[1] Liu, Liyuan and Jiang, Haoming and He, Pengcheng and Chen, Weizhu and Liu Xiaodong, and Gao, Jianfeng and Han Jiawei. “On the Variance of the Adaptive Learning Rate and Beyond”: http://arxiv.org/abs/1412.6980

Create a new RADAM optimizer.

Parameters:

fg (callable) – a function which returns (f, g) where f is the scalar cost, and g is the vector gradient.
x0 (callable) – the parameter vector immediately prior to optimization
alpha (float) – the step size
beta1 (float) – the decay rate of the first moment (mean of gradient)
beta2 (float) – the decay rate of the second moment (uncentered variance)

step()#: Perform one iteration of optimization.

class prysm.x.optym.optimizers.AdaMomentum(fg, x0, alpha, beta1=0.9, beta2=0.999)#

Bases: object

AdaMomentum optimization routine.

AdaMomentum is an algorithm that is extremely similar to Adam, differing only in the calculation of v. The idea is to reduce teh variance of the correction, which can plausibly improve the generality of found solutions, i.e. enter wider local/global minima.

The update is:

\[\begin{split}m &\equiv \text{mean} \\ v &\equiv \text{variance} \\ m_k &= β_1 m_{k-1} + (1-β_1) * g \\ v_k &= β_2 v_{k-1} + (1-β_2) * m_k^2 \\ \hat{m}_k &= m_k / (1 - β_1^k) \\ \hat{v}_k &= v_k / (1 - β_2^k) \\ x_{k+1} &= x_k - α * \hat{m}_k / \sqrt{\hat{v}_k \,} \\\end{split}\]

References

[1] Wang, Yizhou and Kang, Yue and Qin, Can and Wang, Huan and Xu, Yi and Zhang Yulun and Fu, Yun. “Rethinking Adam: A Twofold Exponential Moving Average Approach”: https://arxiv.org/abs/2106.11514

Create a new ADAM optimizer.

Parameters:

fg (callable) – a function which returns (f, g) where f is the scalar cost, and g is the vector gradient.
x0 (callable) – the parameter vector immediately prior to optimization
alpha (float) – the step size
beta1 (float) – the decay rate of the first moment (mean of gradient)
beta2 (float) – the decay rate of the second moment (uncentered variance)

step()#: Perform one iteration of optimization.

class prysm.x.optym.optimizers.Yogi(fg, x0, alpha, beta1=0.9, beta2=0.999)#

Bases: object

YOGI optimization routine.

YOGI is a modification of ADAM, which replaces the multiplicative update to the exponentially moving averaged estimate of the variance of the gradient with an additive one. The premise for this is that multiplicative update causes ADAM to forget past gradients too quickly, tailoring it to more localized behavior in the second order momentum term. The additive update in YOGI essentially makes the second order momentum update over a larger space. This allows Yogi to perform better than ADAM for some non-convex or otherwise tumultuous cost functions, which have many changes in local curvature.

The update is:

\[\begin{split}m &\equiv \text{mean} \\ v &\equiv \text{variance} \\ m_k &= β_1 m_{k-1} + (1-β_1) * g \\ v_k &= v_{k-1} - (1-β_2) * \text{sign}(v_{k-1} - (g^2))*(g^2) \\ x_{k+1} &= x_k - α * m_k / \sqrt{v_k \,} \\\end{split}\]

References

[1] Zaheer, Manzil and Reddi, Sashank and Sachan, Devendra and Kale, Satyen and Kumar, Sanjiv. “Adaptive Methods for Nonconvex Optimization”: https://papers.nips.cc/paper_files/paper/2018/hash/90365351ccc7437a1309dc64e4db32a3-Abstract.html

Create a new YOGI optimizer.

Parameters:

fg (callable) – a function which returns (f, g) where f is the scalar cost, and g is the vector gradient.
x0 (callable) – the parameter vector immediately prior to optimization
alpha (float) – the step size
beta1 (float) – the decay rate of the first moment (mean of gradient)
beta2 (float) – the decay rate of the second moment (uncentered variance)

step()#: Perform one iteration of optimization.

class prysm.x.optym.optimizers.F77LBFGSB(fg, x0, memory=10, lower_bounds=None, upper_bounds=None)#

Bases: object

Limited Memory Broyden Fletcher Goldfarb Shannon optimizer, variant B (L-BFGS-B).

L-BFGS-B is a Quasi-Newton method which uses the previous m gradient vectors to perform the BFGS update, which itself is an approximation of Newton’s Method.

The “L” in L-BFGS is Limited Memory, due to this m*n storage requirement, where m is a small integer (say 10 to 30), and n is the number of variables.

At its core, L-BFGS solves the BFGS update using an adaptive line search, satisfying the strong Wolfe conditions, which guarantee that it does not move uphill.

Variant B (BFGS-B) incorporates subspace minimization, which further accelerates convergence.

Subspace minimization is the practice of forming a lower-dimensional “manifold” (essentially, enclosing Euclidean geometry) for the problem at a given iteration, and then exactly solving for the minimum of that manifold.

The combination of subspace minimization and a quasi-newton update give L-BFGS-B exponential convergence, where it may converge by an order of magnitude in cost or more on each iteration.

This wrapper around Jorge Nocedal’s Fortran code made available through SciPy attenpts to defeat the built-in convergence tests of lbfgsb.f, but is not always successful due to the nature of floating point arithmetic. Unlike all other classes in this file, L-BFGS-B may refuse to step(), and may stop early in a runN or run_to call. A warning will be generated in such instances.

References

[1] Jorge Nocedal, “Updating Quasi-Newton Matricies with Limited Storage”: https://doi.org/10.2307/2006193
[2] Richard H. Byrd, Peihuang Lu, and Jorge Nocedal “A Limited-Memory: Algorithm For Bound-Constrained Optimization” https://doi.org/10.1137/0916069
[3] Ciyou Zhu, Richard H. Byrd, Peihuang Lu, and Jorge Nocedal “Algorithm 778:: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization” https://doi.org/10.1145/279232.279236
[4] José Luis Morales and Jorge Nocedal, “Remark on “algorithm 778: L-BFGS-B:: Fortran subroutines for large-scale bound constrained optimization” https://doi.org/10.1145/2049662.2049669

Create a new L-BFGS-B optimizer.

Parameters:

fg (callable) – a function which returns (f, g) where f is the scalar cost, and g is the vector gradient.
x0 (callable) – the parameter vector immediately prior to optimization
memory (int) – the number of recent gradient vectors to use in performing the approximate Newton’s step
lower_bounds (numpy.ndarray, optional) – vector of same size as x0 containing the hard lower bounds for the variables; if None, unconstrained lb
upper_bounds (numpy.ndarray, optional) – vector of same size as x0 containing the hard upper bounds for the variables; if None, unconstrained ub

step()#: Perform one iteration of optimization.

run_to(N)#: Run the optimizer until its iteration count equals N.

Activation Functions#

Activation functions and related nodes.

class prysm.x.optym.activation.Softmax#

Bases: object

Softmax activation function.

Softmax is a soft, differntiable alternative to argmax. It is used as a component of GumbelSoftmaxEncoder to ecourage / softly force variables to take on one of K discrete states.

The arrays passed to forward() and reverse() may take any number of dimensions. The understanding of the inputs should be that the final dimension is what is being softmaxed over, and all preceeding dimensions are independent variables.

For example, to softmax a 256x256 array over 4 activation levels, the input should be 256x256x4.

Create a new Softmax node.

forward(x)#

Perform Softmax activation on logits.

Parameters:: x (numpy.ndarray, shape (A,B,C, ... K)) – any number of leading dimensions, required trailing dimension of size K, where K is the number of levels to be used with an encoder
Returns:: same shape as x, activated x, where sum(axis=K) == 1
Return type:: numpy.ndarray

backprop(grad)#

Backpropagate grad through Softmax.

Parameters:: grad (numpy.ndarray) – gradient of scalar cost function w.r.t. following step in forward problem, of same shape as passed to forward()
Returns:: dcost/dsoftmax-input
Return type:: numpy.ndarray

class prysm.x.optym.activation.GumbelSoftmax(tau=1, eps=None)#

Bases: object

GumbelSoftmax combines the softmax activation function with stochastic Gumbel noise to encourage variables to fall into discrete categories.

See: https://arxiv.org/pdf/1611.01144.pdf https://arxiv.org/pdf/1611.00712.pdf

You most likely want to use GumbelSoftmaxEncoder, not this class directly.

Create a new GumbelSoftmax estimator.

tau is the temperature parameter, as tau -> 0, output becomes increasingly discrete; tau should be annealed towards zero, but never exactly zero, over the course of design/optimization.

forward(x)#: Gumbel-softmax process on x.

backprop(protograd)#: Adjoint of forward().

class prysm.x.optym.activation.DiscreteEncoder(estimator, levels)#

Bases: object

An encoder that embds a continuous encoder, which encourages values to cluster at discrete states.

Create a new DiscreteEncoder.

Parameters:

estimator (an initialized estimator) – for example GumbelSoftmax()
levels (int or numpy.ndarray) – if int, self-generates arange(levels) else, expected to be K discrete, non-overlapping integer states

forward(x)#

Forward pass through the continuous proxy for optimization.

use discretize() to view the current best fit discrete realization.

backprop(grad)#: Backpropagation through the continuous proxy for optimization.

discretize(x)#

Perform discrete encoding of x.

Note that when the estimator weights are not yet converged or non-sparse the output of this function will not match closely to the continuous proxy that is actually being optimized.

Cost (loss) Functions#

Cost functions, aka figures of merit for models.

class prysm.x.optym.cost.BiasAndGainInvariantError#

Bases: object

Bias and gain invariant error.

This cost function computes internal least mean squares estimates of the overall bias (DC pedestal) and gain between the signal I and D. This objective is useful when the overall signal level is ambiguous in phase retrieval type problems, and can significantly help stabilize the optimization process.

Create a new BiasAndGainInvariantError instance.

forward(I, D, mask)#

Forward cost evaluation.

Parameters:

I (numpy.ndarray) – ‘intensity’ or model data, any float dtype, any shape
D (numpy.ndarray) – ‘data’ or true mesaurement to be matched, any float dtype, any shape
mask (numpy.ndarray) – logical array with elements to keep (True) or exclude (False)

Returns:

scalar cost

Return type:

float

backprop()#: Returns the first step of gradient backpropagation, an array of the same shape as I.

prysm.x.optym.cost.mean_square_error(M, D, mask=None)#

Mean square error.

Parameters:

M (numpy.ndarray) – “model data”
D (numpy.ndarray) – “truth data”
mask (numpy.ndarray, optional) – True where M should contribute to the cost, False where it should not

Returns:

cost, dcost/dM

Return type:

float, numpy.ndarray