Globally Convergent Modifications of Newton’s Method¶

The Model Trust Region Approach¶

[Gay81] showed that perturbing the Hessian to ensure positive semidefiniteness still yields a solution that minimizes (6.4.1).

The level set of \(f\) at value \(c\) is the set of all points \(\mathbf{x} \in \mathbb{R}^n\) such that \(f(\mathbf{x}) = c\). When \(n = 2\), a level set is called a level curve, contour line, or isoline. When \(n = 3\), it is known as the level surface or isosurface. Higher dimensions classifies them as hypersurface.

[Joy] provides a clear derivation of why the gradient of a function is in the direction of steepest ascent (or greatest change). To see why it is normal to the surface, suppose that \(\mathbf{t} \in \mathbb{R}^n\) is tangent to the level surface of \(f\) through \(\mathbf{x}_0\). As you move in the \(\mathbf{t}\) direction, you are at that instant moving along the level surface, and hence the value of \(f\) does not change. This is another of saying the directional derivative in this direction is zero. Therefore, the gradient is orthogonal to the tangent plane.

Exercise 1¶

A descent direction \(p\) of \(f\) at \(x_c\) satisfies \(\nabla f(x_c)^\top p < 0\).

The steepest-descent direction is the solution to

\[\begin{split}\min_{p \in \mathbb{R}^n} \nabla f(x)^\top p &\\ \text{subject to} \quad \left\Vert p \right\Vert &= 1.\end{split}\]

The \(l_2\)-norm restricts the solution to be on the unit ball around \(x\), so \(p = -\frac{\nabla f(x)}{\left\Vert \nabla f(x) \right\Vert_2}\) is the steepest-descent direction because

\[\begin{split}\nabla f(x)^\top p &\leq \left\Vert \nabla f(x) \right\Vert_2 \left\Vert p \right\Vert_2 & \quad & \text{Cauchy-Schwarz Inequality}\\ &= \left\Vert \nabla f(x) \right\Vert_2.\end{split}\]

Given \(f(x) = 3 x_1^2 + 2 x_1 x_2 + x_2^2\) and \(\nabla f(x) = \begin{bmatrix} 6 x_1 + 2 x_2\\ 2 x_1 + 2 x_2\end{bmatrix}\), the steepest-descent direction at \(x_0 = (1,1)^\top\) is

\[\begin{split}p = -\begin{pmatrix} 8\\ 4 \end{pmatrix} \frac{1}{\sqrt{8^2 + 4^2}} = -\frac{1}{\sqrt{5}} \begin{pmatrix} 2\\ 1 \end{pmatrix}.\end{split}\]

The vector \((1, -1)^\top\) is not a descent direction at \(x_0\) because

\[\begin{split}\nabla f(x_c)^\top \begin{pmatrix} 1\\ -1\end{pmatrix} = 8 + (-4) = 4 \not< 0.\end{split}\]

Exercise 2¶

Given \(s = -\lambda_k \nabla f(x_k) = -\lambda_k g\),

\[f(x_k + s) = f(x_k - \lambda_k g) = -\lambda_k g^\top g + \frac{\lambda_k^2}{2} g^\top \nabla^2 f(x_k) g\]

and

\[\frac{\partial f}{\partial \lambda_k} = -g^\top g + \lambda_k g^\top \nabla^2 f(x_k) g.\]

By Theorem 4.3.2,

\[\lambda_k = \frac{g^\top g}{g^\top \nabla^2 f(x_k) g}.\]

Exercise 3¶

Define \(\hat{c} \in (0, 1)\) and \(\nabla^2 f(x) = \begin{bmatrix} \lambda_0 & 0\\ 0 & \lambda_1 \end{bmatrix}\) where \(x \in \mathbb{R}^2\) and \(\lambda_0 \geq \lambda_1 > 0\).

By (6.2.4), \(\hat{c} \triangleq \frac{\lambda_0 - \lambda_1}{\lambda_0 + \lambda_1}\). This sets \(\nabla f(x) = \begin{bmatrix} \lambda_0 x_1\\ \lambda_1 x_2 \end{bmatrix}\) and \(f(x) = \frac{\lambda_0}{2} x_1^2 + \frac{\lambda_1}{2} x_2^2\). Clearly, \(x_* = (0,0)^\top\) still holds and

\[\begin{split}x_{k + 1} &= x_k - \frac{g^\top g}{g^\top \nabla^2 f(x_k) g} g\\ &= x_k - \frac{ \lambda_0^2 x_1^2 + \lambda_1^2 x_2^2 }{ \lambda_0^3 x_1^2 + \lambda_1^3 x_2^2 } \begin{bmatrix} \lambda_0 x_1\\ \lambda_1 x_2 \end{bmatrix}\\ &= \frac{1}{\lambda_0^3 x_1^2 + \lambda_1^3 x_2^2} \begin{bmatrix} x_1 \left( \lambda_0^3 x_1^2 + \lambda_1^3 x_2^2 \right) - \left( \lambda_0^2 x_1^2 + \lambda_1^2 x_2^2 \right) \lambda_0 x_1\\ x_2 \left( \lambda_0^3 x_1^2 + \lambda_1^3 x_2^2 \right) - \left( \lambda_0^2 x_1^2 + \lambda_1^2 x_2^2 \right) \lambda_1 x_2 \end{bmatrix}\\ &= \frac{1}{\lambda_0^3 x_1^2 + \lambda_1^3 x_2^2} \begin{bmatrix} x_1 \left( \lambda_1^3 x_2^2 \right) - \left( \lambda_1^2 x_2^2 \right) \lambda_0 x_1\\ x_2 \left( \lambda_0^3 x_1^2 \right) - \left( \lambda_0^2 x_1^2 \right) \lambda_1 x_2 \end{bmatrix}\\ &= \frac{1}{\lambda_0^3 x_1^2 + \lambda_1^3 x_2^2} \begin{bmatrix} \left( \lambda_1^3 - \lambda_0 \lambda_1^2 \right) x_1 x_2^2\\ \left( \lambda_0^3 - \lambda_0^2 \lambda_1 \right) x_1^2 x_2 \end{bmatrix}\end{split}\]

will converge as long as

\[\frac{ \left\Vert x_{k + 1} - x_* \right\Vert }{ \left\Vert x_k - x_* \right\Vert } = \frac{\left\Vert x_{k + 1} \right\Vert}{\left\Vert x_k \right\Vert} \leq \hat{c}.\]

Exercise 4¶

Given

\[f(x) = x^2,\quad x_0 = 2,\quad \left\{ p_k \right\} = \left\{ \left( -1 \right)^{k + 1} \right\},\quad \left\{ \lambda_k \right\} = \left\{ 2 + \frac{3}{2^{k + 1}} \right\},\]

\(\left\{ x_k \right\} = \left\{ \left( -1 \right)^k \left( 1 + 2^{-k} \right) \right\}\) for \(k \in \mathbb{N}_0\).

This infinite sequence is not permitted by (6.3.3) because

\[\begin{split}f(x_k + \lambda_k p_k) &\leq f(x_k) + \alpha \lambda_k \nabla f(x_k)^\top p_k & \quad & \text{(6.3.3)}\\ x_k^2 + 2 \lambda_k x_k p_k + \lambda_k^2 p_k^2 &\leq x_k^2 + \alpha 2 \lambda_k x_k p_k\\ \frac{\lambda_k^2}{2} + \lambda_k x_k p_k - \alpha \lambda_k x_k p_k &\leq 0\\ \frac{\lambda_k^2}{2} - \lambda_k \left( 1 + 2^{-k} \right) + \alpha \lambda_k \left( 1 + 2^{-k} \right) &\leq 0\\ 2^{-(2k + 3)} \left( 2^{k + 2} + 3 \right) \left( \alpha 2^{k + 2} + 4 \alpha - 1 \right) &\leq 0\\ \alpha &\leq \left( 2^{k + 2} + 4 \right)^{-1}\end{split}\]

where the last inequality holds \(\forall k \in \mathbb{N}_0\) only when \(\alpha \leq 0\).

Exercise 5¶

Given

\[f(x) = x^2,\quad x_0 = 2,\quad \left\{ p_k \right\} = \left\{ -1 \right\},\quad \left\{ \lambda_k \right\} = \left\{ 2^{-k + 1} \right\},\]

\(\left\{ x_k \right\} = \left\{ 1 + 2^{-k} \right\}\) for \(k \in \mathbb{N}_0\).

The infinite sequence is permitted by (6.3.3) when \(\alpha \leq \frac{1}{2}\) because

\[\begin{split}f(x_k + \lambda_k p_k) &\leq f(x_k) + \alpha \lambda_k \nabla f(x_k)^\top p_k & \quad & \text{(6.3.3)}\\ x_k^2 + 2 \lambda_k x_k p_k + \lambda_k^2 p_k^2 &\leq x_k^2 + \alpha 2 \lambda_k x_k p_k\\ \frac{\lambda_k^2}{2} + \lambda_k x_k p_k - \alpha \lambda_k x_k p_k &\leq 0\\ \frac{\lambda_k^2}{2} - \lambda_k \left( 1 + 2^{-k} \right) + \alpha \lambda_k \left( 1 + 2^{-k} \right) &\leq 0\\ 2^{1 - 2k} \left( \alpha 2^k - 2^k + \alpha \right) &\leq 0\\ \alpha &\leq \frac{2^k}{2^k + 1}\\ &= \left( 1 + 2^{-k} \right)^{-1}.\end{split}\]

The infinite sequence is prohibited by (6.3.4) because

\[\begin{split}\nabla f(x_k + \lambda_k p_k)^\top p_k &\geq \beta \nabla f(x_k)^\top p_k & \quad & \text{(6.3.4)}\\ 2 \left( x_k + \lambda_k p_k \right)^\top p_k &\geq 2 \beta x_k^\top p_k\\ \beta x_k &\geq x_k - \lambda_k\\ \beta &\geq \frac{1 - 2^{-k}}{1 + 2^{-k}}\end{split}\]

where the last inequality holds \(\forall k \in \mathbb{N}_0\) when \(\beta \geq 1\), but \(\beta \in (\alpha, 1)\).

Exercise 6¶

Let \(f(x)\) be a positive definite quadratic such that

\[\begin{split}f(x) &= \frac{1}{2} x^\top A x + b^\top x + c\\ \nabla f(x) &= A x + b\\ \nabla^2 f(x) &= A\end{split}\]

with \(x \in \mathbb{R}^n\) and \(A \succ 0\).

Recall that a full Newton step at \(x_k \in \mathbb{R}^n\) means \(\lambda = 1\) and \(p = -H_k^{-1} \nabla f(x_k)\) where \(H_k = \nabla^2 f(x_k) + \mu_k I\) is positive definite with \(\mu_k = 0\) if \(\nabla^2 f(x_k)\) is positive definite. Applying a full Newton step to \(f(x)\) yields

\[\begin{split}f(x_k + \lambda p) &= \frac{1}{2} (x_k + p)^\top \nabla^2 f(x_k) (x_k + p) + b^\top (x_k + p) + c\\ &= f(x_k) + \frac{1}{2} p^\top \nabla^2 f(x_k) p + x_k^\top \nabla^2 f(x_k) p + b^\top p & \quad & \text{definition of } f(x)\\ &= f(x_k) + \frac{1}{2} p^\top \nabla^2 f(x_k) p + \nabla f(x_k)^\top p & \quad & \text{definition of } \nabla f(x)\\ &= f(x_k) + \frac{1}{2} \left( -H_k^{-1} \nabla f(x_k) \right)^\top \nabla^2 f(x_k) \left( -H_k^{-1} \nabla f(x_k) \right) + \nabla f(x_k)^\top \left( -H_k^{-1} \nabla f(x_k) \right)\\ &= f(x_k) - \frac{1}{2} \nabla f(x_k)^\top \nabla^2 f(x_k)^{-1} \nabla f(x_k) & \quad & \text{Hessian is symmetric and } \left( A^{-1} \right)^\top = \left( A^\top \right)^{-1}\\ &= f(x_k) + \frac{1}{2} \nabla f(x_k)^\top p\\ &\leq f(x_k) + \alpha \nabla f(x_k)^\top p,\end{split}\]

where \(\alpha \leq \frac{1}{2}\), which satisfies condition (6.3.3).

Condition (6.3.4) is also satisfied when \(\beta \in (0, 1)\) because

\[\begin{split}\nabla f(x_k + \lambda p) p &= \left[ \nabla^2 f(x_k) (x_k + p) + b \right]^\top p & \quad & \text{definition of } \nabla f(x)\\ &= x_k^\top \nabla^2 f(x_k)^\top p + b^\top p + p^\top \nabla^2 f(x_k)^\top p\\ &= \nabla f(x_k)^\top p + p^\top \nabla^2 f(x_k) p\\ &\geq \beta \nabla f(x_k)^\top p & \quad & \text{Hessian is assumed to be positive definite.}\end{split}\]

Exercise 7¶

Let (6.3.4) be replaced with \(f(x_{k + 1}) \geq f(x_k) + \beta \nabla f(x_k)^\top (x_{k + 1} - x_k)\) where \(\beta \in (\alpha, 1)\).

The proof for Theorem 6.3.2 still proceeds as in (6.3.5). Observe that

\[\begin{split}f(\hat{x}) &= f(x_k) + \alpha \hat{\lambda} \nabla f(x_k)^\top p_k\\ f(\hat{x}) - f(x_k) &= \alpha \hat{\lambda} \nabla f(x_k)^\top p_k\\ &> \beta \hat{\lambda} \nabla f(x_k)^\top p_k & \quad & \alpha < \beta \text{ and } \nabla f(x_k)^\top p_k < 0\\ f(\hat{x}) &> f(x_k) + \beta \nabla f(x_k)^\top (\hat{x} - x_k).\end{split}\]

Thus, by the continuity of \(\nabla f\), Theorem 6.3.2 holds as long as \(\lambda_k \in (\beta \hat{\lambda}, \alpha \hat{\lambda})\).

The proof for Theorem 6.3.3 is the same except for the direct usage of condition (6.3.4). That condition needs to be derived as

\[\begin{split}f(x_{k + 1}) &\geq f(x_k) + \beta \nabla f(x_k)^\top (x_{k + 1} - x_k)\\ \alpha \nabla f(x_k)^\top p_k &\geq \beta \nabla f(x_k)^\top p_k & \quad & \text{(6.3.3)}\\ \nabla f(x_k + \bar{\lambda} p_k)^\top p_k &\geq \beta \nabla f(x_k)^\top p_k & \quad & \text{(6.3.7)}\\ \nabla f(x_k + \lambda_k p_k) \lambda_k p_k &\geq \beta \nabla f(x_k)^\top \lambda_k p_k & \quad & \text{pick } \lambda_k \in (\lambda_l, \lambda_r) \text{ about } \bar{\lambda}\\ \nabla f(x_{k + 1}) s_k &\geq \beta \nabla f(x_k)^\top s_k.\end{split}\]

This derivation can also be used as a transition into the part of Theorem 6.3.4’s proof that uses condition (6.3.4).

Exercise 8¶

Given

\[\begin{split}x_c = \begin{pmatrix} 1\\ 1 \end{pmatrix}^\top,\quad f(x) = x_1^4 + x_2^2,\end{split}\]

\(\nabla f(x) = \begin{bmatrix} 4 x_1^3\\ 2 x_2 \end{bmatrix}\) and \(\nabla^2 f(x) = \begin{bmatrix} 12 x_1^2 & 0\\ 0 & 2 \end{bmatrix}\).

Since the Hessian is positive definite,

\[p_c = -H_k^{-1} \nabla f(x_c) = -\nabla^2 f(x_c)^{-1} \nabla f(x_c) = \begin{bmatrix} -\frac{1}{3} & -1 \end{bmatrix}^\top.\]

(a)¶

Performing a full Newton step (\(\lambda = 1\)) gives

\[x_+ = x_c + \lambda p_c = \begin{bmatrix} \frac{2}{3} & 0 \end{bmatrix}^\top.\]

(b)¶

Given \(\alpha = 10^{-4}\) and \(\lambda_k = 1\), the backtracking line-search gives the same \(x_+\) as in (a) because

\[\begin{split}f(x_k + \lambda_k p_k) &< f(x_k) + \alpha \lambda_k \nabla f(x_k)^\top p_k\\ \left( \frac{2}{3} \right)^4 + 0^2 &< \left( 1^4 + 1^2 \right) + 10^{-4} (1) \left( -\frac{4}{3} - 2 \right)\\ 0.1975 &< 1.9996\end{split}\]

shows that condition (6.3.3) is satisfied.

(c)¶

The only thing I can think of is to use (6.3.17) yielding \(\lambda = 1.08871\), which makes \(x_+\) closer to the minimizer than in (b).

Exercise 9¶

Given \(x_c = (1, 1)^\top\) and \(f(x) = \frac{1}{2} x_1^2 + x_2^2\),

\[\begin{split}\nabla f(x) = \begin{bmatrix} x_1\\ 2 x_2 \end{bmatrix} \quad \text{and} \quad \nabla^2 f(x) = \begin{bmatrix} 1 & 0\\ 0 & 2 \end{bmatrix}.\end{split}\]

The unconstrained solution to (6.4.1) is

\[\begin{split}\frac{\partial m_c}{\partial x} &= 0\\ \nabla f(x_c)^\top + H_c s &= 0\\ s &= -H_c^{-1} \nabla f(x_c)\\ &= -\begin{bmatrix} 1 & 0\\ 0 & 0.5 \end{bmatrix} \begin{bmatrix} 1\\ 2 \end{bmatrix}\\ &= -\begin{bmatrix} 1 & 1 \end{bmatrix}^\top.\end{split}\]

When \(\delta = 2\), then the unconstrained solution is valid.

When \(\delta = \frac{5}{6}\), (6.4.2) needs to be invoked:

\[\begin{split}\left\Vert -\left( H_c + \mu I \right)^{-1} \nabla f(x_c) \right\Vert_2 &\leq \delta\\ \left\Vert -\begin{bmatrix} (1 + \mu)^{-1} & 0\\ 0 & (2 + \mu)^{-1} \end{bmatrix} \begin{bmatrix} 1\\ 2 \end{bmatrix} \right\Vert_2 &\leq \delta\\ \left( 1 + \mu \right)^{-2} + 2^2 \left( 2 + \mu \right)^{-2} &\leq \delta\\ \mu &\geq 0.7798.\end{split}\]

Exercise 10¶

Given \(f(x) = x_1^2 + 2 x_2^2 + 3 x_3^2\),

\[\begin{split}\nabla f(x) = \begin{bmatrix} 2 x_1\\ 4 x_2\\ 6 x_3 \end{bmatrix} \quad \text{and} \quad \nabla^2 f(x) = \begin{bmatrix} 2 & 0 & 0\\ 0 & 4 & 0\\ 0 & 0 & 6 \end{bmatrix}.\end{split}\]

When \(x_c = (1, 1, 1)^\top\),

\[\nabla f(x_c) = (2, 4, 6)^\top \quad \text{and} \quad H_c^{-1} \nabla f(x_c) = (1, 1, 1)^\top.\]

Let \(\mu_0 = 1\), \(\mu_1 = 2\), and \(\mu_2 = 3\).

\[\begin{split}s(\mu_0) = -(H_c + \mu_0 I)^{-1} \nabla f(x_c) = -(\frac{2}{3}, \frac{4}{5}, \frac{6}{7})^\top\\ s(\mu_1) = -(H_c + \mu_1 I)^{-1} \nabla f(x_c) = -(\frac{2}{4}, \frac{4}{6}, \frac{6}{8})^\top\\ s(\mu_2) = -(H_c + \mu_2 I)^{-1} \nabla f(x_c) = -(\frac{2}{5}, \frac{4}{7}, \frac{6}{9})^\top.\end{split}\]

As shown below, the volume of the tetrahedron defined by these points is not zero and none of the singular values are zero.

[1]:

import numpy

M = numpy.ones((4,4))
M[0,:3] = numpy.asfarray([1, 1, 1])
M[1,:3] = -numpy.asfarray([2/3, 4/5, 6/7])
M[2,:3] = -numpy.asfarray([2/4, 4/6, 6/8])
M[3,:3] = -numpy.asfarray([2/5, 4/7, 6/9])
print('Volume of Tetrahedron = {0}'.format(numpy.linalg.det(M)))
print('Singular Values = {0}'.format(numpy.linalg.svd(M[:,:3])[1]))

Volume of Tetrahedron = 0.0005139833711262165
Singular Values = [2.63921809 0.20381107 0.00683625]

Exercise 11¶

Given \(H \in \mathbb{R}^{n \times n}\) is symmetric and positive definite, it has an eigendecomposition of \(H = Q \Lambda Q^\top = \sum_{i = 1}^n \lambda_i v_i v_i^\top\).

Let \(g = \sum_{j = 1}^n \alpha_j v_j\) and \(\mu \geq 0\).

\[\begin{split}\left( H + \mu I \right)^{-1} g &= \left( Q \Lambda Q^\top + Q \text{M} Q^\top \right)^{-1} g & \quad & \text{M is a diagonal matrix of } \mu\\ &= \sum_{j = 1}^n \sum_{i = 1}^n \frac{\alpha_j}{\lambda_i + \mu} v_i v_i^\top v_j\\ &= \sum_{j = 1}^n \frac{\alpha_j}{\lambda_j + \mu} v_j & \quad & \text{eigenvectors form an orthonormal basis}\end{split}\]

and

\[\begin{split}\left\Vert \left( H + \mu I \right)^{-1} g \right\Vert_2^2 &= \left\Vert \sum_{j = 1}^n \frac{\alpha_j}{\lambda_j + \mu} v_j \right\Vert_2^2\\ &= \left( \sum_{j = 1}^n \frac{\alpha_j}{\lambda_j + \mu} v_j \right)^\top \left( \sum_{j = 1}^n \frac{\alpha_j}{\lambda_j + \mu} v_j \right)\\ &= \sum_{j = 1}^n \sum_{i = 1}^n \frac{\alpha_j}{\lambda_j + \mu} \frac{\alpha_i}{\lambda_i + \mu} v_j^\top v_i\\ &= \sum_{j = 1}^n \left( \frac{\alpha_j}{\lambda_j + \mu} \right)^2 & \quad & \text{definition of orthonormal basis}\\ \left\Vert s(\mu) \right\Vert_2 &= \left[ \sum_{j = 1}^n \left( \frac{\alpha_j}{\lambda_j + \mu} \right)^2 \right]^{1/2}.\end{split}\]

This relates to \(m_c(\mu)\) in (6.4.5) in the sense that \(g = -\nabla f(x_c)\).

Exercise 12¶

Let \(s(\mu) = (H + \mu I)^{-1} g\) for \(\mu \geq 0\) and \(\eta(\mu) = \left\Vert s(\mu) \right\Vert\).

\[\begin{split}\frac{d}{d \mu} \eta(\mu) &= \frac{d}{d \mu} \left[ \sum_{j = 1}^n \left( \frac{\alpha_j}{\lambda_j + \mu} \right)^2 \right]^{1/2}\\ &= \frac{1}{2} \left\Vert s(\mu) \right\Vert_2^{-1} \left[ \frac{d}{d \mu} \sum_{j = 1}^n \left( \frac{\alpha_j}{\lambda_j + \mu} \right)^2 \right] & \quad & \text{chain rule}\\ &= \frac{1}{2} \left\Vert s(\mu) \right\Vert_2^{-1} \left[ \sum_{j = 1}^n 2 \frac{\alpha_j}{\lambda_j + \mu} \frac{d \alpha_j \left( \lambda_j + \mu \right)^{-1}}{d \mu} \right] & \quad & \text{linearity of differentiation and chain rule}\\ &= \left\Vert s(\mu) \right\Vert_2^{-1} \left[ \sum_{j = 1}^n -\frac{\alpha_j^2}{\left( \lambda_j + \mu \right)^3} \right] & \quad & \text{linearity of differentiation and chain rule}\\ &= -\left\Vert s(\mu) \right\Vert_2^{-1} \left[ g^\top Q \left( \Lambda + \text{M} \right)^{-3} Q^\top g \right] & \quad & s(\mu)^\top s(\mu)\\ &= -\frac{ s(\mu)^\top \left( H + \mu I \right)^{-1} s(\mu) }{ \left\Vert s(\mu) \right\Vert_2 } & \quad & Q Q^\top = Q^\top Q = I.\end{split}\]

Exercise 13¶

Given \(f(x) = \frac{1}{2} x_1^2 + x_2^2\),

\[\begin{split}\nabla f(x) = (x_1, 2 x_2)^\top \quad \text{and} \quad \nabla^2 f(x) = \begin{bmatrix} 1 & 0\\ 0 & 2 \end{bmatrix}.\end{split}\]

Let \(x_0 = (1,1)^\top\), \(g = \nabla f(x_0) = (1, 2)^\top\), and \(H = \nabla^2 f(x_0) = \begin{bmatrix} 1 & 0\\ 0 & 2 \end{bmatrix}\).

When \(\delta = 1\), the Cauchy point is outside the region and does not trigger the estimation of the parameterized line unlike when \(\delta = 1.25\).

[2]:

import matplotlib.pyplot as plt
import numpy

def double_dogleg_step(x, g, H, delta):
    s_N = -numpy.linalg.inv(H) * g

    #return the Newton point via Newton step
    if (s_N.T * s_N).flat[0] <= delta**2:
        return x + s_N

    #compute Cauchy point
    lambda_star = (g.T * g / (g.T * H * g)).flat[0]
    C_P = x - lambda_star * g

    #compute Newton point
    s_CP = -lambda_star * g

    _ = (g.T * H * g).flat[0] * (g.T * numpy.linalg.inv(H) * g).flat[0]
    gamma = (g.T * g).flat[0]**2 / _
    eta = 0.8 * gamma + 0.2
    N = x + eta * s_N

    s_hat_N = eta * s_N

    a = ((s_hat_N - s_CP).T * (s_hat_N - s_CP)).flat[0]
    b = 2 * ((s_hat_N - s_CP).T * s_CP).flat[0]
    c = (s_CP.T * s_CP).flat[0] - delta**2

    if delta <= numpy.linalg.norm(g) * lambda_star:
        #C.P. exceeds delta-region
        _ = x - delta / numpy.linalg.norm(g) * g
    else:
        #C.P. is within delta-region, so solve for parameterized line
        lambda_c = max((-b + numpy.sqrt(b**2 - 4 * a * c)) / (2 * a),
                       (-b - numpy.sqrt(b**2 - 4 * a * c)) / (2 * a))
        _ = x + s_CP + lambda_c * (s_hat_N - s_CP)

    fig = plt.figure(figsize=[16,8])
    plt.plot([x.flat[0], C_P.flat[0]], [x.flat[1], C_P.flat[1]],
             marker='o', linestyle='--', color='r', label='CP')
    plt.plot([x.flat[0], N.flat[0]], [x.flat[1], N.flat[1]],
             marker=r'$\checkmark$', linestyle=':', color='b', label='N')
    plt.plot([C_P.flat[0], N.flat[0]], [C_P.flat[1], N.flat[1]],
             marker=r'$\bowtie$', linestyle='-', color='g')
    plt.plot([_.flat[0]], [_.flat[1]],
             marker=r'$\lambda$', color='k', label='x', markersize=10)
    circ = numpy.linspace(0, 2 * numpy.pi, 128)
    plt.plot(x.flat[0] + delta * numpy.cos(circ),
             x.flat[1] + delta * numpy.sin(circ))
    plt.axis('equal')
    plt.title(r'Double Dogleg Curve $\delta = {0}$'.format(delta))
    plt.legend()
    plt.show()
    return _

def fp(x):
    _ = x.flat
    return numpy.asmatrix([_[0], 2 * _[1]]).T

def fpp(x):
    return numpy.asmatrix([[1, 0], [0, 2]])

x_0 = numpy.asmatrix([1, 1]).T
g = fp(x_0)
H = fpp(x_0)
for delta in [1, 5 / 4]:
    _ = double_dogleg_step(x_0, g, H, delta)

<Figure size 1600x800 with 1 Axes>

<Figure size 1600x800 with 1 Axes>

Exercise 14¶

\[\begin{split}\left( g^\top g \right)^2 &= \left( u^\top v \right)^2 & \quad & \text{change of variables with } u = H^{1/2} g \text{ and } v = H^{-1/2} g\\ &\leq \left( u^\top u \right) \left( v^\top v \right) & \quad & \text{Cauchy-Schwarz inequality}\\ &= \left( g^\top H g \right) \left( g^\top H^{-1} g \right).\end{split}\]

Exercise 15¶

Let \(f(x_1, x_2) = x_1^4 + x_1^2 + x_2^2\), then \(\nabla f(x) = (4 x_1^3 + 2 x_1, 2 x_2)^\top\) and \(H = \nabla^2 f(x) = \begin{bmatrix} 12 x_1^2 + 2 & 0\\ 0 & 2 \end{bmatrix}\).

With the update of \(\delta_c = 1\) and \(x_c = (0.666, 0.665)^\top\), \(\nabla f(x_c) = (2.514, 1.33)^\top\) and \(H = \begin{bmatrix} 7.323 & 0\\ 0 & 2 \end{bmatrix}\).

Based on lemma 6.4.1, performing a Newton step yields \(s_c^N = -H^{-1} \nabla f(x_c) = (-0.343, -0.665)^\top\).

Since \(\left\Vert s_c^N \right\Vert_2 = 0.74 \leq \delta_c\), the solution for (6.4.2) is \(s(0) = s_c^N\).

Exercise 16¶

Let \(F(x) = (x_1, 2 x_2)^\top\) and \(x_0 = (1, 1)^\top\), then \(J(x_0) = \begin{bmatrix} 1 & 0\\ 0 & 2 \end{bmatrix}\).

The steepest-descent direction for (6.5.3) is

\[\begin{split}s = -\frac{\nabla f(x_0) }{\left\Vert \nabla f(x_0) \right\Vert_2} = -\frac{ J(x_0)^\top F(x_c) }{ \left\Vert \nabla J(x_0)^\top F(x_c) \right\Vert_2 } = \begin{bmatrix} -0.323\\ -1.291 \end{bmatrix}.\end{split}\]

Applying (6.2.4) shows that \(c \triangleq \frac{2 - 1}{2 + 1} = \frac{1}{3}\). Therefore, the convergence rate will be slower than linear.

[3]:

def F(x):
    x = x.flat
    return numpy.asmatrix([x[0], x[1] * 2]).T

def J(x):
    return numpy.asmatrix([[1, 0], [0, 2]])

x_c = numpy.asmatrix([1, 1]).T
for i in range(8):
    top = J(x_c).T * F(x_c)
    s = -top / numpy.sqrt(top.T * top)
    x_c = x_c + s
    print('Iteration {0}: {1}'.format(i, numpy.asarray(x_c).squeeze()))

Iteration 0: [0.75746437 0.0298575 ]
Iteration 1: [-0.23033265 -0.12588923]
Iteration 2: [0.18562906 0.7834929 ]
Iteration 3: [ 0.12650144 -0.21475753]
Iteration 4: [-0.01918811  0.77457283]
Iteration 5: [-0.0129951 -0.225408 ]
Iteration 6: [0.00141627 0.77448815]
Iteration 7: [ 0.00095911 -0.22551174]

Exercise 17¶

The following facts are useful in the proof that follows.

Given a matrix \(A \in \mathbb{R}^{m \times n}\) of rank \(r\), the nullspace \(\mathbf{N}(A)\) is the space of all vectors \(x\) such that \(Ax = 0\). Likewise, the left nullspace is the space of all vectors \(y\) such that \(A^\top y = 0\) or equivalently \(y^\top A = 0\). The dimension of the column space \(\mathbf{C}(A)\) is \(r\) and the dimension of the row space \(\mathbf{C}(A^\top)\) is also \(r\). The dimension of the nullspace \(\mathbf{N}(A)\) is \(n - r\) and the dimension of the left nullspace \(\mathbf{N}(A^\top)\) is \(m - r\).

Let \(f(x) = \frac{1}{2} F(x)^\top F(x)\) where \(F \colon \mathbb{R}^n \rightarrow \mathbb{R}^n\) and each component \(f_i\) for \(i = 1 \ldots, n\) is continuously differentiable.

Recall from definition 4.1.7 that

\[\begin{split}F'(x) = \nabla F(x)^\top = \begin{bmatrix} \frac{\partial}{\partial x_1}\\ \vdots\\ \frac{\partial}{\partial x_n} \end{bmatrix} \begin{bmatrix} f_1(x) & \cdots & f_n(x) \end{bmatrix} = \begin{bmatrix} \frac{\partial f_1(x)}{\partial x_1} & \cdots & \frac{\partial f_1(x)}{\partial x_n}\\ \vdots & \ddots & \vdots\\ \frac{\partial f_n(x)}{\partial x_1} & \cdots & \frac{\partial f_n(x)}{\partial x_n}\\ \end{bmatrix} = J(x).\end{split}\]

Since \(F(\hat{x}) \neq 0\), by the definition of local minimizer (Lemma 4.3.1),

\[\begin{split}\nabla f(x_c) &= \frac{d}{dx} \sum_{i = 1}^n \frac{1}{2} \left( f_i(x_c) \right)^2\\ &= \frac{1}{2} \sum_{i = 1}^n \frac{d}{dx} \left( f_i(x_c) \right)^2\\ &= \frac{1}{2} \sum_{i = 1}^n \begin{bmatrix} \frac{\partial}{\partial x_1} \left( f_i(x_c) \right)^2\\ \vdots\\ \frac{\partial}{\partial x_n} \left( f_i(x_c) \right)^2 \end{bmatrix}\\ &= \frac{1}{2} \sum_{i = 1}^n \begin{bmatrix} 2 f_i(x_c) \frac{\partial f_i(x_c)}{\partial x_1}\\ \vdots\\ 2 f_i(x_c) \frac{\partial f_i(x_c)}{\partial x_n}\\ \end{bmatrix}\\ &= \sum_{i = 1}^n \nabla f_i(x_c) f_i(x_c)\\ &= J(x_c)^\top F(x_c)\end{split}\]

implies that

\[\begin{split}\nabla f(x_c) = \boldsymbol{0} = J(\hat{x})^\top F(\hat{x}) = \begin{bmatrix} \frac{\partial f_1(x)}{\partial x_1} & \cdots & \frac{\partial f_n(x)}{\partial x_1}\\ \vdots & \ddots & \vdots\\ \frac{\partial f_1(x)}{\partial x_n} & \cdots & \frac{\partial f_n(x)}{\partial x_n} \end{bmatrix} \begin{bmatrix} f_1(x)\\ \vdots\\ f_n(x) \end{bmatrix}.\end{split}\]

By the fundamental theorem of linear algebra, the corank of \(J(\hat{x})\) is not zero. Since \(J\) is a square matrix, its corank is equal to its nullity. Thus, \(J(\hat{x})\) is singular.

As shown in Figure 6.5.1, when \(J(\hat{x})\) is singular, \(\hat{x}\) could also be a global minimizer.

Exercise 18¶

A useful fact to observe is

\[\begin{split}J(x)^\top J(x) = \begin{bmatrix} \frac{\partial f_1(x)}{\partial x_1} & \cdots & \frac{\partial f_n(x)}{\partial x_1}\\ \vdots & \ddots & \vdots\\ \frac{\partial f_1(x)}{\partial x_n} & \cdots & \frac{\partial f_n(x)}{\partial x_n}\\ \end{bmatrix} \begin{bmatrix} \frac{\partial f_1(x)}{\partial x_1} & \cdots & \frac{\partial f_1(x)}{\partial x_n}\\ \vdots & \ddots & \vdots\\ \frac{\partial f_n(x)}{\partial x_1} & \cdots & \frac{\partial f_n(x)}{\partial x_n}\\ \end{bmatrix}.\end{split}\]

To show that the model holds, we need to compute the gradient (see Exercise 17) and Hessian of the second-order Taylor series.

\[\begin{split}\nabla^2 f(x_c) &= \sum_{i = 1}^n \frac{d}{dx} f_i(x_c) \nabla f_i(x_c)\\ &= \sum_{i = 1}^n f_i(x_c) \frac{d \nabla f_i(x_c)}{dx} + \frac{d f_i(x_c)}{dx} \nabla f_i(x_c)^\top\\ &= \sum_{i = 1}^n f_i(x_c) \nabla^2 f_i(x_c) + \frac{d f_i(x_c)}{dx} \nabla f_i(x_c)^\top & \quad & \text{(4.1.5)}\\ &= \sum_{i = 1}^n f_i(x_c) \nabla^2 f_i(x_c) + \begin{bmatrix} \frac{\partial f_i(x_c)}{\partial x_1}\\ \vdots\\ \frac{\partial f_i(x_c)}{\partial x_n} \end{bmatrix} \begin{bmatrix} \frac{\partial f_i(x_c)}{\partial x_1} & \cdots & \frac{\partial f_i(x_c)}{\partial x_n} \end{bmatrix}\\ &= J(x_c)^\top J(x_c) + \sum_{i = 1}^n f_i(x_c) \nabla^2 f_i(x_c).\end{split}\]

Compared to (6.5.4), this model requires the extra term \(\sum_{i = 1}^n f_i(x_c) \nabla^2 f_i(x_c)\). The Newton step for minimizing \(m_c(x)\) is

\[\begin{split}\frac{d}{ds} m_c(x_c + s) &= \frac{d}{ds} \left( \frac{1}{2} F(x_c)^\top F(x_c) + \left[ J(x_c)^\top F(x_c) \right]^\top s + \frac{1}{2} s^\top \left[ J(x_c)^\top J(x_c) + \sum_{i = 1}^n f_i(x_c) \nabla^2 f_i(x_c) \right] s \right)\\ 0 &= J(x_c)^\top F(x_c) + \left[ J(x_c)^\top J(x_c) + \sum_{i = 1}^n f_i(x_c) \nabla^2 f_i(x_c) \right] s\\ -J(x_c)^\top F(x_c) &= \left[ J(x_c)^\top J(x_c) + \sum_{i = 1}^n f_i(x_c) \nabla^2 f_i(x_c) \right] s,\end{split}\]

which is the same as the Newton step for \(F(x) = 0\) if \(J\) is nonsingular and the extra term evaluates to zero. Compared to (6.5.5) and the section after example 6.5.1, the attractiveness of this approach is that it does not need to arbitrarily perturb the model when \(J\) is singular as long as the overall sum is nonsingular.

Exercise 19¶

The second term of the original \(x_0\), which is very close to \(\frac{e}{6} \cong 0.45\), would make \(J(x_0)\) singular.

[4]:

import numpy

def J(x):
    x = x.flat
    return numpy.asmatrix([[2 * x[0], 2 * x[1]],
                           [numpy.e**(x[0] - 1), 3 * x[1]**2]])

x = numpy.asmatrix([2, numpy.e / 6]).T
print('Singular Values of J(x_0): {0}'.format(numpy.linalg.svd(J(x))[1]))

Singular Values of J(x_0): [4.95875147e+00 1.18964640e-17]

Exercise 20¶

If the approximation is the one in Exercise 18, then the double dogleg strategy would return the Newton point.

If the approximation was using (6.5.4), the strategy would compute the point along the line connecting the Cauchy point and the Newton point.

[5]:

def F(x, i=None):
    x = x.flat
    if i is not None:
        if 0 == i:
            return x[0]**2 + x[1]**2 - 2
        elif 1 == i:
            _ = numpy.e**(x[0] - 1)
            return _ + x[1]**3 - 2

    _ = numpy.e**(x[0] - 1)
    return numpy.asmatrix([x[0]**2 + x[1]**2 - 2,
                           _ + x[1]**3 - 2]).T

def Fpp(x, i):
    x = x.flat
    if 0 == i:
        return numpy.asmatrix([[2, 0], [0, 2]]).T
    elif 1 == i:
        _ = numpy.e**(x[0] - 1)
        return numpy.asmatrix([[_, 0], [0, 6 * x[1]]]).T
    raise Exception('Invalid dimension {0}'.format(i))

def J(x):
    x = x.flat
    _ = numpy.e**(x[0] - 1)
    return numpy.asmatrix([[2 * x[0], 2 * x[1]],
                           [_, 3 * x[1]**2]])

def f(x):
    return (0.5 * F(x).T * F(x)).flat[0]

def fp(x):
    return J(x).T * F(x)

def fpp(x):
    return J(x).T * J(x) + F(x, 0) * Fpp(x, 0) + F(x, 1) * Fpp(x, 1)

x_0 = numpy.asmatrix([2, 0.5]).T
g = fp(x_0)
H = fpp(x_0)
delta_c = 0.5
_ = double_dogleg_step(x_0, g, H, delta_c)
print('x_+ = {}'.format(numpy.asarray(_).squeeze()))

x_+ = [1.64273563 0.41561734]

Exercise 21¶

Recall that \(m_c(x_c + s) = f(x_c) + \nabla f(x_c)^\top s + \frac{1}{2} s^\top H_c s\). The residual in nonlinear conjugate gradient is \(-\nabla f(x_c) = s^{\text{C.P}}\). Nonlinear CG and the dogleg step proceed with computing the step size that minimizes \(f(x_c - \lambda \nabla f(x_c))\).

When

\[\hat{N} = x_+^N = x_c - H_c^{-1} \nabla f(x_c) = x_c + s^N,\]

\(\eta = 1\) must be true by definition.

By definition of linear subspace, the parameterized line segment

\[\begin{split}x_+(\lambda) &= x_c + s^{\text{C.P}} + \lambda (s^{N} - s^{\text{C.P}})\\ &= x_c + (1 - \lambda) s^{\text{C.P}} + \lambda s^{N}\end{split}\]

occupies the convex space spanned by the steepest-descent and Newton directions. In nonlinear CG, the conjugate directions are sums of scaled residuals defining a linear subspace. Q.E.D. because any linear subspace is a convex subspace.

Exercise 22¶

Notice that the analysis of Exercise 21 is applicable only when \(\hat{N} = x_+^N\). One way to alleviate this criticism is to use the conjugate direction as a solution i.e. combine dogleg with conjugate gradient. See [Ste83] for more details. At the time of reading (2016) this paper, the steps taken in the proof and the accompanying algorithm are reasonable (i.e. no mysterious jumps in reasoning) but the reader needs to be willing to work out the details step by step.

Exercise 23¶

Let \(J \in \mathbb{R}^{n \times n}\) be singular. By Theorem 3.6.4, there exists a decomposition \(J = U D V^\top = \sum_{i = 1}^r \sigma_i u_i v_i^\top\) where \(r < n\) is the rank of \(J\) with \(U^\top U = I\), \(V^\top V = I\), and \(D^\top D = \Lambda\).

(a)¶

For finite square matrices \(AB = I\),

\[\boldsymbol{0}_{n, n} = B - B = B - B I = B - B (AB) = (I - BA) B\]

i.e. \(I = BA\) implies \(A\) and \(B\) are both orthogonal matrices.

\(\alpha = \left(n \epsilon \right)^{1/2} \left\Vert J^\top J \right\Vert_1 > 0\) unless \(J\) is the null matrix because

\[\begin{split}\begin{array}{rcl} n^{-1/2} \left\Vert J^\top J \right\Vert_1 \leq& \left\Vert J^\top J \right\Vert_2 &\leq n^{1/2} \left\Vert J^\top J \right\Vert_1 & \quad & \text{(3.1.13)}\\ n^{-1/2} \left\Vert J^\top J \right\Vert_1 \leq& \left( \max_{\lambda} \left\{ \left( J^\top J \right)^\top J^\top J \right\} \right)^{1/2} &\leq n^{1/2} \left\Vert J^\top J \right\Vert_1 & \quad & \text{(3.1.8b)}\\ n^{-1/2} \left\Vert J^\top J \right\Vert_1 \leq& \lambda_1 &\leq n^{1/2} \left\Vert J^\top J \right\Vert_1 \end{array}\end{split}\]

where \(\lambda_1 \geq \cdots \geq \lambda_r\) and \(\lambda_{r + 1} = \cdots = \lambda_{n} = 0\).

Applying Theorem 3.5.7 to \(A = J^\top J + \alpha I\) gives

\[\begin{split}A &= V \Lambda V^\top + \alpha V V^\top & \quad & \text{orthogonal matrix}\\ &= V \left( \Lambda + \alpha I \right) V^\top\end{split}\]

which affirms that \(J^\top J + \alpha I\) is nonsingular while \(J^\top J\) is singular.

The condition number in the induced \(l_2\) matrix norm is

\[\begin{split}\kappa_2(A) &= \left\Vert A \right\Vert_2 \left\Vert A^{-1} \right\Vert_2 & \quad & \text{definition of condition number page 53}\\ &= \left( \max_{\lambda} \left\{ A^\top A \right\} \right)^{1/2} \left( \min_{\lambda} \left\{ A^\top A \right\} \right)^{-1/2} & \quad & \text{(3.1.8b), (3.1.18)}\\ &= \frac{\lambda_1 + \alpha}{\alpha}\\ &= \frac{\lambda_1}{ \left(n \epsilon \right)^{1/2} \left\Vert J^\top J \right\Vert_1} + 1.\end{split}\]

See Exercise 3.6 for a proof on (3.1.8b).

The foregoing derivations imply

\[\begin{split}\begin{array}{rcl} n^{-1/2} + \left( n \epsilon \right)^{1/2} \leq& \left( n \epsilon \right)^{1/2} \kappa_2(A) &\leq n^{1/2} + \left( n \epsilon \right)^{1/2}\\ \frac{1}{n \epsilon^{1/2}} \leq& \kappa_2 \left( J^\top J + \left( n \epsilon \right)^{1/2} \left\Vert J^\top J \right\Vert_1 I \right) - 1 &\leq \frac{1}{\epsilon^{1/2}}. \end{array}\end{split}\]

(b)¶

The case \(\kappa_2(J) \geq \epsilon^{-1/2}\) implies that \(J\) is nonsingular because

\[\begin{split}\kappa_2(J) &= \left\Vert J \right\Vert_2 \left\Vert J^{-1} \right\Vert_2\\ &= \left( \max_\lambda J^\top J \right)^{1/2} \left( \min_\lambda J^\top J \right)^{-1/2}\\ &= \left( \frac{\lambda_1}{\lambda_n} \right)^{1/2}\\ &= \frac{\sigma_1}{\sigma_n}.\end{split}\]

Applying (3.1.13) yields

\[\begin{split}\begin{array}{rcl} n^{-1/2} \left\Vert J \right\Vert_1 \leq& \sigma_1 &\leq n^{1/2} \left\Vert J \right\Vert_1\\ n^{-1/2} \left\Vert J \right\Vert_1 \leq& \kappa_2(J) \sigma_n &\leq n^{1/2} \left\Vert J \right\Vert_1. \end{array}\end{split}\]

Exercise 24¶

Let \(F \in \mathbb{R}^n\) be nonzero and \(J \in \mathbb{R}^{n \times n}\) be singular.

By Theorem 3.6.4, there exists a decomposition \(J = U D V^\top = \sum_{i = 1}^r \sigma_i u_i v_i^\top\) where \(r < n\) is the rank of \(J\) with \(U\) and \(V\) being orthogonal matrices and \(D^\top D = \Lambda\).

(a)¶

\[\begin{split}\lim_{\alpha \rightarrow 0} \left( J^\top J + \alpha I \right)^{-1} J^\top &= \lim_{\alpha \rightarrow 0} \left( V \left( \Lambda + \alpha I \right) V^\top \right)^{-1} V D U^\top\\ &= \lim_{\alpha \rightarrow 0} V \left( \Lambda + \alpha I \right)^{-1} D U^\top\\ &= \sum_{i = 1}^r \lim_{\alpha \rightarrow 0} \frac{\sigma_i}{\sigma_i^2 + \alpha} v_i u_i^\top & \quad & D_{ii} = 0\, \forall i > r\\ &= V D^+ U^\top.\end{split}\]

(b)¶

Recall that the null space \(\mathbf{N}(J)\) is the space of all vectors \(w\) such that \(Jw = U D V^\top w = 0\). A useful observation is

\[I^+ = D D^+ = \text{diag}(d_1, d_2, \ldots, d_r, d_{r + 1}, \ldots, d_n)\]

where \(d_i = 1\) when \(i \leq r\) and zero otherwise.

\[\begin{split}w^\top \left( J^\top J + \alpha I \right) J^\top v &= w^\top V \left( \Lambda + \alpha I \right)^{-1} D U^\top v\\ &= w^\top V I^+ \left( \Lambda + \alpha I \right)^{-1} D U^\top v & \quad & \text{(a)}\\ &= w^\top V D U^\top U D^+ \left( \Lambda + \alpha I \right)^{-1} D U^\top v & \quad & U^\top U = I\\ &= w^\top J^\top U D^+ \left( \Lambda + \alpha I \right)^{-1} D U^\top v\\ &= \vec{0}^\top U D^+ \left( \Lambda + \alpha I \right)^{-1} D U^\top v\\ &= 0\end{split}\]

\[\begin{split}w^\top J^+ v &= w^\top V D^+ U^\top v\\ &= w^\top V I^+ D^+ U^\top v\\ &= w^\top V D U^\top U D^+ D^+ U^\top v\\ &= w^\top J^\top U D^+ D^+ U^\top v\\ &= \vec{0}^\top U D^+ D^+ U^\top v\\ &= 0.\end{split}\]

Exercise 25¶

See [SSB85] for a nice proof on the convergence of line search and trust region algorithms.

References

Gay81: David M Gay. Computing optimal locally constrained steps. SIAM Journal on Scientific and Statistical Computing, 2(2):186–197, 1981.
Joy: David Joyce. Directional derivatives, steepest ascent, tangent planes. http://aleph0.clarku.edu/ djoyce/ma131/directional.pdf. Accessed on 2017-04-20.
SSB85: Gerald A Shultz, Robert B Schnabel, and Richard H Byrd. A family of trust-region-based algorithms for unconstrained minimization with strong global convergence properties. SIAM Journal on Numerical Analysis, 22(1):47–67, 1985.
Ste83: Trond Steihaug. The conjugate gradient method and trust regions in large scale optimization. SIAM Journal on Numerical Analysis, 20(3):626–637, 1983.