Globally Convergent Modifications of Newton’s Method¶
The Model Trust Region Approach¶
[Gay81] showed that perturbing the Hessian to ensure positive semidefiniteness still yields a solution that minimizes (6.4.1).
The level set of \(f\) at value \(c\) is the set of all points \(\mathbf{x} \in \mathbb{R}^n\) such that \(f(\mathbf{x}) = c\). When \(n = 2\), a level set is called a level curve, contour line, or isoline. When \(n = 3\), it is known as the level surface or isosurface. Higher dimensions classifies them as hypersurface.
[Joy] provides a clear derivation of why the gradient of a function is in the direction of steepest ascent (or greatest change). To see why it is normal to the surface, suppose that \(\mathbf{t} \in \mathbb{R}^n\) is tangent to the level surface of \(f\) through \(\mathbf{x}_0\). As you move in the \(\mathbf{t}\) direction, you are at that instant moving along the level surface, and hence the value of \(f\) does not change. This is another of saying the directional derivative in this direction is zero. Therefore, the gradient is orthogonal to the tangent plane.
Exercise 1¶
A descent direction \(p\) of \(f\) at \(x_c\) satisfies \(\nabla f(x_c)^\top p < 0\).
The steepest-descent direction is the solution to
The \(l_2\)-norm restricts the solution to be on the unit ball around \(x\), so \(p = -\frac{\nabla f(x)}{\left\Vert \nabla f(x) \right\Vert_2}\) is the steepest-descent direction because
Given \(f(x) = 3 x_1^2 + 2 x_1 x_2 + x_2^2\) and \(\nabla f(x) = \begin{bmatrix} 6 x_1 + 2 x_2\\ 2 x_1 + 2 x_2\end{bmatrix}\), the steepest-descent direction at \(x_0 = (1,1)^\top\) is
The vector \((1, -1)^\top\) is not a descent direction at \(x_0\) because
Exercise 2¶
Given \(s = -\lambda_k \nabla f(x_k) = -\lambda_k g\),
and
By Theorem 4.3.2,
Exercise 3¶
Define \(\hat{c} \in (0, 1)\) and \(\nabla^2 f(x) = \begin{bmatrix} \lambda_0 & 0\\ 0 & \lambda_1 \end{bmatrix}\) where \(x \in \mathbb{R}^2\) and \(\lambda_0 \geq \lambda_1 > 0\).
By (6.2.4), \(\hat{c} \triangleq \frac{\lambda_0 - \lambda_1}{\lambda_0 + \lambda_1}\). This sets \(\nabla f(x) = \begin{bmatrix} \lambda_0 x_1\\ \lambda_1 x_2 \end{bmatrix}\) and \(f(x) = \frac{\lambda_0}{2} x_1^2 + \frac{\lambda_1}{2} x_2^2\). Clearly, \(x_* = (0,0)^\top\) still holds and
will converge as long as
Exercise 4¶
Given
\(\left\{ x_k \right\} = \left\{ \left( -1 \right)^k \left( 1 + 2^{-k} \right) \right\}\) for \(k \in \mathbb{N}_0\).
This infinite sequence is not permitted by (6.3.3) because
where the last inequality holds \(\forall k \in \mathbb{N}_0\) only when \(\alpha \leq 0\).
Exercise 5¶
Given
\(\left\{ x_k \right\} = \left\{ 1 + 2^{-k} \right\}\) for \(k \in \mathbb{N}_0\).
The infinite sequence is permitted by (6.3.3) when \(\alpha \leq \frac{1}{2}\) because
The infinite sequence is prohibited by (6.3.4) because
where the last inequality holds \(\forall k \in \mathbb{N}_0\) when \(\beta \geq 1\), but \(\beta \in (\alpha, 1)\).
Exercise 6¶
Let \(f(x)\) be a positive definite quadratic such that
with \(x \in \mathbb{R}^n\) and \(A \succ 0\).
Recall that a full Newton step at \(x_k \in \mathbb{R}^n\) means \(\lambda = 1\) and \(p = -H_k^{-1} \nabla f(x_k)\) where \(H_k = \nabla^2 f(x_k) + \mu_k I\) is positive definite with \(\mu_k = 0\) if \(\nabla^2 f(x_k)\) is positive definite. Applying a full Newton step to \(f(x)\) yields
where \(\alpha \leq \frac{1}{2}\), which satisfies condition (6.3.3).
Condition (6.3.4) is also satisfied when \(\beta \in (0, 1)\) because
Exercise 7¶
Let (6.3.4) be replaced with \(f(x_{k + 1}) \geq f(x_k) + \beta \nabla f(x_k)^\top (x_{k + 1} - x_k)\) where \(\beta \in (\alpha, 1)\).
The proof for Theorem 6.3.2 still proceeds as in (6.3.5). Observe that
Thus, by the continuity of \(\nabla f\), Theorem 6.3.2 holds as long as \(\lambda_k \in (\beta \hat{\lambda}, \alpha \hat{\lambda})\).
The proof for Theorem 6.3.3 is the same except for the direct usage of condition (6.3.4). That condition needs to be derived as
This derivation can also be used as a transition into the part of Theorem 6.3.4’s proof that uses condition (6.3.4).
Exercise 8¶
Given
\(\nabla f(x) = \begin{bmatrix} 4 x_1^3\\ 2 x_2 \end{bmatrix}\) and \(\nabla^2 f(x) = \begin{bmatrix} 12 x_1^2 & 0\\ 0 & 2 \end{bmatrix}\).
Since the Hessian is positive definite,
(a)¶
Performing a full Newton step (\(\lambda = 1\)) gives
(b)¶
Given \(\alpha = 10^{-4}\) and \(\lambda_k = 1\), the backtracking line-search gives the same \(x_+\) as in (a) because
shows that condition (6.3.3) is satisfied.
(c)¶
The only thing I can think of is to use (6.3.17) yielding \(\lambda = 1.08871\), which makes \(x_+\) closer to the minimizer than in (b).
Exercise 9¶
Given \(x_c = (1, 1)^\top\) and \(f(x) = \frac{1}{2} x_1^2 + x_2^2\),
The unconstrained solution to (6.4.1) is
When \(\delta = 2\), then the unconstrained solution is valid.
When \(\delta = \frac{5}{6}\), (6.4.2) needs to be invoked:
Exercise 10¶
Given \(f(x) = x_1^2 + 2 x_2^2 + 3 x_3^2\),
When \(x_c = (1, 1, 1)^\top\),
Let \(\mu_0 = 1\), \(\mu_1 = 2\), and \(\mu_2 = 3\).
As shown below, the volume of the tetrahedron defined by these points is not zero and none of the singular values are zero.
[1]:
import numpy
M = numpy.ones((4,4))
M[0,:3] = numpy.asfarray([1, 1, 1])
M[1,:3] = -numpy.asfarray([2/3, 4/5, 6/7])
M[2,:3] = -numpy.asfarray([2/4, 4/6, 6/8])
M[3,:3] = -numpy.asfarray([2/5, 4/7, 6/9])
print('Volume of Tetrahedron = {0}'.format(numpy.linalg.det(M)))
print('Singular Values = {0}'.format(numpy.linalg.svd(M[:,:3])[1]))
Volume of Tetrahedron = 0.0005139833711262165
Singular Values = [2.63921809 0.20381107 0.00683625]
Exercise 11¶
Given \(H \in \mathbb{R}^{n \times n}\) is symmetric and positive definite, it has an eigendecomposition of \(H = Q \Lambda Q^\top = \sum_{i = 1}^n \lambda_i v_i v_i^\top\).
Let \(g = \sum_{j = 1}^n \alpha_j v_j\) and \(\mu \geq 0\).
and
This relates to \(m_c(\mu)\) in (6.4.5) in the sense that \(g = -\nabla f(x_c)\).
Exercise 12¶
Let \(s(\mu) = (H + \mu I)^{-1} g\) for \(\mu \geq 0\) and \(\eta(\mu) = \left\Vert s(\mu) \right\Vert\).
Exercise 13¶
Given \(f(x) = \frac{1}{2} x_1^2 + x_2^2\),
Let \(x_0 = (1,1)^\top\), \(g = \nabla f(x_0) = (1, 2)^\top\), and \(H = \nabla^2 f(x_0) = \begin{bmatrix} 1 & 0\\ 0 & 2 \end{bmatrix}\).
When \(\delta = 1\), the Cauchy point is outside the region and does not trigger the estimation of the parameterized line unlike when \(\delta = 1.25\).
[2]:
import matplotlib.pyplot as plt
import numpy
def double_dogleg_step(x, g, H, delta):
s_N = -numpy.linalg.inv(H) * g
#return the Newton point via Newton step
if (s_N.T * s_N).flat[0] <= delta**2:
return x + s_N
#compute Cauchy point
lambda_star = (g.T * g / (g.T * H * g)).flat[0]
C_P = x - lambda_star * g
#compute Newton point
s_CP = -lambda_star * g
_ = (g.T * H * g).flat[0] * (g.T * numpy.linalg.inv(H) * g).flat[0]
gamma = (g.T * g).flat[0]**2 / _
eta = 0.8 * gamma + 0.2
N = x + eta * s_N
s_hat_N = eta * s_N
a = ((s_hat_N - s_CP).T * (s_hat_N - s_CP)).flat[0]
b = 2 * ((s_hat_N - s_CP).T * s_CP).flat[0]
c = (s_CP.T * s_CP).flat[0] - delta**2
if delta <= numpy.linalg.norm(g) * lambda_star:
#C.P. exceeds delta-region
_ = x - delta / numpy.linalg.norm(g) * g
else:
#C.P. is within delta-region, so solve for parameterized line
lambda_c = max((-b + numpy.sqrt(b**2 - 4 * a * c)) / (2 * a),
(-b - numpy.sqrt(b**2 - 4 * a * c)) / (2 * a))
_ = x + s_CP + lambda_c * (s_hat_N - s_CP)
fig = plt.figure(figsize=[16,8])
plt.plot([x.flat[0], C_P.flat[0]], [x.flat[1], C_P.flat[1]],
marker='o', linestyle='--', color='r', label='CP')
plt.plot([x.flat[0], N.flat[0]], [x.flat[1], N.flat[1]],
marker=r'$\checkmark$', linestyle=':', color='b', label='N')
plt.plot([C_P.flat[0], N.flat[0]], [C_P.flat[1], N.flat[1]],
marker=r'$\bowtie$', linestyle='-', color='g')
plt.plot([_.flat[0]], [_.flat[1]],
marker=r'$\lambda$', color='k', label='x', markersize=10)
circ = numpy.linspace(0, 2 * numpy.pi, 128)
plt.plot(x.flat[0] + delta * numpy.cos(circ),
x.flat[1] + delta * numpy.sin(circ))
plt.axis('equal')
plt.title(r'Double Dogleg Curve $\delta = {0}$'.format(delta))
plt.legend()
plt.show()
return _
def fp(x):
_ = x.flat
return numpy.asmatrix([_[0], 2 * _[1]]).T
def fpp(x):
return numpy.asmatrix([[1, 0], [0, 2]])
x_0 = numpy.asmatrix([1, 1]).T
g = fp(x_0)
H = fpp(x_0)
for delta in [1, 5 / 4]:
_ = double_dogleg_step(x_0, g, H, delta)
<Figure size 1600x800 with 1 Axes>
<Figure size 1600x800 with 1 Axes>
Exercise 14¶
Exercise 15¶
Let \(f(x_1, x_2) = x_1^4 + x_1^2 + x_2^2\), then \(\nabla f(x) = (4 x_1^3 + 2 x_1, 2 x_2)^\top\) and \(H = \nabla^2 f(x) = \begin{bmatrix} 12 x_1^2 + 2 & 0\\ 0 & 2 \end{bmatrix}\).
With the update of \(\delta_c = 1\) and \(x_c = (0.666, 0.665)^\top\), \(\nabla f(x_c) = (2.514, 1.33)^\top\) and \(H = \begin{bmatrix} 7.323 & 0\\ 0 & 2 \end{bmatrix}\).
Based on lemma 6.4.1, performing a Newton step yields \(s_c^N = -H^{-1} \nabla f(x_c) = (-0.343, -0.665)^\top\).
Since \(\left\Vert s_c^N \right\Vert_2 = 0.74 \leq \delta_c\), the solution for (6.4.2) is \(s(0) = s_c^N\).
Exercise 16¶
Let \(F(x) = (x_1, 2 x_2)^\top\) and \(x_0 = (1, 1)^\top\), then \(J(x_0) = \begin{bmatrix} 1 & 0\\ 0 & 2 \end{bmatrix}\).
The steepest-descent direction for (6.5.3) is
Applying (6.2.4) shows that \(c \triangleq \frac{2 - 1}{2 + 1} = \frac{1}{3}\). Therefore, the convergence rate will be slower than linear.
[3]:
def F(x):
x = x.flat
return numpy.asmatrix([x[0], x[1] * 2]).T
def J(x):
return numpy.asmatrix([[1, 0], [0, 2]])
x_c = numpy.asmatrix([1, 1]).T
for i in range(8):
top = J(x_c).T * F(x_c)
s = -top / numpy.sqrt(top.T * top)
x_c = x_c + s
print('Iteration {0}: {1}'.format(i, numpy.asarray(x_c).squeeze()))
Iteration 0: [0.75746437 0.0298575 ]
Iteration 1: [-0.23033265 -0.12588923]
Iteration 2: [0.18562906 0.7834929 ]
Iteration 3: [ 0.12650144 -0.21475753]
Iteration 4: [-0.01918811 0.77457283]
Iteration 5: [-0.0129951 -0.225408 ]
Iteration 6: [0.00141627 0.77448815]
Iteration 7: [ 0.00095911 -0.22551174]
Exercise 17¶
The following facts are useful in the proof that follows.
Given a matrix \(A \in \mathbb{R}^{m \times n}\) of rank \(r\), the nullspace \(\mathbf{N}(A)\) is the space of all vectors \(x\) such that \(Ax = 0\). Likewise, the left nullspace is the space of all vectors \(y\) such that \(A^\top y = 0\) or equivalently \(y^\top A = 0\). The dimension of the column space \(\mathbf{C}(A)\) is \(r\) and the dimension of the row space \(\mathbf{C}(A^\top)\) is also \(r\). The dimension of the nullspace \(\mathbf{N}(A)\) is \(n - r\) and the dimension of the left nullspace \(\mathbf{N}(A^\top)\) is \(m - r\).
Let \(f(x) = \frac{1}{2} F(x)^\top F(x)\) where \(F \colon \mathbb{R}^n \rightarrow \mathbb{R}^n\) and each component \(f_i\) for \(i = 1 \ldots, n\) is continuously differentiable.
Recall from definition 4.1.7 that
Since \(F(\hat{x}) \neq 0\), by the definition of local minimizer (Lemma 4.3.1),
implies that
By the fundamental theorem of linear algebra, the corank of \(J(\hat{x})\) is not zero. Since \(J\) is a square matrix, its corank is equal to its nullity. Thus, \(J(\hat{x})\) is singular.
As shown in Figure 6.5.1, when \(J(\hat{x})\) is singular, \(\hat{x}\) could also be a global minimizer.
Exercise 18¶
A useful fact to observe is
To show that the model holds, we need to compute the gradient (see Exercise 17) and Hessian of the second-order Taylor series.
Compared to (6.5.4), this model requires the extra term \(\sum_{i = 1}^n f_i(x_c) \nabla^2 f_i(x_c)\). The Newton step for minimizing \(m_c(x)\) is
which is the same as the Newton step for \(F(x) = 0\) if \(J\) is nonsingular and the extra term evaluates to zero. Compared to (6.5.5) and the section after example 6.5.1, the attractiveness of this approach is that it does not need to arbitrarily perturb the model when \(J\) is singular as long as the overall sum is nonsingular.
Exercise 19¶
The second term of the original \(x_0\), which is very close to \(\frac{e}{6} \cong 0.45\), would make \(J(x_0)\) singular.
[4]:
import numpy
def J(x):
x = x.flat
return numpy.asmatrix([[2 * x[0], 2 * x[1]],
[numpy.e**(x[0] - 1), 3 * x[1]**2]])
x = numpy.asmatrix([2, numpy.e / 6]).T
print('Singular Values of J(x_0): {0}'.format(numpy.linalg.svd(J(x))[1]))
Singular Values of J(x_0): [4.95875147e+00 1.18964640e-17]
Exercise 20¶
If the approximation is the one in Exercise 18, then the double dogleg strategy would return the Newton point.
If the approximation was using (6.5.4), the strategy would compute the point along the line connecting the Cauchy point and the Newton point.
[5]:
def F(x, i=None):
x = x.flat
if i is not None:
if 0 == i:
return x[0]**2 + x[1]**2 - 2
elif 1 == i:
_ = numpy.e**(x[0] - 1)
return _ + x[1]**3 - 2
_ = numpy.e**(x[0] - 1)
return numpy.asmatrix([x[0]**2 + x[1]**2 - 2,
_ + x[1]**3 - 2]).T
def Fpp(x, i):
x = x.flat
if 0 == i:
return numpy.asmatrix([[2, 0], [0, 2]]).T
elif 1 == i:
_ = numpy.e**(x[0] - 1)
return numpy.asmatrix([[_, 0], [0, 6 * x[1]]]).T
raise Exception('Invalid dimension {0}'.format(i))
def J(x):
x = x.flat
_ = numpy.e**(x[0] - 1)
return numpy.asmatrix([[2 * x[0], 2 * x[1]],
[_, 3 * x[1]**2]])
def f(x):
return (0.5 * F(x).T * F(x)).flat[0]
def fp(x):
return J(x).T * F(x)
def fpp(x):
return J(x).T * J(x) + F(x, 0) * Fpp(x, 0) + F(x, 1) * Fpp(x, 1)
x_0 = numpy.asmatrix([2, 0.5]).T
g = fp(x_0)
H = fpp(x_0)
delta_c = 0.5
_ = double_dogleg_step(x_0, g, H, delta_c)
print('x_+ = {}'.format(numpy.asarray(_).squeeze()))
x_+ = [1.64273563 0.41561734]
Exercise 21¶
Recall that \(m_c(x_c + s) = f(x_c) + \nabla f(x_c)^\top s + \frac{1}{2} s^\top H_c s\). The residual in nonlinear conjugate gradient is \(-\nabla f(x_c) = s^{\text{C.P}}\). Nonlinear CG and the dogleg step proceed with computing the step size that minimizes \(f(x_c - \lambda \nabla f(x_c))\).
When
\(\eta = 1\) must be true by definition.
By definition of linear subspace, the parameterized line segment
occupies the convex space spanned by the steepest-descent and Newton directions. In nonlinear CG, the conjugate directions are sums of scaled residuals defining a linear subspace. Q.E.D. because any linear subspace is a convex subspace.
Exercise 22¶
Notice that the analysis of Exercise 21 is applicable only when \(\hat{N} = x_+^N\). One way to alleviate this criticism is to use the conjugate direction as a solution i.e. combine dogleg with conjugate gradient. See [Ste83] for more details. At the time of reading (2016) this paper, the steps taken in the proof and the accompanying algorithm are reasonable (i.e. no mysterious jumps in reasoning) but the reader needs to be willing to work out the details step by step.
Exercise 23¶
Let \(J \in \mathbb{R}^{n \times n}\) be singular. By Theorem 3.6.4, there exists a decomposition \(J = U D V^\top = \sum_{i = 1}^r \sigma_i u_i v_i^\top\) where \(r < n\) is the rank of \(J\) with \(U^\top U = I\), \(V^\top V = I\), and \(D^\top D = \Lambda\).
(a)¶
For finite square matrices \(AB = I\),
i.e. \(I = BA\) implies \(A\) and \(B\) are both orthogonal matrices.
\(\alpha = \left(n \epsilon \right)^{1/2} \left\Vert J^\top J \right\Vert_1 > 0\) unless \(J\) is the null matrix because
where \(\lambda_1 \geq \cdots \geq \lambda_r\) and \(\lambda_{r + 1} = \cdots = \lambda_{n} = 0\).
Applying Theorem 3.5.7 to \(A = J^\top J + \alpha I\) gives
which affirms that \(J^\top J + \alpha I\) is nonsingular while \(J^\top J\) is singular.
The condition number in the induced \(l_2\) matrix norm is
See Exercise 3.6 for a proof on (3.1.8b).
The foregoing derivations imply
(b)¶
The case \(\kappa_2(J) \geq \epsilon^{-1/2}\) implies that \(J\) is nonsingular because
Applying (3.1.13) yields
Exercise 24¶
Let \(F \in \mathbb{R}^n\) be nonzero and \(J \in \mathbb{R}^{n \times n}\) be singular.
By Theorem 3.6.4, there exists a decomposition \(J = U D V^\top = \sum_{i = 1}^r \sigma_i u_i v_i^\top\) where \(r < n\) is the rank of \(J\) with \(U\) and \(V\) being orthogonal matrices and \(D^\top D = \Lambda\).
(a)¶
(b)¶
Recall that the null space \(\mathbf{N}(J)\) is the space of all vectors \(w\) such that \(Jw = U D V^\top w = 0\). A useful observation is
where \(d_i = 1\) when \(i \leq r\) and zero otherwise.
Exercise 25¶
See [SSB85] for a nice proof on the convergence of line search and trust region algorithms.
References
- Gay81
David M Gay. Computing optimal locally constrained steps. SIAM Journal on Scientific and Statistical Computing, 2(2):186–197, 1981.
- Joy
David Joyce. Directional derivatives, steepest ascent, tangent planes. http://aleph0.clarku.edu/ djoyce/ma131/directional.pdf. Accessed on 2017-04-20.
- SSB85
Gerald A Shultz, Robert B Schnabel, and Richard H Byrd. A family of trust-region-based algorithms for unconstrained minimization with strong global convergence properties. SIAM Journal on Numerical Analysis, 22(1):47–67, 1985.
- Ste83
Trond Steihaug. The conjugate gradient method and trust regions in large scale optimization. SIAM Journal on Numerical Analysis, 20(3):626–637, 1983.