Secant Methods for Unconstrained Minimization

Implementation of Quasi-Newton Algorithms Using the Positive Definite Secant Update

  • Details to Mind

    • Factored versus Unfactored

    • Condition Number

    • Positive Definite

      • Trust Region versus Line Search

    • Noise level of each component

  • Picking an Initial Jacobian

    • Set it to a finite difference approximation.

  • Picking an Initial Hessian

    \[\DeclareMathOperator{\typ}{typ} H_0 = \max \{ \lvert f(x_0) \rvert, \typ f \} \cdot D_x^2.\]

Exercise 1

The Hessian is symmetric by definition. See Exercise 4.12 for more reasons why a secant approximation to a Hessian should be symmetric.

Exercise 2

Suppose \(A, B \in \mathbb{R}^{n \times n}\) with \(B\) symmetric.

\[\begin{split}\DeclareMathOperator{\tr}{tr} \left\Vert B - A \right\Vert_F^2 &= \tr\left( \left( B - A \right)^\top \left( B - A \right) \right)\\ &= \tr\left(B^\top B\right) - \text{tr}\left(B^\top A\right) - \tr\left(A^\top B\right) + \tr\left(A^\top A\right) & \quad & \text{trace operator is a linear mapping}\\ &= \tr\left( B^2 \right) - 2 \tr\left( AB \right) + \tr\left(A^\top A\right) & \quad & \text{trace cyclic/transpose property}\end{split}\]

By inspection, this norm is minimized when \(B = A\), which is only valid when \(A\) is also symmetric.

From the results in (a) and (b), the norm can be rewritten as

\[\begin{split}\left\Vert B - A \right\Vert_F^2 &= \tr\left( B^2 \right) - 2 \tr\left( AB \right) + \tr\left( A^\top A \right)\\ &= \tr\left( B^2 \right) - \tr\left( \left( A + A^\top \right) B \right) + \tr\left( A^\top A \right).\end{split}\]

Using matrix calculus to find the roots of the derivative gives

\[\begin{split}\frac{\partial \left\Vert B - A \right\Vert_F^2}{\partial B} &= \tr\left( \frac{\partial B^2}{\partial B} \right) - \tr\left( \frac{\partial \left( A + A^\top \right) B}{\partial B} \right) + \tr\left( \frac{\partial A^\top A}{\partial B} \right) & \quad & \text{linear trace operator commutes with derivative}\\ 0 &= 2B - A^\top - A\\ \frac{A + A^\top}{2} &= B.\end{split}\]

This root is a minimum according to the second derivative test:

\[\begin{split}\frac{\partial^2 \left\Vert B - A \right\Vert_F^2}{\partial B^2} &= \frac{\partial}{\partial B} \left( 2B - A^\top - A \right)\\ &= 2I\\ &\succ 0.\end{split}\]

(a)

A useful identity to recall is that for any symmetric matrix \(A\),

\[x^\top A x = x^\top \frac{A + A^\top}{2} x.\]

(b)

\[\begin{split}\tr\left( A B \right) &= \tr\left( A \sum_{i = 1}^n \lambda_i q_i q_i^\top \right) & \quad & \text{eigendecomposition}\\ &= \sum_{i = 1}^n \lambda_i \tr\left( A q_i q_i^\top \right) & \quad & \text{trace is a linear mapping}\\ &= \sum_{i = 1}^n \lambda_i \tr\left( q_i^\top A q_i \right) & \quad & \text{trace cyclic property}\\ &= \frac{1}{2} \sum_{i = 1}^n \lambda_i \tr\left( q_i^\top \left( A + A^\top \right) q_i \right) & \quad & \text{(a)}\\ &= \frac{1}{2} \tr\left( \left( A + A^\top \right) B \right)\end{split}\]

Exercise 3

Suppose \(H_c \triangleq H\) is symmetric, \(s_c \triangleq s\) and \(y_c \triangleq y\).

(a)

\[\begin{split}\DeclareMathOperator{\det}{det} H_+ s &= \left( H + \frac{ \left( y - Hs \right) s^\top + s \left( y - Hs \right)^\top }{s^\top s} - \frac{\left( y - Hs \right)^\top s}{\left( s^\top s \right)^2} s s^\top \right) s & \quad & \text{(9.1.3)}\\ &= Hs + \frac{ys^\top - Hss^\top + sy^\top - ss^\top H^\top}{s^\top s} s - \frac{y^\top s - s^\top H^\top s}{\left( s^\top s \right)^2} s s^\top s\\ &= Hs + y - Hs + \frac{sy^\top s - ss^\top Hs - y^\top ss + s^\top H ss}{s^\top s}\\ &= y\end{split}\]

(b)

Define \(\Sigma \triangleq \begin{bmatrix} \sigma_1 & \sigma_2\\ \sigma_3 & \sigma_4 \end{bmatrix}\). According to [Rhe],

\[\begin{split}H_+ - H &= \begin{bmatrix} y - Hs & s \end{bmatrix} \begin{bmatrix} \sigma_1 & \sigma_2\\ \sigma_3 & \sigma_4 \end{bmatrix} \begin{bmatrix} \left( y - Hs \right)^\top\\ s^\top \end{bmatrix}\\ &= \begin{bmatrix} \sigma_1 \left( y - Hs \right) + \sigma_3 s & \sigma_2 \left( y - Hs \right) + \sigma_4 s \end{bmatrix} \begin{bmatrix} \left( y - Hs \right)^\top\\ s^\top \end{bmatrix}\\ &= \sigma_1 \left( y - Hs \right) \left( y - Hs \right)^\top + \sigma_3 s \left( y - Hs \right)^\top + \sigma_2 \left( y - Hs \right) s^\top + \sigma_4 s s^\top\end{split}\]

where \(\sigma_1 = 0\), \(\sigma_2 = \sigma_3 = \left( s^\top s \right)^{-1}\), and \(\sigma_4 = -\frac{\left( y - Hs \right)^\top s}{\left( s^\top s \right)^2}\).

By inspection, \(H_+ - H = \left( H_+ - H \right)^\top\).

When \(\det(\Sigma) = -\left( s^\top s \right)^{-2} \neq 0\), \(\Sigma\) is a full-rank matrix. Applying (b.1) and (b.2) shows that \(H_+ - H\) has a maximum rank of \(2\).

(b.1)

Recall the following rank property for a matrix \(A \in \mathbb{R}^{m \times n}\):

\[\DeclareMathOperator{\rank}{rank} \rank\left( A^\top A \right) = \rank\left( A A^\top \right) = \rank(A) = \rank\left( A^\top \right).\]

Define \(A \triangleq \begin{bmatrix} y - Hs & s \end{bmatrix}\).

\[\begin{split}A^\top A &= \begin{bmatrix} \left( y - Hs \right)^\top\\ s^\top \end{bmatrix} \begin{bmatrix} y - Hs & s \end{bmatrix}\\ &= \begin{bmatrix} \left( y - Hs \right)^\top \left( y - Hs \right) & \left( y - Hs \right) s\\ s^\top \left( y - Hs \right) & s^\top s \end{bmatrix}\end{split}\]

When \(\det\left( A^\top A \right) \neq 0\), \(A\) is full-rank.

(b.2)

Recall the following rank property for a matrix \(B \in \mathbb{R}^{n \times k}\):

\[\rank(AB) \leq \min \left\{ \rank(A), \rank(B) \right\}.\]
\[\begin{split}\rank\left( A \Sigma A^\top \right) &\leq \min \left\{ \rank(A), \rank\left( \Sigma A^\top \right) \right\}\\ &\leq \min \left\{ \rank(A), \min \left\{ \rank(\Sigma), \rank\left( A^\top \right) \right\} \right\}\\ &= \min \left\{ \rank(A), \rank(\Sigma), \rank\left( A^\top \right) \right\}\end{split}\]

Exercise 4

Define a linearly independent set \(S = \left\{ v_1 = s_c, v_2, \ldots, v_n \right\}\) where \(v_i \in \mathbb{R}^n\) is randomly generated for \(2 \leq i \leq n\).

Perform Gram-Schmidt orthonormalization to generate an orthonormal set \(Q = \left\{ q_1, q_2, \ldots, q_n \right\}\) where \(q_i = t_i \left\Vert t_i \right\Vert^{-1}\) for \(1 \leq i \leq n\) with

\[t_1 = v_1 \quad \text{and} \quad t_i = v_i - \sum_{j = 1}^{i - 1} \text{proj}_{t_j}(v_i).\]

Suppose \(H - H_c\) is a symmetric matrix with \(H s_c = y_c\) where \(H_+\) is given by (9.1.3) and \(H \in \mathbb{R}^{n \times n}\). Applying the results of Exercise 3.11 gives

\[\begin{split}\left\Vert H_+ - H_c \right\Vert_F^2 &= \sum_{i = 1}^n \left\Vert \left( H_+ - H_c \right) q_i \right\Vert_2^2\\ &\leq \sum_{i = 1}^n \left\Vert \left( H - H_c \right) q_i \right\Vert_2^2 & \quad & \text{(a) and (b)}\\ \left\Vert H_+ - H_c \right\Vert_F &\leq \left\Vert H - H_c \right\Vert_F.\end{split}\]

(a)

\[\begin{split}\left( H_+ - H_c \right) s_c &= \left( \frac{ \left( y_c - H_c s_c \right) s_c^\top + s_c \left( y_c - H_c s_c \right)^\top }{ s_c^\top s_c } - \frac{ \left( y_c - H_c s_c \right)^\top s_c }{ \left( s_c^\top s_c \right)^2 } s_c s_c^\top \right) s_c & \quad & \text{(9.1.3)}\\ &= \left( \frac{ \left( H - H_c \right) s_c s_c^\top + s_c s_c^\top \left( H - H_c \right) }{ s_c^\top s_c } - \frac{ s_c^\top \left( H - H_c \right) s_c }{ \left( s_c^\top s_c \right)^2 } s_c s_c^\top \right) s_c & \quad & \text{see initial assumptions}\\ &= \left( H - H_c \right) s_c + s_c \frac{s_c^\top \left( H - H_c \right) s_c}{s_c^\top s_c} - \frac{s_c^\top \left( H - H_c \right) s_c}{s_c^\top s_c} s_c\\ &= \left( H - H_c \right) s_c\end{split}\]

(b)

Let \(t \in \mathbb{R}^n\) satisfy \(t^\top s_c = 0\).

\[\begin{split}\left( H_+ - H_c \right) t &= \left( \frac{ \left( y_c - H_c s_c \right) s_c^\top + s_c \left( y_c - H_c s_c \right)^\top }{s_c^\top s_c} - \frac{ \left( y_c - H_c s_c \right)^\top s_c }{ \left( s_c^\top s_c \right)^2 } s_c s_c^\top \right) t & \quad & \text{(9.1.3)}\\ &= \frac{s_c \left( y_c - H_c s_c \right)^\top t}{s_c^\top s_c} & \quad & s_c^\top t = 0\\ &= \frac{s_c s_c^\top \left( H - H_c \right) t}{s_c^\top s_c} & \quad & \text{see initial assumptions}\\ \left\Vert \left( H_+ - H_c \right) t \right\Vert_2 &= \left\Vert \frac{s_c s_c^\top}{s_c^\top s_c} \left( H - H_c \right) t \right\Vert_2\\ &\leq \left\Vert \frac{s_c s_c^\top}{s_c^\top s_c} \right\Vert_F \left\Vert \left( H - H_c \right) t \right\Vert_2 & \quad & \text{(3.1.16)}\\ &\leq \left\Vert \left( H - H_c \right) t \right\Vert_2 & \quad & \text{(3.1.17)}\end{split}\]

Exercise 5

Let \(H_+\) be given by (9.1.3), \(H_* = \nabla^2 f(x_*)\), \(P_c = I - s_c s_c^\top / s_c^\top s_c\), \(e_k \triangleq x_k - x_*\), and the assumptions of Theorem 9.1.2 hold i.e. \(\left\Vert x_0 - x_* \right\Vert_2 < \epsilon\) and \(\left\Vert H_0 - H_* \right\Vert_F < \delta\).

The results of Exercise 3.8, Exercise 8.7, and Exercise 4.7 will be useful in the derivation that follows.

\[\begin{split}H_+ - H_* &= H_c - H_* + \frac{ \left( y_c - H_c s_c \right) s_c^\top + s_c \left( y_c - H_c s_c \right)^\top }{ s_c^\top s_c } - \frac{ \left( y_c - H_c s_c \right)^\top s_c }{ \left( s_c^\top s_c \right)^2 } s_c s_c^\top\\ \left\Vert H_+ - H_* \right\Vert_F &= \left\Vert P_c (H_c - H_*) P_c + \frac{(y_c - H_* s_c) s_c^\top}{s_c^\top s_c} + \frac{s_c (y_c - H_* s_c)^\top P_c}{s_c^\top s_c} \right\Vert_F & \quad \text{(9.7.1)}\\ &\leq \left\Vert P_c (H_c - H_*) P_c \right\Vert_F + \left\Vert \frac{(y_c - H_* s_c) s_c^\top}{s_c^\top s_c} \right\Vert_F + \left\Vert \frac{s_c (y_c - H_* s_c)^\top P_c}{s_c^\top s_c} \right\Vert_F & \quad \text{Exercise 3.8}\\ &\leq \left\Vert H_c - H_* \right\Vert_F \left\Vert P_c \right\Vert_2^2 + \left\Vert y_c - H_* s_c \right\Vert_2 \left\Vert \frac{s_c}{s_c^\top s_c} \right\Vert_2 + \left\Vert \frac{s_c}{s_c^\top s_c} \right\Vert_2 \left\Vert y_c - H_* s_c \right\Vert_2 \left\Vert P_c \right\Vert_2 & \quad & \text{(3.1.15) and (3.1.17)}\\ &\leq \left\Vert H_c - H_* \right\Vert_F + 2 \left\Vert y_c - H_* s_c \right\Vert_2 \left\Vert s \right\Vert_2^{-1} & \quad & \text{Exercise 8.7}\\ &\leq \left\Vert H_c - H_* \right\Vert_F + \frac{\gamma}{2} \left( \left\Vert x_+ - x_* \right\Vert_2 + \left\Vert x_c - x_* \right\Vert_2 \right) & \quad & \text{Lemma 4.1.12 and Exercise 4.7}.\end{split}\]

Notice that in order to make the following derivations easier, \(2 \gamma\) was replaced with \(\gamma\) while the \(\frac{1}{2}\) factor was kept. Recursively expanding (9.7.2) yields

\[\begin{split}\left\Vert H_k - H_* \right\Vert_F &\leq \left\Vert H_{k - 1} - H_* \right\Vert_F + \frac{\gamma}{2} \left( \left\Vert e_k \right\Vert_2 + \left\Vert e_{k - 1} \right\Vert_2 \right)\\ &\leq \left\Vert H_0 - H_* \right\Vert_F + \frac{\gamma}{2} \left\Vert e_k \right\Vert_2 + \frac{\gamma}{2} \left\Vert e_0 \right\Vert_2 + \gamma \sum_{j = 1}^{k - 1} \left\Vert e_j \right\Vert_2\\ &\leq \delta + \gamma \sum_{j = 0}^k \left\Vert e_j \right\Vert_2.\end{split}\]

If the summation is uniformly bounded above for all \(k\), then the sequence \(\left\{ \left\Vert H_k - H_* \right\Vert_F \right\}\) will be bounded above i.e. \(\lim_{k \rightarrow \infty} \left\Vert e_k \right\Vert_2 = 0\).

According to Theorem 3.1.4, the symmetric secant method is well defined if

\[\begin{split}\left\Vert H_*^{-1} \left( H_k - H_* \right) \right\Vert_F &\leq \left\Vert H_*^{-1} \right\Vert_F \left\Vert H_k - H_* \right\Vert_F & \quad & \text{(3.1.10)}\\ &\leq \beta \left\Vert H_k - H_* \right\Vert_F & \quad & H_* \text{ is nonsingular}\end{split}\]

is strictly less than \(1\). If it does hold, then (3.1.20) states that

\[\left\Vert H_k^{-1} \right\Vert_F \leq \frac{ \left\Vert H_*^{-1} \right\Vert_F }{ 1 - \left\Vert H_*^{-1} \left( H_k - H_* \right) \right\Vert_F }\]

where

\[\left\Vert H_0^{-1} \right\Vert_F \leq \frac{\beta}{1 - \beta \delta}.\]

Suppose \(x_{k + 1}\) is well defined. Isolating \(e_{k + 1}\) according to the derivation of (8.2.1) gives

\[\begin{split}x_{k + 1} &= x_k - H_k^{-1} \nabla f(x_k) & \quad & \text{(9.1.4a)}\\ x_{k + 1} - x_* &= x_k - x_* - H_k^{-1} \left( \nabla f(x_k) - \nabla f(x_*) \right) & \quad & \nabla f(x_*) = 0\\ H_k e_{k + 1} &= H_k e_k - \nabla f(x_k) + \nabla f(x_*)\\ &= \left( H_k - H_* \right) e_k + \left( -\nabla f(x_k) + \nabla f(x_*) + H_* e_k \right)\\ \left\Vert e_{k + 1} \right\Vert_2 &\leq \left\Vert H_k^{-1} \right\Vert_F \left[ \left\Vert -\nabla f(x_k) + \nabla f(x_*) + H_* e_k \right\Vert_2 + \left\Vert H_k - H_* \right\Vert_F \left\Vert e_k \right\Vert_2 \right] & \quad & \text{(3.1.16) and Exercise 3.4}\\ &\leq \left\Vert H_k^{-1} \right\Vert_F \left[ \frac{\gamma}{2} \left\Vert e_k \right\Vert_2 + \left\Vert H_k - H_* \right\Vert_F \right] \left\Vert e_k \right\Vert_2 & \quad & \text{Lemma 4.1.12 and (3.1.1b)}.\end{split}\]

See Exercise 3.4 for more details. One (arbitrary yet simple) way to satisfy the definition of \(q\)-linear convergence (2.3.1) is to define

\[\begin{split}\left\Vert H_k - H_* \right\Vert_F &\leq (2 - 2^{-k}) \delta & \quad & \text{(8.2.7)}\\ \left\Vert e_{k + 1} \right\Vert_2 &\leq \frac{1}{2} \left\Vert e_k \right\Vert_2 & \quad & \text{(8.2.8).}\end{split}\]

From the initial assumptions, the base case \(k = 0\) for (8.2.7) is satisfied:

\[\left\Vert H_0 - H_* \right\Vert_F \leq \delta \leq (2 - 2^{-0}) \delta \leq 2 \delta.\]

The base case \(k = 0\) for (8.2.8) is also satisfied if \(\epsilon\) and \(\delta\) are chosen such that \(6 \beta \delta \leq 1\) and \(3 \gamma \epsilon \leq 2 \delta\) where

\[\begin{split}\left\Vert e_{0 + 1} \right\Vert_2 &\leq \left\Vert H_0^{-1} \right\Vert_F \left[ \frac{\gamma}{2} \left\Vert e_0 \right\Vert_2 + \left\Vert H_0 - H_* \right\Vert_F \right] \left\Vert e_0 \right\Vert_2\\ &\leq \frac{\beta}{1 - \beta \delta} \left[ \frac{\gamma}{2} \epsilon + \delta \right] \epsilon\\ &\leq \frac{6 \beta}{5} \left[ \frac{\gamma}{2} \epsilon + \delta \right] \epsilon\\ &\leq \left[ \frac{3 \beta \gamma \epsilon}{5} + \frac{6 \beta \delta}{5} \right] \epsilon\\ &\leq \left[ \frac{2 \beta \delta}{5} + \frac{1}{5} \right] \epsilon\\ &\leq \left[ \frac{1}{15} + \frac{1}{5} \right] \epsilon\\ &\leq \frac{1}{2} \epsilon.\end{split}\]

For the inductive step, assume (8.2.7) and (8.2.8) holds for \(k = 0, \ldots, i - 1\). When \(k = i\),

\[\begin{split}\left\Vert H_i - H_* \right\Vert_F &\leq \left\Vert H_{i - 1} - H_* \right\Vert_F + \frac{\gamma}{2} \left( \left\Vert e_i \right\Vert_2 + \left\Vert e_{i - 1} \right\Vert_2 \right)\\ &\leq (2 - 2^{-i - 1}) \delta + \frac{3 \gamma}{4} \left\Vert e_{i - 1} \right\Vert_2\\ &\leq (2 - 2^{-i - 1}) \delta + \frac{3 \gamma}{4} 2^{-(i - 1)} \epsilon\\ &\leq (2 - 2^{-i - 1}) \delta + 2^{-i} \delta\\ &\leq (2 - 2^{-i}) \delta.\end{split}\]

To complete the proof, it is necessary to show that the symmetric secant is well-defined:

\[\begin{split}\left\Vert H_*^{-1} \left( H_i - H_* \right) \right\Vert_F &\leq \beta \left\Vert H_i - H_* \right\Vert_F & \quad & \text{see the foregoing derivation}\\ &\leq \beta (2 - 2^{-i}) \delta\\ &\leq 2 \beta \delta\\ &\leq \frac{1}{3}.\end{split}\]

Consequently, \(\left\Vert H_i^{-1} \right\Vert_F \leq \frac{3 \beta}{2}\). Thus,

\[\begin{split}\left\Vert e_{i + 1} \right\Vert_2 &\leq \left\Vert H_i^{-1} \right\Vert_F \left[ \frac{\gamma}{2} \left\Vert e_i \right\Vert_2 + \left\Vert H_i - H_* \right\Vert_F \right] \left\Vert e_i \right\Vert_2\\ &\leq \frac{3 \beta}{2} \left[ \frac{\gamma}{2} 2^{-i} \epsilon + (2 - 2^{-i}) \delta \right] \left\Vert e_i \right\Vert_2\\ &\leq \frac{3 \beta}{2} \left[ \frac{1}{3} 2^{-i} + (2 - 2^{-i}) \right] \delta \left\Vert e_i \right\Vert_2\\ &\leq 3 \beta \delta \left\Vert e_i \right\Vert_2\\ &\leq \frac{1}{2} \left\Vert e_i \right\Vert_2.\end{split}\]

Exercise 6

The proof of \(q\)-superlinear convergence is a combination of Exercise 8.5 and page 183.

Theorem 8.2.4 have shown that a sufficient condition for \(\{ x_k \}\) to converge \(q\)-superlinearly to \(x_*\) is \(\lim_{k \rightarrow \infty} \frac{\left\Vert E_k s_k \right\Vert_2}{\left\Vert s_k \right\Vert_2} = 0\).

Applying the results of Exercise 8.7 and Exercise 9.5 gives

\[\begin{split}\left\Vert H_+ - H_* \right\Vert_F &\leq \left\Vert P_c E_c P_c \right\Vert_F + \frac{\gamma}{2} \left( \left\Vert x_+ - x_* \right\Vert_2 + \left\Vert x_c - x_* \right\Vert_2 \right)\\ &\leq \left\Vert P_c \right\Vert_2 \left\Vert E_c P_c \right\Vert_F + \frac{\gamma}{2} \left( \left\Vert x_+ - x_* \right\Vert_2 + \left\Vert x_c - x_* \right\Vert_2 \right) & \quad & \text{(3.1.15)}\\ &\leq \left\Vert E_c \right\Vert_F - \left( 2 \left\Vert E_c \right\Vert_F \right)^{-1} \left( \frac{ \left\Vert E_c s_c \right\Vert_2 }{ \left\Vert s_c \right\Vert_2 } \right)^2 + \frac{\gamma}{2} \left( \left\Vert x_+ - x_* \right\Vert_2 + \left\Vert x_c - x_* \right\Vert_2 \right) & \quad & \text{Exercise 8.7 and Lemma 8.2.5}\\ \left\Vert E_{k + 1} \right\Vert_F &\leq \left\Vert E_k \right\Vert_F - \left( 2 \left\Vert E_k \right\Vert_F \right)^{-1} \left( \frac{ \left\Vert E_k s_k \right\Vert_2 }{ \left\Vert s_k \right\Vert_2 } \right)^2 + \frac{\gamma}{2} \left( \left\Vert e_{k + 1} \right\Vert_2 + \left\Vert e_k \right\Vert_2 \right) & \quad & \text{replace subscripts}\\ \frac{\left\Vert E_k s_k \right\Vert_2^2}{\left\Vert s_k \right\Vert_2^2} &\leq 2 \left\Vert E_k \right\Vert_F \left( \left\Vert E_k \right\Vert_F - \left\Vert E_{k + 1} \right\Vert_F + \frac{3 \gamma}{4} \left\Vert e_k \right\Vert_2 \right) & \quad & \text{(8.2.8)}\\ &\leq 4 (2 - 2^{-k}) \delta \left( -2^{-(k + 1)} \delta + \frac{3 \gamma}{4} \left\Vert e_k \right\Vert_2 \right) & \quad & \text{(8.2.7)}\\ &\leq 4 (2 - 2^{-k}) \delta \left( -2^{-(k + 1)} \delta + \frac{3 \gamma}{4} 2^{-k} \epsilon \right) & \quad & \text{(8.2.8)}\\ &\leq 4 (2 - 2^{-k}) \delta \left( -2^{-(k + 1)} \delta + \frac{2 \delta}{4} 2^{-k} \right) & \quad & 3 \gamma \epsilon \leq 2 \delta\\ \frac{ \left\Vert E_k s_k \right\Vert_2 }{ \left\Vert s_k \right\Vert_2 } &\leq 0.\end{split}\]

Therefore

\[\sum_{k = 0}^\infty \frac{\left\Vert E_k s_k \right\Vert_2}{\left\Vert s_k \right\Vert_2}\]

is finite, which satisfies the definition of \(q\)-superlinear convergence.

Exercise 7

The following proofs are combinations of [Gei][Gra].

Recall that a matrix \(H \in \mathbb{R}^{n \times n}\) is symmetric positive definite if and only if it is symmetric and \(x^\top H x > 0\) for all nonzero vectors \(x \in \mathbb{R}^n\).

(1)

Let \(H = J J^\top\) for some nonsingular \(J \in \mathbb{R}^{n \times n}\) (i.e. columns are linearly independent). Applying the results of (a) shows that \(H\) is symmetric positive definite.

(2)

Let \(H \in \mathbb{R}^{n \times n}\) be symmetric positive definite. \(H = J J^\top\) for some nonsingular \(J \in \mathbb{R}^{n \times n}\) can be satisfied if there exists a Cholesky factorization such that \(H = L L^\top\). Invoking (e) shows that \(H\) has a Cholesky decomposition because it is symmetric positive definite.

(a)

Let \(B \in \mathbb{R}^{m \times n}\) have linearly independent columns (i.e. full column rank) and \(A = B^\top B\). \(A\) is symmetric because

\[A^\top = \left( B^\top B \right)^\top = B^\top B = A.\]

When \(x \neq \mathbf{0}\), the nullspace facts from Exercise 6.17 shows that \(A\) is also positive definite because

\[v = Bx \neq \mathbf{0} \quad \text{and} \quad v^\top v = x^\top B^\top B x = x^\top A x > 0.\]

(b)

Let \(A \in \mathbb{R}^{n \times n}\) be symmetric positive definite. Define \(e_i\) as a unit \(n\)-vector whose entries are zero except for a one in the \(i^\text{th}\) entry. Applying the definition of positive definite shows that the diagonal elements of \(A\) are positive because

\[e_i^\top A e_i = \alpha_{ii} > 0 \text{ for } i = 1, 2, \ldots, n.\]

(c)

Let \(I_p\) denote a \(p \times p\) identity matrix, \(I_q\) denote a \(q \times q\) identity matrix, and \(M \in \mathbb{R}^{(p + q) \times (p + q)}\) such that \(M = \begin{bmatrix} A & B\\ C & D \end{bmatrix}\) where \(A \in \mathbb{R}^{p \times p}\) is invertible, \(B \in \mathbb{R}^{p \times q}\), \(C \in \mathbb{R}^{q \times p}\), and \(D \in \mathbb{R}^{q \times q}\).

\[\begin{split}\begin{bmatrix} A & B\\ C & D \end{bmatrix} \begin{bmatrix} I_p & -A^{-1}B\\ 0 & I_q \end{bmatrix} &= \begin{bmatrix} A & 0\\ C & D - C A^{-1} B \end{bmatrix}\\ \begin{bmatrix} A & B\\ C & D \end{bmatrix} &= \begin{bmatrix} A & 0\\ C & D - C A^{-1} B \end{bmatrix} \begin{bmatrix} I_p & A^{-1}B\\ 0 & I_q \end{bmatrix}\\ &= \begin{bmatrix} I_p & 0\\ C A^{-1} & I_q \end{bmatrix} \begin{bmatrix} A & 0\\ 0 & D - C A^{-1} B \end{bmatrix} \begin{bmatrix} I_p & A^{-1}B\\ 0 & I_q \end{bmatrix}\end{split}\]

(d)

Assume \(A \in \mathbb{R}^{(n + 1) \times (n + 1)}\) is symmetric positive definite. Partition

\[\begin{split}A = \begin{bmatrix} \alpha_{11} & a_{21}^\top\\ a_{21} & A_{22} \end{bmatrix}\end{split}\]

where \(\alpha_{11} \in \mathbb{R}^{1 \times 1}\), \(a_{21} \in \mathbb{R}^{n \times 1}\), and \(A_{22} \in \mathbb{R}^{n \times n}\). Applying (b) and (c) gives

\[\begin{split}A = \begin{bmatrix} \alpha_{11} & \mathbf{0}^\top\\ a_{21} & S \end{bmatrix} \begin{bmatrix} I_1 & \alpha_{11}^{-1} a_{21}^\top\\ \mathbf{0} & I_n \end{bmatrix}\end{split}\]

where \(S = A_{22} - a_{21} \alpha_{11}^{-1} a_{21}^\top\) is the Schur complement of \(\alpha_{11}\) in \(A\).

\(S\) is symmetric by inspection. For any non-zero vector \(x = \begin{bmatrix} \chi_0\\ x_n \end{bmatrix} \in \mathbb{R}^{n + 1}\),

\[\begin{split}0 < x^\top A x &= \begin{bmatrix} \chi_0 & x_n^\top \end{bmatrix} \begin{bmatrix} \alpha_{11} & \mathbf{0}^\top\\ a_{21} & S \end{bmatrix} \begin{bmatrix} I_1 & \alpha_{11}^{-1} a_{21}^\top\\ \mathbf{0} & I_n \end{bmatrix} \begin{bmatrix} \chi_0\\ x_n \end{bmatrix}\\ &= \begin{bmatrix} \chi_0 \alpha_{11} + x_n^\top a_{21} & x_n^\top S \end{bmatrix} \begin{bmatrix} \chi_0 + \alpha_{11}^{-1} a_{21}^\top x_n\\ x_n \end{bmatrix}\\ &= \chi_0^2 \alpha_{11} + 2 \chi_0 a_{21}^\top x_n + \alpha_{11}^{-1} \left( a_{21}^\top x_n \right)^2 + x_n^\top S x_n.\end{split}\]

With respect to proving \(S\) is positive definite, picking \(\chi_0 = -\alpha_{11}^{-1} a_{21}^\top x_n\) does not lose any generality because \(x_n\) is any arbitrary nonzero vector.

(e)

Let \(A \in \mathbb{R}^{(n + 1) \times (n + 1)}\) be symmetric positive definite. Since \(A\) is symmetric, let \(\star\) denote parts of \(A\) that are neither stored nor updated.

\[\begin{split}\DeclareMathOperator{\chol}{chol} A = \begin{bmatrix} \alpha_{11} & \star\\ a_{21} & A_{22} \end{bmatrix} &= L L^\top & \quad & \chol(A) = L\\ &= \begin{bmatrix} \lambda_{11} & \boldsymbol{0}^\top\\ l_{21} & L_{22} \end{bmatrix} \begin{bmatrix} \lambda_{11} & l_{21}^\top\\ \boldsymbol{0} & L_{22}^\top \end{bmatrix}\\ &= \begin{bmatrix} \lambda_{11}^2 & \star\\ \lambda_{11} l_{21} & l_{21} l_{21}^\top + L_{22} L_{22}^\top \end{bmatrix}\end{split}\]

Equating the block matrices gives

\[\begin{split}\lambda_{11} &= \sqrt{\alpha_{11}}\\\\ l_{21} &= a_{21} / \lambda_{11}\\\\ L_{22} &= \chol\left( A_{22} - l_{21} l_{21}^\top \right).\end{split}\]

The inductive proof starts with the base case \(n = 0\): Invoking (b) shows that \(\alpha_{11} > 0\) hence \(\lambda_{11} = \sqrt{\alpha_{11}}\) is well-defined.

The inductive step assumes the Cholesky factorization exist for all symmetric positive definite matrix \(A \in \mathbb{R}^{n \times n}\). Let \(A \in \mathbb{R}^{(n + 1) \times (n + 1)}\) be symmetric positive definite. Partitioning \(A\) as described and invoking (b) establishes that \(\lambda_{11}\) and \(l_{21}\) are well-defined. Invoking (d) reveals that \(L_{22}\) is symmetric positive definite, which enables the recursive application of the decomposition. Thus, \(L\) is the desired Cholesky factor of \(A\).

Exercise 8

\[\begin{split}H_+ &= J_+ J_+^\top\\ &= \left( L_c + \frac{(y_c - L_c v_c) v_c^\top}{v_c^\top v_c} \right) \left( L_c + \frac{(y_c - L_c v_c) v_c^\top}{v_c^\top v_c} \right)^\top & \quad & \text{(9.2.6)}\\ &= H_c + \frac{L_c v_c (y_c - L_c v_c)^\top}{v_c^\top v_c} + \frac{(y_c - L_c v_c) v_c^\top L_c^\top}{v_c^\top v_c} + \frac{(y_c - L_c v_c) (y_c - L_c v_c)^\top}{v_c^\top v_c}\\ &= H_c + \frac{y_c y_c^\top}{v_c^\top v_c} - \frac{L_c v_c v_c^\top L_c^\top}{v_c^\top v_c}\\ &= H_c + \frac{y_c y_c^\top}{\alpha^2 s_c^\top H_c s_c} - \frac{H_c s_c s_c^\top H_c}{s_c^\top H_c s_c} & \quad & \text{(9.2.8) and } H_c = L_c L_c^\top\\ &= H_c + \frac{y_c y_c^\top}{y_c^\top s_c} - \frac{H_c s_c s_c^\top H_c}{s_c^\top H_c s_c} & \quad & \text{(9.2.9)}\end{split}\]

According to Lemma 8.3.1 (see Exercise 8.13), \(J_+\) given by (9.2.6, 9.2.9) is nonsingular.

Exercise 9

Suppose \(H_c s_c = y_c\). Then \(\alpha = \pm 1\) and

\[\begin{split}J_+ &= L_c + \frac{(y_c - L_c v_c) v_c^\top}{v_c^\top v_c} & \quad & \text{(9.2.6)}\\ &= L_c + \frac{y_c v_c^\top}{v_c^\top v_c} - \frac{L_c v_c v_c^\top}{v_c^\top v_c}\\ &= L_c + \frac{y_c s_c^\top L_c}{\alpha s_c^\top H_c s_c} - \frac{H_c s_c s_c^\top L_c}{s_c^\top H_c s_c} & \quad & \text{(9.2.9)}\\ &= L_c + \frac{y_c s_c^\top L_c}{\alpha y_c^\top s_c} - \frac{y_c s_c^\top L_c}{y_c^\top s_c} & \quad & \text{see initial supposition.}\end{split}\]

Thus choosing \(\alpha = 1\) makes \(J_+\) closer to \(L_c\) when \(y_c \cong H_c s_c\).

Exercise 10

\[\begin{split}H_+^{-1} &= J_+^{-\top} J_+^{-1}\\ &= \left( L_c^{-\top} + \frac{(s_c - L_c^{-\top} w_c) w_c^\top}{w_c^\top w_c} \right) \left( L_c^{-1} + \frac{w_c (s_c - L_c^{-\top} w_c)^\top}{w_c^\top w_c} \right) & \quad & \text{(9.2.12)}\\ &= H_c^{-1} + \frac{L_c^{-\top} w_c (s_c - L_c^{-\top} w_c)^\top}{w_c^\top w_c} + \frac{(s_c - L_c^{-\top} w_c) w_c^\top L_c^{-1}}{w_c^\top w_c} + \frac{ (s_c - L_c^{-\top} w_c) (s_c - L_c^{-\top} w_c)^\top }{ w_c^\top w_c }\\ &= H_c^{-1} + \frac{s_c s_c^\top}{w_c^\top w_c} - \frac{L_c^{-\top} w_c w_c^\top L_c^{-1}}{w_c^\top w_c}\\ &= H_c^{-1} + \frac{s_c s_c^\top}{\alpha^2 y_c^\top H_c^{-1} y_c} - \frac{H_c^{-1} y_c y_c^\top H_c^{-1}}{y_c^\top H_c^{-1} y_c}\\ &= H_c^{-1} + \frac{s_c s_c^\top}{s_c^\top y_c} - \frac{H_c^{-1} y_c y_c^\top H_c^{-1}}{y_c^\top H_c^{-1} y_c}\end{split}\]

where

\[w_c = \left( \frac{y_c^\top s_c}{y_c^\top H_c^{-1} y_c} \right)^{1 / 2} L_c^{-1} y_c = \alpha L_c^{-1} y_c.\]

Using lemma 8.3.1 to invert (9.2.12) gives

\[\begin{split}J_+^{\top} &= \left( L_c^{-\top} + \frac{(s_c - L_c^{-\top} w_c) w_c^\top}{w_c^\top w_c} \right)^{-1}\\ &= L_c^\top - \sigma^{-1} L_c^\top \frac{(s_c - L_c^{-\top} w_c) w_c^\top}{w_c^\top w_c} L_c^\top\\ &= L_c^\top - L_c^\top \frac{(s_c - L_c^{-\top} w_c) w_c^\top}{w_c^\top L_c^\top s_c} L_c^\top\\ &= L_c^\top - \frac{ L_c^\top s_c w_c^\top L_c^\top - w_c w_c^\top L_c^\top }{w_c^\top L_c^\top s_c}\\ &= L_c^\top - \frac{ L_c^\top s_c y_c^\top - \alpha L_c^{-1} y_c y_c^\top }{y_c^\top s_c}\end{split}\]

where

\[\sigma = 1 + \frac{w_c^\top}{w_c^\top w_c} L_c^\top (s_c - L_c^{-\top} w_c) = \frac{w_c^\top L_c^\top s_c}{w_c^\top w_c}.\]

Consequently,

\[\begin{split}H_+ &= J_+ J_+^\top\\ &= \left( L_c - \frac{y_c s_c^\top L_c - \alpha y_c y_c^\top L_c^{-\top}}{y_c^\top s_c} \right) \left( L_c^\top - \frac{ L_c^\top s_c y_c^\top - \alpha L_c^{-1} y_c y_c^\top }{ y_c^\top s_c } \right)\\ &= H_c - \frac{H_c s_c y_c^\top - \alpha y_c y_c^\top}{y_c^\top s_c} - \frac{y_c s_c^\top H_c - \alpha y_c y_c^\top}{y_c^\top s_c} + \frac{ y_c s_c^\top H_c s_c y_c^\top - \alpha y_c s_c^\top y_c y_c^\top - \alpha y_c y_c^\top s_c y_c^\top + \alpha^2 y_c y_c^\top H_c^{-1} y_c y_c^\top }{ (y_c^\top s_c)^2 }\\ &= H_c + \frac{y_c y_c^\top - H_c s_c y_c^\top - y_c s_c^\top H_c}{y_c^\top s_c} + \frac{y_c s_c^\top H_c s_c y_c^\top}{(y_c^\top s_c)^2}\\ &= H_c + \frac{ (y_c - H_c s_c) y_c^\top + y_c (y_c - H_c s_c)^\top }{ y_c^\top s_c } - \frac{(y_c - H_c s_c)^\top s_c y_c y_c^\top}{(y_c^\top s_c)^2}.\end{split}\]

Exercise 11

Define \(H = T^\top T\).

\[\begin{split}\hat{H}_+ &= \hat{H}_c + \frac{ (\hat{y}_c - \hat{H}_c \hat{s}_c) \hat{s}_c^\top + \hat{s}_c (\hat{y}_c - \hat{H}_c \hat{s}_c)^\top }{ \hat{s}_c^\top \hat{s}_c } - \frac{ (\hat{y}_c - \hat{H}_c \hat{s}_c)^\top \hat{s}_c \hat{s}_c \hat{s}_c^\top }{ (\hat{s}_c^\top \hat{s}_c)^2 } & \quad & \text{(9.3.1)}\\ &= T^{-\top} H_c T^{-1} + \frac{ (T^{-\top} y_c - T^{-\top} H_c s_c) s_c^\top T^\top + T s_c (y_c^\top T^{-1} - s_c^\top H_c T^{-1}) }{s_c^\top H s_c} - \frac{ (y_c^\top - s_c^\top H_c) s_c T s_c s_c^\top T^\top }{(s_c^\top H s_c)^2}\\ H_+ &= H_c + \frac{ (y_c - H_c s_c) s_c^\top H + H s_c (y_c - H_c s_c)^\top }{s_c^\top H s_c} - \frac{ (y_c - H_c s_c)^\top s_c H s_c s_c^\top H }{(s_c^\top H s_c)^2} & \quad & H_+ = T^\top \hat{H}_+ T.\end{split}\]

Exercise 12

The section on Convergence of Double-Rank Methods in [BDMore73] is the proof for this exercise. Overall, the proofs in paper were along the lines of the book’s.

Exercise 13

Suppose the assumptions of (9.3.1) holds (see Exercise 7.2).

(a)

\[\begin{split}\hat{H}_+ &= \hat{H}_c + \frac{\hat{y}_c \hat{y}_c^\top}{\hat{y}_c^\top \hat{s}_c} - \frac{ \hat{H}_c \hat{s}_c \hat{s}_c^\top \hat{H}_c }{ \hat{s}_c^\top \hat{H}_c \hat{s}_c } & \quad & \text{(9.2.10)}\\ &= T^{-\top} H_c T^{-1} + \frac{T^{-\top} y_c y_c^\top T^{-1}}{y_c^\top s_c} - \frac{T^{-\top} H_c s_c s_c^\top H_c T^{-1}}{s_c^\top H_c s_c}\\ &= T^{-\top} H_+ T^{-1}\end{split}\]

Applying the foregoing result to (9.2.11a) gives

\[\begin{split}\hat{x}_{k + 1} &= \hat{x}_k - \hat{H}_k^{-1} \nabla \hat{f}(\hat{x}_k)\\ T x_{k + 1} &= T x_k - \left( T^{-\top} H_k T^{-1} \right)^{-1} T^{-\top} \nabla f(x_k)\\ T x_{k + 1} &= T x_k - T H_k^{-1} \nabla f(x_k)\\ x_{k + 1} &= x_k - H_k^{-1} \nabla f(x_k),\end{split}\]

which illustrates that the positive definite secant update is invariant under linear transformations of the variable space.

As shown in Exercise 4.11, \(\hat{H}_+\) cannot be factored such that \(T^{-\top} H_+ T^{-1}\) is the original symmetric secant update where \(H_+\).

(b)

Recall that Broyden’s update or simply the secant update \((8.1.5)\) applied to Hessians (9.1.2) is not guaranteed to be symmetric. Its update has a rank of one by definition of rank and outer product.

The update

\[\begin{split}\hat{H}_+ &= \hat{H}_c + \frac{ (\hat{y}_c - \hat{H}_c \hat{s}_c) \hat{s}_c^\top }{ \hat{s}_c^\top \hat{s}_c }\\ &= T^{-\top} H_c T^{-1} + \frac{ (T^{-\top} y_c - T^{-\top} H_c s_c) s_c T^\top }{ s_c^\top T^\top T s_c }\end{split}\]

can be modified to be invariant under linear transformations of the variable space as follows:

\[\begin{split}\hat{H}_+ &= \hat{H}_c + \frac{ (\hat{y}_c - \hat{H}_c \hat{s}_c) \hat{y}_c^\top }{ \hat{y}_c^\top \hat{s}_c }\\ &= T^{-\top} H_c T^{-1} + \frac{(T^{-\top} y_c - T^{-\top} H_c s_c) y_c T^{-1}}{y_c s_c}\\ &= T^{-\top} H_+ T^{-1}\end{split}\]

which is essentially (8.4.5) with \(v_k = y_c\).

Exercise 14

Let \(f(x_1, x_2) = x_1^2 + 2 x_1 x_2 + 2 x_2^2 + x_2^4\). Then

\[\begin{split}\nabla f &= \begin{bmatrix} 2 x_1 + 2 x_2\\ 2 x_1 + 4 x_2 + 4 x_2^3 \end{bmatrix}\\\\ \nabla^2 f &= \begin{bmatrix} 2 & 2\\ 2 & 4 + 12 x_2^2 \end{bmatrix}.\end{split}\]

Suppose \(x_0 = (1, -1)^\top\), then \(\nabla f(x_0) = \begin{bmatrix} 0\\ -6 \end{bmatrix}\) and \(H_0 = \nabla^2 f(x_0) = \begin{bmatrix} 2 & 2\\ 2 & 16 \end{bmatrix}\).

The positive definite secant and symmetric secant methods have in common the following updates:

\[\begin{split}x_{k + 1} &= x_k - H_k^{-1} \nabla f(x_k)\\\\ s_k &= x_{k + 1} - x_k\\\\ y_k &= \nabla f(x_{k + 1}) - \nabla f(x_k).\end{split}\]

(a)

Suppose \(f \colon \mathbb{R}^n \rightarrow \mathbb{R}\) and \(\nabla f(x) = \begin{bmatrix} A x + b\\ F(x) \end{bmatrix}\) where \(A \in \mathbb{R}^{m \times n}\), \(b \in \mathbb{R}^m\), and \(F \colon \mathbb{R}^n \rightarrow \mathbb{R}^{n - m}\) is nonlinear.

Let \(z_k = F(x_{k + 1}) - F(x_k)\) and \(s_k = \begin{bmatrix} s_{k1}\\ s_{k2} \end{bmatrix}\) where \(s_{k1} \in \mathbb{R}^m\) and \(s_{k2} \in \mathbb{R}^{n - m}\) in order to simplify notations.

\(H_0 = H(x_0)\) can be calculated using finite difference or analytically as

\[\begin{split}H = \nabla^2 f(x) &= \frac{\partial \nabla f(x)}{\partial x}\\ &= \begin{bmatrix} A\\ F'(x) \end{bmatrix} & \quad & \frac{\partial F(x)}{\partial x} = F'(x) \in \mathbb{R}^{(n - m) \times n}\\ &= H^\top & \quad & \text{(9.2.10) is a symmetric matrix of rank at most two.}\end{split}\]

As shown in Exercise 8.10, the affine transformation \(Ax_k + b = 0\) for all \(k \geq 1\).

Substituting the given values into (a.1) shows that the positive definite secant method converges to the correct Hessian. (a.2) illustrates that the symmetric secant method in this case fails to converge to the correct Hessian. Note that this way of showing convergence is handwavy; the more rigorous version is along the lines of Exercises 5, 6, and 12. See Exercise 8.10 for a general idea of how to prove this. Supposely [Sch82], which could not be found on the internet, contains the proof.

(a.1)

\[\begin{split}H_1 &= H_0 + \frac{y_0 y_0^\top}{y_0^\top s_0} - \frac{H_0 s_0 s_0^\top H_0}{s_0^\top H_0 s_0}\\ &= \begin{bmatrix} A\\ F'(x_0) \end{bmatrix} + \frac{ \begin{bmatrix} A s_0\\ z_0 \end{bmatrix} \begin{bmatrix} A s_0\\ z_0 \end{bmatrix}^\top }{ \begin{bmatrix} A s_0\\ z_0 \end{bmatrix}^\top s_0 } - \frac{ \begin{bmatrix} A s_0\\ F'(x_0) s_0 \end{bmatrix} \begin{bmatrix} s_0^\top A^\top\\ s_0^\top F'(x_0)^\top \end{bmatrix} }{ \begin{bmatrix} A s_0\\ F'(x_0) s_0 \end{bmatrix}^\top s_0 }\\ &= \begin{bmatrix} A\\ F'(x_0) \end{bmatrix} + (s_0^\top A^\top s_{01} + z_0^\top s_{02})^{-1} \begin{bmatrix} A s_0 s_0^\top A^\top & A s_0 z_0^\top\\ z_0 s_0^\top A^\top & z_0 z_0^\top \end{bmatrix} - (s_0^\top A^\top s_{01} + s_0^\top F'(x_0)^\top s_{02})^{-1} \begin{bmatrix} A s_0 s_0^\top A^\top & A s_0 s_0^\top F'(x_0)^\top\\ F'(x_0) s_0 s_0^\top A^\top & F'(x_0) s_0 s_0^\top F'(x_0)^\top \end{bmatrix}\end{split}\]

Generalizing this update sequence gives

\[\begin{split}H_{k + 1} &= H_k + \frac{y_k y_k^\top}{y_k^\top s_k} - \frac{H_k s_k s_k^\top H_k}{s_k^\top H_k s_k}\\ &= H_k + (s_k^\top A^\top s_{k1} + z_k^\top s_{k2})^{-1} \begin{bmatrix} A s_k s_k^\top A^\top & A s_k z_k^\top\\ z_k s_k^\top A^\top & z_k z_k^\top \end{bmatrix} - (s_k^\top A^\top s_{k1} + s_k^\top F'(x_k)^\top s_{k2})^{-1} \begin{bmatrix} A s_k s_k^\top A^\top & A s_k s_k^\top F'(x_k)^\top\\ F'(x_k) s_k s_k^\top A^\top & F'(x_k) s_k s_k^\top F'(x_k)^\top \end{bmatrix}\\ &= H_k + (z_k^\top s_{k2})^{-1} \begin{bmatrix} \boldsymbol{0}_{m \times m} & \boldsymbol{0}_{m \times (n - m)}\\ \boldsymbol{0}_{(n - m) \times m} & z_k z_k^\top \end{bmatrix} - (s_k^\top F'(x_k)^\top s_{k2})^{-1} \begin{bmatrix} \boldsymbol{0}_{m \times m} & \boldsymbol{0}_{m \times (n - m)}\\ \boldsymbol{0}_{(n - m) \times m} & F'(x_k) s_k s_k^\top F'(x_k)^\top \end{bmatrix}.\end{split}\]

Notice how the remaining block matrices form another positive definite secant update, which still satisfies Theorem 9.3.1 according to the Schur complement.

(a.2)

\[\begin{split}H_1 &= H_0 + \frac{ (y_0 - H_0 s_0) s_0^\top + s_0 (y_0 - H_0 s_0)^\top }{ s_0^\top s_0 } - \frac{(y_0 - H_0 s_0)^\top s_0 s_0 s_0^\top}{(s_0^\top s_0)^2}\\ &= \begin{bmatrix} A\\ F'(x_0) \end{bmatrix} + \frac{ \begin{bmatrix} \boldsymbol{0}_{m \times 1}\\ z_0 - F'(x_0) s_0 \end{bmatrix} s_0^\top + s_0 \begin{bmatrix} \boldsymbol{0}_{1 \times m} & z_0^\top - s_0^\top F'(x_0)^\top \end{bmatrix} }{ s_0^\top s_0 } - \frac{ \left( z_0^\top - s_0^\top F'(x_0)^\top \right) s_{02} }{ (s_0^\top s_0)^2 } s_0 s_0^\top\\ &= \begin{bmatrix} A\\ F'(x_0) \end{bmatrix} + \frac{ \begin{bmatrix} \boldsymbol{0}_{m \times n}\\ z_0 s_0^\top - F'(x_0) s_0 s_0^\top \end{bmatrix} + \begin{bmatrix} \boldsymbol{0}_{n \times m} & s_0 z_0^\top - s_0 s_0^\top F'(x_0)^\top \end{bmatrix} }{ s_0^\top s_0 } - \frac{ z_0^\top s_{02} - s_0^\top F'(x_0)^\top s_{02} }{ (s_0^\top s_0)^2 } s_0 s_0^\top\end{split}\]

Generalizing this update sequence gives

\[\begin{split}H_{k + 1} &= H_k + \frac{ (y_k - H_k s_k) s_k^\top + s_k (y_k - H_k s_k)^\top }{ s_k^\top s_k } - \frac{(y_k - H_k s_k)^\top s_k s_k s_k^\top}{(s_k^\top s_k)^2}\\ &= H_k + \frac{ \begin{bmatrix} \boldsymbol{0}_{m \times n}\\ z_k s_k^\top - F'(x_k) s_k s_k^\top \end{bmatrix} + \begin{bmatrix} \boldsymbol{0}_{n \times m} & s_k z_k^\top - s_k s_k^\top F'(x_k)^\top \end{bmatrix} }{ s_k^\top s_k } - \frac{ z_k^\top s_{k2} - s_k^\top F'(x_k)^\top s_{k2} }{ (s_k^\top s_k)^2 } s_k s_k^\top.\end{split}\]

Unlike (a.1), the remaining block matrices cannot be isolated to form a symmetric secant update. This means even though Theorem 9.1.2 guarantees its convergence to \(x_*\), there is no guarantee on the Hessian.

(b)

In this case, the positive definite secant method exactly computes the linear portion of the Hessian in a single step. The symmetric secant method does not seem to be directly affected by \(\nabla f(x)\).

(c)

It is interesting that the symmetric secant method’s Hessian converges to \(\nabla^2 f(x_*)\) numerically if the stopping criterion is ignored.

When \(\nabla f(x_0)\) is not zero in the linear components of \(\nabla f(x)\), the symmetric secant method takes several times longer to converge to the correct Hessian compared to the positive definite secant method. However, it does converge to the correct Hessian.

[1]:
import numpy

def pdsu(x_c, H_c):
    x_k = x_c - numpy.linalg.inv(H_c) * fp(x_c)
    s_k = x_k - x_c
    y_k = fp(x_k) - fp(x_c)

    _ = (y_k * y_k.T) / (y_k.T * s_k)
    _ -= (H_c * s_k * s_k.T * H_c) / (s_k.T * H_c * s_k)
    return x_k, H_c + _

def ssu(x_c, H_c):
    x_k = x_c - numpy.linalg.inv(H_c) * fp(x_c)
    s_k = x_k - x_c
    y_k = fp(x_k) - fp(x_c)

    _ = (y_k - H_c * s_k) * s_k.T + s_k * (y_k - H_c * s_k).T
    _ /= s_k.T * s_k
    dp = (y_k - H_c * s_k).T * s_k
    _ -= (dp.flat[0] * s_k * s_k.T) / (s_k.T * s_k)**2
    return x_k, H_c + _

def fp(x):
    x = x.flat
    return numpy.asmatrix([[2 * x[0] + 2 * x[1]],
                           [2 * x[0] + 4 * x[1] + 4 * x[1]**3]])

def fpp(x):
    x = x.flat
    return numpy.asmatrix([[2, 2],
                           [2, 4 + 12 * x[1]**2]])
x_0 = numpy.asmatrix([[1],
                    [-1]])
[2]:
x_c = x_0
H_c = fpp(x_c)
_ = 'x_{0}: {1}\t\tH_{0}: {2}'
print(_.format(0, numpy.asarray(x_c).squeeze(),
               numpy.asarray(H_c.flatten()).squeeze()))

for i in range(8):
    x_k, H_k = pdsu(x_c, H_c)
    print(_.format(i + 1, numpy.asarray(x_k).squeeze(),
                   numpy.asarray(H_k.flatten()).squeeze()))
    x_c, H_c = x_k, H_k
x_0: [ 1 -1]            H_0: [ 2  2  2 16]
x_1: [ 0.57142857 -0.57142857]          H_1: [ 2.          2.          2.         11.59183673]
x_2: [ 0.37446809 -0.37446809]          H_2: [2.         2.         2.         6.72295489]
x_3: [ 0.17142203 -0.17142203]          H_3: [2.         2.         2.         4.93521577]
x_4: [ 0.04775366 -0.04775366]          H_4: [2.         2.         2.         4.15940782]
x_5: [ 0.00332346 -0.00332346]          H_5: [2.         2.         2.         4.00980066]
x_6: [ 1.61335925e-05 -1.61335925e-05]          H_6: [2.        2.        2.        4.0000444]
x_7: [ 3.58126625e-10 -3.58126625e-10]          H_7: [2. 2. 2. 4.]
x_8: [ 1.86439634e-19 -1.86439634e-19]          H_8: [2. 2. 2. 4.]
[3]:
x_c = x_0
H_c = fpp(x_c)
_ = 'x_{0}: {1}\t\tH_{0}: {2}'
print(_.format(0, numpy.asarray(x_c).squeeze(),
               numpy.asarray(H_c.flatten()).squeeze()))

for i in range(8):
    x_k, H_k = ssu(x_c, H_c)
    print(_.format(i + 1, numpy.asarray(x_k).squeeze(),
                   numpy.asarray(H_k.flatten()).squeeze()))
    x_c, H_c = x_k, H_k
x_0: [ 1 -1]            H_0: [ 2  2  2 16]
x_1: [ 0.57142857 -0.57142857]          H_1: [ 3.10204082  3.10204082  3.10204082 12.69387755]
x_2: [ 0.37446809 -0.37446809]          H_2: [4.31926128 4.31926128 4.31926128 9.04221617]
x_3: [ 0.17142203 -0.17142203]          H_3: [4.76619606 4.76619606 4.76619606 7.70141183]
x_4: [ 0.04775366 -0.04775366]          H_4: [4.96014804 4.96014804 4.96014804 7.11955587]
x_5: [ 0.00332346 -0.00332346]          H_5: [4.99754984 4.99754984 4.99754984 7.00735049]
x_6: [ 1.61335925e-05 -1.61335925e-05]          H_6: [4.9999889 4.9999889 4.9999889 7.0000333]
x_7: [ 3.58126646e-10 -3.58126625e-10]          H_7: [5. 5. 5. 7.]
x_8: [ 1.29405326e-17 -1.86439738e-19]          H_8: [4.99999993 5.         5.         7.00000007]

Exercise 15

There is a typo in the definition of \(f\). This exercise illustrates that the convergence of the Hessian using either method depends on the function and is not true in general.

From (a) and (b), it is not obvious that both methods converge to the same incorrect Hessian.

(a)

\[\begin{split}H_1 &= H_0 + \frac{y_0 y_0^\top}{y_0^\top s_0} - \frac{H_0 s_0 s_0^\top H_0}{s_0^\top H_0 s_0}\\ &= I + \frac{y_0 y_0^\top}{y_0^\top s_0} - \frac{s_0 s_0^\top}{s_0^\top s_0}\end{split}\]

(b)

\[\begin{split}H_1 &= H_0 + \frac{ (y_0 - H_0 s_0) s_0^\top + s_0 (y_0 - H_0 s_0)^\top }{ s_0^\top s_0 } - \frac{ (y_0 - H_0 s_0)^\top s_0 s_0 s_0^\top }{ (s_0^\top s_0)^2 }\\ &= I + \frac{(y_0 - s_0) s_0^\top + s_0 (y_0 - s_0)^\top}{s_0^\top s_0} - \frac{(y_0 - s_0)^\top s_0 s_0 s_0^\top}{(s_0^\top s_0)^2}\\ &= I + \frac{ y_0 s_0^\top - s_0 s_0^\top + s_0 y_0^\top - s_0 s_0^\top }{ s_0^\top s_0 } - \frac{(y_0^\top s_0 - s_0^\top s_0) s_0 s_0^\top}{(s_0^\top s_0)^2}\\ &= I + \frac{y_0 s_0^\top + s_0 y_0^\top - s_0 s_0^\top}{s_0^\top s_0} - \frac{y_0^\top s_0}{(s_0^\top s_0)^2} s_0 s_0^\top\\ &= I + \frac{y_0 s_0^\top + s_0 y_0^\top}{s_0^\top s_0} - (1 + \frac{y_0^\top s_0}{s_0^\top s_0}) \frac{s_0 s_0^\top}{s_0^\top s_0}\end{split}\]

Exercise 16, Exercise 17

These exercises are related to benchmarking different minimization algorithms. These tasks are already completed by existing solver packages such as Ceres and Eigen.

Exercise 18

Let \(f(x) = 2^{-1} x^\top H x\) be a positive definite quadratic where the Hessian \(H \in \mathbb{R}^{n \times n}\) and \(x_c, x_+ \in \mathbb{R}^n\) are distinct points.

Define \(s\) and \(y\) according to (9.2.11b) such that

\[\begin{split}\nabla f(x) &= Hx\\\\ \nabla^2 f(x) &= H\\\\ s &= x_+ - x_c\\\\ y &= \nabla f(x_+) - \nabla f(x_c) = H (x_+ - x_c) = Hs.\end{split}\]

The characteristic equation of \(\frac{s^\top y}{s^\top s} I\) is

\[\det\left( \frac{s^\top y}{s^\top s} I - \nu I \right) = 0,\]

which shows that the eigenvalue \(\nu = \frac{s^\top y}{s^\top s}\) has a multiplicity of \(n\).

Rewriting the eigenvalue gives

\[\nu = \frac{s^\top y}{s^\top s} = \frac{s^\top H s}{s^\top s} = v^\top H v\]

where \(v = s \left\Vert s \right\Vert_2^{-1}\).

Recall that an eigenvalue of \(H\) must satisfy the condition

\[\begin{split}Hz &= \lambda z\\ z^\top H z &= \lambda z^\top z.\end{split}\]

Since \(H\) is positive definite, its eigenvalues must all be positive. Assuming \(z\) is restricted to a unit vector, (a) shows that the range of eigenvalues of \(H\) includes \(\nu\).

(a)

\[\begin{split}z^\top H z &= z^\top Q \Lambda Q^\top z & \quad & \text{real symmetric matrix eigendecomposition}\\ &= y^\top \Lambda y & \quad & y = Q^\top z\\ &= \sum_{i = 1}^n \lambda_i y_i^2\\ &\leq \lambda_\max \sum_{i = 1}^n y_i^2\\ &= \lambda_\max y^\top y\\ z^\top H z &\leq \lambda_\max z^\top z\end{split}\]

Exercise 19

Even though each heuristic converged successfully to \(x_*\), none of them converged to the correct Hessian with a stopping criterion of \(10^{-8}\).

The Schnabel heuristic converged much quicker than the identity heuristic. Moreover, the identity heuristic experienced overflow when \(x_0 = (-1, 3)^\top\).

[4]:
def f(x):
    x = x.flat
    return x[0]**4 + (x[0] + x[1])**2 + (numpy.exp(x[1]) - 1)**2

def fp(x):
    x = x.flat
    _ = 2 * (x[0] + x[1])
    return numpy.asmatrix([[4 * x[0] + _],
                           [_ + 2 * (numpy.exp(x[1]) - 1) * numpy.exp(x[1])]])

def fpp(x):
    x = x.flat
    return numpy.asmatrix([[6, 2],
                           [2, 4 * numpy.exp(2 * x[1]) - 2 * numpy.exp(x[1])]])
x_0 = numpy.asmatrix([[-1],
                      [3]])
[5]:
x_c = x_0
H_c = fpp(x_c)
_ = 'x_{0}: {1}\t\tH_{0}: {2}'
print(_.format(0, numpy.asarray(x_c).squeeze(),
               numpy.asarray(H_c.flatten()).squeeze()))

for i in range(16):
    x_k, H_k = pdsu(x_c, H_c)
    print(_.format(i + 1, numpy.asarray(x_k).squeeze(),
                   numpy.asarray(H_k.flatten()).squeeze()))
    x_c, H_c = x_k, H_k
x_0: [-1  3]            H_0: [   6.            2.            2.         1573.54410012]
x_1: [-0.83667163  2.5100149 ]          H_1: [  6.           2.           2.         998.89787784]
x_2: [-0.7426493   2.22794789]          H_2: [  6.           2.           2.         443.44283829]
x_3: [-0.62469991  1.87409972]          H_3: [  6.           2.           2.         233.30165316]
x_4: [-0.5181552   1.55446561]          H_4: [  6.           2.           2.         116.28755637]
x_5: [-0.41032666  1.23097999]          H_5: [ 6.          2.          2.         59.87797302]
x_6: [-0.30760028  0.92280085]          H_6: [ 6.          2.          2.         31.12339324]
x_7: [-0.21061501  0.63184504]          H_7: [ 6.          2.          2.         16.83526742]
x_8: [-0.12490953  0.37472859]          H_8: [6.         2.         2.         9.74841099]
x_9: [-0.05803004  0.17409012]          H_9: [6.         2.         2.         6.33545863]
x_10: [-0.01776458  0.05329373]         H_10: [6.         2.         2.         4.79129044]
x_11: [-0.00269008  0.00807023]         H_11: [6.         2.         2.         4.19210244]
x_12: [-0.00012799  0.00038398]         H_12: [6.         2.         2.         4.02552286]
x_13: [-9.28668114e-07  2.78600434e-06]         H_13: [6.         2.         2.         4.00116064]
x_14: [-3.20914944e-10  9.62744831e-10]         H_14: [6.         2.         2.         4.00000836]
x_15: [-8.16388618e-16  2.44916585e-15]         H_15: [6.         2.         2.         3.99999989]
x_16: [-1.33501404e-18  4.00504212e-18]         H_16: [6.         2.         2.         3.99781597]
[6]:
x_c = x_0
H_c = numpy.eye(len(x_c))
_ = 'x_{0}: {1}\t\tH_{0}: {2}'
print(_.format(0, numpy.asarray(x_c).squeeze(),
               numpy.asarray(H_c.flatten()).squeeze()))

for i in range(8):
    x_k, H_k = pdsu(x_c, H_c)
    print(_.format(i + 1, numpy.asarray(x_k).squeeze(),
                   numpy.asarray(H_k.flatten()).squeeze()))
    x_c, H_c = x_k, H_k
x_0: [-1  3]            H_0: [1. 0. 0. 1.]
x_1: [  -1.         -767.68651314]              H_1: [2.33564408 2.         2.         2.99480982]
x_2: [ 513.68143814 -598.0563262 ]              H_2: [5.96643161 2.10185113 2.10185113 1.6909696 ]
x_3: [-111.18935846  278.4438572 ]              H_3: [3.28480919 0.06429767 0.06429767        inf]
x_4: [-77.62624105 278.4438572 ]                H_4: [nan nan nan nan]
x_5: [nan nan]          H_5: [nan nan nan nan]
x_6: [nan nan]          H_6: [nan nan nan nan]
x_7: [nan nan]          H_7: [nan nan nan nan]
x_8: [nan nan]          H_8: [nan nan nan nan]
[7]:
x_c = x_0
H_c = numpy.eye(len(x_c)) * f(x_0)
_ = 'x_{0}: {1}\t\tH_{0}: {2}'
print(_.format(0, numpy.asarray(x_c).squeeze(),
               numpy.asarray(H_c.flatten()).squeeze()))

for i in range(16):
    x_k, H_k = pdsu(x_c, H_c)
    print(_.format(i + 1, numpy.asarray(x_k).squeeze(),
                   numpy.asarray(H_k.flatten()).squeeze()))
    x_c, H_c = x_k, H_k
x_0: [-1  3]            H_0: [369.25771965   0.           0.         369.25771965]
x_1: [-1.          0.91287637]          H_1: [369.26865517   2.           2.         365.78033718]
x_2: [-0.9885881   0.89297221]          H_2: [278.66012851 158.32756492 158.32756492 111.03763909]
x_3: [-0.72547548  0.45607157]          H_3: [274.64325718 163.78377168 163.78377168 111.38285765]
x_4: [-0.56781456  0.21270315]          H_4: [261.30722445 167.39520948 167.39520948 114.23128041]
x_5: [-0.39287129 -0.04257572]          H_5: [244.11671738 165.18200469 165.18200469 116.43905471]
x_6: [-0.28113629 -0.19291898]          H_6: [199.83411526 146.05736544 146.05736544 110.45644706]
x_7: [-0.21601988 -0.26781987]          H_7: [100.81326034  84.42759544  84.42759544  74.59667515]
x_8: [-0.15333452 -0.32097495]          H_8: [29.08639126 29.22559111 29.22559111 34.83780782]
x_9: [-0.05878757 -0.36162675]          H_9: [ 8.38034685  7.53615301  7.53615301 15.47582188]
x_10: [ 0.03900773 -0.32760138]         H_10: [6.38800315 0.88480616 0.88480616 5.79646148]
x_11: [ 0.08244197 -0.16519482]         H_11: [8.34915061 1.37173986 1.37173986 3.05750176]
x_12: [ 0.03662406 -0.00606959]         H_12: [8.01476872 2.58012474 2.58012474 3.71387439]
x_13: [8.73911509e-03 8.75585830e-05]           H_13: [5.97704698 1.89604859 1.89604859 3.51135524]
x_14: [ 6.13665268e-05 -3.04043648e-04]         H_14: [5.99917425 2.01829831 2.01829831 3.5938675 ]
x_15: [-1.55035930e-05  4.33011301e-05]         H_15: [5.9724562  1.99390434 1.99390434 3.99786894]
x_16: [ 3.31002606e-08 -1.73583456e-08]         H_16: [5.97279364 1.99024211 1.99024211 3.99663007]

Exercise 20

Suppose \(H_c \triangleq H\) is symmetric, \(s_c \triangleq s\), and \(y_c \triangleq y\).

(a)

\[\begin{split}H_+ s &= \left( H + \frac{(y - Hs) (y - Hs)^\top}{(y - Hs)^\top s} \right) s & \quad & \text{(9.6.1)}\\ &= Hs + \frac{ yy^\top - y s^\top H^\top - Hs y^\top + Hs s^\top H^\top }{ y^\top s - s^\top H^\top s } s\\ &= Hs + \frac{ (y^\top s - s^\top H^\top s) y - (y^\top s - s^\top H^\top s) Hs }{ y^\top s - s^\top H^\top s }\\ &= y\end{split}\]

(b)

As laid out in Exercise 3, define \(\Sigma \triangleq \begin{bmatrix} \sigma_1 & \sigma_2\\ \sigma_3 & \sigma_4 \end{bmatrix}\). \(H_+ - H\) can be decomposed into

\[\begin{split}H_+ - H &= \begin{bmatrix} y - Hs & s \end{bmatrix} \begin{bmatrix} \sigma_1 & \sigma_2\\ \sigma_3 & \sigma_4 \end{bmatrix} \begin{bmatrix} \left( y - Hs \right)^\top\\ s^\top \end{bmatrix}\\ &= \sigma_1 \left( y - Hs \right) \left( y - Hs \right)^\top + \sigma_3 s \left( y - Hs \right)^\top + \sigma_2 \left( y - Hs \right) s^\top + \sigma_4 s s^\top\end{split}\]

where \(\sigma_1 = \frac{1}{(y - Hs)^\top s}\) and \(\sigma_2 = \sigma_3 = \sigma_4 = 0\).

By inspection, \(H_+ - H = \left( H_+ - H \right)^\top\). Since \(\det(\Sigma) = 0\), \(\Sigma\) is not a full-rank matrix. Assuming \(\sigma_1 \neq 0\), \(H_+ - H\) has a rank of one.

Exercise 21

Let \(H_c \triangleq H\), \(s_c \triangleq s\), and \(y_c \triangleq y\).

\[\begin{split}& H_+(\phi)\\ &= H + \frac{(y - Hs) y^\top + y (y - Hs)^\top}{y^\top s} - \frac{(y - Hs)^\top s y y^\top}{(y^\top s)^2} - \phi (s^\top H s) \left[ \frac{y}{y^\top s} - \frac{Hs}{s^\top H s} \right] \left[ \frac{y}{y^\top s} - \frac{Hs}{s^\top H s} \right]^\top\\ &= H + \frac{(y - Hs) y^\top + y (y - Hs)^\top}{y^\top s} - \frac{(y - Hs)^\top s y y^\top}{(y^\top s)^2} + \frac{(s^\top H s)^2}{s^\top (y - Hs)} \left[ \frac{y}{y^\top s} - \frac{Hs}{s^\top H s} \right] \left[ \frac{y}{y^\top s} - \frac{Hs}{s^\top H s} \right]^\top\\ &= H + \frac{ y y^\top y^\top s - Hs y^\top y^\top s - y s^\top H^\top y^\top s + s^\top H^\top s y y^\top }{ (y^\top s)^2 } + \frac{(s^\top H s)^2}{s^\top (y - Hs)} \left[ \frac{y}{y^\top s} - \frac{Hs}{s^\top H s} \right] \left[ \frac{y}{y^\top s} - \frac{Hs}{s^\top H s} \right]^\top & \quad & \text{(a)}\\ &= H + \frac{ y y^\top (y^\top s)^2 - Hs y^\top (y^\top s)^2 - y s^\top H^\top (y^\top s)^2 }{ (y^\top s)^2 s^\top (y - Hs) } + \frac{Hs s^\top H^\top}{s^\top (y - Hs)} & \quad & \text{merge (b) and (c)}\\ &= H + \frac{(y - Hs) (y - Hs)^\top}{s^\top (y - Hs)}\end{split}\]

(a)

\[\begin{split}& \frac{(y - Hs) y^\top + y (y - Hs)^\top}{y^\top s} - \frac{(y - Hs)^\top s y y^\top}{(y^\top s)^2}\\ &= \frac{ y y^\top y^\top s - Hs y^\top y^\top s + y y^\top y^\top s - y s^\top H^\top y^\top s }{ (y^\top s)^2 } - \frac{y^\top s y y^\top - s^\top H^\top s y y^\top}{(y^\top s)^2}\\ &= \frac{ y y^\top y^\top s - Hs y^\top y^\top s - y s^\top H^\top y^\top s + s^\top H^\top s y y^\top }{ (y^\top s)^2 }\end{split}\]

(b)

\[\begin{split}& \frac{(s^\top H s)^2}{s^\top (y - Hs)} \left[ \frac{y}{y^\top s} - \frac{Hs}{s^\top H s} \right] \left[ \frac{y}{y^\top s} - \frac{Hs}{s^\top H s} \right]^\top\\ &= \frac{(s^\top H s)^2}{s^\top (y - Hs)} \left[ \frac{y y^\top}{(y^\top s)^2} - \frac{y s^\top H^\top}{y^\top s s^\top H s} - \frac{Hs y^\top}{s^\top H s y^\top s} + \frac{Hs s^\top H^\top}{(s^\top H s)^2} \right]\\ &= \frac{(s^\top H s)^2 y y^\top}{(y^\top s)^2 s^\top (y - Hs)} - \frac{s^\top Hs y s^\top H^\top}{y^\top s s^\top (y - Hs)} - \frac{s^\top Hs Hs y^\top}{y^\top s s^\top (y - Hs)} + \frac{Hs s^\top H^\top}{s^\top (y - Hs)}\\ &= \frac{s^\top H s y y^\top s^\top H s}{(y^\top s)^2 s^\top (y - Hs)} - \frac{y s^\top H^\top y^\top s s^\top Hs}{(y^\top s)^2 s^\top (y - Hs)} - \frac{Hs y^\top y^\top s s^\top Hs}{(y^\top s)^2 s^\top (y - Hs)} + \frac{Hs s^\top H^\top}{s^\top (y - Hs)}\end{split}\]

(c)

\[\begin{split}& \frac{ y y^\top y^\top s - Hs y^\top y^\top s - y s^\top H^\top y^\top s + s^\top H^\top s y y^\top }{ (y^\top s)^2 } \frac{s^\top (y - Hs)}{s^\top (y - Hs)}\\ &= \frac{ y y^\top y^\top s s^\top (y - Hs) - Hs y^\top y^\top s s^\top (y - Hs) - y s^\top H^\top y^\top s s^\top (y - Hs) + s^\top H^\top s y y^\top s^\top (y - Hs) }{ (y^\top s)^2 s^\top (y - Hs) }\\ &= \frac{ y y^\top (y^\top s)^2 - Hs y^\top y^\top s s^\top (y - Hs) - y s^\top H^\top y^\top s s^\top (y - Hs) - s^\top H^\top s y y^\top s^\top Hs }{ (y^\top s)^2 s^\top (y - Hs) }\end{split}\]

Exercise 22

Substitute in relations such as (9.6.1).

References

BDMore73

Charles George Broyden, John E Dennis, and Jorge J Moré. On the local and superlinear convergence of quasi-newton methods. IMA Journal of Applied Mathematics, 12(3):223–245, 1973.

Gei

Robert A. van de Geijn. Notes on cholesky factorization. http://www.cs.utexas.edu/users/flame/Notes/NotesOnChol.pdf. Accessed on 2017-05-23.

Gra

Markus Grasmair. On the existence of a cholesky factorization. https://wiki.math.ntnu.no/_media/ma2501/2014v/cholesky.pdf. Accessed on 2017-05-23.

Rhe

Werner C. Rheinboldt. Quasi-newton methods. https://www-m2.ma.tum.de/foswiki/pub/M2/Allgemeines/SemWs09/quasi-newt.pdf. Accessed on 2017-05-21.

Sch82

R. B. Schnabel. Convergence of quasi-newton updates to correct derivative values. In preparation, 1982.