Secant Methods for Unconstrained Minimization
Implementation of Quasi-Newton Algorithms Using the Positive Definite Secant Update
Details to Mind
Picking an Initial Jacobian
Picking an Initial Hessian
\[\DeclareMathOperator{\typ}{typ}
H_0 = \max \{ \lvert f(x_0) \rvert, \typ f \} \cdot D_x^2.\]
Exercise 1
The Hessian is symmetric by definition.
See Exercise 4.12 for more
reasons why a secant approximation to a Hessian should be symmetric.
Exercise 2
Suppose \(A, B \in \mathbb{R}^{n \times n}\) with \(B\) symmetric.
\[\begin{split}\DeclareMathOperator{\tr}{tr}
\left\Vert B - A \right\Vert_F^2
&= \tr\left(
\left( B - A \right)^\top \left( B - A \right)
\right)\\
&= \tr\left(B^\top B\right) - \text{tr}\left(B^\top A\right) -
\tr\left(A^\top B\right) +
\tr\left(A^\top A\right)
& \quad & \text{trace operator is a linear mapping}\\
&= \tr\left( B^2 \right) -
2 \tr\left( AB \right) +
\tr\left(A^\top A\right)
& \quad & \text{trace cyclic/transpose property}\end{split}\]
By inspection, this norm is minimized when \(B = A\), which is only valid
when \(A\) is also symmetric.
From the results in (a) and (b), the norm can be rewritten as
\[\begin{split}\left\Vert B - A \right\Vert_F^2
&= \tr\left( B^2 \right) -
2 \tr\left( AB \right) +
\tr\left( A^\top A \right)\\
&= \tr\left( B^2 \right) -
\tr\left( \left( A + A^\top \right) B \right) +
\tr\left( A^\top A \right).\end{split}\]
Using matrix calculus to find the roots of the derivative gives
\[\begin{split}\frac{\partial \left\Vert B - A \right\Vert_F^2}{\partial B}
&= \tr\left( \frac{\partial B^2}{\partial B} \right) -
\tr\left(
\frac{\partial \left( A + A^\top \right) B}{\partial B}
\right) +
\tr\left( \frac{\partial A^\top A}{\partial B} \right)
& \quad & \text{linear trace operator commutes with derivative}\\
0 &= 2B - A^\top - A\\
\frac{A + A^\top}{2} &= B.\end{split}\]
This root is a minimum according to the second derivative test:
\[\begin{split}\frac{\partial^2 \left\Vert B - A \right\Vert_F^2}{\partial B^2}
&= \frac{\partial}{\partial B} \left( 2B - A^\top - A \right)\\
&= 2I\\
&\succ 0.\end{split}\]
(a)
A useful identity to recall is that for any symmetric matrix \(A\),
\[x^\top A x = x^\top \frac{A + A^\top}{2} x.\]
(b)
\[\begin{split}\tr\left( A B \right)
&= \tr\left( A \sum_{i = 1}^n \lambda_i q_i q_i^\top \right)
& \quad & \text{eigendecomposition}\\
&= \sum_{i = 1}^n \lambda_i \tr\left( A q_i q_i^\top \right)
& \quad & \text{trace is a linear mapping}\\
&= \sum_{i = 1}^n \lambda_i \tr\left( q_i^\top A q_i \right)
& \quad & \text{trace cyclic property}\\
&= \frac{1}{2} \sum_{i = 1}^n \lambda_i
\tr\left( q_i^\top \left( A + A^\top \right) q_i \right)
& \quad & \text{(a)}\\
&= \frac{1}{2} \tr\left( \left( A + A^\top \right) B \right)\end{split}\]
Exercise 3
Suppose \(H_c \triangleq H\) is symmetric, \(s_c \triangleq s\)
and \(y_c \triangleq y\).
(a)
\[\begin{split}\DeclareMathOperator{\det}{det}
H_+ s
&= \left(
H +
\frac{
\left( y - Hs \right) s^\top + s \left( y - Hs \right)^\top
}{s^\top s} -
\frac{\left( y - Hs \right)^\top s}{\left( s^\top s \right)^2} s s^\top
\right) s
& \quad & \text{(9.1.3)}\\
&= Hs +
\frac{ys^\top - Hss^\top + sy^\top - ss^\top H^\top}{s^\top s} s -
\frac{y^\top s - s^\top H^\top s}{\left( s^\top s \right)^2} s s^\top s\\
&= Hs + y - Hs +
\frac{sy^\top s - ss^\top Hs - y^\top ss + s^\top H ss}{s^\top s}\\
&= y\end{split}\]
(b)
Define \(\Sigma \triangleq \begin{bmatrix} \sigma_1 & \sigma_2\\
\sigma_3 & \sigma_4 \end{bmatrix}\). According to [Rhe],
\[\begin{split}H_+ - H
&= \begin{bmatrix} y - Hs & s \end{bmatrix}
\begin{bmatrix} \sigma_1 & \sigma_2\\ \sigma_3 & \sigma_4 \end{bmatrix}
\begin{bmatrix} \left( y - Hs \right)^\top\\ s^\top \end{bmatrix}\\
&= \begin{bmatrix}
\sigma_1 \left( y - Hs \right) + \sigma_3 s &
\sigma_2 \left( y - Hs \right) + \sigma_4 s
\end{bmatrix}
\begin{bmatrix} \left( y - Hs \right)^\top\\ s^\top \end{bmatrix}\\
&= \sigma_1 \left( y - Hs \right) \left( y - Hs \right)^\top +
\sigma_3 s \left( y - Hs \right)^\top +
\sigma_2 \left( y - Hs \right) s^\top + \sigma_4 s s^\top\end{split}\]
where \(\sigma_1 = 0\),
\(\sigma_2 = \sigma_3 = \left( s^\top s \right)^{-1}\), and
\(\sigma_4 =
-\frac{\left( y - Hs \right)^\top s}{\left( s^\top s \right)^2}\).
By inspection, \(H_+ - H = \left( H_+ - H \right)^\top\).
When \(\det(\Sigma) = -\left( s^\top s \right)^{-2} \neq 0\),
\(\Sigma\) is a full-rank matrix. Applying (b.1) and (b.2) shows that
\(H_+ - H\) has a maximum rank of \(2\).
(b.1)
Recall the following rank property for a matrix
\(A \in \mathbb{R}^{m \times n}\):
\[\DeclareMathOperator{\rank}{rank}
\rank\left( A^\top A \right) =
\rank\left( A A^\top \right) =
\rank(A) =
\rank\left( A^\top \right).\]
Define \(A \triangleq \begin{bmatrix} y - Hs & s \end{bmatrix}\).
\[\begin{split}A^\top A
&= \begin{bmatrix} \left( y - Hs \right)^\top\\ s^\top \end{bmatrix}
\begin{bmatrix} y - Hs & s \end{bmatrix}\\
&= \begin{bmatrix}
\left( y - Hs \right)^\top \left( y - Hs \right) &
\left( y - Hs \right) s\\
s^\top \left( y - Hs \right) & s^\top s
\end{bmatrix}\end{split}\]
When \(\det\left( A^\top A \right) \neq 0\), \(A\) is full-rank.
(b.2)
Recall the following rank property for a matrix
\(B \in \mathbb{R}^{n \times k}\):
\[\rank(AB) \leq \min \left\{ \rank(A), \rank(B) \right\}.\]
\[\begin{split}\rank\left( A \Sigma A^\top \right)
&\leq \min \left\{ \rank(A), \rank\left( \Sigma A^\top \right) \right\}\\
&\leq \min \left\{
\rank(A),
\min \left\{ \rank(\Sigma), \rank\left( A^\top \right) \right\}
\right\}\\
&= \min \left\{
\rank(A), \rank(\Sigma), \rank\left( A^\top \right)
\right\}\end{split}\]
Exercise 4
Define a linearly independent set
\(S = \left\{ v_1 = s_c, v_2, \ldots, v_n \right\}\) where
\(v_i \in \mathbb{R}^n\) is randomly generated for \(2 \leq i \leq n\).
Perform Gram-Schmidt orthonormalization to generate an orthonormal set
\(Q = \left\{ q_1, q_2, \ldots, q_n \right\}\) where
\(q_i = t_i \left\Vert t_i \right\Vert^{-1}\) for \(1 \leq i \leq n\)
with
\[t_1 = v_1
\quad \text{and} \quad
t_i = v_i - \sum_{j = 1}^{i - 1} \text{proj}_{t_j}(v_i).\]
Suppose \(H - H_c\) is a symmetric matrix with \(H s_c = y_c\) where
\(H_+\) is given by (9.1.3) and \(H \in \mathbb{R}^{n \times n}\).
Applying the results of Exercise 3.11 gives
\[\begin{split}\left\Vert H_+ - H_c \right\Vert_F^2
&= \sum_{i = 1}^n
\left\Vert \left( H_+ - H_c \right) q_i \right\Vert_2^2\\
&\leq \sum_{i = 1}^n \left\Vert \left( H - H_c \right) q_i \right\Vert_2^2
& \quad & \text{(a) and (b)}\\
\left\Vert H_+ - H_c \right\Vert_F
&\leq \left\Vert H - H_c \right\Vert_F.\end{split}\]
(a)
\[\begin{split}\left( H_+ - H_c \right) s_c
&= \left(
\frac{
\left( y_c - H_c s_c \right) s_c^\top +
s_c \left( y_c - H_c s_c \right)^\top
}{
s_c^\top s_c
} -
\frac{
\left( y_c - H_c s_c \right)^\top s_c
}{
\left( s_c^\top s_c \right)^2
} s_c s_c^\top
\right) s_c
& \quad & \text{(9.1.3)}\\
&= \left(
\frac{
\left( H - H_c \right) s_c s_c^\top +
s_c s_c^\top \left( H - H_c \right)
}{
s_c^\top s_c
} -
\frac{
s_c^\top \left( H - H_c \right) s_c
}{
\left( s_c^\top s_c \right)^2
} s_c s_c^\top
\right) s_c
& \quad & \text{see initial assumptions}\\
&= \left( H - H_c \right) s_c +
s_c \frac{s_c^\top \left( H - H_c \right) s_c}{s_c^\top s_c} -
\frac{s_c^\top \left( H - H_c \right) s_c}{s_c^\top s_c} s_c\\
&= \left( H - H_c \right) s_c\end{split}\]
(b)
Let \(t \in \mathbb{R}^n\) satisfy \(t^\top s_c = 0\).
\[\begin{split}\left( H_+ - H_c \right) t
&= \left(
\frac{
\left( y_c - H_c s_c \right) s_c^\top +
s_c \left( y_c - H_c s_c \right)^\top
}{s_c^\top s_c} -
\frac{
\left( y_c - H_c s_c \right)^\top s_c
}{
\left( s_c^\top s_c \right)^2
} s_c s_c^\top
\right) t
& \quad & \text{(9.1.3)}\\
&= \frac{s_c \left( y_c - H_c s_c \right)^\top t}{s_c^\top s_c}
& \quad & s_c^\top t = 0\\
&= \frac{s_c s_c^\top \left( H - H_c \right) t}{s_c^\top s_c}
& \quad & \text{see initial assumptions}\\
\left\Vert \left( H_+ - H_c \right) t \right\Vert_2
&= \left\Vert
\frac{s_c s_c^\top}{s_c^\top s_c} \left( H - H_c \right) t
\right\Vert_2\\
&\leq \left\Vert \frac{s_c s_c^\top}{s_c^\top s_c} \right\Vert_F
\left\Vert \left( H - H_c \right) t \right\Vert_2
& \quad & \text{(3.1.16)}\\
&\leq \left\Vert \left( H - H_c \right) t \right\Vert_2
& \quad & \text{(3.1.17)}\end{split}\]
Exercise 5
Let \(H_+\) be given by (9.1.3), \(H_* = \nabla^2 f(x_*)\),
\(P_c = I - s_c s_c^\top / s_c^\top s_c\), \(e_k \triangleq x_k - x_*\),
and the assumptions of Theorem 9.1.2 hold i.e.
\(\left\Vert x_0 - x_* \right\Vert_2 < \epsilon\) and
\(\left\Vert H_0 - H_* \right\Vert_F < \delta\).
The results of
Exercise 3.8,
Exercise 8.7, and
Exercise 4.7 will be useful in the
derivation that follows.
\[\begin{split}H_+ - H_*
&= H_c - H_* +
\frac{
\left( y_c - H_c s_c \right) s_c^\top +
s_c \left( y_c - H_c s_c \right)^\top
}{
s_c^\top s_c
} -
\frac{
\left( y_c - H_c s_c \right)^\top s_c
}{
\left( s_c^\top s_c \right)^2
} s_c s_c^\top\\
\left\Vert H_+ - H_* \right\Vert_F
&= \left\Vert
P_c (H_c - H_*) P_c +
\frac{(y_c - H_* s_c) s_c^\top}{s_c^\top s_c} +
\frac{s_c (y_c - H_* s_c)^\top P_c}{s_c^\top s_c}
\right\Vert_F
& \quad \text{(9.7.1)}\\
&\leq \left\Vert P_c (H_c - H_*) P_c \right\Vert_F +
\left\Vert
\frac{(y_c - H_* s_c) s_c^\top}{s_c^\top s_c}
\right\Vert_F +
\left\Vert
\frac{s_c (y_c - H_* s_c)^\top P_c}{s_c^\top s_c}
\right\Vert_F
& \quad \text{Exercise 3.8}\\
&\leq \left\Vert H_c - H_* \right\Vert_F
\left\Vert P_c \right\Vert_2^2 +
\left\Vert y_c - H_* s_c \right\Vert_2
\left\Vert \frac{s_c}{s_c^\top s_c} \right\Vert_2 +
\left\Vert \frac{s_c}{s_c^\top s_c} \right\Vert_2
\left\Vert y_c - H_* s_c \right\Vert_2
\left\Vert P_c \right\Vert_2
& \quad & \text{(3.1.15) and (3.1.17)}\\
&\leq \left\Vert H_c - H_* \right\Vert_F +
2 \left\Vert y_c - H_* s_c \right\Vert_2
\left\Vert s \right\Vert_2^{-1}
& \quad & \text{Exercise 8.7}\\
&\leq \left\Vert H_c - H_* \right\Vert_F +
\frac{\gamma}{2} \left(
\left\Vert x_+ - x_* \right\Vert_2 +
\left\Vert x_c - x_* \right\Vert_2
\right)
& \quad & \text{Lemma 4.1.12 and Exercise 4.7}.\end{split}\]
Notice that in order to make the following derivations easier, \(2 \gamma\)
was replaced with \(\gamma\) while the \(\frac{1}{2}\) factor was kept.
Recursively expanding (9.7.2) yields
\[\begin{split}\left\Vert H_k - H_* \right\Vert_F
&\leq \left\Vert H_{k - 1} - H_* \right\Vert_F +
\frac{\gamma}{2} \left(
\left\Vert e_k \right\Vert_2 + \left\Vert e_{k - 1} \right\Vert_2
\right)\\
&\leq \left\Vert H_0 - H_* \right\Vert_F +
\frac{\gamma}{2} \left\Vert e_k \right\Vert_2 +
\frac{\gamma}{2} \left\Vert e_0 \right\Vert_2 +
\gamma \sum_{j = 1}^{k - 1} \left\Vert e_j \right\Vert_2\\
&\leq \delta + \gamma \sum_{j = 0}^k \left\Vert e_j \right\Vert_2.\end{split}\]
If the summation is uniformly bounded above for all \(k\), then the sequence
\(\left\{ \left\Vert H_k - H_* \right\Vert_F \right\}\) will be bounded
above i.e. \(\lim_{k \rightarrow \infty} \left\Vert e_k \right\Vert_2 = 0\).
According to Theorem 3.1.4, the symmetric secant method is well defined if
\[\begin{split}\left\Vert H_*^{-1} \left( H_k - H_* \right) \right\Vert_F
&\leq \left\Vert H_*^{-1} \right\Vert_F \left\Vert H_k - H_* \right\Vert_F
& \quad & \text{(3.1.10)}\\
&\leq \beta \left\Vert H_k - H_* \right\Vert_F
& \quad & H_* \text{ is nonsingular}\end{split}\]
is strictly less than \(1\). If it does hold, then (3.1.20) states that
\[\left\Vert H_k^{-1} \right\Vert_F \leq
\frac{
\left\Vert H_*^{-1} \right\Vert_F
}{
1 - \left\Vert H_*^{-1} \left( H_k - H_* \right) \right\Vert_F
}\]
where
\[\left\Vert H_0^{-1} \right\Vert_F \leq \frac{\beta}{1 - \beta \delta}.\]
Suppose \(x_{k + 1}\) is well defined. Isolating \(e_{k + 1}\)
according to the derivation of (8.2.1) gives
\[\begin{split}x_{k + 1}
&= x_k - H_k^{-1} \nabla f(x_k)
& \quad & \text{(9.1.4a)}\\
x_{k + 1} - x_*
&= x_k - x_* - H_k^{-1} \left( \nabla f(x_k) - \nabla f(x_*) \right)
& \quad & \nabla f(x_*) = 0\\
H_k e_{k + 1} &= H_k e_k - \nabla f(x_k) + \nabla f(x_*)\\
&= \left( H_k - H_* \right) e_k +
\left( -\nabla f(x_k) + \nabla f(x_*) + H_* e_k \right)\\
\left\Vert e_{k + 1} \right\Vert_2
&\leq \left\Vert H_k^{-1} \right\Vert_F
\left[
\left\Vert -\nabla f(x_k) + \nabla f(x_*) + H_* e_k \right\Vert_2 +
\left\Vert H_k - H_* \right\Vert_F \left\Vert e_k \right\Vert_2
\right]
& \quad & \text{(3.1.16) and Exercise 3.4}\\
&\leq \left\Vert H_k^{-1} \right\Vert_F
\left[
\frac{\gamma}{2} \left\Vert e_k \right\Vert_2 +
\left\Vert H_k - H_* \right\Vert_F
\right] \left\Vert e_k \right\Vert_2
& \quad & \text{Lemma 4.1.12 and (3.1.1b)}.\end{split}\]
See Exercise 3.4 for more details.
One (arbitrary yet simple) way to satisfy the definition of \(q\)-linear
convergence (2.3.1) is to define
\[\begin{split}\left\Vert H_k - H_* \right\Vert_F
&\leq (2 - 2^{-k}) \delta
& \quad & \text{(8.2.7)}\\
\left\Vert e_{k + 1} \right\Vert_2
&\leq \frac{1}{2} \left\Vert e_k \right\Vert_2
& \quad & \text{(8.2.8).}\end{split}\]
From the initial assumptions, the base case \(k = 0\) for (8.2.7) is
satisfied:
\[\left\Vert H_0 - H_* \right\Vert_F \leq
\delta \leq (2 - 2^{-0}) \delta \leq 2 \delta.\]
The base case \(k = 0\) for (8.2.8) is also satisfied if \(\epsilon\)
and \(\delta\) are chosen such that \(6 \beta \delta \leq 1\)
and \(3 \gamma \epsilon \leq 2 \delta\) where
\[\begin{split}\left\Vert e_{0 + 1} \right\Vert_2
&\leq \left\Vert H_0^{-1} \right\Vert_F
\left[
\frac{\gamma}{2} \left\Vert e_0 \right\Vert_2 +
\left\Vert H_0 - H_* \right\Vert_F
\right] \left\Vert e_0 \right\Vert_2\\
&\leq \frac{\beta}{1 - \beta \delta}
\left[ \frac{\gamma}{2} \epsilon + \delta \right] \epsilon\\
&\leq \frac{6 \beta}{5}
\left[ \frac{\gamma}{2} \epsilon + \delta \right] \epsilon\\
&\leq \left[
\frac{3 \beta \gamma \epsilon}{5} + \frac{6 \beta \delta}{5}
\right] \epsilon\\
&\leq \left[ \frac{2 \beta \delta}{5} + \frac{1}{5} \right] \epsilon\\
&\leq \left[ \frac{1}{15} + \frac{1}{5} \right] \epsilon\\
&\leq \frac{1}{2} \epsilon.\end{split}\]
For the inductive step, assume (8.2.7) and (8.2.8) holds for
\(k = 0, \ldots, i - 1\). When \(k = i\),
\[\begin{split}\left\Vert H_i - H_* \right\Vert_F
&\leq \left\Vert H_{i - 1} - H_* \right\Vert_F +
\frac{\gamma}{2} \left(
\left\Vert e_i \right\Vert_2 + \left\Vert e_{i - 1} \right\Vert_2
\right)\\
&\leq (2 - 2^{-i - 1}) \delta +
\frac{3 \gamma}{4} \left\Vert e_{i - 1} \right\Vert_2\\
&\leq (2 - 2^{-i - 1}) \delta + \frac{3 \gamma}{4} 2^{-(i - 1)} \epsilon\\
&\leq (2 - 2^{-i - 1}) \delta + 2^{-i} \delta\\
&\leq (2 - 2^{-i}) \delta.\end{split}\]
To complete the proof, it is necessary to show that the symmetric secant is
well-defined:
\[\begin{split}\left\Vert H_*^{-1} \left( H_i - H_* \right) \right\Vert_F
&\leq \beta \left\Vert H_i - H_* \right\Vert_F
& \quad & \text{see the foregoing derivation}\\
&\leq \beta (2 - 2^{-i}) \delta\\
&\leq 2 \beta \delta\\
&\leq \frac{1}{3}.\end{split}\]
Consequently, \(\left\Vert H_i^{-1} \right\Vert_F \leq \frac{3 \beta}{2}\).
Thus,
\[\begin{split}\left\Vert e_{i + 1} \right\Vert_2
&\leq \left\Vert H_i^{-1} \right\Vert_F
\left[
\frac{\gamma}{2} \left\Vert e_i \right\Vert_2 +
\left\Vert H_i - H_* \right\Vert_F
\right] \left\Vert e_i \right\Vert_2\\
&\leq \frac{3 \beta}{2} \left[
\frac{\gamma}{2} 2^{-i} \epsilon +
(2 - 2^{-i}) \delta
\right] \left\Vert e_i \right\Vert_2\\
&\leq \frac{3 \beta}{2} \left[
\frac{1}{3} 2^{-i} + (2 - 2^{-i})
\right] \delta \left\Vert e_i \right\Vert_2\\
&\leq 3 \beta \delta \left\Vert e_i \right\Vert_2\\
&\leq \frac{1}{2} \left\Vert e_i \right\Vert_2.\end{split}\]
Exercise 6
The proof of \(q\)-superlinear convergence is a combination of
Exercise 8.5 and page 183.
Theorem 8.2.4 have shown that a sufficient condition for \(\{ x_k \}\) to
converge \(q\)-superlinearly to \(x_*\) is
\(\lim_{k \rightarrow \infty}
\frac{\left\Vert E_k s_k \right\Vert_2}{\left\Vert s_k \right\Vert_2} = 0\).
Applying the results of Exercise 8.7 and
Exercise 9.5 gives
\[\begin{split}\left\Vert H_+ - H_* \right\Vert_F
&\leq \left\Vert P_c E_c P_c \right\Vert_F +
\frac{\gamma}{2} \left(
\left\Vert x_+ - x_* \right\Vert_2 +
\left\Vert x_c - x_* \right\Vert_2
\right)\\
&\leq \left\Vert P_c \right\Vert_2 \left\Vert E_c P_c \right\Vert_F +
\frac{\gamma}{2} \left(
\left\Vert x_+ - x_* \right\Vert_2 +
\left\Vert x_c - x_* \right\Vert_2
\right)
& \quad & \text{(3.1.15)}\\
&\leq \left\Vert E_c \right\Vert_F -
\left( 2 \left\Vert E_c \right\Vert_F \right)^{-1}
\left(
\frac{
\left\Vert E_c s_c \right\Vert_2
}{
\left\Vert s_c \right\Vert_2
}
\right)^2 +
\frac{\gamma}{2} \left(
\left\Vert x_+ - x_* \right\Vert_2 +
\left\Vert x_c - x_* \right\Vert_2
\right)
& \quad & \text{Exercise 8.7 and Lemma 8.2.5}\\
\left\Vert E_{k + 1} \right\Vert_F
&\leq \left\Vert E_k \right\Vert_F -
\left( 2 \left\Vert E_k \right\Vert_F \right)^{-1}
\left(
\frac{
\left\Vert E_k s_k \right\Vert_2
}{
\left\Vert s_k \right\Vert_2
}
\right)^2 +
\frac{\gamma}{2} \left(
\left\Vert e_{k + 1} \right\Vert_2 + \left\Vert e_k \right\Vert_2
\right)
& \quad & \text{replace subscripts}\\
\frac{\left\Vert E_k s_k \right\Vert_2^2}{\left\Vert s_k \right\Vert_2^2}
&\leq 2 \left\Vert E_k \right\Vert_F \left(
\left\Vert E_k \right\Vert_F -
\left\Vert E_{k + 1} \right\Vert_F +
\frac{3 \gamma}{4} \left\Vert e_k \right\Vert_2
\right)
& \quad & \text{(8.2.8)}\\
&\leq 4 (2 - 2^{-k}) \delta \left(
-2^{-(k + 1)} \delta +
\frac{3 \gamma}{4} \left\Vert e_k \right\Vert_2
\right)
& \quad & \text{(8.2.7)}\\
&\leq 4 (2 - 2^{-k}) \delta \left(
-2^{-(k + 1)} \delta +
\frac{3 \gamma}{4} 2^{-k} \epsilon
\right)
& \quad & \text{(8.2.8)}\\
&\leq 4 (2 - 2^{-k}) \delta \left(
-2^{-(k + 1)} \delta +
\frac{2 \delta}{4} 2^{-k}
\right)
& \quad & 3 \gamma \epsilon \leq 2 \delta\\
\frac{
\left\Vert E_k s_k \right\Vert_2
}{
\left\Vert s_k \right\Vert_2
} &\leq 0.\end{split}\]
Therefore
\[\sum_{k = 0}^\infty
\frac{\left\Vert E_k s_k \right\Vert_2}{\left\Vert s_k \right\Vert_2}\]
is finite, which satisfies the definition of \(q\)-superlinear convergence.
Exercise 7
The following proofs are combinations of [Gei][Gra].
Recall that a matrix \(H \in \mathbb{R}^{n \times n}\) is symmetric positive
definite if and only if it is symmetric and \(x^\top H x > 0\) for all
nonzero vectors \(x \in \mathbb{R}^n\).
(1)
Let \(H = J J^\top\) for some nonsingular
\(J \in \mathbb{R}^{n \times n}\) (i.e. columns are linearly independent).
Applying the results of (a) shows that \(H\) is symmetric positive definite.
(2)
Let \(H \in \mathbb{R}^{n \times n}\) be symmetric positive definite.
\(H = J J^\top\) for some nonsingular \(J \in \mathbb{R}^{n \times n}\)
can be satisfied if there exists a Cholesky factorization such that
\(H = L L^\top\). Invoking (e) shows that \(H\) has a Cholesky
decomposition because it is symmetric positive definite.
(a)
Let \(B \in \mathbb{R}^{m \times n}\) have linearly independent columns
(i.e. full column rank) and \(A = B^\top B\). \(A\) is symmetric
because
\[A^\top = \left( B^\top B \right)^\top = B^\top B = A.\]
When \(x \neq \mathbf{0}\), the nullspace facts from
Exercise 6.17 shows that \(A\) is also
positive definite because
\[v = Bx \neq \mathbf{0}
\quad \text{and} \quad
v^\top v = x^\top B^\top B x = x^\top A x > 0.\]
(b)
Let \(A \in \mathbb{R}^{n \times n}\) be symmetric positive definite.
Define \(e_i\) as a unit \(n\)-vector whose entries are zero except for
a one in the \(i^\text{th}\) entry. Applying the definition of positive
definite shows that the diagonal elements of \(A\) are positive because
\[e_i^\top A e_i = \alpha_{ii} > 0 \text{ for } i = 1, 2, \ldots, n.\]
(c)
Let \(I_p\) denote a \(p \times p\) identity matrix,
\(I_q\) denote a \(q \times q\) identity matrix, and
\(M \in \mathbb{R}^{(p + q) \times (p + q)}\) such that
\(M = \begin{bmatrix} A & B\\ C & D \end{bmatrix}\) where
\(A \in \mathbb{R}^{p \times p}\) is invertible,
\(B \in \mathbb{R}^{p \times q}\), \(C \in \mathbb{R}^{q \times p}\),
and \(D \in \mathbb{R}^{q \times q}\).
\[\begin{split}\begin{bmatrix} A & B\\ C & D \end{bmatrix}
\begin{bmatrix} I_p & -A^{-1}B\\ 0 & I_q \end{bmatrix}
&= \begin{bmatrix} A & 0\\ C & D - C A^{-1} B \end{bmatrix}\\
\begin{bmatrix} A & B\\ C & D \end{bmatrix}
&= \begin{bmatrix} A & 0\\ C & D - C A^{-1} B \end{bmatrix}
\begin{bmatrix} I_p & A^{-1}B\\ 0 & I_q \end{bmatrix}\\
&= \begin{bmatrix} I_p & 0\\ C A^{-1} & I_q \end{bmatrix}
\begin{bmatrix} A & 0\\ 0 & D - C A^{-1} B \end{bmatrix}
\begin{bmatrix} I_p & A^{-1}B\\ 0 & I_q \end{bmatrix}\end{split}\]
(d)
Assume \(A \in \mathbb{R}^{(n + 1) \times (n + 1)}\) is symmetric positive
definite. Partition
\[\begin{split}A = \begin{bmatrix} \alpha_{11} & a_{21}^\top\\ a_{21} & A_{22} \end{bmatrix}\end{split}\]
where \(\alpha_{11} \in \mathbb{R}^{1 \times 1}\),
\(a_{21} \in \mathbb{R}^{n \times 1}\), and
\(A_{22} \in \mathbb{R}^{n \times n}\). Applying (b) and (c) gives
\[\begin{split}A =
\begin{bmatrix} \alpha_{11} & \mathbf{0}^\top\\ a_{21} & S \end{bmatrix}
\begin{bmatrix}
I_1 & \alpha_{11}^{-1} a_{21}^\top\\
\mathbf{0} & I_n
\end{bmatrix}\end{split}\]
where \(S = A_{22} - a_{21} \alpha_{11}^{-1} a_{21}^\top\) is the Schur
complement of \(\alpha_{11}\) in \(A\).
\(S\) is symmetric by inspection. For any non-zero vector
\(x = \begin{bmatrix} \chi_0\\ x_n \end{bmatrix} \in \mathbb{R}^{n + 1}\),
\[\begin{split}0 < x^\top A x
&= \begin{bmatrix} \chi_0 & x_n^\top \end{bmatrix}
\begin{bmatrix}
\alpha_{11} & \mathbf{0}^\top\\ a_{21} & S
\end{bmatrix}
\begin{bmatrix}
I_1 & \alpha_{11}^{-1} a_{21}^\top\\
\mathbf{0} & I_n
\end{bmatrix}
\begin{bmatrix} \chi_0\\ x_n \end{bmatrix}\\
&= \begin{bmatrix}
\chi_0 \alpha_{11} + x_n^\top a_{21} & x_n^\top S
\end{bmatrix}
\begin{bmatrix}
\chi_0 + \alpha_{11}^{-1} a_{21}^\top x_n\\
x_n
\end{bmatrix}\\
&= \chi_0^2 \alpha_{11} + 2 \chi_0 a_{21}^\top x_n +
\alpha_{11}^{-1} \left( a_{21}^\top x_n \right)^2 + x_n^\top S x_n.\end{split}\]
With respect to proving \(S\) is positive definite, picking
\(\chi_0 = -\alpha_{11}^{-1} a_{21}^\top x_n\) does not lose any generality
because \(x_n\) is any arbitrary nonzero vector.
(e)
Let \(A \in \mathbb{R}^{(n + 1) \times (n + 1)}\) be symmetric positive
definite. Since \(A\) is symmetric, let \(\star\) denote parts of
\(A\) that are neither stored nor updated.
\[\begin{split}\DeclareMathOperator{\chol}{chol}
A = \begin{bmatrix} \alpha_{11} & \star\\ a_{21} & A_{22} \end{bmatrix}
&= L L^\top
& \quad & \chol(A) = L\\
&= \begin{bmatrix}
\lambda_{11} & \boldsymbol{0}^\top\\
l_{21} & L_{22}
\end{bmatrix}
\begin{bmatrix}
\lambda_{11} & l_{21}^\top\\
\boldsymbol{0} & L_{22}^\top
\end{bmatrix}\\
&= \begin{bmatrix}
\lambda_{11}^2 & \star\\
\lambda_{11} l_{21} & l_{21} l_{21}^\top + L_{22} L_{22}^\top
\end{bmatrix}\end{split}\]
Equating the block matrices gives
\[\begin{split}\lambda_{11} &= \sqrt{\alpha_{11}}\\\\
l_{21} &= a_{21} / \lambda_{11}\\\\
L_{22} &= \chol\left( A_{22} - l_{21} l_{21}^\top \right).\end{split}\]
The inductive proof starts with the base case \(n = 0\): Invoking (b) shows
that \(\alpha_{11} > 0\) hence
\(\lambda_{11} = \sqrt{\alpha_{11}}\) is well-defined.
The inductive step assumes the Cholesky factorization exist for all symmetric
positive definite matrix \(A \in \mathbb{R}^{n \times n}\). Let
\(A \in \mathbb{R}^{(n + 1) \times (n + 1)}\) be symmetric positive
definite. Partitioning \(A\) as described and invoking (b) establishes that
\(\lambda_{11}\) and \(l_{21}\) are well-defined. Invoking (d) reveals
that \(L_{22}\) is symmetric positive definite, which enables the recursive
application of the decomposition. Thus, \(L\) is the desired Cholesky
factor of \(A\).
Exercise 8
\[\begin{split}H_+ &= J_+ J_+^\top\\
&= \left( L_c + \frac{(y_c - L_c v_c) v_c^\top}{v_c^\top v_c} \right)
\left( L_c + \frac{(y_c - L_c v_c) v_c^\top}{v_c^\top v_c} \right)^\top
& \quad & \text{(9.2.6)}\\
&= H_c + \frac{L_c v_c (y_c - L_c v_c)^\top}{v_c^\top v_c} +
\frac{(y_c - L_c v_c) v_c^\top L_c^\top}{v_c^\top v_c} +
\frac{(y_c - L_c v_c) (y_c - L_c v_c)^\top}{v_c^\top v_c}\\
&= H_c + \frac{y_c y_c^\top}{v_c^\top v_c} -
\frac{L_c v_c v_c^\top L_c^\top}{v_c^\top v_c}\\
&= H_c + \frac{y_c y_c^\top}{\alpha^2 s_c^\top H_c s_c} -
\frac{H_c s_c s_c^\top H_c}{s_c^\top H_c s_c}
& \quad & \text{(9.2.8) and } H_c = L_c L_c^\top\\
&= H_c + \frac{y_c y_c^\top}{y_c^\top s_c} -
\frac{H_c s_c s_c^\top H_c}{s_c^\top H_c s_c}
& \quad & \text{(9.2.9)}\end{split}\]
According to Lemma 8.3.1
(see Exercise 8.13), \(J_+\) given by
(9.2.6, 9.2.9) is nonsingular.
Exercise 9
Suppose \(H_c s_c = y_c\). Then \(\alpha = \pm 1\) and
\[\begin{split}J_+ &= L_c + \frac{(y_c - L_c v_c) v_c^\top}{v_c^\top v_c}
& \quad & \text{(9.2.6)}\\
&= L_c + \frac{y_c v_c^\top}{v_c^\top v_c} -
\frac{L_c v_c v_c^\top}{v_c^\top v_c}\\
&= L_c + \frac{y_c s_c^\top L_c}{\alpha s_c^\top H_c s_c} -
\frac{H_c s_c s_c^\top L_c}{s_c^\top H_c s_c}
& \quad & \text{(9.2.9)}\\
&= L_c + \frac{y_c s_c^\top L_c}{\alpha y_c^\top s_c} -
\frac{y_c s_c^\top L_c}{y_c^\top s_c}
& \quad & \text{see initial supposition.}\end{split}\]
Thus choosing \(\alpha = 1\) makes \(J_+\) closer to \(L_c\) when
\(y_c \cong H_c s_c\).
Exercise 10
\[\begin{split}H_+^{-1} &= J_+^{-\top} J_+^{-1}\\
&= \left(
L_c^{-\top} + \frac{(s_c - L_c^{-\top} w_c) w_c^\top}{w_c^\top w_c}
\right)
\left(
L_c^{-1} + \frac{w_c (s_c - L_c^{-\top} w_c)^\top}{w_c^\top w_c}
\right)
& \quad & \text{(9.2.12)}\\
&= H_c^{-1} +
\frac{L_c^{-\top} w_c (s_c - L_c^{-\top} w_c)^\top}{w_c^\top w_c} +
\frac{(s_c - L_c^{-\top} w_c) w_c^\top L_c^{-1}}{w_c^\top w_c} +
\frac{
(s_c - L_c^{-\top} w_c) (s_c - L_c^{-\top} w_c)^\top
}{
w_c^\top w_c
}\\
&= H_c^{-1} + \frac{s_c s_c^\top}{w_c^\top w_c} -
\frac{L_c^{-\top} w_c w_c^\top L_c^{-1}}{w_c^\top w_c}\\
&= H_c^{-1} + \frac{s_c s_c^\top}{\alpha^2 y_c^\top H_c^{-1} y_c} -
\frac{H_c^{-1} y_c y_c^\top H_c^{-1}}{y_c^\top H_c^{-1} y_c}\\
&= H_c^{-1} + \frac{s_c s_c^\top}{s_c^\top y_c} -
\frac{H_c^{-1} y_c y_c^\top H_c^{-1}}{y_c^\top H_c^{-1} y_c}\end{split}\]
where
\[w_c =
\left( \frac{y_c^\top s_c}{y_c^\top H_c^{-1} y_c} \right)^{1 / 2}
L_c^{-1} y_c =
\alpha L_c^{-1} y_c.\]
Using lemma 8.3.1 to invert (9.2.12) gives
\[\begin{split}J_+^{\top}
&= \left(
L_c^{-\top} + \frac{(s_c - L_c^{-\top} w_c) w_c^\top}{w_c^\top w_c}
\right)^{-1}\\
&= L_c^\top - \sigma^{-1} L_c^\top
\frac{(s_c - L_c^{-\top} w_c) w_c^\top}{w_c^\top w_c} L_c^\top\\
&= L_c^\top - L_c^\top
\frac{(s_c - L_c^{-\top} w_c) w_c^\top}{w_c^\top L_c^\top s_c} L_c^\top\\
&= L_c^\top -
\frac{
L_c^\top s_c w_c^\top L_c^\top - w_c w_c^\top L_c^\top
}{w_c^\top L_c^\top s_c}\\
&= L_c^\top -
\frac{
L_c^\top s_c y_c^\top - \alpha L_c^{-1} y_c y_c^\top
}{y_c^\top s_c}\end{split}\]
where
\[\sigma =
1 + \frac{w_c^\top}{w_c^\top w_c} L_c^\top (s_c - L_c^{-\top} w_c) =
\frac{w_c^\top L_c^\top s_c}{w_c^\top w_c}.\]
Consequently,
\[\begin{split}H_+ &= J_+ J_+^\top\\
&= \left(
L_c -
\frac{y_c s_c^\top L_c - \alpha y_c y_c^\top L_c^{-\top}}{y_c^\top s_c}
\right)
\left(
L_c^\top -
\frac{
L_c^\top s_c y_c^\top - \alpha L_c^{-1} y_c y_c^\top
}{
y_c^\top s_c
}
\right)\\
&= H_c -
\frac{H_c s_c y_c^\top - \alpha y_c y_c^\top}{y_c^\top s_c} -
\frac{y_c s_c^\top H_c - \alpha y_c y_c^\top}{y_c^\top s_c} +
\frac{
y_c s_c^\top H_c s_c y_c^\top - \alpha y_c s_c^\top y_c y_c^\top -
\alpha y_c y_c^\top s_c y_c^\top +
\alpha^2 y_c y_c^\top H_c^{-1} y_c y_c^\top
}{
(y_c^\top s_c)^2
}\\
&= H_c +
\frac{y_c y_c^\top - H_c s_c y_c^\top - y_c s_c^\top H_c}{y_c^\top s_c} +
\frac{y_c s_c^\top H_c s_c y_c^\top}{(y_c^\top s_c)^2}\\
&= H_c +
\frac{
(y_c - H_c s_c) y_c^\top + y_c (y_c - H_c s_c)^\top
}{
y_c^\top s_c
} -
\frac{(y_c - H_c s_c)^\top s_c y_c y_c^\top}{(y_c^\top s_c)^2}.\end{split}\]
Exercise 11
Define \(H = T^\top T\).
\[\begin{split}\hat{H}_+
&= \hat{H}_c +
\frac{
(\hat{y}_c - \hat{H}_c \hat{s}_c) \hat{s}_c^\top +
\hat{s}_c (\hat{y}_c - \hat{H}_c \hat{s}_c)^\top
}{
\hat{s}_c^\top \hat{s}_c
} -
\frac{
(\hat{y}_c - \hat{H}_c \hat{s}_c)^\top
\hat{s}_c \hat{s}_c \hat{s}_c^\top
}{
(\hat{s}_c^\top \hat{s}_c)^2
}
& \quad & \text{(9.3.1)}\\
&= T^{-\top} H_c T^{-1} +
\frac{
(T^{-\top} y_c - T^{-\top} H_c s_c) s_c^\top T^\top +
T s_c (y_c^\top T^{-1} - s_c^\top H_c T^{-1})
}{s_c^\top H s_c} -
\frac{
(y_c^\top - s_c^\top H_c) s_c T s_c s_c^\top T^\top
}{(s_c^\top H s_c)^2}\\
H_+
&= H_c +
\frac{
(y_c - H_c s_c) s_c^\top H +
H s_c (y_c - H_c s_c)^\top
}{s_c^\top H s_c} -
\frac{
(y_c - H_c s_c)^\top s_c H s_c s_c^\top H
}{(s_c^\top H s_c)^2}
& \quad & H_+ = T^\top \hat{H}_+ T.\end{split}\]
Exercise 12
The section on Convergence of Double-Rank Methods in [BDMore73]
is the proof for this exercise. Overall, the proofs in paper were along the
lines of the book’s.
Exercise 13
Suppose the assumptions of (9.3.1) holds
(see Exercise 7.2).
(a)
\[\begin{split}\hat{H}_+
&= \hat{H}_c +
\frac{\hat{y}_c \hat{y}_c^\top}{\hat{y}_c^\top \hat{s}_c} -
\frac{
\hat{H}_c \hat{s}_c \hat{s}_c^\top \hat{H}_c
}{
\hat{s}_c^\top \hat{H}_c \hat{s}_c
}
& \quad & \text{(9.2.10)}\\
&= T^{-\top} H_c T^{-1} +
\frac{T^{-\top} y_c y_c^\top T^{-1}}{y_c^\top s_c} -
\frac{T^{-\top} H_c s_c s_c^\top H_c T^{-1}}{s_c^\top H_c s_c}\\
&= T^{-\top} H_+ T^{-1}\end{split}\]
Applying the foregoing result to (9.2.11a) gives
\[\begin{split}\hat{x}_{k + 1} &= \hat{x}_k - \hat{H}_k^{-1} \nabla \hat{f}(\hat{x}_k)\\
T x_{k + 1}
&= T x_k -
\left( T^{-\top} H_k T^{-1} \right)^{-1} T^{-\top} \nabla f(x_k)\\
T x_{k + 1} &= T x_k - T H_k^{-1} \nabla f(x_k)\\
x_{k + 1} &= x_k - H_k^{-1} \nabla f(x_k),\end{split}\]
which illustrates that the positive definite secant update is invariant under
linear transformations of the variable space.
As shown in Exercise 4.11,
\(\hat{H}_+\) cannot be factored such that \(T^{-\top} H_+ T^{-1}\)
is the original symmetric secant update where \(H_+\).
(b)
Recall that Broyden’s update or simply the secant update \((8.1.5)\) applied
to Hessians (9.1.2) is not guaranteed to be symmetric. Its update has a rank
of one by definition of rank and outer product.
The update
\[\begin{split}\hat{H}_+
&= \hat{H}_c +
\frac{
(\hat{y}_c - \hat{H}_c \hat{s}_c) \hat{s}_c^\top
}{
\hat{s}_c^\top \hat{s}_c
}\\
&= T^{-\top} H_c T^{-1} +
\frac{
(T^{-\top} y_c - T^{-\top} H_c s_c) s_c T^\top
}{
s_c^\top T^\top T s_c
}\end{split}\]
can be modified to be invariant under linear transformations of the variable
space as follows:
\[\begin{split}\hat{H}_+
&= \hat{H}_c +
\frac{
(\hat{y}_c - \hat{H}_c \hat{s}_c) \hat{y}_c^\top
}{
\hat{y}_c^\top \hat{s}_c
}\\
&= T^{-\top} H_c T^{-1} +
\frac{(T^{-\top} y_c - T^{-\top} H_c s_c) y_c T^{-1}}{y_c s_c}\\
&= T^{-\top} H_+ T^{-1}\end{split}\]
which is essentially (8.4.5) with \(v_k = y_c\).
Exercise 14
Let \(f(x_1, x_2) = x_1^2 + 2 x_1 x_2 + 2 x_2^2 + x_2^4\). Then
\[\begin{split}\nabla f
&= \begin{bmatrix} 2 x_1 + 2 x_2\\ 2 x_1 + 4 x_2 + 4 x_2^3 \end{bmatrix}\\\\
\nabla^2 f
&= \begin{bmatrix} 2 & 2\\ 2 & 4 + 12 x_2^2 \end{bmatrix}.\end{split}\]
Suppose \(x_0 = (1, -1)^\top\), then
\(\nabla f(x_0) = \begin{bmatrix} 0\\ -6 \end{bmatrix}\) and
\(H_0 = \nabla^2 f(x_0) = \begin{bmatrix} 2 & 2\\ 2 & 16 \end{bmatrix}\).
The positive definite secant and symmetric secant methods have in common the
following updates:
\[\begin{split}x_{k + 1} &= x_k - H_k^{-1} \nabla f(x_k)\\\\
s_k &= x_{k + 1} - x_k\\\\
y_k &= \nabla f(x_{k + 1}) - \nabla f(x_k).\end{split}\]
(a)
Suppose \(f \colon \mathbb{R}^n \rightarrow \mathbb{R}\) and
\(\nabla f(x) = \begin{bmatrix} A x + b\\ F(x) \end{bmatrix}\) where
\(A \in \mathbb{R}^{m \times n}\), \(b \in \mathbb{R}^m\), and
\(F \colon \mathbb{R}^n \rightarrow \mathbb{R}^{n - m}\) is nonlinear.
Let \(z_k = F(x_{k + 1}) - F(x_k)\) and
\(s_k = \begin{bmatrix} s_{k1}\\ s_{k2} \end{bmatrix}\) where
\(s_{k1} \in \mathbb{R}^m\) and \(s_{k2} \in \mathbb{R}^{n - m}\)
in order to simplify notations.
\(H_0 = H(x_0)\) can be calculated using finite difference or analytically
as
\[\begin{split}H = \nabla^2 f(x) &= \frac{\partial \nabla f(x)}{\partial x}\\
&= \begin{bmatrix} A\\ F'(x) \end{bmatrix}
& \quad & \frac{\partial F(x)}{\partial x} = F'(x) \in
\mathbb{R}^{(n - m) \times n}\\
&= H^\top
& \quad & \text{(9.2.10) is a symmetric matrix of rank at most two.}\end{split}\]
As shown in Exercise 8.10, the affine
transformation \(Ax_k + b = 0\) for all \(k \geq 1\).
Substituting the given values into (a.1) shows that the positive definite secant
method converges to the correct Hessian. (a.2) illustrates that the symmetric
secant method in this case fails to converge to the correct Hessian. Note that
this way of showing convergence is handwavy; the more rigorous version is along
the lines of Exercises 5,
6, and
12.
See Exercise 8.10 for a general idea of how
to prove this. Supposely [Sch82], which could not be found on the
internet, contains the proof.
(a.1)
\[\begin{split}H_1 &= H_0 + \frac{y_0 y_0^\top}{y_0^\top s_0} -
\frac{H_0 s_0 s_0^\top H_0}{s_0^\top H_0 s_0}\\
&= \begin{bmatrix} A\\ F'(x_0) \end{bmatrix} +
\frac{
\begin{bmatrix} A s_0\\ z_0 \end{bmatrix}
\begin{bmatrix} A s_0\\ z_0 \end{bmatrix}^\top
}{
\begin{bmatrix} A s_0\\ z_0 \end{bmatrix}^\top s_0
} -
\frac{
\begin{bmatrix} A s_0\\ F'(x_0) s_0 \end{bmatrix}
\begin{bmatrix} s_0^\top A^\top\\ s_0^\top F'(x_0)^\top \end{bmatrix}
}{
\begin{bmatrix} A s_0\\ F'(x_0) s_0 \end{bmatrix}^\top s_0
}\\
&= \begin{bmatrix} A\\ F'(x_0) \end{bmatrix} +
(s_0^\top A^\top s_{01} + z_0^\top s_{02})^{-1}
\begin{bmatrix}
A s_0 s_0^\top A^\top & A s_0 z_0^\top\\
z_0 s_0^\top A^\top & z_0 z_0^\top
\end{bmatrix} -
(s_0^\top A^\top s_{01} + s_0^\top F'(x_0)^\top s_{02})^{-1}
\begin{bmatrix}
A s_0 s_0^\top A^\top & A s_0 s_0^\top F'(x_0)^\top\\
F'(x_0) s_0 s_0^\top A^\top & F'(x_0) s_0 s_0^\top F'(x_0)^\top
\end{bmatrix}\end{split}\]
Generalizing this update sequence gives
\[\begin{split}H_{k + 1}
&= H_k + \frac{y_k y_k^\top}{y_k^\top s_k} -
\frac{H_k s_k s_k^\top H_k}{s_k^\top H_k s_k}\\
&= H_k +
(s_k^\top A^\top s_{k1} + z_k^\top s_{k2})^{-1}
\begin{bmatrix}
A s_k s_k^\top A^\top & A s_k z_k^\top\\
z_k s_k^\top A^\top & z_k z_k^\top
\end{bmatrix} -
(s_k^\top A^\top s_{k1} + s_k^\top F'(x_k)^\top s_{k2})^{-1}
\begin{bmatrix}
A s_k s_k^\top A^\top & A s_k s_k^\top F'(x_k)^\top\\
F'(x_k) s_k s_k^\top A^\top & F'(x_k) s_k s_k^\top F'(x_k)^\top
\end{bmatrix}\\
&= H_k +
(z_k^\top s_{k2})^{-1}
\begin{bmatrix}
\boldsymbol{0}_{m \times m} & \boldsymbol{0}_{m \times (n - m)}\\
\boldsymbol{0}_{(n - m) \times m} & z_k z_k^\top
\end{bmatrix} -
(s_k^\top F'(x_k)^\top s_{k2})^{-1}
\begin{bmatrix}
\boldsymbol{0}_{m \times m} & \boldsymbol{0}_{m \times (n - m)}\\
\boldsymbol{0}_{(n - m) \times m} & F'(x_k) s_k s_k^\top F'(x_k)^\top
\end{bmatrix}.\end{split}\]
Notice how the remaining block matrices form another positive definite secant
update, which still satisfies Theorem 9.3.1 according to the Schur complement.
(a.2)
\[\begin{split}H_1
&= H_0 +
\frac{
(y_0 - H_0 s_0) s_0^\top + s_0 (y_0 - H_0 s_0)^\top
}{
s_0^\top s_0
} -
\frac{(y_0 - H_0 s_0)^\top s_0 s_0 s_0^\top}{(s_0^\top s_0)^2}\\
&= \begin{bmatrix} A\\ F'(x_0) \end{bmatrix} +
\frac{
\begin{bmatrix}
\boldsymbol{0}_{m \times 1}\\
z_0 - F'(x_0) s_0
\end{bmatrix} s_0^\top +
s_0 \begin{bmatrix}
\boldsymbol{0}_{1 \times m} & z_0^\top - s_0^\top F'(x_0)^\top
\end{bmatrix}
}{
s_0^\top s_0
} -
\frac{
\left( z_0^\top - s_0^\top F'(x_0)^\top \right) s_{02}
}{
(s_0^\top s_0)^2
} s_0 s_0^\top\\
&= \begin{bmatrix} A\\ F'(x_0) \end{bmatrix} +
\frac{
\begin{bmatrix}
\boldsymbol{0}_{m \times n}\\
z_0 s_0^\top - F'(x_0) s_0 s_0^\top
\end{bmatrix} +
\begin{bmatrix}
\boldsymbol{0}_{n \times m} &
s_0 z_0^\top - s_0 s_0^\top F'(x_0)^\top
\end{bmatrix}
}{
s_0^\top s_0
} -
\frac{
z_0^\top s_{02} - s_0^\top F'(x_0)^\top s_{02}
}{
(s_0^\top s_0)^2
} s_0 s_0^\top\end{split}\]
Generalizing this update sequence gives
\[\begin{split}H_{k + 1}
&= H_k +
\frac{
(y_k - H_k s_k) s_k^\top + s_k (y_k - H_k s_k)^\top
}{
s_k^\top s_k
} -
\frac{(y_k - H_k s_k)^\top s_k s_k s_k^\top}{(s_k^\top s_k)^2}\\
&= H_k +
\frac{
\begin{bmatrix}
\boldsymbol{0}_{m \times n}\\
z_k s_k^\top - F'(x_k) s_k s_k^\top
\end{bmatrix} +
\begin{bmatrix}
\boldsymbol{0}_{n \times m} &
s_k z_k^\top - s_k s_k^\top F'(x_k)^\top
\end{bmatrix}
}{
s_k^\top s_k
} -
\frac{
z_k^\top s_{k2} - s_k^\top F'(x_k)^\top s_{k2}
}{
(s_k^\top s_k)^2
} s_k s_k^\top.\end{split}\]
Unlike (a.1), the remaining block matrices cannot be isolated to form a
symmetric secant update. This means even though Theorem 9.1.2 guarantees
its convergence to \(x_*\), there is no guarantee on the Hessian.
(b)
In this case, the positive definite secant method exactly computes the linear
portion of the Hessian in a single step. The symmetric secant method does not
seem to be directly affected by \(\nabla f(x)\).
(c)
It is interesting that the symmetric secant method’s Hessian converges to
\(\nabla^2 f(x_*)\) numerically if the stopping criterion is ignored.
When \(\nabla f(x_0)\) is not zero in the linear components of
\(\nabla f(x)\), the symmetric secant method takes several times longer to
converge to the correct Hessian compared to the positive definite secant method.
However, it does converge to the correct Hessian.
x_0: [ 1 -1] H_0: [ 2 2 2 16]
x_1: [ 0.57142857 -0.57142857] H_1: [ 2. 2. 2. 11.59183673]
x_2: [ 0.37446809 -0.37446809] H_2: [2. 2. 2. 6.72295489]
x_3: [ 0.17142203 -0.17142203] H_3: [2. 2. 2. 4.93521577]
x_4: [ 0.04775366 -0.04775366] H_4: [2. 2. 2. 4.15940782]
x_5: [ 0.00332346 -0.00332346] H_5: [2. 2. 2. 4.00980066]
x_6: [ 1.61335925e-05 -1.61335925e-05] H_6: [2. 2. 2. 4.0000444]
x_7: [ 3.58126625e-10 -3.58126625e-10] H_7: [2. 2. 2. 4.]
x_8: [ 1.86439634e-19 -1.86439634e-19] H_8: [2. 2. 2. 4.]
x_0: [ 1 -1] H_0: [ 2 2 2 16]
x_1: [ 0.57142857 -0.57142857] H_1: [ 3.10204082 3.10204082 3.10204082 12.69387755]
x_2: [ 0.37446809 -0.37446809] H_2: [4.31926128 4.31926128 4.31926128 9.04221617]
x_3: [ 0.17142203 -0.17142203] H_3: [4.76619606 4.76619606 4.76619606 7.70141183]
x_4: [ 0.04775366 -0.04775366] H_4: [4.96014804 4.96014804 4.96014804 7.11955587]
x_5: [ 0.00332346 -0.00332346] H_5: [4.99754984 4.99754984 4.99754984 7.00735049]
x_6: [ 1.61335925e-05 -1.61335925e-05] H_6: [4.9999889 4.9999889 4.9999889 7.0000333]
x_7: [ 3.58126646e-10 -3.58126625e-10] H_7: [5. 5. 5. 7.]
x_8: [ 1.29405326e-17 -1.86439738e-19] H_8: [4.99999993 5. 5. 7.00000007]
Exercise 15
There is a typo in the definition of \(f\). This exercise illustrates that
the convergence of the Hessian using either method depends on the function and
is not true in general.
From (a) and (b), it is not obvious that both methods converge to the same
incorrect Hessian.
(a)
\[\begin{split}H_1 &= H_0 + \frac{y_0 y_0^\top}{y_0^\top s_0} -
\frac{H_0 s_0 s_0^\top H_0}{s_0^\top H_0 s_0}\\
&= I + \frac{y_0 y_0^\top}{y_0^\top s_0} - \frac{s_0 s_0^\top}{s_0^\top s_0}\end{split}\]
(b)
\[\begin{split}H_1 &= H_0 +
\frac{
(y_0 - H_0 s_0) s_0^\top + s_0 (y_0 - H_0 s_0)^\top
}{
s_0^\top s_0
} -
\frac{
(y_0 - H_0 s_0)^\top s_0 s_0 s_0^\top
}{
(s_0^\top s_0)^2
}\\
&= I + \frac{(y_0 - s_0) s_0^\top + s_0 (y_0 - s_0)^\top}{s_0^\top s_0} -
\frac{(y_0 - s_0)^\top s_0 s_0 s_0^\top}{(s_0^\top s_0)^2}\\
&= I +
\frac{
y_0 s_0^\top - s_0 s_0^\top + s_0 y_0^\top - s_0 s_0^\top
}{
s_0^\top s_0
} -
\frac{(y_0^\top s_0 - s_0^\top s_0) s_0 s_0^\top}{(s_0^\top s_0)^2}\\
&= I + \frac{y_0 s_0^\top + s_0 y_0^\top - s_0 s_0^\top}{s_0^\top s_0} -
\frac{y_0^\top s_0}{(s_0^\top s_0)^2} s_0 s_0^\top\\
&= I + \frac{y_0 s_0^\top + s_0 y_0^\top}{s_0^\top s_0} -
(1 + \frac{y_0^\top s_0}{s_0^\top s_0}) \frac{s_0 s_0^\top}{s_0^\top s_0}\end{split}\]
Exercise 16, Exercise 17
These exercises are related to benchmarking different minimization algorithms.
These tasks are already completed by existing solver packages such as
Ceres and Eigen.
Exercise 18
Let \(f(x) = 2^{-1} x^\top H x\) be a positive definite quadratic where the
Hessian \(H \in \mathbb{R}^{n \times n}\) and
\(x_c, x_+ \in \mathbb{R}^n\) are distinct points.
Define \(s\) and \(y\) according to (9.2.11b) such that
\[\begin{split}\nabla f(x) &= Hx\\\\
\nabla^2 f(x) &= H\\\\
s &= x_+ - x_c\\\\
y &= \nabla f(x_+) - \nabla f(x_c) = H (x_+ - x_c) = Hs.\end{split}\]
The characteristic equation of \(\frac{s^\top y}{s^\top s} I\) is
\[\det\left( \frac{s^\top y}{s^\top s} I - \nu I \right) = 0,\]
which shows that the eigenvalue \(\nu = \frac{s^\top y}{s^\top s}\) has a
multiplicity of \(n\).
Rewriting the eigenvalue gives
\[\nu = \frac{s^\top y}{s^\top s} = \frac{s^\top H s}{s^\top s} = v^\top H v\]
where \(v = s \left\Vert s \right\Vert_2^{-1}\).
Recall that an eigenvalue of \(H\) must satisfy the condition
\[\begin{split}Hz &= \lambda z\\
z^\top H z &= \lambda z^\top z.\end{split}\]
Since \(H\) is positive definite, its eigenvalues must all be positive.
Assuming \(z\) is restricted to a unit vector, (a) shows that the range of
eigenvalues of \(H\) includes \(\nu\).
(a)
\[\begin{split}z^\top H z
&= z^\top Q \Lambda Q^\top z
& \quad & \text{real symmetric matrix eigendecomposition}\\
&= y^\top \Lambda y
& \quad & y = Q^\top z\\
&= \sum_{i = 1}^n \lambda_i y_i^2\\
&\leq \lambda_\max \sum_{i = 1}^n y_i^2\\
&= \lambda_\max y^\top y\\
z^\top H z &\leq \lambda_\max z^\top z\end{split}\]
Exercise 19
Even though each heuristic converged successfully to \(x_*\), none of them
converged to the correct Hessian with a stopping criterion of \(10^{-8}\).
The Schnabel heuristic converged much quicker than the identity heuristic.
Moreover, the identity heuristic experienced overflow when
\(x_0 = (-1, 3)^\top\).
x_0: [-1 3] H_0: [ 6. 2. 2. 1573.54410012]
x_1: [-0.83667163 2.5100149 ] H_1: [ 6. 2. 2. 998.89787784]
x_2: [-0.7426493 2.22794789] H_2: [ 6. 2. 2. 443.44283829]
x_3: [-0.62469991 1.87409972] H_3: [ 6. 2. 2. 233.30165316]
x_4: [-0.5181552 1.55446561] H_4: [ 6. 2. 2. 116.28755637]
x_5: [-0.41032666 1.23097999] H_5: [ 6. 2. 2. 59.87797302]
x_6: [-0.30760028 0.92280085] H_6: [ 6. 2. 2. 31.12339324]
x_7: [-0.21061501 0.63184504] H_7: [ 6. 2. 2. 16.83526742]
x_8: [-0.12490953 0.37472859] H_8: [6. 2. 2. 9.74841099]
x_9: [-0.05803004 0.17409012] H_9: [6. 2. 2. 6.33545863]
x_10: [-0.01776458 0.05329373] H_10: [6. 2. 2. 4.79129044]
x_11: [-0.00269008 0.00807023] H_11: [6. 2. 2. 4.19210244]
x_12: [-0.00012799 0.00038398] H_12: [6. 2. 2. 4.02552286]
x_13: [-9.28668114e-07 2.78600434e-06] H_13: [6. 2. 2. 4.00116064]
x_14: [-3.20914944e-10 9.62744831e-10] H_14: [6. 2. 2. 4.00000836]
x_15: [-8.16388618e-16 2.44916585e-15] H_15: [6. 2. 2. 3.99999989]
x_16: [-1.33501404e-18 4.00504212e-18] H_16: [6. 2. 2. 3.99781597]
x_0: [-1 3] H_0: [1. 0. 0. 1.]
x_1: [ -1. -767.68651314] H_1: [2.33564408 2. 2. 2.99480982]
x_2: [ 513.68143814 -598.0563262 ] H_2: [5.96643161 2.10185113 2.10185113 1.6909696 ]
x_3: [-111.18935846 278.4438572 ] H_3: [3.28480919 0.06429767 0.06429767 inf]
x_4: [-77.62624105 278.4438572 ] H_4: [nan nan nan nan]
x_5: [nan nan] H_5: [nan nan nan nan]
x_6: [nan nan] H_6: [nan nan nan nan]
x_7: [nan nan] H_7: [nan nan nan nan]
x_8: [nan nan] H_8: [nan nan nan nan]
x_0: [-1 3] H_0: [369.25771965 0. 0. 369.25771965]
x_1: [-1. 0.91287637] H_1: [369.26865517 2. 2. 365.78033718]
x_2: [-0.9885881 0.89297221] H_2: [278.66012851 158.32756492 158.32756492 111.03763909]
x_3: [-0.72547548 0.45607157] H_3: [274.64325718 163.78377168 163.78377168 111.38285765]
x_4: [-0.56781456 0.21270315] H_4: [261.30722445 167.39520948 167.39520948 114.23128041]
x_5: [-0.39287129 -0.04257572] H_5: [244.11671738 165.18200469 165.18200469 116.43905471]
x_6: [-0.28113629 -0.19291898] H_6: [199.83411526 146.05736544 146.05736544 110.45644706]
x_7: [-0.21601988 -0.26781987] H_7: [100.81326034 84.42759544 84.42759544 74.59667515]
x_8: [-0.15333452 -0.32097495] H_8: [29.08639126 29.22559111 29.22559111 34.83780782]
x_9: [-0.05878757 -0.36162675] H_9: [ 8.38034685 7.53615301 7.53615301 15.47582188]
x_10: [ 0.03900773 -0.32760138] H_10: [6.38800315 0.88480616 0.88480616 5.79646148]
x_11: [ 0.08244197 -0.16519482] H_11: [8.34915061 1.37173986 1.37173986 3.05750176]
x_12: [ 0.03662406 -0.00606959] H_12: [8.01476872 2.58012474 2.58012474 3.71387439]
x_13: [8.73911509e-03 8.75585830e-05] H_13: [5.97704698 1.89604859 1.89604859 3.51135524]
x_14: [ 6.13665268e-05 -3.04043648e-04] H_14: [5.99917425 2.01829831 2.01829831 3.5938675 ]
x_15: [-1.55035930e-05 4.33011301e-05] H_15: [5.9724562 1.99390434 1.99390434 3.99786894]
x_16: [ 3.31002606e-08 -1.73583456e-08] H_16: [5.97279364 1.99024211 1.99024211 3.99663007]
Exercise 20
Suppose \(H_c \triangleq H\) is symmetric, \(s_c \triangleq s\), and
\(y_c \triangleq y\).
(a)
\[\begin{split}H_+ s
&= \left(
H + \frac{(y - Hs) (y - Hs)^\top}{(y - Hs)^\top s}
\right) s
& \quad & \text{(9.6.1)}\\
&= Hs +
\frac{
yy^\top - y s^\top H^\top - Hs y^\top + Hs s^\top H^\top
}{
y^\top s - s^\top H^\top s
} s\\
&= Hs +
\frac{
(y^\top s - s^\top H^\top s) y - (y^\top s - s^\top H^\top s) Hs
}{
y^\top s - s^\top H^\top s
}\\
&= y\end{split}\]
(b)
As laid out in Exercise 3, define
\(\Sigma \triangleq \begin{bmatrix} \sigma_1 & \sigma_2\\
\sigma_3 & \sigma_4 \end{bmatrix}\). \(H_+ - H\) can be decomposed into
\[\begin{split}H_+ - H
&= \begin{bmatrix} y - Hs & s \end{bmatrix}
\begin{bmatrix} \sigma_1 & \sigma_2\\ \sigma_3 & \sigma_4 \end{bmatrix}
\begin{bmatrix} \left( y - Hs \right)^\top\\ s^\top \end{bmatrix}\\
&= \sigma_1 \left( y - Hs \right) \left( y - Hs \right)^\top +
\sigma_3 s \left( y - Hs \right)^\top +
\sigma_2 \left( y - Hs \right) s^\top + \sigma_4 s s^\top\end{split}\]
where \(\sigma_1 = \frac{1}{(y - Hs)^\top s}\) and
\(\sigma_2 = \sigma_3 = \sigma_4 = 0\).
By inspection, \(H_+ - H = \left( H_+ - H \right)^\top\). Since
\(\det(\Sigma) = 0\), \(\Sigma\) is not a full-rank matrix. Assuming
\(\sigma_1 \neq 0\), \(H_+ - H\) has a rank of one.
Exercise 21
Let \(H_c \triangleq H\), \(s_c \triangleq s\), and
\(y_c \triangleq y\).
\[\begin{split}& H_+(\phi)\\
&= H + \frac{(y - Hs) y^\top + y (y - Hs)^\top}{y^\top s} -
\frac{(y - Hs)^\top s y y^\top}{(y^\top s)^2} -
\phi (s^\top H s)
\left[ \frac{y}{y^\top s} - \frac{Hs}{s^\top H s} \right]
\left[ \frac{y}{y^\top s} - \frac{Hs}{s^\top H s} \right]^\top\\
&= H + \frac{(y - Hs) y^\top + y (y - Hs)^\top}{y^\top s} -
\frac{(y - Hs)^\top s y y^\top}{(y^\top s)^2} +
\frac{(s^\top H s)^2}{s^\top (y - Hs)}
\left[ \frac{y}{y^\top s} - \frac{Hs}{s^\top H s} \right]
\left[ \frac{y}{y^\top s} - \frac{Hs}{s^\top H s} \right]^\top\\
&= H +
\frac{
y y^\top y^\top s - Hs y^\top y^\top s - y s^\top H^\top y^\top s +
s^\top H^\top s y y^\top
}{
(y^\top s)^2
} +
\frac{(s^\top H s)^2}{s^\top (y - Hs)}
\left[ \frac{y}{y^\top s} - \frac{Hs}{s^\top H s} \right]
\left[ \frac{y}{y^\top s} - \frac{Hs}{s^\top H s} \right]^\top
& \quad & \text{(a)}\\
&= H +
\frac{
y y^\top (y^\top s)^2 - Hs y^\top (y^\top s)^2 -
y s^\top H^\top (y^\top s)^2
}{
(y^\top s)^2 s^\top (y - Hs)
} +
\frac{Hs s^\top H^\top}{s^\top (y - Hs)}
& \quad & \text{merge (b) and (c)}\\
&= H + \frac{(y - Hs) (y - Hs)^\top}{s^\top (y - Hs)}\end{split}\]
(a)
\[\begin{split}& \frac{(y - Hs) y^\top + y (y - Hs)^\top}{y^\top s} -
\frac{(y - Hs)^\top s y y^\top}{(y^\top s)^2}\\
&= \frac{
y y^\top y^\top s - Hs y^\top y^\top s + y y^\top y^\top s -
y s^\top H^\top y^\top s
}{
(y^\top s)^2
} -
\frac{y^\top s y y^\top - s^\top H^\top s y y^\top}{(y^\top s)^2}\\
&= \frac{
y y^\top y^\top s - Hs y^\top y^\top s - y s^\top H^\top y^\top s +
s^\top H^\top s y y^\top
}{
(y^\top s)^2
}\end{split}\]
(b)
\[\begin{split}& \frac{(s^\top H s)^2}{s^\top (y - Hs)}
\left[ \frac{y}{y^\top s} - \frac{Hs}{s^\top H s} \right]
\left[ \frac{y}{y^\top s} - \frac{Hs}{s^\top H s} \right]^\top\\
&= \frac{(s^\top H s)^2}{s^\top (y - Hs)}
\left[
\frac{y y^\top}{(y^\top s)^2} -
\frac{y s^\top H^\top}{y^\top s s^\top H s} -
\frac{Hs y^\top}{s^\top H s y^\top s} +
\frac{Hs s^\top H^\top}{(s^\top H s)^2}
\right]\\
&= \frac{(s^\top H s)^2 y y^\top}{(y^\top s)^2 s^\top (y - Hs)} -
\frac{s^\top Hs y s^\top H^\top}{y^\top s s^\top (y - Hs)} -
\frac{s^\top Hs Hs y^\top}{y^\top s s^\top (y - Hs)} +
\frac{Hs s^\top H^\top}{s^\top (y - Hs)}\\
&= \frac{s^\top H s y y^\top s^\top H s}{(y^\top s)^2 s^\top (y - Hs)} -
\frac{y s^\top H^\top y^\top s s^\top Hs}{(y^\top s)^2 s^\top (y - Hs)} -
\frac{Hs y^\top y^\top s s^\top Hs}{(y^\top s)^2 s^\top (y - Hs)} +
\frac{Hs s^\top H^\top}{s^\top (y - Hs)}\end{split}\]
(c)
\[\begin{split}& \frac{
y y^\top y^\top s - Hs y^\top y^\top s - y s^\top H^\top y^\top s +
s^\top H^\top s y y^\top
}{
(y^\top s)^2
}
\frac{s^\top (y - Hs)}{s^\top (y - Hs)}\\
&= \frac{
y y^\top y^\top s s^\top (y - Hs) -
Hs y^\top y^\top s s^\top (y - Hs) -
y s^\top H^\top y^\top s s^\top (y - Hs) +
s^\top H^\top s y y^\top s^\top (y - Hs)
}{
(y^\top s)^2 s^\top (y - Hs)
}\\
&= \frac{
y y^\top (y^\top s)^2 -
Hs y^\top y^\top s s^\top (y - Hs) -
y s^\top H^\top y^\top s s^\top (y - Hs) -
s^\top H^\top s y y^\top s^\top Hs
}{
(y^\top s)^2 s^\top (y - Hs)
}\end{split}\]
Exercise 22
Substitute in relations such as (9.6.1).
References
- BDMore73
Charles George Broyden, John E Dennis, and Jorge J Moré. On the local and superlinear convergence of quasi-newton methods. IMA Journal of Applied Mathematics, 12(3):223–245, 1973.
- Gei
Robert A. van de Geijn. Notes on cholesky factorization. http://www.cs.utexas.edu/users/flame/Notes/NotesOnChol.pdf. Accessed on 2017-05-23.
- Gra
Markus Grasmair. On the existence of a cholesky factorization. https://wiki.math.ntnu.no/_media/ma2501/2014v/cholesky.pdf. Accessed on 2017-05-23.
- Rhe
Werner C. Rheinboldt. Quasi-newton methods. https://www-m2.ma.tum.de/foswiki/pub/M2/Allgemeines/SemWs09/quasi-newt.pdf. Accessed on 2017-05-21.
- Sch82
R. B. Schnabel. Convergence of quasi-newton updates to correct derivative values. In preparation, 1982.