Nonlinear Least Squares

Gauss-Newton-Type Methods

  • Advantages and disadvantages of Gauss-Newton method

    • See Table 10.2.3.

    • Each step is a descent direction.

  • Damped Gauss-Newton

    • Combine line-search with Gauss-Newton to make it locally convergent on most large-residual or very nonlinear problems.

    • Still requires Jacobian to be full column rank.

  • Levenberg-Marquardt

    • Combine Gauss-Newton with trust region to ensure inverse is well-defined even if Jacobian is not full column rank.

    • Proven to be globally convergent.

    • Each step is within the vicinity of steepest descent direction.

    • Often superior to damped Gauss-Newton.

    • Theorem 10.2.6 suggests that this algorithm may still be slowly convergent on large residual or very nonlinear problems.

      • In practice, More’s scaled trust region is very successful.

Exercise 1

Define

\[\begin{split}\begin{gather*} R(x) \colon \mathbb{R}^{n = 4} \mapsto \mathbb{R}^{m = 20},\\ r_i(x) = x_1 + x_2 \exp[-(t_i + x_3)^2 / x_4] - y_i, \text{and}\\ f(x) = \frac{1}{2} R(x)^\top R(x) = \frac{1}{2} \sum_{i = 1}^m r_i(x)^2. \end{gather*}\end{split}\]

\(J(x)\)

See Exercise 6.17 for more details.

\[\begin{split}J(x) = R'(x) &= \begin{bmatrix} \frac{\partial r_1(x)}{\partial x_1} & \cdots & \frac{\partial r_1(x)}{\partial x_n}\\ \vdots & \ddots & \vdots\\ \frac{\partial r_m(x)}{\partial x_1} & \cdots & \frac{\partial r_m(x)}{\partial x_n}\\ \end{bmatrix} & \quad & \text{Exercise 6.17}\\ &= \begin{bmatrix} 1 & \exp[-(t_1 + x_3)^2 x_4^{-1}] & -2 x_2 (t_1 + x_3) x_4^{-1} \exp[-(t_1 + x_3)^2 x_4^{-1}] & x_2 (t_1 + x_3)^2 x_4^{-2} \exp[-(t_1 + x_3)^2 x_4^{-1}]\\ \vdots & \vdots & \vdots & \vdots\\ 1 & \exp[-(t_{20} + x_3)^2 x_4^{-1}] & -2 x_2 (t_{20} + x_3) x_4^{-1} \exp[-(t_{20} + x_3)^2 x_4^{-1}] & x_2 (t_{20} + x_3)^2 x_4^{-2} \exp[-(t_{20} + x_3)^2 x_4^{-1}] \end{bmatrix}\end{split}\]

\(\nabla f(x)\)

See Exercise 6.17 for more details.

\[\begin{split}\nabla f(x) = \sum_{i = 1}^m r_i(x) \nabla r_i(x) &= J(x)^\top R(x) \qquad \text{Exercise 6.17}\\ &= \begin{bmatrix} 1 & \cdots & 1\\ \exp[-(t_1 + x_3)^2 x_4^{-1}] & \cdots & \exp[-(t_{20} + x_3)^2 x_4^{-1}]\\ -2 x_2 (t_1 + x_3) x_4^{-1} \exp[-(t_1 + x_3)^2 x_4^{-1}] & \cdots & -2 x_2 (t_{20} + x_3) x_4^{-1} \exp[-(t_{20} + x_3)^2 x_4^{-1}]\\ x_2 (t_1 + x_3)^2 x_4^{-2} \exp[-(t_1 + x_3)^2 x_4^{-1}] & \cdots & x_2 (t_{20} + x_3)^2 x_4^{-2} \exp[-(t_{20} + x_3)^2 x_4^{-1}] \end{bmatrix} R(x)\\ &= \begin{bmatrix} \sum_{i = 1}^m r_i(x)\\ \sum_{i = 1}^m r_i(x) \exp[-(t_i + x_3)^2 x_4^{-1}]\\ -\sum_{i = 1}^m r_i(x) 2 x_2 (t_i + x_3) x_4^{-1} \exp[-(t_i + x_3)^2 x_4^{-1}]\\ \sum_{i = 1}^m r_i(x) x_2 (t_i + x_3)^2 x_4^{-2} \exp[-(t_i + x_3)^2 x_4^{-1}] \end{bmatrix}\end{split}\]

\(S(x)\)

See Exercise 6.18 for more details.

\[\begin{split}S(x) &= \sum_{i = 1}^m r_i(x) \nabla^2 r_i(x) & \quad & \text{Exercise 6.18}\\ &= \sum_{i = 1}^m r_i(x) \nabla \begin{bmatrix} 1\\ \exp[-(t_i + x_3)^2 x_4^{-1}]\\ -2 x_2 (t_i + x_3) x_4^{-1} \exp[-(t_i + x_3)^2 x_4^{-1}]\\ x_2 (t_i + x_3)^2 x_4^{-2} \exp[-(t_i + x_3)^2 x_4^{-1}] \end{bmatrix}\\ &= \sum_{i = 1}^m r_i(x) \begin{bmatrix} 0 & 0 & 0 & 0\\ \star & 0 & -2 (t_i + x_3) x_4^{-1} \exp[-(t_i + x_3)^2 x_4^{-1}] & (t_i + x_3)^2 x_4^{-2} \exp[-(t_i + x_3)^2 x_4^{-1}]\\ \star & \star & \ast & \ast\\ \star & \star & \star & \ast \end{bmatrix}\end{split}\]

where \(\star\) have the same value as its transposed location and \(\ast\) are to be completed by the reader.

\(\nabla^2 f(x)\)

See Exercise 6.18 for more details.

\[\nabla^2 f(x) = J(x)^\top J(x) + S(x)\]

Exercise 2

Recall that \(\lambda > \sigma \geq 0\) and \(c \in (1, \lambda / \sigma)\).

The constant in (10.2.6) has a range of

\[\frac{c \sigma + \lambda}{2 \lambda} \in \left( \frac{\sigma + \lambda}{2 \lambda}, 1 \right).\]

That constant’s original form was

\[c \frac{\sigma}{\lambda} + \frac{\lambda - c \sigma}{2 \lambda},\]

which is reduced to the proposed \(c \frac{\sigma}{\lambda} = 1\) when \(c = \lambda / \sigma\).

This is what I could understand from the problem description. Otherwise, (10.2.6) is violated because

\[c \frac{\sigma}{\lambda} \in \left( \frac{\sigma}{\lambda}, 1 \right)\]

unless when \(\alpha = 0\) in (10.2.5).

Exercise 3

\[\begin{split}\left[ J(x) - J(x_*) \right]^\top R(x_*) &= J(x)^\top R(x_*) & \quad & \text{Theorem 10.2.1 and } J(x_*)^\top R(x_*) = 0\\ &= \sum_{i = 1}^m r_i(x_*) \nabla r_i(x) & \quad & \text{Exercise 1}\\ &= \sum_{i = 1}^m r_i(x_*) \left[ \nabla r_i(x_*) + \nabla^2 r_i(x_*) (x - x_*) + \mathcal{O}(\left\Vert x - x_* \right\Vert_2^2) \right] & \quad & \text{first-order Taylor expansion}\\ &= S(x_*) (x - x_*) + J(x_*)^\top R(x_*) + \sum_{i = 1}^m r_i(x_*) \mathcal{O}(\left\Vert x - x_* \right\Vert_2^2) & \quad & \text{Exercise 1}\\ &= S(x_*) (x - x_*) + \mathcal{O}(\left\Vert x - x_* \right\Vert_2^2)\end{split}\]

See Exercise 1 and [Bin][Fol] for more details.

Exercise 4

[1]:
import numpy

def newton(x):
    S_c = S(x)
    J_c = J(x)
    g = J_c.T * R(x)
    s = -numpy.linalg.inv(J_c.T * J_c + S_c) * g
    _ = x + s

    #extract scalar from matrix
    return _

def gauss_newton(x):
    J_c = J(x)
    g = J_c.T * R(x)
    s = -numpy.linalg.inv(J_c.T * J_c) * g
    _ = x + s

    #extract scalar from matrix
    return _

def secant_1d(x, a):
    J_c = J(x)
    g = J_c.T * R(x)
    x_next = (x - numpy.linalg.inv(a) * g).item(0)
    a_next = (J(x_next).T * R(x_next) - g) / (x_next - x)
    return x_next, a_next

def secant_newton(x, A):
    J_c = J(x)
    g = J_c.T * R(x)
    _ = J_c.T * J_c + A
    x_next = (x - numpy.linalg.inv(_) * g).item(0)
    A_next = ((J(x_next) - J_c).T * R(x_next)) / (x_next - x)
    return x_next, A_next

t = numpy.asarray([1.0, 2.0, 3.0])

def J(x):
    return numpy.asmatrix([[t[0] * numpy.exp(t[0] * x)],
                           [t[1] * numpy.exp(t[1] * x)],
                           [t[2] * numpy.exp(t[2] * x)]])

def R(x):
    return numpy.asmatrix([[numpy.exp(t[0] * x) - y[0]],
                           [numpy.exp(t[1] * x) - y[1]],
                           [numpy.exp(t[2] * x) - y[2]]])

def S(x):
    _ = 0.0
    for i, r_i in enumerate(R(x)):
        _ += r_i.item(0) * t[i]**2 * numpy.exp(t[i] * x)
    return numpy.asmatrix([[_]])

def f(x):
    return 0.5 * R(x).T * R(x)
[2]:
y = numpy.asarray([2.0, 4.0, 8.0])
x_0 = 1

print('Gauss-Newton')
x_c = x_0
for i in range(8):
    _ = gauss_newton(x_c)
    x_c = _.item(0)
    print(i + 1, x_c, f(x_c).item(0), (J(x_c).T * R(x_c)).item(0))

print('Newton')
x_c = x_0
for i in range(8):
    _ = newton(x_c)
    x_c = _.item(0)
    print(i + 1, x_c, f(x_c).item(0), (J(x_c).T * R(x_c)).item(0))

print('Secant-1D')
x_c = x_0
a_c = J(x_0).T * J(x_0) + S(x_0)
for i in range(8):
    _ = secant_1d(x_c, a_c)
    x_c = _[0]
    a_c = _[1]
    print(i + 1, x_c, f(x_c).item(0), (J(x_c).T * R(x_c)).item(0))

print('Secant-Newton')
x_c = x_0
A_c = J(x_0).T * J(x_0) * 0
for i in range(8):
    _ = secant_newton(x_c, A_c)
    x_c = _[0]
    A_c = _[1]
    print(i + 1, x_c, f(x_c).item(0), (J(x_c).T * R(x_c)).item(0))
Gauss-Newton
1 0.797681485268023 4.798572539663344 106.43337959215394
2 0.7075326297366793 0.06947285928496987 9.860940386410261
3 0.6934421945480853 2.804859826706154e-05 0.1902319942916944
4 0.6931473062113115 5.083823436164468e-12 8.091952383927143e-05
5 0.6931471805599682 1.684266350377305e-25 1.472866273388815e-11
6 0.6931471805599453 1.5777218104420236e-30 -4.2632564145606e-14
7 0.6931471805599454 1.9721522630525295e-30 4.9737991503207026e-14
8 0.6931471805599453 1.5777218104420236e-30 -4.2632564145606e-14
Newton
1 0.8729918237243539 17.945438055943427 256.32405187794905
2 0.7736000145002915 2.642378244000565 73.64363082415387
3 0.7142113601455673 0.15188433600754053 14.864643595252721
4 0.6949161568380523 0.001012790924586588 1.1479866632350721
5 0.6931606396245242 5.833141447658211e-08 0.008668143002826043
6 0.6931471813446821 1.9829141280503995e-16 5.053704985954406e-07
7 0.6931471805599453 1.5777218104420236e-30 -4.2632564145606e-14
8 0.6931471805599454 1.9721522630525295e-30 4.9737991503207026e-14
Secant-1D
1 0.8729918237243539 17.945438055943427 256.32405187794905
2 0.8108570547759058 6.334571483430388 127.06401085423846
3 0.7497779084785361 1.2191425191346346 46.68223221178341
4 0.7143058034952143 0.15329163936525653 14.937429607434115
5 0.6976145019987203 0.006509690750205783 2.933205196835438
6 0.6935360200729546 4.873987772145979e-05 0.25083485731468713
7 0.6931546314865848 1.7876635805453848e-08 0.00479855164903069
8 0.6931471931013226 5.0646139471428766e-14 8.076647346819262e-06
Secant-Newton
1 0.797681485268023 4.798572539663344 106.43337959215394
2 0.731289681799256 0.5235985265595988 28.997539172425046
3 0.6989136719544521 0.010887376118456476 3.807623768439826
4 0.6932923743798246 6.791008308303912e-06 0.09356365612181405
5 0.6931472723387299 2.7123179020933864e-12 5.910556073636461e-05
6 0.6931471805599818 4.288094052904902e-25 2.3501200985267795e-11
7 0.6931471805599453 1.5777218104420236e-30 -4.2632564145606e-14
8 0.6931471805599454 1.9721522630525295e-30 4.9737991503207026e-14

Exercise 5

[3]:
y = numpy.asarray([2.0, 4.0, -8.0])
x_0 = -0.7

print('Gauss-Newton')
x_c = x_0
for i in range(8):
    _ = gauss_newton(x_c)
    x_c = _.item(0)
    print(i + 1, x_c, f(x_c).item(0), (J(x_c).T * R(x_c)).item(0))

print('Newton')
x_c = x_0
for i in range(8):
    _ = newton(x_c)
    x_c = _.item(0)
    print(i + 1, x_c, f(x_c).item(0), (J(x_c).T * R(x_c)).item(0))

print('Secant-1D')
x_c = x_0
a_c = J(x_0).T * J(x_0) + S(x_0)
for i in range(8):
    _ = secant_1d(x_c, a_c)
    x_c = _[0]
    a_c = _[1]
    print(i + 1, x_c, f(x_c).item(0), (J(x_c).T * R(x_c)).item(0))

print('Secant-Newton')
x_c = x_0
A_c = J(x_0).T * J(x_0) * 0
for i in range(8):
    _ = secant_newton(x_c, A_c)
    x_c = _[0]
    A_c = _[1]
    print(i + 1, x_c, f(x_c).item(0), (J(x_c).T * R(x_c)).item(0))
Gauss-Newton
1 -1.3181392160204424 41.37014622797392 -0.5652177807736287
2 4.603063420929212 493776978174.8777 2962538999580.457
3 4.2697199740280904 66829586764.7546 400942681957.58496
4 3.936364890074363 9045570175.395155 54263341625.477844
5 3.6029836310886236 1224521251.1689384 7344143794.657102
6 3.269541898071428 165819808.15308744 994013600.4270636
7 2.935956749386091 22470974.235461347 134544297.35789526
8 2.6020232241347236 3050228.5712296856 18211744.7482201
Newton
1 -0.7753181925883713 41.14527907441983 0.05727237701094268
2 -0.7909037398531175 41.1448223711391 0.0019908002207311704
3 -0.7914855557238429 41.144821791482315 2.6663328043241563e-06
4 -0.7914863370578041 41.14482179148127 4.802380715318577e-12
5 -0.7914863370592115 41.14482179148128 4.440892098500626e-16
6 -0.7914863370592116 41.14482179148128 -8.881784197001252e-16
7 -0.7914863370592113 41.14482179148128 8.881784197001252e-16
8 -0.7914863370592116 41.14482179148128 -8.881784197001252e-16
Secant-1D
1 -0.7753181925883713 41.14527907441983 0.05727237701094268
2 -0.7884320120357908 41.144837783961584 0.010496620886760066
3 -0.7913747927094275 41.144821812714476 0.00038074597382831143
4 -0.7914855544480799 41.144821791482315 2.6706863969216954e-06
5 -0.7914863368580352 41.14482179148127 6.865192858640512e-10
6 -0.791486337059211 41.14482179148128 1.3322676295501878e-15
7 -0.7914863370592115 41.14482179148128 4.440892098500626e-16
8 -0.7914863370592117 41.14482179148128 -1.3322676295501878e-15
Secant-Newton
1 -1.3181392160204424 41.37014622797392 -0.5652177807736287
2 -0.8658644841836972 41.15325353487681 -0.21408957648183602
3 -0.651247739313924 41.18666389076415 0.6641431077724418
4 -0.8016261951479147 41.14499451637015 -0.03380400577673148
5 -0.7939862144878457 41.14483241371934 -0.008481899318829367
6 -0.7914338276553045 41.14482179618622 0.00017921144377242015
7 -0.7914865943296561 41.14482179148138 -8.779418063653566e-07
8 -0.791486337085712 41.14482179148128 -9.043343851544705e-11
[4]:
y = numpy.asarray([2.0, 4.0, 10.0])
x_0 = 1.0

print('Secant-Newton')
x_c = x_0
A_c = J(x_0).T * J(x_0) * 0
for i in range(7):
    _ = secant_newton(x_c, A_c)
    x_c = _[0]
    A_c = _[1]
    print(i + 1, x_c, f(x_c).item(0), (J(x_c).T * R(x_c)).item(0))
print(S(x_c).item(0))
Secant-Newton
1 0.8289297211970873 2.866613066659297 86.71740114325145
2 0.7789272577862364 0.3567726169698321 18.299700156353552
3 0.7624256408592597 0.19852242343622828 1.2817262639102118
4 0.7610898236380687 0.19766317205318565 0.007228509731195487
5 0.761082202873457 0.19766314450975261 9.251903421159113e-08
6 0.7610822027759129 0.19766314450975353 -1.4033219031261979e-13
7 0.761082202775913 0.19766314450975309 2.3092638912203256e-14
-5.936710010394714

Exercise 6

\[\begin{split}S_G - S_N &= (x_+^G - x_c) - (x_+^N - x_c)\\ &= -(J_c^\top J_c)^{-1} J_c^\top R_c + (J_c^\top J_c + S_c)^{-1} J_c^\top R_c\\ &= -(J_c^\top J_c)^{-1} \left[ I - J_c^\top J_c (J_c^\top J_c + S_c)^{-1} \right] J_c^\top R_c\\ &= -(J_c^\top J_c)^{-1} \left[ -(J_c^\top J_c + S_c) + J_c^\top J_c \right] S_N\\ &= (J_c^\top J_c)^{-1} S_c S_N\end{split}\]

Let \(\nabla f(x_c) = \nabla f_c = J_c^\top R_c\).

\[\begin{split}x_+^G - x_* &= x_c - x_* - (J_c^\top J_c)^{-1} J_c^\top R_c\\ &= -(J_c^\top J_c)^{-1} \left[ \nabla f_c + J_c^\top J_c (x_* - x_c) \right]\\ &= -(J_c^\top J_c)^{-1} \left[ \nabla f_c - S_c (x_* - x_c) + (J_c^\top J_c + S_c) (x_* - x_c) \right]\\ &= (J_c^\top J_c)^{-1} \left[ S_c (x_* - x_c) + \nabla f_* - \nabla f_c - (J_c^\top J_c + S_c) (x_* - x_c) \right] & \quad & \nabla f_* = J_*^\top R_* = 0 \text{ by Lemma 4.3.1}\\ \left\Vert x_+^G - x_* \right\Vert_2 &= \left\Vert (J_c^\top J_c)^{-1} \left[ S_c (x_* - x_c) + \nabla f_* - \nabla f_c - (J_c^\top J_c + S_c) (x_* - x_c) \right] \right\Vert_2\\ &\leq \left\Vert (J_c^\top J_c)^{-1} \right\Vert_2 \left\Vert S_c \right\Vert_2 \left\Vert x_c - x_* \right\Vert_2 + \left\Vert (J_c^\top J_c)^{-1} \right\Vert_2 \left\Vert \nabla f_* - \nabla f_c - (J_c^\top J_c + S_c) (x_* - x_c) \right\Vert_2 & \quad & \text{(3.1.10) and Exercise 3.4}\\ &\leq \left\Vert (J_c^\top J_c)^{-1} \right\Vert_2 \left\Vert S_c \right\Vert_2 \left\Vert x_c - x_* \right\Vert_2 + \frac{\gamma}{2} \left\Vert (J_c^\top J_c)^{-1} \right\Vert_2 \left\Vert x_* - x_c \right\Vert_2^2 & \quad & \text{Lemma 4.1.12}\\ &\leq \left\Vert (J_c^\top J_c)^{-1} \right\Vert_2 \left\Vert S_c \right\Vert_2 \left\Vert x_c - x_* \right\Vert_2 + \mathcal{O}\left( \left\Vert x_c - x_* \right\Vert_2^2 \right) & \quad & \text{Exercise 3.4}\end{split}\]

In the case of \(n = 1\), applying (3.1.2c) gives (10.5.1). See Exercise 3.4 for more details.

Exercise 7

Let \(x_0 = x_* + \delta\) where \(\delta > 0\). When \(\lvert \delta \rvert\) is sufficiently small,

\[\begin{split}x_1^G &= x_0 - (J_0^\top J_0)^{-1} J_0^\top R_0 & \quad & \text{by definition of Gauss-Newton}\\ \left\Vert x_1^G - x_0 \right\Vert_2 &\leq \alpha \lvert \delta \rvert & \quad & \text{Lemma 4.2.1}.\end{split}\]

Hence \(x_1 \cong x_0 - \alpha \delta\). Invoking (10.5.1) yields

\[\begin{split}(x_+^G - x_*) - \left( S_c (J_c^\top J_c)^{-1} \right) (x_c - x_*) &= \mathcal{O}(\left\vert x_c - x_* \right\vert^2)\\ x_+^G - x_* &= \left( S_c (J_c^\top J_c)^{-1} \right) (x_c - x_*) + \mathcal{O}(\left\vert x_c - x_* \right\vert^2)\\ x_1 - x_* &= \delta \left[ \left( S_0 (J_0^\top J_0)^{-1} \right) + \mathcal{O}(\delta) \right]\\ \left\vert \frac{x_1 - x_*}{\delta} \right\vert &= \left\vert \left( S_0 (J_0^\top J_0)^{-1} \right) + \mathcal{O}(\delta) \right\vert.\end{split}\]
[5]:
delta = 0.0001

y = numpy.asarray([2.0, 4.0, -4.0])
x_star = -0.3719287
J_star = J(x_star)
S_star = S(x_star)
print(J_star.T * J_star, S_star)

x_0 = x_star + delta
e = abs((gauss_newton(x_0) - x_0) / delta)
print(abs(x_0 - e * delta - x_star) / (x_0 - x_star))


y = numpy.asarray([2.0, 4.0, -8.0])
x_star = -0.7914863
J_star = J(x_star)
S_star = S(x_star)
print(J_star.T * J_star, S_star)

x_0 = x_star + delta
e = abs((gauss_newton(x_0) - x_0) / delta)
print(abs(x_0 - e * delta - x_star) / (x_0 - x_star))
[[2.34506567]] [[5.15750003]]
[[2.19976901]]
[[0.45201033]] [[2.9605172]]
[[6.55160976]]

Exercise 8

The nonlinearity measure for the latter configuration is twice that of the former’s settings. This could explain why both methods converged to the wrong result.

The Gauss-Newton method converges faster in both settings compared to Newton’s method. In the latter configuration, the Hessian of Newton’s method shrinks towards the zero matrix as the magnitude of the gradient goes to zero. Gauss-Newton did not experience that despite having a faster shrinking gradient.

[6]:
def J(x):
    x = x.flat
    return numpy.asmatrix([[numpy.exp(x[0] + x[1] * t[0]), t[0] * numpy.exp(x[0] + x[1] * t[0])],
                           [numpy.exp(x[0] + x[1] * t[1]), t[1] * numpy.exp(x[0] + x[1] * t[1])],
                           [numpy.exp(x[0] + x[1] * t[2]), t[2] * numpy.exp(x[0] + x[1] * t[2])],
                           [numpy.exp(x[0] + x[1] * t[3]), t[3] * numpy.exp(x[0] + x[1] * t[3])]])

def R(x):
    x = x.flat
    return numpy.asmatrix([[numpy.exp(x[0] + x[1] * t[0]) - y[0]],
                           [numpy.exp(x[0] + x[1] * t[1]) - y[1]],
                           [numpy.exp(x[0] + x[1] * t[2]) - y[2]],
                           [numpy.exp(x[0] + x[1] * t[3]) - y[3]]])

def S(x):
    _ = numpy.zeros((len(x), len(x)))
    for i, r_i in enumerate(R(x)):
        a = numpy.exp(x[0] + x[1] * t[i]).item(0)
        b = t[i] * numpy.exp(x[0] + x[1] * t[i]).item(0)
        c = t[i]**2 * numpy.exp(x[0] + x[1] * t[i]).item(0)
        _ += r_i.item(0) * numpy.asmatrix([[a, b],
                                           [b, c]])
    return _

def f(x):
    return 0.5 * R(x).T * R(x)

def fp(x):
    return J(x).T * R(x)

def fpp(x):
    return J(x).T * J(x) + S(x)

t = numpy.asarray([-2.0, -1.0, 0.0, 1.0])
[7]:
y = numpy.asarray([0.5, 1.0, 2.0, 4.0])
x_0 = numpy.asmatrix([[1.0],
                      [1.0]])

print('Gauss-Newton')
x_c = x_0
for i in range(8):
    _ = gauss_newton(x_c)
    x_c = _
    print(i + 1, numpy.asarray(x_c).squeeze(),
          numpy.asarray(f(x_c)).squeeze(),
          numpy.linalg.norm(fp(x_c)), numpy.linalg.norm(fpp(x_c)))

print('Newton')
x_c = x_0
for i in range(8):
    _ = newton(x_c)
    x_c = _
    print(i + 1, numpy.asarray(x_c).squeeze(),
          numpy.asarray(f(x_c)).squeeze(),
          numpy.linalg.norm(fp(x_c)), numpy.linalg.norm(fpp(x_c)))
Gauss-Newton
1 [0.75406408 0.78581937] 0.230568840781627 4.588006739285215 52.501520036199445
2 [0.69782818 0.69931189] 0.0010036922558129178 0.2642056161400473 35.652589176231444
3 [0.6931729  0.69317775] 2.6841803927285428e-08 0.0013536418753972424 34.589627280948655
4 [0.69314718 0.69314718] 1.9726285056677948e-17 3.6695081563895345e-08 34.58413668570615
5 [0.69314718 0.69314718] 4.005934284325451e-31 5.0649172608481935e-15 34.58413653685745
6 [0.69314718 0.69314718] 4.005934284325451e-31 4.9864341922878095e-15 34.58413653685749
7 [0.69314718 0.69314718] 4.005934284325451e-31 5.0649172608481935e-15 34.58413653685745
8 [0.69314718 0.69314718] 4.005934284325451e-31 4.9864341922878095e-15 34.58413653685749
Newton
1 [0.81564456 0.86820864] 1.0013289027358931 11.00808913350165 76.13038140740468
2 [0.73026818 0.74718706] 0.07646606536852447 2.488684557651273 44.44978255981308
3 [0.69788478 0.69905288] 0.0009679391078418193 0.2595675234876606 35.6340302609343
4 [0.69322068 0.69323479] 2.1988614600444903e-07 0.0038746246180375037 34.5998525308189
5 [0.6931472 0.6931472] 1.1906844585914362e-14 9.015366143490777e-07 34.58414019381937
6 [0.69314718 0.69314718] 3.3719181466331725e-29 4.797709209280672e-14 34.58413653685767
7 [0.69314718 0.69314718] 4.005934284325451e-31 4.9864341922878095e-15 34.58413653685749
8 [0.69314718 0.69314718] 4.005934284325451e-31 5.0649172608481935e-15 34.58413653685745
[8]:
y = numpy.asarray([5.0, 1.0, 2.0, -4.0])
x_0 = numpy.asmatrix([[1.0],
                      [1.0]])

print('Gauss-Newton')
x_c = x_0
for i in range(16):
    _ = gauss_newton(x_c)
    x_c = _
    print(i + 1, numpy.asarray(x_c).squeeze(),
          numpy.asarray(f(x_c)).squeeze(),
          numpy.linalg.norm(fp(x_c)), numpy.linalg.norm(fpp(x_c)))

print('Newton')
x_c = x_0
for i in range(16):
    _ = newton(x_c)
    x_c = _
    print(i + 1, numpy.asarray(x_c).squeeze(),
          numpy.asarray(f(x_c)).squeeze(),
          numpy.linalg.norm(fp(x_c)), numpy.linalg.norm(fpp(x_c)))
Gauss-Newton
1 [ 0.90163112 -0.46628928] 20.56191776952204 34.58973081262516 289.6715725097342
2 [-0.36818859 -1.02409293] 10.376725641662436 6.06987964672671 163.84835201858382
3 [-1.35679809 -1.49393682] 9.767842221661171 1.1702074383497212 135.96800692867657
4 [-1.30903259 -1.45981383] 9.762473917870453 0.009666786819778306 128.18531867833067
5 [-1.33251806 -1.47178715] 9.762400496911953 0.007016785162575777 128.27103187578723
6 [-1.3241041  -1.46745693] 9.762391254263322 0.002433728927208294 128.20915596103677
7 [-1.32717445 -1.46903601] 9.762390012341301 0.000901156793237069 128.23088462217402
8 [-1.32605814 -1.46846175] 9.76238984871859 0.0003259007271765508 128.22287481308905
9 [-1.32646455 -1.4686708 ] 9.76238982700529 0.00011887909378551661 128.22577625980256
10 [-1.32631666 -1.46859473] 9.762389824131397 4.3228210833789006e-05 128.22471852324045
11 [-1.32637049 -1.46862242] 9.762389823750652 1.5737057403786923e-05 128.22510323276728
12 [-1.3263509  -1.46861234] 9.76238982370023 5.7266395893342314e-06 128.22496319179058
13 [-1.32635803 -1.46861601] 9.762389823693551 2.084210919093698e-06 128.2250141534847
14 [-1.32635543 -1.46861467] 9.762389823692667 7.58507091534126e-07 128.224995606165
15 [-1.32635638 -1.46861516] 9.76238982369255 2.760490723046993e-07 128.22500235611844
16 [-1.32635603 -1.46861498] 9.762389823692533 1.0046384255095575e-07 128.22499989956154
Newton
1 [-5.74923409  7.46802779] 60.860720234508165 75.54702133707183 169.06642212514757
2 [-6.75061755  7.83744677] 39.25228502035673 29.201597831652535 58.87923451951846
3 [-7.75108426  8.13649786] 29.96080533250285 11.373123185856448 20.40756245009949
4 [-8.75123711  8.34848473] 25.897025050859376 4.413215372409868 7.135111931295802
5 [-9.75128646  8.47378858] 24.153665247324113 1.686548330594559 2.540582864457937
6 [-10.75130275   8.53496468] 23.441928711923843 0.6334141741732487 0.9195696277964355
7 [-11.75130837   8.56081363] 23.165436990089965 0.235171827605604 0.3359779599874159
8 [-12.75131037   8.57089609] 23.06127981190756 0.08683219837152847 0.1232698720635962
9 [-13.7513111    8.57469085] 23.022602051371816 0.03198818502117416 0.04530296995548207
10 [-14.75131136   8.57609891] 23.008322834899026 0.011773881162521788 0.016659830516505058
11 [-15.75131146   8.57661856] 23.00306288784996 0.004332196260795135 0.006127967670570039
12 [-16.7513115    8.57680995] 23.001126920948433 0.0015938381140055543 0.002254239327183045
13 [-17.75131151   8.57688039] 23.000414591019343 0.0005863554654991157 0.0008292728707017905
14 [-18.75131152   8.57690631] 23.000152522215817 0.00021571017730657692 0.0003050703513048174
15 [-19.75131152   8.57691585] 23.000056110153405 7.935561780067604e-05 0.00011222882763368799
16 [-20.75131152   8.57691935] 23.000020641821397 2.9193337997770915e-05 4.128664013018487e-05

Exercise 9

There is no value in implementing this at the moment since optimization packages provide this and the previous exercises have implemented Gauss-Newton and line-search independently.

The section after (10.2.12) mentioned that the damped Gauss-Newton took ten times more iterations than Newton’s method for Example 10.2.4. Since the symmetric secant method in Example 9.3.2 took twice as long as Newton’s method, I hypothesize the damped Gauss-Newton will be slower than the secant approximation.

Exercise 10

Let \(s = x_+ - x_c \in \mathbb{R}^n\).

\[\begin{split}\min_{x_+} \left\Vert R(x_c) + J(x_c) s \right\Vert_2 &\equiv \min_{x_+} \left( R(x_c) + J(x_c) s \right)^\top \left( R(x_c) + J(x_c) s \right) & \quad & \text{square root is monotonically increasing}\\ &\equiv \min_{x_+} \frac{1}{2} \left( R(x_c)^\top R(x_c) + 2 R(x_c)^\top J(x_c) s + s^\top J(x_c)^\top J(x_c) s \right) & \quad & \text{minimum point is not affected by positive scaling}\\ &= \min_{x_+} f(x_c) + \nabla f(x_c)^\top s + \frac{1}{2} s^\top J(x_c)^\top J(x_c) s & \quad & \text{see Exercise 1}\end{split}\]

subject to \(\left\Vert s \right\Vert_2 \leq \delta_c\). See Exercise 1 for more details.

(10.2.13) now matches the problem format of (6.4.1), so applying Lemma 6.4.1 gives (10.2.14).

Exercise 11

Let

\[\begin{split}\begin{gather*} x_+, x_c \in \mathbb{R}^n,\\ f_c = f(x_c) \in \mathbb{R},\\ J_c = J(x_c) \in \mathbb{R}^{m \times n},\\ R_c = R(x_c) \in \mathbb{R}^m, \text{ and}\\ s = x_+ - x_c = -\left( J_c^\top J_c + \mu_c I_n \right)^{-1} J_c^\top R_c. \end{gather*}\end{split}\]

Assume that either \(J_c^\top J_c\) is positive definite or \(\mu_c \geq 0\) is picked such that \(J_c^\top J_c + \mu_c I_n\) is positive definite.

\[\begin{split}\nabla f_c^\top s &= \left[ R_c^\top J_c \right] \left[ -\left( J_c^\top J_c + \mu_c I \right)^{-1} J_c^\top R_c \right] & \quad & \text{Exercise 1}\\ &= -\left( R_c^\top J_c \right) \left( J_c^\top J_c + \mu_c I \right)^{-1} \left( J_c^\top R_c \right)\\ &< 0\end{split}\]

satisfies the definition of a descent direction (6.2.3). See Exercise 1 for more details.

Exercise 12

Let \(R \in \mathbb{R}^m\), \(J \in \mathbb{R}^{m \times n}\), and \(s \in \mathbb{R}^n\).

Denote \(A = \begin{bmatrix} J^\top & \mu^{1 / 2} I \end{bmatrix}^\top \in \mathbb{R}^{(m + n) \times n}\) and \(b = \begin{bmatrix} R^\top & 0 \end{bmatrix}^\top \in \mathbb{R}^{m + n}\).

\[\begin{split}\min_s \left\Vert As + b \right\Vert &\equiv \min_s \frac{1}{2} \left( As + b \right)^\top \left( As + b \right)\\ &= \min_s \frac{1}{2} s^\top A^\top A s + b^\top As + \frac{1}{2} b^\top b\end{split}\]

The solution to this unconstrained minimization problem is

\[\begin{split}0 &= \frac{\partial}{\partial s} \frac{1}{2} s^\top A^\top As + b^\top As + \frac{1}{2} b^\top b\\ &= A^\top As + A^\top b\\ s &= -\left( A^\top A \right)^{-1} A^\top b\\ &= -\left( J^\top J + \mu I \right)^{-1} J^\top R.\end{split}\]

Exercise 13

Let \(J_* \triangleq J(x_*)\), \(J_k \triangleq J(x_k)\), \(R_* \triangleq R(x_*)\), and \(R_k \triangleq R(x_k)\).

(a)

According to Theorem 3.1.4, the Levenberg-Marquardt method is well-defined if

\[\begin{split}\left\Vert \left( J_*^\top J_* + \mu_* I \right)^{-1} \left( J_k^\top J_k + \mu_k I - J_*^\top J_* \right) \right\Vert_2 &\leq \left\Vert \left( J_*^\top J_* + \mu_* I \right)^{-1} \right\Vert_2 \left\Vert J_k^\top J_k + \mu_k I - J_*^\top J_* \right\Vert_2 & \quad & \text{(3.1.10)}\\ &\leq \frac{ \left\Vert J_k^\top J_k + \mu_k I - J_*^\top J_* \right\Vert_2 }{\lambda + b} & \quad & \text{(a.1)}\\ &< 1,\end{split}\]

which implies that \(\left\Vert J_k^\top J_k + \mu_k I - J_*^\top J_* \right\Vert_2 < \lambda + b\).

Applying (3.2.20) yields

\[\begin{split}\left\Vert \left( J_k^\top J_k + \mu_k I \right)^{-1} \right\Vert_2 &\leq \frac{ \left\Vert \left( J_*^\top J_* + \mu_* I \right)^{-1} \right\Vert_2 }{ 1 - \left\Vert \left( J_*^\top J_* + \mu_* I \right)^{-1} \left( J_k^\top J_k + \mu_k I - J_*^\top J_* \right) \right\Vert_2 }\\ &\leq \frac{1}{ \lambda + b - \left\Vert J_k^\top J_k + \mu_k I - J_*^\top J_* \right\Vert_2 }.\end{split}\]

By inspection, the lower bound on the inverse is attained when \(\left\Vert J_k^\top J_k + \mu_k I - J_*^\top J_* \right\Vert_2 = 0\).

In order to make subsequent derivations more elegant, it is beneficial to obtain an upper bound for the inverse using the assumption that \(\lambda > \sigma \geq 0\):

\[\left\Vert \left( J_k^\top J_k + \mu_k I \right)^{-1} \right\Vert_2 \leq \frac{c}{\lambda + b} \quad \text{where} \quad c \in \left( 1, \frac{\lambda + b}{\sigma + b} \right) \text{ for } x_k \in N(x_*, \epsilon_1).\]

Since the conditions of Theorem 10.2.1 are satisfied, \(\left\Vert J(x) \right\Vert_2 \leq \alpha \, \forall x \in D\). Applying the results of Exercise 3.4, Exercise 3.8, and the proof techniques of Theorem 10.2.1 gives

\[\begin{split}x_{k + 1} - x_* &= x_k - x_* - \left( J_k^\top J_k + \mu_k I \right)^{-1} J_k^\top R_k\\ &= -\left( J_k^\top J_k + \mu_k I \right)^{-1} \left[ J_k^\top R_k + (J_k^\top J_k + \mu_k I) (x_* - x_k) \right]\\ \left\Vert x_{k + 1} - x_* \right\Vert_2 &= \left\Vert -\left( J_k^\top J_k + \mu_k I \right)^{-1} \left[ J_k^\top R_* - J_k^\top \left( R_* - R_k - J_k (x_* - x_k) \right) + \mu_k (x_* - x_k) \right] \right\Vert_2\\ &\leq \left\Vert \left( J_k^\top J_k + \mu_k I \right)^{-1} \right\Vert_2 \left[ \left\Vert J_k^\top R_* \right\Vert_2 + \left\Vert J_k^\top \right\Vert_2 \left\Vert R_* - R_k - J_k (x_* - x_k) \right\Vert_2 + \left\Vert \mu_k (x_* - x_k) \right\Vert_2 \right] & \quad & \text{(3.1.10) and Exercise 3.4}\\ &\leq \frac{c}{\lambda + b} \left[ (\sigma + b) \left\Vert x_k - x_* \right\Vert_2 + \frac{\alpha \gamma}{2} \left\Vert x_k - x_* \right\Vert_2^2 \right] & \quad & \text{(a.1), (10.2.4), (4.1.12), and Exercise 3.8}.\end{split}\]

Picking

\[\epsilon = \min \left\{ \epsilon_1, \frac{(\lambda + b) - c(\sigma + b)}{c \alpha \gamma} \right\}\]

simplifies the preceding result to

\[\begin{split}\left\Vert x_{k + 1} - x_* \right\Vert_2 &\leq \left\Vert x_k - x_* \right\Vert_2 \left[ \frac{c (\sigma + b)}{\lambda + b} + \frac{c \alpha \gamma}{2 (\lambda + b)} \left\Vert x_k - x_* \right\Vert_2 \right]\\ &\leq \left\Vert x_k - x_* \right\Vert_2 \left[ \frac{c (\sigma + b)}{\lambda + b} + \frac{(\lambda + b) - c(\sigma + b)}{2 (\lambda + b)} \right]\\ &\leq \left\Vert x_k - x_* \right\Vert_2 \frac{c (\sigma + b) + (\lambda + b)}{2 (\lambda + b)}\\ &< \left\Vert x_k - x_* \right\Vert_2.\end{split}\]

(b)

When \(R_* = 0\), \(\sigma\) in (10.2.4) can be set to zero. Substituting this into the results of (a) shows that \(\{ x_k \}\) converges to \(x_*\).

When \(\mu_k = \mathcal{O}(\left\Vert J_k^\top R_k \right\Vert_2)\),

\[\begin{split}\left\Vert \left( J_*^\top J_* + \mu_* I \right)^{-1} \right\Vert_2 &= \left\Vert \left( J_*^\top J_* + \mathcal{O}(\left\Vert J_*^\top R_* \right\Vert_2) I \right)^{-1} \right\Vert_2\\ &= \left\Vert \left( J_*^\top J_* \right)^{-1} \right\Vert_2\\ &\leq \frac{1}{\lambda} & \quad & \text{(a.1)}\end{split}\]

and consequently

\[\left\Vert \left( J_k^\top J_k + \mu_k I \right)^{-1} \right\Vert_2 \leq \frac{c}{\lambda} \quad \text{where} \quad c \in \left( 1, \frac{\lambda}{\sigma} \right) \text{ for } x_k \in N(x_*, \epsilon_1).\]

Applying the same procedures as in (a) gives

\[\begin{split}\left\Vert x_{k + 1} - x_* \right\Vert_2 &\leq \left\Vert \left( J_k^\top J_k + \mu_k I \right)^{-1} \right\Vert_2 \left[ \left\Vert J_k^\top R_* \right\Vert_2 + \left\Vert J_k^\top \right\Vert_2 \left\Vert R_* - R_k - J_k (x_* - x_k) \right\Vert_2 + \left\Vert \mu_k (x_* - x_k) \right\Vert_2 \right]\\ &\leq \frac{c}{\lambda} \left[ \mathcal{O}(\left\Vert J_k^\top R_k \right\Vert_2) \left\Vert x_k - x_* \right\Vert_2 + \frac{\alpha \gamma}{2} \left\Vert x_k - x_* \right\Vert_2^2 \right].\end{split}\]

Picking \(\epsilon = \min \{ \epsilon_1, \epsilon_2 \}\) so that (b.1) is satisfied (i.e. \(\mathcal{O}(\left\Vert J_k^\top R_k \right\Vert_2) \leq 0\)) proves that the convergence rate is q-quadratic.

(a.1)

Assuming \(\lambda > 0\) is the smallest eigenvalue of \(J_*^\top J_*\) i.e. the spectral decomposition exists,

\[\begin{split}\left\Vert \left( J_*^\top J_* + \mu_* I \right)^{-1} \right\Vert_2 &= \left( \min_{v \in \mathbb{R}^n \setminus 0} \frac{ \left\Vert (J_*^\top J_* + \mu_* I) v \right\Vert_2 }{ \left\Vert v \right\Vert_2 } \right)^{-1} & \quad & \text{(3.1.18)}\\ &\leq \left( \min_{v \in \mathbb{R}^n \setminus 0} \frac{ \left\Vert J_*^\top J_* v \right\Vert_2 + \mu_* \left\Vert v \right\Vert_2 }{ \left\Vert v \right\Vert_2 } \right)^{-1} & \quad & \text{(3.1.1c) and (3.1.1b)}\\ &\leq \left( \min_{v \in \mathbb{R}^n \setminus 0} \frac{ \left\Vert J_*^\top J_* v \right\Vert_2 }{ \left\Vert v \right\Vert_2 } + b \right)^{-1} & \quad & \{ \mu_k \} < b \text{ and } b > 0\\ &\leq \frac{1}{\lambda + b} & \quad & \text{Exercise 3.6}.\end{split}\]

See Exercise 3.6 for more details on the last inequality relation.

(b.1)

\[\begin{split}\left\Vert \left( J_*^\top J_* + \mu_* I \right)^{-1} \left( J_k^\top J_k + \mu_k I - J_*^\top J_* \right) \right\Vert_2 &\leq \left\Vert \left( J_*^\top J_* + \mu_* I \right)^{-1} \right\Vert_2 \left\Vert J_k^\top J_k + \mu_k I - J_*^\top J_* \right\Vert_2\\ &\leq \frac{ \left\Vert J_k^\top J_k + \mu_k I - J_*^\top J_* \right\Vert_2 }{\lambda}\\ &\leq \frac{ \left\Vert J_k^\top J_k - J_*^\top J_* \right\Vert_2 + \mathcal{O}(\left\Vert J_k^\top R_k \right\Vert_2) }{\lambda}\\ &\leq 1\end{split}\]

implies that

\[\mathcal{O}(\left\Vert J_k^\top R_k \right\Vert_2) \leq \lambda - \left\Vert J_k^\top J_k - J_*^\top J_* \right\Vert_2.\]

Exercise 14

See Exercise 4 and Exercise 5 for code. The suggested secant method converges slower than (10.3.4) in Exercise 4 but wins out in Exercise 5.

The suggested secant method did not beat Newton’s method in terms of convergence rate. (10.3.4) beats Newton’s method when Gauss-Newton dominated.

Exercise 15

See Exercise 5 for code.

Exercise 16

Define \(S = \{ T T^\top \mid T T^\top \in Q(y_c, s_c) \}\).

\[\begin{split}A_+ - A_c &= \frac{ (y^\#_c - A_c s_c) {y^\#_c}^\top + y^\#_c (y^\#_c - A_c s_c)^\top }{ {y^\#_c}^\top s_c } - \frac{ (y^\#_c - A_c s_c)^\top s_c y^\#_c {y^\#_c}^\top }{ ({y^\#_c}^\top s_c)^2 } & \quad & \text{(9.3.4) applied to (10.3.5)}\\ &= \frac{ (y^\#_c - A_c s_c) s_c^\top A_+^\top + A_+ s_c (y^\#_c - A_c s_c)^\top }{ s_c^\top A_+^\top s_c } - \frac{ (A_+ s_c - A_c s_c)^\top s_c A_+ s_c s_c^\top A_+^\top }{ (s_c^\top A_+^\top s_c)^2 } & \quad & A_+ s_c = y^\#_c\\ &= \frac{ (y_c^\# - A_c s_c) y_c^\top + y_c (y_c^\# - A_c s_c)^\top }{ y_c^\top s_c } - \frac{ (y_c^\# - A_c s_c)^\top s_c y_c y_c^\top }{ (y_c^\top s_c)^2 } & \quad & A_+ \in Q(y^\#_c, s_c) \cap S\end{split}\]

Exercise 17

Let \(x \in \mathbb{R}^n\), \(R \in \mathbb{R}^m\), and \(J \in \mathbb{R}^{m \times n}\) have full column rank i.e. \(J^\top J\) is nonsingular.

(a) shows that the solution to the linear least-squares problem (b) is equivalent to the Euclidean projection of \(R\) onto the linear subspace spanned by the columns of \(J\).

\[\begin{split}R^\top J (J^\top J)^{-1} J^\top R &= \left\Vert R \right\Vert_2 \left\Vert J (J^\top J)^{-1} J^\top R \right\Vert_2 \cos(\phi) & \quad & \text{definition of dot product}\\ \cos(\phi) &= \frac{ R^\top J (J^\top J)^{-1} J^\top R }{ \left\Vert R \right\Vert_2 \left\Vert J (J^\top J)^{-1} J^\top R \right\Vert_2 }\end{split}\]

(a)

The Euclidean projection of a point \(x_0 \in \mathbb{R}^n\) on a set \(\mathcal{S} \subseteq \mathbb{R}^n\) is a point that achieves the smallest Euclidean distance from \(x_0\) to the set. Symbolically this means

\[\min_{x \in \mathcal{S}} \left\Vert x - x_0 \right\Vert_2.\]

(b)

Given

\[\begin{split}\min_x \left\Vert Jx - R \right\Vert_2^2 &= \min_x (Jx - R)^\top (Jx - R) & \quad & \text{(3.1.2c)}\\ &= \min_x x^\top J^\top J x - 2 R^\top J x + R^\top R,\end{split}\]

the minima is

\[\begin{split}0 &= \frac{\partial}{\partial x} \left( x^\top J^\top J x - 2 R^\top J x + R^\top R \right)\\ &= 2 J^\top J x - 2 J^\top R & \quad & \text{Matrix Cookbook (81) and (69)}\\ x &= (J^\top J)^{-1} J^\top R.\end{split}\]

Exercise 18

Let \(\alpha \in \mathbb{R}\), \(s \in \mathbb{R}^n\), and \(M \in \mathbb{R}^{n \times n}\). Suppose \(\alpha > 0\) and \(M\) is positive definite i.e. \(M \triangleq L L^\top\).

(a)

\[\begin{split}\max_v \phi(v) = \max_v \frac{\alpha v^\top s}{(v^\top M v)^{1 / 2}} &= \max_v \frac{\alpha v^\top s}{\left\Vert L^\top v \right\Vert_2}\\ &\leq \max_v \frac{ \alpha \left\Vert v \right\Vert_2 \left\Vert s \right\Vert_2 }{ \left\Vert L^\top v \right\Vert_2 } & \quad & \text{Cauchy-Schwarz Inequality}\\ &\leq \alpha \left\Vert s \right\Vert_2 \left\Vert L^{-\top} \right\Vert_2 & \quad & \text{(3.1.18)}\\ &\leq \alpha \left\Vert s \right\Vert_2 \left\Vert L^{-\top} \right\Vert_F & \quad & \text{(3.1.12)}\end{split}\]

Applying (a.1) shows that the maximum value is \(\alpha \left( s^\top M^{-1} s \right)^{1 / 2}\).

(a.1)

\[\begin{split}\DeclareMathOperator{\tr}{tr} \alpha \left( s^\top M^{-1} s \right)^{1 / 2} &= \alpha \left\Vert L^{-1} s \right\Vert_2 & \quad & \text{(3.1.2c)}\\ &\leq \alpha \left\Vert L^{-1} \right\Vert_F \left\Vert s \right\Vert_2 & \quad & \text{(3.1.16)}\\ &= \alpha \text{tr}(L^{-\top} L^{-1}) \left\Vert s \right\Vert_2 & \quad & \text{(3.1.14)}\\ &= \alpha \text{tr}(L^{-1} L^{-\top}) \left\Vert s \right\Vert_2 & \quad & \tr(AB) = \tr(BA)\\ &= \alpha \left\Vert L^{-\top} \right\Vert_F \left\Vert s \right\Vert_2\end{split}\]

(b)

If \(M = (J^\top J)^{-1}\) and \(s = -M J^\top R\), then

\[\begin{split}\max_v \phi(v) &\leq \alpha \left( s^\top M^{-1} s \right)^{1 / 2} & \quad & \text{(10.5.2) and (a)}\\ &\leq \alpha \left( R^\top J M^\top J^\top J M J^\top R \right)^{1 / 2}\\ &\leq \alpha \left\Vert J M J^\top R \right\Vert_2.\end{split}\]

Let \(H \in \mathbb{R}^{n \times n}\) be \(\nabla^2 f(x)\) or an approximation of it. If \(M = H^{-1}\) and \(s = -M J^\top R\), then

\[\begin{split}\max_v \phi(v) &\leq \alpha \left( s^\top M^{-1} s \right)^{1 / 2} & \quad & \text{(10.5.2) and (a)}\\ &\leq \alpha \left( R^\top J H^{-\top} J^\top R \right)^{1 / 2}.\end{split}\]

It is interesting to note that (10.4.2) can be written in the form of (10.5.2) where

\[\begin{split}\begin{gather*} M = (J^\top J)^{-1},\\ s = -M J^\top R,\\ v = -J^\top R = -\nabla f(x), \text{ and}\\ \alpha = \left\Vert R \right\Vert_2^{-1}. \end{gather*}\end{split}\]

Applying the results in (a) shows that (10.5.4) is an upper bound for (10.4.2).

Exercise 19

Define \(R \colon \mathbb{R}^n \mapsto \mathbb{R}^m\) with \(m > n\). Let \(x \in \begin{bmatrix} u\\ v \end{bmatrix}\), \(u \in \mathbb{R}^{n - p}\), \(v \in \mathbb{R}^{p}\) with \(0 < p < n\), and \(R(x) = N(v) u - b\) where \(N \colon \mathbb{R}^p \rightarrow \mathbb{R}^{m \times (n - p)}\) and \(b \in \mathbb{R}^m\).

\[\begin{split}f(x) &= \frac{1}{2} R(x)^\top R(x)\\ &= \frac{1}{2} (N(v) u - b)^\top (N(v) u - b)\\ &= \frac{1}{2} \left[ u^\top N(v)^\top N(v) u - u^\top N(v)^\top b - b^\top N(v) u + b^\top b \right]\end{split}\]

(a)

\[\begin{split}0 &= \frac{\partial }{\partial u} f(x)\\ &= N(v)^\top N(v) u - N(v)^\top b & \quad & \text{Matrix Cookbook (81) and (69)}\\ (V D^\top U^\top) (U D V^\top) u &= V D^\top U^\top b & \quad & \text{Theorem 3.6.4}\\ u &= V D^+ U^\top b & \quad & \text{(3.6.6)}\\ &= N(v)^+ b & \quad & \text{Theorem 3.6.5}\end{split}\]

(b)

\[\begin{split}\min_x f(x) &= \min_v \frac{1}{2} \left( b^\top {N(v)^+}^\top N(v)^\top N(v) N(v)^+ b - b^\top {N(v)^+}^\top N(v)^\top b - b^\top N(v) N(v)^+ b + b^\top b \right) & \quad & \text{(a)}\\ &= \min_v \left\Vert N(v) N(v)^+ b - b \right\Vert_2^2 & \quad & \text{(3.1.2c)}\end{split}\]

Exercise 20

The answer to this problem may be located in the section after (3.5) of [GP73] and in this book’s Chapter 5. Unfortunately, my current (2015) knowledge does not allow me to comprehend this 100%; however, I do feel that it is within my reach to fully dissect this succinct and lucid paper. The first item to understand are Frechet derivatives.

Two things I learned from reading this are:

  • The rank of the matrix must be locally constant at the point where the derivative calculated; otherwise the pseudo-inverse is not a continuous function.

  • Based on the convergence results, the variable projection via modified Marquardt method should definitely be used when appropriate.

Exercise 21

There are existing Levenberg-Marquardt implementations (e.g. Ceres, Eigen), so a more interesting venture is to implement the secent approximation. However, such a thing is not of immediate interest at the moment.

References

Bin

Birne Binegar. Higher order derivatives and taylor expansions. https://math.okstate.edu/people/binegar/4013-U98/4013-l10.pdf. Accessed on 2017-05-31.

Fol

G. A. Folland. Higher-order derivatives and taylor’s formula in several variables. https://sites.math.washington.edu/ folland/Math425/taylor2.pdf. Accessed on 2017-05-31.

GP73

Gene H Golub and Victor Pereyra. The differentiation of pseudo-inverses and nonlinear least squares problems whose variables separate. SIAM Journal on numerical analysis, 10(2):413–432, 1973.