Multivariable Calculus Background
Derivatives and Multivariable Models
Important results from single-variable calculus (e.g. mean value theorem) hold
for functions of multiple variables because they involve the value of a
real-valued function at two points \(x, \bar{x} \in \mathbb{R}^n\), and
the line connecting these points can be parameterized using one variable (e.g.
gradient, Hessian).
There is no mean value theorem for continuously differentiable vector-valued
functions (e.g. Jacobian).
Exercise 1
\[\begin{split}F(x) =
\begin{bmatrix}
x_1^2 + x_2^3 - x_3^4\\
x_1 x_2 x_3\\
2 x_1 x_2 - 3 x_2 x_3 + x_1 x_3\\
\end{bmatrix}\end{split}\]
\[\begin{split}J(x) =
\begin{bmatrix}
2 x_1 & 3 x_2^2 & - 4 x_3^3\\
x_2 x_3 & x_2 x_3 & x_1 x_2\\
2 x_2 + x_3 & 2 x_1 - 3 x_3 & -3 x_2 + x_1\\
\end{bmatrix}\end{split}\]
Exercise 2
Let \(f(x) = x_1^3 x_2^4 + \frac{x_2}{x_1}\).
\[\begin{split}\nabla f(x) =
\begin{bmatrix}
\frac{\partial f}{\partial x_1}(x)\\
\frac{\partial f}{\partial x_2}(x)\\
\end{bmatrix} =
\begin{bmatrix}
3 x_1^2 x_2^4 - x_2 (x_1)^{-2}\\
4 x_1^3 x_2^3 + (x_1)^{-1}
\end{bmatrix}\end{split}\]
\[\begin{split}\nabla^2 f(x) =
\begin{bmatrix}
\frac{\partial^2 f}{\partial^2 x_1}(x) &
\frac{\partial^2 f}{\partial x_1 \partial x_2}(x)\\
\frac{\partial^2 f}{\partial x_2 \partial x_1}(x) &
\frac{\partial^2 f}{\partial^2 x_2}(x)\\
\end{bmatrix} =
\begin{bmatrix}
6 x_1 x_2^4 + 2 x_2 (x_1)^{-3} &
12 x_1^2 x_2^3 - (x_1)^{-2}\\
12 x_1^2 x_2^3 - (x_1)^{-2} &
12 x_1^3 x_2^2
\end{bmatrix}\end{split}\]
Exercise 3
Define \(x(t) = x + tp\) and \(g(t) = f(x(t)) = f(x + tp)\) where
\(t \in (0, 1)\).
For \(0 \leq \alpha \leq 1\), applying the chain rule twice yields
\[\begin{split}\frac{d g(t)}{d t}[\alpha]
&= \sum_{j = 1}^n \frac{\partial f(x(t))}{\partial x(t)_j}[x(\alpha)]
\frac{d x(t)_j}{d t}[\alpha]
& \quad & \text{definition of total differential}\\
&= \sum_{j = 1}^n \frac{\partial}{\partial x_j} f[x(\alpha)] p_j\\
&= \nabla f[x + \alpha p] \cdot p\end{split}\]
and
\[\begin{split}\frac{d^2 g}{d t^2}[\alpha]
&= \frac{d}{dt}
\left(
\sum_{j = 1}^n \frac{\partial}{\partial x_j} f[x(\alpha)] p_j
\right)\\
&= \sum_{i = 1}^n \sum_{j = 1}^n
\frac{\partial^2}{\partial x_i \partial x_j} f[x(\alpha)] p_i p_j
& \quad & \text{definition of total differential}\\
&= p^\top \nabla^2 f[x(\alpha)] p.\end{split}\]
Substituting \(\alpha = 0\) yields
\(\frac{\partial^2 f}{\partial p^2} = p^\top \nabla^2 f(x) p\).
Applying the first fundamental theorem of calculus twice followed by the mean
value theorem for functions of one variable yields
\[\begin{split}g(1) &= g(0) + \int_0^1 g'(t_1) dt_1\\
&= g(0) + \int_0^1 \left( g'(0) + \int_0^{t_1} g''(t_2) dt_2 \right) dt_1
& \quad & \text{since } g'(t_1) - g'(0) = \int_0^{t_1} g''(t_2) dt_2\\
&= g(0) + \int_0^1 g'(0) dt_1 + \int_0^1 dt_1 \int_0^{t_1} g''(t_2) dt_2\\
&= g(0) + g'(0) (1 - 0) + \int_0^1 g''(t_2) dt_2 \int_{t_2}^1 dt_1
& \quad & \text{changed order of integration in the double integral}\\
&= g(0) + g'(0) + \int_0^1 (1 - t_2) g''(t_2) dt_2\\
&= g(0) + g'(0) + g''(\xi) \int_0^1 (1 - t_2) dt_2
& \quad & \text{Mean Value Theorem with } \xi \in (0,1)\\
f(x + p)
&= f(x) + \nabla f(x)^\top p + \frac{1}{2} p^\top \nabla^2 f[x(\xi)] p.\end{split}\]
Exercise 4
Invoking Lemma 4.2.1 for each component function in
\(\left\{ f_i \right\}_{i = 1}^n\) yields
\[\begin{split}\lim_{t \rightarrow 0} \frac{f_i(x + tp) - f_i(x)}{t}
&= \nabla f_i(x)^\top p\\
\lim_{t \rightarrow 0} \frac{f_i(x + tp) - f_i(x)}{t} - \nabla f_i(x)^\top p
&= 0\\
\left\vert \lim_{t \rightarrow 0}
\frac{f_i(x + tp) - f_i(x)}{t} - \nabla f_i(x)^\top p
\right\vert &= 0\\
\lim_{t \rightarrow 0}
\left\vert
\frac{f_i(x + tp) - f_i(x)}{t} - \nabla f_i(x)^\top p
\right\vert
&= 0 & \quad & \lvert \cdot \rvert \text{ is continuous.}\end{split}\]
Since \(F\) is assumed to be continuously differentiable in an open set
\(D\) containing \(x\), the consequent property
\(F'(x) = \nabla F(x)^\top = J(x)\) allows the previous statement to be
written more compactly as
\[\begin{split}\lim_{t \rightarrow 0} \frac{F(x + tp) - F(x)}{t}
&= \nabla F(x)^\top p\\
&= J(x) p\\
\lim_{t \rightarrow 0} \frac{F(x + tp) - F(x)}{t} - J(x) p &= \vec{0}\\
\left\Vert
\lim_{t \rightarrow 0} \frac{F(x + tp) - F(x) - J(x)tp}{t}
\right\Vert &= \left\Vert 0 \right\Vert\\
\lim_{t \rightarrow 0} \frac{\lVert F(x + tp) - F(x) - J(x)tp \rVert}{t}
&= 0 & \quad & \lVert \cdot \rVert \text{ is continuous.}\end{split}\]
Thus \(J(x) = \hat{J}(x)\).
Exercise 5
A useful fact to note is that if \(I \subset \mathbb{R}\) is an interval and
\(f \colon I \mapsto \mathbb{R}\) is differentiable on \(I\), then
\(f\) is Lipschitz on \(I\) if and only if \(f'\) is bounded on
\(I\).
Clearly if \(f\) is Lipschitz on \(I\), then
\(\lvert f'(x) \rvert \leq \gamma\) (4.1.11).
Suppose \(\lvert f'(x) \rvert \leq M\) for all \(x \in I\) and
\(M \in \mathbb{R}\). Let \(x\) and \(y\) be any two distinct
elements of \(I\). Applying the Mean Value Theorem gives
\[\begin{split}f'(w) &= \frac{f(x) - f(y)}{x - y}\\
\lvert f'(w) \rvert &= \left\vert \frac{f(x) - f(y)}{x - y} \right\vert\\
M \lvert x - y \rvert &\geq \left\vert f(x) - f(y) \right\vert\end{split}\]
where \(w \in (x, y)\). Thus, \(f\) is Lipschitz on \(I\) with
\(\gamma = M\).
In this exercise, \(f(x) = x \sin x^{-1}\), so
\(f'(x) = \sin(x^{-1}) - \frac{\cos x^{-1}}{x}\) and
\(I \in (-\infty, \infty)\). Since \(f(x = 0) = 0\) and
\(x^{-1}\) becomes unbounded only when \(x = 0\), \(f'(x)\) is
bounded by a finite value over \(I\). Therefore, \(f(x)\) is Lipschitz
continuous.
However, \(f\) is not differentiable at \(x = 0\) because the limit does
not exist:
\[\begin{split}f'(0)
&= \lim_{\epsilon \rightarrow 0^+} \frac{f(0 + \epsilon) - f(0)}{\epsilon}\\
&= \lim_{\epsilon \rightarrow 0^+}
\frac{\epsilon \sin(\epsilon^{-1})}{\epsilon}\\
&= \lim_{\epsilon \rightarrow 0^+} \sin \frac{1}{\epsilon}\\
&\neq \lim_{\epsilon \rightarrow 0^-} \sin \frac{1}{\epsilon}.\end{split}\]
Exercise 6
Let \(g(t) = f(x + tp)\) where t in (0, 1) and
\(f \colon \mathbb{R}^n \mapsto \mathbb{R}\).
\[\begin{split}g(1) - g(0) &= \int_0^1 g'(t_1) dt_1\\
g(1) - g(0) - g'(0) &= \int_0^1 g'(t_1) - g'(0) dt_1\\
&= \int_0^1 \int_0^{t_1} g''(t_2) dt_2 dt_1
& \quad & g'(t_1) - g'(0) = \int_0^{t_1} g''(t_2) dt_2\\
g(1) - g(0) - g'(0) - \int_0^1 \int_0^{t_1} g''(0) dt_2 dt_1
&= \int_0^1 \int_0^{t_1} g''(t_2) dt_2 dt_1 -
\int_0^1 \int_0^{t_1} g''(0) dt_2 dt_1\\
g(1) - g(0) - g'(0) - \frac{1}{2} g''(0)
&= \int_0^1 \int_0^{t_1} g''(t_2) - g''(0) dt_2 dt_1\\
\left\Vert g(1) - g(0) - g'(0) - \frac{1}{2} g''(0) \right\Vert
&= \left\Vert
\int_0^1 \int_0^{t_1} g''(t_2) - g''(0) dt_2 dt_1
\right\Vert\\
\left\vert g(1) - g(0) - g'(0) - \frac{1}{2} g''(0) \right\vert
&\leq \int_0^1 \int_0^{t_1}
\left\Vert g''(t_2) - g''(0) \right\Vert dt_2 dt_1
& \quad & \text{double integral application of Lemma 4.1.10}\\
\left\vert
f(x + p) - f(x) - \nabla f'(x)^\top p - \frac{1}{2} p^\top \nabla^2 f(x) p
\right\vert
&= \int_0^1 \int_0^{t_1} \left\Vert
p^\top \left[ \nabla^2 f(x + t_2 p) - \nabla^2 f(x) \right] p
\right\Vert dt_2 dt_1\\
&= \int_0^1 \int_0^{t_1} \left\Vert
\sum_i \sigma_i p^\top u_i v_i^\top p
\right\Vert dt_2 dt_1
& \quad & \text{apply Theorem 3.6.4 to Hessian difference}\\
&= \int_0^1 \int_0^{t_1} \left\Vert
\sum_i \sigma_i \sum_j \sum_k p_j u_{ij} v_{ik} p_k
\right\Vert dt_2 dt_1\\
&\leq \left\Vert p \right\Vert^2_\infty \int_0^1 \int_0^{t_1}
\left\vert
\sum_i \sigma_i \sum_j \sum_k u_{ij} v_{ik}
\right\vert dt_2 dt_1\\
&\leq \left\Vert p \right\Vert^2_\infty \int_0^1 \int_0^{t_1} n
\left\Vert
\nabla^2 f(x + t_2 p) - \nabla^2 f(x)
\right\Vert_\infty dt_2 dt_1\\
&\leq \left\Vert p \right\Vert^2 \int_0^1 \int_0^{t_1} n
\left\Vert \nabla^2 f(x + t_2 p) - \nabla^2 f(x) \right\Vert dt_2 dt_1
& \quad & \text{(3.1.5)}\\
&\leq \left\Vert p \right\Vert^2 \int_0^1 \int_0^{t_1} \gamma
\left\Vert t_2 p \right\Vert dt_2 dt_1
& \quad & \text{(4.1.11)}\\
&= \gamma \left\Vert p \right\Vert^3 \int_0^1 \int_0^{t_1} t_2 dt_2 dt_1\\
&= \frac{\gamma}{6} \left\Vert p \right\Vert^3\end{split}\]
The additional factor of \(n\) is rolled into \(\gamma\). Since the
definition of Lipschitz continuity only requires a norm to be defined over
\(\mathbb{R}^{m \times n}\), there does not seem to be any reason against
using a matrix norm. The book focuses on \(p\)-norms where
\(p \geq 1\).
Exercise 7
\[\begin{split}F(v) - F(u) &= \int_u^v F'(z) dz & \quad & \text{(4.1.8)}\\
&= \int_0^1 J(u + t(v - u)) (v - u) dt
& \quad & \text{change of variables where }
z = u + t(v - u) \text{ and } dz = (v - u) dt\\
F(v) - F(u) - J(x) (v - u)
&= \int_0^1 J(u + t(v - u)) (v - u) dt - J(x) (v - u)\\
\left\Vert F(v) - F(u) - J(x) (v - u) \right\Vert
&= \left\Vert
\int_0^1 \left[ J(u + t(v - u)) - J(x) \right] (v - u) dt
\right\Vert\\
&\leq \int_0^1 \left\Vert
\left[ J(u + t(v - u)) - J(x) \right] (v - u)
\right\Vert dt
& \quad & \text{(4.1.9)}\\
&\leq \int_0^1 \left\Vert J(u + t(v - u)) - J(x) \right\Vert
\left\Vert v - u \right\Vert dt\\
&\leq \left\Vert v - u \right\Vert
\int_0^1 \gamma \left\Vert u + t(v - u) - x \right\Vert dt
& \quad & \text{(4.1.11)}\\
&= \gamma \left\Vert v - u \right\Vert \int_0^1
\left\Vert u + t(v - u) - \left[ tx + (1 - t)x \right] \right\Vert dt\\
&= \gamma \left\Vert v - u \right\Vert \int_0^1
\left\Vert tv - tx + (1 - t)u - (1 - t)x \right\Vert dt\\
&\leq \gamma \left\Vert v - u \right\Vert
\int_0^1 t \left\Vert v - x \right\Vert +
(1 - t) \left\Vert u - x \right\Vert dt\\
&= \gamma \left\Vert v - u \right\Vert
\frac{\left\Vert v - x \right\Vert + \left\Vert u - x \right\Vert}{2}\end{split}\]
Exercise 8
\[\begin{split}F(x + p) - F(x) &= \int_0^1 J(x + tp) p dt\\
F(x + p) - F(x) - J(x) p &= \int_0^1 J(x + tp) p dt - J(x) p\\
\left\Vert F(x + p) - F(x) - J(x) p \right\Vert
&= \left\Vert \int_0^1 \left[ J(x + tp) - J(x) \right] p dt \right\Vert\\
&\leq \int_0^1 \left\Vert \left[ J(x + tp) - J(x) \right] p \right\Vert dt
& \quad & \text{(4.1.9)}\\
&\leq \int_0^1 \left\Vert J(x + tp) - J(x) \right\Vert
\left\Vert p \right\Vert dt
& \quad & \text{Theorem 3.1.3}\\
&\leq \left\Vert p \right\Vert
\int_0^1 \gamma \left\Vert tp \right\Vert^\alpha dt
& \quad &
\text{Lipschitz Continuous is a subset of Holder Continuous}\\
&= \gamma \left\Vert p \right\Vert^{1 + \alpha} \int_0^1 t^\alpha dt\\
&= \left\Vert p \right\Vert^{1 + \alpha} \frac{\gamma}{1 + \alpha}\end{split}\]
Exercise 9
\[\begin{split}F(x) &= \begin{pmatrix} 3x_1^2 - 2x_2\\ x_2^3 - \frac{1}{x_1} \end{pmatrix}\\
J(x) &= \begin{pmatrix} 6x_1 & -2\\ \frac{1}{x_1^2} & 3x_2^2 \end{pmatrix}\end{split}\]
When \(x = \begin{pmatrix} 1 & 1 \end{pmatrix}^\top\),
\[\begin{split}J(x) = \begin{pmatrix} 6 & -2\\ 1 & 3 \end{pmatrix}.\end{split}\]
Using the forward difference approximation where \(h = 0.01\) yields
\[\begin{split}F(x) = \begin{pmatrix} 1\\ 0 \end{pmatrix}
\quad & \quad
F(x + h e_1) = \begin{pmatrix} 1.0603\\ 0.0099 \end{pmatrix}
\quad & \quad
F(x + h e_2) = \begin{pmatrix} 0.98\\ 0.0303 \end{pmatrix}\end{split}\]
resulting in \(A = \begin{pmatrix} 6.03 & -2\\ 0.99 & 3.03 \end{pmatrix}\).
The accuracy of the approximation using an \(\mathcal{L}_1\)-norm is
\(0.04\), which makes \(\gamma \geq 8\).
Exercise 10
Define
\[\alpha =
f(x + he_i) - f(x) - h \nabla f(x)_i - \frac{h^2}{2} \nabla^2 f(x)_{ii} +
\mathcal{O}(h^2)\]
and
\[\beta =
f(x - he_i) - f(x) + h \nabla f(x)_i - \frac{h^2}{2} \nabla^2 f(x)_{ii} +
\mathcal{O}(h^2)\]
where \(\left\vert \cdot \right\vert \leq \frac{\gamma h^3}{6}\) (4.1.13).
\[\begin{split}\alpha + \beta
&= f(x + he_i) - 2f(x) + f(x - he_i) - h^2 \nabla^2 f(x)_{ii}\\
\left\vert \alpha + \beta \right\vert
&= \left\vert
f(x + he_i) - 2f(x) + f(x - he_i) - h^2 \nabla^2 f(x)_{ii}
\right\vert\\
&\leq \frac{\gamma h^3}{3} \qquad \text{triangle inequality}\end{split}\]
Thus, \(\left| A_{ii} - \nabla^2 f(x)_{ii} \right| \leq \frac{\gamma h}{3}\)
where \(A_{ii} = \frac{f(x + he_i) - 2f(x) + f(x - he_i)}{h^2}\).
Exercise 11
Substituting \(p = he_i + he_j, p = he_i - he_j, p = -he_i + he_j,\) and
\(p = -he_i - he_j\) into Lemma 4.1.14 yields
\[\begin{split}\alpha
&= f(x + he_i + he_j) - f(x) - h \nabla f(x)_i - h \nabla f(x)_j -
\frac{h^2}{2} \left[
\nabla^2 f(x)_{ii} + 2 \nabla^2 f(x)_{ij} + \nabla^2 f(x)_{jj}
\right] + \mathcal{O}(h^2)\\
\beta
&= f(x + he_i - he_j) - f(x) - h \nabla f(x)_i + h \nabla f(x)_j -
\frac{h^2}{2} \left[
\nabla^2 f(x)_{ii} - 2 \nabla^2 f(x)_{ij} + \nabla^2 f(x)_{jj}
\right] + \mathcal{O}(h^2)\\
\eta
&= f(x - he_i + he_j) - f(x) + h \nabla f(x)_i - h \nabla f(x)_j -
\frac{h^2}{2} \left[
\nabla^2 f(x)_{ii} - 2 \nabla^2 f(x)_{ij} + \nabla^2 f(x)_{jj}
\right] + \mathcal{O}(h^2)\\
\nu
&= f(x - he_i - he_j) - f(x) + h \nabla f(x)_i + h \nabla f(x)_j -
\frac{h^2}{2} \left[
\nabla^2 f(x)_{ii} + 2 \nabla^2 f(x)_{ij} + \nabla^2 f(x)_{jj}
\right] + \mathcal{O}(h^2)\end{split}\]
By the triangle inequality,
\[\begin{split}\left| \alpha - \beta - \eta + \nu \right|
&\leq \left| \alpha \right| + \left| \beta \right| +
\left| \eta \right| + \left| \nu \right|\\
&\leq \frac{\gamma h^3}{6} \left[
\left\Vert e_i + e_j \right\Vert^3 +
\left\Vert e_i - e_j \right\Vert^3 +
\left\Vert -e_i + e_j \right\Vert^3 +
\left\Vert -e_i - e_j \right\Vert^3
\right]\\
&\leq \frac{16}{3} \gamma h^3.\end{split}\]
Solving for \(\nabla^2 f(x)_{ij}\) yields
\[\begin{split}\left| \alpha - \beta - \eta + \nu \right|
&= \left|
f(x + he_i + he_j) - f(x + he_i - he_j) - f(x - he_i + he_j) +
f(x - he_i - he_j) - 4h^2 \nabla^2 f(x)_{ij}
\right|\\
&\leq \frac{16}{3} \gamma h^3.\end{split}\]
Hence
\(\left| A_{ij} - \nabla^2 f(x)_{ij} \right| \leq \frac{4}{3} \gamma h\)
where
\[A_{ij} =
\frac{f(x + he_i + he_j) - f(x + he_i - he_j) -
f(x - he_i + he_j) + f(x - he_i - he_j)}{4h^2}.\]
The number of function evaluations are still the same (unless \(f(x)\) is
already computed), but this new approximation reduces the upper bound by
\(20\%\).
Exercise 12
Define \(A, M \in \mathbb{R}^{n \times n}\) where \(M = M^\top\) and
\(\hat{A} = \frac{A + A^\top}{2}\).
\[\begin{split}\left\Vert M - \hat{A} \right\Vert_F^2
&= \text{tr}\left(
\left( M - \hat{A} \right)^\top \left( M - \hat{A} \right)
\right)\\
&= \text{tr}\left(M^\top M\right) - \text{tr}\left(M^\top \hat{A}\right) -
\text{tr}\left(\hat{A}^\top M\right) +
\text{tr}\left(\hat{A}^\top \hat{A}\right)
& \quad & \text{trace operator is a linear mapping}\\
&= \text{tr}\left(M^\top M\right) - 2\text{tr}\left(M^\top \hat{A}\right) +
\text{tr}\left(\hat{A}^\top \hat{A}\right)
& \quad & \text{trace equivalence under transpose}\\
&= \text{tr}\left(M^\top M\right) - \text{tr}\left(M^\top A\right) -
\text{tr}\left(M^\top A^\top\right) +
\text{tr}\left(\hat{A}^\top \hat{A}\right)\\
&= \text{tr}\left(M^\top M\right) - \text{tr}\left(M^\top A\right) -
\text{tr}\left(A^\top M\right) +
\text{tr}\left(\hat{A}^\top \hat{A}\right)
& \quad & \text{trace is invariant under cyclic permutations}\\
&= \text{tr}\left(M^\top M\right) - 2 \text{tr}\left(M^\top A\right) +
\frac{1}{4} \left\Vert A + A^\top \right\Vert_F^2\\
&\leq \text{tr}\left(M^\top M\right) - 2 \text{tr}\left(M^\top A\right) +
\left\Vert A \right\Vert_F^2
& \quad & \text{triangle inequality applied to matrix norm}\\
&\leq \text{tr}\left(M^\top M\right) - 2 \text{tr}\left(M^\top A\right) +
\text{tr}\left(A^\top A\right)\\
&= \left\Vert M - A \right\Vert_F^2\\
\left\Vert M - \hat{A} \right\Vert_F &\leq \left\Vert M - A \right\Vert_F\end{split}\]
Exercise 13
Proof by contradiction; assume that \(x\) is a local minimizer of \(f\)
(i.e. \(\nabla^2 f(x)\) is not indefinite) and \(\nabla^2 f(x)\) is
negative definite (i.e.
\(\forall p \in \mathbb{R}^n, p^\top \nabla^2 f(x) p < 0\)).
Since the Hessian is a real symmetric square matrix, its eigendecomposition
exists as
\[\nabla^2 f(x) = Q \Lambda Q^\top = \sum_{i = 1}^n \lambda_i q_i q_i^\top.\]
Pick \(p = c q_i\) where \(q_i\) could be any of the eigenvectors that
has a corresponding negative eigenvalue and \(c \in \mathbb{R} \setminus 0\)
such that \(x + p \in D\).
By the continuity of \(\nabla^2 f\), there exists an open convex subset
\(D\) such that for every \(0 \leq t \leq 1, x + tp \in D\). Applying
Lemma \(4.1.5\) for some \(z \in [x, x + p]\) yields
\[\begin{split}f(x + p) &= f(x) + \nabla f(x)^\top p + \frac{1}{2} p^\top \nabla^2 f(x) p\\
&= f(x) + \frac{1}{2} p^\top \nabla^2 f(x) p\\
&= f(x) + \frac{1}{2} c^2 \lambda_i\\
f(x + p) &< f(x)\end{split}\]
which is a contradiction because \(x\) is a local minimizer of \(f\).
Thus, \(\nabla^2 f(x)\) is positive semidefinite.
This proof can be generalized to functions that are Lipschitz continuous. If
\(\nabla^2 f \in \text{Lip}_\gamma(D)\) at \(x\), then there exists an
open convex set \(D \subset \mathbb{R}^n\) with \(x \in D\), which
matches the assumptions of Theorem 4.3.2.
Exercise 14
If \(x_*\) is a local minimizer of \(f\), Theorem 4.3.2 states that
\(\nabla^2 f(x_*)\) must be positive semidefinite. Since
\(\nabla^2 f(x_*)\) is nonsingular, none of its eigenvalues are zero, which
means each of them must be positive in this case. This is equivalent to being
positive definite.
If \(\nabla^2 f(x_*)\) is positive definite, Theorem 4.3.2 states that
\(x_*\) is a local minimizer of \(f\).
Exercise 15
If \(f\) has a unique minimizer \(x_*\), then
\(\nabla f(x_*) = 0 = Ax_* + b\). Since \(Ax_* = -b\) has a unique
solution, it is equivalent to saying \(A\) has full rank and is therefore
invertible. According to Theorem 4.3.2, \(\nabla^2 f(x) = A\) must be
positive semidefinite. Hence when \(x_*\) is the unique minimizer,
\(A\) must be positive definite in order to yield \(x_* = -A^{-1} b\).
If \(A\) is positive definite, Theorem 4.3.2 states that \(x\) is a
local minimizer when \(\nabla f(x) = Ax + b = 0\). Setting
\(x = -A^{-1} b\) satisfies that condition. The fundamental theorem of
Linear Algebra states that \(x\) is a unique solution because the nullspace
of \(A\) is zero.
Exercise 16
\[\begin{split}\DeclareMathOperator*{\argmin}{arg\,min}
\argmin_{x \in \mathbb{R}^n} f(x)
&= \argmin_{x \in \mathbb{R}^n} \left\Vert Ax - b \right\Vert_2\\
&= \argmin_{x \in \mathbb{R}^n} \left\Vert Ax - b \right\Vert_2^2\\
&= \argmin_{x \in \mathbb{R}^n} x^\top A^\top Ax - 2 b^\top Ax + b^\top b\end{split}\]
The necessary conditions for \(x_*\) to be a local minimizer of \(f\)
are \(\nabla f(x_*) = 0\) and \(\nabla^2 f(x_*)\) is positive
semidefinite. The sufficient condition is that \(\nabla^2 f(x_*)\) must be
positive definite. Solving for the gradient and the Hessian yields
\[\nabla f(x) = 2 x^\top A^\top A - 2 b^\top A
\quad \text{and} \quad
\nabla^2 f(x) = 2 A^\top A.\]
Solving for \(x_*\) indeed satisfies \(A^\top A x_* = A^\top b\);
according to Theorem 4.3.2, the Hessian is positive semidefinite, so Theorem
3.6.5 is needed to compute the pseudo-inverse of \(A\).