Multivariable Calculus Background

Derivatives and Multivariable Models

  • Important results from single-variable calculus (e.g. mean value theorem) hold for functions of multiple variables because they involve the value of a real-valued function at two points \(x, \bar{x} \in \mathbb{R}^n\), and the line connecting these points can be parameterized using one variable (e.g. gradient, Hessian).

  • There is no mean value theorem for continuously differentiable vector-valued functions (e.g. Jacobian).

Exercise 1

\[\begin{split}F(x) = \begin{bmatrix} x_1^2 + x_2^3 - x_3^4\\ x_1 x_2 x_3\\ 2 x_1 x_2 - 3 x_2 x_3 + x_1 x_3\\ \end{bmatrix}\end{split}\]
\[\begin{split}J(x) = \begin{bmatrix} 2 x_1 & 3 x_2^2 & - 4 x_3^3\\ x_2 x_3 & x_2 x_3 & x_1 x_2\\ 2 x_2 + x_3 & 2 x_1 - 3 x_3 & -3 x_2 + x_1\\ \end{bmatrix}\end{split}\]

Exercise 2

Let \(f(x) = x_1^3 x_2^4 + \frac{x_2}{x_1}\).

\[\begin{split}\nabla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1}(x)\\ \frac{\partial f}{\partial x_2}(x)\\ \end{bmatrix} = \begin{bmatrix} 3 x_1^2 x_2^4 - x_2 (x_1)^{-2}\\ 4 x_1^3 x_2^3 + (x_1)^{-1} \end{bmatrix}\end{split}\]
\[\begin{split}\nabla^2 f(x) = \begin{bmatrix} \frac{\partial^2 f}{\partial^2 x_1}(x) & \frac{\partial^2 f}{\partial x_1 \partial x_2}(x)\\ \frac{\partial^2 f}{\partial x_2 \partial x_1}(x) & \frac{\partial^2 f}{\partial^2 x_2}(x)\\ \end{bmatrix} = \begin{bmatrix} 6 x_1 x_2^4 + 2 x_2 (x_1)^{-3} & 12 x_1^2 x_2^3 - (x_1)^{-2}\\ 12 x_1^2 x_2^3 - (x_1)^{-2} & 12 x_1^3 x_2^2 \end{bmatrix}\end{split}\]

Exercise 3

Define \(x(t) = x + tp\) and \(g(t) = f(x(t)) = f(x + tp)\) where \(t \in (0, 1)\).

For \(0 \leq \alpha \leq 1\), applying the chain rule twice yields

\[\begin{split}\frac{d g(t)}{d t}[\alpha] &= \sum_{j = 1}^n \frac{\partial f(x(t))}{\partial x(t)_j}[x(\alpha)] \frac{d x(t)_j}{d t}[\alpha] & \quad & \text{definition of total differential}\\ &= \sum_{j = 1}^n \frac{\partial}{\partial x_j} f[x(\alpha)] p_j\\ &= \nabla f[x + \alpha p] \cdot p\end{split}\]

and

\[\begin{split}\frac{d^2 g}{d t^2}[\alpha] &= \frac{d}{dt} \left( \sum_{j = 1}^n \frac{\partial}{\partial x_j} f[x(\alpha)] p_j \right)\\ &= \sum_{i = 1}^n \sum_{j = 1}^n \frac{\partial^2}{\partial x_i \partial x_j} f[x(\alpha)] p_i p_j & \quad & \text{definition of total differential}\\ &= p^\top \nabla^2 f[x(\alpha)] p.\end{split}\]

Substituting \(\alpha = 0\) yields \(\frac{\partial^2 f}{\partial p^2} = p^\top \nabla^2 f(x) p\).

Applying the first fundamental theorem of calculus twice followed by the mean value theorem for functions of one variable yields

\[\begin{split}g(1) &= g(0) + \int_0^1 g'(t_1) dt_1\\ &= g(0) + \int_0^1 \left( g'(0) + \int_0^{t_1} g''(t_2) dt_2 \right) dt_1 & \quad & \text{since } g'(t_1) - g'(0) = \int_0^{t_1} g''(t_2) dt_2\\ &= g(0) + \int_0^1 g'(0) dt_1 + \int_0^1 dt_1 \int_0^{t_1} g''(t_2) dt_2\\ &= g(0) + g'(0) (1 - 0) + \int_0^1 g''(t_2) dt_2 \int_{t_2}^1 dt_1 & \quad & \text{changed order of integration in the double integral}\\ &= g(0) + g'(0) + \int_0^1 (1 - t_2) g''(t_2) dt_2\\ &= g(0) + g'(0) + g''(\xi) \int_0^1 (1 - t_2) dt_2 & \quad & \text{Mean Value Theorem with } \xi \in (0,1)\\ f(x + p) &= f(x) + \nabla f(x)^\top p + \frac{1}{2} p^\top \nabla^2 f[x(\xi)] p.\end{split}\]

Exercise 4

Invoking Lemma 4.2.1 for each component function in \(\left\{ f_i \right\}_{i = 1}^n\) yields

\[\begin{split}\lim_{t \rightarrow 0} \frac{f_i(x + tp) - f_i(x)}{t} &= \nabla f_i(x)^\top p\\ \lim_{t \rightarrow 0} \frac{f_i(x + tp) - f_i(x)}{t} - \nabla f_i(x)^\top p &= 0\\ \left\vert \lim_{t \rightarrow 0} \frac{f_i(x + tp) - f_i(x)}{t} - \nabla f_i(x)^\top p \right\vert &= 0\\ \lim_{t \rightarrow 0} \left\vert \frac{f_i(x + tp) - f_i(x)}{t} - \nabla f_i(x)^\top p \right\vert &= 0 & \quad & \lvert \cdot \rvert \text{ is continuous.}\end{split}\]

Since \(F\) is assumed to be continuously differentiable in an open set \(D\) containing \(x\), the consequent property \(F'(x) = \nabla F(x)^\top = J(x)\) allows the previous statement to be written more compactly as

\[\begin{split}\lim_{t \rightarrow 0} \frac{F(x + tp) - F(x)}{t} &= \nabla F(x)^\top p\\ &= J(x) p\\ \lim_{t \rightarrow 0} \frac{F(x + tp) - F(x)}{t} - J(x) p &= \vec{0}\\ \left\Vert \lim_{t \rightarrow 0} \frac{F(x + tp) - F(x) - J(x)tp}{t} \right\Vert &= \left\Vert 0 \right\Vert\\ \lim_{t \rightarrow 0} \frac{\lVert F(x + tp) - F(x) - J(x)tp \rVert}{t} &= 0 & \quad & \lVert \cdot \rVert \text{ is continuous.}\end{split}\]

Thus \(J(x) = \hat{J}(x)\).

Exercise 5

A useful fact to note is that if \(I \subset \mathbb{R}\) is an interval and \(f \colon I \mapsto \mathbb{R}\) is differentiable on \(I\), then \(f\) is Lipschitz on \(I\) if and only if \(f'\) is bounded on \(I\).

Clearly if \(f\) is Lipschitz on \(I\), then \(\lvert f'(x) \rvert \leq \gamma\) (4.1.11).

Suppose \(\lvert f'(x) \rvert \leq M\) for all \(x \in I\) and \(M \in \mathbb{R}\). Let \(x\) and \(y\) be any two distinct elements of \(I\). Applying the Mean Value Theorem gives

\[\begin{split}f'(w) &= \frac{f(x) - f(y)}{x - y}\\ \lvert f'(w) \rvert &= \left\vert \frac{f(x) - f(y)}{x - y} \right\vert\\ M \lvert x - y \rvert &\geq \left\vert f(x) - f(y) \right\vert\end{split}\]

where \(w \in (x, y)\). Thus, \(f\) is Lipschitz on \(I\) with \(\gamma = M\).

In this exercise, \(f(x) = x \sin x^{-1}\), so \(f'(x) = \sin(x^{-1}) - \frac{\cos x^{-1}}{x}\) and \(I \in (-\infty, \infty)\). Since \(f(x = 0) = 0\) and \(x^{-1}\) becomes unbounded only when \(x = 0\), \(f'(x)\) is bounded by a finite value over \(I\). Therefore, \(f(x)\) is Lipschitz continuous.

However, \(f\) is not differentiable at \(x = 0\) because the limit does not exist:

\[\begin{split}f'(0) &= \lim_{\epsilon \rightarrow 0^+} \frac{f(0 + \epsilon) - f(0)}{\epsilon}\\ &= \lim_{\epsilon \rightarrow 0^+} \frac{\epsilon \sin(\epsilon^{-1})}{\epsilon}\\ &= \lim_{\epsilon \rightarrow 0^+} \sin \frac{1}{\epsilon}\\ &\neq \lim_{\epsilon \rightarrow 0^-} \sin \frac{1}{\epsilon}.\end{split}\]

Exercise 6

Let \(g(t) = f(x + tp)\) where t in (0, 1) and \(f \colon \mathbb{R}^n \mapsto \mathbb{R}\).

\[\begin{split}g(1) - g(0) &= \int_0^1 g'(t_1) dt_1\\ g(1) - g(0) - g'(0) &= \int_0^1 g'(t_1) - g'(0) dt_1\\ &= \int_0^1 \int_0^{t_1} g''(t_2) dt_2 dt_1 & \quad & g'(t_1) - g'(0) = \int_0^{t_1} g''(t_2) dt_2\\ g(1) - g(0) - g'(0) - \int_0^1 \int_0^{t_1} g''(0) dt_2 dt_1 &= \int_0^1 \int_0^{t_1} g''(t_2) dt_2 dt_1 - \int_0^1 \int_0^{t_1} g''(0) dt_2 dt_1\\ g(1) - g(0) - g'(0) - \frac{1}{2} g''(0) &= \int_0^1 \int_0^{t_1} g''(t_2) - g''(0) dt_2 dt_1\\ \left\Vert g(1) - g(0) - g'(0) - \frac{1}{2} g''(0) \right\Vert &= \left\Vert \int_0^1 \int_0^{t_1} g''(t_2) - g''(0) dt_2 dt_1 \right\Vert\\ \left\vert g(1) - g(0) - g'(0) - \frac{1}{2} g''(0) \right\vert &\leq \int_0^1 \int_0^{t_1} \left\Vert g''(t_2) - g''(0) \right\Vert dt_2 dt_1 & \quad & \text{double integral application of Lemma 4.1.10}\\ \left\vert f(x + p) - f(x) - \nabla f'(x)^\top p - \frac{1}{2} p^\top \nabla^2 f(x) p \right\vert &= \int_0^1 \int_0^{t_1} \left\Vert p^\top \left[ \nabla^2 f(x + t_2 p) - \nabla^2 f(x) \right] p \right\Vert dt_2 dt_1\\ &= \int_0^1 \int_0^{t_1} \left\Vert \sum_i \sigma_i p^\top u_i v_i^\top p \right\Vert dt_2 dt_1 & \quad & \text{apply Theorem 3.6.4 to Hessian difference}\\ &= \int_0^1 \int_0^{t_1} \left\Vert \sum_i \sigma_i \sum_j \sum_k p_j u_{ij} v_{ik} p_k \right\Vert dt_2 dt_1\\ &\leq \left\Vert p \right\Vert^2_\infty \int_0^1 \int_0^{t_1} \left\vert \sum_i \sigma_i \sum_j \sum_k u_{ij} v_{ik} \right\vert dt_2 dt_1\\ &\leq \left\Vert p \right\Vert^2_\infty \int_0^1 \int_0^{t_1} n \left\Vert \nabla^2 f(x + t_2 p) - \nabla^2 f(x) \right\Vert_\infty dt_2 dt_1\\ &\leq \left\Vert p \right\Vert^2 \int_0^1 \int_0^{t_1} n \left\Vert \nabla^2 f(x + t_2 p) - \nabla^2 f(x) \right\Vert dt_2 dt_1 & \quad & \text{(3.1.5)}\\ &\leq \left\Vert p \right\Vert^2 \int_0^1 \int_0^{t_1} \gamma \left\Vert t_2 p \right\Vert dt_2 dt_1 & \quad & \text{(4.1.11)}\\ &= \gamma \left\Vert p \right\Vert^3 \int_0^1 \int_0^{t_1} t_2 dt_2 dt_1\\ &= \frac{\gamma}{6} \left\Vert p \right\Vert^3\end{split}\]

The additional factor of \(n\) is rolled into \(\gamma\). Since the definition of Lipschitz continuity only requires a norm to be defined over \(\mathbb{R}^{m \times n}\), there does not seem to be any reason against using a matrix norm. The book focuses on \(p\)-norms where \(p \geq 1\).

Exercise 7

\[\begin{split}F(v) - F(u) &= \int_u^v F'(z) dz & \quad & \text{(4.1.8)}\\ &= \int_0^1 J(u + t(v - u)) (v - u) dt & \quad & \text{change of variables where } z = u + t(v - u) \text{ and } dz = (v - u) dt\\ F(v) - F(u) - J(x) (v - u) &= \int_0^1 J(u + t(v - u)) (v - u) dt - J(x) (v - u)\\ \left\Vert F(v) - F(u) - J(x) (v - u) \right\Vert &= \left\Vert \int_0^1 \left[ J(u + t(v - u)) - J(x) \right] (v - u) dt \right\Vert\\ &\leq \int_0^1 \left\Vert \left[ J(u + t(v - u)) - J(x) \right] (v - u) \right\Vert dt & \quad & \text{(4.1.9)}\\ &\leq \int_0^1 \left\Vert J(u + t(v - u)) - J(x) \right\Vert \left\Vert v - u \right\Vert dt\\ &\leq \left\Vert v - u \right\Vert \int_0^1 \gamma \left\Vert u + t(v - u) - x \right\Vert dt & \quad & \text{(4.1.11)}\\ &= \gamma \left\Vert v - u \right\Vert \int_0^1 \left\Vert u + t(v - u) - \left[ tx + (1 - t)x \right] \right\Vert dt\\ &= \gamma \left\Vert v - u \right\Vert \int_0^1 \left\Vert tv - tx + (1 - t)u - (1 - t)x \right\Vert dt\\ &\leq \gamma \left\Vert v - u \right\Vert \int_0^1 t \left\Vert v - x \right\Vert + (1 - t) \left\Vert u - x \right\Vert dt\\ &= \gamma \left\Vert v - u \right\Vert \frac{\left\Vert v - x \right\Vert + \left\Vert u - x \right\Vert}{2}\end{split}\]

Exercise 8

\[\begin{split}F(x + p) - F(x) &= \int_0^1 J(x + tp) p dt\\ F(x + p) - F(x) - J(x) p &= \int_0^1 J(x + tp) p dt - J(x) p\\ \left\Vert F(x + p) - F(x) - J(x) p \right\Vert &= \left\Vert \int_0^1 \left[ J(x + tp) - J(x) \right] p dt \right\Vert\\ &\leq \int_0^1 \left\Vert \left[ J(x + tp) - J(x) \right] p \right\Vert dt & \quad & \text{(4.1.9)}\\ &\leq \int_0^1 \left\Vert J(x + tp) - J(x) \right\Vert \left\Vert p \right\Vert dt & \quad & \text{Theorem 3.1.3}\\ &\leq \left\Vert p \right\Vert \int_0^1 \gamma \left\Vert tp \right\Vert^\alpha dt & \quad & \text{Lipschitz Continuous is a subset of Holder Continuous}\\ &= \gamma \left\Vert p \right\Vert^{1 + \alpha} \int_0^1 t^\alpha dt\\ &= \left\Vert p \right\Vert^{1 + \alpha} \frac{\gamma}{1 + \alpha}\end{split}\]

Exercise 9

\[\begin{split}F(x) &= \begin{pmatrix} 3x_1^2 - 2x_2\\ x_2^3 - \frac{1}{x_1} \end{pmatrix}\\ J(x) &= \begin{pmatrix} 6x_1 & -2\\ \frac{1}{x_1^2} & 3x_2^2 \end{pmatrix}\end{split}\]

When \(x = \begin{pmatrix} 1 & 1 \end{pmatrix}^\top\),

\[\begin{split}J(x) = \begin{pmatrix} 6 & -2\\ 1 & 3 \end{pmatrix}.\end{split}\]

Using the forward difference approximation where \(h = 0.01\) yields

\[\begin{split}F(x) = \begin{pmatrix} 1\\ 0 \end{pmatrix} \quad & \quad F(x + h e_1) = \begin{pmatrix} 1.0603\\ 0.0099 \end{pmatrix} \quad & \quad F(x + h e_2) = \begin{pmatrix} 0.98\\ 0.0303 \end{pmatrix}\end{split}\]

resulting in \(A = \begin{pmatrix} 6.03 & -2\\ 0.99 & 3.03 \end{pmatrix}\).

The accuracy of the approximation using an \(\mathcal{L}_1\)-norm is \(0.04\), which makes \(\gamma \geq 8\).

Exercise 10

Define

\[\alpha = f(x + he_i) - f(x) - h \nabla f(x)_i - \frac{h^2}{2} \nabla^2 f(x)_{ii} + \mathcal{O}(h^2)\]

and

\[\beta = f(x - he_i) - f(x) + h \nabla f(x)_i - \frac{h^2}{2} \nabla^2 f(x)_{ii} + \mathcal{O}(h^2)\]

where \(\left\vert \cdot \right\vert \leq \frac{\gamma h^3}{6}\) (4.1.13).

\[\begin{split}\alpha + \beta &= f(x + he_i) - 2f(x) + f(x - he_i) - h^2 \nabla^2 f(x)_{ii}\\ \left\vert \alpha + \beta \right\vert &= \left\vert f(x + he_i) - 2f(x) + f(x - he_i) - h^2 \nabla^2 f(x)_{ii} \right\vert\\ &\leq \frac{\gamma h^3}{3} \qquad \text{triangle inequality}\end{split}\]

Thus, \(\left| A_{ii} - \nabla^2 f(x)_{ii} \right| \leq \frac{\gamma h}{3}\) where \(A_{ii} = \frac{f(x + he_i) - 2f(x) + f(x - he_i)}{h^2}\).

Exercise 11

Substituting \(p = he_i + he_j, p = he_i - he_j, p = -he_i + he_j,\) and \(p = -he_i - he_j\) into Lemma 4.1.14 yields

\[\begin{split}\alpha &= f(x + he_i + he_j) - f(x) - h \nabla f(x)_i - h \nabla f(x)_j - \frac{h^2}{2} \left[ \nabla^2 f(x)_{ii} + 2 \nabla^2 f(x)_{ij} + \nabla^2 f(x)_{jj} \right] + \mathcal{O}(h^2)\\ \beta &= f(x + he_i - he_j) - f(x) - h \nabla f(x)_i + h \nabla f(x)_j - \frac{h^2}{2} \left[ \nabla^2 f(x)_{ii} - 2 \nabla^2 f(x)_{ij} + \nabla^2 f(x)_{jj} \right] + \mathcal{O}(h^2)\\ \eta &= f(x - he_i + he_j) - f(x) + h \nabla f(x)_i - h \nabla f(x)_j - \frac{h^2}{2} \left[ \nabla^2 f(x)_{ii} - 2 \nabla^2 f(x)_{ij} + \nabla^2 f(x)_{jj} \right] + \mathcal{O}(h^2)\\ \nu &= f(x - he_i - he_j) - f(x) + h \nabla f(x)_i + h \nabla f(x)_j - \frac{h^2}{2} \left[ \nabla^2 f(x)_{ii} + 2 \nabla^2 f(x)_{ij} + \nabla^2 f(x)_{jj} \right] + \mathcal{O}(h^2)\end{split}\]

By the triangle inequality,

\[\begin{split}\left| \alpha - \beta - \eta + \nu \right| &\leq \left| \alpha \right| + \left| \beta \right| + \left| \eta \right| + \left| \nu \right|\\ &\leq \frac{\gamma h^3}{6} \left[ \left\Vert e_i + e_j \right\Vert^3 + \left\Vert e_i - e_j \right\Vert^3 + \left\Vert -e_i + e_j \right\Vert^3 + \left\Vert -e_i - e_j \right\Vert^3 \right]\\ &\leq \frac{16}{3} \gamma h^3.\end{split}\]

Solving for \(\nabla^2 f(x)_{ij}\) yields

\[\begin{split}\left| \alpha - \beta - \eta + \nu \right| &= \left| f(x + he_i + he_j) - f(x + he_i - he_j) - f(x - he_i + he_j) + f(x - he_i - he_j) - 4h^2 \nabla^2 f(x)_{ij} \right|\\ &\leq \frac{16}{3} \gamma h^3.\end{split}\]

Hence \(\left| A_{ij} - \nabla^2 f(x)_{ij} \right| \leq \frac{4}{3} \gamma h\) where

\[A_{ij} = \frac{f(x + he_i + he_j) - f(x + he_i - he_j) - f(x - he_i + he_j) + f(x - he_i - he_j)}{4h^2}.\]

The number of function evaluations are still the same (unless \(f(x)\) is already computed), but this new approximation reduces the upper bound by \(20\%\).

Exercise 12

Define \(A, M \in \mathbb{R}^{n \times n}\) where \(M = M^\top\) and \(\hat{A} = \frac{A + A^\top}{2}\).

\[\begin{split}\left\Vert M - \hat{A} \right\Vert_F^2 &= \text{tr}\left( \left( M - \hat{A} \right)^\top \left( M - \hat{A} \right) \right)\\ &= \text{tr}\left(M^\top M\right) - \text{tr}\left(M^\top \hat{A}\right) - \text{tr}\left(\hat{A}^\top M\right) + \text{tr}\left(\hat{A}^\top \hat{A}\right) & \quad & \text{trace operator is a linear mapping}\\ &= \text{tr}\left(M^\top M\right) - 2\text{tr}\left(M^\top \hat{A}\right) + \text{tr}\left(\hat{A}^\top \hat{A}\right) & \quad & \text{trace equivalence under transpose}\\ &= \text{tr}\left(M^\top M\right) - \text{tr}\left(M^\top A\right) - \text{tr}\left(M^\top A^\top\right) + \text{tr}\left(\hat{A}^\top \hat{A}\right)\\ &= \text{tr}\left(M^\top M\right) - \text{tr}\left(M^\top A\right) - \text{tr}\left(A^\top M\right) + \text{tr}\left(\hat{A}^\top \hat{A}\right) & \quad & \text{trace is invariant under cyclic permutations}\\ &= \text{tr}\left(M^\top M\right) - 2 \text{tr}\left(M^\top A\right) + \frac{1}{4} \left\Vert A + A^\top \right\Vert_F^2\\ &\leq \text{tr}\left(M^\top M\right) - 2 \text{tr}\left(M^\top A\right) + \left\Vert A \right\Vert_F^2 & \quad & \text{triangle inequality applied to matrix norm}\\ &\leq \text{tr}\left(M^\top M\right) - 2 \text{tr}\left(M^\top A\right) + \text{tr}\left(A^\top A\right)\\ &= \left\Vert M - A \right\Vert_F^2\\ \left\Vert M - \hat{A} \right\Vert_F &\leq \left\Vert M - A \right\Vert_F\end{split}\]

Exercise 13

Proof by contradiction; assume that \(x\) is a local minimizer of \(f\) (i.e. \(\nabla^2 f(x)\) is not indefinite) and \(\nabla^2 f(x)\) is negative definite (i.e. \(\forall p \in \mathbb{R}^n, p^\top \nabla^2 f(x) p < 0\)).

Since the Hessian is a real symmetric square matrix, its eigendecomposition exists as

\[\nabla^2 f(x) = Q \Lambda Q^\top = \sum_{i = 1}^n \lambda_i q_i q_i^\top.\]

Pick \(p = c q_i\) where \(q_i\) could be any of the eigenvectors that has a corresponding negative eigenvalue and \(c \in \mathbb{R} \setminus 0\) such that \(x + p \in D\).

By the continuity of \(\nabla^2 f\), there exists an open convex subset \(D\) such that for every \(0 \leq t \leq 1, x + tp \in D\). Applying Lemma \(4.1.5\) for some \(z \in [x, x + p]\) yields

\[\begin{split}f(x + p) &= f(x) + \nabla f(x)^\top p + \frac{1}{2} p^\top \nabla^2 f(x) p\\ &= f(x) + \frac{1}{2} p^\top \nabla^2 f(x) p\\ &= f(x) + \frac{1}{2} c^2 \lambda_i\\ f(x + p) &< f(x)\end{split}\]

which is a contradiction because \(x\) is a local minimizer of \(f\). Thus, \(\nabla^2 f(x)\) is positive semidefinite.

This proof can be generalized to functions that are Lipschitz continuous. If \(\nabla^2 f \in \text{Lip}_\gamma(D)\) at \(x\), then there exists an open convex set \(D \subset \mathbb{R}^n\) with \(x \in D\), which matches the assumptions of Theorem 4.3.2.

Exercise 14

If \(x_*\) is a local minimizer of \(f\), Theorem 4.3.2 states that \(\nabla^2 f(x_*)\) must be positive semidefinite. Since \(\nabla^2 f(x_*)\) is nonsingular, none of its eigenvalues are zero, which means each of them must be positive in this case. This is equivalent to being positive definite.

If \(\nabla^2 f(x_*)\) is positive definite, Theorem 4.3.2 states that \(x_*\) is a local minimizer of \(f\).

Exercise 15

If \(f\) has a unique minimizer \(x_*\), then \(\nabla f(x_*) = 0 = Ax_* + b\). Since \(Ax_* = -b\) has a unique solution, it is equivalent to saying \(A\) has full rank and is therefore invertible. According to Theorem 4.3.2, \(\nabla^2 f(x) = A\) must be positive semidefinite. Hence when \(x_*\) is the unique minimizer, \(A\) must be positive definite in order to yield \(x_* = -A^{-1} b\).

If \(A\) is positive definite, Theorem 4.3.2 states that \(x\) is a local minimizer when \(\nabla f(x) = Ax + b = 0\). Setting \(x = -A^{-1} b\) satisfies that condition. The fundamental theorem of Linear Algebra states that \(x\) is a unique solution because the nullspace of \(A\) is zero.

Exercise 16

\[\begin{split}\DeclareMathOperator*{\argmin}{arg\,min} \argmin_{x \in \mathbb{R}^n} f(x) &= \argmin_{x \in \mathbb{R}^n} \left\Vert Ax - b \right\Vert_2\\ &= \argmin_{x \in \mathbb{R}^n} \left\Vert Ax - b \right\Vert_2^2\\ &= \argmin_{x \in \mathbb{R}^n} x^\top A^\top Ax - 2 b^\top Ax + b^\top b\end{split}\]

The necessary conditions for \(x_*\) to be a local minimizer of \(f\) are \(\nabla f(x_*) = 0\) and \(\nabla^2 f(x_*)\) is positive semidefinite. The sufficient condition is that \(\nabla^2 f(x_*)\) must be positive definite. Solving for the gradient and the Hessian yields

\[\nabla f(x) = 2 x^\top A^\top A - 2 b^\top A \quad \text{and} \quad \nabla^2 f(x) = 2 A^\top A.\]

Solving for \(x_*\) indeed satisfies \(A^\top A x_* = A^\top b\); according to Theorem 4.3.2, the Hessian is positive semidefinite, so Theorem 3.6.5 is needed to compute the pseudo-inverse of \(A\).