Multivariable Calculus Background

Derivatives and Multivariable Models

  • Important results from single-variable calculus (e.g. mean value theorem) hold for functions of multiple variables because they involve the value of a real-valued function at two points x,ˉxRn, and the line connecting these points can be parameterized using one variable (e.g. gradient, Hessian).

  • There is no mean value theorem for continuously differentiable vector-valued functions (e.g. Jacobian).

Exercise 1

F(x)=[x21+x32x43x1x2x32x1x23x2x3+x1x3]
J(x)=[2x13x224x33x2x3x2x3x1x22x2+x32x13x33x2+x1]

Exercise 2

Let f(x)=x31x42+x2x1.

f(x)=[fx1(x)fx2(x)]=[3x21x42x2(x1)24x31x32+(x1)1]
2f(x)=[2f2x1(x)2fx1x2(x)2fx2x1(x)2f2x2(x)]=[6x1x42+2x2(x1)312x21x32(x1)212x21x32(x1)212x31x22]

Exercise 3

Define x(t)=x+tp and g(t)=f(x(t))=f(x+tp) where t(0,1).

For 0α1, applying the chain rule twice yields

dg(t)dt[α]=nj=1f(x(t))x(t)j[x(α)]dx(t)jdt[α]definition of total differential=nj=1xjf[x(α)]pj=f[x+αp]p

and

d2gdt2[α]=ddt(nj=1xjf[x(α)]pj)=ni=1nj=12xixjf[x(α)]pipjdefinition of total differential=p2f[x(α)]p.

Substituting α=0 yields 2fp2=p2f(x)p.

Applying the first fundamental theorem of calculus twice followed by the mean value theorem for functions of one variable yields

g(1)=g(0)+10g(t1)dt1=g(0)+10(g(0)+t10g

Exercise 4

Invoking Lemma 4.2.1 for each component function in \left\{ f_i \right\}_{i = 1}^n yields

\begin{split}\lim_{t \rightarrow 0} \frac{f_i(x + tp) - f_i(x)}{t} &= \nabla f_i(x)^\top p\\ \lim_{t \rightarrow 0} \frac{f_i(x + tp) - f_i(x)}{t} - \nabla f_i(x)^\top p &= 0\\ \left\vert \lim_{t \rightarrow 0} \frac{f_i(x + tp) - f_i(x)}{t} - \nabla f_i(x)^\top p \right\vert &= 0\\ \lim_{t \rightarrow 0} \left\vert \frac{f_i(x + tp) - f_i(x)}{t} - \nabla f_i(x)^\top p \right\vert &= 0 & \quad & \lvert \cdot \rvert \text{ is continuous.}\end{split}

Since F is assumed to be continuously differentiable in an open set D containing x, the consequent property F'(x) = \nabla F(x)^\top = J(x) allows the previous statement to be written more compactly as

\begin{split}\lim_{t \rightarrow 0} \frac{F(x + tp) - F(x)}{t} &= \nabla F(x)^\top p\\ &= J(x) p\\ \lim_{t \rightarrow 0} \frac{F(x + tp) - F(x)}{t} - J(x) p &= \vec{0}\\ \left\Vert \lim_{t \rightarrow 0} \frac{F(x + tp) - F(x) - J(x)tp}{t} \right\Vert &= \left\Vert 0 \right\Vert\\ \lim_{t \rightarrow 0} \frac{\lVert F(x + tp) - F(x) - J(x)tp \rVert}{t} &= 0 & \quad & \lVert \cdot \rVert \text{ is continuous.}\end{split}

Thus J(x) = \hat{J}(x).

Exercise 5

A useful fact to note is that if I \subset \mathbb{R} is an interval and f \colon I \mapsto \mathbb{R} is differentiable on I, then f is Lipschitz on I if and only if f' is bounded on I.

Clearly if f is Lipschitz on I, then \lvert f'(x) \rvert \leq \gamma (4.1.11).

Suppose \lvert f'(x) \rvert \leq M for all x \in I and M \in \mathbb{R}. Let x and y be any two distinct elements of I. Applying the Mean Value Theorem gives

\begin{split}f'(w) &= \frac{f(x) - f(y)}{x - y}\\ \lvert f'(w) \rvert &= \left\vert \frac{f(x) - f(y)}{x - y} \right\vert\\ M \lvert x - y \rvert &\geq \left\vert f(x) - f(y) \right\vert\end{split}

where w \in (x, y). Thus, f is Lipschitz on I with \gamma = M.

In this exercise, f(x) = x \sin x^{-1}, so f'(x) = \sin(x^{-1}) - \frac{\cos x^{-1}}{x} and I \in (-\infty, \infty). Since f(x = 0) = 0 and x^{-1} becomes unbounded only when x = 0, f'(x) is bounded by a finite value over I. Therefore, f(x) is Lipschitz continuous.

However, f is not differentiable at x = 0 because the limit does not exist:

\begin{split}f'(0) &= \lim_{\epsilon \rightarrow 0^+} \frac{f(0 + \epsilon) - f(0)}{\epsilon}\\ &= \lim_{\epsilon \rightarrow 0^+} \frac{\epsilon \sin(\epsilon^{-1})}{\epsilon}\\ &= \lim_{\epsilon \rightarrow 0^+} \sin \frac{1}{\epsilon}\\ &\neq \lim_{\epsilon \rightarrow 0^-} \sin \frac{1}{\epsilon}.\end{split}

Exercise 6

Let g(t) = f(x + tp) where t in (0, 1) and f \colon \mathbb{R}^n \mapsto \mathbb{R}.

\begin{split}g(1) - g(0) &= \int_0^1 g'(t_1) dt_1\\ g(1) - g(0) - g'(0) &= \int_0^1 g'(t_1) - g'(0) dt_1\\ &= \int_0^1 \int_0^{t_1} g''(t_2) dt_2 dt_1 & \quad & g'(t_1) - g'(0) = \int_0^{t_1} g''(t_2) dt_2\\ g(1) - g(0) - g'(0) - \int_0^1 \int_0^{t_1} g''(0) dt_2 dt_1 &= \int_0^1 \int_0^{t_1} g''(t_2) dt_2 dt_1 - \int_0^1 \int_0^{t_1} g''(0) dt_2 dt_1\\ g(1) - g(0) - g'(0) - \frac{1}{2} g''(0) &= \int_0^1 \int_0^{t_1} g''(t_2) - g''(0) dt_2 dt_1\\ \left\Vert g(1) - g(0) - g'(0) - \frac{1}{2} g''(0) \right\Vert &= \left\Vert \int_0^1 \int_0^{t_1} g''(t_2) - g''(0) dt_2 dt_1 \right\Vert\\ \left\vert g(1) - g(0) - g'(0) - \frac{1}{2} g''(0) \right\vert &\leq \int_0^1 \int_0^{t_1} \left\Vert g''(t_2) - g''(0) \right\Vert dt_2 dt_1 & \quad & \text{double integral application of Lemma 4.1.10}\\ \left\vert f(x + p) - f(x) - \nabla f'(x)^\top p - \frac{1}{2} p^\top \nabla^2 f(x) p \right\vert &= \int_0^1 \int_0^{t_1} \left\Vert p^\top \left[ \nabla^2 f(x + t_2 p) - \nabla^2 f(x) \right] p \right\Vert dt_2 dt_1\\ &= \int_0^1 \int_0^{t_1} \left\Vert \sum_i \sigma_i p^\top u_i v_i^\top p \right\Vert dt_2 dt_1 & \quad & \text{apply Theorem 3.6.4 to Hessian difference}\\ &= \int_0^1 \int_0^{t_1} \left\Vert \sum_i \sigma_i \sum_j \sum_k p_j u_{ij} v_{ik} p_k \right\Vert dt_2 dt_1\\ &\leq \left\Vert p \right\Vert^2_\infty \int_0^1 \int_0^{t_1} \left\vert \sum_i \sigma_i \sum_j \sum_k u_{ij} v_{ik} \right\vert dt_2 dt_1\\ &\leq \left\Vert p \right\Vert^2_\infty \int_0^1 \int_0^{t_1} n \left\Vert \nabla^2 f(x + t_2 p) - \nabla^2 f(x) \right\Vert_\infty dt_2 dt_1\\ &\leq \left\Vert p \right\Vert^2 \int_0^1 \int_0^{t_1} n \left\Vert \nabla^2 f(x + t_2 p) - \nabla^2 f(x) \right\Vert dt_2 dt_1 & \quad & \text{(3.1.5)}\\ &\leq \left\Vert p \right\Vert^2 \int_0^1 \int_0^{t_1} \gamma \left\Vert t_2 p \right\Vert dt_2 dt_1 & \quad & \text{(4.1.11)}\\ &= \gamma \left\Vert p \right\Vert^3 \int_0^1 \int_0^{t_1} t_2 dt_2 dt_1\\ &= \frac{\gamma}{6} \left\Vert p \right\Vert^3\end{split}

The additional factor of n is rolled into \gamma. Since the definition of Lipschitz continuity only requires a norm to be defined over \mathbb{R}^{m \times n}, there does not seem to be any reason against using a matrix norm. The book focuses on p-norms where p \geq 1.

Exercise 7

\begin{split}F(v) - F(u) &= \int_u^v F'(z) dz & \quad & \text{(4.1.8)}\\ &= \int_0^1 J(u + t(v - u)) (v - u) dt & \quad & \text{change of variables where } z = u + t(v - u) \text{ and } dz = (v - u) dt\\ F(v) - F(u) - J(x) (v - u) &= \int_0^1 J(u + t(v - u)) (v - u) dt - J(x) (v - u)\\ \left\Vert F(v) - F(u) - J(x) (v - u) \right\Vert &= \left\Vert \int_0^1 \left[ J(u + t(v - u)) - J(x) \right] (v - u) dt \right\Vert\\ &\leq \int_0^1 \left\Vert \left[ J(u + t(v - u)) - J(x) \right] (v - u) \right\Vert dt & \quad & \text{(4.1.9)}\\ &\leq \int_0^1 \left\Vert J(u + t(v - u)) - J(x) \right\Vert \left\Vert v - u \right\Vert dt\\ &\leq \left\Vert v - u \right\Vert \int_0^1 \gamma \left\Vert u + t(v - u) - x \right\Vert dt & \quad & \text{(4.1.11)}\\ &= \gamma \left\Vert v - u \right\Vert \int_0^1 \left\Vert u + t(v - u) - \left[ tx + (1 - t)x \right] \right\Vert dt\\ &= \gamma \left\Vert v - u \right\Vert \int_0^1 \left\Vert tv - tx + (1 - t)u - (1 - t)x \right\Vert dt\\ &\leq \gamma \left\Vert v - u \right\Vert \int_0^1 t \left\Vert v - x \right\Vert + (1 - t) \left\Vert u - x \right\Vert dt\\ &= \gamma \left\Vert v - u \right\Vert \frac{\left\Vert v - x \right\Vert + \left\Vert u - x \right\Vert}{2}\end{split}

Exercise 8

\begin{split}F(x + p) - F(x) &= \int_0^1 J(x + tp) p dt\\ F(x + p) - F(x) - J(x) p &= \int_0^1 J(x + tp) p dt - J(x) p\\ \left\Vert F(x + p) - F(x) - J(x) p \right\Vert &= \left\Vert \int_0^1 \left[ J(x + tp) - J(x) \right] p dt \right\Vert\\ &\leq \int_0^1 \left\Vert \left[ J(x + tp) - J(x) \right] p \right\Vert dt & \quad & \text{(4.1.9)}\\ &\leq \int_0^1 \left\Vert J(x + tp) - J(x) \right\Vert \left\Vert p \right\Vert dt & \quad & \text{Theorem 3.1.3}\\ &\leq \left\Vert p \right\Vert \int_0^1 \gamma \left\Vert tp \right\Vert^\alpha dt & \quad & \text{Lipschitz Continuous is a subset of Holder Continuous}\\ &= \gamma \left\Vert p \right\Vert^{1 + \alpha} \int_0^1 t^\alpha dt\\ &= \left\Vert p \right\Vert^{1 + \alpha} \frac{\gamma}{1 + \alpha}\end{split}

Exercise 9

\begin{split}F(x) &= \begin{pmatrix} 3x_1^2 - 2x_2\\ x_2^3 - \frac{1}{x_1} \end{pmatrix}\\ J(x) &= \begin{pmatrix} 6x_1 & -2\\ \frac{1}{x_1^2} & 3x_2^2 \end{pmatrix}\end{split}

When x = \begin{pmatrix} 1 & 1 \end{pmatrix}^\top,

\begin{split}J(x) = \begin{pmatrix} 6 & -2\\ 1 & 3 \end{pmatrix}.\end{split}

Using the forward difference approximation where h = 0.01 yields

\begin{split}F(x) = \begin{pmatrix} 1\\ 0 \end{pmatrix} \quad & \quad F(x + h e_1) = \begin{pmatrix} 1.0603\\ 0.0099 \end{pmatrix} \quad & \quad F(x + h e_2) = \begin{pmatrix} 0.98\\ 0.0303 \end{pmatrix}\end{split}

resulting in A = \begin{pmatrix} 6.03 & -2\\ 0.99 & 3.03 \end{pmatrix}.

The accuracy of the approximation using an \mathcal{L}_1-norm is 0.04, which makes \gamma \geq 8.

Exercise 10

Define

\alpha = f(x + he_i) - f(x) - h \nabla f(x)_i - \frac{h^2}{2} \nabla^2 f(x)_{ii} + \mathcal{O}(h^2)

and

\beta = f(x - he_i) - f(x) + h \nabla f(x)_i - \frac{h^2}{2} \nabla^2 f(x)_{ii} + \mathcal{O}(h^2)

where \left\vert \cdot \right\vert \leq \frac{\gamma h^3}{6} (4.1.13).

\begin{split}\alpha + \beta &= f(x + he_i) - 2f(x) + f(x - he_i) - h^2 \nabla^2 f(x)_{ii}\\ \left\vert \alpha + \beta \right\vert &= \left\vert f(x + he_i) - 2f(x) + f(x - he_i) - h^2 \nabla^2 f(x)_{ii} \right\vert\\ &\leq \frac{\gamma h^3}{3} \qquad \text{triangle inequality}\end{split}

Thus, \left| A_{ii} - \nabla^2 f(x)_{ii} \right| \leq \frac{\gamma h}{3} where A_{ii} = \frac{f(x + he_i) - 2f(x) + f(x - he_i)}{h^2}.

Exercise 11

Substituting p = he_i + he_j, p = he_i - he_j, p = -he_i + he_j, and p = -he_i - he_j into Lemma 4.1.14 yields

\begin{split}\alpha &= f(x + he_i + he_j) - f(x) - h \nabla f(x)_i - h \nabla f(x)_j - \frac{h^2}{2} \left[ \nabla^2 f(x)_{ii} + 2 \nabla^2 f(x)_{ij} + \nabla^2 f(x)_{jj} \right] + \mathcal{O}(h^2)\\ \beta &= f(x + he_i - he_j) - f(x) - h \nabla f(x)_i + h \nabla f(x)_j - \frac{h^2}{2} \left[ \nabla^2 f(x)_{ii} - 2 \nabla^2 f(x)_{ij} + \nabla^2 f(x)_{jj} \right] + \mathcal{O}(h^2)\\ \eta &= f(x - he_i + he_j) - f(x) + h \nabla f(x)_i - h \nabla f(x)_j - \frac{h^2}{2} \left[ \nabla^2 f(x)_{ii} - 2 \nabla^2 f(x)_{ij} + \nabla^2 f(x)_{jj} \right] + \mathcal{O}(h^2)\\ \nu &= f(x - he_i - he_j) - f(x) + h \nabla f(x)_i + h \nabla f(x)_j - \frac{h^2}{2} \left[ \nabla^2 f(x)_{ii} + 2 \nabla^2 f(x)_{ij} + \nabla^2 f(x)_{jj} \right] + \mathcal{O}(h^2)\end{split}

By the triangle inequality,

\begin{split}\left| \alpha - \beta - \eta + \nu \right| &\leq \left| \alpha \right| + \left| \beta \right| + \left| \eta \right| + \left| \nu \right|\\ &\leq \frac{\gamma h^3}{6} \left[ \left\Vert e_i + e_j \right\Vert^3 + \left\Vert e_i - e_j \right\Vert^3 + \left\Vert -e_i + e_j \right\Vert^3 + \left\Vert -e_i - e_j \right\Vert^3 \right]\\ &\leq \frac{16}{3} \gamma h^3.\end{split}

Solving for \nabla^2 f(x)_{ij} yields

\begin{split}\left| \alpha - \beta - \eta + \nu \right| &= \left| f(x + he_i + he_j) - f(x + he_i - he_j) - f(x - he_i + he_j) + f(x - he_i - he_j) - 4h^2 \nabla^2 f(x)_{ij} \right|\\ &\leq \frac{16}{3} \gamma h^3.\end{split}

Hence \left| A_{ij} - \nabla^2 f(x)_{ij} \right| \leq \frac{4}{3} \gamma h where

A_{ij} = \frac{f(x + he_i + he_j) - f(x + he_i - he_j) - f(x - he_i + he_j) + f(x - he_i - he_j)}{4h^2}.

The number of function evaluations are still the same (unless f(x) is already computed), but this new approximation reduces the upper bound by 20\%.

Exercise 12

Define A, M \in \mathbb{R}^{n \times n} where M = M^\top and \hat{A} = \frac{A + A^\top}{2}.

\begin{split}\left\Vert M - \hat{A} \right\Vert_F^2 &= \text{tr}\left( \left( M - \hat{A} \right)^\top \left( M - \hat{A} \right) \right)\\ &= \text{tr}\left(M^\top M\right) - \text{tr}\left(M^\top \hat{A}\right) - \text{tr}\left(\hat{A}^\top M\right) + \text{tr}\left(\hat{A}^\top \hat{A}\right) & \quad & \text{trace operator is a linear mapping}\\ &= \text{tr}\left(M^\top M\right) - 2\text{tr}\left(M^\top \hat{A}\right) + \text{tr}\left(\hat{A}^\top \hat{A}\right) & \quad & \text{trace equivalence under transpose}\\ &= \text{tr}\left(M^\top M\right) - \text{tr}\left(M^\top A\right) - \text{tr}\left(M^\top A^\top\right) + \text{tr}\left(\hat{A}^\top \hat{A}\right)\\ &= \text{tr}\left(M^\top M\right) - \text{tr}\left(M^\top A\right) - \text{tr}\left(A^\top M\right) + \text{tr}\left(\hat{A}^\top \hat{A}\right) & \quad & \text{trace is invariant under cyclic permutations}\\ &= \text{tr}\left(M^\top M\right) - 2 \text{tr}\left(M^\top A\right) + \frac{1}{4} \left\Vert A + A^\top \right\Vert_F^2\\ &\leq \text{tr}\left(M^\top M\right) - 2 \text{tr}\left(M^\top A\right) + \left\Vert A \right\Vert_F^2 & \quad & \text{triangle inequality applied to matrix norm}\\ &\leq \text{tr}\left(M^\top M\right) - 2 \text{tr}\left(M^\top A\right) + \text{tr}\left(A^\top A\right)\\ &= \left\Vert M - A \right\Vert_F^2\\ \left\Vert M - \hat{A} \right\Vert_F &\leq \left\Vert M - A \right\Vert_F\end{split}

Exercise 13

Proof by contradiction; assume that x is a local minimizer of f (i.e. \nabla^2 f(x) is not indefinite) and \nabla^2 f(x) is negative definite (i.e. \forall p \in \mathbb{R}^n, p^\top \nabla^2 f(x) p < 0).

Since the Hessian is a real symmetric square matrix, its eigendecomposition exists as

\nabla^2 f(x) = Q \Lambda Q^\top = \sum_{i = 1}^n \lambda_i q_i q_i^\top.

Pick p = c q_i where q_i could be any of the eigenvectors that has a corresponding negative eigenvalue and c \in \mathbb{R} \setminus 0 such that x + p \in D.

By the continuity of \nabla^2 f, there exists an open convex subset D such that for every 0 \leq t \leq 1, x + tp \in D. Applying Lemma 4.1.5 for some z \in [x, x + p] yields

\begin{split}f(x + p) &= f(x) + \nabla f(x)^\top p + \frac{1}{2} p^\top \nabla^2 f(x) p\\ &= f(x) + \frac{1}{2} p^\top \nabla^2 f(x) p\\ &= f(x) + \frac{1}{2} c^2 \lambda_i\\ f(x + p) &< f(x)\end{split}

which is a contradiction because x is a local minimizer of f. Thus, \nabla^2 f(x) is positive semidefinite.

This proof can be generalized to functions that are Lipschitz continuous. If \nabla^2 f \in \text{Lip}_\gamma(D) at x, then there exists an open convex set D \subset \mathbb{R}^n with x \in D, which matches the assumptions of Theorem 4.3.2.

Exercise 14

If x_* is a local minimizer of f, Theorem 4.3.2 states that \nabla^2 f(x_*) must be positive semidefinite. Since \nabla^2 f(x_*) is nonsingular, none of its eigenvalues are zero, which means each of them must be positive in this case. This is equivalent to being positive definite.

If \nabla^2 f(x_*) is positive definite, Theorem 4.3.2 states that x_* is a local minimizer of f.

Exercise 15

If f has a unique minimizer x_*, then \nabla f(x_*) = 0 = Ax_* + b. Since Ax_* = -b has a unique solution, it is equivalent to saying A has full rank and is therefore invertible. According to Theorem 4.3.2, \nabla^2 f(x) = A must be positive semidefinite. Hence when x_* is the unique minimizer, A must be positive definite in order to yield x_* = -A^{-1} b.

If A is positive definite, Theorem 4.3.2 states that x is a local minimizer when \nabla f(x) = Ax + b = 0. Setting x = -A^{-1} b satisfies that condition. The fundamental theorem of Linear Algebra states that x is a unique solution because the nullspace of A is zero.

Exercise 16

\begin{split}\DeclareMathOperator*{\argmin}{arg\,min} \argmin_{x \in \mathbb{R}^n} f(x) &= \argmin_{x \in \mathbb{R}^n} \left\Vert Ax - b \right\Vert_2\\ &= \argmin_{x \in \mathbb{R}^n} \left\Vert Ax - b \right\Vert_2^2\\ &= \argmin_{x \in \mathbb{R}^n} x^\top A^\top Ax - 2 b^\top Ax + b^\top b\end{split}

The necessary conditions for x_* to be a local minimizer of f are \nabla f(x_*) = 0 and \nabla^2 f(x_*) is positive semidefinite. The sufficient condition is that \nabla^2 f(x_*) must be positive definite. Solving for the gradient and the Hessian yields

\nabla f(x) = 2 x^\top A^\top A - 2 b^\top A \quad \text{and} \quad \nabla^2 f(x) = 2 A^\top A.

Solving for x_* indeed satisfies A^\top A x_* = A^\top b; according to Theorem 4.3.2, the Hessian is positive semidefinite, so Theorem 3.6.5 is needed to compute the pseudo-inverse of A.