Multivariable Calculus Background
Derivatives and Multivariable Models
Important results from single-variable calculus (e.g. mean value theorem) hold
for functions of multiple variables because they involve the value of a
real-valued function at two points x,ˉx∈Rn, and
the line connecting these points can be parameterized using one variable (e.g.
gradient, Hessian).
There is no mean value theorem for continuously differentiable vector-valued
functions (e.g. Jacobian).
Exercise 1
F(x)=[x21+x32−x43x1x2x32x1x2−3x2x3+x1x3]
J(x)=[2x13x22−4x33x2x3x2x3x1x22x2+x32x1−3x3−3x2+x1]
Exercise 2
Let f(x)=x31x42+x2x1.
∇f(x)=[∂f∂x1(x)∂f∂x2(x)]=[3x21x42−x2(x1)−24x31x32+(x1)−1]
∇2f(x)=[∂2f∂2x1(x)∂2f∂x1∂x2(x)∂2f∂x2∂x1(x)∂2f∂2x2(x)]=[6x1x42+2x2(x1)−312x21x32−(x1)−212x21x32−(x1)−212x31x22]
Exercise 3
Define x(t)=x+tp and g(t)=f(x(t))=f(x+tp) where
t∈(0,1).
For 0≤α≤1, applying the chain rule twice yields
dg(t)dt[α]=n∑j=1∂f(x(t))∂x(t)j[x(α)]dx(t)jdt[α]definition of total differential=n∑j=1∂∂xjf[x(α)]pj=∇f[x+αp]⋅p
and
d2gdt2[α]=ddt(n∑j=1∂∂xjf[x(α)]pj)=n∑i=1n∑j=1∂2∂xi∂xjf[x(α)]pipjdefinition of total differential=p⊤∇2f[x(α)]p.
Substituting α=0 yields
∂2f∂p2=p⊤∇2f(x)p.
Applying the first fundamental theorem of calculus twice followed by the mean
value theorem for functions of one variable yields
g(1)=g(0)+∫10g′(t1)dt1=g(0)+∫10(g′(0)+∫t10g″
Exercise 4
Invoking Lemma 4.2.1 for each component function in
\left\{ f_i \right\}_{i = 1}^n yields
\begin{split}\lim_{t \rightarrow 0} \frac{f_i(x + tp) - f_i(x)}{t}
&= \nabla f_i(x)^\top p\\
\lim_{t \rightarrow 0} \frac{f_i(x + tp) - f_i(x)}{t} - \nabla f_i(x)^\top p
&= 0\\
\left\vert \lim_{t \rightarrow 0}
\frac{f_i(x + tp) - f_i(x)}{t} - \nabla f_i(x)^\top p
\right\vert &= 0\\
\lim_{t \rightarrow 0}
\left\vert
\frac{f_i(x + tp) - f_i(x)}{t} - \nabla f_i(x)^\top p
\right\vert
&= 0 & \quad & \lvert \cdot \rvert \text{ is continuous.}\end{split}
Since F is assumed to be continuously differentiable in an open set
D containing x, the consequent property
F'(x) = \nabla F(x)^\top = J(x) allows the previous statement to be
written more compactly as
\begin{split}\lim_{t \rightarrow 0} \frac{F(x + tp) - F(x)}{t}
&= \nabla F(x)^\top p\\
&= J(x) p\\
\lim_{t \rightarrow 0} \frac{F(x + tp) - F(x)}{t} - J(x) p &= \vec{0}\\
\left\Vert
\lim_{t \rightarrow 0} \frac{F(x + tp) - F(x) - J(x)tp}{t}
\right\Vert &= \left\Vert 0 \right\Vert\\
\lim_{t \rightarrow 0} \frac{\lVert F(x + tp) - F(x) - J(x)tp \rVert}{t}
&= 0 & \quad & \lVert \cdot \rVert \text{ is continuous.}\end{split}
Thus J(x) = \hat{J}(x).
Exercise 5
A useful fact to note is that if I \subset \mathbb{R} is an interval and
f \colon I \mapsto \mathbb{R} is differentiable on I, then
f is Lipschitz on I if and only if f' is bounded on
I.
Clearly if f is Lipschitz on I, then
\lvert f'(x) \rvert \leq \gamma (4.1.11).
Suppose \lvert f'(x) \rvert \leq M for all x \in I and
M \in \mathbb{R}. Let x and y be any two distinct
elements of I. Applying the Mean Value Theorem gives
\begin{split}f'(w) &= \frac{f(x) - f(y)}{x - y}\\
\lvert f'(w) \rvert &= \left\vert \frac{f(x) - f(y)}{x - y} \right\vert\\
M \lvert x - y \rvert &\geq \left\vert f(x) - f(y) \right\vert\end{split}
where w \in (x, y). Thus, f is Lipschitz on I with
\gamma = M.
In this exercise, f(x) = x \sin x^{-1}, so
f'(x) = \sin(x^{-1}) - \frac{\cos x^{-1}}{x} and
I \in (-\infty, \infty). Since f(x = 0) = 0 and
x^{-1} becomes unbounded only when x = 0, f'(x) is
bounded by a finite value over I. Therefore, f(x) is Lipschitz
continuous.
However, f is not differentiable at x = 0 because the limit does
not exist:
\begin{split}f'(0)
&= \lim_{\epsilon \rightarrow 0^+} \frac{f(0 + \epsilon) - f(0)}{\epsilon}\\
&= \lim_{\epsilon \rightarrow 0^+}
\frac{\epsilon \sin(\epsilon^{-1})}{\epsilon}\\
&= \lim_{\epsilon \rightarrow 0^+} \sin \frac{1}{\epsilon}\\
&\neq \lim_{\epsilon \rightarrow 0^-} \sin \frac{1}{\epsilon}.\end{split}
Exercise 6
Let g(t) = f(x + tp) where t in (0, 1) and
f \colon \mathbb{R}^n \mapsto \mathbb{R}.
\begin{split}g(1) - g(0) &= \int_0^1 g'(t_1) dt_1\\
g(1) - g(0) - g'(0) &= \int_0^1 g'(t_1) - g'(0) dt_1\\
&= \int_0^1 \int_0^{t_1} g''(t_2) dt_2 dt_1
& \quad & g'(t_1) - g'(0) = \int_0^{t_1} g''(t_2) dt_2\\
g(1) - g(0) - g'(0) - \int_0^1 \int_0^{t_1} g''(0) dt_2 dt_1
&= \int_0^1 \int_0^{t_1} g''(t_2) dt_2 dt_1 -
\int_0^1 \int_0^{t_1} g''(0) dt_2 dt_1\\
g(1) - g(0) - g'(0) - \frac{1}{2} g''(0)
&= \int_0^1 \int_0^{t_1} g''(t_2) - g''(0) dt_2 dt_1\\
\left\Vert g(1) - g(0) - g'(0) - \frac{1}{2} g''(0) \right\Vert
&= \left\Vert
\int_0^1 \int_0^{t_1} g''(t_2) - g''(0) dt_2 dt_1
\right\Vert\\
\left\vert g(1) - g(0) - g'(0) - \frac{1}{2} g''(0) \right\vert
&\leq \int_0^1 \int_0^{t_1}
\left\Vert g''(t_2) - g''(0) \right\Vert dt_2 dt_1
& \quad & \text{double integral application of Lemma 4.1.10}\\
\left\vert
f(x + p) - f(x) - \nabla f'(x)^\top p - \frac{1}{2} p^\top \nabla^2 f(x) p
\right\vert
&= \int_0^1 \int_0^{t_1} \left\Vert
p^\top \left[ \nabla^2 f(x + t_2 p) - \nabla^2 f(x) \right] p
\right\Vert dt_2 dt_1\\
&= \int_0^1 \int_0^{t_1} \left\Vert
\sum_i \sigma_i p^\top u_i v_i^\top p
\right\Vert dt_2 dt_1
& \quad & \text{apply Theorem 3.6.4 to Hessian difference}\\
&= \int_0^1 \int_0^{t_1} \left\Vert
\sum_i \sigma_i \sum_j \sum_k p_j u_{ij} v_{ik} p_k
\right\Vert dt_2 dt_1\\
&\leq \left\Vert p \right\Vert^2_\infty \int_0^1 \int_0^{t_1}
\left\vert
\sum_i \sigma_i \sum_j \sum_k u_{ij} v_{ik}
\right\vert dt_2 dt_1\\
&\leq \left\Vert p \right\Vert^2_\infty \int_0^1 \int_0^{t_1} n
\left\Vert
\nabla^2 f(x + t_2 p) - \nabla^2 f(x)
\right\Vert_\infty dt_2 dt_1\\
&\leq \left\Vert p \right\Vert^2 \int_0^1 \int_0^{t_1} n
\left\Vert \nabla^2 f(x + t_2 p) - \nabla^2 f(x) \right\Vert dt_2 dt_1
& \quad & \text{(3.1.5)}\\
&\leq \left\Vert p \right\Vert^2 \int_0^1 \int_0^{t_1} \gamma
\left\Vert t_2 p \right\Vert dt_2 dt_1
& \quad & \text{(4.1.11)}\\
&= \gamma \left\Vert p \right\Vert^3 \int_0^1 \int_0^{t_1} t_2 dt_2 dt_1\\
&= \frac{\gamma}{6} \left\Vert p \right\Vert^3\end{split}
The additional factor of n is rolled into \gamma. Since the
definition of Lipschitz continuity only requires a norm to be defined over
\mathbb{R}^{m \times n}, there does not seem to be any reason against
using a matrix norm. The book focuses on p-norms where
p \geq 1.
Exercise 7
\begin{split}F(v) - F(u) &= \int_u^v F'(z) dz & \quad & \text{(4.1.8)}\\
&= \int_0^1 J(u + t(v - u)) (v - u) dt
& \quad & \text{change of variables where }
z = u + t(v - u) \text{ and } dz = (v - u) dt\\
F(v) - F(u) - J(x) (v - u)
&= \int_0^1 J(u + t(v - u)) (v - u) dt - J(x) (v - u)\\
\left\Vert F(v) - F(u) - J(x) (v - u) \right\Vert
&= \left\Vert
\int_0^1 \left[ J(u + t(v - u)) - J(x) \right] (v - u) dt
\right\Vert\\
&\leq \int_0^1 \left\Vert
\left[ J(u + t(v - u)) - J(x) \right] (v - u)
\right\Vert dt
& \quad & \text{(4.1.9)}\\
&\leq \int_0^1 \left\Vert J(u + t(v - u)) - J(x) \right\Vert
\left\Vert v - u \right\Vert dt\\
&\leq \left\Vert v - u \right\Vert
\int_0^1 \gamma \left\Vert u + t(v - u) - x \right\Vert dt
& \quad & \text{(4.1.11)}\\
&= \gamma \left\Vert v - u \right\Vert \int_0^1
\left\Vert u + t(v - u) - \left[ tx + (1 - t)x \right] \right\Vert dt\\
&= \gamma \left\Vert v - u \right\Vert \int_0^1
\left\Vert tv - tx + (1 - t)u - (1 - t)x \right\Vert dt\\
&\leq \gamma \left\Vert v - u \right\Vert
\int_0^1 t \left\Vert v - x \right\Vert +
(1 - t) \left\Vert u - x \right\Vert dt\\
&= \gamma \left\Vert v - u \right\Vert
\frac{\left\Vert v - x \right\Vert + \left\Vert u - x \right\Vert}{2}\end{split}
Exercise 8
\begin{split}F(x + p) - F(x) &= \int_0^1 J(x + tp) p dt\\
F(x + p) - F(x) - J(x) p &= \int_0^1 J(x + tp) p dt - J(x) p\\
\left\Vert F(x + p) - F(x) - J(x) p \right\Vert
&= \left\Vert \int_0^1 \left[ J(x + tp) - J(x) \right] p dt \right\Vert\\
&\leq \int_0^1 \left\Vert \left[ J(x + tp) - J(x) \right] p \right\Vert dt
& \quad & \text{(4.1.9)}\\
&\leq \int_0^1 \left\Vert J(x + tp) - J(x) \right\Vert
\left\Vert p \right\Vert dt
& \quad & \text{Theorem 3.1.3}\\
&\leq \left\Vert p \right\Vert
\int_0^1 \gamma \left\Vert tp \right\Vert^\alpha dt
& \quad &
\text{Lipschitz Continuous is a subset of Holder Continuous}\\
&= \gamma \left\Vert p \right\Vert^{1 + \alpha} \int_0^1 t^\alpha dt\\
&= \left\Vert p \right\Vert^{1 + \alpha} \frac{\gamma}{1 + \alpha}\end{split}
Exercise 9
\begin{split}F(x) &= \begin{pmatrix} 3x_1^2 - 2x_2\\ x_2^3 - \frac{1}{x_1} \end{pmatrix}\\
J(x) &= \begin{pmatrix} 6x_1 & -2\\ \frac{1}{x_1^2} & 3x_2^2 \end{pmatrix}\end{split}
When x = \begin{pmatrix} 1 & 1 \end{pmatrix}^\top,
\begin{split}J(x) = \begin{pmatrix} 6 & -2\\ 1 & 3 \end{pmatrix}.\end{split}
Using the forward difference approximation where h = 0.01 yields
\begin{split}F(x) = \begin{pmatrix} 1\\ 0 \end{pmatrix}
\quad & \quad
F(x + h e_1) = \begin{pmatrix} 1.0603\\ 0.0099 \end{pmatrix}
\quad & \quad
F(x + h e_2) = \begin{pmatrix} 0.98\\ 0.0303 \end{pmatrix}\end{split}
resulting in A = \begin{pmatrix} 6.03 & -2\\ 0.99 & 3.03 \end{pmatrix}.
The accuracy of the approximation using an \mathcal{L}_1-norm is
0.04, which makes \gamma \geq 8.
Exercise 10
Define
\alpha =
f(x + he_i) - f(x) - h \nabla f(x)_i - \frac{h^2}{2} \nabla^2 f(x)_{ii} +
\mathcal{O}(h^2)
and
\beta =
f(x - he_i) - f(x) + h \nabla f(x)_i - \frac{h^2}{2} \nabla^2 f(x)_{ii} +
\mathcal{O}(h^2)
where \left\vert \cdot \right\vert \leq \frac{\gamma h^3}{6} (4.1.13).
\begin{split}\alpha + \beta
&= f(x + he_i) - 2f(x) + f(x - he_i) - h^2 \nabla^2 f(x)_{ii}\\
\left\vert \alpha + \beta \right\vert
&= \left\vert
f(x + he_i) - 2f(x) + f(x - he_i) - h^2 \nabla^2 f(x)_{ii}
\right\vert\\
&\leq \frac{\gamma h^3}{3} \qquad \text{triangle inequality}\end{split}
Thus, \left| A_{ii} - \nabla^2 f(x)_{ii} \right| \leq \frac{\gamma h}{3}
where A_{ii} = \frac{f(x + he_i) - 2f(x) + f(x - he_i)}{h^2}.
Exercise 11
Substituting p = he_i + he_j, p = he_i - he_j, p = -he_i + he_j, and
p = -he_i - he_j into Lemma 4.1.14 yields
\begin{split}\alpha
&= f(x + he_i + he_j) - f(x) - h \nabla f(x)_i - h \nabla f(x)_j -
\frac{h^2}{2} \left[
\nabla^2 f(x)_{ii} + 2 \nabla^2 f(x)_{ij} + \nabla^2 f(x)_{jj}
\right] + \mathcal{O}(h^2)\\
\beta
&= f(x + he_i - he_j) - f(x) - h \nabla f(x)_i + h \nabla f(x)_j -
\frac{h^2}{2} \left[
\nabla^2 f(x)_{ii} - 2 \nabla^2 f(x)_{ij} + \nabla^2 f(x)_{jj}
\right] + \mathcal{O}(h^2)\\
\eta
&= f(x - he_i + he_j) - f(x) + h \nabla f(x)_i - h \nabla f(x)_j -
\frac{h^2}{2} \left[
\nabla^2 f(x)_{ii} - 2 \nabla^2 f(x)_{ij} + \nabla^2 f(x)_{jj}
\right] + \mathcal{O}(h^2)\\
\nu
&= f(x - he_i - he_j) - f(x) + h \nabla f(x)_i + h \nabla f(x)_j -
\frac{h^2}{2} \left[
\nabla^2 f(x)_{ii} + 2 \nabla^2 f(x)_{ij} + \nabla^2 f(x)_{jj}
\right] + \mathcal{O}(h^2)\end{split}
By the triangle inequality,
\begin{split}\left| \alpha - \beta - \eta + \nu \right|
&\leq \left| \alpha \right| + \left| \beta \right| +
\left| \eta \right| + \left| \nu \right|\\
&\leq \frac{\gamma h^3}{6} \left[
\left\Vert e_i + e_j \right\Vert^3 +
\left\Vert e_i - e_j \right\Vert^3 +
\left\Vert -e_i + e_j \right\Vert^3 +
\left\Vert -e_i - e_j \right\Vert^3
\right]\\
&\leq \frac{16}{3} \gamma h^3.\end{split}
Solving for \nabla^2 f(x)_{ij} yields
\begin{split}\left| \alpha - \beta - \eta + \nu \right|
&= \left|
f(x + he_i + he_j) - f(x + he_i - he_j) - f(x - he_i + he_j) +
f(x - he_i - he_j) - 4h^2 \nabla^2 f(x)_{ij}
\right|\\
&\leq \frac{16}{3} \gamma h^3.\end{split}
Hence
\left| A_{ij} - \nabla^2 f(x)_{ij} \right| \leq \frac{4}{3} \gamma h
where
A_{ij} =
\frac{f(x + he_i + he_j) - f(x + he_i - he_j) -
f(x - he_i + he_j) + f(x - he_i - he_j)}{4h^2}.
The number of function evaluations are still the same (unless f(x) is
already computed), but this new approximation reduces the upper bound by
20\%.
Exercise 12
Define A, M \in \mathbb{R}^{n \times n} where M = M^\top and
\hat{A} = \frac{A + A^\top}{2}.
\begin{split}\left\Vert M - \hat{A} \right\Vert_F^2
&= \text{tr}\left(
\left( M - \hat{A} \right)^\top \left( M - \hat{A} \right)
\right)\\
&= \text{tr}\left(M^\top M\right) - \text{tr}\left(M^\top \hat{A}\right) -
\text{tr}\left(\hat{A}^\top M\right) +
\text{tr}\left(\hat{A}^\top \hat{A}\right)
& \quad & \text{trace operator is a linear mapping}\\
&= \text{tr}\left(M^\top M\right) - 2\text{tr}\left(M^\top \hat{A}\right) +
\text{tr}\left(\hat{A}^\top \hat{A}\right)
& \quad & \text{trace equivalence under transpose}\\
&= \text{tr}\left(M^\top M\right) - \text{tr}\left(M^\top A\right) -
\text{tr}\left(M^\top A^\top\right) +
\text{tr}\left(\hat{A}^\top \hat{A}\right)\\
&= \text{tr}\left(M^\top M\right) - \text{tr}\left(M^\top A\right) -
\text{tr}\left(A^\top M\right) +
\text{tr}\left(\hat{A}^\top \hat{A}\right)
& \quad & \text{trace is invariant under cyclic permutations}\\
&= \text{tr}\left(M^\top M\right) - 2 \text{tr}\left(M^\top A\right) +
\frac{1}{4} \left\Vert A + A^\top \right\Vert_F^2\\
&\leq \text{tr}\left(M^\top M\right) - 2 \text{tr}\left(M^\top A\right) +
\left\Vert A \right\Vert_F^2
& \quad & \text{triangle inequality applied to matrix norm}\\
&\leq \text{tr}\left(M^\top M\right) - 2 \text{tr}\left(M^\top A\right) +
\text{tr}\left(A^\top A\right)\\
&= \left\Vert M - A \right\Vert_F^2\\
\left\Vert M - \hat{A} \right\Vert_F &\leq \left\Vert M - A \right\Vert_F\end{split}
Exercise 13
Proof by contradiction; assume that x is a local minimizer of f
(i.e. \nabla^2 f(x) is not indefinite) and \nabla^2 f(x) is
negative definite (i.e.
\forall p \in \mathbb{R}^n, p^\top \nabla^2 f(x) p < 0).
Since the Hessian is a real symmetric square matrix, its eigendecomposition
exists as
\nabla^2 f(x) = Q \Lambda Q^\top = \sum_{i = 1}^n \lambda_i q_i q_i^\top.
Pick p = c q_i where q_i could be any of the eigenvectors that
has a corresponding negative eigenvalue and c \in \mathbb{R} \setminus 0
such that x + p \in D.
By the continuity of \nabla^2 f, there exists an open convex subset
D such that for every 0 \leq t \leq 1, x + tp \in D. Applying
Lemma 4.1.5 for some z \in [x, x + p] yields
\begin{split}f(x + p) &= f(x) + \nabla f(x)^\top p + \frac{1}{2} p^\top \nabla^2 f(x) p\\
&= f(x) + \frac{1}{2} p^\top \nabla^2 f(x) p\\
&= f(x) + \frac{1}{2} c^2 \lambda_i\\
f(x + p) &< f(x)\end{split}
which is a contradiction because x is a local minimizer of f.
Thus, \nabla^2 f(x) is positive semidefinite.
This proof can be generalized to functions that are Lipschitz continuous. If
\nabla^2 f \in \text{Lip}_\gamma(D) at x, then there exists an
open convex set D \subset \mathbb{R}^n with x \in D, which
matches the assumptions of Theorem 4.3.2.
Exercise 14
If x_* is a local minimizer of f, Theorem 4.3.2 states that
\nabla^2 f(x_*) must be positive semidefinite. Since
\nabla^2 f(x_*) is nonsingular, none of its eigenvalues are zero, which
means each of them must be positive in this case. This is equivalent to being
positive definite.
If \nabla^2 f(x_*) is positive definite, Theorem 4.3.2 states that
x_* is a local minimizer of f.
Exercise 15
If f has a unique minimizer x_*, then
\nabla f(x_*) = 0 = Ax_* + b. Since Ax_* = -b has a unique
solution, it is equivalent to saying A has full rank and is therefore
invertible. According to Theorem 4.3.2, \nabla^2 f(x) = A must be
positive semidefinite. Hence when x_* is the unique minimizer,
A must be positive definite in order to yield x_* = -A^{-1} b.
If A is positive definite, Theorem 4.3.2 states that x is a
local minimizer when \nabla f(x) = Ax + b = 0. Setting
x = -A^{-1} b satisfies that condition. The fundamental theorem of
Linear Algebra states that x is a unique solution because the nullspace
of A is zero.
Exercise 16
\begin{split}\DeclareMathOperator*{\argmin}{arg\,min}
\argmin_{x \in \mathbb{R}^n} f(x)
&= \argmin_{x \in \mathbb{R}^n} \left\Vert Ax - b \right\Vert_2\\
&= \argmin_{x \in \mathbb{R}^n} \left\Vert Ax - b \right\Vert_2^2\\
&= \argmin_{x \in \mathbb{R}^n} x^\top A^\top Ax - 2 b^\top Ax + b^\top b\end{split}
The necessary conditions for x_* to be a local minimizer of f
are \nabla f(x_*) = 0 and \nabla^2 f(x_*) is positive
semidefinite. The sufficient condition is that \nabla^2 f(x_*) must be
positive definite. Solving for the gradient and the Hessian yields
\nabla f(x) = 2 x^\top A^\top A - 2 b^\top A
\quad \text{and} \quad
\nabla^2 f(x) = 2 A^\top A.
Solving for x_* indeed satisfies A^\top A x_* = A^\top b;
according to Theorem 4.3.2, the Hessian is positive semidefinite, so Theorem
3.6.5 is needed to compute the pseudo-inverse of A.