Regression Models
Non-linear Regression
Radial basis functions denote any spherically symmetric function.
Bayesian approach to nonlinear regression (8.24) is rarely used in practice
because it requires computing \(z_i^\top z_j\), which could be infinite
length.
Mercer’s Theorem avoids computing \(z\) explicitly.
A kernel function \(k[x_i, x_j]\) is valid when the kernel’s arguments
are in a measurable space and the kernel is positive semi-definite.
See (8.26), (8.27), (8.28) for examples of a linear, polynomial, and RBF
kernel.
The sum and products of valid kernels are valid kernels.
The kernel trick consists of choosing a kernel function \(k[x_i, x_j]\) to
replace \(f[x_i]^\top f[x_j]\) without knowing what \(f[\bullet]\)
does.
Exercise 8.1
Assuming a linear relation between the world state and the data, let
\(\boldsymbol{\theta} = \{ \alpha, \beta, \boldsymbol{\phi} \}\) and
\(\alpha = \beta = \boldsymbol{\phi}^\top \mathbf{x}\) where
\(\alpha, \beta > 0\) and \(\boldsymbol{\phi} \in \mathbb{R}^{D + 1}\).
\[\begin{split}\DeclareMathOperator{\GamDist}{Gam}
Pr(w \mid \mathbf{x}, \boldsymbol{\theta})
&= \GamDist_w[\alpha, \beta]\\
&= \frac{\beta^\alpha}{\Gamma[\alpha]} \exp[-\beta w] w^{\alpha - 1}
& \quad & \text{(7.23)}\end{split}\]
In the maximum likelihood approach, the goal is
\[\begin{split}\DeclareMathOperator*{\argmax}{arg\,max}
\hat{\boldsymbol{\theta}}
&= \argmax_{\boldsymbol{\theta}}
Pr(w \mid \mathbf{x}, \boldsymbol{\theta})\\
&= \argmax_{\boldsymbol{\theta}} L
& \quad & L = \log Pr(w \mid \mathbf{x}, \boldsymbol{\theta})\\
&= \argmax_{\boldsymbol{\theta}}
\alpha \log(\beta) - \log(\Gamma[\alpha]) -
\beta w + (\alpha - 1) \log(w)\end{split}\]
Exercise 8.2
Assuming a linear relation between the world state and the data, let
\(\boldsymbol{\theta} = \{ \mu, \sigma, \boldsymbol{\phi} \}\) and
\(\mu = \boldsymbol{\phi}^\top \mathbf{x}\) where
\(\sigma > 0\) and \(\boldsymbol{\phi} \in \mathbb{R}^{D + 1}\).
\[\begin{split}\DeclareMathOperator{\StudDist}{Stud}
\DeclareMathOperator{\NormDist}{Norm}
Pr(w \mid \mathbf{x}, \boldsymbol{\theta})
&= \StudDist_w\left[ \mu, \sigma^2, \nu \right]\\
&= \int \NormDist_w\left[ \mu, \sigma^2 / 2 \right]
\GamDist_h[\nu / 2, \nu / 2] dh
& \quad & \text{(7.24)}\end{split}\]
The EM algorithm can be used to fit \(\boldsymbol{\theta}\) in the maximum
likelihood approach.
Exercise 8.3
\[\begin{split}L &= \log Pr(\mathbf{w} \mid \mathbf{X}, \boldsymbol{\theta})\\
&= \log \NormDist_\mathbf{w}\left[
\mathbf{X}^\top \boldsymbol{\phi}, \sigma^2 \mathbf{I}
\right]\\
&= -\frac{D}{2} \log(2 \pi) - \frac{D}{2} \log\left( \sigma^2 \right) -
\frac{
\left( \mathbf{w} - \mathbf{X}^\top \boldsymbol{\phi} \right)^\top
\left( \mathbf{w} - \mathbf{X}^\top \boldsymbol{\phi} \right)
}{
2 \sigma^2
}\end{split}\]
(a)
\[\begin{split}\frac{\partial L}{\partial \sigma}
&= -\frac{D}{\sigma} +
\sigma^{-3}
\left( \mathbf{w} - \mathbf{X}^\top \boldsymbol{\phi} \right)^\top
\left( \mathbf{w} - \mathbf{X}^\top \boldsymbol{\phi} \right)\\
0 &= -D \sigma^2 +
\left( \mathbf{w} - \mathbf{X}^\top \boldsymbol{\phi} \right)^\top
\left( \mathbf{w} - \mathbf{X}^\top \boldsymbol{\phi} \right)\\
\sigma^2
&= D^{-1}
\left( \mathbf{w} - \mathbf{X}^\top \boldsymbol{\phi} \right)^\top
\left( \mathbf{w} - \mathbf{X}^\top \boldsymbol{\phi} \right)\end{split}\]
(b)
\[\begin{split}\frac{\partial L}{\partial \boldsymbol{\phi}}
&= -\frac{1}{2 \sigma^2} \left[
\mathbf{X}
\left( \mathbf{X}^\top \boldsymbol{\phi} - \mathbf{w} \right) +
\mathbf{X}
\left( \mathbf{X}^\top \boldsymbol{\phi} - \mathbf{w} \right)
\right]
& \quad & \text{(C.32)}\\
0 &= -\mathbf{X} \mathbf{X}^\top \boldsymbol{\phi} + \mathbf{X} \mathbf{w}\\
\boldsymbol{\phi} &= (\mathbf{X} \mathbf{X}^\top)^{-1} \mathbf{X} \mathbf{w}\end{split}\]
Exercise 8.4
\[\begin{split}Pr(\boldsymbol{\phi} \mid \mathbf{X}, \mathbf{w})
&= \frac{
Pr(\boldsymbol{\phi}, \mathbf{X}, \mathbf{w})
}{
Pr(\mathbf{X}, \mathbf{w})
}\\
&= \frac{
Pr(\mathbf{w} \mid \mathbf{X}, \boldsymbol{\phi})
Pr(\boldsymbol{\phi} \mid \mathbf{X})
Pr(\mathbf{X})
}{
Pr(\mathbf{w} \mid \mathbf{X}) Pr(\mathbf{X})
}\\
&= \frac{
Pr(\mathbf{w} \mid \mathbf{X}, \boldsymbol{\phi})
Pr(\boldsymbol{\phi})
}{
Pr(\mathbf{w} \mid \mathbf{X})
}\end{split}\]
The foregoing implies that (8.8) assumes
\(Pr(\boldsymbol{\phi} \mid \mathbf{X}) = Pr(\boldsymbol{\phi})\) i.e.
the data does not affect the prior on the parameters (8.7).
Furthermore, (8.9) assumes that \(\sigma^2\) is known, so
\(Pr(\mathbf{w} \mid \mathbf{X},
\boldsymbol{\theta} = \left\{ \boldsymbol{\phi}, \sigma^2 \right\}) =
Pr(\mathbf{w} \mid \mathbf{X}, \boldsymbol{\phi})\).
\[\begin{split}Pr(\boldsymbol{\phi} \mid \mathbf{X}, \mathbf{w})
&= \frac{1}{Pr(\mathbf{w} \mid \mathbf{X})}
\NormDist_{\mathbf{w}}\left[
\mathbf{X}^\top \boldsymbol{\phi}, \sigma^2 \mathbf{I}_I
\right]
\NormDist_{\boldsymbol{\phi}}\left[
\boldsymbol{0}, \sigma_p^2 \mathbf{I}_{D + 1}
\right]\\
&= \frac{\kappa_1}{Pr(\mathbf{w} \mid \mathbf{X})}
\NormDist_{\boldsymbol{\phi}}\left[
\left( \mathbf{X} \mathbf{X}^\top \right)^{-1} \mathbf{X} \mathbf{w},
\sigma^2 \left( \mathbf{X} \mathbf{X}^\top \right)^{-1}
\right]
\NormDist_{\boldsymbol{\phi}}\left[
\boldsymbol{0}, \sigma_p^2 \mathbf{I}_{D + 1}
\right]
& \quad & \text{(a)}\\
&= \frac{\kappa_1 \kappa_2}{Pr(\mathbf{w} \mid \mathbf{X})}
\NormDist_{\boldsymbol{\phi}}\left[
\frac{1}{\sigma^2} \mathbf{A}^{-1} \mathbf{X} \mathbf{w},
\mathbf{A}^{-1}
\right]
& \quad & \text{(b)}\\
&= \NormDist_{\boldsymbol{\phi}}\left[
\frac{1}{\sigma^2} \mathbf{A}^{-1} \mathbf{X} \mathbf{w},
\mathbf{A}^{-1}
\right]
& \quad & \text{(c)}\end{split}\]
where
\[\mathbf{A} =
\frac{1}{\sigma^2} \mathbf{X} \mathbf{X}^\top +
\frac{1}{\sigma_p^2} \mathbf{I}_{D + 1}.\]
(a)
See Exercise 5.10 for more details.
\[\begin{split}\NormDist_{\mathbf{w}}\left[
\mathbf{X}^\top \boldsymbol{\phi}, \sigma^2 \mathbf{I}_I
\right]
&= \kappa_1 \NormDist_{\boldsymbol{\phi}}\left[
\left( \mathbf{X} \mathbf{X}^\top \sigma^{-2} \right)^{-1}
\sigma^{-2} \mathbf{X} \mathbf{w},
\left( \mathbf{X} \mathbf{X}^\top \sigma^{-2} \right)^{-1}
\right]\\
&= \kappa_1 \NormDist_{\boldsymbol{\phi}}\left[
\left( \mathbf{X} \mathbf{X}^\top \right)^{-1} \mathbf{X} \mathbf{w},
\sigma^2 \left( \mathbf{X} \mathbf{X}^\top \right)^{-1}
\right]\end{split}\]
where
\[\begin{split}\kappa_1
&= \frac{
\left\vert
\sigma^2 \left( \mathbf{X} \mathbf{X}^\top \right)^{-1}
\right\vert^{1 / 2}
}{
\left\vert \sigma^2 \mathbf{I}_I \right\vert^{1 / 2}
}
\exp\left[
\mathbf{w}^\top
\left(
\sigma^{-2} \mathbf{I}_I -
\sigma^{-2} \mathbf{I}_I \mathbf{X}^\top
\sigma^2 \left( \mathbf{X} \mathbf{X}^\top \right)^{-1} \mathbf{X}
\sigma^{-2} \mathbf{I}_I
\right)^{-1}
\mathbf{w}
\right]^{-0.5}\\
&= \frac{1}{
\sigma^{I + D + 1}
\left\vert \mathbf{X} \mathbf{X}^\top \right\vert^{1/2}
}
\exp\left[
\mathbf{w}^\top
\left(
\sigma^{-2} \mathbf{I}_I -
\sigma^{-2} \mathbf{I}_I \mathbf{X}^\top
\sigma^2 \left( \mathbf{X} \mathbf{X}^\top \right)^{-1} \mathbf{X}
\sigma^{-2} \mathbf{I}_I
\right)^{-1}
\mathbf{w}
\right]^{-0.5}\\
&= \frac{1}{
\sigma^{I + D + 1}
\left\vert \mathbf{X} \mathbf{X}^\top \right\vert^{1/2}
}
\exp\left[
\mathbf{w}^\top
\left(
\sigma^{-2} \mathbf{I}_I -
\sigma^{-2} \mathbf{X}^\top
\left( \mathbf{X} \mathbf{X}^\top \right)^{-1} \mathbf{X}
\right)^{-1}
\mathbf{w}
\right]^{-0.5}\end{split}\]
(b)
See Exercise 5.7 and
Exercise 5.9 for more details.
\[\begin{split}& \NormDist_{\boldsymbol{\phi}}\left[
\left( \mathbf{X} \mathbf{X}^\top \right)^{-1} \mathbf{X} \mathbf{w},
\sigma^2 \left( \mathbf{X} \mathbf{X}^\top \right)^{-1}
\right]
\NormDist_{\boldsymbol{\phi}}\left[
\boldsymbol{0}, \sigma_p^2 \mathbf{I}_{D + 1}
\right]\\
&= \kappa_2 \NormDist_{\boldsymbol{\phi}}\left[
\left(
\frac{1}{\sigma^2} \mathbf{X} \mathbf{X}^\top +
\frac{1}{\sigma_p^2} \mathbf{I}_{D + 1}
\right)^{-1}
\sigma^{-2} \mathbf{X} \mathbf{X}^\top
\left( \mathbf{X} \mathbf{X}^\top \right)^{-1} \mathbf{X} \mathbf{w},
\left(
\frac{1}{\sigma^2} \mathbf{X} \mathbf{X}^\top +
\frac{1}{\sigma_p^2} \mathbf{I}_{D + 1}
\right)^{-1}
\right]\\
&= \kappa_2 \NormDist_{\boldsymbol{\phi}}\left[
\frac{1}{\sigma^2}
\left(
\frac{1}{\sigma^2} \mathbf{X} \mathbf{X}^\top +
\frac{1}{\sigma_p^2} \mathbf{I}_{D + 1}
\right)^{-1} \mathbf{X} \mathbf{w},
\left(
\frac{1}{\sigma^2} \mathbf{X} \mathbf{X}^\top +
\frac{1}{\sigma_p^2} \mathbf{I}_{D + 1}
\right)^{-1}
\right]\end{split}\]
where
\[\begin{split}\kappa_2
&= \NormDist_{\boldsymbol{0}}\left[
\left( \mathbf{X} \mathbf{X}^\top \right)^{-1} \mathbf{X} \mathbf{w},
\sigma^2 \left( \mathbf{X} \mathbf{X}^\top \right)^{-1} +
\sigma_p^2 \mathbf{I}_{D + 1}
\right]\\
&= \frac{1}{
(2 \pi)^{(D + 1) / 2}
\left\vert
\sigma^2 \left( \mathbf{X} \mathbf{X}^\top \right)^{-1} +
\sigma_p^2 \mathbf{I}_{D + 1}
\right\vert^{1 / 2}
}
\exp\left[
\left(
\left( \mathbf{X} \mathbf{X}^\top \right)^{-1} \mathbf{X} \mathbf{w}
\right)^\top
\left(
\sigma^2 \left( \mathbf{X} \mathbf{X}^\top \right)^{-1} +
\sigma_p^2 \mathbf{I}_{D + 1}
\right)^{-1}
\left(
\left( \mathbf{X} \mathbf{X}^\top \right)^{-1} \mathbf{X} \mathbf{w}
\right)
\right]^{-0.5}\end{split}\]
(c)
\[\begin{split}Pr(\mathbf{w} \mid \mathbf{X})
&= \int Pr(\boldsymbol{\phi}, \mathbf{w} \mid \mathbf{X})
d\boldsymbol{\phi}\\
&= \int Pr(\mathbf{w} \mid \mathbf{X}, \boldsymbol{\phi})
Pr(\boldsymbol{\phi} \mid \mathbf{X}) d\boldsymbol{\phi}\\
&= \int Pr(\mathbf{w} \mid \mathbf{X}, \boldsymbol{\phi})
Pr(\boldsymbol{\phi}) d\boldsymbol{\phi}\\
&= \int \kappa_1 \kappa_2
\NormDist_{\boldsymbol{\phi}}\left[
\frac{1}{\sigma^2} \mathbf{A}^{-1} \mathbf{X} \mathbf{w},
\mathbf{A}^{-1}
\right] d\boldsymbol{\phi}\\
&= \kappa_1 \kappa_2\end{split}\]
Exercise 8.5
Let \(w^\ast, \sigma \in \mathbb{R}\);
\(\mathbf{x}^\ast, \boldsymbol{\phi} \in \mathbb{R}^{D + 1}\);
\(\mathbf{A} \in \mathbb{R}^{(D + 1) \times (D + 1)}\);
\(\mathbf{X} \in \mathbb{R}^{(D + 1) \times I}\);
\(\mathbf{w} \in \mathbb{R}^I\).
\[\begin{split}Pr(w^\ast \mid \mathbf{x}^\ast, \mathbf{X}, \mathbf{w})
&= \int Pr(\boldsymbol{\phi}, w^\ast \mid
\mathbf{x}^\ast, \mathbf{X}, \mathbf{w}) d\boldsymbol{\phi}\\
&= \int Pr(w^\ast \mid \mathbf{x}^\ast, \boldsymbol{\phi})
Pr(\boldsymbol{\phi} \mid \mathbf{X}, \mathbf{w}) d\boldsymbol{\phi}
& \quad & \text{conditional independence}\\
&= \int
\NormDist_{w^\ast}\left[
\boldsymbol{\phi}^\top \mathbf{x}^\ast, \sigma^2
\right]
\NormDist_{\boldsymbol{\phi}}\left[
\sigma^{-2} \mathbf{A}^{-1} \mathbf{X} \mathbf{w}, \mathbf{A}^{-1}
\right] d\boldsymbol{\phi}
& \quad & \text{(8.2) and (8.10)}\\
&= \int \kappa_1
\NormDist_{\boldsymbol{\phi}}\left[
\left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1}
\mathbf{x}^\ast w^\ast,
\sigma^2 \left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1}
\right]
\NormDist_{\boldsymbol{\phi}}\left[
\sigma^{-2} \mathbf{A}^{-1} \mathbf{X} \mathbf{w}, \mathbf{A}^{-1}
\right] d\boldsymbol{\phi}
& \quad & \text{(a)}\\
&= \int \kappa_1 \kappa_2
\NormDist_{\boldsymbol{\phi}}\left[
\boldsymbol{\Sigma} \left(
\sigma^{-2} \mathbf{x}^\ast w^\ast +
\sigma^{-2} \mathbf{X} \mathbf{w}
\right),
\boldsymbol{\Sigma}
\right] d\boldsymbol{\phi}
& \quad & \text{(b)}\\
&= \kappa_1 \kappa_2\\
&= \NormDist_{w^\ast}\left[
\sigma^{-2} {\mathbf{x}^\ast}^\top
\mathbf{A}^{-1} \mathbf{X} \mathbf{w},
\sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast
\right]
& \quad & \text{(c)}\end{split}\]
(a)
See Exercise 5.10 for more details.
\[\begin{split}\NormDist_{w^\ast}\left[
{\mathbf{x}^\ast}^\top \boldsymbol{\phi}^, \sigma^2
\right]
&= \kappa_1 \NormDist_{\boldsymbol{\phi}}\left[
\left( \mathbf{x}^\ast \sigma^{-2} {\mathbf{x}^\ast}^\top \right)^{-1}
\mathbf{x}^\ast \sigma^{-2},
\left( \mathbf{x}^\ast \sigma^{-2} {\mathbf{x}^\ast}^\top \right)^{-1}
\right]\\
&= \kappa_1 \NormDist_{\boldsymbol{\phi}}\left[
\left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1}
\mathbf{x}^\ast w^\ast,
\sigma^2 \left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1}
\right]\end{split}\]
where
\[\begin{split}\kappa_1
&= (2 \pi)^{D / 2} \frac{
\left\vert
\left(
\mathbf{x}^\ast \sigma^{-2} {\mathbf{x}^\ast}^\top
\right)^{-1}
\right\vert^{1 / 2}
}{
\left\vert \sigma^2 \right\vert^{1 / 2}
}
\exp\left[
w^\ast
\left(
\sigma^{-2} -
\sigma^{-2} {\mathbf{x}^\ast}^\top
\left(
\mathbf{x}^\ast \sigma^{-2} {\mathbf{x}^\ast}^\top
\right)^{-1} \mathbf{x}^\ast \sigma^{-2}
\right)
w^\ast
\right]^{-0.5}\\
&= (2 \pi)^{D / 2} \sigma^{-1} \left\vert
\sigma^2 \left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1}
\right\vert^{1 / 2}
\exp\left[
w^\ast
\left(
\sigma^{-2} -
\sigma^{-2} {\mathbf{x}^\ast}^\top
\left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1}
\mathbf{x}^\ast
\right)
w^\ast
\right]^{-0.5}\end{split}\]
(b)
See Exercise 5.7 and
Exercise 5.9 for more details.
\[\begin{split}& \NormDist_{\boldsymbol{\phi}}\left[
\left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1}
\mathbf{x}^\ast w^\ast,
\sigma^2 \left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1}
\right]
\NormDist_{\boldsymbol{\phi}}\left[
\sigma^{-2} \mathbf{A}^{-1} \mathbf{X} \mathbf{w}, \mathbf{A}^{-1}
\right]\\
&= \kappa_2 \NormDist_{\boldsymbol{\phi}}\left[
\left(
\sigma^{-2} \mathbf{x}^\ast {\mathbf{x}^\ast}^\top + \mathbf{A}
\right)^{-1}
\left(
\sigma^{-2} \mathbf{x}^\ast {\mathbf{x}^\ast}^\top
\left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1}
\mathbf{x}^\ast w^\ast +
\mathbf{A} \sigma^{-2} \mathbf{A}^{-1} \mathbf{X} \mathbf{w}
\right),
\left(
\sigma^{-2} \mathbf{x}^\ast {\mathbf{x}^\ast}^\top + \mathbf{A}
\right)^{-1}
\right]\\
&= \kappa_2 \NormDist_{\boldsymbol{\phi}}\left[
\left(
\sigma^{-2} \mathbf{x}^\ast {\mathbf{x}^\ast}^\top + \mathbf{A}
\right)^{-1}
\left(
\sigma^{-2} \mathbf{x}^\ast w^\ast +
\sigma^{-2} \mathbf{X} \mathbf{w}
\right),
\left(
\sigma^{-2} \mathbf{x}^\ast {\mathbf{x}^\ast}^\top + \mathbf{A}
\right)^{-1}
\right]\end{split}\]
where
\[\begin{split}\kappa_2
&= \NormDist_{\sigma^{-2} \mathbf{A}^{-1} \mathbf{X} \mathbf{w}}\left[
\left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1}
\mathbf{x}^\ast w^\ast,
\left(
\mathbf{x}^\ast \sigma^{-2} {\mathbf{x}^\ast}^\top
\right)^{-1} + \mathbf{A}^{-1}
\right]\\
&= \frac{
\exp\left[
\left(
\sigma^{-2} \mathbf{A}^{-1} \mathbf{X} \mathbf{w} -
\left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1}
\mathbf{x}^\ast w^\ast
\right)^\top
\boldsymbol{\Sigma}'
\left(
\sigma^{-2} \mathbf{A}^{-1} \mathbf{X} \mathbf{w} -
\left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1}
\mathbf{x}^\ast w^\ast
\right)
\right]^{-1 / 2}
}{
\left\vert
2 \pi \left(
\sigma^2
\left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1} +
\mathbf{A}^{-1}
\right)
\right\vert^{1 / 2}
}
& \quad & \text{(b.1)}\end{split}\]
(b.1)
According to Sherman-Morrison Identity in Matrix Cookbook (160),
\[\begin{split}\boldsymbol{\Sigma}
&= \left(
\sigma^{-2} \mathbf{x}^\ast {\mathbf{x}^\ast}^\top + \mathbf{A}
\right)^{-1}\\
&= \mathbf{A}^{-1} -
\frac{
\sigma^{-2} \mathbf{A}^{-1}
\mathbf{x}^\ast {\mathbf{x}^\ast}^\top \mathbf{A}^{-1}
}{
1 + \sigma^{-2} {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast
}\\
&= \mathbf{A}^{-1} -
\frac{
\mathbf{A}^{-1} \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \mathbf{A}^{-1}
}{
\sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast
}.\end{split}\]
According to Searle Set of Identities in Matrix Cookbook (163),
\[\begin{split}\left( \mathbf{x}^\ast \sigma^{-2} {\mathbf{x}^\ast}^\top \right)^{-1} +
\mathbf{A}^{-1}
&= \left( \mathbf{x}^\ast \sigma^{-2} {\mathbf{x}^\ast}^\top \right)^{-1}
\boldsymbol{\Sigma}^{-1} \mathbf{A}^{-1}\\
&= \mathbf{A}^{-1}
\boldsymbol{\Sigma}^{-1}
\left( \mathbf{x}^\ast \sigma^{-2} {\mathbf{x}^\ast}^\top \right)^{-1}.\end{split}\]
Furthermore, applying (165) gives
\[\begin{split}\boldsymbol{\Sigma}'
&= \mathbf{A} \boldsymbol{\Sigma}
\left( \mathbf{x}^\ast \sigma^{-2} {\mathbf{x}^\ast}^\top \right)\\
&= \left( \mathbf{x}^\ast \sigma^{-2} {\mathbf{x}^\ast}^\top \right)
\boldsymbol{\Sigma} \mathbf{A}\\
&= \mathbf{x}^\ast \sigma^{-2} {\mathbf{x}^\ast}^\top -
\frac{
\sigma^{-2} \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \mathbf{A}^{-1}
\mathbf{x}^\ast {\mathbf{x}^\ast}^\top
}{
\sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast
}.\end{split}\]
(c)
\[\kappa_1 \kappa_2 =
\frac{
\exp\left[
\left(
w^\ast -
\sigma^{-2} {\mathbf{x}^\ast}^\top \mathbf{A}^{-1}
\mathbf{X} \mathbf{w}
\right)^\top
\left(
{\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast + \sigma^2
\right)^{-1}
\left(
w^\ast -
\sigma^{-2} {\mathbf{x}^\ast}^\top \mathbf{A}^{-1}
\mathbf{X} \mathbf{w}
\right)
\right]^{-0.5}
}{
\left\vert
2 \pi \left(
{\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast + \sigma^2
\right)
\right\vert^{1 / 2}
}
\qquad \text{(c.1) and (c.2)}\]
(c.1)
\[\begin{split}& \frac{
(2 \pi)^{D / 2}
\left\vert
\left( \mathbf{x}^\ast \sigma^{-2} {\mathbf{x}^\ast}^\top \right)^{-1}
\right\vert^{1 / 2}
}{
\left\vert \sigma^2 \right\vert^{1 / 2}
}
\frac{1}{
(2 \pi)^{(D + 1) / 2}
\left\vert
\sigma^2 \left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1} +
\mathbf{A}^{-1}
\right\vert^{1 / 2}
}\\
&= \left[
(2 \pi)^{1 / 2}
\left\vert \sigma^2 \right\vert^{1 / 2}
\left\vert
\mathbf{x}^\ast \sigma^{-2} {\mathbf{x}^\ast}^\top
\right\vert^{1 / 2}
\left\vert
\sigma^2 \left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1} +
\mathbf{A}^{-1}
\right\vert^{1 / 2}
\right]^{-1}\\
&= \left[
(2 \pi)
\left\vert \sigma^2 \right\vert
\left\vert
\mathbf{I}_{D + 1} +
\mathbf{x}^\ast {\mathbf{x}^\ast}^\top \sigma^{-2} \mathbf{A}^{-1}
\right\vert
\right]^{-1 / 2}\\
&= \left[
(2 \pi)
\left\vert \sigma^2 \right\vert
\left(
1 +
{\mathbf{x}^\ast}^\top \sigma^{-2} \mathbf{A}^{-1} \mathbf{x}^\ast
\right)
\right]^{-1 / 2}
& \quad & \text{Matrix determinant lemma and Matrix Cookbook (24)}\\
&= \frac{1}{
(2 \pi)^{1 / 2}
\left(
\sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast
\right)^{1 / 2}
}\end{split}\]
(c.2)
\[\begin{split}& \exp\left[
{w^\ast}^2
\left(
\sigma^{-2} -
\sigma^{-2}
{\mathbf{x}^\ast}^\top
\left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1}
\mathbf{x}^\ast
\right)
\right]^{-1 / 2}
\exp\left[
\left(
\sigma^{-2} \mathbf{A}^{-1} \mathbf{X} \mathbf{w} -
\left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1}
\mathbf{x}^\ast w^\ast
\right)^\top
\boldsymbol{\Sigma}'
\left(
\sigma^{-2} \mathbf{A}^{-1} \mathbf{X} \mathbf{w} -
\left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1}
\mathbf{x}^\ast w^\ast
\right)
\right]^{-1 / 2}\\
&= \exp\left[
\left(
w^\ast -
\sigma^{-2} {\mathbf{x}^\ast}^\top \mathbf{A}^{-1}
\mathbf{X} \mathbf{w}
\right)^\top
\left(
{\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast + \sigma^2
\right)^{-1}
\left(
w^\ast -
\sigma^{-2} {\mathbf{x}^\ast}^\top \mathbf{A}^{-1}
\mathbf{X} \mathbf{w}
\right)
\right]^{-0.5}\end{split}\]
\[\begin{split}\text{(1)} & \quad &
{w^\ast}^2 \sigma^{-2} -
{w^\ast}^2 \sigma^{-2} {\mathbf{x}^\ast}^\top
\left(
\mathbf{x}^\ast {\mathbf{x}^\ast}^\top
\right)^{-1} \mathbf{x}^\ast\\\\
\text{(2)} & \quad &
\sigma^{-4} \mathbf{w}^\top \mathbf{X}^\top \mathbf{A}^{-1}
\left( \mathbf{x}^\ast \sigma^{-2} {\mathbf{x}^\ast}^\top \right)
\mathbf{A}^{-1} \mathbf{X} \mathbf{w} -
\sigma^{-4} \mathbf{w}^\top \mathbf{X}^\top \mathbf{A}^{-1}
\frac{
\sigma^{-2} \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \mathbf{A}^{-1}
\mathbf{x}^\ast {\mathbf{x}^\ast}^\top
}{
\sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast
}
\mathbf{A}^{-1} \mathbf{X} \mathbf{w}\\\\
\text{(3)} & \quad &
-\left(
\sigma^{-4} \mathbf{w}^\top \mathbf{X}^\top \mathbf{A}^{-1}
\mathbf{x}^\ast w^\ast -
\sigma^{-2} \mathbf{w}^\top \mathbf{X}^\top \mathbf{A}^{-1}
\frac{
\sigma^{-2} \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \mathbf{A}^{-1}
}{
\sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast
}
\mathbf{x}^\ast w^\ast
\right)\\\\
\text{(4)} & \quad &
-\left(
w^\ast {\mathbf{x}^\ast}^\top
\sigma^{-4} \mathbf{A}^{-1} \mathbf{X} \mathbf{w} -
w^\ast {\mathbf{x}^\ast}^\top
\frac{
\sigma^{-2} \mathbf{A}^{-1} \mathbf{x}^\ast {\mathbf{x}^\ast}^\top
}{
\sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast
}
\sigma^{-2} \mathbf{A}^{-1} \mathbf{X} \mathbf{w}
\right)\\\\
\text{(5)} & \quad &
w^\ast {\mathbf{x}^\ast}^\top
\sigma^{-2} \left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1}
\mathbf{x}^\ast w^\ast -
w^\ast {\mathbf{x}^\ast}^\top
\frac{
\sigma^{-2} \mathbf{A}^{-1}
}{
\sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast
}
\mathbf{x}^\ast w^\ast\end{split}\]
\[\begin{split}\text{(1) + (5)}
&= \frac{
{w^\ast}^2 \sigma^{-2}
\left(
\sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast
\right) -
{w^\ast}^2 \sigma^{-2}
{\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast
}{
\sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast
}\\
&= \frac{
{w^\ast}^2
}{
\sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast
}\\\\\\
\text{(2)}
&= \frac{
\sigma^{-6} \mathbf{w}^\top \mathbf{X}^\top \mathbf{A}^{-1}
\left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)
\mathbf{A}^{-1} \mathbf{X} \mathbf{w}
\left(
\sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast
\right) -
\sigma^{-6} \mathbf{w}^\top \mathbf{X}^\top \mathbf{A}^{-1}
\mathbf{x}^\ast {\mathbf{x}^\ast}^\top
\mathbf{A}^{-1}
\mathbf{x}^\ast {\mathbf{x}^\ast}^\top
\mathbf{A}^{-1} \mathbf{X} \mathbf{w}
}{
\sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast
}\\
&= \frac{
\sigma^{-4} \mathbf{w}^\top \mathbf{X}^\top \mathbf{A}^{-1}
\mathbf{x}^\ast {\mathbf{x}^\ast}^\top
\mathbf{A}^{-1} \mathbf{X} \mathbf{w}
}{
\sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast
}\\\\\\
\text{(4)}
&= \frac{
-\sigma^{-4} w^\ast {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{X}
\mathbf{w}
\left(
\sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast
\right) +
\sigma^{-4} w^\ast {\mathbf{x}^\ast}^\top
\mathbf{A}^{-1} \mathbf{x}^\ast {\mathbf{x}^\ast}^\top
\mathbf{A}^{-1} \mathbf{X} \mathbf{w}
}{
\sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast
}\\
&= \frac{
-\sigma^{-2} w^\ast
{\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{X} \mathbf{w}
}{
\sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast
}\\
&= \frac{
-\sigma^{-2} w^\ast
\mathbf{w}^\top \mathbf{X}^\top \mathbf{A}^{-1} \mathbf{x}^\ast
}{
\sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast
}\\
&= \text{(3)}\end{split}\]
Exercise 8.6
\[\begin{split}\mathbf{A}^{-1}
&= \left(
\sigma^{-2} \mathbf{X} \mathbf{X}^\top + \sigma_p^{-2} \mathbf{I}_D
\right)^{-1}\\
&= \left(
\sigma_p^{-2} \mathbf{I}_D +
\mathbf{X} \left( \sigma^{-2} \mathbf{I}_I \right) \mathbf{X}^\top
\right)^{-1}\\
&= \sigma_p^2 \mathbf{I}_D -
\sigma_p^2 \mathbf{I}_D \mathbf{X}
\left(
\mathbf{X}^\top \sigma_p^2 \mathbf{I}_D \mathbf{X} +
\sigma^2 \mathbf{I}_I
\right)^{-1}
\mathbf{X}^\top \sigma_p^2 \mathbf{I}_D\\
&= \sigma_p^2 \mathbf{I}_D -
\sigma_p^4 \mathbf{X} \sigma_p^{-2}
\left(
\mathbf{X}^\top \mathbf{I}_D \mathbf{X} +
\frac{\sigma^2}{\sigma_p^2} \mathbf{I}_I
\right)^{-1}
\mathbf{X}^\top\\
&= \sigma_p^2 \mathbf{I}_D -
\sigma_p^2 \mathbf{X}
\left(
\mathbf{X}^\top \mathbf{I}_D \mathbf{X} +
\frac{\sigma^2}{\sigma_p^2} \mathbf{I}_I
\right)^{-1}
\mathbf{X}^\top\end{split}\]
Exercise 8.7
Recall that
\[\frac{\partial}{\partial \sigma} \left(
\sigma_p^2 \mathbf{X}^\top \mathbf{X} + \sigma^2 \mathbf{I}
\right) = 2 \sigma \mathbf{I}.\]
\[\begin{split}\DeclareMathOperator{\tr}{\mathrm{tr}}
0 &= \frac{\partial}{\partial \sigma}
\log Pr(\mathbf{w} \mid \mathbf{X}, \sigma^2)\\
&= \frac{\partial}{\partial \sigma} \log
\NormDist_{\mathbf{w}}\left[
\boldsymbol{0},
\sigma_p^2 \mathbf{X}^\top \mathbf{X} + \sigma^2 \mathbf{I}_I
\right]\\
&= \frac{\partial}{\partial \sigma} \left(
-\frac{I}{2} \log[2 \pi] -
\frac{1}{2} \log\left\vert
\sigma_p^2 \mathbf{X}^\top \mathbf{X} + \sigma^2 \mathbf{I}
\right\vert -
\frac{1}{2} \mathbf{w}^\top
\left(
\sigma_p^2 \mathbf{X}^\top \mathbf{X} + \sigma^2 \mathbf{I}
\right)^{-1} \mathbf{w}
\right)\\
&= -\frac{1}{2} \tr\left[
\left(
\sigma_p^2 \mathbf{X}^\top \mathbf{X} + \sigma^2 \mathbf{I}
\right)^{-1}
2 \sigma \mathbf{I}
\right] -
\frac{1}{2} \mathbf{w}^\top
\left[
-\left(
\sigma_p^2 \mathbf{X}^\top \mathbf{X} + \sigma^2 \mathbf{I}
\right)^{-1}
2 \sigma \mathbf{I}
\left(
\sigma_p^2 \mathbf{X}^\top \mathbf{X} + \sigma^2 \mathbf{I}
\right)^{-1}
\right] \mathbf{w}
& \quad & \text{(C.36) and (C.39)}\\
&= \mathbf{w}^\top
\left(
\sigma_p^2 \mathbf{X}^\top \mathbf{X} + \sigma^2 \mathbf{I}
\right)^{-2} \mathbf{w} -
\tr\left[
\left(
\sigma_p^2 \mathbf{X}^\top \mathbf{X} + \sigma^2 \mathbf{I}
\right)^{-1}
\right]\end{split}\]
Exercise 8.8
\[q(\phi) = \max_h L =
\max_h \NormDist_{\phi}\left[ 0, h^{-1} \right] \GamDist_h[\nu / 2, \nu / 2]\]
is an unconstrained optimization problem where
\[\begin{split}\frac{\partial L}{\partial h}
&= \frac{\partial}{\partial h} \log\left[
\frac{1}{\sqrt{2 \pi \det h^{-1}}}
\exp\left( -0.5 \phi^2 h\right)
\frac{\left( \frac{\nu}{2} \right)^{\nu / 2}}{\Gamma[\nu / 2]}
\exp\left( -\frac{\nu}{2} h \right) h^{\nu / 2 - 1}
\right]\\
0 &= \frac{\partial}{\partial h} \left[
\frac{1}{2} \log(h) - \frac{1}{2} \log(2 \pi) +
\left( -0.5 \phi^2 h \right) +
\log\left(
\frac{\left( \frac{\nu}{2} \right)^{\nu / 2}}{\Gamma[\nu / 2]}
\right) +
\left( -0.5 \nu h \right) +
\left( \frac{\nu}{2} - 1 \right) \log(h)
\right]\\
&= \frac{1}{2h} - \frac{1}{2} \phi^2 - \frac{1}{2} \nu +
\left( \frac{\nu}{2} - 1 \right) h^{-1}\\
h^{-1} \frac{\nu - 1}{2} &= \frac{\phi^2 + \nu}{2}\\
h &= \frac{\nu - 1}{\phi^2 + \nu}.\end{split}\]
<Figure size 640x480 with 1 Axes>
<Figure size 640x480 with 1 Axes>
Exercise 8.9
Replace the nonlinear transformation in \(x\) with
\(z = \begin{bmatrix} 1 & x & x^2 & x^3 \end{bmatrix}^\top\) as in (8.19).
This enables reuse of the linear model since (8.17) is linear in \(z\).
The maximum likelihood learning algorithm (8.6) is replaced with \(Z\)
where the columns of \(Z\) contains the transformed vectors
\(\left\{ z_i \right\}_{i = 1}^I\).
The Bayesian linear regression inference algorithm (8.14) can be reused via
replacing \(x\) and \(z\) with \(z\) and \(Z\) respectively.
Exercise 8.10
Dual linear regression should only be used when \(I < D\), otherwise the
computation gets unnecessarily more expensive.
Exercise 8.11
\[\begin{split}\frac{\partial L}{\partial \psi}
&= -\frac{1}{2 \sigma^2} \frac{\partial}{\partial \psi}
\left(
w^\top w - w^\top X^\top X \psi - \psi^\top X^\top X w +
\psi^\top X^\top X X^\top X \psi
\right)\\
0 &= -\frac{1}{2 \sigma^2} \left(
-2 X^\top X w + 2 X^\top X X^\top X \psi
\right)
& \quad & \text{(C.27), (C.28), (C.33)}\\
X^\top X X^\top X \psi &= X^\top X w\\
\psi &= \left( X^\top X \right)^{-1} w\end{split}\]