Regression Models

Non-linear Regression

  • Radial basis functions denote any spherically symmetric function.

  • Bayesian approach to nonlinear regression (8.24) is rarely used in practice because it requires computing \(z_i^\top z_j\), which could be infinite length.

  • Mercer’s Theorem avoids computing \(z\) explicitly.

    • A kernel function \(k[x_i, x_j]\) is valid when the kernel’s arguments are in a measurable space and the kernel is positive semi-definite.

      • See (8.26), (8.27), (8.28) for examples of a linear, polynomial, and RBF kernel.

    • The sum and products of valid kernels are valid kernels.

  • The kernel trick consists of choosing a kernel function \(k[x_i, x_j]\) to replace \(f[x_i]^\top f[x_j]\) without knowing what \(f[\bullet]\) does.

Exercise 8.1

Assuming a linear relation between the world state and the data, let \(\boldsymbol{\theta} = \{ \alpha, \beta, \boldsymbol{\phi} \}\) and \(\alpha = \beta = \boldsymbol{\phi}^\top \mathbf{x}\) where \(\alpha, \beta > 0\) and \(\boldsymbol{\phi} \in \mathbb{R}^{D + 1}\).

\[\begin{split}\DeclareMathOperator{\GamDist}{Gam} Pr(w \mid \mathbf{x}, \boldsymbol{\theta}) &= \GamDist_w[\alpha, \beta]\\ &= \frac{\beta^\alpha}{\Gamma[\alpha]} \exp[-\beta w] w^{\alpha - 1} & \quad & \text{(7.23)}\end{split}\]

In the maximum likelihood approach, the goal is

\[\begin{split}\DeclareMathOperator*{\argmax}{arg\,max} \hat{\boldsymbol{\theta}} &= \argmax_{\boldsymbol{\theta}} Pr(w \mid \mathbf{x}, \boldsymbol{\theta})\\ &= \argmax_{\boldsymbol{\theta}} L & \quad & L = \log Pr(w \mid \mathbf{x}, \boldsymbol{\theta})\\ &= \argmax_{\boldsymbol{\theta}} \alpha \log(\beta) - \log(\Gamma[\alpha]) - \beta w + (\alpha - 1) \log(w)\end{split}\]

Exercise 8.2

Assuming a linear relation between the world state and the data, let \(\boldsymbol{\theta} = \{ \mu, \sigma, \boldsymbol{\phi} \}\) and \(\mu = \boldsymbol{\phi}^\top \mathbf{x}\) where \(\sigma > 0\) and \(\boldsymbol{\phi} \in \mathbb{R}^{D + 1}\).

\[\begin{split}\DeclareMathOperator{\StudDist}{Stud} \DeclareMathOperator{\NormDist}{Norm} Pr(w \mid \mathbf{x}, \boldsymbol{\theta}) &= \StudDist_w\left[ \mu, \sigma^2, \nu \right]\\ &= \int \NormDist_w\left[ \mu, \sigma^2 / 2 \right] \GamDist_h[\nu / 2, \nu / 2] dh & \quad & \text{(7.24)}\end{split}\]

The EM algorithm can be used to fit \(\boldsymbol{\theta}\) in the maximum likelihood approach.

Exercise 8.3

\[\begin{split}L &= \log Pr(\mathbf{w} \mid \mathbf{X}, \boldsymbol{\theta})\\ &= \log \NormDist_\mathbf{w}\left[ \mathbf{X}^\top \boldsymbol{\phi}, \sigma^2 \mathbf{I} \right]\\ &= -\frac{D}{2} \log(2 \pi) - \frac{D}{2} \log\left( \sigma^2 \right) - \frac{ \left( \mathbf{w} - \mathbf{X}^\top \boldsymbol{\phi} \right)^\top \left( \mathbf{w} - \mathbf{X}^\top \boldsymbol{\phi} \right) }{ 2 \sigma^2 }\end{split}\]

(a)

\[\begin{split}\frac{\partial L}{\partial \sigma} &= -\frac{D}{\sigma} + \sigma^{-3} \left( \mathbf{w} - \mathbf{X}^\top \boldsymbol{\phi} \right)^\top \left( \mathbf{w} - \mathbf{X}^\top \boldsymbol{\phi} \right)\\ 0 &= -D \sigma^2 + \left( \mathbf{w} - \mathbf{X}^\top \boldsymbol{\phi} \right)^\top \left( \mathbf{w} - \mathbf{X}^\top \boldsymbol{\phi} \right)\\ \sigma^2 &= D^{-1} \left( \mathbf{w} - \mathbf{X}^\top \boldsymbol{\phi} \right)^\top \left( \mathbf{w} - \mathbf{X}^\top \boldsymbol{\phi} \right)\end{split}\]

(b)

\[\begin{split}\frac{\partial L}{\partial \boldsymbol{\phi}} &= -\frac{1}{2 \sigma^2} \left[ \mathbf{X} \left( \mathbf{X}^\top \boldsymbol{\phi} - \mathbf{w} \right) + \mathbf{X} \left( \mathbf{X}^\top \boldsymbol{\phi} - \mathbf{w} \right) \right] & \quad & \text{(C.32)}\\ 0 &= -\mathbf{X} \mathbf{X}^\top \boldsymbol{\phi} + \mathbf{X} \mathbf{w}\\ \boldsymbol{\phi} &= (\mathbf{X} \mathbf{X}^\top)^{-1} \mathbf{X} \mathbf{w}\end{split}\]

Exercise 8.4

\[\begin{split}Pr(\boldsymbol{\phi} \mid \mathbf{X}, \mathbf{w}) &= \frac{ Pr(\boldsymbol{\phi}, \mathbf{X}, \mathbf{w}) }{ Pr(\mathbf{X}, \mathbf{w}) }\\ &= \frac{ Pr(\mathbf{w} \mid \mathbf{X}, \boldsymbol{\phi}) Pr(\boldsymbol{\phi} \mid \mathbf{X}) Pr(\mathbf{X}) }{ Pr(\mathbf{w} \mid \mathbf{X}) Pr(\mathbf{X}) }\\ &= \frac{ Pr(\mathbf{w} \mid \mathbf{X}, \boldsymbol{\phi}) Pr(\boldsymbol{\phi}) }{ Pr(\mathbf{w} \mid \mathbf{X}) }\end{split}\]

The foregoing implies that (8.8) assumes \(Pr(\boldsymbol{\phi} \mid \mathbf{X}) = Pr(\boldsymbol{\phi})\) i.e. the data does not affect the prior on the parameters (8.7).

Furthermore, (8.9) assumes that \(\sigma^2\) is known, so \(Pr(\mathbf{w} \mid \mathbf{X}, \boldsymbol{\theta} = \left\{ \boldsymbol{\phi}, \sigma^2 \right\}) = Pr(\mathbf{w} \mid \mathbf{X}, \boldsymbol{\phi})\).

\[\begin{split}Pr(\boldsymbol{\phi} \mid \mathbf{X}, \mathbf{w}) &= \frac{1}{Pr(\mathbf{w} \mid \mathbf{X})} \NormDist_{\mathbf{w}}\left[ \mathbf{X}^\top \boldsymbol{\phi}, \sigma^2 \mathbf{I}_I \right] \NormDist_{\boldsymbol{\phi}}\left[ \boldsymbol{0}, \sigma_p^2 \mathbf{I}_{D + 1} \right]\\ &= \frac{\kappa_1}{Pr(\mathbf{w} \mid \mathbf{X})} \NormDist_{\boldsymbol{\phi}}\left[ \left( \mathbf{X} \mathbf{X}^\top \right)^{-1} \mathbf{X} \mathbf{w}, \sigma^2 \left( \mathbf{X} \mathbf{X}^\top \right)^{-1} \right] \NormDist_{\boldsymbol{\phi}}\left[ \boldsymbol{0}, \sigma_p^2 \mathbf{I}_{D + 1} \right] & \quad & \text{(a)}\\ &= \frac{\kappa_1 \kappa_2}{Pr(\mathbf{w} \mid \mathbf{X})} \NormDist_{\boldsymbol{\phi}}\left[ \frac{1}{\sigma^2} \mathbf{A}^{-1} \mathbf{X} \mathbf{w}, \mathbf{A}^{-1} \right] & \quad & \text{(b)}\\ &= \NormDist_{\boldsymbol{\phi}}\left[ \frac{1}{\sigma^2} \mathbf{A}^{-1} \mathbf{X} \mathbf{w}, \mathbf{A}^{-1} \right] & \quad & \text{(c)}\end{split}\]

where

\[\mathbf{A} = \frac{1}{\sigma^2} \mathbf{X} \mathbf{X}^\top + \frac{1}{\sigma_p^2} \mathbf{I}_{D + 1}.\]

(a)

See Exercise 5.10 for more details.

\[\begin{split}\NormDist_{\mathbf{w}}\left[ \mathbf{X}^\top \boldsymbol{\phi}, \sigma^2 \mathbf{I}_I \right] &= \kappa_1 \NormDist_{\boldsymbol{\phi}}\left[ \left( \mathbf{X} \mathbf{X}^\top \sigma^{-2} \right)^{-1} \sigma^{-2} \mathbf{X} \mathbf{w}, \left( \mathbf{X} \mathbf{X}^\top \sigma^{-2} \right)^{-1} \right]\\ &= \kappa_1 \NormDist_{\boldsymbol{\phi}}\left[ \left( \mathbf{X} \mathbf{X}^\top \right)^{-1} \mathbf{X} \mathbf{w}, \sigma^2 \left( \mathbf{X} \mathbf{X}^\top \right)^{-1} \right]\end{split}\]

where

\[\begin{split}\kappa_1 &= \frac{ \left\vert \sigma^2 \left( \mathbf{X} \mathbf{X}^\top \right)^{-1} \right\vert^{1 / 2} }{ \left\vert \sigma^2 \mathbf{I}_I \right\vert^{1 / 2} } \exp\left[ \mathbf{w}^\top \left( \sigma^{-2} \mathbf{I}_I - \sigma^{-2} \mathbf{I}_I \mathbf{X}^\top \sigma^2 \left( \mathbf{X} \mathbf{X}^\top \right)^{-1} \mathbf{X} \sigma^{-2} \mathbf{I}_I \right)^{-1} \mathbf{w} \right]^{-0.5}\\ &= \frac{1}{ \sigma^{I + D + 1} \left\vert \mathbf{X} \mathbf{X}^\top \right\vert^{1/2} } \exp\left[ \mathbf{w}^\top \left( \sigma^{-2} \mathbf{I}_I - \sigma^{-2} \mathbf{I}_I \mathbf{X}^\top \sigma^2 \left( \mathbf{X} \mathbf{X}^\top \right)^{-1} \mathbf{X} \sigma^{-2} \mathbf{I}_I \right)^{-1} \mathbf{w} \right]^{-0.5}\\ &= \frac{1}{ \sigma^{I + D + 1} \left\vert \mathbf{X} \mathbf{X}^\top \right\vert^{1/2} } \exp\left[ \mathbf{w}^\top \left( \sigma^{-2} \mathbf{I}_I - \sigma^{-2} \mathbf{X}^\top \left( \mathbf{X} \mathbf{X}^\top \right)^{-1} \mathbf{X} \right)^{-1} \mathbf{w} \right]^{-0.5}\end{split}\]

(b)

See Exercise 5.7 and Exercise 5.9 for more details.

\[\begin{split}& \NormDist_{\boldsymbol{\phi}}\left[ \left( \mathbf{X} \mathbf{X}^\top \right)^{-1} \mathbf{X} \mathbf{w}, \sigma^2 \left( \mathbf{X} \mathbf{X}^\top \right)^{-1} \right] \NormDist_{\boldsymbol{\phi}}\left[ \boldsymbol{0}, \sigma_p^2 \mathbf{I}_{D + 1} \right]\\ &= \kappa_2 \NormDist_{\boldsymbol{\phi}}\left[ \left( \frac{1}{\sigma^2} \mathbf{X} \mathbf{X}^\top + \frac{1}{\sigma_p^2} \mathbf{I}_{D + 1} \right)^{-1} \sigma^{-2} \mathbf{X} \mathbf{X}^\top \left( \mathbf{X} \mathbf{X}^\top \right)^{-1} \mathbf{X} \mathbf{w}, \left( \frac{1}{\sigma^2} \mathbf{X} \mathbf{X}^\top + \frac{1}{\sigma_p^2} \mathbf{I}_{D + 1} \right)^{-1} \right]\\ &= \kappa_2 \NormDist_{\boldsymbol{\phi}}\left[ \frac{1}{\sigma^2} \left( \frac{1}{\sigma^2} \mathbf{X} \mathbf{X}^\top + \frac{1}{\sigma_p^2} \mathbf{I}_{D + 1} \right)^{-1} \mathbf{X} \mathbf{w}, \left( \frac{1}{\sigma^2} \mathbf{X} \mathbf{X}^\top + \frac{1}{\sigma_p^2} \mathbf{I}_{D + 1} \right)^{-1} \right]\end{split}\]

where

\[\begin{split}\kappa_2 &= \NormDist_{\boldsymbol{0}}\left[ \left( \mathbf{X} \mathbf{X}^\top \right)^{-1} \mathbf{X} \mathbf{w}, \sigma^2 \left( \mathbf{X} \mathbf{X}^\top \right)^{-1} + \sigma_p^2 \mathbf{I}_{D + 1} \right]\\ &= \frac{1}{ (2 \pi)^{(D + 1) / 2} \left\vert \sigma^2 \left( \mathbf{X} \mathbf{X}^\top \right)^{-1} + \sigma_p^2 \mathbf{I}_{D + 1} \right\vert^{1 / 2} } \exp\left[ \left( \left( \mathbf{X} \mathbf{X}^\top \right)^{-1} \mathbf{X} \mathbf{w} \right)^\top \left( \sigma^2 \left( \mathbf{X} \mathbf{X}^\top \right)^{-1} + \sigma_p^2 \mathbf{I}_{D + 1} \right)^{-1} \left( \left( \mathbf{X} \mathbf{X}^\top \right)^{-1} \mathbf{X} \mathbf{w} \right) \right]^{-0.5}\end{split}\]

(c)

\[\begin{split}Pr(\mathbf{w} \mid \mathbf{X}) &= \int Pr(\boldsymbol{\phi}, \mathbf{w} \mid \mathbf{X}) d\boldsymbol{\phi}\\ &= \int Pr(\mathbf{w} \mid \mathbf{X}, \boldsymbol{\phi}) Pr(\boldsymbol{\phi} \mid \mathbf{X}) d\boldsymbol{\phi}\\ &= \int Pr(\mathbf{w} \mid \mathbf{X}, \boldsymbol{\phi}) Pr(\boldsymbol{\phi}) d\boldsymbol{\phi}\\ &= \int \kappa_1 \kappa_2 \NormDist_{\boldsymbol{\phi}}\left[ \frac{1}{\sigma^2} \mathbf{A}^{-1} \mathbf{X} \mathbf{w}, \mathbf{A}^{-1} \right] d\boldsymbol{\phi}\\ &= \kappa_1 \kappa_2\end{split}\]

Exercise 8.5

Let \(w^\ast, \sigma \in \mathbb{R}\); \(\mathbf{x}^\ast, \boldsymbol{\phi} \in \mathbb{R}^{D + 1}\); \(\mathbf{A} \in \mathbb{R}^{(D + 1) \times (D + 1)}\); \(\mathbf{X} \in \mathbb{R}^{(D + 1) \times I}\); \(\mathbf{w} \in \mathbb{R}^I\).

\[\begin{split}Pr(w^\ast \mid \mathbf{x}^\ast, \mathbf{X}, \mathbf{w}) &= \int Pr(\boldsymbol{\phi}, w^\ast \mid \mathbf{x}^\ast, \mathbf{X}, \mathbf{w}) d\boldsymbol{\phi}\\ &= \int Pr(w^\ast \mid \mathbf{x}^\ast, \boldsymbol{\phi}) Pr(\boldsymbol{\phi} \mid \mathbf{X}, \mathbf{w}) d\boldsymbol{\phi} & \quad & \text{conditional independence}\\ &= \int \NormDist_{w^\ast}\left[ \boldsymbol{\phi}^\top \mathbf{x}^\ast, \sigma^2 \right] \NormDist_{\boldsymbol{\phi}}\left[ \sigma^{-2} \mathbf{A}^{-1} \mathbf{X} \mathbf{w}, \mathbf{A}^{-1} \right] d\boldsymbol{\phi} & \quad & \text{(8.2) and (8.10)}\\ &= \int \kappa_1 \NormDist_{\boldsymbol{\phi}}\left[ \left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1} \mathbf{x}^\ast w^\ast, \sigma^2 \left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1} \right] \NormDist_{\boldsymbol{\phi}}\left[ \sigma^{-2} \mathbf{A}^{-1} \mathbf{X} \mathbf{w}, \mathbf{A}^{-1} \right] d\boldsymbol{\phi} & \quad & \text{(a)}\\ &= \int \kappa_1 \kappa_2 \NormDist_{\boldsymbol{\phi}}\left[ \boldsymbol{\Sigma} \left( \sigma^{-2} \mathbf{x}^\ast w^\ast + \sigma^{-2} \mathbf{X} \mathbf{w} \right), \boldsymbol{\Sigma} \right] d\boldsymbol{\phi} & \quad & \text{(b)}\\ &= \kappa_1 \kappa_2\\ &= \NormDist_{w^\ast}\left[ \sigma^{-2} {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{X} \mathbf{w}, \sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast \right] & \quad & \text{(c)}\end{split}\]

(a)

See Exercise 5.10 for more details.

\[\begin{split}\NormDist_{w^\ast}\left[ {\mathbf{x}^\ast}^\top \boldsymbol{\phi}^, \sigma^2 \right] &= \kappa_1 \NormDist_{\boldsymbol{\phi}}\left[ \left( \mathbf{x}^\ast \sigma^{-2} {\mathbf{x}^\ast}^\top \right)^{-1} \mathbf{x}^\ast \sigma^{-2}, \left( \mathbf{x}^\ast \sigma^{-2} {\mathbf{x}^\ast}^\top \right)^{-1} \right]\\ &= \kappa_1 \NormDist_{\boldsymbol{\phi}}\left[ \left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1} \mathbf{x}^\ast w^\ast, \sigma^2 \left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1} \right]\end{split}\]

where

\[\begin{split}\kappa_1 &= (2 \pi)^{D / 2} \frac{ \left\vert \left( \mathbf{x}^\ast \sigma^{-2} {\mathbf{x}^\ast}^\top \right)^{-1} \right\vert^{1 / 2} }{ \left\vert \sigma^2 \right\vert^{1 / 2} } \exp\left[ w^\ast \left( \sigma^{-2} - \sigma^{-2} {\mathbf{x}^\ast}^\top \left( \mathbf{x}^\ast \sigma^{-2} {\mathbf{x}^\ast}^\top \right)^{-1} \mathbf{x}^\ast \sigma^{-2} \right) w^\ast \right]^{-0.5}\\ &= (2 \pi)^{D / 2} \sigma^{-1} \left\vert \sigma^2 \left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1} \right\vert^{1 / 2} \exp\left[ w^\ast \left( \sigma^{-2} - \sigma^{-2} {\mathbf{x}^\ast}^\top \left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1} \mathbf{x}^\ast \right) w^\ast \right]^{-0.5}\end{split}\]

(b)

See Exercise 5.7 and Exercise 5.9 for more details.

\[\begin{split}& \NormDist_{\boldsymbol{\phi}}\left[ \left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1} \mathbf{x}^\ast w^\ast, \sigma^2 \left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1} \right] \NormDist_{\boldsymbol{\phi}}\left[ \sigma^{-2} \mathbf{A}^{-1} \mathbf{X} \mathbf{w}, \mathbf{A}^{-1} \right]\\ &= \kappa_2 \NormDist_{\boldsymbol{\phi}}\left[ \left( \sigma^{-2} \mathbf{x}^\ast {\mathbf{x}^\ast}^\top + \mathbf{A} \right)^{-1} \left( \sigma^{-2} \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1} \mathbf{x}^\ast w^\ast + \mathbf{A} \sigma^{-2} \mathbf{A}^{-1} \mathbf{X} \mathbf{w} \right), \left( \sigma^{-2} \mathbf{x}^\ast {\mathbf{x}^\ast}^\top + \mathbf{A} \right)^{-1} \right]\\ &= \kappa_2 \NormDist_{\boldsymbol{\phi}}\left[ \left( \sigma^{-2} \mathbf{x}^\ast {\mathbf{x}^\ast}^\top + \mathbf{A} \right)^{-1} \left( \sigma^{-2} \mathbf{x}^\ast w^\ast + \sigma^{-2} \mathbf{X} \mathbf{w} \right), \left( \sigma^{-2} \mathbf{x}^\ast {\mathbf{x}^\ast}^\top + \mathbf{A} \right)^{-1} \right]\end{split}\]

where

\[\begin{split}\kappa_2 &= \NormDist_{\sigma^{-2} \mathbf{A}^{-1} \mathbf{X} \mathbf{w}}\left[ \left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1} \mathbf{x}^\ast w^\ast, \left( \mathbf{x}^\ast \sigma^{-2} {\mathbf{x}^\ast}^\top \right)^{-1} + \mathbf{A}^{-1} \right]\\ &= \frac{ \exp\left[ \left( \sigma^{-2} \mathbf{A}^{-1} \mathbf{X} \mathbf{w} - \left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1} \mathbf{x}^\ast w^\ast \right)^\top \boldsymbol{\Sigma}' \left( \sigma^{-2} \mathbf{A}^{-1} \mathbf{X} \mathbf{w} - \left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1} \mathbf{x}^\ast w^\ast \right) \right]^{-1 / 2} }{ \left\vert 2 \pi \left( \sigma^2 \left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1} + \mathbf{A}^{-1} \right) \right\vert^{1 / 2} } & \quad & \text{(b.1)}\end{split}\]

(b.1)

According to Sherman-Morrison Identity in Matrix Cookbook (160),

\[\begin{split}\boldsymbol{\Sigma} &= \left( \sigma^{-2} \mathbf{x}^\ast {\mathbf{x}^\ast}^\top + \mathbf{A} \right)^{-1}\\ &= \mathbf{A}^{-1} - \frac{ \sigma^{-2} \mathbf{A}^{-1} \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} }{ 1 + \sigma^{-2} {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast }\\ &= \mathbf{A}^{-1} - \frac{ \mathbf{A}^{-1} \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} }{ \sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast }.\end{split}\]

According to Searle Set of Identities in Matrix Cookbook (163),

\[\begin{split}\left( \mathbf{x}^\ast \sigma^{-2} {\mathbf{x}^\ast}^\top \right)^{-1} + \mathbf{A}^{-1} &= \left( \mathbf{x}^\ast \sigma^{-2} {\mathbf{x}^\ast}^\top \right)^{-1} \boldsymbol{\Sigma}^{-1} \mathbf{A}^{-1}\\ &= \mathbf{A}^{-1} \boldsymbol{\Sigma}^{-1} \left( \mathbf{x}^\ast \sigma^{-2} {\mathbf{x}^\ast}^\top \right)^{-1}.\end{split}\]

Furthermore, applying (165) gives

\[\begin{split}\boldsymbol{\Sigma}' &= \mathbf{A} \boldsymbol{\Sigma} \left( \mathbf{x}^\ast \sigma^{-2} {\mathbf{x}^\ast}^\top \right)\\ &= \left( \mathbf{x}^\ast \sigma^{-2} {\mathbf{x}^\ast}^\top \right) \boldsymbol{\Sigma} \mathbf{A}\\ &= \mathbf{x}^\ast \sigma^{-2} {\mathbf{x}^\ast}^\top - \frac{ \sigma^{-2} \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast {\mathbf{x}^\ast}^\top }{ \sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast }.\end{split}\]

(c)

\[\kappa_1 \kappa_2 = \frac{ \exp\left[ \left( w^\ast - \sigma^{-2} {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{X} \mathbf{w} \right)^\top \left( {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast + \sigma^2 \right)^{-1} \left( w^\ast - \sigma^{-2} {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{X} \mathbf{w} \right) \right]^{-0.5} }{ \left\vert 2 \pi \left( {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast + \sigma^2 \right) \right\vert^{1 / 2} } \qquad \text{(c.1) and (c.2)}\]

(c.1)

\[\begin{split}& \frac{ (2 \pi)^{D / 2} \left\vert \left( \mathbf{x}^\ast \sigma^{-2} {\mathbf{x}^\ast}^\top \right)^{-1} \right\vert^{1 / 2} }{ \left\vert \sigma^2 \right\vert^{1 / 2} } \frac{1}{ (2 \pi)^{(D + 1) / 2} \left\vert \sigma^2 \left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1} + \mathbf{A}^{-1} \right\vert^{1 / 2} }\\ &= \left[ (2 \pi)^{1 / 2} \left\vert \sigma^2 \right\vert^{1 / 2} \left\vert \mathbf{x}^\ast \sigma^{-2} {\mathbf{x}^\ast}^\top \right\vert^{1 / 2} \left\vert \sigma^2 \left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1} + \mathbf{A}^{-1} \right\vert^{1 / 2} \right]^{-1}\\ &= \left[ (2 \pi) \left\vert \sigma^2 \right\vert \left\vert \mathbf{I}_{D + 1} + \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \sigma^{-2} \mathbf{A}^{-1} \right\vert \right]^{-1 / 2}\\ &= \left[ (2 \pi) \left\vert \sigma^2 \right\vert \left( 1 + {\mathbf{x}^\ast}^\top \sigma^{-2} \mathbf{A}^{-1} \mathbf{x}^\ast \right) \right]^{-1 / 2} & \quad & \text{Matrix determinant lemma and Matrix Cookbook (24)}\\ &= \frac{1}{ (2 \pi)^{1 / 2} \left( \sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast \right)^{1 / 2} }\end{split}\]

(c.2)

\[\begin{split}& \exp\left[ {w^\ast}^2 \left( \sigma^{-2} - \sigma^{-2} {\mathbf{x}^\ast}^\top \left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1} \mathbf{x}^\ast \right) \right]^{-1 / 2} \exp\left[ \left( \sigma^{-2} \mathbf{A}^{-1} \mathbf{X} \mathbf{w} - \left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1} \mathbf{x}^\ast w^\ast \right)^\top \boldsymbol{\Sigma}' \left( \sigma^{-2} \mathbf{A}^{-1} \mathbf{X} \mathbf{w} - \left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1} \mathbf{x}^\ast w^\ast \right) \right]^{-1 / 2}\\ &= \exp\left[ \left( w^\ast - \sigma^{-2} {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{X} \mathbf{w} \right)^\top \left( {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast + \sigma^2 \right)^{-1} \left( w^\ast - \sigma^{-2} {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{X} \mathbf{w} \right) \right]^{-0.5}\end{split}\]
\[\begin{split}\text{(1)} & \quad & {w^\ast}^2 \sigma^{-2} - {w^\ast}^2 \sigma^{-2} {\mathbf{x}^\ast}^\top \left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1} \mathbf{x}^\ast\\\\ \text{(2)} & \quad & \sigma^{-4} \mathbf{w}^\top \mathbf{X}^\top \mathbf{A}^{-1} \left( \mathbf{x}^\ast \sigma^{-2} {\mathbf{x}^\ast}^\top \right) \mathbf{A}^{-1} \mathbf{X} \mathbf{w} - \sigma^{-4} \mathbf{w}^\top \mathbf{X}^\top \mathbf{A}^{-1} \frac{ \sigma^{-2} \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast {\mathbf{x}^\ast}^\top }{ \sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast } \mathbf{A}^{-1} \mathbf{X} \mathbf{w}\\\\ \text{(3)} & \quad & -\left( \sigma^{-4} \mathbf{w}^\top \mathbf{X}^\top \mathbf{A}^{-1} \mathbf{x}^\ast w^\ast - \sigma^{-2} \mathbf{w}^\top \mathbf{X}^\top \mathbf{A}^{-1} \frac{ \sigma^{-2} \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} }{ \sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast } \mathbf{x}^\ast w^\ast \right)\\\\ \text{(4)} & \quad & -\left( w^\ast {\mathbf{x}^\ast}^\top \sigma^{-4} \mathbf{A}^{-1} \mathbf{X} \mathbf{w} - w^\ast {\mathbf{x}^\ast}^\top \frac{ \sigma^{-2} \mathbf{A}^{-1} \mathbf{x}^\ast {\mathbf{x}^\ast}^\top }{ \sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast } \sigma^{-2} \mathbf{A}^{-1} \mathbf{X} \mathbf{w} \right)\\\\ \text{(5)} & \quad & w^\ast {\mathbf{x}^\ast}^\top \sigma^{-2} \left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right)^{-1} \mathbf{x}^\ast w^\ast - w^\ast {\mathbf{x}^\ast}^\top \frac{ \sigma^{-2} \mathbf{A}^{-1} }{ \sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast } \mathbf{x}^\ast w^\ast\end{split}\]
\[\begin{split}\text{(1) + (5)} &= \frac{ {w^\ast}^2 \sigma^{-2} \left( \sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast \right) - {w^\ast}^2 \sigma^{-2} {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast }{ \sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast }\\ &= \frac{ {w^\ast}^2 }{ \sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast }\\\\\\ \text{(2)} &= \frac{ \sigma^{-6} \mathbf{w}^\top \mathbf{X}^\top \mathbf{A}^{-1} \left( \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \right) \mathbf{A}^{-1} \mathbf{X} \mathbf{w} \left( \sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast \right) - \sigma^{-6} \mathbf{w}^\top \mathbf{X}^\top \mathbf{A}^{-1} \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{X} \mathbf{w} }{ \sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast }\\ &= \frac{ \sigma^{-4} \mathbf{w}^\top \mathbf{X}^\top \mathbf{A}^{-1} \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{X} \mathbf{w} }{ \sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast }\\\\\\ \text{(4)} &= \frac{ -\sigma^{-4} w^\ast {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{X} \mathbf{w} \left( \sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast \right) + \sigma^{-4} w^\ast {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{X} \mathbf{w} }{ \sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast }\\ &= \frac{ -\sigma^{-2} w^\ast {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{X} \mathbf{w} }{ \sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast }\\ &= \frac{ -\sigma^{-2} w^\ast \mathbf{w}^\top \mathbf{X}^\top \mathbf{A}^{-1} \mathbf{x}^\ast }{ \sigma^2 + {\mathbf{x}^\ast}^\top \mathbf{A}^{-1} \mathbf{x}^\ast }\\ &= \text{(3)}\end{split}\]

Exercise 8.6

\[\begin{split}\mathbf{A}^{-1} &= \left( \sigma^{-2} \mathbf{X} \mathbf{X}^\top + \sigma_p^{-2} \mathbf{I}_D \right)^{-1}\\ &= \left( \sigma_p^{-2} \mathbf{I}_D + \mathbf{X} \left( \sigma^{-2} \mathbf{I}_I \right) \mathbf{X}^\top \right)^{-1}\\ &= \sigma_p^2 \mathbf{I}_D - \sigma_p^2 \mathbf{I}_D \mathbf{X} \left( \mathbf{X}^\top \sigma_p^2 \mathbf{I}_D \mathbf{X} + \sigma^2 \mathbf{I}_I \right)^{-1} \mathbf{X}^\top \sigma_p^2 \mathbf{I}_D\\ &= \sigma_p^2 \mathbf{I}_D - \sigma_p^4 \mathbf{X} \sigma_p^{-2} \left( \mathbf{X}^\top \mathbf{I}_D \mathbf{X} + \frac{\sigma^2}{\sigma_p^2} \mathbf{I}_I \right)^{-1} \mathbf{X}^\top\\ &= \sigma_p^2 \mathbf{I}_D - \sigma_p^2 \mathbf{X} \left( \mathbf{X}^\top \mathbf{I}_D \mathbf{X} + \frac{\sigma^2}{\sigma_p^2} \mathbf{I}_I \right)^{-1} \mathbf{X}^\top\end{split}\]

Exercise 8.7

Recall that

\[\frac{\partial}{\partial \sigma} \left( \sigma_p^2 \mathbf{X}^\top \mathbf{X} + \sigma^2 \mathbf{I} \right) = 2 \sigma \mathbf{I}.\]
\[\begin{split}\DeclareMathOperator{\tr}{\mathrm{tr}} 0 &= \frac{\partial}{\partial \sigma} \log Pr(\mathbf{w} \mid \mathbf{X}, \sigma^2)\\ &= \frac{\partial}{\partial \sigma} \log \NormDist_{\mathbf{w}}\left[ \boldsymbol{0}, \sigma_p^2 \mathbf{X}^\top \mathbf{X} + \sigma^2 \mathbf{I}_I \right]\\ &= \frac{\partial}{\partial \sigma} \left( -\frac{I}{2} \log[2 \pi] - \frac{1}{2} \log\left\vert \sigma_p^2 \mathbf{X}^\top \mathbf{X} + \sigma^2 \mathbf{I} \right\vert - \frac{1}{2} \mathbf{w}^\top \left( \sigma_p^2 \mathbf{X}^\top \mathbf{X} + \sigma^2 \mathbf{I} \right)^{-1} \mathbf{w} \right)\\ &= -\frac{1}{2} \tr\left[ \left( \sigma_p^2 \mathbf{X}^\top \mathbf{X} + \sigma^2 \mathbf{I} \right)^{-1} 2 \sigma \mathbf{I} \right] - \frac{1}{2} \mathbf{w}^\top \left[ -\left( \sigma_p^2 \mathbf{X}^\top \mathbf{X} + \sigma^2 \mathbf{I} \right)^{-1} 2 \sigma \mathbf{I} \left( \sigma_p^2 \mathbf{X}^\top \mathbf{X} + \sigma^2 \mathbf{I} \right)^{-1} \right] \mathbf{w} & \quad & \text{(C.36) and (C.39)}\\ &= \mathbf{w}^\top \left( \sigma_p^2 \mathbf{X}^\top \mathbf{X} + \sigma^2 \mathbf{I} \right)^{-2} \mathbf{w} - \tr\left[ \left( \sigma_p^2 \mathbf{X}^\top \mathbf{X} + \sigma^2 \mathbf{I} \right)^{-1} \right]\end{split}\]

Exercise 8.8

\[q(\phi) = \max_h L = \max_h \NormDist_{\phi}\left[ 0, h^{-1} \right] \GamDist_h[\nu / 2, \nu / 2]\]

is an unconstrained optimization problem where

\[\begin{split}\frac{\partial L}{\partial h} &= \frac{\partial}{\partial h} \log\left[ \frac{1}{\sqrt{2 \pi \det h^{-1}}} \exp\left( -0.5 \phi^2 h\right) \frac{\left( \frac{\nu}{2} \right)^{\nu / 2}}{\Gamma[\nu / 2]} \exp\left( -\frac{\nu}{2} h \right) h^{\nu / 2 - 1} \right]\\ 0 &= \frac{\partial}{\partial h} \left[ \frac{1}{2} \log(h) - \frac{1}{2} \log(2 \pi) + \left( -0.5 \phi^2 h \right) + \log\left( \frac{\left( \frac{\nu}{2} \right)^{\nu / 2}}{\Gamma[\nu / 2]} \right) + \left( -0.5 \nu h \right) + \left( \frac{\nu}{2} - 1 \right) \log(h) \right]\\ &= \frac{1}{2h} - \frac{1}{2} \phi^2 - \frac{1}{2} \nu + \left( \frac{\nu}{2} - 1 \right) h^{-1}\\ h^{-1} \frac{\nu - 1}{2} &= \frac{\phi^2 + \nu}{2}\\ h &= \frac{\nu - 1}{\phi^2 + \nu}.\end{split}\]
[1]:
import matplotlib.pyplot as plt
import numpy
import scipy.special

def q(phi):
    h = (nu - 1) / (phi**2 + nu)
    d_normal = 1 / numpy.sqrt(2 * numpy.pi / h) * numpy.exp(-.5 * phi**2 * h)
    _ = numpy.power(nu / 2, nu / 2) / scipy.special.gamma(nu / 2)
    d_gamma = _ * numpy.exp(-.5 * nu * h) * numpy.power(h, nu / 2 - 1)
    return d_normal * d_gamma

nu = 2.0
phi = numpy.linspace(-10, 10, 100)
plt.plot(phi, q(phi))
plt.title('$q(h)$ where $v = 2$')
plt.show()

nu = 2.0
_ = numpy.linspace(-10, 10, 100)
phi_1, phi_2 = numpy.meshgrid(_, _)
plt.imshow(q(phi_1) * q(phi_2))
plt.title('$q(h_1) q(h_2)$ where $v = 2$')
plt.xlabel('$h_1$')
plt.ylabel('$h_2$')
plt.show()
<Figure size 640x480 with 1 Axes>
<Figure size 640x480 with 1 Axes>

Exercise 8.9

Replace the nonlinear transformation in \(x\) with \(z = \begin{bmatrix} 1 & x & x^2 & x^3 \end{bmatrix}^\top\) as in (8.19).

This enables reuse of the linear model since (8.17) is linear in \(z\).

The maximum likelihood learning algorithm (8.6) is replaced with \(Z\) where the columns of \(Z\) contains the transformed vectors \(\left\{ z_i \right\}_{i = 1}^I\).

The Bayesian linear regression inference algorithm (8.14) can be reused via replacing \(x\) and \(z\) with \(z\) and \(Z\) respectively.

Exercise 8.10

Dual linear regression should only be used when \(I < D\), otherwise the computation gets unnecessarily more expensive.

Exercise 8.11

\[\begin{split}\frac{\partial L}{\partial \psi} &= -\frac{1}{2 \sigma^2} \frac{\partial}{\partial \psi} \left( w^\top w - w^\top X^\top X \psi - \psi^\top X^\top X w + \psi^\top X^\top X X^\top X \psi \right)\\ 0 &= -\frac{1}{2 \sigma^2} \left( -2 X^\top X w + 2 X^\top X X^\top X \psi \right) & \quad & \text{(C.27), (C.28), (C.33)}\\ X^\top X X^\top X \psi &= X^\top X w\\ \psi &= \left( X^\top X \right)^{-1} w\end{split}\]