The Normal Distribution

[SchonL11] provides a very nice exposition on this topic. However, its step (29) is not obvious. [Wan] is another interpretation that is possibly clearer and more concise.

Exercise 5.1

The following facts are useful in this proof:

\[\begin{split}\newcommand{\E}[1]{\operatorname{E}\left[#1\right]} \newcommand{\Cov}[1]{\operatorname{cov}\left(#1\right)} \begin{gather*} \boldsymbol{\mu} = \E{\mathbf{x}}\\\\ \E{\mathbf{A} \mathbf{x} + \mathbf{b}} = \mathbf{A} \E{\mathbf{x}} + \mathbf{b}\\\\ \boldsymbol{\Sigma} = \Cov{\mathbf{x}} = \E{ \left( \mathbf{x} - \E{\mathbf{x}} \right) \left( \mathbf{x} - \E{\mathbf{x}} \right)^\top } = \E{ \mathbf{x} \mathbf{x}^\top - \mathbf{x} \E{\mathbf{x}}^\top - \E{\mathbf{x}} \mathbf{x}^\top + \E{\mathbf{x}} \E{\mathbf{x}}^\top }\\\\ \Cov{\mathbf{A} \mathbf{x} + \mathbf{b}} = \mathbf{A} \Cov{\mathbf{x}} \mathbf{A}^\top \end{gather*}\end{split}\]

Let \(\mathbf{y} = \mathbf{A} \mathbf{x} + \mathbf{b}\) where \(\mathbf{A}\) is nonsingular so that \(\mathbf{x} = \mathbf{A}^{-1} (\mathbf{y} - \mathbf{b})\). The mean and covariance are derived as

\[\begin{split}\boldsymbol{\mu} &= \E{\mathbf{x}}\\ &= \E{\mathbf{A}^{-1} (\mathbf{y} - \mathbf{b})}\\ &= \mathbf{A}^{-1} \E{\mathbf{y}} - \mathbf{A}^{-1} \mathbf{b}\\ \mathbf{A} \boldsymbol{\mu} + \mathbf{b} &= \E{\mathbf{y}}\\ &= \tilde{\boldsymbol{\mu}}\end{split}\]

and

\[\begin{split}\boldsymbol{\Sigma} &= \Cov{\mathbf{x}}\\ &= \Cov{\mathbf{A}^{-1} (\mathbf{y} - \mathbf{b})}\\ &= \mathbf{A}^{-1} \Cov{\mathbf{y} - \mathbf{b}} \mathbf{A}^{-\top}\\ &= \mathbf{A}^{-1} \Cov{\mathbf{y}} \mathbf{A}^{-\top}\\ \mathbf{A} \boldsymbol{\Sigma} \mathbf{A}^\top &= \Cov{\mathbf{y}}\\ &= \tilde{\boldsymbol{\Sigma}}.\end{split}\]

Thus

\[\DeclareMathOperator{\NormDist}{Norm} Pr(\mathbf{y}) = \NormDist_{\mathbf{y}}\left[ \mathbf{A} \boldsymbol{\mu} + \mathbf{b}, \mathbf{A} \boldsymbol{\Sigma} \mathbf{A}^\top \right].\]

Exercise 5.2

See Exercise 5.1 for the derivations of the following terms.

A solution to

\[\begin{split}\begin{aligned} \mathbf{I} &= \tilde{\boldsymbol{\Sigma}}\\ &= \mathbf{A} \boldsymbol{\Sigma} \mathbf{A}^{\top}\\ \mathbf{A}^{-1} \mathbf{A}^{-\top} &= \boldsymbol{\Sigma} \end{aligned} \quad \text{and} \quad \begin{aligned} \boldsymbol{0} &= \tilde{\boldsymbol{\mu}}\\ &= \mathbf{A} \boldsymbol{\mu} + \mathbf{b}\\ \mathbf{b} &= -\mathbf{A} \boldsymbol{\mu} \end{aligned}\end{split}\]

is to set \(\mathbf{A} = \boldsymbol{\Sigma}^{-1 / 2}\) resulting in \(\mathbf{b} = -\boldsymbol{\Sigma}^{-1 / 2} \boldsymbol{\mu}\).

Exercise 5.3

Recall that

\[\begin{split}Pr(\mathbf{x} = \begin{bmatrix} \mathbf{x}_1\\ \mathbf{x}_2 \end{bmatrix}) &= Pr(\mathbf{x}_1, \mathbf{x}_2)\\ &= \NormDist_{\mathbf{x}}\left[ \boldsymbol{\mu} = \begin{bmatrix} \boldsymbol{\mu}_1\\ \boldsymbol{\mu}_2 \end{bmatrix}, \boldsymbol{\Sigma} = \begin{bmatrix} \boldsymbol{\Sigma}_{11} & \boldsymbol{\Sigma}_{21}^\top\\ \boldsymbol{\Sigma}_{21} & \boldsymbol{\Sigma}_{22} \end{bmatrix} \right]\\ &= \frac{1}{ (2 \pi)^{D / 2} \left\vert \boldsymbol{\Sigma} \right\vert^{1 / 2} } \exp\left[ -0.5 (\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) \right]\end{split}\]

where \(\boldsymbol{\Sigma}_{11} \in \mathbb{R}^{p \times p}\), \(\boldsymbol{\Sigma}_{21} \in \mathbb{R}^{q \times p}\), \(\boldsymbol{\Sigma}_{22} \in \mathbb{R}^{q \times q}\), and \(p + q = D\).

The Schur complement \(\mathbf{S}\) of \(\boldsymbol{\Sigma}_{11}\) in \(\boldsymbol{\Sigma}\) is defined as

\[\mathbf{S} = \boldsymbol{\Sigma}_{22} - \boldsymbol{\Sigma}_{21} \boldsymbol{\Sigma}_{11}^{-1} \boldsymbol{\Sigma}_{21}^\top.\]

It is symmetric positive definite because \(\boldsymbol{\Sigma}\) is positive definite according to (5.7). This quantity is useful for deriving a closed-form expression for the inverse of the full covariance matrix:

\[\begin{split}\boldsymbol{\Sigma}^{-1} &= \left( \begin{bmatrix} \mathbf{I}_p & \boldsymbol{0}\\ \boldsymbol{\Sigma}_{21}^\top \boldsymbol{\Sigma}_{11}^{-1} & \mathbf{I}_q \end{bmatrix} \begin{bmatrix} \boldsymbol{\Sigma}_{11} & \boldsymbol{0}\\ \boldsymbol{0} & \mathbf{S} \end{bmatrix} \begin{bmatrix} \mathbf{I}_p & \boldsymbol{\Sigma}_{11}^{-1} \boldsymbol{\Sigma}_{21}^\top\\ \boldsymbol{0} & \mathbf{I}_q \end{bmatrix} \right)^{-1}\\ &= \begin{bmatrix} \mathbf{I}_p & -\boldsymbol{\Sigma}_{11}^{-1} \boldsymbol{\Sigma}_{21}^\top\\ \boldsymbol{0} & \mathbf{I}_q \end{bmatrix} \begin{bmatrix} \boldsymbol{\Sigma}_{11}^{-1} & \boldsymbol{0}\\ \boldsymbol{0} & \mathbf{S}^{-1} \end{bmatrix} \begin{bmatrix} \mathbf{I}_p & \boldsymbol{0}\\ -\boldsymbol{\Sigma}_{21}^\top \boldsymbol{\Sigma}_{11}^{-1} & \mathbf{I}_q \end{bmatrix}\\ &= \begin{bmatrix} \boldsymbol{\Sigma}_{11}^{-1} + \boldsymbol{\Sigma}_{11}^{-1} \boldsymbol{\Sigma}_{21}^\top \mathbf{S}^{-1} \boldsymbol{\Sigma}_{21} \boldsymbol{\Sigma}_{11}^{-1} & -\boldsymbol{\Sigma}_{11}^{-1} \boldsymbol{\Sigma}_{21}^\top \mathbf{S}^{-1}\\ -\mathbf{S}^{-1} \boldsymbol{\Sigma}_{21} \boldsymbol{\Sigma}_{11}^{-1} & \mathbf{S}^{-1} \end{bmatrix}.\end{split}\]

The foregoing expression simplifies the determinant of \(\boldsymbol{\Sigma}\) to

\[\begin{split}\left\vert \boldsymbol{\Sigma} \right\vert &= \left\vert \begin{bmatrix} \mathbf{I}_p & \boldsymbol{0}\\ \boldsymbol{\Sigma}_{21}^\top \boldsymbol{\Sigma}_{11}^{-1} & \mathbf{I}_q\\ \end{bmatrix} \begin{bmatrix} \boldsymbol{\Sigma}_{11} & \boldsymbol{0}\\ \boldsymbol{0} & \mathbf{S} \end{bmatrix} \begin{bmatrix} \mathbf{I}_p & \boldsymbol{\Sigma}_{11}^{-1} \boldsymbol{\Sigma}_{21}^\top\\ \boldsymbol{0} & \mathbf{I}_q \end{bmatrix} \right\vert\\ &= \left\vert \begin{bmatrix} \mathbf{I}_p & \boldsymbol{0}\\ \boldsymbol{\Sigma}_{21}^\top \boldsymbol{\Sigma}_{11}^{-1} & \mathbf{I}_q\\ \end{bmatrix} \right\vert \left\vert \begin{bmatrix} \boldsymbol{\Sigma}_{11} & \boldsymbol{0}\\ \boldsymbol{0} & \mathbf{S} \end{bmatrix} \right\vert \left\vert \begin{bmatrix} \mathbf{I}_p & \boldsymbol{\Sigma}_{11}^{-1} \boldsymbol{\Sigma}_{21}^\top\\ \boldsymbol{0} & \mathbf{I}_q \end{bmatrix} \right\vert & \quad & \det(AB) = \det(A) \det(A)\\ &= \left\vert \begin{bmatrix} \boldsymbol{\Sigma}_{11} & \boldsymbol{0}\\ \boldsymbol{0} & \mathbf{S} \end{bmatrix} \right\vert & \quad & \det\left( \mathbf{T}_n \right) = \prod_{k = 1}^n a_{kk}\\ &= \left\vert \boldsymbol{\Sigma}_{11} \right\vert \left\vert \mathbf{S} \right\vert & \quad & \text{block matrix determinant property.}\end{split}\]
\[\begin{split}& Pr(\mathbf{x}_1)\\ &= \int Pr(\mathbf{x}_1, \mathbf{x}_2) d\mathbf{x}_2\\ &= \int \frac{1}{ (2 \pi)^{D / 2} \left\vert \boldsymbol{\Sigma} \right\vert^{1 / 2} } \exp\left[ -0.5 \begin{bmatrix} \mathbf{x}_1 - \boldsymbol{\mu}_1\\ \mathbf{x}_2 - \boldsymbol{\mu}_2 \end{bmatrix}^\top \begin{bmatrix} \boldsymbol{\Lambda}_{11} & \boldsymbol{\Lambda}_{21}^\top\\ \boldsymbol{\Lambda}_{21} & \boldsymbol{\Lambda}_{22} \end{bmatrix} \begin{bmatrix} \mathbf{x}_1 - \boldsymbol{\mu}_1\\ \mathbf{x}_2 - \boldsymbol{\mu}_2 \end{bmatrix} \right] d\mathbf{x}_2 & \quad & \Lambda = \Sigma^{-1} = \begin{bmatrix} \boldsymbol{\Lambda}_{11} & \boldsymbol{\Lambda}_{21}^\top\\ \boldsymbol{\Lambda}_{21} & \boldsymbol{\Lambda}_{22} \end{bmatrix}\\ &= \int \frac{1}{ (2 \pi)^{(p + q) / 2} \left\vert \boldsymbol{\Sigma}_{11} \right\vert^{1 / 2} \left\vert \mathbf{S} \right\vert^{1 / 2} } \exp\left[ -0.5 \left( (\mathbf{x}_1 - \boldsymbol{\mu}_1)^\top \boldsymbol{\Lambda}_{11} (\mathbf{x}_1 - \boldsymbol{\mu}_1) + 2 (\mathbf{x}_1 - \boldsymbol{\mu}_1)^\top \boldsymbol{\Lambda}_{21}^\top (\mathbf{x}_2 - \boldsymbol{\mu}_2) + (\mathbf{x}_2 - \boldsymbol{\mu}_2)^\top \boldsymbol{\Lambda}_{22} (\mathbf{x}_2 - \boldsymbol{\mu}_2) \right) \right] d\mathbf{x}_2\\ &= \int \NormDist_{\mathbf{x}_1}\left[ \boldsymbol{\mu}_1, \boldsymbol{\Sigma}_{11} \right] \frac{1}{ (2 \pi)^{q / 2} \left\vert \mathbf{S} \right\vert^{1 / 2} } \exp\left[ -0.5 \left[ (\mathbf{x}_2 - \boldsymbol{\mu}_2) - \boldsymbol{\Sigma}_{21} \boldsymbol{\Sigma}_{11}^{-1} (\mathbf{x}_1 - \boldsymbol{\mu}_1) \right]^\top \mathbf{S}^{-1} \left[ (\mathbf{x}_2 - \boldsymbol{\mu}_2) - \boldsymbol{\Sigma}_{21} \boldsymbol{\Sigma}_{11}^{-1} (\mathbf{x}_1 - \boldsymbol{\mu}_1) \right] \right] d\mathbf{x}_2\\ &= \NormDist_{\mathbf{x}_1}\left[ \boldsymbol{\mu}_1, \boldsymbol{\Sigma}_{11} \right] \int \NormDist_{\mathbf{x}_2}\left[ \boldsymbol{\mu}_2 + \boldsymbol{\Sigma}_{21} \boldsymbol{\Sigma}_{11}^{-1} (\mathbf{x}_1 - \boldsymbol{\mu}_1), \mathbf{S} \right] d\mathbf{x}_2\\ &= \NormDist_{\mathbf{x}_1}\left[ \boldsymbol{\mu}_1, \boldsymbol{\Sigma}_{11} \right]\end{split}\]

Exercise 5.4

This is true if and only if it satisfies the definition of matrix inverse:

\[M M^{-1} = M^{-1} M = I.\]

A simple way to show this is to decompose the block matrix \(M\) using the Schur complement of \(D\) in \(M\). The upper, diagonal, and lower triangular matrices cancels out.

Exercise 5.5

Another expression for \(\boldsymbol{\Sigma}^{-1}\) in Exercise 5.3 is

\[\begin{split}\boldsymbol{\Sigma}^{-1} &= \left( \begin{bmatrix} \mathbf{I}_p & \boldsymbol{\Sigma}_{21}^\top \boldsymbol{\Sigma}_{22}^{-1}\\ \boldsymbol{0} & \mathbf{I}_q\\ \end{bmatrix} \begin{bmatrix} \mathbf{S} & \boldsymbol{0}\\ \boldsymbol{0} & \boldsymbol{\Sigma}_{22} \end{bmatrix} \begin{bmatrix} \mathbf{I}_p & \boldsymbol{0}\\ \boldsymbol{\Sigma}_{22}^{-1} \boldsymbol{\Sigma}_{21} & \mathbf{I}_q \end{bmatrix} \right)^{-1}\\ &= \begin{bmatrix} \mathbf{I}_p & \boldsymbol{0}\\ -\boldsymbol{\Sigma}_{22}^{-1} \boldsymbol{\Sigma}_{21} & \mathbf{I}_q \end{bmatrix} \begin{bmatrix} \mathbf{S}^{-1} & \boldsymbol{0}\\ \boldsymbol{0} & \boldsymbol{\Sigma}_{22}^{-1} \end{bmatrix} \begin{bmatrix} \mathbf{I}_p & -\boldsymbol{\Sigma}_{21}^\top \boldsymbol{\Sigma}_{22}^{-1}\\ \boldsymbol{0} & \mathbf{I}_q\\ \end{bmatrix}\\ &= \begin{bmatrix} \mathbf{S}^{-1} & -\mathbf{S}^{-1} \boldsymbol{\Sigma}_{21}^\top \boldsymbol{\Sigma}_{22}^{-1}\\ -\boldsymbol{\Sigma}_{22}^{-1} \boldsymbol{\Sigma}_{21} \mathbf{S}^{-1} & \boldsymbol{\Sigma}_{22}^{-1} + \boldsymbol{\Sigma}_{22}^{-1} \boldsymbol{\Sigma}_{21} \mathbf{S}^{-1} \boldsymbol{\Sigma}_{21}^\top \boldsymbol{\Sigma}_{22}^{-1} \end{bmatrix}\end{split}\]

where

\[\mathbf{S} = \boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{21}^\top \boldsymbol{\Sigma}_{22}^{-1} \boldsymbol{\Sigma}_{21}\]

is the Schur complement of \(\boldsymbol{\Sigma}_{22}\) in \(\boldsymbol{\Sigma}\). The determinant of \(\boldsymbol{\Sigma}\) is simplified to

\[\left\vert \boldsymbol{\Sigma} \right\vert = \left\vert \mathbf{S} \right\vert \left\vert \boldsymbol{\Sigma}_{22} \right\vert.\]

Going through the same motions gives

\[\begin{split}& Pr(\mathbf{x}_1, \mathbf{x}_2)\\ &= \frac{1}{ (2 \pi)^{(p + q) / 2} \left\vert \mathbf{S} \right\vert^{1 / 2} \left\vert \boldsymbol{\Sigma}_{22} \right\vert^{1 / 2} } \exp\left[ \left( \left( \mathbf{x}_1 - \boldsymbol{\mu}_1 \right)^\top \boldsymbol{\Lambda}_{11} \left( \mathbf{x}_1 - \boldsymbol{\mu}_1 \right) + 2 \left( \mathbf{x}_1 - \boldsymbol{\mu}_1 \right)^\top \boldsymbol{\Lambda}_{21}^\top \left( \mathbf{x}_2 - \boldsymbol{\mu}_2 \right) + \left( \mathbf{x}_2 - \boldsymbol{\mu}_2 \right)^\top \boldsymbol{\Lambda}_{22} \left( \mathbf{x}_2 - \boldsymbol{\mu}_2 \right) \right) \right]^{-0.5}\\ &= \NormDist_{\mathbf{x}_2}\left[ \boldsymbol{\mu}_2, \boldsymbol{\Sigma}_{22} \right] \frac{1}{(2 \pi)^{p / 2} \left\vert \mathbf{S} \right\vert^{1 / 2}} \exp\left[ \left( \left( \mathbf{x}_1 - \boldsymbol{\mu}_1 \right) - \boldsymbol{\Sigma}_{21}^\top \boldsymbol{\Sigma}_{22}^{-1} \left( \mathbf{x}_2 - \boldsymbol{\mu}_2 \right) \right)^\top \mathbf{S}^{-1} \left( \left( \mathbf{x}_1 - \boldsymbol{\mu}_1 \right) - \boldsymbol{\Sigma}_{21}^\top \boldsymbol{\Sigma}_{22}^{-1} \left( \mathbf{x}_2 - \boldsymbol{\mu}_2 \right) \right) \right]^{-0.5} d\mathbf{x}_2\\ &= \NormDist_{\mathbf{x}_2}\left[ \boldsymbol{\mu}_2, \boldsymbol{\Sigma}_{22} \right] \NormDist_{\mathbf{x}_1}\left[ \boldsymbol{\mu}_1 + \boldsymbol{\Sigma}_{21}^\top \boldsymbol{\Sigma}_{22}^{-1} (\mathbf{x}_2 - \boldsymbol{\mu}_2), \mathbf{S} \right].\end{split}\]

Rearranging the equations using conditional probability (2.4) results in

\[Pr(\mathbf{x}_1 \mid \mathbf{x}_2) = \frac{Pr(\mathbf{x}_1, \mathbf{x}_2)}{Pr(\mathbf{x}_2)} = \NormDist_{\mathbf{x}_1}\left[ \boldsymbol{\mu}_1 + \boldsymbol{\Sigma}_{21}^\top \boldsymbol{\Sigma}_{22}^{-1} (\mathbf{x}_2 - \boldsymbol{\mu}_2), \mathbf{S} \right].\]

Exercise 5.6

When the covariance is diagonal (i.e. the individual variables are independent), the off-diagonal elements (e.g. \(\boldsymbol{\Sigma}_{21}^\top\)) in Exercise 5.5 will be zero. Thus

\[\begin{split}Pr(\mathbf{x}_1 \mid \mathbf{x}_2) &= \NormDist_{\mathbf{x}_1}\left[ \boldsymbol{\mu}_1 + \boldsymbol{\Sigma}_{21}^\top \boldsymbol{\Sigma}_{22}^{-1} (\mathbf{x}_2 - \boldsymbol{\mu}_2), \boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{21}^\top \boldsymbol{\Sigma}_{22}^{-1} \boldsymbol{\Sigma}_{21} \right]\\ &= \NormDist_{\mathbf{x}_1}\left[ \boldsymbol{\mu}_1, \boldsymbol{\Sigma}_{11} \right]\\ &= Pr(\mathbf{x}_1).\end{split}\]

Exercise 5.7

Let \(x, a, b \in \mathbb{R}^D\) and \(A, B \in \mathbb{R}^{D \times D}\).

\[\begin{split}& \NormDist_{x}[a, A] \NormDist_{x}[b, B]\\ &= \frac{1}{\left\vert 2 \pi A \right\vert^{1 / 2}} \exp\left[ (x - a)^\top A^{-1} (x - a) \right]^{-0.5} \frac{1}{\left\vert 2 \pi B \right\vert^{1 / 2}} \exp\left[ (x - b)^\top B^{-1} (x - b) \right]^{-0.5}\\ &= \frac{1}{(2 \pi)^{D} \left\vert AB \right\vert^{1 / 2}} \exp\left[ x^\top A^{-1} x - 2 x^\top A^{-1} a + a^\top A^{-1} a + x^\top B^{-1} x - 2 x^\top B^{-1} b + b^\top B^{-1} b \right]^{-0.5}\\ &= \frac{1}{(2 \pi)^{D} \left\vert AB \right\vert^{1 / 2}} \exp\left[ x^\top (A^{-1} + B^{-1}) x - 2 x^\top (A^{-1} a + B^{-1} b) + a^\top A^{-1} a + b^\top B^{-1} b \right]^{-0.5} & \quad & \text{rearrange terms to expose pattern}\\ &= \frac{1}{(2 \pi)^{D} \left\vert AB \right\vert^{1 / 2}} \exp\left[ (x - \boldsymbol{\mu})^\top (A^{-1} + B^{-1}) (x - \boldsymbol{\mu}) - \boldsymbol{\mu}^\top (A^{-1} + B^{-1}) \boldsymbol{\mu} + a^\top A^{-1} a + b^\top B^{-1} b \right]^{-0.5} & \quad & \text{completing the square}\\ &= \frac{ \left\vert \boldsymbol{\Sigma} \right\vert^{1 / 2} }{ (2 \pi)^{D / 2} \left\vert AB \right\vert^{1 / 2} } \exp\left[ a^\top A^{-1} a + b^\top B^{-1} b - \boldsymbol{\mu}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} \right]^{-0.5} \NormDist_{x}[\boldsymbol{\mu}, \boldsymbol{\Sigma}]\\ &\propto \NormDist_{x}[\boldsymbol{\mu}, \boldsymbol{\Sigma}]\end{split}\]

where \(\boldsymbol{\mu} = \boldsymbol{\Sigma} (A^{-1} a + B^{-1} b)\) and \(\boldsymbol{\Sigma} = (A^{-1} + B^{-1})^{-1}\).

Exercise 5.8

The results of Exercise 5.7 illustrate that the new mean and variance are respectively

\[\mu = \frac{ \sigma_1^{-2} \mu_1 + \sigma_2^{-2} \mu_2 }{ \sigma_1^{-2} + \sigma_2^{-2} } = a \mu_1 + b \mu_2 \quad \text{and} \quad \sigma^2 = \frac{1}{\sigma_1^{-2} + \sigma_2^{-2}}\]

where \(a, b > 0\) and \(a + b = 1\).

Assuming \(\sigma_1^2, \sigma_2^2 > 0\), the following (applicable to both) shows that the new variance is smaller than either of them:

\[\begin{split}\sigma_1^{-2} + \sigma_2^{-2} &> \sigma_1^{-2}\\ \sigma_1^2 &> (\sigma_1^{-2} + \sigma_2^{-2})^{-1}\\ &> \sigma^2.\end{split}\]

The variance proof is quite clever in the sense that you start by assuming what you want (\(\sigma^2 < \sigma_1^2\)) and work backwards to reach some kind of obviously true proposition (\(\sigma_1^{-2} + \sigma_2^{-2} > \sigma_1^{-2}\)) under certain assumptions. Then present the proof backwards!

Exercise 5.9

Exercise 5.7 states that

\[\kappa = \frac{ \left\vert \boldsymbol{\Sigma} \right\vert^{1 / 2} }{ (2 \pi)^{D / 2} \left\vert AB \right\vert^{1 / 2} } \exp\left[ \left( a^\top A^{-1} a + b^\top B^{-1} b - \boldsymbol{\mu}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} \right) \right]^{-0.5}.\]

Notice that

\[\begin{split}\frac{ \left\vert \boldsymbol{\Sigma} \right\vert^{1 / 2} }{ \left\vert AB \right\vert^{1 / 2} } &= \left( \left\vert AB \right\vert \left\vert A^{-1} + B^{-1} \right\vert \right)^{-1 / 2}\\ &= \left( \left\vert A \right\vert \left\vert A^{-1} + B^{-1} \right\vert \left\vert B \right\vert \right)^{-1 / 2}\\ &= \left( \left\vert A (A^{-1} + B^{-1}) B \right\vert \right)^{-1 / 2}\\ &= \left( \left\vert A + B \right\vert \right)^{-1 / 2}\end{split}\]

and

\[\begin{split}& \exp\left[ a^\top A^{-1} a + b^\top B^{-1} b - \boldsymbol{\mu}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} \right]^{-0.5}\\ &= \exp\left[ a^\top A^{-1} a + b^\top B^{-1} b - a^\top A^{-1} \boldsymbol{\Sigma} A^{-1} a - b^\top B^{-1} \boldsymbol{\Sigma} B^{-1} b - 2 a^\top A^{-1} \boldsymbol{\Sigma} B^{-1} b \right]^{-0.5}\\ &= \exp\left[ a^\top A^{-1} a + b^\top B^{-1} b - a^\top (A \boldsymbol{\Sigma}^{-1} A)^{-1} a - b^\top (B \boldsymbol{\Sigma}^{-1} B)^{-1} b - 2 a^\top (B \boldsymbol{\Sigma}^{-1} A)^{-1} b \right]^{-0.5}\\ &= \exp\left[ a^\top A^{-1} a + b^\top B^{-1} b - a^\top (A + B)^{-1} B A^{-1} a - b^\top (A + B)^{-1} A B^{-1} b - 2 a^\top (A + B)^{-1} b \right]^{-0.5}\\ &= \exp\left[ a^\top \left( A^{-1} - (A + B)^{-1} B A^{-1} \right) a + b^\top \left( B^{-1} - (A + B)^{-1} A B^{-1} \right) b - 2 a^\top (A + B)^{-1} b \right]^{-0.5}\\ &= \exp\left[ a^\top (A + B)^{-1} a - 2 a^\top (A + B)^{-1} b + b^\top (A + B)^{-1} b \right]^{-0.5} & \quad & \text{(a)}\\ &= \exp\left[ (a - b) (A + B)^{-1} (a - b) \right]^{-0.5}.\end{split}\]

Thus \(\kappa = \NormDist_{a}[b, A + B]\).

(a)

One approach to this solution is to assume the desired identities

\[\begin{split}A^{-1} - (A + B)^{-1} B A^{-1} &= (A + B)^{-1}\\ B^{-1} - (A + B)^{-1} A B^{-1} &= (A + B)^{-1}\end{split}\]

hold and try to solve for \((A + B)^{-1}\). This leads to the following identities:

\[\begin{split}(A + B) A^{-1} &= I + B A^{-1}\\ A^{-1} &= (A + B)^{-1} (I + B A^{-1})\\\\ (A + B) B^{-1} &= A B^{-1} + I\\ B^{-1} &= (A + B)^{-1} (A B^{-1} + I).\end{split}\]

The purpose of the assumption is to derive some kind of obviously true proposition and then work backwards. The solution in the book made use of the clever observation that

\[a^\top A^{-1} a = a^\top (A + B)^{-1} (A + B) A^{-1} a = a^\top (A + B)^{-1} a + a^\top (A + B)^{-1} B A^{-1} a\]

and

\[b^\top B^{-1} b = b^\top (A + B)^{-1} (A + B) B^{-1} b = b^\top (A + B)^{-1} b + b^\top (A + B)^{-1} A B^{-1} b.\]

Exercise 5.10

Suppose \(x \in \mathbb{R}^n\), \(A \in \mathbb{R}^{n \times m}\), \(y \in \mathbb{R}^m\), \(b \in \mathbb{R}^n\), and \(\Sigma \in \mathbb{R}^{n \times n}\).

\[\begin{split}& \NormDist_x[Ay + b, \Sigma]\\ &= \frac{1}{\left\vert 2 \pi \Sigma \right\vert^{1 / 2}} \exp\left[ (x - Ay - b)^\top \Sigma^{-1} (x - Ay - b) \right]^{-0.5}\\ &= \frac{1}{(2 \pi)^{n / 2} \left\vert \Sigma \right\vert^{1 / 2}} \exp\left[ x^\top \Sigma^{-1} x^\top - 2 x^\top \Sigma^{-1} A y - 2 x^\top \Sigma^{-1} b + y^\top A^\top \Sigma^{-1} Ay + 2 y^\top A^\top \Sigma^{-1} b + b^\top \Sigma^{-1} b \right]^{-0.5}\\ &= \frac{1}{(2 \pi)^{n / 2} \left\vert \Sigma \right\vert^{1 / 2}} \exp\left[ x^\top \Sigma^{-1} x^\top - 2 x^\top \Sigma^{-1} b + b^\top \Sigma^{-1} b \right]^{-0.5} \exp\left[ y^\top A^\top \Sigma^{-1} A y - 2 y^\top A^\top \Sigma^{-1} (x - b) \right]^{-0.5}\\ &= \kappa_1 \exp\left[ y^\top A^\top \Sigma^{-1} A y - 2 y^\top A^\top \Sigma^{-1} (x - b) \right]^{-0.5}\\ &= \kappa_1 \exp\left[ \left( y - \Sigma' A^\top \Sigma^{-1} (x - b) \right)^\top \Sigma'^{-1} \left( y - \Sigma' A^\top \Sigma^{-1} (x - b) \right) - (x - b)^\top \Sigma^{-1} A \Sigma' A^\top \Sigma^{-1} (x - b) \right]^{-0.5}\\ &= \kappa_1 \exp\left[ (A' x + b')^\top \Sigma'^{-1} (A' x + b') \right]^{0.5} \exp\left[ \left( y - (A' x + b') \right)^\top \Sigma'^{-1} \left( y - (A' x + b') \right) \right]^{-0.5}\\ &= \kappa_2 \left\vert 2 \pi \Sigma' \right\vert^{1 / 2} \NormDist_y[A' x + b', \Sigma']\\ &= \kappa \NormDist_y[A' x + b', \Sigma']\end{split}\]

where

\[\begin{split}\Sigma' &= (A^\top \Sigma^{-1} A)^{-1} \in \mathbb{R}^{m \times m}\\\\ A' &= \Sigma' A^\top \Sigma^{-1} \in \mathbb{R}^{m \times n}\\\\ b' &= -\Sigma' A^\top \Sigma^{-1} b \in \mathbb{R}^{m \times 1}\end{split}\]
\[\begin{split}\kappa &= (2 \pi)^{(m - n) / 2} \frac{ \left\vert \Sigma' \right\vert^{1 / 2} }{ \left\vert \Sigma \right\vert^{1 / 2} } \exp\left[ x^\top \Sigma^{-1} x - 2 x^\top \Sigma^{-1} b + b^\top \Sigma^{-1} b \right]^{-0.5} \exp\left[ (A' x + b')^\top \Sigma'^{-1} (A' x + b') \right]^{0.5}\\ &= (2 \pi)^{(m - n) / 2} \frac{ \left\vert \Sigma' \right\vert^{1 / 2} }{ \left\vert \Sigma \right\vert^{1 / 2} } \exp\left[ (x - b)^\top \left( \Sigma^{-1} - \Sigma^{-1} A \Sigma' A^\top \Sigma^{-1} \right) (x - b) \right]^{-0.5}.\end{split}\]

References

SchonL11

Thomas B Schön and Fredrik Lindsten. Manipulating the multivariate gaussian density. Division of Automatic Control, Linköping University, Sweden, Tech. Rep, 2011.

Wan

Ruye Wang. Marginal and conditional distributions of multivariate normal distribution. http://fourier.eng.hmc.edu/e161/lectures/gaussianprocess/node7.html. Accessed on 2017-06-11.