Modeling Complex Data Densities
Hidden Variables
Hidden (latent) variables describe the target density \(Pr(x)\) as the
marginalization of \(Pr(x, h)\) such that
\[Pr(x \mid \theta) = \int Pr(x, h \mid \theta) dh.\]
The joint density \(Pr(x, h)\) can be chosen to be much simpler than the
target density \(Pr(x)\) while producing complex marginal distributions
when integrated.
The left term can be formulated as a constrained nonlinear optimization
problem (7.7). The right term can be maximized using EM, which could lead to a
closed form solution (7.8).
Expectation Maximization
Figure 7.5 is a beautiful illustration of EM algorithm.
Factor Analysis
The factor analyzer describes a linear subspace with a full covariance model as
\[\DeclareMathOperator{\NormDist}{Norm}
Pr(\mathbf{x}) =
\NormDist_{\mathbf{x}}\left[
\boldsymbol{\mu},
\boldsymbol{\Phi} \boldsymbol{\Phi}^\top + \boldsymbol{\Sigma}
\right].\]
Probabilistic principal component analysis is an instance of the factor
analyzer where \(\boldsymbol{\Sigma}\) models spherical covariance. It
has slightly fewer parameters and can be fit in closed form.
Exercise 7.1
The world state \(w \in \{0, 1\}\) is a discrete label indicating whether
the orange is ripe or not.
The observed data \(\mathbf{x} \in \mathbb{R}^3\) describes the averaged RGB
color of a segmented orange.
Suppose the given labeled training data
\(\left\{ \mathbf{x}_i, w_i \right\}_{i = 1}^I\) describes the multimodal
colorspace of ripe and unripe oranges. Since the pixels have been
post-processed, outliers can be ignored. The input dimension is low, so no need
to use factor analysis.
One way to model this is to separately fit a mixture of Gaussians for
each discrete label as
\[Pr(\mathbf{x} \mid w) =
\sum_{k = 1}^K \lambda_{wk}
\NormDist_\mathbf{x}\left[
\boldsymbol{\mu}_{wk}, \boldsymbol{\Sigma}_{wk}
\right].\]
There are no prior knowledge about whether oranges are ripe or unripe
when the system is used, hence \(Pr(w = 0) = Pr(w = 1) = 0.5\).
Applying Bayes’ rule to compute the posterior over \(w\) gives
\[Pr\left( w^\ast = 1 \mid \mathbf{x}^\ast \right) =
\frac{
Pr\left( w^\ast = 1, \mathbf{x}^\ast \right)
}{
Pr\left( \mathbf{x}^\ast \right)
}.\]
Exercise 7.2
Erroneous training labels can be viewed as outliers so a mixture of multivariate
t-distributions can be used instead.
Exercise 7.3
Rewriting (7.18) with Lagrange multipliers gives
\[\begin{split}L &= \sum_{i = 1}^I \sum_{k = 1}^K r_{ik} \log\left(
\lambda_k \NormDist_{\mathbf{x}_i}\left[
\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k
\right]
\right) +
\nu \left( \sum_{k = 1}^K \lambda_k - 1 \right)\\
&= \sum_{i = 1}^I \sum_{k = 1}^K r_{ik}
\left(
\log \lambda_k -
\frac{D}{2} \log 2 \pi -
\frac{1}{2} \log \lvert \boldsymbol{\Sigma}_k \rvert -
\frac{1}{2}
(\mathbf{x}_i - \boldsymbol{\mu}_k)^\top
\boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_i - \boldsymbol{\mu}_k)
\right) +
\nu \left( \sum_{k = 1}^K \lambda_k - 1 \right).\end{split}\]
(a)
\[\begin{split}\frac{\partial L}{\partial \lambda_k}
&= \sum_{i = 1}^I
\frac{\partial}{\partial \lambda_k} r_{ik} \log \lambda_k +
\nu\\
0 &= \sum_{i = 1}^I \frac{r_{ik}}{\lambda_k} + \nu
& \quad & r_{ik} = q_i(h_i) \text{ is a constant in the M-step}\\
\lambda_k &= -\frac{\sum_{i = 1}^I r_{ik}}{\nu}\\
&= \frac{\sum_{i = 1}^I r_{ik}}{\sum_{j = 1}^K \sum_{i = 1}^I r_{ij}}\end{split}\]
since
\[\begin{split}0 &= \sum_{i = 1}^I r_{ik} + \nu \lambda_k\\
-\nu \sum_{j = 1}^K \lambda_j &= \sum_{j = 1}^K \sum_{i = 1}^I r_{ij}\\
\nu &= -\sum_{j = 1}^K \sum_{i = 1}^I r_{ij}
& \quad & \sum_k \lambda_k = 1.\end{split}\]
(b)
\[\begin{split}\frac{\partial L}{\partial \boldsymbol{\mu}_k}
&= \sum_{i = 1}^I
\frac{\partial}{\partial \boldsymbol{\mu}_k}
\frac{-r_{ik}}{2}
\left(
\mathbf{x}_i^\top \boldsymbol{\Sigma}_k^{-1} \mathbf{x}_i -
2 \mathbf{x}_i^\top \boldsymbol{\Sigma}_k^{-1} \boldsymbol{\mu}_k +
\boldsymbol{\mu}_k^\top \boldsymbol{\Sigma}_k^{-1} \boldsymbol{\mu}_k
\right)\\
\boldsymbol{0}
&= \sum_{i = 1}^I
\frac{r_{ik}}{2}
\left(
2 \boldsymbol{\Sigma}_k^{-1} \mathbf{x}_i -
2 \boldsymbol{\Sigma}_k^{-1} \boldsymbol{\mu}_k
\right)
& \quad & \text{(C.28) and (C.33)}\\
\boldsymbol{\Sigma}_k^{-1} \boldsymbol{\mu}_k \sum_{i = 1}^I r_{ik}
&= \boldsymbol{\Sigma}_k^{-1} \sum_{i = 1}^I r_{ik} \mathbf{x}_i\\
\boldsymbol{\mu}_k
&= \frac{
\sum_{i = 1}^I r_{ik} \mathbf{x}_i
}{
\sum_{i = 1}^I r_{ik}
}\end{split}\]
(c)
\[\begin{split}\frac{\partial L}{\partial \boldsymbol{\Sigma}_k}
&= \sum_{i = 1}^I
\frac{\partial}{\partial \boldsymbol{\Sigma}_k} \frac{-r_{ik}}{2}
\left(
\log \left\vert \boldsymbol{\Sigma}_k \right\vert +
(\mathbf{x}_i - \boldsymbol{\mu}_k)^\top
\boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_i - \boldsymbol{\mu}_k)
\right)\\
\boldsymbol{0}
&= \sum_{i = 1}^I
\frac{r_{ik}}{2}
\left[
-\boldsymbol{\Sigma}_k^{-\top} +
\boldsymbol{\Sigma}_k^{-\top}
(\mathbf{x}_i - \boldsymbol{\mu}_k)
(\mathbf{x}_i - \boldsymbol{\mu}_k)^\top
\boldsymbol{\Sigma}_k^{-\top}
\right]
& \quad & \text{(C.38) and Matrix Cookbook Section 2.2}\\
\boldsymbol{\Sigma}_k^{-1} \sum_{i = 1}^I r_{ik}
&= \boldsymbol{\Sigma}_k^{-1}
\left(
\sum_{i = 1}^I r_{ik}
(\mathbf{x}_i - \boldsymbol{\mu}_k)
(\mathbf{x}_i - \boldsymbol{\mu}_k)^\top
\right)
\boldsymbol{\Sigma}_k^{-1}\\
\boldsymbol{\Sigma}_k
&= \frac{
\sum_{i = 1}^I r_{ik}
(\mathbf{x}_i - \boldsymbol{\mu}_k)
(\mathbf{x}_i - \boldsymbol{\mu}_k)^\top
}{
\sum_{i = 1}^I r_{ik}
}\end{split}\]
Exercise 7.4
\[\DeclareMathOperator{\BetaDist}{Beta}
Pr(x \mid \boldsymbol{\theta}) =
\sum_{k = 1}^K \lambda_k \BetaDist_x[\alpha_k, \beta_k]\]
is a mixture of beta distributions where \(x \in [0, 1]\) is univariate and
\(\boldsymbol{\theta} = \{ \alpha_k, \beta_k, \lambda_k \}_{k = 1}^K\).
Define a discrete hidden variable \(h \in \{1 \ldots K\}\) such that
\[\begin{split}\DeclareMathOperator{\CatDist}{Cat}
Pr(x \mid h, \boldsymbol{\theta})
&= \BetaDist_x[\alpha_h, \beta_h]\\
Pr(h \mid \boldsymbol{\theta}) &= \CatDist_h[\boldsymbol{\lambda}]\end{split}\]
where \(\boldsymbol{\lambda} = \{ \lambda_1, \ldots, \lambda_K\}\) are the
parameters of the categorical distribution.
The original density can be expressed as the marginalization of the previous
terms:
\[\begin{split}Pr(x \mid \boldsymbol{\theta})
&= \sum_{h} Pr(x, h \mid \boldsymbol{\theta})\\
&= \sum_{h}
Pr(x \mid h, \boldsymbol{\theta})
Pr(h \mid \boldsymbol{\theta})\\
&= \sum_{k = 1}^K \lambda_k \BetaDist_x[\alpha_k, \beta_k].\end{split}\]
In the E-step, the goal is to compute the probability that the
\(k\text{th}\) beta distribution was responsible for the
\(i\text{th}\) data point:
\[\begin{split}q_i(h_i = k)
&= Pr(h_i \mid x_i, \boldsymbol{\theta})
& \quad & \text{(7.11)}\\
&= \frac{
Pr(h_i, x_i, \boldsymbol{\theta})
}{
Pr(x_i, \boldsymbol{\theta})
}\\
&= \frac{
Pr(x_i \mid h_i, \boldsymbol{\theta}) Pr(h_i \mid \boldsymbol{\theta})
}{
Pr(x_i \mid \boldsymbol{\theta})
}\\
&= \frac{
\lambda_k \BetaDist_{x_i}[\alpha_k, \beta_k]
}{
\sum_{j = 1}^K \lambda_j \BetaDist_{x_i}[\alpha_j, \beta_j]
}\\
&= r_{ik}.\end{split}\]
In the M-step, the goal is to maximize the lower bound approximation with
respect to \(\boldsymbol{\theta}\) such that
\[\begin{split}\DeclareMathOperator*{\argmax}{arg\,max}
\hat{\boldsymbol{\theta}}
&= \argmax_{\boldsymbol{\theta}}
\sum_{i = 1}^I \sum_{k = 1}^k
q_i(h_i = k)
\log Pr(x_i, h_i = k \mid \boldsymbol{\theta})
& \quad & \text{(7.12)}\\
&= \argmax_{\boldsymbol{\theta}}
\sum_{i = 1}^I \sum_{k = 1}^k
r_{ik} \log \lambda_k \BetaDist_{x_i}[\alpha_k, \beta_k].\end{split}\]
Exercise 7.5
\[\begin{split}\DeclareMathOperator{\GamDist}{Gam}
\DeclareMathOperator{\StudDist}{Stud}
Pr(x) &= \int Pr(x, h) dh\\
&= \int Pr(x \mid h) Pr(h) dh\\
&= \int_0^\infty
\NormDist_x\left[ \mu, \sigma^2 / h \right]
\GamDist_h[\nu / 2, \nu / 2] dh
& \quad & \text{Gamma distribution is defined on } (0, \infty)\\
&= \int_0^\infty
\frac{
h^{1/2}
}{
\sqrt{2 \pi \sigma^2}
}
\exp\left[
-\frac{(x - \mu)^2}{2 \sigma^2} h
\right]
\frac{
\left( \frac{\nu}{2} \right)^{\nu / 2}
}{
\Gamma\left[ \frac{\nu}{2} \right]
}
\exp\left[ -\frac{\nu}{2} h \right] h^{\nu / 2 - 1} dh\\
&= \frac{
\nu^{\nu / 2}
}{
2^{\nu / 2} \sqrt{2 \pi \sigma^2} \Gamma\left[ \frac{\nu}{2} \right]
}
\int_0^\infty \exp\left[
-\left(
\frac{\nu}{2} + \frac{(x - \mu)^2}{2 \sigma^2}
\right) h
\right] h^{(\nu - 1) / 2} dh\\
&= \frac{
\nu^{\nu / 2}
}{
2^{\nu / 2} \sqrt{2 \pi \sigma^2} \Gamma[\frac{\nu}{2}]
}
\left(
\frac{\nu}{2} + \frac{(x - \mu)^2}{2 \sigma^2}
\right)^{-(v + 1) / 2}
\Gamma\left[ \frac{\nu + 1}{2} \right]
& \quad & \int_0^\infty x^n \exp\left[ -a x^b \right] dx =
b^{-1} a^{-(n + 1) / b} \Gamma\left[ \frac{n + 1}{b} \right]\\
&= \frac{
\nu^{\nu / 2}
}{
\sqrt{\pi \sigma^2} \Gamma\left[ \frac{\nu}{2} \right]
}
\left(
\nu + \frac{(x - \mu)^2}{\sigma^2}
\right)^{-(v + 1) / 2}
\Gamma\left[ \frac{\nu + 1}{2} \right]\\
&= \frac{1}{
\sqrt{\nu \pi \sigma^2} \Gamma\left[ \frac{\nu}{2} \right]
}
\left(
1 + \frac{(x - \mu)^2}{\nu \sigma^2}
\right)^{-(v + 1) / 2}
\Gamma\left[ \frac{\nu + 1}{2} \right]\\
&= \frac{
\Gamma\left[ \frac{\nu + 1}{2} \right]
}{
\sqrt{\nu \pi \sigma^2}
\Gamma\left[ \frac{\nu}{2} \right]
}
\left(
1 + \frac{(x - \mu)^2}{\nu \sigma^2}
\right)^{-\frac{\nu + 1}{2}}\\
&= \StudDist_x\left[ \mu, \sigma^2, \nu \right].\end{split}\]
Exercise 7.6
\[\begin{split}\frac{\partial}{\partial z} \GamDist_z[\alpha, \beta]
&= \frac{\partial}{\partial z}
\frac{\beta^\alpha}{\Gamma[\alpha]}
\exp[-\beta z] z^{\alpha - 1}\\
0 &= \frac{\beta^\alpha}{\Gamma[\alpha]}
\left(
-\beta \exp[-\beta z] z^{\alpha - 1} +
(\alpha - 1) \exp[-\beta z] z^{\alpha - 2}
\right)\\
\beta z^{\alpha - 1} &= (\alpha - 1) z^{\alpha - 2}\\
z &= \frac{\alpha - 1}{\beta}\end{split}\]
Exercise 7.7
\[\begin{split}& \NormDist_{\mathbf{x}}[\boldsymbol{\mu}, \boldsymbol{\Sigma} / h]
\GamDist_{h}[\nu / 2, \nu / 2]\\
&= \frac{
h^{D / 2}
}{
(2 \pi)^{D / 2} \lvert \boldsymbol{\Sigma} \rvert^{1 / 2}
}
\exp\left[
-\frac{
(\mathbf{x} - \boldsymbol{\mu})^\top
\boldsymbol{\Sigma}^{-1}
(\mathbf{x} - \boldsymbol{\mu})
}{2} h
\right]
\frac{\nu^{\nu / 2}}{2^{\nu / 2} \Gamma[\nu / 2]}
\exp\left[ -\frac{\nu}{2} h \right] h^{\nu / 2 - 1}\\
&= \frac{
\nu^{\nu / 2}
}{
(2 \pi)^{D / 2}
\lvert \boldsymbol{\Sigma} \rvert^{1 / 2}
\Gamma[\nu / 2]
2^{\nu / 2}
}
\exp\left[
-\frac{
(\mathbf{x} - \boldsymbol{\mu})^\top
\boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) +
\nu
}{2} h
\right]
h^{(\nu + D) / 2 - 1}\\
&= \frac{
\nu^{\nu / 2}
}{
(2 \pi)^{D / 2}
\lvert \boldsymbol{\Sigma} \rvert^{1 / 2}
\Gamma[\nu / 2]
2^{\nu / 2}
}
\exp[-\beta h] h^{\alpha - 1}\\
&= \kappa \GamDist_h[\alpha, \beta]\end{split}\]
where
\[\begin{split}\alpha &= \frac{\nu + D}{2}\\\\
\beta &= \frac{
(\mathbf{x} - \boldsymbol{\mu})^\top
\boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) +
\nu
}{2}\\\\
\kappa
&= \frac{
\nu^{\nu / 2} \Gamma[\alpha]
}{
(2 \pi)^{D / 2} \lvert \boldsymbol{\Sigma} \rvert^{1 / 2}
\Gamma[\nu / 2] 2^{\nu / 2} \beta^\alpha
}\\
&= \frac{
\nu^{\nu / 2} \Gamma[\alpha]
}{
(2 \pi)^{D / 2} \lvert \boldsymbol{\Sigma} \rvert^{1 / 2}
\Gamma[\nu / 2] 2^{\nu / 2}
}
\left( \frac{\nu}{2} \right)^{-\alpha}
\left(
\frac{
(\mathbf{x} - \boldsymbol{\mu})^\top
\boldsymbol{\Sigma}^{-1}
(\mathbf{x} - \boldsymbol{\mu})
}{\nu} + 1
\right)^{-\alpha}\\
&= \StudDist_{\mathbf{x}}[\boldsymbol{\mu}, \boldsymbol{\Sigma}, \nu].\end{split}\]
Exercise 7.8
Let \(\mathbf{x}_i =
\boldsymbol{\mu} + \boldsymbol{\Phi} \mathbf{h}_i + \boldsymbol{\epsilon}_i\)
where
\[\begin{split}\DeclareMathOperator{\E}{\mathrm{E}}
\DeclareMathOperator{\Cov}{\mathrm{Cov}}
\begin{gather*}
\E[\mathbf{h}_i] = \E[\boldsymbol{\epsilon}_i] = \boldsymbol{0}\\
\Cov(\mathbf{h}_i, \mathbf{h}_i) =
\E\left[
(\mathbf{h}_i - \E[\mathbf{h}_i])
(\mathbf{h}_i - \E[\mathbf{h}_i])^\top
\right] =
\E\left[ \mathbf{h}_i \mathbf{h}_i^\top \right] = I\\
\Cov(\boldsymbol{\epsilon}_i, \boldsymbol{\epsilon}_i) =
\E\left[
(\boldsymbol{\epsilon}_i - \E[\boldsymbol{\epsilon}_i])
(\boldsymbol{\epsilon}_i - \E[\boldsymbol{\epsilon}_i])^\top
\right] =
\E\left[ \boldsymbol{\epsilon}_i \boldsymbol{\epsilon}_i^\top \right] =
\boldsymbol{\Sigma}\\
\Cov(\mathbf{h}_i, \boldsymbol{\epsilon}_i) =
\E\left[
(\mathbf{h}_i - \E[\mathbf{h}_i])
(\boldsymbol{\epsilon}_i - \E[\boldsymbol{\epsilon}_i])^\top
\right] =
\E\left[ \mathbf{h}_i \boldsymbol{\epsilon}_i^\top \right] =
\boldsymbol{0}\\
\Cov(\boldsymbol{\epsilon}_i, \mathbf{h}_i) =
\Cov(\mathbf{h}_i, \boldsymbol{\epsilon}_i)^\top = \boldsymbol{0}
\end{gather*}\end{split}\]
(1)
\[\begin{split}\E[\mathbf{x}_i]
&= \E[\boldsymbol{\mu}] +
\E[\boldsymbol{\Phi} \mathbf{h}_i] +
\E[\boldsymbol{\epsilon}_i]
& \quad & \text{(2.16)}\\
&= \boldsymbol{\mu} +
\boldsymbol{\Phi} \E[\mathbf{h}_i] +
\boldsymbol{0}
& \quad & \text{(2.14), (2.15), (2.16)}\\
&= \boldsymbol{\mu} + \boldsymbol{\Phi} \boldsymbol{0}\\
&= \boldsymbol{\mu}\end{split}\]
(2)
\[\begin{split}\E\left[
(\mathbf{x}_i - \E[\mathbf{x}_i])
(\mathbf{x}_i - \E[\mathbf{x}_i])^\top
\right]
&= \E\left[
(\mathbf{x}_i - \boldsymbol{\mu})
(\mathbf{x}_i - \boldsymbol{\mu})^\top
\right]\\
&= \E\left[
(\boldsymbol{\Phi} \mathbf{h}_i + \boldsymbol{\epsilon}_i)
(\boldsymbol{\Phi} \mathbf{h}_i + \boldsymbol{\epsilon}_i)^\top
\right]\\
&= \E\left[
\boldsymbol{\Phi} \mathbf{h}_i \mathbf{h}_i^\top \boldsymbol{\Phi}^\top
\right] +
\E\left[
\boldsymbol{\Phi} \mathbf{h}_i \boldsymbol{\epsilon}_i^\top
\right] +
\E\left[
\boldsymbol{\epsilon}_i \mathbf{h}_i^\top \boldsymbol{\Phi}^\top
\right] +
\E\left[ \boldsymbol{\epsilon}_i \boldsymbol{\epsilon}_i^\top \right]
& \quad & \text{(2.16)}\\
&= \boldsymbol{\Phi} \boldsymbol{\Phi}^\top +
\boldsymbol{\Phi} \E\left[
\mathbf{h}_i \boldsymbol{\epsilon}_i^\top
\right] +
\E\left[ \boldsymbol{\epsilon}_i \mathbf{h}_i^\top \right]
\boldsymbol{\Phi}^\top +
\boldsymbol{\Sigma}
& \quad & \text{(2.15), (2.16)}\\
&= \boldsymbol{\Phi} \boldsymbol{\Phi}^\top + \boldsymbol{\Sigma}\end{split}\]
Exercise 7.9
\[\begin{split}q_i(\mathbf{h}_i)
&= Pr(\mathbf{h}_i \mid \mathbf{x}_i, \boldsymbol{\theta})
& \quad & \text{(7.11)}\\
&= \frac{
Pr(\mathbf{x}_i \mid \mathbf{h}_i, \boldsymbol{\theta})
Pr(\mathbf{h}_i \mid \boldsymbol{\theta})
}{
Pr(\mathbf{x}_i \mid \boldsymbol{\theta})
}\\
&= \frac{
\NormDist_{\mathbf{x}_i}\left[
\boldsymbol{\mu} + \boldsymbol{\Phi} \mathbf{h}_i,
\boldsymbol{\Sigma}
\right]
\NormDist_{\mathbf{h}_i}\left[ \boldsymbol{0}, \mathbf{I} \right]
}{
\NormDist_{\mathbf{x}_i}\left[
\boldsymbol{\mu},
\boldsymbol{\Phi} \boldsymbol{\Phi}^\top + \boldsymbol{\Sigma}
\right]
}\\
&= \frac{
\kappa_1 \NormDist_{\mathbf{h}_i}\left[
\boldsymbol{\Phi}' \mathbf{x}_i + \boldsymbol{\mu}',
\boldsymbol{\Sigma}'
\right]
\NormDist_{\mathbf{h}_i}\left[ \boldsymbol{0}, \mathbf{I} \right]
}{
\NormDist_{\mathbf{x}_i}\left[
\boldsymbol{\mu},
\boldsymbol{\Phi} \boldsymbol{\Phi}^\top + \boldsymbol{\Sigma}
\right]
}
& \quad & \text{(a), (5.17), and Exercise 5.10}\\
&= \frac{
\kappa_1 \kappa_2 \NormDist_{\mathbf{h}_i}\left[
\hat{\boldsymbol{\mu}}, \hat{\boldsymbol{\Sigma}}
\right]
}{
\NormDist_{\mathbf{x}_i}\left[
\boldsymbol{\mu},
\boldsymbol{\Phi} \boldsymbol{\Phi}^\top + \boldsymbol{\Sigma}
\right]
}
& \quad & \text{(b), (5.14), (5.15), and Exercise 5.7 & 5.9}\\
&= \NormDist_{\mathbf{h}_i}\left[
\hat{\boldsymbol{\mu}}, \hat{\boldsymbol{\Sigma}}
\right]
& \quad & \text{(c)}\end{split}\]
See Exercise 5.7,
Exercise 5.9, and
Exercise 5.10 for details.
(a)
\[\begin{split}\boldsymbol{\Sigma}' &=
\left(
\boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi}
\right)^{-1}\\\\
\boldsymbol{\Phi}' &=
\boldsymbol{\Sigma}' \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1}\\\\
\boldsymbol{\mu}' &=
-\boldsymbol{\Sigma}'
\boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}\\\\
\kappa_1 &=
\frac{
\left\vert \boldsymbol{\Sigma}' \right\vert^{1 / 2}
}{
\left\vert \boldsymbol{\Sigma} \right\vert^{1 / 2}
}
\exp\left[
(\mathbf{x}_i - \boldsymbol{\mu})^\top
\left(
\boldsymbol{\Sigma}^{-1} -
\boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi}
\boldsymbol{\Sigma}' \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1}
\right)
(\mathbf{x}_i - \boldsymbol{\mu})
\right]^{-0.5}\end{split}\]
(b)
\[\begin{split}\hat{\boldsymbol{\Sigma}}
&= \left( \boldsymbol{\Sigma}'^{-1} + \mathbf{I}^{-1} \right)^{-1}\\
&= \left(
\boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} +
\mathbf{I}
\right)^{-1}\\\\
\hat{\boldsymbol{\mu}}
&= \hat{\boldsymbol{\Sigma}}
\left(
\boldsymbol{\Sigma}'^{-1} \left(
\boldsymbol{\Phi}' \mathbf{x}_i + \boldsymbol{\mu}'
\right) +
\mathbf{I}^{-1} \boldsymbol{0}
\right)\\
&= \hat{\boldsymbol{\Sigma}}
\boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1}
(\mathbf{x}_i - \boldsymbol{\mu})\\\\
\kappa_2
&= \NormDist_{\boldsymbol{\Phi}' \mathbf{x}_i + \boldsymbol{\mu}'}\left[
\boldsymbol{0}, \boldsymbol{\Sigma}' + \mathbf{I}
\right]\\
&= \NormDist_{\boldsymbol{0}}\left[
\boldsymbol{\Phi}' \mathbf{x}_i + \boldsymbol{\mu}',
\boldsymbol{\Sigma}' + \mathbf{I}
\right]\\
&= \frac{1}{
(2 \pi)^{D / 2} \lvert \boldsymbol{\Sigma}' + \mathbf{I} \rvert^{1 / 2}
}
\exp\left[
\left( \boldsymbol{\Phi}' \mathbf{x}_i + \boldsymbol{\mu}' \right)^\top
\left( \boldsymbol{\Sigma}' + \mathbf{I} \right)^{-1}
\left( \boldsymbol{\Phi}' \mathbf{x}_i + \boldsymbol{\mu}' \right)
\right]^{-0.5}\\
&= \frac{1}{
(2 \pi)^{D / 2} \lvert \boldsymbol{\Sigma}' + \mathbf{I} \rvert^{1 / 2}
}
\exp\left[
(\mathbf{x}_i - \boldsymbol{\mu})^\top
\boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \boldsymbol{\Sigma}'
\left( \boldsymbol{\Sigma}' + \mathbf{I} \right)^{-1}
\boldsymbol{\Sigma}' \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1}
(\mathbf{x}_i - \boldsymbol{\mu})
\right]^{-0.5}\end{split}\]
(c)
\[\begin{split}\kappa_1 \kappa_2
&= \frac{
\left\vert \boldsymbol{\Sigma}' \right\vert^{1 / 2}
}{
\left\vert \boldsymbol{\Sigma} \right\vert^{1 / 2}
}
\exp\left[
(\mathbf{x}_i - \boldsymbol{\mu})^\top
\left(
\boldsymbol{\Sigma}^{-1} -
\boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \boldsymbol{\Sigma}'
\boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1}
\right)
(\mathbf{x}_i - \boldsymbol{\mu})
\right]^{-0.5}
\NormDist_{\boldsymbol{0}}\left[
\boldsymbol{\Phi}' \mathbf{x}_i + \boldsymbol{\mu}',
\boldsymbol{\Sigma}' + \mathbf{I}
\right]\\
&= \frac{
\left\vert \boldsymbol{\Sigma}' \right\vert^{1 / 2}
}{
(2 \pi)^{D / 2} \left\vert \boldsymbol{\Sigma} \right\vert^{1 / 2}
\left\vert \boldsymbol{\Sigma}' + \mathbf{I} \right\vert^{1 / 2}
}
\exp\left[
(\mathbf{x}_i - \boldsymbol{\mu})^\top
\left(
\boldsymbol{\Sigma}^{-1} -
\boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \boldsymbol{\Sigma}'
\boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} +
\boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \boldsymbol{\Sigma}'
\left( \boldsymbol{\Sigma}' + \mathbf{I} \right)^{-1}
\boldsymbol{\Sigma}' \boldsymbol{\Phi}^\top
\boldsymbol{\Sigma}^{-1}
\right)
(\mathbf{x}_i - \boldsymbol{\mu})
\right]^{-0.5}\\
&= \frac{1}{
(2 \pi)^{D / 2} \left\vert \boldsymbol{\Sigma} \right\vert^{1 / 2}
\left\vert \boldsymbol{\Sigma}'^{-1} + \mathbf{I} \right\vert^{1 / 2}
}
\exp\left[
(\mathbf{x} - \boldsymbol{\mu})^\top
\left(
\boldsymbol{\Sigma}^{-1} -
\boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi}
\left(
\boldsymbol{\Sigma}' -
\boldsymbol{\Sigma}'
\left( \boldsymbol{\Sigma}' + \mathbf{I} \right)^{-1}
\boldsymbol{\Sigma}'
\right)
\boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1}
\right)
(\mathbf{x} - \boldsymbol{\mu})
\right]^{-0.5}\\
&= \frac{1}{
(2 \pi)^{D / 2}
\left\vert
\boldsymbol{\Sigma} + \boldsymbol{\Phi} \boldsymbol{\Phi}^\top
\right\vert^{1 / 2}
}
\exp\left[
(\mathbf{x} - \boldsymbol{\mu})^\top
\left(
\boldsymbol{\Sigma}^{-1} -
\boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi}
\left( \mathbf{I} + \boldsymbol{\Sigma}'^{-1} \right)^{-1}
\boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1}
\right)
(\mathbf{x} - \boldsymbol{\mu})
\right]^{-0.5}
& \quad & \text{(c.1) and (d)}\\
&= \frac{1}{
(2 \pi)^{D / 2}
\left\vert
\boldsymbol{\Sigma} + \boldsymbol{\Phi} \boldsymbol{\Phi}^\top
\right\vert^{1 / 2}
}
\exp\left[
(\mathbf{x} - \boldsymbol{\mu})^\top
\left(
\boldsymbol{\Sigma} + \boldsymbol{\Phi} \boldsymbol{\Phi}^\top
\right)^{-1}
(\mathbf{x} - \boldsymbol{\mu})
\right]^{-0.5}
& \quad & \text{Sherman–Morrison–Woodbury formula}\\
&= \NormDist_{\mathbf{x}_i}\left[
\boldsymbol{\mu},
\boldsymbol{\Phi} \boldsymbol{\Phi}^\top + \boldsymbol{\Sigma}
\right]\end{split}\]
(c.1)
\[\begin{split}\left(
\boldsymbol{\Sigma}' -
\boldsymbol{\Sigma}'
\left( \mathbf{I} + \boldsymbol{\Sigma}'\right)^{-1} \boldsymbol{\Sigma}'
\right)
\left( \mathbf{I} + \boldsymbol{\Sigma}'^{-1} \right)
&= \boldsymbol{\Sigma}' + \mathbf{I} -
\boldsymbol{\Sigma}'
\left( \mathbf{I} + \boldsymbol{\Sigma}' \right)^{-1}
\boldsymbol{\Sigma}' -
\boldsymbol{\Sigma}'
\left( \mathbf{I} + \boldsymbol{\Sigma}' \right)^{-1}\\
&= \mathbf{I} +
\boldsymbol{\Sigma}' \left[
\mathbf{I} - \left( \mathbf{I} + \boldsymbol{\Sigma}' \right)^{-1}
\right] -
\boldsymbol{\Sigma}'
\left( \mathbf{I} + \boldsymbol{\Sigma}' \right)^{-1}
\boldsymbol{\Sigma}'\\
&= \mathbf{I}
& \quad & \text{(c.2)}\\
\left(
\boldsymbol{\Sigma}' -
\boldsymbol{\Sigma}'
\left( \mathbf{I} + \boldsymbol{\Sigma}' \right)^{-1}
\boldsymbol{\Sigma}'
\right)
&= \left( \mathbf{I} + \boldsymbol{\Sigma}'^{-1} \right)^{-1}\end{split}\]
(c.2)
\[\begin{split}\left( \mathbf{I} + \boldsymbol{\Sigma}' \right) \boldsymbol{\Sigma}'^{-1}
&= \boldsymbol{\Sigma}'^{-1} + \mathbf{I}
& \quad & \text{Exercise 5.9}\\
\boldsymbol{\Sigma}'^{-1}
&= \left( \mathbf{I} + \boldsymbol{\Sigma}' \right)^{-1}
\left( \boldsymbol{\Sigma}'^{-1} + \mathbf{I} \right)\\
\boldsymbol{\Sigma}'^{-1} -
\left( \mathbf{I} + \boldsymbol{\Sigma}' \right)^{-1}
\boldsymbol{\Sigma}'^{-1}
&= \left( \mathbf{I} + \boldsymbol{\Sigma}' \right)^{-1}\\
\mathbf{I} - \left( \mathbf{I} + \boldsymbol{\Sigma}' \right)^{-1}
&= \left( \mathbf{I} + \boldsymbol{\Sigma}' \right)^{-1}
\boldsymbol{\Sigma}'\\
\boldsymbol{\Sigma}' \left[
\mathbf{I} - \left( \mathbf{I} + \boldsymbol{\Sigma}' \right)^{-1}
\right]
&= \boldsymbol{\Sigma}'
\left( \mathbf{I} + \boldsymbol{\Sigma}' \right)^{-1}
\boldsymbol{\Sigma}'\end{split}\]
(d)
\[\begin{split}\left(
\left\vert \boldsymbol{\Sigma} \right\vert
\left\vert \boldsymbol{\Sigma}'^{-1} + \mathbf{I} \right\vert
\right)^{-1 / 2}
&= \left(
\left\vert \boldsymbol{\Sigma} \right\vert
\left\vert
\mathbf{I} +
\boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi}
\right\vert
\right)^{-1 / 2}\\
&= \left\vert
\boldsymbol{\Sigma} + \boldsymbol{\Phi} \boldsymbol{\Phi}^\top
\right\vert^{-1 / 2}
& \quad & \text{Sylvester's determinant theorem}\end{split}\]
Exercise 7.10
\[\begin{split}\hat{\boldsymbol{\theta}}
&= \argmax_\boldsymbol{\theta}
\sum_{i = 1}^I \sum_{k = 1}^k q_i(\mathbf{h}_i)
\log Pr(\mathbf{x}_i, \mathbf{h}_i \mid \boldsymbol{\theta})
& \quad & \text{(7.12)}\\
&= \argmax_\boldsymbol{\theta} \sum_{i = 1}^I
\E\left[
\log Pr(\mathbf{x}_i, \mathbf{h}_i \mid \boldsymbol{\theta})
\right]
& \quad & \text{(7.36)}\\
&= \argmax_\boldsymbol{\theta} L(\boldsymbol{\theta})\end{split}\]
where \(L\) is the log-likelihood given by
\[\begin{split}L
&= -\frac{1}{2} \sum_{i = 1}^I \E\left[
D \log(2\pi) + \log \lvert \boldsymbol{\Sigma} \rvert +
(\mathbf{x}_i - \boldsymbol{\mu} - \boldsymbol{\Phi} \mathbf{h}_i)^\top
\boldsymbol{\Sigma}^{-1}
(\mathbf{x}_i - \boldsymbol{\mu} - \boldsymbol{\Phi} \mathbf{h}_i)
\right]
& \quad & \text{(7.37)}\\
&= -\frac{1}{2} \sum_{i = 1}^I \E\left[
D \log(2\pi) + \log \lvert \boldsymbol{\Sigma} \rvert +
\mathbf{x}_i^\top \boldsymbol{\Sigma}^{-1} \mathbf{x}_i +
\boldsymbol{\mu}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} +
\mathbf{h}_i^\top \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1}
\boldsymbol{\Phi} \mathbf{h}_i +
2 \boldsymbol{\mu}^\top \boldsymbol{\Sigma}^{-1}
\boldsymbol{\Phi} \mathbf{h}_i -
2 \mathbf{x}_i^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} -
2 \mathbf{x}_i^\top \boldsymbol{\Sigma}^{-1}
\boldsymbol{\Phi} \mathbf{h}_i
\right].\end{split}\]
(a)
\[\begin{split}\frac{\partial L}{\partial \boldsymbol{\mu}}
&= -\frac{1}{2} \sum_{i = 1}^I \E\left[
2 \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} +
2 \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \mathbf{h}_i -
2 \boldsymbol{\Sigma}^{-1} \mathbf{x}_i
\right]
& \quad & \text{(C.33), (C.27), (C.28)}\\
0
&= \sum_{i = 1}^I
-\boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} -
\boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \E[\mathbf{h}_i] +
\boldsymbol{\Sigma}^{-1} \mathbf{x}_i
& \quad & \E \text{ is a linear operator}\\
&= \sum_{i = 1}^I
-\boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} -
\boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \left(
\left(
\boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} +
\mathbf{I}
\right)^{-1}
\boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1}
\left( \mathbf{x}_i - \boldsymbol{\mu} \right)
\right) +
\boldsymbol{\Sigma}^{-1} \mathbf{x}_i
& \quad & \text{(7.35)}\\
&= \sum_{i = 1}^I
\left(
\boldsymbol{\Sigma}^{-1} -
\boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \left(
\boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} +
\mathbf{I}
\right)^{-1}
\boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1}
\right) \mathbf{x}_i -
\left(
\boldsymbol{\Sigma}^{-1} -
\boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \left(
\boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} +
\mathbf{I}
\right)^{-1}
\boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1}
\right) \boldsymbol{\mu}\\
&= \sum_{i = 1}^I
\left(
\boldsymbol{\Sigma} + \boldsymbol{\Phi} \boldsymbol{\Phi}^\top
\right)^{-1} \mathbf{x}_i -
\left(
\boldsymbol{\Sigma} + \boldsymbol{\Phi} \boldsymbol{\Phi}^\top
\right)^{-1} \boldsymbol{\mu}
& \quad & \text{Sherman-Morrison-Woodbury formula}\\
\boldsymbol{\mu} &= \frac{1}{I} \sum_{i = 1}^I \mathbf{x}_i\end{split}\]
(b)
\[\begin{split}\frac{\partial L}{\partial \boldsymbol{\Phi}}
&= -\frac{1}{2} \sum_{i = 1}^I
\E\left[
2 \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi}
\mathbf{h}_i \mathbf{h}_i^\top +
2 \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} \mathbf{h}_i^\top -
2 \boldsymbol{\Sigma}^{-1} \mathbf{x}_i \mathbf{h}_i^\top
\right]
& \quad & \text{(C.34), (C.29)}\\
0
&= \sum_{i = 1}^I
-\boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi}
\E\left[ \mathbf{h}_i \mathbf{h}_i^\top \right] -
\boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}
\E\left[ \mathbf{h}_i^\top \right] +
\boldsymbol{\Sigma}^{-1} \mathbf{x}_i
\E\left[ \mathbf{h}_i^\top \right]
& \quad & \E \text{ is a linear operator}\\
\boldsymbol{\Phi}
&= \left(
\sum_{i = 1}^I
(\mathbf{x}_i - \boldsymbol{\mu}) \E\left[ \mathbf{h}_i^\top \right]
\right)
\left(
\sum_{i = 1}^I \E\left[ \mathbf{h}_i \mathbf{h}_i^\top \right]
\right)^{-1}\end{split}\]
(c)
\[\begin{split}\DeclareMathOperator{\diag}{\mathrm{diag}}
\frac{\partial L}{\partial \boldsymbol{\Sigma}}
&= -\frac{1}{2} \sum_{i = 1}^I
\E\left[
\boldsymbol{\Sigma}^{-\top} -
\boldsymbol{\Sigma}^{-\top}
\left(
\mathbf{x}_i - \boldsymbol{\mu} - \boldsymbol{\Phi} \mathbf{h}_i
\right)
\left(
\mathbf{x}_i - \boldsymbol{\mu} - \boldsymbol{\Phi} \mathbf{h}_i
\right)^\top
\boldsymbol{\Sigma}^{-\top}
\right]
& \quad & \text{(C.38) and Matrix Cookbook (61)}\\
\boldsymbol{\Sigma}
&= \frac{1}{I} \sum_{i = 1}^I
\E\left[
\left(
\mathbf{x}_i - \boldsymbol{\mu} - \boldsymbol{\Phi} \mathbf{h}_i
\right)
\left(
\mathbf{x}_i - \boldsymbol{\mu} - \boldsymbol{\Phi} \mathbf{h}_i
\right)^\top
\right]
& \quad & \E \text{ is a linear operator}\\
&= \frac{1}{I} \sum_{i = 1}^I
(\mathbf{x}_i - \boldsymbol{\mu})
(\mathbf{x}_i - \boldsymbol{\mu})^\top +
\E\left[
2 \boldsymbol{\Phi} \mathbf{h}_i \boldsymbol{\mu}^\top -
2 \boldsymbol{\Phi} \mathbf{h}_i \mathbf{x}_i^\top +
\boldsymbol{\Phi} \mathbf{h}_i
\mathbf{h}_i^\top \boldsymbol{\Phi}^\top
\right]
& \quad & \text{expanded version of (7.37)}\\
&= \frac{1}{I} \sum_{i = 1}^I
(\mathbf{x}_i - \boldsymbol{\mu})
(\mathbf{x}_i - \boldsymbol{\mu})^\top -
2 \boldsymbol{\Phi} \E[\mathbf{h}_i]
\left( \mathbf{x}_i - \boldsymbol{\mu}\right)^\top +
\boldsymbol{\Phi} \E\left[ \mathbf{h}_i \mathbf{h}_i^\top \right]
\boldsymbol{\Phi}^\top\\
&= \frac{1}{I} \sum_{i = 1}^I
(\mathbf{x}_i - \boldsymbol{\mu})
(\mathbf{x}_i - \boldsymbol{\mu})^\top -
\boldsymbol{\Phi} \E[\mathbf{h}_i]
\left( \mathbf{x}_i - \boldsymbol{\mu} \right)^\top
& \quad & \text{(b)}\\
&= \frac{1}{I} \sum_{i = 1}^I \diag\left[
(\mathbf{x}_i - \boldsymbol{\mu})
(\mathbf{x}_i - \boldsymbol{\mu})^\top -
\boldsymbol{\Phi} \E[\mathbf{h}_i]
\left( \mathbf{x}_i - \boldsymbol{\mu}\right)^\top
\right]
& \quad & \boldsymbol{\Sigma} \text{ diagonal constraint.}\end{split}\]