Modeling Complex Data Densities

Hidden Variables

Hidden (latent) variables describe the target density \(Pr(x)\) as the marginalization of \(Pr(x, h)\) such that

\[Pr(x \mid \theta) = \int Pr(x, h \mid \theta) dh.\]

The joint density \(Pr(x, h)\) can be chosen to be much simpler than the target density \(Pr(x)\) while producing complex marginal distributions when integrated.

The left term can be formulated as a constrained nonlinear optimization problem (7.7). The right term can be maximized using EM, which could lead to a closed form solution (7.8).

Expectation Maximization

Figure 7.5 is a beautiful illustration of EM algorithm.

Factor Analysis

The factor analyzer describes a linear subspace with a full covariance model as

\[\DeclareMathOperator{\NormDist}{Norm} Pr(\mathbf{x}) = \NormDist_{\mathbf{x}}\left[ \boldsymbol{\mu}, \boldsymbol{\Phi} \boldsymbol{\Phi}^\top + \boldsymbol{\Sigma} \right].\]

Probabilistic principal component analysis is an instance of the factor analyzer where \(\boldsymbol{\Sigma}\) models spherical covariance. It has slightly fewer parameters and can be fit in closed form.

Exercise 7.1

The world state \(w \in \{0, 1\}\) is a discrete label indicating whether the orange is ripe or not.

The observed data \(\mathbf{x} \in \mathbb{R}^3\) describes the averaged RGB color of a segmented orange.

Suppose the given labeled training data \(\left\{ \mathbf{x}_i, w_i \right\}_{i = 1}^I\) describes the multimodal colorspace of ripe and unripe oranges. Since the pixels have been post-processed, outliers can be ignored. The input dimension is low, so no need to use factor analysis.

One way to model this is to separately fit a mixture of Gaussians for each discrete label as

\[Pr(\mathbf{x} \mid w) = \sum_{k = 1}^K \lambda_{wk} \NormDist_\mathbf{x}\left[ \boldsymbol{\mu}_{wk}, \boldsymbol{\Sigma}_{wk} \right].\]

There are no prior knowledge about whether oranges are ripe or unripe when the system is used, hence \(Pr(w = 0) = Pr(w = 1) = 0.5\).

Applying Bayes’ rule to compute the posterior over \(w\) gives

\[Pr\left( w^\ast = 1 \mid \mathbf{x}^\ast \right) = \frac{ Pr\left( w^\ast = 1, \mathbf{x}^\ast \right) }{ Pr\left( \mathbf{x}^\ast \right) }.\]

Exercise 7.2

Erroneous training labels can be viewed as outliers so a mixture of multivariate t-distributions can be used instead.

Exercise 7.3

Rewriting (7.18) with Lagrange multipliers gives

\[\begin{split}L &= \sum_{i = 1}^I \sum_{k = 1}^K r_{ik} \log\left( \lambda_k \NormDist_{\mathbf{x}_i}\left[ \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k \right] \right) + \nu \left( \sum_{k = 1}^K \lambda_k - 1 \right)\\ &= \sum_{i = 1}^I \sum_{k = 1}^K r_{ik} \left( \log \lambda_k - \frac{D}{2} \log 2 \pi - \frac{1}{2} \log \lvert \boldsymbol{\Sigma}_k \rvert - \frac{1}{2} (\mathbf{x}_i - \boldsymbol{\mu}_k)^\top \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_i - \boldsymbol{\mu}_k) \right) + \nu \left( \sum_{k = 1}^K \lambda_k - 1 \right).\end{split}\]

(a)

\[\begin{split}\frac{\partial L}{\partial \lambda_k} &= \sum_{i = 1}^I \frac{\partial}{\partial \lambda_k} r_{ik} \log \lambda_k + \nu\\ 0 &= \sum_{i = 1}^I \frac{r_{ik}}{\lambda_k} + \nu & \quad & r_{ik} = q_i(h_i) \text{ is a constant in the M-step}\\ \lambda_k &= -\frac{\sum_{i = 1}^I r_{ik}}{\nu}\\ &= \frac{\sum_{i = 1}^I r_{ik}}{\sum_{j = 1}^K \sum_{i = 1}^I r_{ij}}\end{split}\]

since

\[\begin{split}0 &= \sum_{i = 1}^I r_{ik} + \nu \lambda_k\\ -\nu \sum_{j = 1}^K \lambda_j &= \sum_{j = 1}^K \sum_{i = 1}^I r_{ij}\\ \nu &= -\sum_{j = 1}^K \sum_{i = 1}^I r_{ij} & \quad & \sum_k \lambda_k = 1.\end{split}\]

(b)

\[\begin{split}\frac{\partial L}{\partial \boldsymbol{\mu}_k} &= \sum_{i = 1}^I \frac{\partial}{\partial \boldsymbol{\mu}_k} \frac{-r_{ik}}{2} \left( \mathbf{x}_i^\top \boldsymbol{\Sigma}_k^{-1} \mathbf{x}_i - 2 \mathbf{x}_i^\top \boldsymbol{\Sigma}_k^{-1} \boldsymbol{\mu}_k + \boldsymbol{\mu}_k^\top \boldsymbol{\Sigma}_k^{-1} \boldsymbol{\mu}_k \right)\\ \boldsymbol{0} &= \sum_{i = 1}^I \frac{r_{ik}}{2} \left( 2 \boldsymbol{\Sigma}_k^{-1} \mathbf{x}_i - 2 \boldsymbol{\Sigma}_k^{-1} \boldsymbol{\mu}_k \right) & \quad & \text{(C.28) and (C.33)}\\ \boldsymbol{\Sigma}_k^{-1} \boldsymbol{\mu}_k \sum_{i = 1}^I r_{ik} &= \boldsymbol{\Sigma}_k^{-1} \sum_{i = 1}^I r_{ik} \mathbf{x}_i\\ \boldsymbol{\mu}_k &= \frac{ \sum_{i = 1}^I r_{ik} \mathbf{x}_i }{ \sum_{i = 1}^I r_{ik} }\end{split}\]

(c)

\[\begin{split}\frac{\partial L}{\partial \boldsymbol{\Sigma}_k} &= \sum_{i = 1}^I \frac{\partial}{\partial \boldsymbol{\Sigma}_k} \frac{-r_{ik}}{2} \left( \log \left\vert \boldsymbol{\Sigma}_k \right\vert + (\mathbf{x}_i - \boldsymbol{\mu}_k)^\top \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_i - \boldsymbol{\mu}_k) \right)\\ \boldsymbol{0} &= \sum_{i = 1}^I \frac{r_{ik}}{2} \left[ -\boldsymbol{\Sigma}_k^{-\top} + \boldsymbol{\Sigma}_k^{-\top} (\mathbf{x}_i - \boldsymbol{\mu}_k) (\mathbf{x}_i - \boldsymbol{\mu}_k)^\top \boldsymbol{\Sigma}_k^{-\top} \right] & \quad & \text{(C.38) and Matrix Cookbook Section 2.2}\\ \boldsymbol{\Sigma}_k^{-1} \sum_{i = 1}^I r_{ik} &= \boldsymbol{\Sigma}_k^{-1} \left( \sum_{i = 1}^I r_{ik} (\mathbf{x}_i - \boldsymbol{\mu}_k) (\mathbf{x}_i - \boldsymbol{\mu}_k)^\top \right) \boldsymbol{\Sigma}_k^{-1}\\ \boldsymbol{\Sigma}_k &= \frac{ \sum_{i = 1}^I r_{ik} (\mathbf{x}_i - \boldsymbol{\mu}_k) (\mathbf{x}_i - \boldsymbol{\mu}_k)^\top }{ \sum_{i = 1}^I r_{ik} }\end{split}\]

Exercise 7.4

\[\DeclareMathOperator{\BetaDist}{Beta} Pr(x \mid \boldsymbol{\theta}) = \sum_{k = 1}^K \lambda_k \BetaDist_x[\alpha_k, \beta_k]\]

is a mixture of beta distributions where \(x \in [0, 1]\) is univariate and \(\boldsymbol{\theta} = \{ \alpha_k, \beta_k, \lambda_k \}_{k = 1}^K\).

Define a discrete hidden variable \(h \in \{1 \ldots K\}\) such that

\[\begin{split}\DeclareMathOperator{\CatDist}{Cat} Pr(x \mid h, \boldsymbol{\theta}) &= \BetaDist_x[\alpha_h, \beta_h]\\ Pr(h \mid \boldsymbol{\theta}) &= \CatDist_h[\boldsymbol{\lambda}]\end{split}\]

where \(\boldsymbol{\lambda} = \{ \lambda_1, \ldots, \lambda_K\}\) are the parameters of the categorical distribution.

The original density can be expressed as the marginalization of the previous terms:

\[\begin{split}Pr(x \mid \boldsymbol{\theta}) &= \sum_{h} Pr(x, h \mid \boldsymbol{\theta})\\ &= \sum_{h} Pr(x \mid h, \boldsymbol{\theta}) Pr(h \mid \boldsymbol{\theta})\\ &= \sum_{k = 1}^K \lambda_k \BetaDist_x[\alpha_k, \beta_k].\end{split}\]

In the E-step, the goal is to compute the probability that the \(k\text{th}\) beta distribution was responsible for the \(i\text{th}\) data point:

\[\begin{split}q_i(h_i = k) &= Pr(h_i \mid x_i, \boldsymbol{\theta}) & \quad & \text{(7.11)}\\ &= \frac{ Pr(h_i, x_i, \boldsymbol{\theta}) }{ Pr(x_i, \boldsymbol{\theta}) }\\ &= \frac{ Pr(x_i \mid h_i, \boldsymbol{\theta}) Pr(h_i \mid \boldsymbol{\theta}) }{ Pr(x_i \mid \boldsymbol{\theta}) }\\ &= \frac{ \lambda_k \BetaDist_{x_i}[\alpha_k, \beta_k] }{ \sum_{j = 1}^K \lambda_j \BetaDist_{x_i}[\alpha_j, \beta_j] }\\ &= r_{ik}.\end{split}\]

In the M-step, the goal is to maximize the lower bound approximation with respect to \(\boldsymbol{\theta}\) such that

\[\begin{split}\DeclareMathOperator*{\argmax}{arg\,max} \hat{\boldsymbol{\theta}} &= \argmax_{\boldsymbol{\theta}} \sum_{i = 1}^I \sum_{k = 1}^k q_i(h_i = k) \log Pr(x_i, h_i = k \mid \boldsymbol{\theta}) & \quad & \text{(7.12)}\\ &= \argmax_{\boldsymbol{\theta}} \sum_{i = 1}^I \sum_{k = 1}^k r_{ik} \log \lambda_k \BetaDist_{x_i}[\alpha_k, \beta_k].\end{split}\]

Exercise 7.5

\[\begin{split}\DeclareMathOperator{\GamDist}{Gam} \DeclareMathOperator{\StudDist}{Stud} Pr(x) &= \int Pr(x, h) dh\\ &= \int Pr(x \mid h) Pr(h) dh\\ &= \int_0^\infty \NormDist_x\left[ \mu, \sigma^2 / h \right] \GamDist_h[\nu / 2, \nu / 2] dh & \quad & \text{Gamma distribution is defined on } (0, \infty)\\ &= \int_0^\infty \frac{ h^{1/2} }{ \sqrt{2 \pi \sigma^2} } \exp\left[ -\frac{(x - \mu)^2}{2 \sigma^2} h \right] \frac{ \left( \frac{\nu}{2} \right)^{\nu / 2} }{ \Gamma\left[ \frac{\nu}{2} \right] } \exp\left[ -\frac{\nu}{2} h \right] h^{\nu / 2 - 1} dh\\ &= \frac{ \nu^{\nu / 2} }{ 2^{\nu / 2} \sqrt{2 \pi \sigma^2} \Gamma\left[ \frac{\nu}{2} \right] } \int_0^\infty \exp\left[ -\left( \frac{\nu}{2} + \frac{(x - \mu)^2}{2 \sigma^2} \right) h \right] h^{(\nu - 1) / 2} dh\\ &= \frac{ \nu^{\nu / 2} }{ 2^{\nu / 2} \sqrt{2 \pi \sigma^2} \Gamma[\frac{\nu}{2}] } \left( \frac{\nu}{2} + \frac{(x - \mu)^2}{2 \sigma^2} \right)^{-(v + 1) / 2} \Gamma\left[ \frac{\nu + 1}{2} \right] & \quad & \int_0^\infty x^n \exp\left[ -a x^b \right] dx = b^{-1} a^{-(n + 1) / b} \Gamma\left[ \frac{n + 1}{b} \right]\\ &= \frac{ \nu^{\nu / 2} }{ \sqrt{\pi \sigma^2} \Gamma\left[ \frac{\nu}{2} \right] } \left( \nu + \frac{(x - \mu)^2}{\sigma^2} \right)^{-(v + 1) / 2} \Gamma\left[ \frac{\nu + 1}{2} \right]\\ &= \frac{1}{ \sqrt{\nu \pi \sigma^2} \Gamma\left[ \frac{\nu}{2} \right] } \left( 1 + \frac{(x - \mu)^2}{\nu \sigma^2} \right)^{-(v + 1) / 2} \Gamma\left[ \frac{\nu + 1}{2} \right]\\ &= \frac{ \Gamma\left[ \frac{\nu + 1}{2} \right] }{ \sqrt{\nu \pi \sigma^2} \Gamma\left[ \frac{\nu}{2} \right] } \left( 1 + \frac{(x - \mu)^2}{\nu \sigma^2} \right)^{-\frac{\nu + 1}{2}}\\ &= \StudDist_x\left[ \mu, \sigma^2, \nu \right].\end{split}\]

Exercise 7.6

\[\begin{split}\frac{\partial}{\partial z} \GamDist_z[\alpha, \beta] &= \frac{\partial}{\partial z} \frac{\beta^\alpha}{\Gamma[\alpha]} \exp[-\beta z] z^{\alpha - 1}\\ 0 &= \frac{\beta^\alpha}{\Gamma[\alpha]} \left( -\beta \exp[-\beta z] z^{\alpha - 1} + (\alpha - 1) \exp[-\beta z] z^{\alpha - 2} \right)\\ \beta z^{\alpha - 1} &= (\alpha - 1) z^{\alpha - 2}\\ z &= \frac{\alpha - 1}{\beta}\end{split}\]

Exercise 7.7

\[\begin{split}& \NormDist_{\mathbf{x}}[\boldsymbol{\mu}, \boldsymbol{\Sigma} / h] \GamDist_{h}[\nu / 2, \nu / 2]\\ &= \frac{ h^{D / 2} }{ (2 \pi)^{D / 2} \lvert \boldsymbol{\Sigma} \rvert^{1 / 2} } \exp\left[ -\frac{ (\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) }{2} h \right] \frac{\nu^{\nu / 2}}{2^{\nu / 2} \Gamma[\nu / 2]} \exp\left[ -\frac{\nu}{2} h \right] h^{\nu / 2 - 1}\\ &= \frac{ \nu^{\nu / 2} }{ (2 \pi)^{D / 2} \lvert \boldsymbol{\Sigma} \rvert^{1 / 2} \Gamma[\nu / 2] 2^{\nu / 2} } \exp\left[ -\frac{ (\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) + \nu }{2} h \right] h^{(\nu + D) / 2 - 1}\\ &= \frac{ \nu^{\nu / 2} }{ (2 \pi)^{D / 2} \lvert \boldsymbol{\Sigma} \rvert^{1 / 2} \Gamma[\nu / 2] 2^{\nu / 2} } \exp[-\beta h] h^{\alpha - 1}\\ &= \kappa \GamDist_h[\alpha, \beta]\end{split}\]

where

\[\begin{split}\alpha &= \frac{\nu + D}{2}\\\\ \beta &= \frac{ (\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) + \nu }{2}\\\\ \kappa &= \frac{ \nu^{\nu / 2} \Gamma[\alpha] }{ (2 \pi)^{D / 2} \lvert \boldsymbol{\Sigma} \rvert^{1 / 2} \Gamma[\nu / 2] 2^{\nu / 2} \beta^\alpha }\\ &= \frac{ \nu^{\nu / 2} \Gamma[\alpha] }{ (2 \pi)^{D / 2} \lvert \boldsymbol{\Sigma} \rvert^{1 / 2} \Gamma[\nu / 2] 2^{\nu / 2} } \left( \frac{\nu}{2} \right)^{-\alpha} \left( \frac{ (\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) }{\nu} + 1 \right)^{-\alpha}\\ &= \StudDist_{\mathbf{x}}[\boldsymbol{\mu}, \boldsymbol{\Sigma}, \nu].\end{split}\]

Exercise 7.8

Let \(\mathbf{x}_i = \boldsymbol{\mu} + \boldsymbol{\Phi} \mathbf{h}_i + \boldsymbol{\epsilon}_i\) where

\[\begin{split}\DeclareMathOperator{\E}{\mathrm{E}} \DeclareMathOperator{\Cov}{\mathrm{Cov}} \begin{gather*} \E[\mathbf{h}_i] = \E[\boldsymbol{\epsilon}_i] = \boldsymbol{0}\\ \Cov(\mathbf{h}_i, \mathbf{h}_i) = \E\left[ (\mathbf{h}_i - \E[\mathbf{h}_i]) (\mathbf{h}_i - \E[\mathbf{h}_i])^\top \right] = \E\left[ \mathbf{h}_i \mathbf{h}_i^\top \right] = I\\ \Cov(\boldsymbol{\epsilon}_i, \boldsymbol{\epsilon}_i) = \E\left[ (\boldsymbol{\epsilon}_i - \E[\boldsymbol{\epsilon}_i]) (\boldsymbol{\epsilon}_i - \E[\boldsymbol{\epsilon}_i])^\top \right] = \E\left[ \boldsymbol{\epsilon}_i \boldsymbol{\epsilon}_i^\top \right] = \boldsymbol{\Sigma}\\ \Cov(\mathbf{h}_i, \boldsymbol{\epsilon}_i) = \E\left[ (\mathbf{h}_i - \E[\mathbf{h}_i]) (\boldsymbol{\epsilon}_i - \E[\boldsymbol{\epsilon}_i])^\top \right] = \E\left[ \mathbf{h}_i \boldsymbol{\epsilon}_i^\top \right] = \boldsymbol{0}\\ \Cov(\boldsymbol{\epsilon}_i, \mathbf{h}_i) = \Cov(\mathbf{h}_i, \boldsymbol{\epsilon}_i)^\top = \boldsymbol{0} \end{gather*}\end{split}\]

(1)

\[\begin{split}\E[\mathbf{x}_i] &= \E[\boldsymbol{\mu}] + \E[\boldsymbol{\Phi} \mathbf{h}_i] + \E[\boldsymbol{\epsilon}_i] & \quad & \text{(2.16)}\\ &= \boldsymbol{\mu} + \boldsymbol{\Phi} \E[\mathbf{h}_i] + \boldsymbol{0} & \quad & \text{(2.14), (2.15), (2.16)}\\ &= \boldsymbol{\mu} + \boldsymbol{\Phi} \boldsymbol{0}\\ &= \boldsymbol{\mu}\end{split}\]

(2)

\[\begin{split}\E\left[ (\mathbf{x}_i - \E[\mathbf{x}_i]) (\mathbf{x}_i - \E[\mathbf{x}_i])^\top \right] &= \E\left[ (\mathbf{x}_i - \boldsymbol{\mu}) (\mathbf{x}_i - \boldsymbol{\mu})^\top \right]\\ &= \E\left[ (\boldsymbol{\Phi} \mathbf{h}_i + \boldsymbol{\epsilon}_i) (\boldsymbol{\Phi} \mathbf{h}_i + \boldsymbol{\epsilon}_i)^\top \right]\\ &= \E\left[ \boldsymbol{\Phi} \mathbf{h}_i \mathbf{h}_i^\top \boldsymbol{\Phi}^\top \right] + \E\left[ \boldsymbol{\Phi} \mathbf{h}_i \boldsymbol{\epsilon}_i^\top \right] + \E\left[ \boldsymbol{\epsilon}_i \mathbf{h}_i^\top \boldsymbol{\Phi}^\top \right] + \E\left[ \boldsymbol{\epsilon}_i \boldsymbol{\epsilon}_i^\top \right] & \quad & \text{(2.16)}\\ &= \boldsymbol{\Phi} \boldsymbol{\Phi}^\top + \boldsymbol{\Phi} \E\left[ \mathbf{h}_i \boldsymbol{\epsilon}_i^\top \right] + \E\left[ \boldsymbol{\epsilon}_i \mathbf{h}_i^\top \right] \boldsymbol{\Phi}^\top + \boldsymbol{\Sigma} & \quad & \text{(2.15), (2.16)}\\ &= \boldsymbol{\Phi} \boldsymbol{\Phi}^\top + \boldsymbol{\Sigma}\end{split}\]

Exercise 7.9

\[\begin{split}q_i(\mathbf{h}_i) &= Pr(\mathbf{h}_i \mid \mathbf{x}_i, \boldsymbol{\theta}) & \quad & \text{(7.11)}\\ &= \frac{ Pr(\mathbf{x}_i \mid \mathbf{h}_i, \boldsymbol{\theta}) Pr(\mathbf{h}_i \mid \boldsymbol{\theta}) }{ Pr(\mathbf{x}_i \mid \boldsymbol{\theta}) }\\ &= \frac{ \NormDist_{\mathbf{x}_i}\left[ \boldsymbol{\mu} + \boldsymbol{\Phi} \mathbf{h}_i, \boldsymbol{\Sigma} \right] \NormDist_{\mathbf{h}_i}\left[ \boldsymbol{0}, \mathbf{I} \right] }{ \NormDist_{\mathbf{x}_i}\left[ \boldsymbol{\mu}, \boldsymbol{\Phi} \boldsymbol{\Phi}^\top + \boldsymbol{\Sigma} \right] }\\ &= \frac{ \kappa_1 \NormDist_{\mathbf{h}_i}\left[ \boldsymbol{\Phi}' \mathbf{x}_i + \boldsymbol{\mu}', \boldsymbol{\Sigma}' \right] \NormDist_{\mathbf{h}_i}\left[ \boldsymbol{0}, \mathbf{I} \right] }{ \NormDist_{\mathbf{x}_i}\left[ \boldsymbol{\mu}, \boldsymbol{\Phi} \boldsymbol{\Phi}^\top + \boldsymbol{\Sigma} \right] } & \quad & \text{(a), (5.17), and Exercise 5.10}\\ &= \frac{ \kappa_1 \kappa_2 \NormDist_{\mathbf{h}_i}\left[ \hat{\boldsymbol{\mu}}, \hat{\boldsymbol{\Sigma}} \right] }{ \NormDist_{\mathbf{x}_i}\left[ \boldsymbol{\mu}, \boldsymbol{\Phi} \boldsymbol{\Phi}^\top + \boldsymbol{\Sigma} \right] } & \quad & \text{(b), (5.14), (5.15), and Exercise 5.7 & 5.9}\\ &= \NormDist_{\mathbf{h}_i}\left[ \hat{\boldsymbol{\mu}}, \hat{\boldsymbol{\Sigma}} \right] & \quad & \text{(c)}\end{split}\]

See Exercise 5.7, Exercise 5.9, and Exercise 5.10 for details.

(a)

\[\begin{split}\boldsymbol{\Sigma}' &= \left( \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \right)^{-1}\\\\ \boldsymbol{\Phi}' &= \boldsymbol{\Sigma}' \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1}\\\\ \boldsymbol{\mu}' &= -\boldsymbol{\Sigma}' \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}\\\\ \kappa_1 &= \frac{ \left\vert \boldsymbol{\Sigma}' \right\vert^{1 / 2} }{ \left\vert \boldsymbol{\Sigma} \right\vert^{1 / 2} } \exp\left[ (\mathbf{x}_i - \boldsymbol{\mu})^\top \left( \boldsymbol{\Sigma}^{-1} - \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \boldsymbol{\Sigma}' \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \right) (\mathbf{x}_i - \boldsymbol{\mu}) \right]^{-0.5}\end{split}\]

(b)

\[\begin{split}\hat{\boldsymbol{\Sigma}} &= \left( \boldsymbol{\Sigma}'^{-1} + \mathbf{I}^{-1} \right)^{-1}\\ &= \left( \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} + \mathbf{I} \right)^{-1}\\\\ \hat{\boldsymbol{\mu}} &= \hat{\boldsymbol{\Sigma}} \left( \boldsymbol{\Sigma}'^{-1} \left( \boldsymbol{\Phi}' \mathbf{x}_i + \boldsymbol{\mu}' \right) + \mathbf{I}^{-1} \boldsymbol{0} \right)\\ &= \hat{\boldsymbol{\Sigma}} \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x}_i - \boldsymbol{\mu})\\\\ \kappa_2 &= \NormDist_{\boldsymbol{\Phi}' \mathbf{x}_i + \boldsymbol{\mu}'}\left[ \boldsymbol{0}, \boldsymbol{\Sigma}' + \mathbf{I} \right]\\ &= \NormDist_{\boldsymbol{0}}\left[ \boldsymbol{\Phi}' \mathbf{x}_i + \boldsymbol{\mu}', \boldsymbol{\Sigma}' + \mathbf{I} \right]\\ &= \frac{1}{ (2 \pi)^{D / 2} \lvert \boldsymbol{\Sigma}' + \mathbf{I} \rvert^{1 / 2} } \exp\left[ \left( \boldsymbol{\Phi}' \mathbf{x}_i + \boldsymbol{\mu}' \right)^\top \left( \boldsymbol{\Sigma}' + \mathbf{I} \right)^{-1} \left( \boldsymbol{\Phi}' \mathbf{x}_i + \boldsymbol{\mu}' \right) \right]^{-0.5}\\ &= \frac{1}{ (2 \pi)^{D / 2} \lvert \boldsymbol{\Sigma}' + \mathbf{I} \rvert^{1 / 2} } \exp\left[ (\mathbf{x}_i - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \boldsymbol{\Sigma}' \left( \boldsymbol{\Sigma}' + \mathbf{I} \right)^{-1} \boldsymbol{\Sigma}' \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x}_i - \boldsymbol{\mu}) \right]^{-0.5}\end{split}\]

(c)

\[\begin{split}\kappa_1 \kappa_2 &= \frac{ \left\vert \boldsymbol{\Sigma}' \right\vert^{1 / 2} }{ \left\vert \boldsymbol{\Sigma} \right\vert^{1 / 2} } \exp\left[ (\mathbf{x}_i - \boldsymbol{\mu})^\top \left( \boldsymbol{\Sigma}^{-1} - \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \boldsymbol{\Sigma}' \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \right) (\mathbf{x}_i - \boldsymbol{\mu}) \right]^{-0.5} \NormDist_{\boldsymbol{0}}\left[ \boldsymbol{\Phi}' \mathbf{x}_i + \boldsymbol{\mu}', \boldsymbol{\Sigma}' + \mathbf{I} \right]\\ &= \frac{ \left\vert \boldsymbol{\Sigma}' \right\vert^{1 / 2} }{ (2 \pi)^{D / 2} \left\vert \boldsymbol{\Sigma} \right\vert^{1 / 2} \left\vert \boldsymbol{\Sigma}' + \mathbf{I} \right\vert^{1 / 2} } \exp\left[ (\mathbf{x}_i - \boldsymbol{\mu})^\top \left( \boldsymbol{\Sigma}^{-1} - \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \boldsymbol{\Sigma}' \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} + \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \boldsymbol{\Sigma}' \left( \boldsymbol{\Sigma}' + \mathbf{I} \right)^{-1} \boldsymbol{\Sigma}' \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \right) (\mathbf{x}_i - \boldsymbol{\mu}) \right]^{-0.5}\\ &= \frac{1}{ (2 \pi)^{D / 2} \left\vert \boldsymbol{\Sigma} \right\vert^{1 / 2} \left\vert \boldsymbol{\Sigma}'^{-1} + \mathbf{I} \right\vert^{1 / 2} } \exp\left[ (\mathbf{x} - \boldsymbol{\mu})^\top \left( \boldsymbol{\Sigma}^{-1} - \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \left( \boldsymbol{\Sigma}' - \boldsymbol{\Sigma}' \left( \boldsymbol{\Sigma}' + \mathbf{I} \right)^{-1} \boldsymbol{\Sigma}' \right) \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \right) (\mathbf{x} - \boldsymbol{\mu}) \right]^{-0.5}\\ &= \frac{1}{ (2 \pi)^{D / 2} \left\vert \boldsymbol{\Sigma} + \boldsymbol{\Phi} \boldsymbol{\Phi}^\top \right\vert^{1 / 2} } \exp\left[ (\mathbf{x} - \boldsymbol{\mu})^\top \left( \boldsymbol{\Sigma}^{-1} - \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \left( \mathbf{I} + \boldsymbol{\Sigma}'^{-1} \right)^{-1} \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \right) (\mathbf{x} - \boldsymbol{\mu}) \right]^{-0.5} & \quad & \text{(c.1) and (d)}\\ &= \frac{1}{ (2 \pi)^{D / 2} \left\vert \boldsymbol{\Sigma} + \boldsymbol{\Phi} \boldsymbol{\Phi}^\top \right\vert^{1 / 2} } \exp\left[ (\mathbf{x} - \boldsymbol{\mu})^\top \left( \boldsymbol{\Sigma} + \boldsymbol{\Phi} \boldsymbol{\Phi}^\top \right)^{-1} (\mathbf{x} - \boldsymbol{\mu}) \right]^{-0.5} & \quad & \text{Sherman–Morrison–Woodbury formula}\\ &= \NormDist_{\mathbf{x}_i}\left[ \boldsymbol{\mu}, \boldsymbol{\Phi} \boldsymbol{\Phi}^\top + \boldsymbol{\Sigma} \right]\end{split}\]

(c.1)

\[\begin{split}\left( \boldsymbol{\Sigma}' - \boldsymbol{\Sigma}' \left( \mathbf{I} + \boldsymbol{\Sigma}'\right)^{-1} \boldsymbol{\Sigma}' \right) \left( \mathbf{I} + \boldsymbol{\Sigma}'^{-1} \right) &= \boldsymbol{\Sigma}' + \mathbf{I} - \boldsymbol{\Sigma}' \left( \mathbf{I} + \boldsymbol{\Sigma}' \right)^{-1} \boldsymbol{\Sigma}' - \boldsymbol{\Sigma}' \left( \mathbf{I} + \boldsymbol{\Sigma}' \right)^{-1}\\ &= \mathbf{I} + \boldsymbol{\Sigma}' \left[ \mathbf{I} - \left( \mathbf{I} + \boldsymbol{\Sigma}' \right)^{-1} \right] - \boldsymbol{\Sigma}' \left( \mathbf{I} + \boldsymbol{\Sigma}' \right)^{-1} \boldsymbol{\Sigma}'\\ &= \mathbf{I} & \quad & \text{(c.2)}\\ \left( \boldsymbol{\Sigma}' - \boldsymbol{\Sigma}' \left( \mathbf{I} + \boldsymbol{\Sigma}' \right)^{-1} \boldsymbol{\Sigma}' \right) &= \left( \mathbf{I} + \boldsymbol{\Sigma}'^{-1} \right)^{-1}\end{split}\]

(c.2)

\[\begin{split}\left( \mathbf{I} + \boldsymbol{\Sigma}' \right) \boldsymbol{\Sigma}'^{-1} &= \boldsymbol{\Sigma}'^{-1} + \mathbf{I} & \quad & \text{Exercise 5.9}\\ \boldsymbol{\Sigma}'^{-1} &= \left( \mathbf{I} + \boldsymbol{\Sigma}' \right)^{-1} \left( \boldsymbol{\Sigma}'^{-1} + \mathbf{I} \right)\\ \boldsymbol{\Sigma}'^{-1} - \left( \mathbf{I} + \boldsymbol{\Sigma}' \right)^{-1} \boldsymbol{\Sigma}'^{-1} &= \left( \mathbf{I} + \boldsymbol{\Sigma}' \right)^{-1}\\ \mathbf{I} - \left( \mathbf{I} + \boldsymbol{\Sigma}' \right)^{-1} &= \left( \mathbf{I} + \boldsymbol{\Sigma}' \right)^{-1} \boldsymbol{\Sigma}'\\ \boldsymbol{\Sigma}' \left[ \mathbf{I} - \left( \mathbf{I} + \boldsymbol{\Sigma}' \right)^{-1} \right] &= \boldsymbol{\Sigma}' \left( \mathbf{I} + \boldsymbol{\Sigma}' \right)^{-1} \boldsymbol{\Sigma}'\end{split}\]

(d)

\[\begin{split}\left( \left\vert \boldsymbol{\Sigma} \right\vert \left\vert \boldsymbol{\Sigma}'^{-1} + \mathbf{I} \right\vert \right)^{-1 / 2} &= \left( \left\vert \boldsymbol{\Sigma} \right\vert \left\vert \mathbf{I} + \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \right\vert \right)^{-1 / 2}\\ &= \left\vert \boldsymbol{\Sigma} + \boldsymbol{\Phi} \boldsymbol{\Phi}^\top \right\vert^{-1 / 2} & \quad & \text{Sylvester's determinant theorem}\end{split}\]

Exercise 7.10

\[\begin{split}\hat{\boldsymbol{\theta}} &= \argmax_\boldsymbol{\theta} \sum_{i = 1}^I \sum_{k = 1}^k q_i(\mathbf{h}_i) \log Pr(\mathbf{x}_i, \mathbf{h}_i \mid \boldsymbol{\theta}) & \quad & \text{(7.12)}\\ &= \argmax_\boldsymbol{\theta} \sum_{i = 1}^I \E\left[ \log Pr(\mathbf{x}_i, \mathbf{h}_i \mid \boldsymbol{\theta}) \right] & \quad & \text{(7.36)}\\ &= \argmax_\boldsymbol{\theta} L(\boldsymbol{\theta})\end{split}\]

where \(L\) is the log-likelihood given by

\[\begin{split}L &= -\frac{1}{2} \sum_{i = 1}^I \E\left[ D \log(2\pi) + \log \lvert \boldsymbol{\Sigma} \rvert + (\mathbf{x}_i - \boldsymbol{\mu} - \boldsymbol{\Phi} \mathbf{h}_i)^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x}_i - \boldsymbol{\mu} - \boldsymbol{\Phi} \mathbf{h}_i) \right] & \quad & \text{(7.37)}\\ &= -\frac{1}{2} \sum_{i = 1}^I \E\left[ D \log(2\pi) + \log \lvert \boldsymbol{\Sigma} \rvert + \mathbf{x}_i^\top \boldsymbol{\Sigma}^{-1} \mathbf{x}_i + \boldsymbol{\mu}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} + \mathbf{h}_i^\top \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \mathbf{h}_i + 2 \boldsymbol{\mu}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \mathbf{h}_i - 2 \mathbf{x}_i^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} - 2 \mathbf{x}_i^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \mathbf{h}_i \right].\end{split}\]

(a)

\[\begin{split}\frac{\partial L}{\partial \boldsymbol{\mu}} &= -\frac{1}{2} \sum_{i = 1}^I \E\left[ 2 \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} + 2 \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \mathbf{h}_i - 2 \boldsymbol{\Sigma}^{-1} \mathbf{x}_i \right] & \quad & \text{(C.33), (C.27), (C.28)}\\ 0 &= \sum_{i = 1}^I -\boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} - \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \E[\mathbf{h}_i] + \boldsymbol{\Sigma}^{-1} \mathbf{x}_i & \quad & \E \text{ is a linear operator}\\ &= \sum_{i = 1}^I -\boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} - \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \left( \left( \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} + \mathbf{I} \right)^{-1} \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \left( \mathbf{x}_i - \boldsymbol{\mu} \right) \right) + \boldsymbol{\Sigma}^{-1} \mathbf{x}_i & \quad & \text{(7.35)}\\ &= \sum_{i = 1}^I \left( \boldsymbol{\Sigma}^{-1} - \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \left( \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} + \mathbf{I} \right)^{-1} \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \right) \mathbf{x}_i - \left( \boldsymbol{\Sigma}^{-1} - \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \left( \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} + \mathbf{I} \right)^{-1} \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \right) \boldsymbol{\mu}\\ &= \sum_{i = 1}^I \left( \boldsymbol{\Sigma} + \boldsymbol{\Phi} \boldsymbol{\Phi}^\top \right)^{-1} \mathbf{x}_i - \left( \boldsymbol{\Sigma} + \boldsymbol{\Phi} \boldsymbol{\Phi}^\top \right)^{-1} \boldsymbol{\mu} & \quad & \text{Sherman-Morrison-Woodbury formula}\\ \boldsymbol{\mu} &= \frac{1}{I} \sum_{i = 1}^I \mathbf{x}_i\end{split}\]

(b)

\[\begin{split}\frac{\partial L}{\partial \boldsymbol{\Phi}} &= -\frac{1}{2} \sum_{i = 1}^I \E\left[ 2 \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \mathbf{h}_i \mathbf{h}_i^\top + 2 \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} \mathbf{h}_i^\top - 2 \boldsymbol{\Sigma}^{-1} \mathbf{x}_i \mathbf{h}_i^\top \right] & \quad & \text{(C.34), (C.29)}\\ 0 &= \sum_{i = 1}^I -\boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \E\left[ \mathbf{h}_i \mathbf{h}_i^\top \right] - \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} \E\left[ \mathbf{h}_i^\top \right] + \boldsymbol{\Sigma}^{-1} \mathbf{x}_i \E\left[ \mathbf{h}_i^\top \right] & \quad & \E \text{ is a linear operator}\\ \boldsymbol{\Phi} &= \left( \sum_{i = 1}^I (\mathbf{x}_i - \boldsymbol{\mu}) \E\left[ \mathbf{h}_i^\top \right] \right) \left( \sum_{i = 1}^I \E\left[ \mathbf{h}_i \mathbf{h}_i^\top \right] \right)^{-1}\end{split}\]

(c)

\[\begin{split}\DeclareMathOperator{\diag}{\mathrm{diag}} \frac{\partial L}{\partial \boldsymbol{\Sigma}} &= -\frac{1}{2} \sum_{i = 1}^I \E\left[ \boldsymbol{\Sigma}^{-\top} - \boldsymbol{\Sigma}^{-\top} \left( \mathbf{x}_i - \boldsymbol{\mu} - \boldsymbol{\Phi} \mathbf{h}_i \right) \left( \mathbf{x}_i - \boldsymbol{\mu} - \boldsymbol{\Phi} \mathbf{h}_i \right)^\top \boldsymbol{\Sigma}^{-\top} \right] & \quad & \text{(C.38) and Matrix Cookbook (61)}\\ \boldsymbol{\Sigma} &= \frac{1}{I} \sum_{i = 1}^I \E\left[ \left( \mathbf{x}_i - \boldsymbol{\mu} - \boldsymbol{\Phi} \mathbf{h}_i \right) \left( \mathbf{x}_i - \boldsymbol{\mu} - \boldsymbol{\Phi} \mathbf{h}_i \right)^\top \right] & \quad & \E \text{ is a linear operator}\\ &= \frac{1}{I} \sum_{i = 1}^I (\mathbf{x}_i - \boldsymbol{\mu}) (\mathbf{x}_i - \boldsymbol{\mu})^\top + \E\left[ 2 \boldsymbol{\Phi} \mathbf{h}_i \boldsymbol{\mu}^\top - 2 \boldsymbol{\Phi} \mathbf{h}_i \mathbf{x}_i^\top + \boldsymbol{\Phi} \mathbf{h}_i \mathbf{h}_i^\top \boldsymbol{\Phi}^\top \right] & \quad & \text{expanded version of (7.37)}\\ &= \frac{1}{I} \sum_{i = 1}^I (\mathbf{x}_i - \boldsymbol{\mu}) (\mathbf{x}_i - \boldsymbol{\mu})^\top - 2 \boldsymbol{\Phi} \E[\mathbf{h}_i] \left( \mathbf{x}_i - \boldsymbol{\mu}\right)^\top + \boldsymbol{\Phi} \E\left[ \mathbf{h}_i \mathbf{h}_i^\top \right] \boldsymbol{\Phi}^\top\\ &= \frac{1}{I} \sum_{i = 1}^I (\mathbf{x}_i - \boldsymbol{\mu}) (\mathbf{x}_i - \boldsymbol{\mu})^\top - \boldsymbol{\Phi} \E[\mathbf{h}_i] \left( \mathbf{x}_i - \boldsymbol{\mu} \right)^\top & \quad & \text{(b)}\\ &= \frac{1}{I} \sum_{i = 1}^I \diag\left[ (\mathbf{x}_i - \boldsymbol{\mu}) (\mathbf{x}_i - \boldsymbol{\mu})^\top - \boldsymbol{\Phi} \E[\mathbf{h}_i] \left( \mathbf{x}_i - \boldsymbol{\mu}\right)^\top \right] & \quad & \boldsymbol{\Sigma} \text{ diagonal constraint.}\end{split}\]