Models for Style and Identity

Bayesian model selection is a valid way to compare models with different numbers of parameters as long as they are marginalized out of the final solution.

Exercise 18.1

\[\begin{split}\DeclareMathOperator{\NormDist}{Norm} Pr(\mathbf{h}_i \mid \mathbf{x}_{i \cdot}) &= \frac{ Pr(\mathbf{h}_i, \mathbf{x}_{i1}, \mathbf{x}_{i2}, \ldots, \mathbf{x}_{iJ}) }{ \int Pr(\mathbf{x}_{i1}, \mathbf{x}_{i2}, \ldots, \mathbf{x}_{iJ}, \mathbf{h}_i) d\mathbf{h}_i } & \quad & \text{(2.1), (2.4) and } \mathbf{x}_{i \cdot} = \{ \mathbf{x}_{ij} \}_{j = 1}^J\\ &= \frac{ \prod_{j = 1}^J Pr(\mathbf{x}_{ij} \mid \mathbf{h}_i) \cdot Pr(\mathbf{h}_i) }{ \int Pr(\mathbf{h}_i) \prod_{j = 1}^J Pr(\mathbf{x}_{ij} \mid \mathbf{h}_i) d\mathbf{h}_i } & \quad & \text{(2.6), (2.10)}\\ &= \frac{ \kappa \NormDist_{\mathbf{h}_i}\left[ \boldsymbol{\mu}', \boldsymbol{\Sigma}' \right] }{ \kappa \int \NormDist_{\mathbf{h}_i}\left[ \boldsymbol{\mu}', \boldsymbol{\Sigma}' \right] d\mathbf{h}_i } & \quad & \text{(a)}\\ &= \NormDist_{\mathbf{h}_i}\left[ \boldsymbol{\mu}', \boldsymbol{\Sigma}' \right] & \quad & \text{multivariate normal distribution integrates to one}\end{split}\]

(a)

\[\begin{split}& \prod_{j = 1}^J Pr(\mathbf{x}_{ij} \mid \mathbf{h}_i) \cdot Pr(\mathbf{h}_i)\\ &= \prod_{j = 1}^J \NormDist_{\mathbf{x}_{ij}}\left[ \boldsymbol{\mu} + \boldsymbol{\Phi} \mathbf{h}_i, \boldsymbol{\Sigma} \right] \cdot \NormDist_{\mathbf{h}_i}\left[ \boldsymbol{0}, \mathbf{I} \right] & \quad & \text{(18.7)}\\ &= \frac{1}{ (2\pi)^{JD / 2} \left\vert \boldsymbol{\Sigma} \right\vert^{J / 2} } \exp\left[ \sum_{j = 1}^J \left( \mathbf{x}_{ij} - \boldsymbol{\mu} - \boldsymbol{\Phi} \mathbf{h}_i \right)^\top \boldsymbol{\Sigma}^{-1} \left( \mathbf{x}_{ij} - \boldsymbol{\mu} - \boldsymbol{\Phi} \mathbf{h}_i \right) \right]^{-0.5} \frac{1}{(2\pi)^{k / 2}} \exp\left[ \mathbf{h}_i^\top \mathbf{h}_i \right]^{-0.5}\\ &= \frac{ \exp\left[ \mathbf{h}_i^\top \mathbf{h}_i + \sum_{j = 1}^J \mathbf{x}_{ij}^\top \boldsymbol{\Sigma}^{-1} \mathbf{x}_{ij} - \mathbf{x}_{ij}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} - \mathbf{x}_{ij}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \mathbf{h}_i - \boldsymbol{\mu}^\top \boldsymbol{\Sigma}^{-1} \mathbf{x}_{ij} + \boldsymbol{\mu}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} + \boldsymbol{\mu}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \mathbf{h}_i - \mathbf{h}_i^\top \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \mathbf{x}_{ij} + \mathbf{h}_i^\top \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} + \mathbf{h}_i^\top \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \mathbf{h}_i \right]^{-0.5} }{ (2\pi)^{(JD + k) / 2} \left\vert \boldsymbol{\Sigma} \right\vert^{J / 2} }\\ &= \frac{ \exp\left[ \mathbf{h}_i^\top \mathbf{h}_i + J \mathbf{h}_i^\top \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \mathbf{h}_i - \left( 2 \mathbf{h}_i^\top \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \sum_{j = 1}^J \left( \mathbf{x}_{ij} - \boldsymbol{\mu} \right) \right) + \sum_{j = 1}^J \left( \mathbf{x}_{ij} - \boldsymbol{\mu} \right)^\top \boldsymbol{\Sigma}^{-1} \left( \mathbf{x}_{ij} - \boldsymbol{\mu} \right) \right]^{-0.5} }{ (2\pi)^{(JD + k) / 2} \left\vert \boldsymbol{\Sigma} \right\vert^{J / 2} }\\ &= \frac{ \exp\left[ \mathbf{h}_i^\top \boldsymbol{\Sigma}'^{-1} \mathbf{h}_i - \left( 2 \mathbf{h}_i^\top \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \sum_{j = 1}^J \left( \mathbf{x}_{ij} - \boldsymbol{\mu} \right) \right) + \sum_{j = 1}^J \left( \mathbf{x}_{ij} - \boldsymbol{\mu} \right)^\top \boldsymbol{\Sigma}^{-1} \left( \mathbf{x}_{ij} - \boldsymbol{\mu} \right) \right]^{-0.5} }{ (2\pi)^{(JD + k) / 2} \left\vert \boldsymbol{\Sigma} \right\vert^{J / 2} } & \quad & \boldsymbol{\Sigma}' = \left( J \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} + \mathbf{I} \right)^{-1}\\ &= \frac{ \exp\left[ \left( \mathbf{h}_i - \boldsymbol{\mu}' \right)^\top \boldsymbol{\Sigma}'^{-1} \left( \mathbf{h}_i - \boldsymbol{\mu}' \right) - \boldsymbol{\mu}'^\top \boldsymbol{\Sigma}'^{-1} \boldsymbol{\mu}' + \sum_{j = 1}^J \left( \mathbf{x}_{ij} - \boldsymbol{\mu} \right)^\top \boldsymbol{\Sigma}^{-1} \left( \mathbf{x}_{ij} - \boldsymbol{\mu} \right) \right]^{-0.5} }{ (2\pi)^{(JD + k) / 2} \left\vert \boldsymbol{\Sigma} \right\vert^{J / 2} } & \quad & \boldsymbol{\mu}' = \boldsymbol{\Sigma}' \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \sum_{j = 1}^J \left( \mathbf{x}_{ij} - \boldsymbol{\mu} \right)\\ &= \kappa \NormDist_{\mathbf{h}_i}\left[ \boldsymbol{\mu}', \boldsymbol{\Sigma}' \right]\end{split}\]

where

\[\kappa = \frac{ \left\vert \boldsymbol{\Sigma}' \right\vert^{1 / 2} }{ (2\pi)^{JD / 2} \left\vert \boldsymbol{\Sigma} \right\vert^{J / 2} } \exp\left[ -\boldsymbol{\mu}'^\top \boldsymbol{\Sigma}'^{-1} \boldsymbol{\mu}' + \sum_{j = 1}^J \left( \mathbf{x}_{ij} - \boldsymbol{\mu} \right)^\top \boldsymbol{\Sigma}^{-1} \left( \mathbf{x}_{ij} - \boldsymbol{\mu} \right) \right]^{-0.5}.\]

Exercise 18.2

\[\begin{split}\DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator{\E}{\mathrm{E}} \hat{\theta} &= \argmax_\boldsymbol{\theta} \sum_{i = 1}^I \int q_i(\mathbf{h}_i) \log Pr(\mathbf{x}_{i\cdot}, \mathbf{h}_i \mid \theta) d\mathbf{h}_i & \quad & \text{(7.51)}\\ &= \argmax_\boldsymbol{\theta} \sum_{i = 1}^I \int q_i(\mathbf{h}_i) \left[ \log Pr(\mathbf{x}_{i\cdot} \mid \mathbf{h}_i, \theta) + \log Pr(\mathbf{h}_i \mid \theta) \right] d\mathbf{h}_i\\ &= \argmax_\boldsymbol{\theta} \sum_{i = 1}^I \E\left[ \log Pr(\mathbf{x}_{i\cdot} \mid \mathbf{h}_i, \theta) \right] + \E\left[ \log Pr(\mathbf{h}_i \mid \theta) \right]\\ &= \argmax_\boldsymbol{\theta} \sum_{i = 1}^I \sum_{j = 1}^J \E\left[ Pr(\mathbf{x}_{ij} \mid \mathbf{h}_i, \theta) \right] & \quad & Pr(\mathbf{h}_i \mid \theta) \perp \theta\\ &= \argmax_\boldsymbol{\theta} -\frac{1}{2} \sum_{i = 1}^I \sum_{j = 1}^J D \log 2 \pi + \log \left\vert \boldsymbol{\Sigma} \right\vert + \E\left[ \left( \mathbf{x}_{ij} - \boldsymbol{\mu} - \boldsymbol{\Phi} \mathbf{h}_i \right)^\top \boldsymbol{\Sigma}^{-1} \left( \mathbf{x}_{ij} - \boldsymbol{\mu} - \boldsymbol{\Phi} \mathbf{h}_i \right)^\top \right]\\ &= \argmax_\boldsymbol{\theta} -\frac{1}{2} \sum_{i = 1}^I \sum_{j = 1}^J D \log(2\pi) + \log\left\vert \boldsymbol{\Sigma} \right\vert + \E\left[ \mathbf{x}_{ij}^\top \boldsymbol{\Sigma}^{-1} \mathbf{x}_{ij} + \boldsymbol{\mu}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} + \mathbf{h}_i^\top \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \mathbf{h}_i + 2 \boldsymbol{\mu}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \mathbf{h}_i - 2 \mathbf{x}_{ij}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} - 2 \mathbf{x}_{ij}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \mathbf{h}_i \right]\end{split}\]

(a)

\[\begin{split}\frac{\partial L}{\partial \boldsymbol{\mu}} &= -\frac{1}{2} \sum_{i = 1}^I \sum_{j = 1}^J \E\left[ 2 \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} + 2 \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \mathbf{h}_i - 2 \boldsymbol{\Sigma}^{-1} \mathbf{x}_{ij} \right] & \quad & \text{(C.33), (C.27), (C.28)}\\ 0 &= \sum_{i = 1}^I \sum_{j = 1}^J -\boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} - \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \E[\mathbf{h}_i] + \boldsymbol{\Sigma}^{-1} \mathbf{x}_{ij} & \quad & \E[] \text{ is a linear operator}\\ &= \sum_{i = 1}^I \sum_{j = 1}^J -\boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} - \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \left( \left( J \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} + \mathbf{I} \right)^{-1} \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \sum_{k = 1}^J \left( \mathbf{x}_{ik} - \boldsymbol{\mu} \right) \right) + \boldsymbol{\Sigma}^{-1} \mathbf{x}_{ij} & \quad & \text{(18.10)}\\ &= \sum_{i = 1}^I \sum_{j = 1}^J \left( \boldsymbol{\Sigma}^{-1} - J \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \left( J \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} + \mathbf{I} \right)^{-1} \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \right) \mathbf{x}_{ij} - \left( \boldsymbol{\Sigma}^{-1} - J \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \left( J \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} + \mathbf{I} \right)^{-1} \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \right) \boldsymbol{\mu} & \quad & \text{(a.1)}\\ &= \sum_{i = 1}^I \sum_{j = 1}^J \left( \boldsymbol{\Sigma} + J \boldsymbol{\Phi} \boldsymbol{\Phi}^\top \right)^{-1} \mathbf{x}_{ij} - \left( \boldsymbol{\Sigma} + J \boldsymbol{\Phi} \boldsymbol{\Phi}^\top \right)^{-1} \boldsymbol{\mu} & \quad & \text{Sherman-Morrison-Woodbury formula}\\ \boldsymbol{\mu} &= \frac{1}{IJ} \sum_{i = 1}^I \sum_{j = 1}^J \mathbf{x}_{ij}\end{split}\]

(a.1)

Notice that

\[\sum_{i = 1}^I \sum_{j = 1}^J -\boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \left( J \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} + \mathbf{I} \right)^{-1} \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \sum_{k = 1}^J \mathbf{x}_{ik} = \sum_{i = 1}^I \sum_{j = 1}^J -\boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \left( J \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} + \mathbf{I} \right)^{-1} \boldsymbol{\Phi}^\top \boldsymbol{\Sigma}^{-1} \left( J \mathbf{x}_{ij} - J \mathbf{x}_{ij} + \sum_{k = 1}^J \mathbf{x}_{ik} \right)\]

and

\[\begin{split}\sum_{i = 1}^I \sum_{j = 1}^J J \mathbf{x}_{ij} - \sum_{k = 1}^J \mathbf{x}_{ik} &= \sum_{i = 1}^I \sum_{j = 1}^J J \mathbf{x}_{ij} - \left( \mathbf{x}_{i1} + \mathbf{x}_{i2} + \cdots + \mathbf{x}_{iJ} \right)\\ &= \sum_{i = 1}^I \left( J \sum_{j = 1}^J \mathbf{x}_{ij} \right) - J \left( \mathbf{x}_{i1} + \mathbf{x}_{i2} + \cdots + \mathbf{x}_{iJ} \right)\\ &= 0.\end{split}\]

(b)

\[\begin{split}\frac{\partial L}{\partial \boldsymbol{\Phi}} &= -\frac{1}{2} \sum_{i = 1}^I \sum_{j = 1}^J \E\left[ 2 \boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \mathbf{h}_i \mathbf{h}_i^\top + 2 \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} \mathbf{h}_i^\top - 2 \boldsymbol{\Sigma}^{-1} \mathbf{x}_{ij} \mathbf{h}_i^\top \right] & \quad & \text{(C.34), (C.29)}\\ 0 &= \sum_{i = 1}^I \sum_{j = 1}^J -\boldsymbol{\Sigma}^{-1} \boldsymbol{\Phi} \E\left[ \mathbf{h}_i \mathbf{h}_i^\top \right] - \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu} \E\left[ \mathbf{h}_i^\top \right] + \boldsymbol{\Sigma}^{-1} \mathbf{x}_{ij} \E\left[ \mathbf{h}_i^\top \right] & \quad & \E[] \text{ is a linear operator}\\ \boldsymbol{\Phi} \sum_{i = 1}^I \sum_{j = 1}^J \E\left[ \mathbf{h}_i \mathbf{h}_i^\top \right] &= \sum_{i = 1}^I \sum_{j = 1}^J \left( \mathbf{x}_{ij} - \boldsymbol{\mu} \right) \E\left[ \mathbf{h}_i^\top \right]\\ \boldsymbol{\Phi} &= \left( \sum_{i = 1}^I \sum_{j = 1}^J (\mathbf{x}_{ij} - \boldsymbol{\mu}) \E\left[ \mathbf{h}_i^\top \right] \right) \left( \sum_{i = 1}^I J \E\left[ \mathbf{h}_i \mathbf{h}_i^\top \right] \right)^{-1}\end{split}\]

(c)

\[\begin{split}\DeclareMathOperator{\diag}{\mathrm{diag}} \frac{\partial L}{\partial \boldsymbol{\Sigma}} &= -\frac{1}{2} \sum_{i = 1}^I \sum_{j = 1}^J \E\left[ \boldsymbol{\Sigma}^{-\top} - \boldsymbol{\Sigma}^{-\top} \left( \mathbf{x}_{ij} - \boldsymbol{\mu} - \boldsymbol{\Phi} \mathbf{h}_i \right) \left( \mathbf{x}_{ij} - \boldsymbol{\mu} - \boldsymbol{\Phi} \mathbf{h}_i \right)^\top \boldsymbol{\Sigma}^{-\top} \right] & \quad & \text{(C.38) and Matrix Cookbook (61)}\\ \boldsymbol{\Sigma} &= \frac{1}{IJ} \sum_{i = 1}^I \sum_{j = 1}^J \E\left[ \left( \mathbf{x}_{ij} - \boldsymbol{\mu} - \boldsymbol{\Phi} \mathbf{h}_i \right) \left( \mathbf{x}_{ij} - \boldsymbol{\mu} - \boldsymbol{\Phi} \mathbf{h}_i \right)^\top \right] & \quad & \E[] \text{ is a linear operator}\\ &= \frac{1}{IJ} \sum_{i = 1}^I \sum_{j = 1}^J (\mathbf{x}_{ij} - \boldsymbol{\mu}) (\mathbf{x}_{ij} - \boldsymbol{\mu})^\top + \E\left[ 2 \boldsymbol{\Phi} \mathbf{h}_i \boldsymbol{\mu}^\top - 2 \boldsymbol{\Phi} \mathbf{h}_i \mathbf{x}_{ij}^\top + \boldsymbol{\Phi} \mathbf{h}_i \mathbf{h}_i^\top \boldsymbol{\Phi}^\top \right]\\ &= \frac{1}{IJ} \sum_{i = 1}^I \sum_{j = 1}^J (\mathbf{x}_{ij} - \boldsymbol{\mu}) (\mathbf{x}_{ij} - \boldsymbol{\mu})^\top - 2 \boldsymbol{\Phi} \E[\mathbf{h}_i] \left( \mathbf{x}_{ij} - \boldsymbol{\mu}\right)^\top + \boldsymbol{\Phi} \E\left[ \mathbf{h}_i \mathbf{h}_i^\top \right] \boldsymbol{\Phi}^\top\\ &= \frac{1}{IJ} \sum_{i = 1}^I \sum_{j = 1}^J (\mathbf{x}_{ij} - \boldsymbol{\mu}) (\mathbf{x}_{ij} - \boldsymbol{\mu})^\top - \boldsymbol{\Phi} \E[\mathbf{h}_i] \left( \mathbf{x}_{ij} - \boldsymbol{\mu}\right)^\top & \quad & \text{substitute in results from (b)}\\ &= \frac{1}{IJ} \sum_{i = 1}^I \sum_{j = 1}^J \diag\left[ (\mathbf{x}_{ij} - \boldsymbol{\mu}) (\mathbf{x}_{ij} - \boldsymbol{\mu})^\top - \boldsymbol{\Phi} \E[\mathbf{h}_i] \left( \mathbf{x}_{ij} - \boldsymbol{\mu}\right)^\top \right] & \quad & \boldsymbol{\Sigma} \text{ diagonal constraint}\end{split}\]

Exercise 18.3

As shown in (17.25),

\[\begin{split}Pr(\mathbf{x}_{ij}) &= \int Pr(\mathbf{x}_{ij}, \mathbf{h}_i) d\mathbf{h}_i\\ &= \int Pr(\mathbf{x}_{ij} \mid \mathbf{h}_i) Pr(\mathbf{h}_i) d\mathbf{h}_i\\ &= \NormDist_{\mathbf{x}_{ij}}\left[ \boldsymbol{\mu}, \boldsymbol{\Phi} \boldsymbol{\Phi}^\top + \sigma \mathbf{I} \right]\end{split}\]

where

\[\begin{split}Pr(\mathbf{h}_i) &= \NormDist_{\mathbf{h}_i}\left[\boldsymbol{0}, \mathbf{I} \right], \\\\ Pr(\mathbf{x}_{ij} \mid \mathbf{h}_i) &= \NormDist_{\mathbf{x}_{ij}}\left[ \boldsymbol{\mu} + \boldsymbol{\Phi} \mathbf{h}_i, \sigma \mathbf{I} \right].\end{split}\]

This model looks very similar to the PPCA. Instead of stacking landmark points, the subspace identity model uses images with varied lighting and poses. Since the dimensionality of the data is very high compared to the number of training examples, (17.29) is a more efficient way of estimating \(\boldsymbol{\Phi}\) and \(\sigma\). Furthermore, \(\boldsymbol{\mu}\) can still be estimated using (17.26).

Exercise 18.4

Face clustering can be viewed as a set partition problem. The number of ways a set of \(n\) elements can be partitioned into non-empty subsets is called a Bell number

\[B_n = \sum_{k = 0}^{n - 1} {n - 1 \choose x} B_k\]

where \(B_0 = B_1 = 1\).

Exercise 18.5

Suppose \(Pr(\mathbf{x}_1, \mathbf{x}_2 \mid w)\) is defined as (18.17) and (18.18); see Exercise 7.8 for the derivation.

The marginal and conditional distributions can be derived using (5.11) and (5.13) respectively. It is interesting to note that any combination of these four equations will result in exactly \(\{ Pr(\mathbf{x}_1), Pr(\mathbf{x}_2), Pr(\mathbf{x}_1 \mid \mathbf{x}_2), Pr(\mathbf{x}_2 \mid \mathbf{x}_1) \}\).

When the prior is uniform over the world state, (18.12) can be evaluated using the foregoing expressions.

Exercise 18.6

The marginalization technique used to combine the t-distribution with factor analyzers in [KD04] could be applied to the subspace identity model to make it robust to outliers:

\[\begin{split}Pr(\mathbf{x}_{ij}) &= \iint Pr(\mathbf{x}_{ij}, \mathbf{h}_i, u_i) d\mathbf{h}_i du_i\\ &= \iint Pr(\mathbf{x}_{ij} \mid \mathbf{h}_i, u_i) Pr(\mathbf{h}_i \mid u_i) Pr(u_i) d\mathbf{h}_i du_i\end{split}\]

where

\[\begin{split}\DeclareMathOperator{\GamDist}{Gam} Pr(u_i) &= \GamDist_{u_i}[\nu / 2, \nu / 2], \\\\\\ Pr(\mathbf{h}_i \mid u_i) &= \NormDist_{\mathbf{h}_i}\left[ \boldsymbol{0}, \frac{\mathbf{I}}{u_i} \right], \\\\\\ Pr(\mathbf{x}_{ij} \mid \mathbf{h}_i, u_i) &= \NormDist_{\mathbf{x}_{ij}}\left[ \boldsymbol{\mu} + \boldsymbol{\Phi} \mathbf{h}_i, \frac{\boldsymbol{\Sigma}}{u_i} \right].\end{split}\]

Exercise 18.7

Imitating (18.17), define

\[Pr(\mathbf{x}_\delta \mid w) = \NormDist_{\mathbf{x}_\delta}\left[ \boldsymbol{\mu}_w, \boldsymbol{\Phi}_w \boldsymbol{\Phi}^\top_w + \boldsymbol{\Sigma}_w \right]\]

and

\[Pr(w \mid \mathbf{x}_\delta) = \frac{Pr(\mathbf{x}_\delta \mid w) Pr(w)}{Pr(\mathbf{x}_\delta)}.\]

See [MJP00] for a discussion of this model’s disadvantages.

Exercise 18.8

Combining (18.20) and (18.30) yields

\[\begin{split}\mathbf{x}_{ijs} &= \boldsymbol{\mu}_s + \boldsymbol{\Phi}_s \mathbf{h}_i + \boldsymbol{\Psi}_s \mathbf{s}_{ij} + \boldsymbol{\epsilon}_{ijs}\\ &= \boldsymbol{\mu}_s + \begin{bmatrix} \boldsymbol{\Phi}_s & \boldsymbol{\Psi}_s \end{bmatrix} \begin{bmatrix} \mathbf{h}_i\\ \mathbf{s}_{ij} \end{bmatrix} + \boldsymbol{\epsilon}_{ijs}\\ &= \boldsymbol{\mu}_s + \boldsymbol{\Phi}'_s \mathbf{h}_{ij} + \boldsymbol{\epsilon}_{ijs}\end{split}\]

where \(\boldsymbol{\mu}_s\) is the mean vector associated with the \(s\text{th}\) style, \(\boldsymbol{\Phi}_s\) describes the between-individual variation associated with the \(s\text{th}\) style, \(\boldsymbol{\Psi}_s\) describes the within-individual variation associated with the \(s\text{th}\) style, and the diagonal noise term \(\boldsymbol{\epsilon}_{ijs}\) explains any remaining variation associated with the \(s\text{th}\) style.

The generative model can be written in probabilistic terms as a combination of (18.21) and (18.31):

\[\begin{split}\DeclareMathOperator{\CatDist}{Cat} Pr(s) &= \CatDist_s\left[\boldsymbol{\lambda}\right]\\ Pr(\mathbf{h}_i) &= \NormDist_{\mathbf{h}_i}\left[ \boldsymbol{0}, \mathbf{I}_{D_{\mathbf{h}_i}} \right]\\ Pr(\mathbf{s}_{ij}) &= \NormDist_{\mathbf{s}_{ij}}\left[ \boldsymbol{0}, \mathbf{I}_{D_{\mathbf{s}_{ij}}} \right]\\ Pr(\mathbf{x}_{ijs} \mid \mathbf{h}_i, \mathbf{s}_{ij}, s) &= \NormDist_{\mathbf{x}_{ijs}}\left[ \boldsymbol{\mu}_s + \boldsymbol{\Phi}_s \mathbf{h}_i + \boldsymbol{\Psi}_s \mathbf{s}_{ij}, \boldsymbol{\Sigma}_s \right]\end{split}\]

where \(\boldsymbol{\lambda}\) describes the probability of observing data in each style.

The entire model can be reshaped into the standard factor analysis model

\[\begin{split}Pr(\mathbf{h}_{ij}) &= \NormDist_{\mathbf{h}_{ij}}\left[ \boldsymbol{0}, \mathbf{I}_{D_{\mathbf{h}_i} + D_{\mathbf{s}_{ij}}} \right]\\ Pr(\mathbf{x}_{ijs} \mid \mathbf{h}_{ij}, s) &= \NormDist_{\mathbf{x}_{ijs}}\left[ \boldsymbol{\mu}_s + \boldsymbol{\Phi}'_s \mathbf{h}_{ij}, \boldsymbol{\Sigma}_s \right].\end{split}\]

Marginalizing over the hidden variables gives

\[\begin{split}Pr(\mathbf{x}_{ijs}) &= \sum_{s = 1}^S \iint Pr(\mathbf{x}_{ijs}, \mathbf{h}_i, \mathbf{s}_{ij}, s) d\mathbf{h}_i d\mathbf{s}_{ij}\\ &= \sum_{s = 1}^S \iint Pr(\mathbf{x}_{ijs} \mid \mathbf{h}_i, \mathbf{s}_{ij}, s) Pr(\mathbf{h}_i) Pr(\mathbf{s}_{ij}) Pr(s) d\mathbf{h}_i d\mathbf{s}_{ij} & \quad & \text{the priors are independent of each other}\\ &= \sum_{s = 1}^S \int Pr(\mathbf{x}_{ijs} \mid \mathbf{h}_{ij}, s) Pr(\mathbf{h}_{ij}) Pr(s) d\mathbf{h}_{ij}\\ &= \sum_{s = 1}^S \lambda_s \NormDist_{\mathbf{x}_{ijs}}\left[ \boldsymbol{\mu}_s, {\boldsymbol{\Phi}'}_s {\boldsymbol{\Phi}'}_s^\top + \boldsymbol{\Sigma}_s \right] & \quad & \text{Exercise 7.8}\\ &= \sum_{s = 1}^S \lambda_s \NormDist_{\mathbf{x}_{ijs}}\left[ \boldsymbol{\mu}_s, \boldsymbol{\Phi}_s \boldsymbol{\Phi}_s^\top + \boldsymbol{\Psi}_s \boldsymbol{\Psi}_s^\top + \boldsymbol{\Sigma}_s \right]\end{split}\]

which is essentially the non-linear identity model (18.27). See Exercise 7.8 for more derivation details. The compound generative equation for \(\mathbf{x}_{i\cdot\cdot}\) is

\[\begin{split}\begin{bmatrix} \mathbf{x}_{i11}\\ \vdots\\ \mathbf{x}_{i1S}\\ \vdots\\ \mathbf{x}_{iJ1}\\ \vdots\\ \mathbf{x}_{iJS} \end{bmatrix} &= \begin{bmatrix} \boldsymbol{\mu}_1\\ \vdots\\ \boldsymbol{\mu}_S\\ \vdots\\ \boldsymbol{\mu}_1\\ \vdots\\ \boldsymbol{\mu}_S \end{bmatrix} + \begin{bmatrix} \boldsymbol{\Phi}_1 & \boldsymbol{\Psi}_1 & \boldsymbol{0} & \cdots & \boldsymbol{0}\\ \vdots & \vdots & \vdots & \ddots & \vdots\\ \boldsymbol{\Phi}_S & \boldsymbol{\Psi}_S & \boldsymbol{0} & \cdots & \boldsymbol{0}\\ \vdots & \vdots & \vdots & \vdots & \vdots\\ \boldsymbol{\Phi}_1 & \boldsymbol{0} & \boldsymbol{0} & \cdots & \boldsymbol{\Psi}_1\\ \vdots & \vdots & \vdots & \ddots & \vdots\\ \boldsymbol{\Phi}_S & \boldsymbol{0} & \boldsymbol{0} & \cdots & \boldsymbol{\Psi}_S\\ \end{bmatrix} \begin{bmatrix} \mathbf{h}_{i}\\ \mathbf{s}_{i1}\\ \vdots\\ \mathbf{s}_{iJ} \end{bmatrix} + \begin{bmatrix} \boldsymbol{\epsilon}_{i11}\\ \vdots\\ \boldsymbol{\epsilon}_{i1S}\\ \vdots\\ \boldsymbol{\epsilon}_{iJ1}\\ \vdots\\ \boldsymbol{\epsilon}_{iJS} \end{bmatrix}\end{split}\]

(18.9) and (18.10) can be reused to compute the E-step; the M-step can be updated according to (18.35). If \(\boldsymbol{\lambda}\) is unknown and needs to be estimated, one can proceed as in [GH+96]. The inference in section 18.2.2 and 18.4.2 are still applicable.

Exercise 18.9

The posterior distribution of some observed data \(\mathbf{x}\) over style \(s\) can be computed using

\[Pr(s \mid \mathbf{x}) = \frac{Pr(\mathbf{x} \mid s) Pr(s)}{Pr(\mathbf{x})}\]

where the numerator and denominator are defined as (18.32), (18.36), and (18.37).

Now that the two examples each have its own distribution, arbitrary metrics (e.g. \(L_\infty\)-norm, KL divergence) can be applied to determine whether the styles match or not.