Latent Dirichlet allocation is a density model for the data in a corpus. It
does not involve a “world” term that we wish to infer.
Exercise 20.2
The following description is the generative process for the plate notation in
Figure 20.6.
For each image indexed by \(i \in \{1, \ldots, I\}\) in a corpus:
Choose \(\DeclareMathOperator{\PoissonDist}
J_i \sim \PoissonDist(\xi)\).
Choose a \(M\)-dimensional part probability vector
\(\boldsymbol{\pi}_i\) from the distribution
\(\DeclareMathOperator{\DirDist}{Dir}
Pr(\boldsymbol{\pi}_i) = \DirDist_{\boldsymbol{\pi}_i}[\boldsymbol{\alpha}]\).
For each visual word indexed by \(j \in \{1, \ldots, J_i\}\) in image
\(i\):
Choose a part \(p_{ij} \in \{1, \ldots, M\}\) from the distribution
\(\DeclareMathOperator{\CatDist}{Cat}
Pr(p_{ij}) = \CatDist_{p_{ij}}[\boldsymbol{\pi}_i]\).
Choose a visual word \(f_{ij} \in \{1, \ldots, K\}\) from the
distribution
\(Pr(f_{ij} \mid p_{ij}) =
\CatDist_{f_{ij}}[\boldsymbol{\lambda}_{p_{ij}}]\) where
\(Pr(\boldsymbol{\lambda}_m) =
\DirDist_{\boldsymbol{\lambda}_m}[\boldsymbol{\beta}]\).
The joint distribution for this generative process’ graphical model is
\[\begin{split}Pr(\boldsymbol{\pi}, \boldsymbol{\lambda}, \mathbf{p}, \mathbf{f} \mid
\boldsymbol{\alpha}, \boldsymbol{\beta})
&= Pr(\boldsymbol{\lambda} \mid \boldsymbol{\beta})
Pr(\boldsymbol{\pi} \mid \boldsymbol{\alpha})
Pr(\mathbf{p} \mid \boldsymbol{\pi})
Pr(\mathbf{f} \mid \mathbf{p}, \boldsymbol{\lambda})
& \quad & \text{(10.3)}\\
&= \prod_{m = 1}^M Pr(\boldsymbol{\lambda}_m \mid \boldsymbol{\beta})
\cdot
\prod_{i = 1}^I Pr(\boldsymbol{\pi}_i \mid \boldsymbol{\alpha})
\cdot
\prod_{i = 1}^I \prod_{j = 1}^{J_i} Pr(p_{ij} \mid \boldsymbol{\pi}_i)
\cdot
\prod_{i = 1}^I \prod_{j = 1}^{J_i}
Pr(f_{ij} \mid \boldsymbol{\lambda}_{p_{ij}})
& \quad & \text{d-Separation from generative process}\\
&= \prod_{m = 1}^M Pr(\boldsymbol{\lambda}_m \mid \boldsymbol{\beta}) \cdot
\prod_{i = 1}^I Pr(\boldsymbol{\pi}_i \mid \boldsymbol{\alpha})
\prod_{j = 1}^{J_i}
Pr(p_{ij} \mid \boldsymbol{\pi}_i)
Pr(f_{ij} \mid \boldsymbol{\lambda}_{p_{ij}})\end{split}\]
where \(\mathbf{p} = \{ p_{ij} \}_{i = 1, j = 1}^{I, J_i}\) and
\(\mathbf{f} = \{ f_{ij} \}_{i = 1, j = 1}^{I, J_i}\) are the hidden part
labels and observed visual word labels respectively. This joint distribution is
also known as the complete-data likelihood of an image.
When the part and visual word labels are known, the Bayesian approach (or
another technique like ML and MAP) can be applied to fit the parameters
\(\boldsymbol{\theta} =
\left\{
\{ \boldsymbol{\pi}_i \}_{i = 1}^I,
\{ \boldsymbol{\lambda}_m \}_{m = 1}^M
\right\}\) to the data. The Bayesian approach requires calculating a posterior
over the parameters as
\[\begin{split}Pr(\boldsymbol{\theta} \mid
\mathbf{p}, \mathbf{f}, \boldsymbol{\alpha}, \boldsymbol{\beta})
&= \frac{
Pr(\boldsymbol{\pi}, \boldsymbol{\lambda}, \mathbf{p}, \mathbf{f} \mid
\boldsymbol{\alpha}, \boldsymbol{\beta})
}{
Pr(\mathbf{p}, \mathbf{f} \mid \boldsymbol{\alpha}, \boldsymbol{\beta})
}\\
&= \frac{
Pr(\boldsymbol{\lambda} \mid \boldsymbol{\beta})
Pr(\boldsymbol{\pi} \mid \boldsymbol{\alpha})
Pr(\mathbf{p} \mid \boldsymbol{\pi})
Pr(\mathbf{f} \mid \mathbf{p}, \boldsymbol{\lambda})
}{
Pr(\mathbf{f} \mid \mathbf{p}, \boldsymbol{\alpha}, \boldsymbol{\beta})
Pr(\mathbf{p} \mid \boldsymbol{\alpha}, \boldsymbol{\beta})
}\\
&= \frac{
Pr(\boldsymbol{\lambda} \mid \boldsymbol{\beta})
Pr(\mathbf{f} \mid \mathbf{p}, \boldsymbol{\lambda})
Pr(\boldsymbol{\pi} \mid \boldsymbol{\alpha})
Pr(\mathbf{p} \mid \boldsymbol{\pi})
}{
\left(
\int Pr(\boldsymbol{\lambda}, \mathbf{f} \mid
\mathbf{p}, \boldsymbol{\alpha}, \boldsymbol{\beta})
d\boldsymbol{\lambda}
\right)
\left(
\prod_{i = 1}^I
\int Pr(\boldsymbol{\pi}_i, \mathbf{p}_{i \cdot} \mid
\boldsymbol{\alpha}, \boldsymbol{\beta})
d\boldsymbol{\pi}_i
\right)
}
& \quad & \text{marginalization and independence from generative process}\\
&= \frac{
\prod_{m = 1}^M Pr(\boldsymbol{\lambda}_m \mid \boldsymbol{\beta})
\cdot
\prod_{i = 1}^I \prod_{j = 1}^{J_i}
Pr(f_{ij} \mid \boldsymbol{\lambda}_{p_{ij}}) \cdot
\prod_{i = 1}^I Pr(\boldsymbol{\pi}_i \mid \boldsymbol{\alpha})
\prod_{j = 1}^{J_i} Pr(p_{ij} \mid \boldsymbol{\pi}_i)
}{
\left(
\int
\prod_{m = 1}^M Pr(\boldsymbol{\lambda}_m \mid \boldsymbol{\beta})
\cdot
\prod_{i = 1}^I \prod_{j = 1}^{J_i}
Pr(f_{ij} \mid \boldsymbol{\lambda}_{p_{ij}})
d\boldsymbol{\lambda}
\right)
\left(
\prod_{i = 1}^I
\int Pr(\boldsymbol{\pi}_i \mid \boldsymbol{\alpha})
\prod_{j = 1}^{J_i}
Pr(\mathbf{p}_{i \cdot} \mid \boldsymbol{\pi}_i)
d\boldsymbol{\pi}_i
\right)
}
& \quad & \text{d-Separation from generative process}\\
&= \prod_{m = 1}^M
\DirDist_{\boldsymbol{\lambda}_m}\left[
\beta + F_{m1}, \ldots, \beta + F_{mK}
\right]
\cdot
\prod_{i = 1}^I
\DirDist_{\boldsymbol{\pi}_i}\left[
\alpha + P_{i1}, \ldots, \alpha + P_{iM}
\right]
& \quad & \text{Exercise 20.3 (a) and (b)}\end{split}\]
See Exercise 20.3 for more details.
The predictive density (probability that a new data point
\(\{p^*, f^*\}\) belongs to the same model) is
\[\begin{split}& Pr(p^*, f^* \mid
\mathbf{p}, \mathbf{f}, \boldsymbol{\alpha}, \boldsymbol{\beta})\\
&= \int Pr(p^*, f^* \mid \boldsymbol{\theta})
Pr(\boldsymbol{\theta} \mid
\mathbf{p}, \mathbf{f}, \boldsymbol{\alpha}, \boldsymbol{\beta})
d\boldsymbol{\theta}\\
&= \int Pr(f^* \mid \boldsymbol{\lambda}_{p^*})
Pr(p^* \mid \boldsymbol{\pi}_*)
Pr(\boldsymbol{\theta} \mid
\mathbf{p}, \mathbf{f}, \boldsymbol{\alpha}, \boldsymbol{\beta})
d\boldsymbol{\theta}
& \quad & \text{d-Separation from generative process}\\
&= \int \CatDist_{f^*}[\boldsymbol{\lambda}_{p*}]
\CatDist_{p^*}[\boldsymbol{\pi}_*]
\cdot
\prod_{m = 1}^M
\DirDist_{\boldsymbol{\lambda}_m}\left[
\beta + F_{m1}, \ldots, \beta + F_{mK}
\right]
\cdot
\prod_{i = 1}^I
\DirDist_{\boldsymbol{\pi}_i}\left[
\alpha + P_{i1}, \ldots, \alpha + P_{iM}
\right]
d\boldsymbol{\lambda} d\boldsymbol{\pi}
& \quad & \text{(20.5)}\\
&= \frac{\beta + F_{mk}}{K \beta + \sum_{k = 1}^K F_{mk}}
\frac{\alpha + P_{im}}{M \alpha + \sum_{m = 1}^M P_{im}}
& \quad & \text{(a), (b), and }
\Gamma[1 + n] = n \Gamma[n] \text{ for } n > 0.\end{split}\]
[Tu14][Gri02][Gre05] are helpful
resources in proving these relations.
(a)
\[\begin{split}& \prod_{m = 1}^M
\DirDist_{\boldsymbol{\lambda}_m}\left[
\beta + F_{m1}, \ldots, \beta + F_{mK}
\right]
\cdot
\CatDist_{f^*}[\boldsymbol{\lambda}_{p^*}]\\
&= \left(
\prod_{m = 1}^M
\frac{
\Gamma\left[ K \beta + \sum_{k = 1}^K F_{mk} \right]
}{
\prod_{k = 1}^K \Gamma\left[ \beta + F_{mk} \right]
}
\prod_{k = 1}^K \boldsymbol{\lambda}_{mk}^{\beta - 1 + F_{mk}}
\right)
\cdot
\prod_{k = 1}^K \boldsymbol{\lambda}_{p^* k}^{\delta[f^* - k]}
& \quad & \text{(3.9), (3.8)}\\
&= \left(
\prod_{m = 1}^M
\frac{
\Gamma\left[ K \beta + \sum_{k = 1}^K F_{mk} \right]
}{
\prod_{k = 1}^K \Gamma\left[ \beta + F_{mk} \right]
}
\prod_{k = 1}^K \boldsymbol{\lambda}_{mk}^{\beta - 1 + F_{mk}}
\right)
\cdot
\prod_{m = 1}^M
\prod_{k = 1}^K \boldsymbol{\lambda}_{mk}^{\tilde{F}_{mk}}
& \quad & \tilde{F}_{mk} =
\delta\left[ f^* - k \right] \delta\left[ p^* - m \right]\\
&= \prod_{m = 1}^M
\frac{
\Gamma\left[ K \beta + \sum_{k = 1}^K F_{mk} \right]
}{
\prod_{k = 1}^K \Gamma\left[ \beta + F_{mk} \right]
}
\prod_{k = 1}^K
\boldsymbol{\lambda}_{mk}^{\beta - 1 + F_{mk} + \tilde{F}_{mk}}\\
&= \prod_{m = 1}^M
\frac{
\Gamma\left[ K \beta + \sum_{k = 1}^K F_{mk} \right]
}{
\prod_{k = 1}^K \Gamma\left[ \beta + F_{mk} \right]
}
\frac{
\prod_{k = 1}^K \Gamma\left[ \beta + F_{mk} + \tilde{F}_{mk} \right]
}{
\Gamma\left[ K \beta + \sum_{k = 1}^K F_{mk} + \tilde{F}_{mk} \right]
}
\DirDist_{\boldsymbol{\lambda}_m}\left[
\beta + F_{m1} + \tilde{F}_{m1},
\ldots,
\beta + F_{mK} + \tilde{F}_{mK}
\right]\end{split}\]
(b)
\[\begin{split}& \prod_{i = 1}^I \DirDist_{\boldsymbol{\pi}_i}\left[
\alpha + P_{i1}, \ldots, \alpha + P_{iM}
\right]
\cdot
\CatDist_{p^*}[\boldsymbol{\pi}_*]\\
&= \left(
\prod_{i = 1}^I
\frac{
\Gamma\left[ M \alpha + \sum_{m = 1}^M P_{im} \right]
}{
\prod_{m = 1}^M \Gamma\left[ \alpha + P_{im} \right]
}
\prod_{m = 1}^M \boldsymbol{\pi}_{im}^{\alpha - 1 + P_{im}}
\right)
\cdot
\prod_{m = 1}^M \boldsymbol{\pi}_{*m}^{\delta\left[ p^* - m \right]}
& \quad & \text{(3.9), (3.8)}\\
&= \left(
\prod_{i = 1}^I
\frac{
\Gamma\left[ M \alpha + \sum_{m = 1}^M P_{im} \right]
}{
\prod_{m = 1}^M \Gamma\left[ \alpha + P_{im} \right]
}
\prod_{m = 1}^M \boldsymbol{\pi}_{im}^{\alpha - 1 + P_{im}}
\right)
\cdot
\prod_{i = 1}^I \prod_{m = 1}^M \boldsymbol{\pi}_{im}^{\tilde{P}_{im}}
& \quad & \tilde{P}_{im} =
\delta\left[ p^* - m \right] \delta\left[ * - i \right]\\
&= \prod_{i = 1}^I
\frac{
\Gamma\left[ M \alpha + \sum_{m = 1}^M P_{im} \right]
}{
\prod_{m = 1}^M \Gamma\left[ \alpha + P_{im} \right]
}
\prod_{m = 1}^M
\boldsymbol{\pi}_{im}^{\alpha - 1 + P_{im} + \tilde{P}_{im}}\\
&= \prod_{i = 1}^I
\frac{
\Gamma\left[ M \alpha + \sum_{m = 1}^M P_{im} \right]
}{
\prod_{m = 1}^M \Gamma\left[ \alpha + P_{im} \right]
}
\frac{
\prod_{m = 1}^M \Gamma\left[ \alpha + P_{im} + \tilde{P}_{im} \right]
}{
\Gamma\left[
M \alpha + \sum_{m = 1}^M P_{im} + \tilde{P}_{im}
\right]
}
\DirDist_{\boldsymbol{\pi}_i}\left[
\alpha + P_{i1} + \tilde{P}_{i1},
\ldots,
\alpha + P_{iM} + \tilde{P}_{iM}
\right]\end{split}\]
Exercise 20.3
The following is based on the derivations in
Exercise 3.10.
The likelihood term in LDA is given by
\[\begin{split}& \int \prod_{m = 1}^M Pr(\boldsymbol{\lambda}_m \mid \boldsymbol{\beta})
\cdot
\prod_{i = 1}^I
\prod_{j = 1}^{J_i} Pr(f_{ij} \mid \boldsymbol{\lambda}_{p_{ij}})
d\boldsymbol{\lambda}_{1 \ldots M}
& \quad & \text{(20.10)}\\
&= \left( \frac{\Gamma\left[K \beta\right]}{\Gamma[\beta]^K} \right)^M
\prod_{m = 1}^M
\frac{
\prod_{k = 1}^K \Gamma\left[ \beta + F_{mk} \right]
}{
\Gamma\left[ K \beta + \sum_{k = 1}^K F_{mk} \right]
}
\cdot
\int \prod_{m = 1}^M
\DirDist_{\boldsymbol{\lambda}_m}\left[
\beta + F_{m1}, \ldots, \beta + F_{mK}
\right]
d\boldsymbol{\lambda}_{1 \ldots M}
& \quad & \text{(a)}\\
&= \left( \frac{\Gamma\left[K \beta\right]}{\Gamma[\beta]^K} \right)^M
\prod_{m = 1}^M
\frac{
\prod_{k = 1}^K \Gamma[\beta + F_{mk}]
}{
\Gamma\left[ K \beta + \sum_{k = 1}^K F_{mk} \right]
}.\end{split}\]
The prior term in LDA is given by
\[\begin{split}& \prod_{i = 1}^I \int Pr(\boldsymbol{\pi}_i \mid \boldsymbol{\alpha})
\prod_{j = 1}^{J_i}
Pr(p_{ij} \mid \boldsymbol{\pi}_i) d\boldsymbol{\pi}_{i}
& \quad & \text{(20.11)}\\
&= \prod_{i = 1}^I
\frac{\Gamma[M \alpha]}{\Gamma[\alpha]^M}
\frac{
\prod_{m = 1}^M \Gamma\left[ \alpha + P_{im} \right]
}{
\Gamma\left[ M \alpha + \sum_{m = 1}^M P_{im} \right]
}
\int \DirDist_{\boldsymbol{\pi}_i}\left[
\alpha + P_{i1}, \ldots, \alpha + P_{iM}
\right]
d\boldsymbol{\pi}_i
& \quad & \text{(b)}\\
&= \left( \frac{\Gamma[M \alpha]}{\Gamma[\alpha]^M} \right)^I
\prod_{i = 1}^I
\frac{
\prod_{m = 1}^M \Gamma\left[ \alpha + P_{im} \right]
}{
\Gamma\left[ M \alpha + \sum_{m = 1}^M P_{im} \right]
}.\end{split}\]
(a)
\[\begin{split}\prod_{m = 1}^M Pr(\boldsymbol{\lambda}_m \mid \boldsymbol{\beta})
\cdot
\prod_{i = 1}^I \prod_{j = 1}^{J_i}
Pr(f_{ij} \mid \boldsymbol{\lambda}_{p_{ij}})
&= \prod_{m = 1}^M \DirDist_{\boldsymbol{\lambda}_m}[\boldsymbol{\beta}]
\cdot
\prod_{i = 1}^I \prod_{j = 1}^{J_i}
\CatDist_{f_{ij}}[\boldsymbol{\lambda}_{p_{ij}}]
& \quad & \text{(20.5), (20.7)}\\
&= \left(
\prod_{m = 1}^M
\frac{
\Gamma\left[ \sum_{k = 1}^K \boldsymbol{\beta}_k \right]
}{
\prod_{k = 1}^K \Gamma\left[ \boldsymbol{\beta}_k \right]
}
\prod_{k = 1}^K \boldsymbol{\lambda}_{mk}^{\boldsymbol{\beta}_k - 1}
\right)
\cdot
\prod_{i = 1}^I \prod_{j = 1}^{J_i} \prod_{k = 1}^K
\boldsymbol{\lambda}_{p_{ij} k}^{\delta\left[ f_{ij} - k \right]}
& \quad & \text{(3.9), (3.8)}\\
&= \left(
\prod_{m = 1}^M
\frac{\Gamma\left[K \beta\right]}{\Gamma[\beta]^K}
\prod_{k = 1}^K \boldsymbol{\lambda}_{mk}^{\beta - 1}
\right)
\cdot
\prod_{m = 1}^M \prod_{k = 1}^K \boldsymbol{\lambda}_{mk}^{F_{mk}}
& \quad & F_{mk} =
\sum_{i = 1}^I \sum_{j = 1}^{J_i}
\delta\left[ f_{ij} - k \right]
\delta\left[ p_{ij} - m \right]\\
&= \prod_{m = 1}^M
\frac{\Gamma\left[K \beta\right]}{\Gamma[\beta]^K}
\prod_{k = 1}^K \boldsymbol{\lambda}_{mk}^{\beta - 1 + F_{mk}}\\
&= \prod_{m = 1}^M
\frac{\Gamma\left[K \beta\right]}{\Gamma[\beta]^K}
\frac{
\prod_{k = 1}^K \Gamma[\beta + F_{mk}]
}{
\Gamma\left[ K \beta + \sum_{k = 1}^K F_{mk} \right]
}
\DirDist_{\boldsymbol{\lambda}_m}\left[
\beta + F_{m1}, \ldots, \beta + F_{mK}
\right]\end{split}\]
(b)
\[\begin{split}\prod_{i = 1}^I Pr(\boldsymbol{\pi}_i \mid \boldsymbol{\alpha})
\prod_{j = 1}^{J_i} Pr(p_{ij} \mid \boldsymbol{\pi}_i)
&= \prod_{i = 1}^I \DirDist_{\boldsymbol{\pi}_i}[\boldsymbol{\alpha}]
\prod_{j = 1}^{J_i} \CatDist_{p_{ij}}[\boldsymbol{\pi}_i]
& \quad & \text{(20.5), (20.7)}\\
&= \prod_{i = 1}^I \left(
\frac{
\Gamma\left[ \sum_{m = 1}^M \boldsymbol{\alpha}_m \right]
}{
\prod_{m = 1}^M \Gamma\left[ \boldsymbol{\alpha}_m \right]
}
\prod_{m = 1}^M \boldsymbol{\pi}_{im}^{\boldsymbol{\alpha}_m - 1}
\cdot
\prod_{j = 1}^{J_i} \prod_{m = 1}^M
\boldsymbol{\pi}_{im}^{\delta\left[ p_{ij} - m \right]}
\right)
& \quad & \text{(3.9), (3.8)}\\
&= \prod_{i = 1}^I \left(
\frac{\Gamma[M \alpha]}{\Gamma[\alpha]^M}
\prod_{m = 1}^M \boldsymbol{\pi}_{im}^{\alpha - 1}
\cdot
\prod_{m = 1}^M \boldsymbol{\pi}_{im}^{P_{im}}
\right)
& \quad & P_{im} = \sum_{j = 1}^{J_i} \delta[p_{ij} - m]\\
&= \prod_{i = 1}^I
\frac{\Gamma\left[M \alpha \right]}{\Gamma[\alpha]^M}
\prod_{m = 1}^M \boldsymbol{\pi}_{im}^{\alpha - 1 + P_{im}}\\
&= \prod_{i = 1}^I
\frac{\Gamma[M \alpha]}{\Gamma[\alpha]^M}
\frac{
\prod_{m = 1}^M \Gamma\left[ \alpha + P_{im} \right]
}{
\Gamma\left[ M \alpha + \sum_{m = 1}^M P_{im} \right]
}
\DirDist_{\boldsymbol{\pi}_i}\left[
\alpha + P_{i1}, \ldots, \alpha + P_{iM}
\right]\end{split}\]
Exercise 20.7
The graphical model can be constructed from the following generative process.
For each image indexed by \(i \in \{1, \ldots, I\}\) in a corpus:
Choose a scene \(s_i \in \{1, \ldots, C\}\) from the distribution
\(Pr(s_i) = \CatDist_{s_i}[\boldsymbol{\nu}]\).
Choose \(J_i \sim \PoissonDist(\xi)\).
For each visual word indexed by \(j \in \{1, \ldots, J_i\}\) in image
\(i\):
Choose an object \(w_{ij} \in \{1, \ldots, L\}\) from the distribution
\(Pr(w_{ij} \mid s_i = c) =
\CatDist_{w_{ij}}[\boldsymbol{\phi}_c]\).
Choose a part \(p_{ij} \in \{1, \ldots, M\}\) from the distribution
\(Pr(p_{ij} \mid w_{ij} = n) =
\CatDist_{p_{ij}}[\boldsymbol{\pi}_{w_n}]\).
Choose a visual word \(f_{ij} \in \{1, \ldots, K\}\) from the
distribution
\(Pr(f_{ij} \mid p_{ij} = m) =
\CatDist_{f_{ij}}[\boldsymbol{\lambda}_m]\).
Choose a feature position \(\mathbf{x}_{ij}\) from the distribution
\(\DeclareMathOperator{\NormDist}{Norm}
Pr(\mathbf{x}_{ij} \mid p_{ij} = m, w_{ij} = n) =
\NormDist_{\mathbf{x}_{ij}}\left[
\boldsymbol{\mu}_n^{(w)} + \boldsymbol{\mu}_m^{(p)},
\boldsymbol{\Sigma}_n^{(w)} + \boldsymbol{\Sigma}_m^{(p)}
\right]\).
\(\{ s_i \}_{i = 1}^I, \{ f_{ij} \}_{i = 1, j = 1}^{I, J_i},
\{ x_{ij} \}_{i = 1, j = 1}^{I, J_i}\) are observable (manifest) variables.
\(\{ w_{ij} \}_{i = 1, j = 1}^{I, J_i},
\{ p_{ij} \}_{i = 1, j = 1}^{I, J_i}\) are unobservable (latent) variables.
The rest are parameters to be estimated with fixed hyperparameters.