Fitting Probability Models
Exercise 4.1
\[\begin{split}\frac{\partial L}{\partial \sigma}
&= \frac{\partial}{\partial \sigma} \left[
-0.5I \log[2 \pi] - 0.5I \log \sigma^2 -
0.5 \sum_{i = 1}^I \frac{(x_i - \mu)^2}{\sigma^2}
\right]\\
0 &= -\frac{I}{\sigma} + \sum_{i = 1}^I \frac{(x_i - \mu)^2}{\sigma^3}\\
\sigma^2 &= \sum_{i = 1}^I \frac{(x_i - \mu)^2}{I}\end{split}\]
Exercise 4.2
The log-likelihood is
\[\begin{split}\DeclareMathOperator{\NormDist}{Norm}
\DeclareMathOperator{\NormInvGamDist}{NormInvGam}
L &= \sum_{i = 1}^I
\log\left(
\NormDist_{x_i}[\mu, \sigma^2]
\right) +
\log\left(
\NormInvGamDist_{\mu, \sigma^2}[\alpha, \beta, \gamma, \delta]
\right)\\
&= -0.5 (I + 1) \log(2 \pi) -
0.5 (I + 1) \log \sigma^2 -
0.5 \sum_{i = 1}^I \frac{(x_i - \mu)^2}{\sigma^2} +
0.5 \log \gamma +
\alpha \log \beta - \log \Gamma[\alpha] -
(\alpha + 1) \log \sigma^2 -
\frac{2 \beta + \gamma (\delta - \mu)^2}{2 \sigma^2}.\end{split}\]
The mean is derived as
\[\begin{split}0 = \frac{\partial L}{\partial \mu}
&= \sum_{i = 1}^I
\frac{x_i - \mu}{\sigma^2} +
\frac{\gamma (\delta - \mu)}{\sigma^2}\\
\mu (I + \gamma) &= \sum_{i = 1}^I x_i + \gamma \delta\\
\mu &= \frac{\sum_{i = 1}^I x_i + \gamma \delta}{I + \gamma}.\end{split}\]
The variance is derived as
\[\begin{split}0 = \frac{\partial L}{\partial \mu}
&= -\frac{I + 1}{\sigma} +
\sum_{i = 1}^I
\frac{(x_i - \mu)^2}{\sigma^3} -
\frac{2 (\alpha + 1)}{\sigma} +
\frac{2 \beta + \gamma (\delta - \mu)^2}{\sigma^3}\\
\sigma^2 \left( I + 1 + 2 \alpha + 2 \right)
&= \sum_{i = 1}^I (x_i - \mu)^2 + 2 \beta + \gamma (\delta - \mu)^2\\
\sigma^2
&= \frac{
\sum_{i = 1}^I (x_i - \mu)^2 + 2 \beta + \gamma (\delta - \mu)^2
}{
I + 2 \alpha + 3
}.\end{split}\]
Exercise 4.3
\[\begin{split}\frac{\partial L}{\partial \lambda_k}
&= \frac{\partial}{\partial \lambda_k} \left[
\sum_k N_k \log \lambda_k + \nu \left( \sum_k \lambda_k - 1 \right)
\right]\\
0 &= \frac{N_k}{\lambda_k} + \nu\\
\lambda_k &= \frac{N_k}{\nu}
& \quad & \text{the negative factor is absorbed into } \nu\end{split}\]
From the constraints on the categorical distribution,
\[\begin{split}0 = \frac{\partial L}{\partial \nu} &= \sum_k \lambda_k - 1\\
1 &= \sum_k \lambda_k\\
1 &= \sum_k \frac{N^k}{\nu}
& \quad & \text{substitute in optimal parameter for } \lambda_k\\
\nu &= \sum_k N^k.\end{split}\]
Thus \(\lambda_k = \frac{N^k}{\sum_m N^m}\).
Exercise 4.4
The log-likelihood is
\[L =
\sum_k (N_k + \alpha_k - 1) \log \lambda_k +
\nu \left( \sum_k \lambda_k - 1 \right).\]
Solving for the roots of \(L\) gives
\[\begin{split}0 = \frac{\partial L}{\partial \lambda_k}
&= \frac{N_k + \alpha_k - 1}{\lambda_k} + \nu\\
\lambda_k &= \frac{N_k + \alpha_k - 1}{\nu}
& \quad & \text{the negative factor is absorbed into } \nu\end{split}\]
and
\[\begin{split}0 = \frac{\partial L}{\partial \nu} &= \sum_k \lambda_k - 1\\
1 &= \sum_k \lambda_k\\
1 &= \sum_k \frac{N^k + \alpha_k - 1}{\nu}
& \quad & \text{substitute in optimal parameter for } \lambda_k\\
\nu &= \sum_k N^k + \alpha_k - 1.\end{split}\]
Thus \(\lambda_k = \frac{N^k + \alpha_k - 1}{\sum_m N^m + \alpha_k - 1}\).
Exercise 4.5
\[\begin{split}Pr(x_{1 \ldots I})
&= \int Pr(x_{1 \ldots I}, \theta) d\theta
& \quad & \text{marginalization}\\
&= \int \prod_i Pr(x_i \mid \theta) Pr(\theta) d\theta
& \quad & \text{samples are independent and identically distributed}\end{split}\]
(i)
Assume samples come from a univariate normal distribution with the
corresponding conjugate prior.
\[\begin{split}\int \prod_i Pr(x_i \mid \theta) Pr(\theta) d\theta
&= \int \prod_i \NormDist_{x_i}\left[ \mu, \sigma^2 \right] \cdot
\NormInvGamDist_{\mu, \sigma^2}[\alpha, \beta, \gamma, \delta] d\theta\\
&= \int \kappa \NormInvGamDist_{\mu, \sigma^2}\left[
\tilde{\alpha}, \tilde{\beta}, \tilde{\gamma}, \tilde{\delta}
\right] d\theta
& \quad & \text{Exercise 3.11}\\
&= \kappa
& \quad & \text{the conjugate prior is a valid probability distribution}\end{split}\]
See Exercise 3.11 for more details.
(ii)
Assume samples come from a categorical distribution with the corresponding
conjugate prior.
\[\begin{split}\DeclareMathOperator{\CatDist}{Cat}
\DeclareMathOperator{\DirDist}{Dir}
\int \prod_i Pr(x_i \mid \theta) Pr(\theta) d\theta
&= \int \prod_i \CatDist_{\mathbf{x}_i}[\lambda_{1 \ldots K}] \cdot
\DirDist_{\lambda_{1 \ldots K}}[\alpha_{1 \ldots K}] d\theta\\
&= \int \kappa \DirDist_{\lambda_{1 \ldots K}}\left[
\tilde{\lambda}_{1 \ldots K}
\right] d\theta
& \quad & \text{Exercise 3.10}\\
&= \kappa
& \quad & \text{the conjugate prior is a valid probability distribution}\end{split}\]
See Exercise 3.10 for more details.
Exercise 4.6
From Exercise 4.5,
\[Pr(x_{1 \ldots I}) = \kappa =
\frac{1}{(2 \pi)^{I / 2}}
\frac{
\sqrt{\gamma} \beta^\alpha
}{
\sqrt{\tilde{\gamma}} \tilde{\beta}^\tilde{\alpha}
}\]
where
\[\begin{split}\tilde{\alpha} &= \alpha + 0.5 I\\\\
\tilde{\beta}
&= 0.5 \sum_i x_i^2 +
\beta + 0.5 \gamma \delta^2 -
0.5 \frac{(\gamma \delta + \sum_i x_i)^2}{\gamma + I}\\\\
\tilde{\gamma} &= \gamma + I.\end{split}\]
The following solutions use the conjugate prior with parameters
\(\alpha = 1, \beta = 1, \gamma = 1, \delta = 0\).
\(Pr(\mathcal{S}_1 = \{ 0.1, -0.5, 0.2, 0.7 \})\)
\[\begin{split}\begin{gather*}
\tilde{\alpha} = 3\\
\tilde{\beta} = 1.37\\
\tilde{\gamma} = 5\\
Pr(\mathcal{S}_1) = \kappa = 4.4 \times 10^{-3}
\end{gather*}\end{split}\]
\(Pr(\mathcal{S}_2 = \{ 1.1, 2.0, 1.4, 2.3 \})\)
\[\begin{split}\begin{gather*}
\tilde{\alpha} = 3\\
\tilde{\beta} = 2.606\\
\tilde{\gamma} = 5\\
Pr(\mathcal{S}_2) = \kappa = 6.4 \times 10^{-4}
\end{gather*}\end{split}\]
\(Pr(\mathcal{S}_1 \cup \mathcal{S}_2 \mid M_1)\)
\[\begin{split}\begin{gather*}
\tilde{\alpha} = 5\\
\tilde{\beta} = 4.664\\
\tilde{\gamma} = 9\\
Pr(\mathcal{S}_1 \cup \mathcal{S}_2 \mid M_1) = \kappa = 9.7 \times 10^{-8}
\end{gather*}\end{split}\]
\(Pr(\mathcal{S}_1 \cup \mathcal{S}_2 \mid M_2)\)
\[Pr(\mathcal{S}_1 \cup \mathcal{S}_2 \mid M_2) =
Pr(\mathcal{S}_1) Pr(\mathcal{S}_2) = 2.8 \times 10^{-6}\]
Priors on \(M_1\) and \(M_2\)
Suppose the priors on \(M_1\) and \(M_2\) are uniform. The posterior
probability simplifies to
\[Pr(M_1 \mid \mathcal{S}_1 \cup \mathcal{S}_2) =
\frac{
Pr(\mathcal{S}_1 \cup \mathcal{S}_2 \mid M_1) Pr(M_1)
}{
\sum_i Pr(\mathcal{S}_1 \cup \mathcal{S}_2 \mid M_i) Pr(M_i)
} = 0.0335\]
and
\[Pr(M_2 \mid \mathcal{S}_1 \cup \mathcal{S}_2) =
\frac{
Pr(\mathcal{S}_1 \cup \mathcal{S}_2 \mid M_2) Pr(M_2)
}{
\sum_i Pr(\mathcal{S}_1 \cup \mathcal{S}_2 \mid M_i) Pr(M_i)
} = 0.967.\]
Exercise 4.7
Suppose samples are independent and identically distributed. The likelihood
function
\[\DeclareMathOperator*{\argmax}{arg\,max}
\argmax_\lambda Pr(x_{1 \ldots I} \mid \lambda) =
\argmax_\lambda \prod_{i = 1}^I \lambda^{x_i} (1 - \lambda)^{1 - x_i}\]
is difficult to optimize analytically due to the product rule. Since
\(\log\) is a monotonic function, applying that pointwise gives a
log-probability whose optimization consists of simple chain rules:
\[\begin{split}L = \log Pr(x_{1 \ldots I} \mid \lambda)
&= \sum_{i = 1}^I x_i \log \lambda + (1 - x_i) \log(1 - \lambda)\\
&= I \log(1 - \lambda) +
\sum_{i = 1}^I x_i \log \lambda - x_i \log (1 - \lambda).\end{split}\]
An optima of \(L\) can be found via applying Fermat’s theorem:
\[\begin{split}0 = \frac{\partial L}{\partial \lambda}
&= -\frac{I}{1 - \lambda} +
\sum_{i = 1}^I
x_i \left( \frac{1}{\lambda} + \frac{1}{1 - \lambda} \right)\\
I &= \sum_{i = 1}^I x_i \left( \frac{1 - \lambda}{\lambda} + 1 \right)\\
\lambda &= I^{-1} \sum_{i = 1}^I x_i.\end{split}\]
Exercise 4.8
The posterior function
\[\begin{split}\DeclareMathOperator{\BetaDist}{Beta}
\DeclareMathOperator{\BernDist}{Bern}
\argmax_\lambda Pr(\lambda \mid x_{1 \ldots I})
&= \argmax_\lambda
\frac{
Pr(x_{1 \ldots I} \mid \lambda) Pr(\lambda)
}{
Pr(x_{1 \ldots I})
}\\
&= \argmax_\lambda
\frac{
\prod_{i = 1}^I Pr(x_i \mid \lambda) \cdot Pr(\lambda)
}{
Pr(x_{1 \ldots I})
}
& \quad & \text{assuming samples are I.I.D.}\\
&= \argmax_\lambda \prod_{i = 1}^I Pr(x_i \mid \lambda) \cdot Pr(\lambda)
& \quad & Pr(x_{1 \ldots I}) \text{ is a constant}\\
&= \argmax_\lambda \prod_{i = 1}^I \BernDist_{x_i}[\lambda] \cdot
\BetaDist_\lambda[\alpha, \beta]\end{split}\]
is difficult to optimize analytically due to the product rule. Since
\(\log\) is a monotonic function, applying that pointwise yields a
log-probability whose optimization consists of simple chain rules:
\[\begin{split}L &= \sum_{i = 1}^I
\log \BernDist_{x_i}[\lambda] +
\log \BetaDist_\lambda[\alpha, \beta]\\
&= \left[
\sum_{i = 1}^I x_i \log \lambda + (1 - x_i) \log(1 - \lambda)
\right] +
(\alpha - 1) \log \lambda + (\beta - 1) \log (1 - \lambda)\\
&= I \log(1 - \lambda) +
\sum_{i = 1}^I
x_i \left[ \log \lambda - \log(1 - \lambda) \right] +
(\alpha - 1) \log \lambda + (\beta - 1) \log (1 - \lambda).\end{split}\]
The optima of \(L\) can be found via applying Fermat’s theorem:
\[\begin{split}0 = \frac{\partial L}{\partial \lambda}
&= -\frac{I}{1 - \lambda} +
\sum_{i = 1}^I
x_i \left( \frac{1}{\lambda} + \frac{1}{1 - \lambda} \right) +
\frac{\alpha - 1}{\lambda} - \frac{\beta - 1}{1 - \lambda}\\
I + \beta - 1
&= \sum_{i = 1}^I x_i \left( \frac{1 - \lambda}{\lambda} + 1 \right) +
\frac{\alpha - 1}{\lambda} (1 - \lambda)\\
I + \alpha + \beta - 2
&= \sum_{i = 1}^I x_i \frac{1}{\lambda} +
\frac{\alpha - 1}{\lambda}\\
\lambda &= \frac{\alpha - 1 + \sum_{i = 1}^I x_i}{I + \alpha + \beta - 2}.\end{split}\]
Exercise 4.9
(i)
\[\begin{split}Pr(\lambda \mid x_{1 \ldots I})
&= \frac{Pr(x_{1 \ldots I} \mid \lambda) Pr(\lambda)}{Pr(x_{1 \ldots I})}\\
&= \frac{
\prod_{i = 1}^I Pr(x_i \mid \lambda) \cdot Pr(\lambda)
}{
Pr(x_{1 \ldots I})
}
& \quad & \text{assuming samples are I.I.D.}\\
&= \frac{
\prod_{i = 1}^I \BernDist_{x_i}[\lambda] \cdot
\BetaDist_\lambda[\alpha, \beta]
}{
Pr(x_{1 \ldots I})
}\\
&= \BetaDist_\lambda\left[ \tilde{\alpha}, \tilde{\beta} \right]\end{split}\]
See Exercise 3.9 and
Exercise 4.5 for more details.
(ii)
\[\begin{split}Pr(x^* \mid x_{1 \ldots I})
&= \frac{Pr(x^*, x_{1 \ldots I})}{Pr(x_{1 \ldots I})}\\
&= \int \frac{Pr(x^*, x_{1 \ldots I}, \theta)}{Pr(x_{1 \ldots I})} d\theta
& \quad & \text{marginalization}\\
&= \int
\frac{
Pr(x^* \mid \theta) Pr(x_{1 \ldots I} \mid \theta) Pr(\theta)
}{
Pr(x_{1 \ldots I})
} d\theta\\
&= \int Pr(x^* \mid \theta) Pr(\lambda \mid x_{1 \ldots I}) d\theta
& \quad & \text{see (i)}\\
&= \int
\BernDist_{x^*}[\lambda]
\BetaDist_\lambda[\tilde{\alpha}, \tilde{\beta}] d\theta\\
&= \kappa(x^*, \hat{\alpha}, \hat{\beta})\end{split}\]
where
\[\begin{split}\begin{gather*}
\kappa =
\frac{
\Gamma[\tilde{\alpha} + \tilde{\beta}]
\Gamma[\hat{\alpha}] \Gamma[\hat{\beta}]
}{
\Gamma[\tilde{\alpha}] \Gamma[\tilde{\beta}]
\Gamma[\tilde{\alpha} + \tilde{\beta} + 1]
}\\
\hat{\alpha} = \tilde{\alpha} + x^*\\
\hat{\beta} = \tilde{\beta} + 1 - x^*.
\end{gather*}\end{split}\]
Exercise 4.10
ML confidently predicted that future data will have only \(x = 0\).
Using a uniform beta prior \((\alpha = 1, \beta = 1)\) reduced MAP to ML.
Only the Bayesian approach gave a proper weighting to \(x = 0\) and
\(x = 1\).
ML: lambda = 0.0
x = 0: 1.0
x = 1: 0.0
MAP: lambda = 0.0, alpha = 1, beta = 1
x = 0: 1.0
x = 1: 0.0
Bayesian
(alpha = 1, beta = 6) x = 0: 0.8333333333333334
(alpha = 2, beta = 5) x = 1: 0.16666666666666666