Fitting Probability Models

Exercise 4.1

\[\begin{split}\frac{\partial L}{\partial \sigma} &= \frac{\partial}{\partial \sigma} \left[ -0.5I \log[2 \pi] - 0.5I \log \sigma^2 - 0.5 \sum_{i = 1}^I \frac{(x_i - \mu)^2}{\sigma^2} \right]\\ 0 &= -\frac{I}{\sigma} + \sum_{i = 1}^I \frac{(x_i - \mu)^2}{\sigma^3}\\ \sigma^2 &= \sum_{i = 1}^I \frac{(x_i - \mu)^2}{I}\end{split}\]

Exercise 4.2

The log-likelihood is

\[\begin{split}\DeclareMathOperator{\NormDist}{Norm} \DeclareMathOperator{\NormInvGamDist}{NormInvGam} L &= \sum_{i = 1}^I \log\left( \NormDist_{x_i}[\mu, \sigma^2] \right) + \log\left( \NormInvGamDist_{\mu, \sigma^2}[\alpha, \beta, \gamma, \delta] \right)\\ &= -0.5 (I + 1) \log(2 \pi) - 0.5 (I + 1) \log \sigma^2 - 0.5 \sum_{i = 1}^I \frac{(x_i - \mu)^2}{\sigma^2} + 0.5 \log \gamma + \alpha \log \beta - \log \Gamma[\alpha] - (\alpha + 1) \log \sigma^2 - \frac{2 \beta + \gamma (\delta - \mu)^2}{2 \sigma^2}.\end{split}\]

The mean is derived as

\[\begin{split}0 = \frac{\partial L}{\partial \mu} &= \sum_{i = 1}^I \frac{x_i - \mu}{\sigma^2} + \frac{\gamma (\delta - \mu)}{\sigma^2}\\ \mu (I + \gamma) &= \sum_{i = 1}^I x_i + \gamma \delta\\ \mu &= \frac{\sum_{i = 1}^I x_i + \gamma \delta}{I + \gamma}.\end{split}\]

The variance is derived as

\[\begin{split}0 = \frac{\partial L}{\partial \mu} &= -\frac{I + 1}{\sigma} + \sum_{i = 1}^I \frac{(x_i - \mu)^2}{\sigma^3} - \frac{2 (\alpha + 1)}{\sigma} + \frac{2 \beta + \gamma (\delta - \mu)^2}{\sigma^3}\\ \sigma^2 \left( I + 1 + 2 \alpha + 2 \right) &= \sum_{i = 1}^I (x_i - \mu)^2 + 2 \beta + \gamma (\delta - \mu)^2\\ \sigma^2 &= \frac{ \sum_{i = 1}^I (x_i - \mu)^2 + 2 \beta + \gamma (\delta - \mu)^2 }{ I + 2 \alpha + 3 }.\end{split}\]

Exercise 4.3

\[\begin{split}\frac{\partial L}{\partial \lambda_k} &= \frac{\partial}{\partial \lambda_k} \left[ \sum_k N_k \log \lambda_k + \nu \left( \sum_k \lambda_k - 1 \right) \right]\\ 0 &= \frac{N_k}{\lambda_k} + \nu\\ \lambda_k &= \frac{N_k}{\nu} & \quad & \text{the negative factor is absorbed into } \nu\end{split}\]

From the constraints on the categorical distribution,

\[\begin{split}0 = \frac{\partial L}{\partial \nu} &= \sum_k \lambda_k - 1\\ 1 &= \sum_k \lambda_k\\ 1 &= \sum_k \frac{N^k}{\nu} & \quad & \text{substitute in optimal parameter for } \lambda_k\\ \nu &= \sum_k N^k.\end{split}\]

Thus \(\lambda_k = \frac{N^k}{\sum_m N^m}\).

Exercise 4.4

The log-likelihood is

\[L = \sum_k (N_k + \alpha_k - 1) \log \lambda_k + \nu \left( \sum_k \lambda_k - 1 \right).\]

Solving for the roots of \(L\) gives

\[\begin{split}0 = \frac{\partial L}{\partial \lambda_k} &= \frac{N_k + \alpha_k - 1}{\lambda_k} + \nu\\ \lambda_k &= \frac{N_k + \alpha_k - 1}{\nu} & \quad & \text{the negative factor is absorbed into } \nu\end{split}\]

and

\[\begin{split}0 = \frac{\partial L}{\partial \nu} &= \sum_k \lambda_k - 1\\ 1 &= \sum_k \lambda_k\\ 1 &= \sum_k \frac{N^k + \alpha_k - 1}{\nu} & \quad & \text{substitute in optimal parameter for } \lambda_k\\ \nu &= \sum_k N^k + \alpha_k - 1.\end{split}\]

Thus \(\lambda_k = \frac{N^k + \alpha_k - 1}{\sum_m N^m + \alpha_k - 1}\).

Exercise 4.5

\[\begin{split}Pr(x_{1 \ldots I}) &= \int Pr(x_{1 \ldots I}, \theta) d\theta & \quad & \text{marginalization}\\ &= \int \prod_i Pr(x_i \mid \theta) Pr(\theta) d\theta & \quad & \text{samples are independent and identically distributed}\end{split}\]

(i)

Assume samples come from a univariate normal distribution with the corresponding conjugate prior.

\[\begin{split}\int \prod_i Pr(x_i \mid \theta) Pr(\theta) d\theta &= \int \prod_i \NormDist_{x_i}\left[ \mu, \sigma^2 \right] \cdot \NormInvGamDist_{\mu, \sigma^2}[\alpha, \beta, \gamma, \delta] d\theta\\ &= \int \kappa \NormInvGamDist_{\mu, \sigma^2}\left[ \tilde{\alpha}, \tilde{\beta}, \tilde{\gamma}, \tilde{\delta} \right] d\theta & \quad & \text{Exercise 3.11}\\ &= \kappa & \quad & \text{the conjugate prior is a valid probability distribution}\end{split}\]

See Exercise 3.11 for more details.

(ii)

Assume samples come from a categorical distribution with the corresponding conjugate prior.

\[\begin{split}\DeclareMathOperator{\CatDist}{Cat} \DeclareMathOperator{\DirDist}{Dir} \int \prod_i Pr(x_i \mid \theta) Pr(\theta) d\theta &= \int \prod_i \CatDist_{\mathbf{x}_i}[\lambda_{1 \ldots K}] \cdot \DirDist_{\lambda_{1 \ldots K}}[\alpha_{1 \ldots K}] d\theta\\ &= \int \kappa \DirDist_{\lambda_{1 \ldots K}}\left[ \tilde{\lambda}_{1 \ldots K} \right] d\theta & \quad & \text{Exercise 3.10}\\ &= \kappa & \quad & \text{the conjugate prior is a valid probability distribution}\end{split}\]

See Exercise 3.10 for more details.

Exercise 4.6

From Exercise 4.5,

\[Pr(x_{1 \ldots I}) = \kappa = \frac{1}{(2 \pi)^{I / 2}} \frac{ \sqrt{\gamma} \beta^\alpha }{ \sqrt{\tilde{\gamma}} \tilde{\beta}^\tilde{\alpha} }\]

where

\[\begin{split}\tilde{\alpha} &= \alpha + 0.5 I\\\\ \tilde{\beta} &= 0.5 \sum_i x_i^2 + \beta + 0.5 \gamma \delta^2 - 0.5 \frac{(\gamma \delta + \sum_i x_i)^2}{\gamma + I}\\\\ \tilde{\gamma} &= \gamma + I.\end{split}\]

The following solutions use the conjugate prior with parameters \(\alpha = 1, \beta = 1, \gamma = 1, \delta = 0\).

\(Pr(\mathcal{S}_1 = \{ 0.1, -0.5, 0.2, 0.7 \})\)

\[\begin{split}\begin{gather*} \tilde{\alpha} = 3\\ \tilde{\beta} = 1.37\\ \tilde{\gamma} = 5\\ Pr(\mathcal{S}_1) = \kappa = 4.4 \times 10^{-3} \end{gather*}\end{split}\]

\(Pr(\mathcal{S}_2 = \{ 1.1, 2.0, 1.4, 2.3 \})\)

\[\begin{split}\begin{gather*} \tilde{\alpha} = 3\\ \tilde{\beta} = 2.606\\ \tilde{\gamma} = 5\\ Pr(\mathcal{S}_2) = \kappa = 6.4 \times 10^{-4} \end{gather*}\end{split}\]

\(Pr(\mathcal{S}_1 \cup \mathcal{S}_2 \mid M_1)\)

\[\begin{split}\begin{gather*} \tilde{\alpha} = 5\\ \tilde{\beta} = 4.664\\ \tilde{\gamma} = 9\\ Pr(\mathcal{S}_1 \cup \mathcal{S}_2 \mid M_1) = \kappa = 9.7 \times 10^{-8} \end{gather*}\end{split}\]

\(Pr(\mathcal{S}_1 \cup \mathcal{S}_2 \mid M_2)\)

\[Pr(\mathcal{S}_1 \cup \mathcal{S}_2 \mid M_2) = Pr(\mathcal{S}_1) Pr(\mathcal{S}_2) = 2.8 \times 10^{-6}\]

Priors on \(M_1\) and \(M_2\)

Suppose the priors on \(M_1\) and \(M_2\) are uniform. The posterior probability simplifies to

\[Pr(M_1 \mid \mathcal{S}_1 \cup \mathcal{S}_2) = \frac{ Pr(\mathcal{S}_1 \cup \mathcal{S}_2 \mid M_1) Pr(M_1) }{ \sum_i Pr(\mathcal{S}_1 \cup \mathcal{S}_2 \mid M_i) Pr(M_i) } = 0.0335\]

and

\[Pr(M_2 \mid \mathcal{S}_1 \cup \mathcal{S}_2) = \frac{ Pr(\mathcal{S}_1 \cup \mathcal{S}_2 \mid M_2) Pr(M_2) }{ \sum_i Pr(\mathcal{S}_1 \cup \mathcal{S}_2 \mid M_i) Pr(M_i) } = 0.967.\]

Exercise 4.7

Suppose samples are independent and identically distributed. The likelihood function

\[\DeclareMathOperator*{\argmax}{arg\,max} \argmax_\lambda Pr(x_{1 \ldots I} \mid \lambda) = \argmax_\lambda \prod_{i = 1}^I \lambda^{x_i} (1 - \lambda)^{1 - x_i}\]

is difficult to optimize analytically due to the product rule. Since \(\log\) is a monotonic function, applying that pointwise gives a log-probability whose optimization consists of simple chain rules:

\[\begin{split}L = \log Pr(x_{1 \ldots I} \mid \lambda) &= \sum_{i = 1}^I x_i \log \lambda + (1 - x_i) \log(1 - \lambda)\\ &= I \log(1 - \lambda) + \sum_{i = 1}^I x_i \log \lambda - x_i \log (1 - \lambda).\end{split}\]

An optima of \(L\) can be found via applying Fermat’s theorem:

\[\begin{split}0 = \frac{\partial L}{\partial \lambda} &= -\frac{I}{1 - \lambda} + \sum_{i = 1}^I x_i \left( \frac{1}{\lambda} + \frac{1}{1 - \lambda} \right)\\ I &= \sum_{i = 1}^I x_i \left( \frac{1 - \lambda}{\lambda} + 1 \right)\\ \lambda &= I^{-1} \sum_{i = 1}^I x_i.\end{split}\]

Exercise 4.8

The posterior function

\[\begin{split}\DeclareMathOperator{\BetaDist}{Beta} \DeclareMathOperator{\BernDist}{Bern} \argmax_\lambda Pr(\lambda \mid x_{1 \ldots I}) &= \argmax_\lambda \frac{ Pr(x_{1 \ldots I} \mid \lambda) Pr(\lambda) }{ Pr(x_{1 \ldots I}) }\\ &= \argmax_\lambda \frac{ \prod_{i = 1}^I Pr(x_i \mid \lambda) \cdot Pr(\lambda) }{ Pr(x_{1 \ldots I}) } & \quad & \text{assuming samples are I.I.D.}\\ &= \argmax_\lambda \prod_{i = 1}^I Pr(x_i \mid \lambda) \cdot Pr(\lambda) & \quad & Pr(x_{1 \ldots I}) \text{ is a constant}\\ &= \argmax_\lambda \prod_{i = 1}^I \BernDist_{x_i}[\lambda] \cdot \BetaDist_\lambda[\alpha, \beta]\end{split}\]

is difficult to optimize analytically due to the product rule. Since \(\log\) is a monotonic function, applying that pointwise yields a log-probability whose optimization consists of simple chain rules:

\[\begin{split}L &= \sum_{i = 1}^I \log \BernDist_{x_i}[\lambda] + \log \BetaDist_\lambda[\alpha, \beta]\\ &= \left[ \sum_{i = 1}^I x_i \log \lambda + (1 - x_i) \log(1 - \lambda) \right] + (\alpha - 1) \log \lambda + (\beta - 1) \log (1 - \lambda)\\ &= I \log(1 - \lambda) + \sum_{i = 1}^I x_i \left[ \log \lambda - \log(1 - \lambda) \right] + (\alpha - 1) \log \lambda + (\beta - 1) \log (1 - \lambda).\end{split}\]

The optima of \(L\) can be found via applying Fermat’s theorem:

\[\begin{split}0 = \frac{\partial L}{\partial \lambda} &= -\frac{I}{1 - \lambda} + \sum_{i = 1}^I x_i \left( \frac{1}{\lambda} + \frac{1}{1 - \lambda} \right) + \frac{\alpha - 1}{\lambda} - \frac{\beta - 1}{1 - \lambda}\\ I + \beta - 1 &= \sum_{i = 1}^I x_i \left( \frac{1 - \lambda}{\lambda} + 1 \right) + \frac{\alpha - 1}{\lambda} (1 - \lambda)\\ I + \alpha + \beta - 2 &= \sum_{i = 1}^I x_i \frac{1}{\lambda} + \frac{\alpha - 1}{\lambda}\\ \lambda &= \frac{\alpha - 1 + \sum_{i = 1}^I x_i}{I + \alpha + \beta - 2}.\end{split}\]

Exercise 4.9

(i)

\[\begin{split}Pr(\lambda \mid x_{1 \ldots I}) &= \frac{Pr(x_{1 \ldots I} \mid \lambda) Pr(\lambda)}{Pr(x_{1 \ldots I})}\\ &= \frac{ \prod_{i = 1}^I Pr(x_i \mid \lambda) \cdot Pr(\lambda) }{ Pr(x_{1 \ldots I}) } & \quad & \text{assuming samples are I.I.D.}\\ &= \frac{ \prod_{i = 1}^I \BernDist_{x_i}[\lambda] \cdot \BetaDist_\lambda[\alpha, \beta] }{ Pr(x_{1 \ldots I}) }\\ &= \BetaDist_\lambda\left[ \tilde{\alpha}, \tilde{\beta} \right]\end{split}\]

See Exercise 3.9 and Exercise 4.5 for more details.

(ii)

\[\begin{split}Pr(x^* \mid x_{1 \ldots I}) &= \frac{Pr(x^*, x_{1 \ldots I})}{Pr(x_{1 \ldots I})}\\ &= \int \frac{Pr(x^*, x_{1 \ldots I}, \theta)}{Pr(x_{1 \ldots I})} d\theta & \quad & \text{marginalization}\\ &= \int \frac{ Pr(x^* \mid \theta) Pr(x_{1 \ldots I} \mid \theta) Pr(\theta) }{ Pr(x_{1 \ldots I}) } d\theta\\ &= \int Pr(x^* \mid \theta) Pr(\lambda \mid x_{1 \ldots I}) d\theta & \quad & \text{see (i)}\\ &= \int \BernDist_{x^*}[\lambda] \BetaDist_\lambda[\tilde{\alpha}, \tilde{\beta}] d\theta\\ &= \kappa(x^*, \hat{\alpha}, \hat{\beta})\end{split}\]

where

\[\begin{split}\begin{gather*} \kappa = \frac{ \Gamma[\tilde{\alpha} + \tilde{\beta}] \Gamma[\hat{\alpha}] \Gamma[\hat{\beta}] }{ \Gamma[\tilde{\alpha}] \Gamma[\tilde{\beta}] \Gamma[\tilde{\alpha} + \tilde{\beta} + 1] }\\ \hat{\alpha} = \tilde{\alpha} + x^*\\ \hat{\beta} = \tilde{\beta} + 1 - x^*. \end{gather*}\end{split}\]

Exercise 4.10

ML confidently predicted that future data will have only \(x = 0\).

Using a uniform beta prior \((\alpha = 1, \beta = 1)\) reduced MAP to ML.

Only the Bayesian approach gave a proper weighting to \(x = 0\) and \(x = 1\).

[1]:
import numpy
import scipy.signal

alpha = 1
beta = 1
data = numpy.asarray([0, 0, 0, 0])
I = len(data)
bern = lambda x, theta: theta**x * (1 - theta)**(1 - x)

#predictive distribution for maximum likelihood
ml_lambda = numpy.sum(data) / I
print('ML: lambda = {0}'.format(ml_lambda))
for x in [0, 1]:
    print('x = {0}: {1}'.format(x, bern(x, ml_lambda)))

#predictive distribution for maximum a posteriori
map_lambda = (alpha - 1 + numpy.sum(data)) / (I + alpha + beta - 2)
_ = '\nMAP: lambda = {0}, alpha = {1}, beta = {2}'
print(_.format(map_lambda, alpha, beta))
for x in [0, 1]:
    print('x = {0}: {1}'.format(x, bern(x, map_lambda)))

#predictive distribution for Bayesian
t_alpha = alpha + numpy.sum(data)
t_beta = beta + numpy.sum(1 - data)
G = lambda z: scipy.special.gamma(z)

print('\nBayesian')
for x in [0, 1]:
    h_alpha = t_alpha + x
    h_beta = t_beta + 1 - x
    _ = G(t_alpha + t_beta) * G(h_alpha) * G(h_beta)
    kappa = _ / (G(t_alpha) * G(t_beta) * G(t_alpha + t_beta + 1))
    _ = '(alpha = {0}, beta = {1}) x = {2}: {3}'
    print(_.format(h_alpha, h_beta, x, kappa))
ML: lambda = 0.0
x = 0: 1.0
x = 1: 0.0

MAP: lambda = 0.0, alpha = 1, beta = 1
x = 0: 1.0
x = 1: 0.0

Bayesian
(alpha = 1, beta = 6) x = 0: 0.8333333333333334
(alpha = 2, beta = 5) x = 1: 0.16666666666666666