Classification Models¶

[Tip04] should be read after completing this chapter. Concepts and techniques from previous chapters are used to illustrate the difference between frequentist versus Bayesian approaches. Be aware that the generative equation in section 2.3 builds upon the exercises in the previous chapters.

Parameterized model \(P(B \mid A) = f(A; w)\) may over-specialize to the observed data resulting in a poor model of the true underlying distribution.
- The Bayesian inference paradigm is to treat parameters such as \(w\) as random variables.
The common convention of “probability” versus “likelihood”.
- Probability is interpreted as a function of some random variable.
- Likelihood is interpreted as a function of the parameters.
Writing \(p(t \mid x, w, \sigma^2)\) as \(p(t | w, \sigma^2)\) is purely for notational convenience.
- This signifies that the input data \(x\) is not modeled: any effects \(x\) might have on the overall distribution are ignored.
Specifying a Bayesian Prior
- When you specify a Gaussian prior on the parameters, it is essentially giving small weights to large parameter values.
Ockham’s Razor is automatically implemented during marginalization.
- Instead of estimating all nuisance model variables, integrate them out.
The goal of the Bayesian framework is to compute the posterior distribution over all unknowns (possibly via marginalization).

[Wela] presents supervised learning alongside unsupervised to illustrate the similarities between the two methods. The information after equation (41) requires a knowledge base beyond the completion of this chapter.

The Bayesian approach allows one to ask how the prior of the random variable \(\theta\) changes in the light of new observations \(d\).
- The data will move the modes of the distributions to the most probable values of \(\theta\) and determine a spread around those values.
Given sufficient data, there are no significant differences between ML, MAP, and Bayesian.
- When the number of parameters (model complexity) becomes too large with respect to the amount of data samples, MAP and Bayesian are necessary.
Robustness implies the estimate of \(\theta\) is not influenced too much by deviations from the assumptions (e.g. outliers, wrong priors/models).
Bias-Variance Tradeoff
Minimum Description Length
- Minimizing the following costs will generate the best generalization:
  - Specifics of the model.
  - Activities of the model when applied to the data.
  - The reconstruction errors.
- Jorma Rissanen also proposed (54) as a way to gauge how many parameters to introduce, assuming the data is large.

[Welb][Welc] should only be read after completing this chapter. It’s clearly written, and the derivations are useful when independently deriving the update equations. However, the approach is not as elegant compared to the book’s explanations. Hence reading these notes are not essential. One interesting insight from [Welb] is that PCA is not probabilistic so it is hard to apply PCA to MAP estimation.

Bayesian Logistic Regression¶

[Cev] contains a completely understandable derivation of the Laplace approximation.

[Jor] has a section about Laplace approximation. The derivation emphasizes the Taylor expansion of higher order terms. This enables one to derive more accurate Laplace approximations. It also has useful asides on the multivariate case and conditional expectation.

[Tok] should be read after the previous two references. This exposition contains motivational examples and presents a cool Bernstein-von Mises theorem.

[CSK11] is a beautifully written survey on applications of random forests. This contains a lot of useful information (especially the references), so one should just read it in its entirety.

See [BBB+07] for a prior-based approach of combining generative and discriminative models.

Exercise 9.1¶

(i)¶

\[\DeclareMathOperator{\sigmoid}{sig} \lim_{a \rightarrow -\infty} \sigmoid[a] = \lim_{a \rightarrow -\infty} \frac{1}{1 + \exp[-a]} = \frac{1}{1 + \infty} = 0\]

(ii)¶

\[\sigmoid[0] = \frac{1}{1 + \exp[-0]} = \frac{1}{1 + 2} = 0.5\]

(iii)¶

\[\lim_{a \rightarrow \infty} \sigmoid[a] = \lim_{a \rightarrow \infty} \frac{1}{1 + \exp[-a]} = \frac{1}{1 + 0} = 1\]

Exercise 9.2¶

\[\begin{split}L &= \sum_{i = 1}^I w_i \log \frac{1}{ 1 + \exp\left[ -\boldsymbol{\phi}^\top \mathbf{x}_i \right] } + \sum_{i = 1}^I (1 - w_i) \log \frac{ \exp\left[ -\boldsymbol{\phi}^\top \mathbf{x}_i \right] }{ 1 + \exp\left[ -\boldsymbol{\phi}^\top \mathbf{x}_i \right] }\\ &= \sum_{i = 1}^I -w_i \log\left( 1 + \exp\left[ -\boldsymbol{\phi}^\top \mathbf{x}_i \right] \right) + (1 - w_i) (-\boldsymbol{\phi}^\top \mathbf{x}_i) - (1 - w_i) \log\left( 1 + \exp\left[ -\boldsymbol{\phi}^\top \mathbf{x}_i \right] \right)\\ &= \sum_{i = 1}^I -(1 - w_i) \boldsymbol{\phi}^\top \mathbf{x}_i - \log\left( 1 + \exp\left[ -\boldsymbol{\phi}^\top \mathbf{x}_i \right] \right)\\\\\\ \frac{\partial L}{\partial \boldsymbol{\phi}} &= \sum_{i = 1}^I -(1 - w_i) \mathbf{x}_i - \frac{ \exp\left[ -\boldsymbol{\phi}^\top \mathbf{x}_i \right] }{ 1 + \exp\left[ -\boldsymbol{\phi}^\top \mathbf{x}_i \right] } (-\mathbf{x}_i)\\ &= -\sum_{i = 1}^I (1 - w_i) \mathbf{x}_i - \left( 1 - \sigmoid[a_i] \right) \mathbf{x}_i & \quad & a_i = \boldsymbol{\phi}^\top \mathbf{x}_i\\ &= -\sum_{i = 1}^I \left( \sigmoid[a_i] - w_i \right) \mathbf{x}_i\end{split}\]

Exercise 9.3¶

\[\begin{split}\frac{\partial^2 L}{\partial \boldsymbol{\phi}^2} &= \frac{\partial}{\partial \boldsymbol{\phi}^\top} \frac{\partial L}{\partial \boldsymbol{\phi}}\\ &= -\frac{\partial}{\partial \boldsymbol{\phi}^\top} \sum_{i = 1}^I \left( \sigmoid[a_i] - w_i \right) \mathbf{x}_i\\ &= -\sum_{i = 1}^I (-1) \frac{1}{\left( 1 + \exp[-a_i] \right)^2} \exp[-a_i] \mathbf{x}_i \left( -\mathbf{x}_i^\top \right)\\ &= -\sum_{i = 1}^I \sigmoid[a_i] \frac{\exp[-a_i]}{1 + \exp[-a_i]} \mathbf{x}_i \mathbf{x}_i^\top\\ &= -\sum_{i = 1}^I \sigmoid[a_i] \left( 1 - \sigmoid[a_i] \right) \mathbf{x}_i \mathbf{x}_i^\top\end{split}\]

Exercise 9.4¶

Maximizing the concave log-likelihood caused the parameters \(\boldsymbol{\phi}\) to grow exponentially fast resulting in a singular Hessian with a gradient vector that is not even close to tangent. Minimizing the negative log-likelihood rectified this issue.

[1]:

import numpy

def sig(a):
    return 1 / (1 + numpy.exp(-a))

def g(phi, X, w):
    gradient = numpy.zeros_like(phi)
    for i, w_i in enumerate(w):
        x_i = X[:, i]
        a_i = (phi.T * x_i).item(0)
        gradient += (sig(a_i) - w_i) * x_i
    return -gradient

def H(phi, X):
    D, I = X.shape
    hessian = numpy.zeros((D, D))
    for i in range(I):
        x_i = X[:, i]
        a_i = (phi.T * x_i).item(0)
        _ = sig(a_i)
        hessian += _ * (1 - _) * x_i * x_i.T
    return -hessian

I = 20
x_0 = numpy.random.rand(I // 2) - 1.0
x_1 = numpy.random.rand(I // 2) + 1.0
_ = numpy.hstack((x_0, x_1))
X = numpy.asmatrix(numpy.vstack((numpy.ones(_.shape[0]), _)))
w = numpy.hstack((numpy.zeros(I // 2), numpy.ones(I // 2)))
phi = numpy.asmatrix(numpy.random.rand(2)).T

phi_c = phi
alpha = 1.0

for t in range(16):
    g_c = g(phi_c, X, w)
    H_c = H(phi_c, X)
    phi_hat = phi_c - alpha * numpy.linalg.inv(H_c) * g_c
    phi_c = phi_hat
    print('norm: {:.7f}\tphi: {}'.format(numpy.linalg.norm(g_c),
                                           numpy.asarray(phi_c).flatten()))

norm: 7.8555601 phi: [-0.70921938  2.12974601]
norm: 1.8291913 phi: [-1.21585321  3.17649198]
norm: 0.6617972 phi: [-1.6801012   4.19974688]
norm: 0.2444221 phi: [-2.1357043   5.23395222]
norm: 0.0907930 phi: [-2.5935056   6.28910869]
norm: 0.0337695 phi: [-3.05796092  7.36765439]
norm: 0.0125557 phi: [-3.53106369  8.46905946]
norm: 0.0046637 phi: [-4.01357669  9.5915571 ]
norm: 0.0017303 phi: [-4.50551199 10.73291224]
norm: 0.0006412 phi: [-5.00638957 11.89078601]
norm: 0.0002374 phi: [-5.51542399 13.06291367]
norm: 0.0000878 phi: [-6.03167349 14.24719431]
norm: 0.0000324 phi: [-6.55415456 15.4417363 ]
norm: 0.0000120 phi: [-7.08192205 16.64487701]
norm: 0.0000044 phi: [-7.61411754 17.85518464]
norm: 0.0000016 phi: [-8.1499925  19.07144724]

Exercise 9.5¶

Let \(\alpha = \beta = 1.0\). Applying (a) yields

\[\lambda_\max = \frac{\alpha - 1}{\alpha + \beta - 2} = \frac{\alpha - 1}{2 (\alpha - 1)} = \frac{1}{2}.\]

The variance of the Laplace approximation can be computed as

\[\begin{split}\DeclareMathOperator{\BetaDist}{Beta} \sigma^2 &= -\frac{1}{\BetaDist''_{\lambda_\max}[\alpha, \beta]}\\ &= -\frac{B(\alpha, \beta)}{ \lambda^{\alpha - 1} (1 - \lambda)^{\beta - 1} \left[ (\alpha - 1) (\alpha - 2) \lambda^{-2} - 2 (\alpha - 1) (\beta - 1) \lambda^{-1} (1 - \lambda)^{-1} + (\beta - 1) (\beta - 2) (1 - \lambda)^{-2} \right] }\\ &= -\frac{1}{\lambda_\max^{\alpha + \beta - 4}} \frac{B(\alpha, \beta)}{ (\alpha - 1) (\alpha - 2) - 2 (\alpha - 1) (\beta - 1) + (\beta - 1) (\beta - 2) } & \quad & \lambda \mapsto \lambda_\max = \frac{1}{2}\\ &= -\frac{1}{2 \lambda_\max^{\alpha + \beta - 4}} \frac{B(\alpha, \alpha)}{ (\alpha - 1) (\alpha - 2) - (\alpha - 1) (\alpha - 1) } & \quad & \alpha = \beta\\ &= \frac{ B(\alpha, \alpha) }{ 2 \lambda_\max^{\alpha + \beta - 4} } (\alpha - 1)^{-1}\end{split}\]

By inspection, \(\lim_{\alpha \rightarrow 1^+} \sigma^2 = \infty\). This makes sense because a beta distribution with this configuration of parameters is a uniform distribution. In order to approximate this with a normal distribution, the variance needs to be infinite.

(a)¶

From Exercise 3.2, the peak of the beta distribution

\[\begin{split}\BetaDist_\lambda[\alpha, \beta] &= B(\alpha, \beta)^{-1} \lambda^{\alpha - 1} (1 - \lambda)^{\beta - 1}\\ &= \frac{ \lambda^{\alpha - 1} (1 - \lambda)^{\beta - 1} }{ \int_0^1 t^{\alpha - 1} (1 - t)^{\beta - 1} dt }\\ &= \frac{\Gamma[\alpha + \beta]}{\Gamma[\alpha] \Gamma[\beta]} \lambda^{\alpha - 1} (1 - \lambda)^{\beta - 1}\end{split}\]

is \(\lambda_\max = \frac{\alpha - 1}{\alpha + \beta - 2}\).

(b)¶

The second derivative of the beta distribution is

\[\begin{split}\frac{\partial^2}{\partial \lambda^2} \BetaDist_\lambda[\alpha, \beta] &= \frac{\partial}{\partial \lambda} B(\alpha, \beta)^{-1} \left[ (\alpha - 1) \lambda^{\alpha - 2} (1 - \lambda)^{\beta - 1} - (\beta - 1) \lambda^{\alpha - 1} (1 - \lambda)^{\beta - 2} \right]\\ &= B(\alpha, \beta)^{-1} \lambda^{\alpha - 1} (1 - \lambda)^{\beta - 1} \left[ (\alpha - 1) (\alpha - 2) \lambda^{-2} - 2 (\alpha - 1) (\beta - 1) \lambda^{-1} (1 - \lambda)^{-1} + (\beta - 1) (\beta - 2) (1 - \lambda)^{-2} \right]\end{split}\]

Exercise 9.6¶

Recall that

\[\DeclareMathOperator{\NormDist}{Norm} \NormDist_x\left[ \mu, \sigma^2 \right] = \frac{1}{\sigma \sqrt{2 \pi}} \exp\left[ -\frac{(x - \mu)^2}{2 \sigma^2} \right]\]

Clearly the mean, median, and mode of the distribution is at \(\mu\).

The second derivative of the normal distribution is

\[\begin{split}\frac{\partial^2}{\partial x^2} \NormDist_x\left[ \mu, \sigma^2 \right] &= \frac{\partial}{\partial x} \left( \frac{1}{\sigma \sqrt{2 \pi}} \exp\left[ -\frac{(x - \mu)^2}{2 \sigma^2} \right] \frac{-2 (x - \mu)}{2 \sigma^2} (1) \right)\\ &= \frac{\partial}{\partial x} \left( -\frac{x - \mu}{\sigma^2} \NormDist_x\left[ \mu, \sigma^2 \right] \right)\\ &= -\frac{1}{\sigma^2} \NormDist_x\left[ \mu, \sigma^2 \right] + \frac{(x - \mu)^2}{\sigma^4} \NormDist_x\left[ \mu, \sigma^2 \right].\end{split}\]

The variance of the Laplace approximation can be computed as

\[\hat{\sigma}^2 = -\frac{1}{\NormDist''_{\mu}\left[ \mu, \sigma^2 \right]} = \sigma^3 \sqrt{2 \pi}.\]

This illustrates that the Laplace approximation is another univariate normal with a scaled variance.

Exercise 9.7¶

\[\begin{split}\frac{\partial}{\partial \boldsymbol{\phi}} \log \NormDist_{\boldsymbol{\phi}}[\boldsymbol{\mu}, \boldsymbol{\Sigma}] &= \frac{\partial}{\partial \boldsymbol{\phi}} \left[ -\frac{1}{2} \log \left\vert 2 \pi \boldsymbol{\Sigma} \right\vert - \frac{1}{2} (\boldsymbol{\phi} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\boldsymbol{\phi} - \boldsymbol{\mu}) \right]\\ &= -\frac{1}{2} \left[ \boldsymbol{\Sigma}^{-1} (\boldsymbol{\phi} - \boldsymbol{\mu}) + \boldsymbol{\Sigma}^{-\top} (\boldsymbol{\phi} - \boldsymbol{\mu}) \right] & \quad & \text{(C.32)}\\ &= -\boldsymbol{\Sigma}^{-1} (\boldsymbol{\phi} - \boldsymbol{\mu})\\\\\\ \frac{\partial^2 L}{\partial \boldsymbol{\phi}^2} &= \frac{\partial}{\partial \boldsymbol{\phi}^\top} \frac{\partial L}{\partial \boldsymbol{\phi}}\\ &= \frac{\partial}{\partial \boldsymbol{\phi}^\top} \left[ -\boldsymbol{\Sigma}^{-1} (\boldsymbol{\phi} - \boldsymbol{\mu}) \right]\\ &= -\boldsymbol{\Sigma}^{-1}\end{split}\]

Since the second derivative of \(L\) is independent of \(\boldsymbol{\phi}\), evaluating \(\boldsymbol{\phi}\) at \(\boldsymbol{\mu}\) is irrelevant.

Exercise 9.8¶

As shown in (8.30), one approach is to maximize the marginal likelihood using some nonlinear optimization procedure.

Exercise 9.9¶

The branching logistic regression model’s prediction is the result of a single logistic regression model whose activation is a linear weighted sum of experts (e.g. linear functions of the data).

The mixture of experts model’s prediction is a weighted linear sum of logistic regression models. This could be generalized so that the weights and/or the experts themselves contain a nonlinear activation term. Moreover, this could be built hierarchically to form a tree structure.

The mixture of experts model’s parameters can be estimated via direct optimization of the log posterior probability or the EM algorithm.

Exercise 9.10¶

Let

\[\DeclareMathOperator{\softmax}{softmax} s_k = \softmax_k\left[ a_1, a_2, \ldots, a_K \right] = \frac{\exp a_k}{\sum_{j = 1}^K \exp a_j}.\]

(a)¶

Recall that \(\lim_{x \rightarrow -\infty} \exp x = 0\). Assuming \(a_k \neq -\infty\) for \(1 \leq k < K\),

\[\begin{split}0 < \exp a_k &< \sum_{j = 1}^K \exp a_j\\ \frac{\exp a_k}{\sum_{j = 1}^K \exp a_j} &< 1\\ s_k &< 1.\end{split}\]

(b)¶

\[\begin{split}\sum_{k = 1}^K s_k &= \sum_{k = 1}^K \frac{\exp a_k}{\sum_{j = 1}^K \exp a_j}\\ &= \frac{\sum_k \exp a_k}{\sum_j \exp a_j}\\ &= 1\end{split}\]

Exercise 9.11¶

The cost function is

\[\begin{split}\DeclareMathOperator{\CatDist}{Cat} L &= \sum_{i = 1}^I \log \CatDist_{w_i} \softmax[a_1, \ldots, a_N]\\ &= \sum_{i = 1}^I \sum_{n = 1}^N \delta[w_i - n] \log \lambda_n & \quad & \text{(3.8)}\\ &= \sum_{i = 1}^I \left( -\log\left[ \sum_{m = 1}^N \exp a_m \right] + \sum_{n = 1}^N \delta[w_i - n] \log \exp a_n \right) & \quad & \text{(9.59)}\\ &= \sum_{i = 1}^I \left( -\log\left[ \sum_{m = 1}^N \exp \boldsymbol{\phi}_m^\top \mathbf{x}_i \right] + \sum_{n = 1}^N \delta[w_i - n] \boldsymbol{\phi}_n^\top \mathbf{x}_i \right) & \quad & \text{(9.58)}.\end{split}\]

The first derivative is

\[\begin{split}\frac{\partial L}{\partial \boldsymbol{\phi}_n} &= \sum_{i = 1}^I -\frac{ \exp \boldsymbol{\phi}_n^\top \mathbf{x}_i }{ \sum_{m = 1}^N \exp \boldsymbol{\phi}_m^\top \mathbf{x}_i } \mathbf{x}_i + \delta[w_i - n] \mathbf{x}_i & \quad & \text{(C.28)}\\ &= -\sum_{i = 1}^I \left( y_{in} - \delta[w_i - n] \right) \mathbf{x}_i.\end{split}\]

Exercise 9.12¶

This is essentially multi-class logistic regression (section 9.9) and Exercise 6.2. In order to exploit the data’s discrete nature, (random) classification trees (section 9.8 and 9.10) could be used.

References

Cev: Volkan Cevher. Laplace approximation. http://www.ece.rice.edu/ vc3/elec633/graphical_models_notes_091108.pdf. Accessed on 2017-07-03.
CSK11: A Criminisi, J Shotton, and E Konukoglu. Decision forests for classification, regression, density estimation, manifold learning and semi-supervised learning. Microsoft Research Cambridge, Tech. Rep. MSRTR-2011-114, 5(6):12, 2011.
Jor: Michael I. Jordan. Bayesian modeling and inference: lecture 15. http://www.cs.berkeley.edu/ jordan/courses/260-spring10/lectures/lecture15.pdf. Accessed on 2017-07-03.
Tip04: Michael E Tipping. Bayesian inference: an introduction to principles and practice in machine learning. Lecture notes in computer science, 3176:41–62, 2004.
Tok: Surya T. Tokdar. Laplace approximation to the posterior. https://stat.duke.edu/ st118/sta250/laplace.pdf. Accessed on 2017-07-03.
Wela: Max Welling. Estimation. http://www.ics.uci.edu/ welling/classnotes/papers_class/StatEst.ps.gz. Accessed on 2017-07-04.
Welb(1,2): Max Welling. Linear models. http://www.ics.uci.edu/ welling/classnotes/papers_class/LinMod.ps.gz. Accessed on 2017-07-04.
Welc: Max Welling. Mixture of factor analysers. http://www.ics.uci.edu/ welling/classnotes/papers_class/MoFA.ps.gz. Accessed on 2017-07-04.