{
 "cells": [
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "*********************\n",
    "Classification Models\n",
    "*********************\n",
    "\n",
    ":cite:`tipping2004bayesian` should be read after completing this chapter.\n",
    "Concepts and techniques from previous chapters are used to illustrate the\n",
    "difference between frequentist versus Bayesian approaches.  Be aware that the\n",
    "generative equation in section 2.3 builds upon the exercises in the previous\n",
    "chapters.\n",
    "\n",
    "- Parameterized model :math:`P(B \\mid A) = f(A; w)` may over-specialize to the\n",
    "  observed data resulting in a poor model of the true underlying distribution.\n",
    "\n",
    "  - The Bayesian inference paradigm is to treat parameters such as :math:`w` as\n",
    "    random variables.\n",
    "\n",
    "- The common convention of \"probability\" versus \"likelihood\".\n",
    "\n",
    "  - Probability is interpreted as a function of some random variable.\n",
    "  - Likelihood is interpreted as a function of the parameters.\n",
    "\n",
    "- Writing :math:`p(t \\mid x, w, \\sigma^2)` as :math:`p(t | w, \\sigma^2)` is\n",
    "  purely for notational convenience.\n",
    "\n",
    "  - This signifies that the input data :math:`x` is not modeled: any effects\n",
    "    :math:`x` might have on the overall distribution are ignored.\n",
    "\n",
    "- Specifying a Bayesian Prior\n",
    "\n",
    "  - When you specify a Gaussian prior on the parameters, it is essentially\n",
    "    giving small weights to large parameter values.\n",
    "\n",
    "- Ockham's Razor is automatically implemented during marginalization.\n",
    "\n",
    "  - Instead of estimating all nuisance model variables, integrate them out.\n",
    "\n",
    "- The goal of the Bayesian framework is to compute the posterior distribution\n",
    "  over all unknowns (possibly via marginalization).\n",
    "\n",
    ":cite:`wellingme` presents supervised learning alongside unsupervised to\n",
    "illustrate the similarities between the two methods.  The information after\n",
    "equation (41) requires a knowledge base beyond the completion of this chapter.\n",
    "\n",
    "- The Bayesian approach allows one to ask how the prior of the random variable\n",
    "  :math:`\\theta` changes in the light of new observations :math:`d`.\n",
    "\n",
    "  - The data will move the modes of the distributions to the most probable\n",
    "    values of :math:`\\theta` and determine a spread around those values.\n",
    "\n",
    "- Given sufficient data, there are no significant differences between ML, MAP,\n",
    "  and Bayesian.\n",
    "\n",
    "  - When the number of parameters (model complexity) becomes too large with\n",
    "    respect to the amount of data samples, MAP and Bayesian are necessary.\n",
    "\n",
    "- Robustness implies the estimate of :math:`\\theta` is not influenced too much\n",
    "  by deviations from the assumptions (e.g. outliers, wrong priors/models).\n",
    "\n",
    "- :doc:`Bias-Variance Tradeoff </blog/2016/11/23/training-products-of-experts-by-minimizing-contrastive-divergence>`\n",
    "\n",
    "- Minimum Description Length\n",
    "\n",
    "  - Minimizing the following costs will generate the best generalization:\n",
    "\n",
    "    - Specifics of the model.\n",
    "    - Activities of the model when applied to the data.\n",
    "    - The reconstruction errors.\n",
    "\n",
    "  - Jorma Rissanen also proposed (54) as a way to gauge how many parameters to\n",
    "    introduce, assuming the data is large.\n",
    "\n",
    ":cite:`wellingmlm,wellingmmofa` should only be read after completing this\n",
    "chapter.  It's clearly written, and the derivations are useful when\n",
    "independently deriving the update equations.  However, the approach is not as\n",
    "elegant compared to the book's explanations.  Hence reading these notes are not\n",
    "essential.  One interesting insight from :cite:`wellingmlm` is that PCA is not\n",
    "probabilistic so it is hard to apply PCA to MAP estimation.\n",
    "\n",
    "Bayesian Logistic Regression\n",
    "============================\n",
    "\n",
    ":cite:`cevhervla` contains a completely understandable derivation of the Laplace\n",
    "approximation.\n",
    "\n",
    ":cite:`jordanmibmil15` has a section about Laplace approximation.  The\n",
    "derivation emphasizes the Taylor expansion of higher order terms.  This enables\n",
    "one to derive more accurate Laplace approximations.  It also has useful asides\n",
    "on the multivariate case and conditional expectation.\n",
    "\n",
    ":cite:`tokdarstlap` should be read after the previous two references.  This\n",
    "exposition contains motivational examples and presents a cool\n",
    "Bernstein-von Mises theorem.\n",
    "\n",
    ":cite:`criminisi2011decision` is a beautifully written survey on applications of\n",
    "random forests.  This contains a lot of useful information (especially the\n",
    "references), so one should just read it in its entirety.\n",
    "\n",
    "See :cite:`bernardo2007generative` for a prior-based approach of combining\n",
    "generative and discriminative models."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "Exercise 9.1\n",
    "============\n",
    "\n",
    "(i)\n",
    "---\n",
    "\n",
    ".. math::\n",
    "\n",
    "   \\DeclareMathOperator{\\sigmoid}{sig}\n",
    "   \\lim_{a \\rightarrow -\\infty} \\sigmoid[a] =\n",
    "   \\lim_{a \\rightarrow -\\infty} \\frac{1}{1 + \\exp[-a]} =\n",
    "   \\frac{1}{1 + \\infty} = 0\n",
    "\n",
    "(ii)\n",
    "----\n",
    "\n",
    ".. math::\n",
    "\n",
    "   \\sigmoid[0] =\n",
    "   \\frac{1}{1 + \\exp[-0]} =\n",
    "   \\frac{1}{1 + 2} = 0.5\n",
    "\n",
    "(iii)\n",
    "-----\n",
    "\n",
    ".. math::\n",
    "\n",
    "   \\lim_{a \\rightarrow \\infty} \\sigmoid[a] =\n",
    "   \\lim_{a \\rightarrow \\infty} \\frac{1}{1 + \\exp[-a]} =\n",
    "   \\frac{1}{1 + 0} = 1"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "Exercise 9.2\n",
    "============\n",
    "\n",
    ".. math::\n",
    "\n",
    "   L &= \\sum_{i = 1}^I\n",
    "          w_i \\log \\frac{1}{\n",
    "            1 + \\exp\\left[ -\\boldsymbol{\\phi}^\\top \\mathbf{x}_i \\right]\n",
    "          } +\n",
    "        \\sum_{i = 1}^I\n",
    "          (1 - w_i) \\log \\frac{\n",
    "            \\exp\\left[ -\\boldsymbol{\\phi}^\\top \\mathbf{x}_i \\right]\n",
    "          }{\n",
    "            1 + \\exp\\left[ -\\boldsymbol{\\phi}^\\top \\mathbf{x}_i \\right]\n",
    "          }\\\\\n",
    "    &= \\sum_{i = 1}^I\n",
    "         -w_i \\log\\left(\n",
    "           1 + \\exp\\left[ -\\boldsymbol{\\phi}^\\top \\mathbf{x}_i \\right]\n",
    "         \\right) +\n",
    "         (1 - w_i) (-\\boldsymbol{\\phi}^\\top \\mathbf{x}_i) -\n",
    "         (1 - w_i) \\log\\left(\n",
    "           1 + \\exp\\left[ -\\boldsymbol{\\phi}^\\top \\mathbf{x}_i \\right]\n",
    "         \\right)\\\\\n",
    "    &= \\sum_{i = 1}^I\n",
    "         -(1 - w_i) \\boldsymbol{\\phi}^\\top \\mathbf{x}_i -\n",
    "         \\log\\left(\n",
    "           1 + \\exp\\left[ -\\boldsymbol{\\phi}^\\top \\mathbf{x}_i \\right]\n",
    "         \\right)\\\\\\\\\\\\\n",
    "   \\frac{\\partial L}{\\partial \\boldsymbol{\\phi}}\n",
    "    &= \\sum_{i = 1}^I\n",
    "         -(1 - w_i) \\mathbf{x}_i -\n",
    "         \\frac{\n",
    "           \\exp\\left[ -\\boldsymbol{\\phi}^\\top \\mathbf{x}_i \\right]\n",
    "         }{\n",
    "           1 + \\exp\\left[ -\\boldsymbol{\\phi}^\\top \\mathbf{x}_i \\right]\n",
    "         } (-\\mathbf{x}_i)\\\\\n",
    "    &= -\\sum_{i = 1}^I\n",
    "         (1 - w_i) \\mathbf{x}_i -\n",
    "         \\left( 1 - \\sigmoid[a_i] \\right) \\mathbf{x}_i\n",
    "       & \\quad & a_i = \\boldsymbol{\\phi}^\\top \\mathbf{x}_i\\\\\n",
    "    &= -\\sum_{i = 1}^I \\left( \\sigmoid[a_i] - w_i \\right) \\mathbf{x}_i"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "Exercise 9.3\n",
    "============\n",
    "\n",
    ".. math::\n",
    "\n",
    "   \\frac{\\partial^2 L}{\\partial \\boldsymbol{\\phi}^2}\n",
    "    &= \\frac{\\partial}{\\partial \\boldsymbol{\\phi}^\\top}\n",
    "       \\frac{\\partial L}{\\partial \\boldsymbol{\\phi}}\\\\\n",
    "    &= -\\frac{\\partial}{\\partial \\boldsymbol{\\phi}^\\top}\n",
    "       \\sum_{i = 1}^I \\left( \\sigmoid[a_i] - w_i \\right) \\mathbf{x}_i\\\\\n",
    "    &= -\\sum_{i = 1}^I\n",
    "         (-1) \\frac{1}{\\left( 1 + \\exp[-a_i] \\right)^2} \\exp[-a_i]\n",
    "         \\mathbf{x}_i \\left( -\\mathbf{x}_i^\\top \\right)\\\\\n",
    "    &= -\\sum_{i = 1}^I\n",
    "         \\sigmoid[a_i]\n",
    "         \\frac{\\exp[-a_i]}{1 + \\exp[-a_i]} \\mathbf{x}_i \\mathbf{x}_i^\\top\\\\\n",
    "    &= -\\sum_{i = 1}^I\n",
    "         \\sigmoid[a_i]\n",
    "         \\left( 1 - \\sigmoid[a_i] \\right)\n",
    "         \\mathbf{x}_i \\mathbf{x}_i^\\top"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "Exercise 9.4\n",
    "============\n",
    "\n",
    "Maximizing the concave log-likelihood caused the parameters\n",
    ":math:`\\boldsymbol{\\phi}` to grow exponentially fast resulting in a singular\n",
    "Hessian with a gradient vector that is not even close to tangent.  Minimizing\n",
    "the negative log-likelihood rectified this issue."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy\n",
    "\n",
    "def sig(a):\n",
    "    return 1 / (1 + numpy.exp(-a))\n",
    "\n",
    "def g(phi, X, w):\n",
    "    gradient = numpy.zeros_like(phi)\n",
    "    for i, w_i in enumerate(w):\n",
    "        x_i = X[:, i]\n",
    "        a_i = (phi.T * x_i).item(0)\n",
    "        gradient += (sig(a_i) - w_i) * x_i\n",
    "    return -gradient\n",
    "\n",
    "def H(phi, X):\n",
    "    D, I = X.shape\n",
    "    hessian = numpy.zeros((D, D))\n",
    "    for i in range(I):\n",
    "        x_i = X[:, i]\n",
    "        a_i = (phi.T * x_i).item(0)\n",
    "        _ = sig(a_i)\n",
    "        hessian += _ * (1 - _) * x_i * x_i.T\n",
    "    return -hessian\n",
    "\n",
    "I = 20\n",
    "x_0 = numpy.random.rand(I // 2) - 1.0\n",
    "x_1 = numpy.random.rand(I // 2) + 1.0\n",
    "_ = numpy.hstack((x_0, x_1))\n",
    "X = numpy.asmatrix(numpy.vstack((numpy.ones(_.shape[0]), _)))\n",
    "w = numpy.hstack((numpy.zeros(I // 2), numpy.ones(I // 2)))\n",
    "phi = numpy.asmatrix(numpy.random.rand(2)).T\n",
    "\n",
    "phi_c = phi\n",
    "alpha = 1.0\n",
    "\n",
    "for t in range(16):\n",
    "    g_c = g(phi_c, X, w)\n",
    "    H_c = H(phi_c, X)\n",
    "    phi_hat = phi_c - alpha * numpy.linalg.inv(H_c) * g_c\n",
    "    phi_c = phi_hat\n",
    "    print('norm: {:.7f}\\tphi: {}'.format(numpy.linalg.norm(g_c),\n",
    "                                           numpy.asarray(phi_c).flatten()))"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "Exercise 9.5\n",
    "============\n",
    "\n",
    "Let :math:`\\alpha = \\beta = 1.0`.  Applying (a) yields\n",
    "\n",
    ".. math::\n",
    "\n",
    "   \\lambda_\\max = \\frac{\\alpha - 1}{\\alpha + \\beta - 2} =\n",
    "   \\frac{\\alpha - 1}{2 (\\alpha - 1)} =\n",
    "   \\frac{1}{2}.\n",
    "\n",
    "The variance of the Laplace approximation can be computed as\n",
    "\n",
    ".. math::\n",
    "\n",
    "   \\DeclareMathOperator{\\BetaDist}{Beta}\n",
    "   \\sigma^2 &= -\\frac{1}{\\BetaDist''_{\\lambda_\\max}[\\alpha, \\beta]}\\\\\n",
    "    &= -\\frac{B(\\alpha, \\beta)}{\n",
    "         \\lambda^{\\alpha - 1} (1 - \\lambda)^{\\beta - 1}\n",
    "         \\left[\n",
    "           (\\alpha - 1) (\\alpha - 2) \\lambda^{-2} -\n",
    "           2 (\\alpha - 1) (\\beta - 1) \\lambda^{-1} (1 - \\lambda)^{-1} +\n",
    "           (\\beta - 1) (\\beta - 2) (1 - \\lambda)^{-2}\n",
    "         \\right]\n",
    "       }\\\\\n",
    "    &= -\\frac{1}{\\lambda_\\max^{\\alpha + \\beta - 4}}\n",
    "       \\frac{B(\\alpha, \\beta)}{\n",
    "         (\\alpha - 1) (\\alpha - 2) -\n",
    "         2 (\\alpha - 1) (\\beta - 1) +\n",
    "         (\\beta - 1) (\\beta - 2)\n",
    "       }\n",
    "       & \\quad & \\lambda \\mapsto \\lambda_\\max = \\frac{1}{2}\\\\\n",
    "    &= -\\frac{1}{2 \\lambda_\\max^{\\alpha + \\beta - 4}}\n",
    "       \\frac{B(\\alpha, \\alpha)}{\n",
    "         (\\alpha - 1) (\\alpha - 2) - (\\alpha - 1) (\\alpha - 1)\n",
    "       }\n",
    "       & \\quad & \\alpha = \\beta\\\\\n",
    "    &= \\frac{\n",
    "         B(\\alpha, \\alpha)\n",
    "       }{\n",
    "         2 \\lambda_\\max^{\\alpha + \\beta - 4}\n",
    "       }\n",
    "       (\\alpha - 1)^{-1}\n",
    "\n",
    "By inspection, :math:`\\lim_{\\alpha \\rightarrow 1^+} \\sigma^2 = \\infty`.  This\n",
    "makes sense because a beta distribution with this configuration of parameters\n",
    "is a uniform distribution.  In order to approximate this with a normal\n",
    "distribution, the variance needs to be infinite.\n",
    "\n",
    "(a)\n",
    "---\n",
    "\n",
    "From :ref:`Exercise 3.2 <prince2012computer-ex-3.2>`, the peak of the beta\n",
    "distribution\n",
    "\n",
    ".. math::\n",
    "\n",
    "   \\BetaDist_\\lambda[\\alpha, \\beta]\n",
    "    &= B(\\alpha, \\beta)^{-1} \\lambda^{\\alpha - 1} (1 - \\lambda)^{\\beta - 1}\\\\\n",
    "    &= \\frac{\n",
    "         \\lambda^{\\alpha - 1} (1 - \\lambda)^{\\beta - 1}\n",
    "       }{\n",
    "         \\int_0^1 t^{\\alpha - 1} (1 - t)^{\\beta - 1} dt\n",
    "       }\\\\\n",
    "    &= \\frac{\\Gamma[\\alpha + \\beta]}{\\Gamma[\\alpha] \\Gamma[\\beta]}\n",
    "       \\lambda^{\\alpha - 1} (1 - \\lambda)^{\\beta - 1}\n",
    "\n",
    "is :math:`\\lambda_\\max = \\frac{\\alpha - 1}{\\alpha + \\beta - 2}`.\n",
    "\n",
    "(b)\n",
    "---\n",
    "\n",
    "The second derivative of the beta distribution is\n",
    "\n",
    ".. math::\n",
    "\n",
    "   \\frac{\\partial^2}{\\partial \\lambda^2} \\BetaDist_\\lambda[\\alpha, \\beta]\n",
    "    &= \\frac{\\partial}{\\partial \\lambda}\n",
    "       B(\\alpha, \\beta)^{-1}\n",
    "       \\left[\n",
    "         (\\alpha - 1) \\lambda^{\\alpha - 2} (1 - \\lambda)^{\\beta - 1} -\n",
    "         (\\beta - 1) \\lambda^{\\alpha - 1} (1 - \\lambda)^{\\beta - 2}\n",
    "       \\right]\\\\\n",
    "    &= B(\\alpha, \\beta)^{-1}\n",
    "       \\lambda^{\\alpha - 1} (1 - \\lambda)^{\\beta - 1}\n",
    "       \\left[\n",
    "         (\\alpha - 1) (\\alpha - 2) \\lambda^{-2} -\n",
    "         2 (\\alpha - 1) (\\beta - 1) \\lambda^{-1} (1 - \\lambda)^{-1} +\n",
    "         (\\beta - 1) (\\beta - 2) (1 - \\lambda)^{-2}\n",
    "       \\right]"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "Exercise 9.6\n",
    "============\n",
    "\n",
    "Recall that\n",
    "\n",
    ".. math::\n",
    "\n",
    "   \\DeclareMathOperator{\\NormDist}{Norm}\n",
    "   \\NormDist_x\\left[ \\mu, \\sigma^2 \\right] =\n",
    "   \\frac{1}{\\sigma \\sqrt{2 \\pi}}\n",
    "       \\exp\\left[ -\\frac{(x - \\mu)^2}{2 \\sigma^2} \\right]\n",
    "\n",
    "Clearly the mean, median, and mode of the distribution is at :math:`\\mu`.\n",
    "\n",
    "The second derivative of the normal distribution is\n",
    "\n",
    ".. math::\n",
    "\n",
    "   \\frac{\\partial^2}{\\partial x^2} \\NormDist_x\\left[ \\mu, \\sigma^2 \\right]\n",
    "    &= \\frac{\\partial}{\\partial x} \\left(\n",
    "         \\frac{1}{\\sigma \\sqrt{2 \\pi}}\n",
    "         \\exp\\left[ -\\frac{(x - \\mu)^2}{2 \\sigma^2} \\right]\n",
    "         \\frac{-2 (x - \\mu)}{2 \\sigma^2} (1)\n",
    "       \\right)\\\\\n",
    "    &= \\frac{\\partial}{\\partial x} \\left(\n",
    "         -\\frac{x - \\mu}{\\sigma^2} \\NormDist_x\\left[ \\mu, \\sigma^2 \\right]\n",
    "       \\right)\\\\\n",
    "    &= -\\frac{1}{\\sigma^2} \\NormDist_x\\left[ \\mu, \\sigma^2 \\right] +\n",
    "       \\frac{(x - \\mu)^2}{\\sigma^4} \\NormDist_x\\left[ \\mu, \\sigma^2 \\right].\n",
    "\n",
    "The variance of the Laplace approximation can be computed as\n",
    "\n",
    ".. math::\n",
    "\n",
    "   \\hat{\\sigma}^2 =\n",
    "   -\\frac{1}{\\NormDist''_{\\mu}\\left[ \\mu, \\sigma^2 \\right]} =\n",
    "   \\sigma^3 \\sqrt{2 \\pi}.\n",
    "\n",
    "This illustrates that the Laplace approximation is another univariate normal\n",
    "with a scaled variance."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "Exercise 9.7\n",
    "============\n",
    "\n",
    ".. math::\n",
    "\n",
    "   \\frac{\\partial}{\\partial \\boldsymbol{\\phi}}\n",
    "       \\log \\NormDist_{\\boldsymbol{\\phi}}[\\boldsymbol{\\mu}, \\boldsymbol{\\Sigma}]\n",
    "    &= \\frac{\\partial}{\\partial \\boldsymbol{\\phi}} \\left[\n",
    "         -\\frac{1}{2} \\log \\left\\vert 2 \\pi \\boldsymbol{\\Sigma} \\right\\vert -\n",
    "         \\frac{1}{2} (\\boldsymbol{\\phi} - \\boldsymbol{\\mu})^\\top\n",
    "           \\boldsymbol{\\Sigma}^{-1} (\\boldsymbol{\\phi} - \\boldsymbol{\\mu})\n",
    "       \\right]\\\\\n",
    "    &= -\\frac{1}{2} \\left[\n",
    "         \\boldsymbol{\\Sigma}^{-1} (\\boldsymbol{\\phi} - \\boldsymbol{\\mu}) +\n",
    "         \\boldsymbol{\\Sigma}^{-\\top} (\\boldsymbol{\\phi} - \\boldsymbol{\\mu})\n",
    "       \\right]\n",
    "       & \\quad & \\text{(C.32)}\\\\\n",
    "    &= -\\boldsymbol{\\Sigma}^{-1} (\\boldsymbol{\\phi} - \\boldsymbol{\\mu})\\\\\\\\\\\\\n",
    "   \\frac{\\partial^2 L}{\\partial \\boldsymbol{\\phi}^2}\n",
    "    &= \\frac{\\partial}{\\partial \\boldsymbol{\\phi}^\\top}\n",
    "       \\frac{\\partial L}{\\partial \\boldsymbol{\\phi}}\\\\\n",
    "    &= \\frac{\\partial}{\\partial \\boldsymbol{\\phi}^\\top} \\left[\n",
    "         -\\boldsymbol{\\Sigma}^{-1} (\\boldsymbol{\\phi} - \\boldsymbol{\\mu})\n",
    "       \\right]\\\\\n",
    "    &= -\\boldsymbol{\\Sigma}^{-1}\n",
    "\n",
    "Since the second derivative of :math:`L` is independent of\n",
    ":math:`\\boldsymbol{\\phi}`, evaluating :math:`\\boldsymbol{\\phi}` at\n",
    ":math:`\\boldsymbol{\\mu}` is irrelevant."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "Exercise 9.8\n",
    "============\n",
    "\n",
    "As shown in (8.30), one approach is to maximize the marginal likelihood using\n",
    "some nonlinear optimization procedure."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "Exercise 9.9\n",
    "============\n",
    "\n",
    "The branching logistic regression model's prediction is the result of a single\n",
    "logistic regression model whose activation is a linear weighted sum of experts\n",
    "(e.g. linear functions of the data).\n",
    "\n",
    "The mixture of experts model's prediction is a weighted linear sum of logistic\n",
    "regression models.  This could be generalized so that the weights and/or the\n",
    "experts themselves contain a nonlinear activation term.  Moreover, this could\n",
    "be built hierarchically to form a tree structure.\n",
    "\n",
    "The mixture of experts model's parameters can be estimated via direct\n",
    "optimization of the log posterior probability or the EM algorithm."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "Exercise 9.10\n",
    "=============\n",
    "\n",
    "Let\n",
    "\n",
    ".. math::\n",
    "\n",
    "   \\DeclareMathOperator{\\softmax}{softmax}\n",
    "   s_k =\n",
    "   \\softmax_k\\left[ a_1, a_2, \\ldots, a_K \\right] =\n",
    "   \\frac{\\exp a_k}{\\sum_{j = 1}^K \\exp a_j}.\n",
    "\n",
    "(a)\n",
    "---\n",
    "\n",
    "Recall that :math:`\\lim_{x \\rightarrow -\\infty} \\exp x = 0`.  Assuming\n",
    ":math:`a_k \\neq -\\infty` for :math:`1 \\leq k < K`,\n",
    "\n",
    ".. math::\n",
    "\n",
    "   0 < \\exp a_k &< \\sum_{j = 1}^K \\exp a_j\\\\\n",
    "   \\frac{\\exp a_k}{\\sum_{j = 1}^K \\exp a_j} &< 1\\\\\n",
    "   s_k &< 1.\n",
    "\n",
    "(b)\n",
    "---\n",
    "\n",
    ".. math::\n",
    "\n",
    "   \\sum_{k = 1}^K s_k\n",
    "    &= \\sum_{k = 1}^K \\frac{\\exp a_k}{\\sum_{j = 1}^K \\exp a_j}\\\\\n",
    "    &= \\frac{\\sum_k \\exp a_k}{\\sum_j \\exp a_j}\\\\\n",
    "    &= 1"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "Exercise 9.11\n",
    "=============\n",
    "\n",
    "The cost function is\n",
    "\n",
    ".. math::\n",
    "\n",
    "   \\DeclareMathOperator{\\CatDist}{Cat}\n",
    "   L &= \\sum_{i = 1}^I \\log \\CatDist_{w_i} \\softmax[a_1, \\ldots, a_N]\\\\\n",
    "    &= \\sum_{i = 1}^I \\sum_{n = 1}^N \\delta[w_i - n] \\log \\lambda_n\n",
    "       & \\quad & \\text{(3.8)}\\\\\n",
    "    &= \\sum_{i = 1}^I \\left(\n",
    "         -\\log\\left[ \\sum_{m = 1}^N \\exp a_m \\right] +\n",
    "         \\sum_{n = 1}^N \\delta[w_i - n] \\log \\exp a_n\n",
    "       \\right)\n",
    "       & \\quad & \\text{(9.59)}\\\\\n",
    "    &= \\sum_{i = 1}^I \\left(\n",
    "         -\\log\\left[\n",
    "           \\sum_{m = 1}^N \\exp \\boldsymbol{\\phi}_m^\\top \\mathbf{x}_i\n",
    "         \\right] +\n",
    "         \\sum_{n = 1}^N \\delta[w_i - n] \\boldsymbol{\\phi}_n^\\top \\mathbf{x}_i\n",
    "       \\right)\n",
    "       & \\quad & \\text{(9.58)}.\n",
    "\n",
    "The first derivative is\n",
    "\n",
    ".. math::\n",
    "\n",
    "   \\frac{\\partial L}{\\partial \\boldsymbol{\\phi}_n}\n",
    "    &= \\sum_{i = 1}^I\n",
    "         -\\frac{\n",
    "           \\exp \\boldsymbol{\\phi}_n^\\top \\mathbf{x}_i\n",
    "         }{\n",
    "           \\sum_{m = 1}^N \\exp \\boldsymbol{\\phi}_m^\\top \\mathbf{x}_i\n",
    "         } \\mathbf{x}_i +\n",
    "         \\delta[w_i - n] \\mathbf{x}_i\n",
    "       & \\quad & \\text{(C.28)}\\\\\n",
    "    &= -\\sum_{i = 1}^I \\left( y_{in} - \\delta[w_i - n] \\right) \\mathbf{x}_i."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "Exercise 9.12\n",
    "=============\n",
    "\n",
    "This is essentially multi-class logistic regression (section 9.9) and\n",
    ":ref:`Exercise 6.2 <prince2012computer-ex-6.2>`.  In order to exploit the\n",
    "data's discrete nature, (random) classification trees (section 9.8 and 9.10)\n",
    "could be used."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. rubric:: References\n",
    "\n",
    ".. bibliography:: chapter-09.bib"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "celltoolbar": "Raw Cell Format",
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}