Learning and Inference in Vision
Regression
Suppose the likelihood of the data is independent of the model.
\[\begin{split}Pr(\boldsymbol{\theta} \mid w, x)
&= \frac{Pr(\boldsymbol{\theta}, w, x)}{Pr(w, x)}\\
&= \frac{
Pr(w \mid x, \boldsymbol{\theta})
Pr(x \mid \boldsymbol{\theta}) Pr(\boldsymbol{\theta})
}{
Pr(w \mid x) Pr(x)
}\\
&= \frac{
Pr(w \mid x, \boldsymbol{\theta}) Pr(\boldsymbol{\theta})
}{
Pr(w \mid x)
}
& \quad & Pr(x \mid \boldsymbol{\theta}) = Pr(x).\end{split}\]
Applications
Suppose the world state is independent of the model [Eng].
\[\begin{split}Pr(w \mid x, \boldsymbol{\theta})
&= \frac{Pr(w, x, \boldsymbol{\theta})}{Pr(x, \boldsymbol{\theta})}\\
&= \frac{
Pr(x \mid w, \boldsymbol{\theta})
Pr(w \mid \boldsymbol{\theta}) Pr(\boldsymbol{\theta})
}{
Pr(x \mid \boldsymbol{\theta}) Pr(\boldsymbol{\theta})
}\\
&= \frac{
Pr(x \mid w, \boldsymbol{\theta}) Pr(w)
}{
Pr(x \mid \boldsymbol{\theta})
}
& \quad & Pr(w \mid \boldsymbol{\theta}) = Pr(w).\end{split}\]
Exercise 6.1
(i), (iii), and (iv) are classification problems while (ii) and (v) are
regression problems.
(i)
\(\mathbf{w}\) represents a discrete state describing whether a face is male
or female.
\(\mathbf{x}\) is an image of a face that have been discretized into pixels
spanning some color space.
(ii)
\(\mathbf{w}\) represents a continuous state describing the 3D pose of a
human body, which covers all physically possible orientations and positions.
\(\mathbf{x}\) is an image of a body that have been discretized into pixels
spanning some color space.
(iii)
\(\mathbf{w}\) represents a discrete state spanning the four suits
(hearts, diamond, clubs, spades) of a playing card.
\(\mathbf{x}\) is an image of a playing card that have been discretized into
pixels spanning some color space.
(iv)
\(\mathbf{w}\) represents a discrete binary state describing whether a
face image matches another face image.
\(\mathbf{x}\) consists of a pair of face images where each image
has been discretized into pixels spanning some color space.
(v)
\(\mathbf{w}\) represents a continuous state describing the 3D position of a
point.
\(\mathbf{x}\) consists of the images produced by a set of cameras and their
correspondences; all of which have been discretized into pixels spanning
arbitrary color spaces.
Exercise 6.2
Discriminative
According to [Brub][Brua], this is known as
multinomial logistic regression.
Use a categorical distribution to model the univariate discrete multi-valued
world state \(\mathbf{w}\) as \(Pr(\mathbf{w})\).
Let \(L_m(x) = \phi_{m, 0} + \phi_{m, 1} x\) denote the linear function of
the data \(x\) for \(m = 1, 2, \ldots, M\).
Define the probability of observing one of the \(M\) possible outcomes as
\[\lambda_M(x) =
\left( 1 + \sum_{i = 1}^{M - 1} \exp L_i(x) \right)^{-1}
\quad \land \quad
\lambda_m(x) = \lambda_M \exp L_m(x)\]
where \(\sum_m \lambda_m(x) = 1\) for \(x = 1, 2, \ldots, K\).
Applying the same notations as (3.8) gives
\[\DeclareMathOperator{\CatDist}{Cat}
Pr(\mathbf{w} \mid x, \boldsymbol{\theta}) =
\CatDist_{\mathbf{w}}\left[ \boldsymbol{\lambda}(x) \right]\]
where \(\boldsymbol{\theta} = \{ \phi_{1 \ldots M \times 0 \ldots 1} \}\),
\(\mathbf{w} = \mathbf{e}_m\), and
\(\boldsymbol{\lambda} = \left( \lambda_1, \ldots, \lambda_M \right)^\top\).
Generative
Since the world state is a discrete multi-valued univariate, define a prior
distribution over the world state as
\[Pr(\mathbf{w}) = \CatDist_{\mathbf{w}}\left[ \boldsymbol{\lambda}' \right]\]
where \(\mathbf{w} = \mathbf{e}_m\) and
\(\boldsymbol{\lambda}' = \left( \lambda'_1, \ldots, \lambda'_M \right)^\top\).
Use a categorical distribution to model the discrete multi-valued univariate
data \(\mathbf{x}\) as \(Pr(\mathbf{x})\).
Let \(L_k(w) = \phi_{k, 0} + \phi_{k, 1} w\) denote the linear function of
the world state \(w\) for \(k = 1, 2, \ldots, K\).
Define the probability of observing one of the \(K\) possible outcomes as
\[\lambda_K(w) =
\left( 1 + \sum_{i = 1}^{K - 1} \exp L_i(w) \right)^{-1}
\quad \land \quad
\lambda_k = \lambda_K \exp L_k(w)\]
where \(\sum \lambda_k(w) = 1\) for all \(w = 1, 2, \ldots, M\).
Applying the same notations as (3.8) yields
\[Pr(\mathbf{x} \mid w, \boldsymbol{\theta}) =
\CatDist_{\mathbf{x}}\left[ \boldsymbol{\lambda}(w) \right]\]
where \(\boldsymbol{\theta} =
\left\{ \boldsymbol{\lambda}', \phi_{1 \ldots K \times 0 \ldots 1} \right\}\),
\(\mathbf{x} = \mathbf{e}_k\), and
\(\boldsymbol{\lambda} = \left( \lambda_1, \ldots, \lambda_K \right)^\top\).
Exercise 6.3
Since the world state is univariate and continuous, define a prior
distribution over the world state as
\[\DeclareMathOperator{\NormDist}{Norm}
Pr(w) = \NormDist_w\left[ \mu_p, \sigma_p^2 \right].\]
Use a Bernoulli distribution to model the univariate binary discrete data
\(x\) as \(Pr(x)\).
Let \(\lambda(w) = \phi_0 + \phi_1 w\) denote a linear function of the world
state \(w\). The generative regression model is then
\[\DeclareMathOperator{\BernDist}{Bern}
\DeclareMathOperator{\sigmoid}{sig}
Pr(x \mid w, \boldsymbol{\theta}) =
\BernDist_x\left[ \sigmoid\left( \lambda(w) \right) \right] =
\BernDist_x\left[ \frac{1}{1 + \exp\left[ -\phi_0 - \phi_1 w \right]} \right]\]
where \(\boldsymbol{\theta} = \{ \mu_p, \sigma_p^2, \phi_0, \phi_1 \}\).
Exercise 6.4
Use a beta distribution to model the univariate continuous world state
\(w \in \{ 0, 1 \}\) as \(Pr(w)\).
Since the data \(x\) comes from a univariate continuous distribution, we can
arbitrarily model that as
\(Pr(x) = \NormDist_x\left[ \mu, \sigma^2 \right]\) and represent the
parameters of the beta distribution in those terms (see
Exercise 3.3):
\[\alpha = \mu \left( \frac{\mu (1 - \mu)}{\sigma^2} - 1 \right)
\quad \text{and} \quad
\beta = (1 - \mu) \left( \frac{\mu (1 - \mu)}{\sigma^2} - 1 \right).\]
The discriminative regression model is then
\[\DeclareMathOperator{\BetaDist}{Beta}
Pr(w \mid x, \boldsymbol{\theta}) = \BetaDist_w[\alpha, \beta]\]
where \(\boldsymbol{\theta} = \left\{ \mu, \sigma^2 \right\}\).
Exercise 6.5
\[\begin{split}L
&= \sum_{i = 1}^I
\log \NormDist_{w_i}\left[ \phi_0 + \phi_1 x_i, \sigma^2 \right]\\
&= \sum_{i = 1}^I \log
\frac{1}{\sqrt{2 \pi \sigma^2}}
\exp\left[
\frac{(w_i - \phi_0 - \phi_1 x_i)^2}{\sigma^2}
\right]^{-0.5}\\
&= -\frac{I}{2} \log 2 \pi - \frac{I}{2} \log \sigma^2 -
\frac{1}{2 \sigma^2} \sum_{i = 1}^I (w_i - \phi_0 - \phi_1 x_i)^2\end{split}\]
(a)
\[\begin{split}\frac{\partial L}{\partial \phi_0}
&= -\frac{1}{2 \sigma^2}
\sum_{i = 1}^I 2 (w_i - \phi_0 - \phi_1 x_i) (-1)\\
0 &= \frac{1}{2 \sigma^2} \sum_{i = 1}^I w_i - \phi_0 - \phi_1 x_i\\
\phi_0 &= \frac{1}{I} \sum_{i = 1}^I w_i - \phi_1 x_i\end{split}\]
(b)
\[\begin{split}\frac{\partial L}{\partial \phi_1}
&= -\frac{1}{2 \sigma^2}
\sum_{i = 1}^I 2 (w_i - \phi_0 - \phi_1 x_i) (-x_i)\\
0 &= \frac{1}{2 \sigma^2}
\sum_{i = 1}^I w_i x_i - \phi_0 x_i - \phi_1 x_i^2\\
\phi_1 &= \frac{\sum_{i = 1}^I x_i (w_i - \phi_0)}{\sum_{i = 1}^I x_i^2}\end{split}\]
(c)
\[\begin{split}\frac{\partial L}{\partial \sigma}
&= -\frac{I}{2 \sigma^2} 2 \sigma -
\frac{1}{2 \sigma^3} (-2) \sum_{i = 1}^I (w_i - \phi_0 - \phi_1 x_i)^2\\
\frac{I}{\sigma}
&= \frac{1}{\sigma^3} \sum_{i = 1}^I (w_i - \phi_0 - \phi_1 x_i)^2\\
\sigma^2 &= \frac{1}{I} \sum_{i = 1}^I (w_i - \phi_0 - \phi_1 x_i)^2\end{split}\]
Exercise 6.6
\[\begin{split}Pr(w_i \mid x_i)
&= \frac{Pr(w_i, x_i)}{Pr(x_i)}\\
&= \NormDist_{w_i}\left[
\mu_w + \sigma_{xw}^2 \sigma_{xx}^{-1} (x_i - \mu_x),
\sigma_{ww}^2 - \sigma_{xw}^2 \sigma_{xx}^{-1} \sigma_{xw}^2
\right]
& \quad & \text{(5.13) and Exercise 5.5}\\
&= \NormDist_{w_i}\left[ \phi_0 + \phi_1 x_i, \sigma^2 \right]
& \quad & \text{Exercise 6.5, (a), (b), (c)}\end{split}\]
where
\[\begin{split}\phi_0 &= \mu_w - \sigma_{xw}^2 \sigma_{xx}^{-1} \mu_x\\\\
\phi_1 &= \sigma_{xw}^2 \sigma_{xx}^{-1}\\\\
\sigma^2 &= \sigma_{ww}^2 - \sigma_{xw}^2 \sigma_{xx}^{-1} \sigma_{xw}^2.\end{split}\]
See Exercise 5.5 and
Exercise 6.5 for more details.
(a)
In order to simplify notations, rewrite the MLE of \(\phi_0\) as
\[\phi_0 = \frac{1}{I} \sum_{i = 1}^I w_i - \phi_1 x_i = \mu_w - \phi_1 \mu_x\]
where \(\mu_w = I^{-1} \sum_{i = 1}^I w_i\) and
\(\mu_x = I^{-1} \sum_{i = 1}^I x_i\).
(b)
In order to simplify notations, rewrite the MLE of \(\phi_1\) as
\[\begin{split}\phi_1 &= \frac{\sum_{i = 1}^I x_i (w_i - \phi_0)}{\sum_{i = 1}^I x_i^2}\\
\phi_1 \sum_{i = 1}^I x_i^2
&= \sum_{i = 1}^I x_i w_i - x_i (\mu_w - \phi_1 \mu_x)
& \quad & \text{(a)}\\
\phi_1
&= \frac{
\sum_{i = 1}^I x_i w_i - x_i \mu_w
}{
\sum_{i = 1}^I x_i^2 - x_i \mu_x
}\\
&= \frac{
I^{-1} \sum_{i = 1}^I x_i w_i - x_i \mu_w
}{
I^{-1} \sum_{i = 1}^I x_i^2 - x_i \mu_x
}\\
&= \left(
\frac{\sum_{i = 1}^I x_i w_i}{I} - \mu_x \mu_w
\right)
\left(
\frac{\sum_{i = 1}^I x_i^2}{I} - \mu_x^2
\right)^{-1}\\
&= \frac{
\sum_{i = 1}^I x_i w_i - \mu_x \mu_w
}{
\sum_{i = 1}^I x_i^2 - \mu_x^2
}.\end{split}\]
(c)
Substituting in the MLE of \(\phi_0\) and \(\phi_1\) into
\(\sigma^2\) gives
\[\begin{split}\sigma^2 &= I^{-1} \sum_{i = 1}^I (w_i - \phi_0 - \phi_1 x_i)^2\\
&= I^{-1} \sum_{i = 1}^I
w_i^2 - 2 w_i (\phi_0 + \phi_1 x_i) + (\phi_0 + \phi_1 x_i)^2\\
&= I^{-1} \sum_{i = 1}^I
w_i^2 - 2 w_i \phi_0 - 2 w_i x_i \phi_1 +
\phi_0^2 + 2 \phi_0 \phi_1 x_i + \phi_1^2 x_i^2\\
&= I^{-1} \sum_{i = 1}^I
w_i^2 +
\phi_1^2 \left(
x_i^2 - 2 x_i \mu_x + \mu_x^2
\right) +
\phi_1 \left(
2 \mu_w x_i - 2 \mu_x \mu_w - 2 x_i w_i + 2 \mu_x w_i
\right) +
\left(
\mu_w^2 - 2 \mu_w w_i
\right)\\
&= I^{-1} \sum_{i = 1}^I
w_i^2 +
\phi_1^2 \left( x_i^2 - \mu_x^2 \right) -
2 \phi_1 \left( x_i w_i - \mu_x \mu_w \right) -
\mu_w^2\\
&= \frac{\sum_{i = 1}^I w_i^2 - \mu_w^2}{I} +
\frac{\phi_1^2}{I} \left( \sum_{i = 1}^I x_i^2 - \mu_x^2 \right) -
\frac{2 \phi_1}{I} \left( \sum_{i = 1}^I x_i w_i - \mu_x \mu_w \right)\\
&= \frac{\sum_{i = 1}^I w_i^2 - \mu_w^2}{I} -
\left(
\frac{\sum_{i = 1}^I x_i w_i}{I} - \mu_x \mu_w
\right)^2
\left(
\frac{\sum_{i = 1}^I x_i^2}{I} - \mu_x^2
\right)^{-1}\\
&= \frac{\sum_{i = 1}^I (w_i - \mu_w)^2}{I} -
\left(
I^{-1} \sum_{i = 1}^I (x_i - \mu_x) (w_i - \mu_w)
\right)^2
\left(
I^{-1} \sum_{i = 1}^I (x_i - \mu_x)^2
\right)^{-1}\\
&= \sigma_{ww}^2 - \sigma_{xw}^2 \sigma_{xx}^{-1} \sigma_{xw}^2
& \quad & \text{definition of covariance with uniform probability.}\end{split}\]
Exercise 6.7
(1)
Assuming \(Pr(w)\) has a uniform prior simplifies (6.11) to
\[Pr(w \mid x) =
\frac{
Pr(x \mid w) Pr(w)
}{
\sum_{w \in \{ 0, 1 \}} Pr(x \mid w) Pr(w)
} =
\frac{
Pr(x \mid w)
}{
Pr(x \mid w = 1) + Pr(x \mid w = 0)
}.\]
The points on the decision boundary obey
\[\begin{split}Pr(w = 0 \mid x) &= Pr(w = 1 \mid x)\\
Pr(x \mid w = 0) &= Pr(x \mid w = 1)\\
\NormDist_x\left[ \mu_0, \sigma_0^2 \right]
&= \NormDist_x\left[ \mu_1, \sigma_1^2 \right]\\
-\frac{1}{2} \log 2 \pi - \frac{1}{2} \log \sigma_0^2 -
\frac{(x - \mu_0)^2}{2 \sigma_0^2}
&= -\frac{1}{2} \log 2 \pi - \frac{1}{2} \log \sigma_1^2 -
\frac{(x - \mu_1)^2}{2 \sigma_1^2}
& \quad & \text{rearrange into a quadratic equation using log normals}\\
\log \sigma_0^2 + \frac{(x - \mu_0)^2}{\sigma_0^2}
&= \log \sigma_1^2 + \frac{(x - \mu_1)^2}{\sigma_1^2}\\
a x^2 + bx + c &= 0\end{split}\]
where
\[\begin{split}a &= \sigma_0^{-2} - \sigma_1^{-2}\\\\
b &= 2 \left( \mu_1 \sigma_1^{-2} - \mu_0 \sigma_0^{-2} \right)\\\\
c &= \mu_0^2 \sigma_0^{-2} - \mu_1^2 \sigma_1^{-2} +
\log \sigma_0^2 - \log \sigma_1^2.\end{split}\]
(2)
The shape of the decision boundary for the logistic regression model have
the form of
\[\begin{split}Pr(w = 0 \mid x) &= Pr(w = 1 \mid x)\\
\BernDist_{w = 0}\left[ \sigmoid\left( \phi_0 + \phi_1 x \right) \right]
&= \BernDist_{w = 1}\left[
\sigmoid\left( \phi_0 + \phi_1 x \right)
\right]\\
1 - \sigmoid\left(\phi_0 + \phi_1 x \right)
&= \sigmoid\left( \phi_0 + \phi_1 x \right)\\
1 + \exp \left( -\phi_0 - \phi_1 x \right) &= 2\\
\phi_1 x + \phi_0 &= 0.\end{split}\]
Exercise 6.8
The following uses the results of
Exercise 6.7.
(1)
Suppose \(Pr(w)\) is uniform and
\(\mu_0 = 0\), \(\sigma_0^2 = \sigma^2\),
\(\mu_1 = 0\), \(\sigma_1^2 = 1.5 \sigma^2\),
\[\begin{split}a &= \sigma_0^{-2} - \sigma_1^{-2} = \frac{1}{3 \sigma^2}\\\\
b &= 2 \left( \mu_1 \sigma_1^{-2} - \mu_0 \sigma_0^{-2} \right) = 0\\\\
c &= \mu_0^2 \sigma_0^{-2} - \mu_1^2 \sigma_1^{-2} +
\log \sigma_0^2 - \log \sigma_1^2
= -\log 1.5\end{split}\]
(2)
In order for the discriminative classifier to have the same decision boundary,
a quadratic function
\[\phi_2 x^2 + \phi_1 x + \phi_0\]
needs to be used where
\[\begin{split}\begin{gather*}
\phi_2 = a\\
\phi_1 = 0\\
\phi_0 = c.
\end{gather*}\end{split}\]
Exercise 6.9
Let \(G(\mathbf{x})\) and \(D(\mathbf{x})\) denote the number of
parameters a model has as a function of the dimensionality of
\(\mathbf{x} \in \mathbb{R}^n\).
Generative Model
Suppose the prior is uniform and the model parameters are
\(\boldsymbol{\theta} = \left\{
\boldsymbol{\mu}_0, \boldsymbol{\mu}_1,
\boldsymbol{\Sigma}_0, \boldsymbol{\Sigma}_1 \right\}\).
Recall that a symmetric matrix (e.g. covariance matrix) has
\(\frac{n (n + 1)}{2}\) scalars.
\[G(\mathbf{x}) = 2n + 2 \frac{n (n + 1)}{2} = n^2 + 3n.\]
Discriminative Model
The model parameters consists of
\(\boldsymbol{\theta} = \{ \phi_0, \boldsymbol{\phi} \}\).
\[D(\mathbf{x}) = n + 1.\]
Exercise 6.10
The goal now is to infer a multi-valued label \(w_n \in \{0, 1, 2\}\) that
indicates whether the \(n\text{th}\) pixel in the image is part of a known
background \((w = 0)\), foreground \((w = 1)\), or shadow
\((w = 2)\).
The prior \(Pr(w)\) would be a categorical distribution.
Since the background is known and there is lighting in the scene, shadows
will make the pixels “dimmer”. In addition to Equations (6.16) and (6.17), the
class conditional distribution of the shadows could be modeled as
\[Pr(\mathbf{x}_n \mid w = 2) =
\NormDist_{\mathbf{x}_n}\left[
\boldsymbol{\mu}_{n2}, \boldsymbol{\Sigma}_{n2}
\right].\]
References
- Brua
Jerry Brunner. Logistic regression. http://www.utstat.toronto.edu/ brunner/oldclass/312f12/lectures/312f12LogisticRegression1.pdf. Accessed on 2017-06-19.
- Brub
Jerry Brunner. Multinomial logit models. http://www.utstat.toronto.edu/ brunner/oldclass/312f12/lectures/312f12MultinomialLogit.pdf. Accessed on 2017-06-19.
- Eng
Barbara Engelhardt. Introduction: mle, map, bayesian reasoning. https://web.archive.org/web/20150225224855/https://genome.duke.edu/labs/engelhardt/courses/scribe/lec_08_28_2013.pdf. Accessed on 2017-06-17.