(i)
A change of coordinates in the parameter space is also known as
reparameterization, which is usually realized through the
change-of-variable technique.
Let \(X\) be a continuous random variable with generic probability
density function \(f(x)\) defined over the support
\(c_1 \leq x \leq c_2\), and let \(Y = u(X)\) be an invertible function
of \(X\) with inverse function \(X = v(Y)\). The probability density
function of \(Y\) is
\[f_Y(y) = f_X(v(y)) \left\vert v'(y) \right\vert\]
defined over the support \(u(c_1) \leq y \leq u(c_2)\).
Consider the Bernoulli distribution with a uniform prior such that
\[\begin{split}p(x_i \mid \mu)
&= \mu^{x_i} (1 - \mu)^{1 - x_i}
& \quad & x_i \in \{0, 1\},\\\\
p_X(\mu)
&= 1
& \quad & c_1 = 0, c_1 = 1, \mu \in [0, 1].\end{split}\]
The maximum a posteriori (MAP) estimation is
\[\begin{split}\DeclareMathOperator*{\argmax}{arg\,max}
\hat{\mu}_{\mathrm{MAP}}
&= \argmax_\mu p(\mu \mid X)\\
&= \argmax_\mu \log p_X(\mu) + \sum_{x_i \in X} \log p(x_i \mid \mu)\\
&= \argmax_\mu \sum_{i = 1}^n x_i \log \mu + (1 - x_i) \log(1 - \mu)\\
&= \frac{1}{n} \sum_i x_i\\
&= \hat{\mu}_{\mathrm{ML}}.\end{split}\]
The following derivations demonstrate that the mode (MAP estimate) and mean of
the posterior distribution are not invariant to reparameterization.
Parameterization I
Now define \(\theta = u(\mu) = \sqrt{\mu}\) and
\(\mu = v(\theta) = \theta^2\). The support happens to be the same for this
parameterization. The new prior is
\[\begin{split}p_Y(\theta)
&= p_X(v(\theta)) \left\vert v'(\theta) \right\vert\\
&= p_X(\theta^2) \left\vert 2 \theta \right\vert
& \quad & v'(\theta) = \frac{\partial}{\partial \theta} v(\theta)\\
&= 2 \theta
& \quad & \theta \in [0, 1].\end{split}\]
The reparameterized MAP estimation is
\[\begin{split}\hat{\theta}_{\mathrm{MAP}}
&= \argmax_\theta p(\theta \mid X)\\
&= \argmax_\theta \frac{p(X \mid \theta) p_Y(\theta)}{p(X)}\\
&= \argmax_\theta \log p_Y(\theta) +
\sum_{x_i \in X} \log p(x_i \mid \theta)
& \quad & p(X) = \mathrm{const}\\
&= \argmax_\theta \log 2 \theta +
\sum_{i = 1}^n x_i \log \theta + (1 - x_i) \log(1 - \theta)\\
&= \frac{1 + \sum_i x_i}{n + 1}\\
&\neq u\left( \hat{\mu}_{\mathrm{MAP}} \right).\end{split}\]
The mean of the posterior distribution is
\[\begin{split}\DeclareMathOperator*{\argmin}{arg\,min}
\hat{\theta}(x = 1)
&= \left\langle \theta \right\rangle_{p(\theta \mid x = 1)}
& \quad & \text{(ii)}\\
&= \int \theta p(\theta \mid x = 1) d\theta\\
&= \int \theta
\frac{
p(x = 1 \mid \theta) p_Y(\theta)
}{
\int p(x = 1 \mid \theta') p_Y(\theta') d\theta'
}
d\theta\\
&= \int \frac{
\theta (\theta) (2 \theta)
}{
\int \theta' (2 \theta') d\theta'
}
d\theta\\
&= \frac{3}{2} \int 2 \theta^3 d\theta
& \quad & \int \theta' (2 \theta') d\theta' = \frac{2}{3}\\
&= \frac{3}{4}
& \quad & \int 2 \theta^3 d\theta = \frac{1}{2}.\end{split}\]
Parameterization II
Now define \(\phi = u(\mu) = 1 - \sqrt{1 - \mu}\) and
\(\mu = v(\phi) = 1 - (1 - \phi)^2\). The support happens to be the same
for this parameterization. The new prior is
\[\begin{split}p_Y(\phi)
&= p_X(v(\phi)) \left\vert v'(\phi) \right\vert\\
&= p_X\left( 1 - (1 - \phi)^2 \right) \left\vert 2 (1 - \phi) \right\vert
& \quad & v'(\phi) = \frac{\partial}{\partial \phi} v(\phi)\\
&= 2 - 2 \phi
& \quad & \phi \in [0, 1].\end{split}\]
The reparameterized MAP estimation is
\[\begin{split}\hat{\phi}_{\mathrm{MAP}}
&= \argmax_\phi p(\phi \mid X)\\
&= \argmax_\phi \frac{p(X \mid \phi) p_Y(\phi)}{p(X)}\\
&= \argmax_\phi \log p_Y(\phi) +
\sum_{x_i \in X} \log p(x_i \mid \phi)
& \quad & p(X) = \mathrm{const}\\
&= \argmax_\phi \log 2 (1 - \phi) +
\sum_{i = 1}^n x_i \log \phi + (1 - x_i) \log(1 - \phi)\\
&= \frac{1}{n + 1} \sum_i x_i\\
&\neq u\left( \hat{\mu}_{\mathrm{MAP}} \right).\end{split}\]
The mean of the posterior distribution is
\[\begin{split}\hat{\phi}(x = 1)
&= \left\langle \phi \right\rangle_{p(\phi \mid x = 1)}
& \quad & \text{(ii)}\\
&= \int \phi p(\phi \mid x = 1) d\phi\\
&= \int \phi
\frac{
p(x = 1 \mid \phi) p_Y(\phi)
}{
\int p(x = 1 \mid \phi') p_Y(\phi') d\phi'
}
d\phi\\
&= \int \frac{
\phi (\phi) (2 - 2 \phi)
}{
\int \phi' (2 - \phi') d\phi'
}
d\phi\\
&= 3 \int 2 \phi^2 - 2 \phi^3 d\phi
& \quad & \int \phi' (2 - \phi') d\phi' = \frac{1}{3}\\
&= \frac{1}{2}
& \quad & \int 2 \phi^2 - 2 \phi^3 d\phi = \frac{1}{6}.\end{split}\]
(ii)
\[\begin{split}\argmin_Y Q
&= \argmin_Y \int \left\Vert Y - \theta \right\Vert^2_2 p(\theta) d\theta\\
&= \argmin_Y \int (Y - \theta)^\top (Y - \theta) p(\theta) d\theta
& \quad & \theta \in \mathbb{R}^n\\
0 &= \int \frac{\partial}{\partial Y} (Y - \theta)^\top (Y - \theta)
p(\theta) d\theta
& \quad & \text{differentiation under the integral sign}\\
&= \int 2 (Y - \theta) p(\theta) d\theta
& \quad & \text{Matrix Cookbook (85)}\\
&= Y \int p(\theta) d\theta - \int \theta p(\theta) d\theta\\
Y &= \left\langle \theta \right\rangle_{p(\theta)}
& \quad & \int p(\theta) d\theta = 1.\end{split}\]
To see that this is the minimum, notice that
\[\begin{split}\frac{\partial^2 Q}{\partial Y^2}
&= \frac{\partial}{\partial Y} \int 2 (Y - \theta) p(\theta) d\theta\\
&= 2 \mathbf{I}
& \quad & \text{Leibniz integral rule}\\
&\succ \boldsymbol{0}.\end{split}\]
(iii)
Given
\[\begin{split}\argmin_Y Q
&= \argmin_Y \int \left\vert Y - \theta \right\vert p(\theta) d\theta
& \quad & \theta \in \mathbb{R}\\
&= \argmin_Y
\int_L^Y (Y - \theta) p(\theta) d\theta +
\int_Y^U (\theta - Y) p(\theta) d\theta,\end{split}\]
the minimum is reached when
\[\begin{split}\frac{\partial}{\partial Y} \argmin_Y Q
&= \left(
(Y - Y) p(Y) \frac{\partial}{\partial Y} Y -
(Y - L) p(L) \frac{\partial}{\partial Y} L +
\int_L^Y \frac{\partial}{\partial Y} (Y - \theta) p(\theta) d\theta
\right) +\\
&\qquad
\left(
(U - Y) p(U) \frac{\partial}{\partial Y} U -
(Y - Y) p(Y) \frac{\partial}{\partial Y} Y -
\int_Y^U \frac{\partial}{\partial Y} (\theta - Y) p(\theta) d\theta
\right)
& \quad & \text{differentiation under the integral sign}\\
0 &= \int_L^Y p(\theta) d\theta - \int_Y^U p(\theta) d\theta.\end{split}\]
When \(L \rightarrow -\infty\) and \(U \rightarrow \infty\),
\[\begin{split}\int_{-\infty}^Y p(\theta) d\theta
&= \int_Y^{\infty} p(\theta) d\theta\\
p(X \leq Y)
&= p(X \geq Y)
& \quad & \text{by definition of median, CDF, and CCDF.}\end{split}\]
To verify that this is the minimum, notice that
\[\begin{split}\frac{\partial^2 Q}{\partial Y^2}
&= \frac{\partial}{\partial Y}
\left(
\int_L^Y p(\theta) d\theta - \int_Y^U p(\theta) d\theta
\right)\\
&= \left(
p(Y) \frac{\partial}{\partial Y} Y -
p(L) \frac{\partial}{\partial Y} L +
\int_L^Y \frac{\partial}{\partial Y} p(\theta) d\theta
\right) -\\
&\qquad
\left(
p(U) \frac{\partial}{\partial Y} U -
p(Y) \frac{\partial}{\partial Y} Y -
\int_Y^U \frac{\partial}{\partial Y} p(\theta) d\theta
\right)
& \quad & \text{Leibniz integral rule}\\
&= 2 p(Y)\\
&> 0.\end{split}\]
Higher Dimensions
\[\begin{split}Q\left( t X + (1 - t) Y \right)
&= \int \left\Vert t X + (1 - t) Y - \theta \right\Vert_1 p(\theta) d\theta
& \quad & t \in [0, 1], \theta \in \mathbb{R}^n\\
&= \int \left(
\sum_i \left\vert t X_i + (1 - t) Y_i - \theta_i \right\vert
\right)
p(\theta) d\theta\\
&= \int \left(
\sum_i \left\vert
t X_i + (1 - t) Y_i - t \theta_i - (1 - t) \theta_i
\right\vert
\right)
p(\theta) d\theta\\
&\leq \int \left(
\sum_i t \left\vert X_i - \theta_i \right\vert +
(1 - t) \left\vert Y_i - \theta_i \right\vert
\right)
p(\theta) d\theta
& \quad & \text{triangle inequality}\\
&= t Q(X) + (1 - t) Q(Y).\end{split}\]
There are many definitions for a multivariate median. The vector of marginal
medians is a direct generalization of the median of a 1D distribution because
\[\begin{split}\min_Y Q
&= \min_Y \int \left\Vert Y - \theta \right\Vert_1 p(\theta) d\theta\\
&= \min_Y
\sum_i \int \left\vert Y_i - \theta_i \right\vert p(\theta) d\theta\\
&= \sum_i \min_{Y_i}
\int \left\vert Y_i - \theta_i \right\vert p(\theta) d\theta.\end{split}\]
(iv)
The median of a PDF defined on \(\mathbb{R}\) is invariant to
reparametrization of \(\mathbb{R}\). By inspection of the
minimization, the additional scaling factor
\(\left\vert v'(\theta) \right\vert\) gets canceled out. However, this is
not true in higher dimensions. A specific example is the normal distribution.
Its median and mode are equal to its mean, but a
change of variable gives a different
quantity.
[Han96] has an excellent geometric interpretation of the
determinant of a matrix. A matrix \(A \in \mathbb{R}^{n \times n}\) can be
viewed as a linear transformation from
\(\mathbb{R}^n \mapsto \mathbb{R}^n\) such that a unit \(n\)-cube in
\(\mathbb{R}^n\) is mapped into the \(n\)-dimensional parallelogram
determined by the columns of \(A\). The magnification factor
\[\frac{
\text{area (or volume) of image region}
}{
\text{area (or volume) of original region}
} =
\det A\]
is always the same regardless of the starting hypercube.
In computing an integral over a volume in any coordinate system, you break the
volume up into little pieces and sum the value of the integrand at a point in
each piece, times the volume of the piece. In changing variables from any given
set of variables of integration to any other, the Jacobian tells us how to
express the volume element (e.g. parallelopiped) in for the old variables in
terms of the volume element for the new set. The key idea is that locally any
differentiable map is linear (and takes rectangles to parallelograms), and then
we piece the contributions over the entire region together. The absolute value
of the Jacobian determinant gives us the exchange rate between the two different
volume element [Mil][Sch54].
References
- Han96
John Hannah. A geometric approach to determinants. The American mathematical monthly, 103(5):401–409, 1996.
- Mil
Steven Miller. The change of variable theorem. https://web.williams.edu/Mathematics/sjmiller/public_html/150/handouts/ChangeVariableThm.pdf. Accessed on 2017-08-09.
- Sch54
J Schwartz. The formula for change in variables in a multiple integral. The American Mathematical Monthly, 61(2):81–85, 1954.