Parameter Estimation

(i)

A change of coordinates in the parameter space is also known as reparameterization, which is usually realized through the change-of-variable technique.

Let \(X\) be a continuous random variable with generic probability density function \(f(x)\) defined over the support \(c_1 \leq x \leq c_2\), and let \(Y = u(X)\) be an invertible function of \(X\) with inverse function \(X = v(Y)\). The probability density function of \(Y\) is

\[f_Y(y) = f_X(v(y)) \left\vert v'(y) \right\vert\]

defined over the support \(u(c_1) \leq y \leq u(c_2)\).

Consider the Bernoulli distribution with a uniform prior such that

\[\begin{split}p(x_i \mid \mu) &= \mu^{x_i} (1 - \mu)^{1 - x_i} & \quad & x_i \in \{0, 1\},\\\\ p_X(\mu) &= 1 & \quad & c_1 = 0, c_1 = 1, \mu \in [0, 1].\end{split}\]

The maximum a posteriori (MAP) estimation is

\[\begin{split}\DeclareMathOperator*{\argmax}{arg\,max} \hat{\mu}_{\mathrm{MAP}} &= \argmax_\mu p(\mu \mid X)\\ &= \argmax_\mu \log p_X(\mu) + \sum_{x_i \in X} \log p(x_i \mid \mu)\\ &= \argmax_\mu \sum_{i = 1}^n x_i \log \mu + (1 - x_i) \log(1 - \mu)\\ &= \frac{1}{n} \sum_i x_i\\ &= \hat{\mu}_{\mathrm{ML}}.\end{split}\]

The following derivations demonstrate that the mode (MAP estimate) and mean of the posterior distribution are not invariant to reparameterization.

Parameterization I

Now define \(\theta = u(\mu) = \sqrt{\mu}\) and \(\mu = v(\theta) = \theta^2\). The support happens to be the same for this parameterization. The new prior is

\[\begin{split}p_Y(\theta) &= p_X(v(\theta)) \left\vert v'(\theta) \right\vert\\ &= p_X(\theta^2) \left\vert 2 \theta \right\vert & \quad & v'(\theta) = \frac{\partial}{\partial \theta} v(\theta)\\ &= 2 \theta & \quad & \theta \in [0, 1].\end{split}\]

The reparameterized MAP estimation is

\[\begin{split}\hat{\theta}_{\mathrm{MAP}} &= \argmax_\theta p(\theta \mid X)\\ &= \argmax_\theta \frac{p(X \mid \theta) p_Y(\theta)}{p(X)}\\ &= \argmax_\theta \log p_Y(\theta) + \sum_{x_i \in X} \log p(x_i \mid \theta) & \quad & p(X) = \mathrm{const}\\ &= \argmax_\theta \log 2 \theta + \sum_{i = 1}^n x_i \log \theta + (1 - x_i) \log(1 - \theta)\\ &= \frac{1 + \sum_i x_i}{n + 1}\\ &\neq u\left( \hat{\mu}_{\mathrm{MAP}} \right).\end{split}\]

The mean of the posterior distribution is

\[\begin{split}\DeclareMathOperator*{\argmin}{arg\,min} \hat{\theta}(x = 1) &= \left\langle \theta \right\rangle_{p(\theta \mid x = 1)} & \quad & \text{(ii)}\\ &= \int \theta p(\theta \mid x = 1) d\theta\\ &= \int \theta \frac{ p(x = 1 \mid \theta) p_Y(\theta) }{ \int p(x = 1 \mid \theta') p_Y(\theta') d\theta' } d\theta\\ &= \int \frac{ \theta (\theta) (2 \theta) }{ \int \theta' (2 \theta') d\theta' } d\theta\\ &= \frac{3}{2} \int 2 \theta^3 d\theta & \quad & \int \theta' (2 \theta') d\theta' = \frac{2}{3}\\ &= \frac{3}{4} & \quad & \int 2 \theta^3 d\theta = \frac{1}{2}.\end{split}\]

Parameterization II

Now define \(\phi = u(\mu) = 1 - \sqrt{1 - \mu}\) and \(\mu = v(\phi) = 1 - (1 - \phi)^2\). The support happens to be the same for this parameterization. The new prior is

\[\begin{split}p_Y(\phi) &= p_X(v(\phi)) \left\vert v'(\phi) \right\vert\\ &= p_X\left( 1 - (1 - \phi)^2 \right) \left\vert 2 (1 - \phi) \right\vert & \quad & v'(\phi) = \frac{\partial}{\partial \phi} v(\phi)\\ &= 2 - 2 \phi & \quad & \phi \in [0, 1].\end{split}\]

The reparameterized MAP estimation is

\[\begin{split}\hat{\phi}_{\mathrm{MAP}} &= \argmax_\phi p(\phi \mid X)\\ &= \argmax_\phi \frac{p(X \mid \phi) p_Y(\phi)}{p(X)}\\ &= \argmax_\phi \log p_Y(\phi) + \sum_{x_i \in X} \log p(x_i \mid \phi) & \quad & p(X) = \mathrm{const}\\ &= \argmax_\phi \log 2 (1 - \phi) + \sum_{i = 1}^n x_i \log \phi + (1 - x_i) \log(1 - \phi)\\ &= \frac{1}{n + 1} \sum_i x_i\\ &\neq u\left( \hat{\mu}_{\mathrm{MAP}} \right).\end{split}\]

The mean of the posterior distribution is

\[\begin{split}\hat{\phi}(x = 1) &= \left\langle \phi \right\rangle_{p(\phi \mid x = 1)} & \quad & \text{(ii)}\\ &= \int \phi p(\phi \mid x = 1) d\phi\\ &= \int \phi \frac{ p(x = 1 \mid \phi) p_Y(\phi) }{ \int p(x = 1 \mid \phi') p_Y(\phi') d\phi' } d\phi\\ &= \int \frac{ \phi (\phi) (2 - 2 \phi) }{ \int \phi' (2 - \phi') d\phi' } d\phi\\ &= 3 \int 2 \phi^2 - 2 \phi^3 d\phi & \quad & \int \phi' (2 - \phi') d\phi' = \frac{1}{3}\\ &= \frac{1}{2} & \quad & \int 2 \phi^2 - 2 \phi^3 d\phi = \frac{1}{6}.\end{split}\]

(ii)

\[\begin{split}\argmin_Y Q &= \argmin_Y \int \left\Vert Y - \theta \right\Vert^2_2 p(\theta) d\theta\\ &= \argmin_Y \int (Y - \theta)^\top (Y - \theta) p(\theta) d\theta & \quad & \theta \in \mathbb{R}^n\\ 0 &= \int \frac{\partial}{\partial Y} (Y - \theta)^\top (Y - \theta) p(\theta) d\theta & \quad & \text{differentiation under the integral sign}\\ &= \int 2 (Y - \theta) p(\theta) d\theta & \quad & \text{Matrix Cookbook (85)}\\ &= Y \int p(\theta) d\theta - \int \theta p(\theta) d\theta\\ Y &= \left\langle \theta \right\rangle_{p(\theta)} & \quad & \int p(\theta) d\theta = 1.\end{split}\]

To see that this is the minimum, notice that

\[\begin{split}\frac{\partial^2 Q}{\partial Y^2} &= \frac{\partial}{\partial Y} \int 2 (Y - \theta) p(\theta) d\theta\\ &= 2 \mathbf{I} & \quad & \text{Leibniz integral rule}\\ &\succ \boldsymbol{0}.\end{split}\]

(iii)

Given

\[\begin{split}\argmin_Y Q &= \argmin_Y \int \left\vert Y - \theta \right\vert p(\theta) d\theta & \quad & \theta \in \mathbb{R}\\ &= \argmin_Y \int_L^Y (Y - \theta) p(\theta) d\theta + \int_Y^U (\theta - Y) p(\theta) d\theta,\end{split}\]

the minimum is reached when

\[\begin{split}\frac{\partial}{\partial Y} \argmin_Y Q &= \left( (Y - Y) p(Y) \frac{\partial}{\partial Y} Y - (Y - L) p(L) \frac{\partial}{\partial Y} L + \int_L^Y \frac{\partial}{\partial Y} (Y - \theta) p(\theta) d\theta \right) +\\ &\qquad \left( (U - Y) p(U) \frac{\partial}{\partial Y} U - (Y - Y) p(Y) \frac{\partial}{\partial Y} Y - \int_Y^U \frac{\partial}{\partial Y} (\theta - Y) p(\theta) d\theta \right) & \quad & \text{differentiation under the integral sign}\\ 0 &= \int_L^Y p(\theta) d\theta - \int_Y^U p(\theta) d\theta.\end{split}\]

When \(L \rightarrow -\infty\) and \(U \rightarrow \infty\),

\[\begin{split}\int_{-\infty}^Y p(\theta) d\theta &= \int_Y^{\infty} p(\theta) d\theta\\ p(X \leq Y) &= p(X \geq Y) & \quad & \text{by definition of median, CDF, and CCDF.}\end{split}\]

To verify that this is the minimum, notice that

\[\begin{split}\frac{\partial^2 Q}{\partial Y^2} &= \frac{\partial}{\partial Y} \left( \int_L^Y p(\theta) d\theta - \int_Y^U p(\theta) d\theta \right)\\ &= \left( p(Y) \frac{\partial}{\partial Y} Y - p(L) \frac{\partial}{\partial Y} L + \int_L^Y \frac{\partial}{\partial Y} p(\theta) d\theta \right) -\\ &\qquad \left( p(U) \frac{\partial}{\partial Y} U - p(Y) \frac{\partial}{\partial Y} Y - \int_Y^U \frac{\partial}{\partial Y} p(\theta) d\theta \right) & \quad & \text{Leibniz integral rule}\\ &= 2 p(Y)\\ &> 0.\end{split}\]

Higher Dimensions

\[\begin{split}Q\left( t X + (1 - t) Y \right) &= \int \left\Vert t X + (1 - t) Y - \theta \right\Vert_1 p(\theta) d\theta & \quad & t \in [0, 1], \theta \in \mathbb{R}^n\\ &= \int \left( \sum_i \left\vert t X_i + (1 - t) Y_i - \theta_i \right\vert \right) p(\theta) d\theta\\ &= \int \left( \sum_i \left\vert t X_i + (1 - t) Y_i - t \theta_i - (1 - t) \theta_i \right\vert \right) p(\theta) d\theta\\ &\leq \int \left( \sum_i t \left\vert X_i - \theta_i \right\vert + (1 - t) \left\vert Y_i - \theta_i \right\vert \right) p(\theta) d\theta & \quad & \text{triangle inequality}\\ &= t Q(X) + (1 - t) Q(Y).\end{split}\]

There are many definitions for a multivariate median. The vector of marginal medians is a direct generalization of the median of a 1D distribution because

\[\begin{split}\min_Y Q &= \min_Y \int \left\Vert Y - \theta \right\Vert_1 p(\theta) d\theta\\ &= \min_Y \sum_i \int \left\vert Y_i - \theta_i \right\vert p(\theta) d\theta\\ &= \sum_i \min_{Y_i} \int \left\vert Y_i - \theta_i \right\vert p(\theta) d\theta.\end{split}\]

(iv)

The median of a PDF defined on \(\mathbb{R}\) is invariant to reparametrization of \(\mathbb{R}\). By inspection of the minimization, the additional scaling factor \(\left\vert v'(\theta) \right\vert\) gets canceled out. However, this is not true in higher dimensions. A specific example is the normal distribution. Its median and mode are equal to its mean, but a change of variable gives a different quantity.

[Han96] has an excellent geometric interpretation of the determinant of a matrix. A matrix \(A \in \mathbb{R}^{n \times n}\) can be viewed as a linear transformation from \(\mathbb{R}^n \mapsto \mathbb{R}^n\) such that a unit \(n\)-cube in \(\mathbb{R}^n\) is mapped into the \(n\)-dimensional parallelogram determined by the columns of \(A\). The magnification factor

\[\frac{ \text{area (or volume) of image region} }{ \text{area (or volume) of original region} } = \det A\]

is always the same regardless of the starting hypercube.

In computing an integral over a volume in any coordinate system, you break the volume up into little pieces and sum the value of the integrand at a point in each piece, times the volume of the piece. In changing variables from any given set of variables of integration to any other, the Jacobian tells us how to express the volume element (e.g. parallelopiped) in for the old variables in terms of the volume element for the new set. The key idea is that locally any differentiable map is linear (and takes rectangles to parallelograms), and then we piece the contributions over the entire region together. The absolute value of the Jacobian determinant gives us the exchange rate between the two different volume element [Mil][Sch54].

References

Han96

John Hannah. A geometric approach to determinants. The American mathematical monthly, 103(5):401–409, 1996.

Mil

Steven Miller. The change of variable theorem. https://web.williams.edu/Mathematics/sjmiller/public_html/150/handouts/ChangeVariableThm.pdf. Accessed on 2017-08-09.

Sch54

J Schwartz. The formula for change in variables in a multiple integral. The American Mathematical Monthly, 61(2):81–85, 1954.