Notes
General Exponential Families
Suppose that \(X\) is a random variable taking values in
\(\mathcal{S}\), and that the distribution of \(X\) depends on an
unspecified parameter \(\theta\) taking values in a parameter space
\(\Theta\). The distribution of \(X\) is a \(k\)-parameter
exponential family if it can be expressed as
\[f_\theta(x) =
\alpha(\theta) g(x) \exp\left(
\sum_{i = 1}^k \beta_i(\theta) h_i(x)
\right);
\qquad
x \in \mathcal{S}, \theta \in \Theta, \alpha(\theta) \geq 0, g(x) \geq 0\]
where \(\alpha\) and \(\left( \beta_1, \ldots, \beta_k \right)\) are
real-valued functions on \(\Theta\), and where \(g\) and
\(\left( h_1, \ldots, h_k \right)\) are real-valued functions on
\(\mathcal{S}\). Furthermore, \(k\) is assumed to be the smallest
possible integer. The parameters \(\{ \beta_i(\theta) \}\) are called the
natural paramaters of the distribution, and the random variables
\(\{ h_i(X) \}\) are tagged as the natural statistics of the distribution.
Exponential Family Harmonium
Let \(\mathbf{x} = \left\{ x_i \right\}_{i = 1}^{M_v}\) be the set of
observed random variables and
\(\mathbf{h} = \left\{ h_j \right\}_{j = 1}^{M_h}\) be the set of hidden
(latent) variables. Both \(\mathbf{x}\) and \(\mathbf{h}\) can take on
values in either the continuous or the discrete domain. The distribution of
each random variable is chosen independently from the exponential family as
\[\begin{split}p_i(x_i; \boldsymbol{\theta}_i)
&= \frac{1}{A'_i\left( \boldsymbol{\theta}_i \right)}
r_i(x_i) \exp\left[ \sum_{a'} \theta_{ia'} f_{ia'}(x_i) \right]\\
&= \frac{1}{A'_i\left( \boldsymbol{\theta}_i \right)}
\exp\left[ \sum_a \theta_{ia} f_{ia}(x_i) \right]
& \quad & \theta_{ia} = 1 \text{ and } f_{ia}(x_i) = \log r_i(x_i)
\text{ when } a = 0\\
&= \exp\left[
\sum_a \theta_{ia} f_{ia}(x_i) -
A_i\left( \boldsymbol{\theta}_i \right)
\right]
& \quad & A_i\left( \boldsymbol{\theta}_i \right) =
\log \sum_{x'_i} \exp \sum_a \theta_{ia} f_{ia}(x'_i)\end{split}\]
and
\[\begin{split}p_j(h_j; \boldsymbol{\lambda}_j)
&= \frac{1}{B'_j\left( \boldsymbol{\lambda}_j \right)}
s_j(h_j) \exp\left[ \sum_{b'} \lambda_{jb'} g_{jb'}(h_j) \right]\\
&= \frac{1}{B'_j\left( \boldsymbol{\lambda}_j \right)}
\exp\left[ \sum_b \lambda_{jb} g_{jb}(h_j) \right]
& \quad & \lambda_{jb} = 1 \text{ and } g_{jb}(h_j) = \log s_j(h_j)
\text{ when } b = 0\\
&= \exp\left[
\sum_b \lambda_{jb} g_{jb}(h_j) -
B_j\left( \boldsymbol{\lambda}_j \right)
\right]
& \quad & B_j\left( \boldsymbol{\lambda}_j \right) =
\log \sum_{h'_j} \exp \sum_b \lambda_{jb} g_{jb}(h'_j)\end{split}\]
where \(\left\{ f_{ia}(x_i), g_{jb}(h_j) \right\}\) are sufficient
statistics for the models,
\(\left\{ \boldsymbol{\theta}_i, \boldsymbol{\lambda}_j \right\}\) the
canonical parameters of the models, and
\(\left\{ A_i, B_j \right\}\) the log-partition functions.
To model the pairwise interactions between random variables \(x_i\) and
\(h_j\), the authors introduce a quadratic interaction term to the joint
distribution
\[\begin{split}p(\mathbf{x}, \mathbf{h})
&= \frac{
\exp\left[
\sum_{i, a} \theta_{ia} f_{ia}(x_i) +
\sum_{j, b} \lambda_{jb} g_{jb}(h_j) +
\sum_{i, j, a, b} W_{ia}^{jb} f_{ia}(x_i) g_{jb}(h_j)
\right]
}{
\sum_{\mathbf{x}', \mathbf{h}'} \exp\left[
\sum_{i, a} \theta_{ia} f_{ia}(x'_i) +
\sum_{j, b} \lambda_{jb} g_{jb}(h'_j) +
\sum_{i, j, a, b} W_{ia}^{jb} f_{ia}(x'_i) g_{jb}(h'_j)
\right]
}\\
&= Z^{-1} \exp E(\mathbf{x}, \mathbf{h}; \phi)
& \quad & \phi =
\left\{
\boldsymbol{\theta}_i, \boldsymbol{\lambda}_j, \mathbf{W}_i^j
\right\}.\end{split}\]
As a consequence of this construction,
\[\begin{split}p(\mathbf{x} \mid \mathbf{h})
&= \frac{
p(\mathbf{x}, \mathbf{h})
}{
\sum_{\mathbf{x}'} p(\mathbf{x}', \mathbf{h})
}\\
&= \frac{
\exp\left[
\sum_{i, a} \theta_{ia} f_{ia}(x_i) +
\sum_{i, j, a, b} W_{ia}^{jb} f_{ia}(x_i) g_{jb}(h_j)
\right]
}{
\sum_{\mathbf{x}'} \exp\left[
\sum_{i, a} \theta_{ia} f_{ia}(x'_i) +
\sum_{i, j, a, b} W_{ia}^{jb} f_{ia}(x'_i) g_{jb}(h_j)
\right]
}
& \quad & \text{cancel out } \lambda_{jb}\\
&= \frac{
\prod_i \exp \sum_a \hat{\theta}_{ia} f_{ia}(x_i)
}{
\left(
\sum_{x'_1} \exp \sum_a \hat{\theta}_{1a} f_{1a}(x'_1)
\right)
\cdots
\left(
\sum_{x'_{M_v}}
\exp \sum_a \hat{\theta}_{{M_v}a} f_{{M_v}a}(x'_{M_v})
\right)
}
& \quad & \hat{\theta}_{ia} =
\theta_{ia} + \sum_{j, b} W_{ia}^{jb} g_{jb}(h_j)\\
&= \prod_{i = 1}^{M_v}
\exp\left[
\sum_a \hat{\theta}_{ia} f_{ia}(x_i) -
A_i\left( \hat{\boldsymbol{\theta}}_i \right)
\right]\\
&= \prod_i p(x_i \mid \mathbf{h})\\
&= \prod_i p_i(x_i; \hat{\boldsymbol{\theta}}_i)\end{split}\]
and
\[\begin{split}p(\mathbf{h} \mid \mathbf{x})
&= \frac{
p(\mathbf{h}, \mathbf{x})
}{
\sum_{\mathbf{h}} p(\mathbf{h}, \mathbf{x})
}\\
&= \frac{
\exp\left[
\sum_{j, b} \lambda_{jb} g_{jb}(h_j) +
\sum_{i, j, a, b} W_{ia}^{jb} f_{ia}(x_i) g_{jb}(h_j)
\right]
}{
\sum_{\mathbf{h}'} \exp\left[
\sum_{j, b} \lambda_{jb} g_{jb}(h'_j) +
\sum_{i, j, a, b} W_{ia}^{jb} f_{ia}(x_i) g_{jb}(h'_j)
\right]
}
& \quad & \text{cancel out } \theta_{ia}\\
&= \frac{
\prod_j \exp \sum_b \hat{\lambda}_{jb} g_{jb}(h_j)
}{
\left(
\sum_{h'_1} \exp \sum_b \hat{\lambda}_{1b} g_{1b}(h'_1)
\right)
\cdots
\left(
\sum_{h'_{M_h}}
\exp \sum_b \hat{\lambda}_{{M_h}b} g_{{M_h}b}(h'_{M_h})
\right)
}
& \quad & \hat{\lambda}_{jb} =
\lambda_{jb} + \sum_{i, a} W_{ia}^{jb} f_{ia}(x_i)\\
&= \prod_{j = 1}^{M_h}
\exp\left[
\sum_b \hat{\lambda}_{jb} g_{jb}(h_j) -
B_j\left( \hat{\boldsymbol{\lambda}}_j \right)
\right]\\
&= \prod_j p(h_j \mid \mathbf{x})\\
&= \prod_j p_j(h_j; \hat{\boldsymbol{\lambda}}_j).\end{split}\]
The marginal distributions of the observed and latent variables are
\[\begin{split}p(\mathbf{x})
&= \frac{
p(\mathbf{h}, \mathbf{x})
}{
p(\mathbf{h} \mid \mathbf{x})
}
& \quad & \forall \mathbf{h}\\
&= \frac{
Z^{-1} \exp E(\mathbf{x}, \mathbf{h}; \phi)
}{
\prod_{j = 1}^{M_h} \exp\left[
\sum_b \hat{\lambda}_{jb} g_{jb}(h_j) -
B_j\left( \hat{\boldsymbol{\lambda}}_j \right)
\right]
}\\
&= Z^{-1} \exp\left\{
\sum_{i, a} \theta_{ia} f_{ia}(x_i) +
\left[
\sum_{j, b} \lambda_{jb} g_{jb}(h_j) +
\sum_{i, a} W_{ia}^{jb} f_{ia}(x_i) g_{jb}(h_j)
\right] +
\sum_j
B_j\left( \hat{\boldsymbol{\lambda}}_j \right) -
\sum_b \hat{\lambda}_{jb} g_{jb}(h_j)
\right\}\\
&= Z^{-1} \exp\left[
\sum_{i, a} \theta_{ia} f_{ia}(x_i) +
\sum_j B_j\left( \hat{\boldsymbol{\lambda}}_j \right)
\right]
& \quad & \text{cancel out } \hat{\lambda}_{jb}\end{split}\]
and
\[\begin{split}p(\mathbf{h})
&= \frac{
p(\mathbf{x}, \mathbf{h})
}{
p(\mathbf{x} \mid \mathbf{h})
}
& \quad & \forall \mathbf{x}\\
&= \frac{
Z^{-1} \exp E(\mathbf{x}, \mathbf{h}; \phi)
}{
\prod_{i = 1}^{M_v}
\exp\left[
\sum_a \hat{\theta}_{ia} f_{ia}(x_i) -
A_i\left( \hat{\boldsymbol{\theta}}_i \right)
\right]
}\\
&= Z^{-1} \exp\left\{
\sum_{j, b} \lambda_{jb} g_{jb}(h_j) +
\left[
\sum_{i, a} \theta_{ia} f_{ia}(x_i) +
\sum_{j, b} W_{ia}^{jb} f_{ia}(x_i) g_{jb}(h_j)
\right] +
\sum_i
A_i\left( \hat{\boldsymbol{\theta}}_i \right) -
\sum_a \hat{\theta}_{ia} f_{ia}(x_i)
\right\}\\
&= Z^{-1} \exp\left[
\sum_{j, b} \lambda_{jb} g_{jb}(h_j) +
\sum_i
A_i\left( \hat{\boldsymbol{\theta}}_i \right)
\right]
& \quad & \text{cancel out } \hat{\theta}_{ia}.\end{split}\]
Note that
\[\begin{split}\begin{aligned}
Z &= \sum_{\mathbf{x}'}
\exp\left[
\sum_{i, a} \theta_{ia} f_{ia}(x'_i)
\right]
\sum_{\mathbf{h}'}
\exp\left[
\sum_{j, b} \lambda_{jb} g_{jb}(h'_j) +
\sum_{i, a} W_{ia}^{jb} f_{ia}(x'_i) g_{jb}(h'_j)
\right]\\
&= \sum_{\mathbf{x}'}
\exp\left[
\sum_{i, a} \theta_{ia} f_{ia}(x'_i)
\right]
\prod_j \sum_{h'_j} \exp \sum_b \hat{\lambda}_{jb} g_{jb}(h'_j)\\
&= \sum_{\mathbf{x}'}
\exp\left[
\sum_{i, a} \theta_{ia} f_{ia}(x'_i) +
\sum_j B_j\left( \hat{\boldsymbol{\lambda}}_j \right)
\right]
\end{aligned}
\quad \iff \quad
\begin{aligned}
Z &= \sum_{\mathbf{h}'}
\exp\left[
\sum_{j, b} \lambda_{jb} g_{jb}(h'_j)
\right]
\sum_{\mathbf{x}'}
\exp\left[
\sum_{i, a} \theta_{ia} f_{ia}(x'_i) +
\sum_{j, b} W_{ia}^{jb} f_{ia}(x'_i) g_{jb}(h'_j)
\right]\\
&= \sum_{\mathbf{h}'}
\exp\left[
\sum_{j, b} \lambda_{jb} g_{jb}(h'_j)
\right]
\prod_i \sum_{x'_i} \exp \sum_a \hat{\theta}_{ia} f_{ia}(x'_i)\\
&= \sum_{\mathbf{h}'}
\exp\left[
\sum_{j, b} \lambda_{jb} g_{jb}(h'_j) +
\sum_i A_i\left( \hat{\boldsymbol{\theta}}_i \right)
\right]
\end{aligned}\end{split}\]
and
\[\begin{split}\frac{\partial}{\partial \phi} \log Z
&= Z^{-1} \sum_{\mathbf{x}', \mathbf{h}'}
\exp\left( E' \right)
\frac{\partial E'}{\partial \phi}
&= \sum_{\mathbf{x}', \mathbf{h}'}
p(\mathbf{x}', \mathbf{h}') \frac{\partial E'}{\partial \phi}
&= \left\langle
\frac{\partial E'}{\partial \phi}
\right\rangle_{p(\mathbf{x}', \mathbf{h}')}
& \quad & E' = E(\mathbf{x}', \mathbf{h}'; \phi)\\
&= Z^{-1} \sum_{\mathbf{x}'}
\exp\left( E'_\mathbf{x} \right)
\frac{\partial E'_\mathbf{x}}{\partial \phi}
&= \sum_{\mathbf{x}'}
p(\mathbf{x}') \frac{\partial E'_\mathbf{x}}{\partial \phi}
&= \left\langle
\frac{\partial E'_\mathbf{x}}{\partial \phi}
\right\rangle_{p(\mathbf{x}')}
& \quad & E'_\mathbf{x} =
\sum_{i, a} \theta_{ia} f_{ia}(x_i) +
\sum_j B_j\left( \hat{\boldsymbol{\lambda}}_j \right).\end{split}\]
The derivation of the learning rules under the maximum likelihood objective
follows in the same manner as the original harmonium.
The log-likelihood gradient for a single sample is
\[\begin{split}\frac{\partial}{\partial \phi} \log p(\mathbf{x})
&= \frac{\partial}{\partial \phi} \left(
\sum_{i, a} \theta_{ia} f_{ia}(x_i) +
\sum_j B_j\left( \hat{\boldsymbol{\lambda}}_j \right) -
\log Z
\right)\\
&= \frac{\partial}{\partial \phi} \left(
\sum_{i, a} \theta_{ia} f_{ia}(x_i) +
\sum_j \log \sum_{h_j} \exp\left[
\sum_b \hat{\lambda}_{jb} g_{jb}(h_j)
\right]
\right) -
\left\langle
\frac{\partial E'}{\partial \phi}
\right\rangle_{p(\mathbf{x}', \mathbf{h}')}\\
&= \frac{\partial}{\partial \phi} \left(
\sum_{i, a} \theta_{ia} f_{ia}(x_i) +
\sum_j \log \sum_{h_j} \exp\left[
\sum_b \lambda_{jb} g_{jb}(h_j) +
\sum_{i, a} W_{ia}^{jb} f_{ia}(x_i) g_{jb}(h_j)
\right]
\right) -
\left\langle
\frac{\partial E'_\mathbf{x}}{\partial \phi}
\right\rangle_{p(\mathbf{x}')}.\end{split}\]
The learning rules themselves are
\[\begin{split}\frac{\partial}{\partial \theta_{ia}} \log p(\mathbf{x})
&= f_{ia}(x_i) -
\left\langle
f_{ia}(x'_i)
\right\rangle_{p(\mathbf{x}', \mathbf{h}')}\\
&= f_{ia}(x_i) -
\left\langle
f_{ia}(x'_i)
\right\rangle_{p(\mathbf{x}')},\\\\\\
\frac{\partial}{\partial \lambda_{jb}} \log p(\mathbf{x})
&= \frac{
\sum_{h_j} \exp\left[
\sum_b \lambda_{jb} g_{jb}(h_j) +
\sum_{i, a} W_{ia}^{jb} f_{ia}(x_i) g_{jb}(h_j)
\right]
g_{jb}(h_j)
}{
\sum_{h_j} \exp\left[
\sum_b \lambda_{jb} g_{jb}(h_j) +
\sum_{i, a} W_{ia}^{jb} f_{ia}(x_i) g_{jb}(h_j)
\right]
} -
\left\langle
\frac{\partial E'}{\partial \lambda_{jb}}
\right\rangle_{p(\mathbf{x}', \mathbf{h}')}\\
&= \frac{
\partial B_j\left( \hat{\boldsymbol{\lambda}}_j \right)
}{
\partial \lambda_{jb}
} -
\left\langle
g_{jb}(h'_j)
\right\rangle_{p(\mathbf{x}', \mathbf{h}')}\\
&= \frac{
\partial B_j\left( \hat{\boldsymbol{\lambda}}_j \right)
}{
\partial \lambda_{jb}
} -
\left\langle
\frac{
\partial B_j\left( \hat{\boldsymbol{\lambda}}_j \right)
}{
\partial \lambda_{jb}
}
\right\rangle_{p(\mathbf{x}')},\\\\\\
\frac{\partial}{\partial W_{ia}^{jb}} \log p(\mathbf{x})
&= \frac{
\sum_{h_j} \exp\left[
\sum_b \lambda_{jb} g_{jb}(h_j) +
\sum_{i, a} W_{ia}^{jb} f_{ia}(x_i) g_{jb}(h_j)
\right]
f_{ia}(x_i) g_{jb}(h_j)
}{
\sum_{h_j} \exp\left[
\sum_b \lambda_{jb} g_{jb}(h_j) +
\sum_{i, a} W_{ia}^{jb} f_{ia}(x_i) g_{jb}(h_j)
\right]
} -
\left\langle
\frac{\partial E'}{\partial W_{ia}^{jb}}
\right\rangle_{p(\mathbf{x}', \mathbf{h}')}\\
&= f_{ia}(x_i) \frac{
\partial B_j\left( \hat{\boldsymbol{\lambda}}_j \right)
}{
\partial \lambda_{jb}
} -
\left\langle
f_{ia}(x'_i) g_{jb}(h'_j)
\right\rangle_{p(\mathbf{x}', \mathbf{h}')}\\
&= f_{ia}(x_i) \frac{
\partial B_j\left( \hat{\boldsymbol{\lambda}}_j \right)
}{
\partial \lambda_{jb}
} -
\left\langle
f_{ia}(x'_i) \frac{
\partial B_j\left( \hat{\boldsymbol{\lambda}}_j \right)
}{
\partial \lambda_{jb}
}
\right\rangle_{p(\mathbf{x}')}.\end{split}\]
Bernoulli-Bernoulli
Recall that a Bernoulli distribution is defined as
\[\begin{split}\DeclareMathOperator{\BernDist}{Bern}
p(x \in \{0, 1\} \mid \lambda)
&= \BernDist_x\left[ \lambda \right]\\
&= \lambda^x (1 - \lambda)^{(1 - x)}\\
&= \exp \log \lambda^x (1 - \lambda)^{(1 - x)}\\
&= \exp\left\{
x \log \lambda + (1 - x) \log(1 - \lambda)
\right\}\\
&= \exp\left\{
x \log \frac{\lambda}{1 - \lambda} + \log(1 - \lambda)
\right\}\\
&= \exp\left\{ \eta x + \log\left( 1 + e^\eta \right) \right\}
& \quad & \eta = \log \frac{\lambda}{1 - \lambda}.\end{split}\]
By inspection, \(\lambda = \left( 1 + \exp\{-\eta\} \right)^{-1}\).
An EFH with Bernoulli observed variables and Bernoulli latent variables defines
\[\begin{split}p(x_i \mid \mathbf{h})
&= p_i(x_i; \hat{\boldsymbol{\theta}}_i)\\
&= \exp\left[
\sum_a \hat{\theta}_{ia} f_{ia}(x_i) -
A_i\left( \hat{\boldsymbol{\theta}}_i \right)
\right]\\
&= \exp\left[
\sum_a \hat{\theta_{ia}} x_i -
\log \sum_{x'_i} \exp \sum_a \hat{\theta}_{ia} x'_i
\right]
& \quad & f_{ia}(x_i) = x_i\\
&= \exp\left[
\sum_a \hat{\theta}_{ia} x_i -
\log\left( 1 + \exp \sum_a \hat{\theta}_{ia} \right)
\right]
& \quad & x'_i \in \{0, 1\}\end{split}\]
and
\[\begin{split}p(h_j \mid \mathbf{x})
&= p_j(h_j; \hat{\boldsymbol{\lambda}}_j)\\
&= \exp\left[
\sum_b \hat{\lambda}_{jb} g_{jb}(h_j) -
B_j\left( \hat{\boldsymbol{\lambda}}_j \right)
\right]\\
&= \exp\left[
\sum_b \hat{\lambda}_{jb} h_j -
\log \sum_{h'_j} \exp \sum_b \hat{\lambda}_{jb} h'_j
\right]
& \quad & g_{jb}(h_j) = h_j\\
&= \exp\left[
\sum_b \hat{\lambda}_{jb} h_j -
\log\left( 1 + \exp \sum_b \hat{\lambda}_{jb} \right)
\right]
& \quad & h'_j \in \{0, 1\}.\end{split}\]
Note that this formulation is equivalent to the
Bernoulli-Bernoulli RBM.
Gaussian-Bernoulli
Recall that a Gaussian distribution is defined as
\[\begin{split}\DeclareMathOperator{\NormDist}{Norm}
p(x \mid \mu, \sigma)
&= \NormDist_{x}\left[ \mu, \sigma^2 \right]\\
&= \frac{1}{\sqrt{2 \pi \sigma^2}}
\exp\left( -\frac{(x - \mu)^2}{2 \sigma^2} \right)\\
&= \exp\left( -\frac{\mu^2}{2 \sigma^2} - \log \sigma \sqrt{2 \pi} \right)
\exp\left( \frac{\mu x}{\sigma^2} - \frac{x^2}{2 \sigma^2} \right).\end{split}\]
An EFH with Gaussian observed variables and Bernoulli latent variables defines
\[\begin{split}\theta_{i1} &= -\frac{\mu_i^2}{2 \sigma_i^2} - \log \sigma_i \sqrt{2 \pi},
& \qquad &
f_{i1} &= 1,
& \qquad &
W_{i1}^{\cdot \cdot} &= 0;\\
\theta_{i2} &= \frac{\mu_i}{\sigma_i^2},
& \qquad &
f_{i2} &= x_i,
& \qquad &
W_{i2}^{\cdot \cdot} &\in \mathbb{R};\\
\theta_{i3} &= -\frac{1}{2 \sigma_i^2},
& \qquad &
f_{i3} &= x_i^2,
& \qquad &
W_{i3}^{\cdot \cdot} &= 0\end{split}\]
where
\[\begin{split}p(x_i \mid \mathbf{h})
&= p_i(x_i; \hat{\boldsymbol{\theta}}_i)\\
&= \exp\left[
\sum_a
\left( \theta_{ia} + \sum_{j, b} W_{ia}^{jb} g_{jb}(h_j) \right)
f_{ia}(x_i) -
A_i\left( \hat{\boldsymbol{\theta}}_i \right)
\right]\\
&= \frac{1}{\exp A_i\left( \hat{\boldsymbol{\theta}}_i \right)}
\exp\left[
-\frac{\mu_i^2}{2 \sigma_i^2} - \log \sigma_i \sqrt{2 \pi}
\right]
\exp\left[
\frac{\mu_i x_i}{\sigma_i^2} - \frac{x_i^2}{2 \sigma_i^2} +
x_i \sum_{j, b} W_{i2}^{jb} h_j
\right]
& \quad & g_{jb}(h_j) = h_j\\
&= \frac{
\exp\left[ \log \sigma_i \sqrt{2 \pi} \right]^{-1}
}{
\exp A_i\left( \hat{\boldsymbol{\theta}}_i \right)
}
\exp\left[
\frac{x_i \sum_j w_{ij} h_j}{\sigma_i^2} -
\frac{\left( x_i - \mu_i \right)^2}{2 \sigma_i^2}
\right]
& \quad & W_{i2}^{jb} = \frac{w_{ij}}{\sigma_i^2}\\
&= \frac{
\exp\left[ \log \sigma_i \sqrt{2 \pi} \right]^{-1}
}{
\exp A_i\left( \hat{\boldsymbol{\theta}}_i \right)
}
\exp \frac{
-\left( x_i - \mu_i - \sum_j w_{ij} h_j \right)^2 +
\left( \sum_j w_{ij} h_j \right)^2 +
2 \mu_i \sum_j w_{ij} h_j
}{2 \sigma_i^2}
& \quad & \frac{ax}{\sigma^2} -
\frac{\left( x - b \right)^2}{2 \sigma^2} =
\frac{-\left( x - a - b \right)^2 + a^2 + 2ab}{2 \sigma^2}\\
&= \frac{
\exp\left[
-\log \sigma_i \sqrt{2 \pi} +
\left( \sum_j w_{ij} h_j \right)^2 +
2 \mu_i \sum_j w_{ij} h_j
\right]
}{
\exp A_i\left( \hat{\boldsymbol{\theta}}_i \right)
}
\exp \frac{
-\left( x_i - \mu_i - \sum_j w_{ij} h_j \right)^2
}{2 \sigma_i^2}\\
&= \frac{1}{
\int
\exp -\left( x'_i - \mu_i - \sum_j w_{ij} h_j \right)^2
\left( 2 \sigma_i^2 \right)^{-1}
dx'_i
}
\exp \frac{
-\left( x_i - \mu_i - \sum_j w_{ij} h_j \right)^2
}{2 \sigma_i^2}\\
&= \frac{1}{\sigma_i \sqrt{2 \pi}}
\exp \frac{
-\left(
x_i - \mu_i - \sum_j w_{ij} h_j
\right)^2
}{2 \sigma_i^2}
& \quad & \int_{-\infty}^\infty e^{-a (x + b)^2} dx = \sqrt{a^{-1} \pi}\\
&= \NormDist_{x_i}\left[ \mu_i + \sum_j w_{ij} h_j, \sigma_i^2 \right]\end{split}\]
and
\[\begin{split}\DeclareMathOperator{\sigmoid}{sigmoid}
p(h_j = 1 \mid \mathbf{x})
&= p_j(h_j = 1; \hat{\boldsymbol{\lambda}}_j)\\
&= \exp\left[
\sum_b
\left(
\lambda_{jb} + \sum_{i, a} W_{ia}^{jb} f_{ia}(x_i)
\right) h_j -
\log\left( 1 + \exp \sum_b \hat{\lambda}_{jb} \right)
\right]\\
&= \frac{
\exp\left( \lambda_j + \sum_i \frac{w_{ij}}{\sigma_i^2} x_i \right)
}{
1 + \exp \hat{\lambda}_j
}
& \quad & \lambda_{jb} = \lambda_j\\
&= \sigmoid\left(
\lambda_j + \sum_i \frac{w_{ij}}{\sigma_i^2} x_i
\right)
& \quad & \sigmoid(u) = \frac{1}{1 + \exp(-u)}\end{split}\]
By inspection,
\[\begin{split}p(\mathbf{x}, \mathbf{h})
&= Z^{-1}
\exp\left[
\sum_{i, a} \theta_{ia} f_{ia}(x_i) +
\sum_{j, b} \lambda_{jb} g_{jb}(h_j) +
\sum_{i, j, a, b} W_{ia}^{jb} f_{ia}(x_i) g_{jb}(h_j)
\right]\\
&= Z^{-1}
\exp\left[
\left(
\sum_i
-\frac{\mu_i^2}{2 \sigma_i^2} - \log \sigma_i \sqrt{2 \pi} +
\frac{\mu_i}{\sigma_i^2} x_i -
\frac{1}{2 \sigma_i^2} x_i^2
\right) +
\sum_j \lambda_j h_j +
\sum_{i, j} \frac{w_{ij}}{\sigma_i^2} x_i h_j
\right]\\
&= \frac{\exp -\sum_i \log \sigma_i \sqrt{2 \pi}}{Z^{-1}}
\exp\left[
\sum_i -\frac{\left( x_i - \mu_i \right)^2}{2 \sigma_i^2} +
\sum_j \lambda_j h_j +
\sum_{i, j} \frac{w_{ij}}{\sigma_i^2} x_i h_j
\right]\\
&\approx
\exp\left[
\sum_i -\frac{\left( x_i - \mu_i \right)^2}{2 \sigma_i^2} +
\sum_j \lambda_j h_j +
\sum_{i, j} \frac{w_{ij}}{\sigma_i^2} x_i h_j
\right].\end{split}\]
Bernoulli-Gaussian
An EFH with Bernoulli observed variables and Gaussian latent variables defines
\[\begin{split}\lambda_{j1} &= -\frac{\mu_j^2}{2 \sigma_j^2} - \log \sigma_j \sqrt{2 \pi},
& \qquad &
g_{j1} &= 1,
& \qquad &
W_{\cdot \cdot}^{j1} &= 0;\\
\lambda_{j2} &= \frac{\mu_j}{\sigma_j^2},
& \qquad &
g_{j2} &= h_j,
& \qquad &
W_{\cdot \cdot}^{j2} &\in \mathbb{R};\\
\lambda_{j3} &= -\frac{1}{2 \sigma_j^2},
& \qquad &
g_{j3} &= h_j^2,
& \qquad &
W_{\cdot \cdot}^{j3} &= 0\end{split}\]
where
\[\begin{split}p(h_j \mid \mathbf{x})
&= p_j(h_j; \hat{\boldsymbol{\lambda}}_j)\\
&= \exp\left[
\sum_b
\left(
\lambda_{jb} + \sum_{i, a} W_{ia}^{jb} f_{ia}(x_i)
\right) g_{jb}(h_j) -
B_j\left( \hat{\boldsymbol{\lambda}}_j \right)
\right]\\
&= \frac{1}{\exp B_j\left( \hat{\boldsymbol{\lambda}}_j \right)}
\exp\left[
-\frac{\mu_j^2}{2 \sigma_j^2} - \log \sigma_j \sqrt{2 \pi}
\right]
\exp\left[
\frac{\mu_j h_j}{\sigma_j^2} - \frac{h_j^2}{2 \sigma_j^2} +
h_j \sum_{i, a} W_{ia}^{j2} x_i
\right]
& \quad & f_{ia}(x_i) = x_i\\
&= \NormDist_{h_j}\left[ \mu_j + \sum_i w_{ij} x_i, \sigma_j^2 \right]
& \quad & W_{ia}^{j2} = \frac{w_{ij}}{\sigma_j^2}\end{split}\]
and
\[\begin{split}p(x_i = 1 \mid \mathbf{h})
&= p_i(x_i = 1; \hat{\boldsymbol{\theta}}_i)\\
&= \exp\left[
\sum_a
\left(
\theta_{ia} + \sum_{j, b} W_{ia}^{jb} g_{jb}(h_j)
\right) x_i -
\log\left( 1 + \exp \sum_a \hat{\theta}_{ia} \right)
\right]\\
&= \frac{
\exp\left( \theta_i + \sum_j \frac{w_{ij}}{\sigma_j^2} h_j \right)
}{
1 + \exp \hat{\theta}_i
}
& \quad & \theta_{ia} = \theta_i\\
&= \sigmoid\left( \theta_i + \sum_j \frac{w_{ij}}{\sigma_j^2} h_j \right).\end{split}\]
By inspection,
\[\begin{split}p(\mathbf{x}, \mathbf{h})
&= Z^{-1}
\exp\left[
\sum_i \theta_i x_i +
\left(
\sum_j
-\frac{\mu_j^2}{2 \sigma_j^2} - \log \sigma_j \sqrt{2 \pi} +
\frac{\mu_j}{\sigma_j^2} h_j -
\frac{1}{2 \sigma_j^2} h_j^2
\right) +
\sum_{i, j} \frac{w_{ij}}{\sigma_j^2} x_i h_j
\right]\\
&\approx
\exp\left[
\sum_i \theta_i x_i -
\sum_j \frac{\left( h_j - \mu_j \right)^2}{2 \sigma_j^2} +
\sum_{i, j} \frac{w_{ij}}{\sigma_j^2} x_i h_j
\right].\end{split}\]
Gaussian-Gaussian
An EFH with Gaussian observed variables and Gaussian latent variables defines
\[\begin{split}\theta_{i1} &= -\frac{\mu_i^2}{2 \sigma_i^2} - \log \sigma_i \sqrt{2 \pi},
& \qquad &
f_{i1} &= 1,
& \qquad &
g_{j1} &= 1,
& \qquad &
\lambda_{j1} &= -\frac{\mu_j^2}{2 \sigma_j^2} - \log \sigma_j \sqrt{2 \pi},
& \qquad &
W_{i2}^{j2} &\in \mathbb{R};\\
\theta_{i2} &= \frac{\mu_i}{\sigma_i^2},
& \qquad &
f_{i2} &= x_i,
& \qquad &
g_{j2} &= h_j,
& \qquad &
\lambda_{j2} &= \frac{\mu_j}{\sigma_j^2};\\
\theta_{i3} &= -\frac{1}{2 \sigma_i^2},
& \qquad &
f_{i3} &= x_i^2,
& \qquad &
g_{j3} &= h_j^2,
& \qquad &
\lambda_{j3} &= -\frac{1}{2 \sigma_j^2},
& \qquad &
W_{ia}^{jb} &= 0 \quad \forall i \neq 2, j \neq 2\end{split}\]
where
\[\begin{split}p(x_i \mid \mathbf{h})
&= p_i(x_i; \hat{\boldsymbol{\theta}}_i)\\
&= \exp\left[
\sum_a
\left( \theta_{ia} + \sum_{j, b} W_{ia}^{jb} g_{jb}(h_j) \right)
f_{ia}(x_i) -
A_i\left( \hat{\boldsymbol{\theta}}_i \right)
\right]\\
&= \frac{1}{\exp A_i\left( \hat{\boldsymbol{\theta}}_i \right)}
\exp\left[
-\frac{\mu_i^2}{2 \sigma_i^2} - \log \sigma_i \sqrt{2 \pi}
\right]
\exp\left[
\frac{\mu_i x_i}{\sigma_i^2} - \frac{x_i^2}{2 \sigma_i^2} +
x_i \sum_j W_{i2}^{j2} h_j
\right]\\
&= \frac{
\exp\left[ \log \sigma_i \sqrt{2 \pi} \right]^{-1}
}{
\exp A_i\left( \hat{\boldsymbol{\theta}}_i \right)
}
\exp\left[
\frac{x_i \sigma_i^2 \sum_j W_{i2}^{j2} h_j}{\sigma_i^2} -
\frac{\left( x_i - \mu_i \right)^2}{2 \sigma_i^2}
\right]\\
&= \NormDist_{x_i}\left[
\mu_i + \sigma_i^2 \sum_j W_{i2}^{j2} h_j, \sigma_i^2
\right]\\
&= \NormDist_{x_i}\left[
\mu_i + \sum_j \frac{w_{ij}}{\sigma_j^2} h_j, \sigma_i^2
\right]
& \quad & W_{i2}^{j2} = \frac{w_{ij}}{\sigma_i^2 \sigma_j^2}\\\end{split}\]
and
\[\begin{split}p(h_j \mid \mathbf{x})
&= p_j(h_j; \hat{\boldsymbol{\lambda}}_j)\\
&= \exp\left[
\sum_b
\left(
\lambda_{jb} + \sum_{i, a} W_{ia}^{jb} f_{ia}(x_i)
\right) g_{jb}(h_j) -
B_j\left( \hat{\boldsymbol{\lambda}}_j \right)
\right]\\
&= \frac{1}{\exp B_j\left( \hat{\boldsymbol{\lambda}}_j \right)}
\exp\left[
-\frac{\mu_j^2}{2 \sigma_j^2} - \log \sigma_j \sqrt{2 \pi}
\right]
\exp\left[
\frac{\mu_j h_j}{\sigma_j^2} - \frac{h_j^2}{2 \sigma_j^2} +
h_j \sum_i W_{i2}^{j2} x_i
\right]\\
&= \frac{
\exp\left[ \log \sigma_j \sqrt{2 \pi} \right]^{-1}
}{
\exp B_j\left( \hat{\boldsymbol{\lambda}}_j \right)
}
\exp\left[
\frac{h_j \sigma_j^2 \sum_i W_{i2}^{j2} x_i}{\sigma_j^2} -
\frac{\left( h_j - \mu_j \right)^2}{2 \sigma_j^2}
\right]\\
&= \NormDist_{h_j}\left[
\mu_j + \sigma_j^2 \sum_i W_{i2}^{j2} x_i, \sigma_j^2
\right]\\
&= \NormDist_{h_j}\left[
\mu_j + \sum_i \frac{w_{ij}}{\sigma_i^2} x_i, \sigma_j^2
\right].\end{split}\]
By inspection,
\[p(\mathbf{x}, \mathbf{h}) \approx
\exp\left[
-\sum_i \frac{\left( x_i - \mu_i \right)^2}{2 \sigma_i^2} -
\sum_j \frac{\left( h_j - \mu_j \right)^2}{2 \sigma_j^2} +
\sum_{i, j} \frac{w_{ij}}{\sigma_i^2 \sigma_j^2} x_i h_j
\right].\]
References
- WRZH05
Max Welling, Michal Rosen-Zvi, and Geoffrey E Hinton. Exponential family harmoniums with an application to information retrieval. In Advances in neural information processing systems, 1481–1488. 2005.