A Fast Learning Algorithm for Deep Belief Nets

Motivation(s)

Learning is difficult in densely-connected, directed, multilayer belief nets (DBN) because it is difficult to draw samples from the conditional distribution of hidden activities when given a data vector. This is manifested in the phenomenon of explaining away: hidden states are independent in the prior but dependent in the posterior. Furthermore, even though belief networks can be easily ancestral sampled, learning the model parameters for the first hidden layer requires knowing the model parameters of higher layers.

The posterior distribution over the hidden variables \(p(\mathbf{h} \mid \mathbf{v})\) is intractable except for mixture models or linear models with additive Gaussian noise. Markov Chain Monte Carlo methods can be used to sample from the posterior if time is not a constraint. Variational methods approximate the true posterior with a more tractable distribution, and learning is guaranteed to improve a variational bound even when the inference of the hidden states is done incorrectly. However, the approximations may be poor and variational learning requires all the parameters to be learned together, which scales poorly as the number of parameters increase.

One greedy procedure described in [Fre] to train a deep belief network layer-by-layer is as follows:

  1. Given the visible data layer \(\mathbf{v}\) that is connected to the hidden layer \(\mathbf{h}_1\), maximize the likelihood of generating the training data

    \[p(\mathbf{v} \mid \mathbf{W}_1) = \int p(\mathbf{v} \mid \mathbf{h}_1, \mathbf{W}_1) p(\mathbf{h}_1) d\mathbf{h}_1.\]
    • \(\mathbf{h}_1\) is driven by an ephemeral hidden layer \(\mathbf{b}_1\) that will serve as the bias layer.

    • The EM algorithm can be used to optimize the weights \(\{ \mathbf{b}_1, \mathbf{W}_1 \}\) to maximize the probability of generating the data patterns.

      • When the posterior is not available analytically, one must resort to drawing samples from \(p(\mathbf{h}_1 \mid \mathbf{v}, \mathbf{W}_1)\).

    • Averaging over the \(N\) training examples gives the aggregate posterior \(\frac{1}{N} \sum_{\mathbf{v} \in \mathcal{D}} p(\mathbf{h}_1 \mid \mathbf{v}, \mathbf{W}_1)\).

      • The weights \(\mathbf{W}_1\) end up at values that maximize the likelihood of the data, given \(\mathbf{h}_1\) sampled from the aggregated posterior distribution.

      • The bias weights \(\mathbf{b}_1\) end up at values that approximate the aggregate posterior distribution.

  2. Freeze the weights \(\mathbf{W}_1\), replace \(\mathbf{h}_1\)’s bias inputs by a second layer of weights \(\mathbf{W}_2\).

    • This stage discards the existing bias layer and its weights \(\mathbf{b}_1\) to introduce the next hidden layer \(\mathbf{h}_2\).

    • A new ephemeral hidden bias layer \(\mathbf{b}_2\) will drive \(\mathbf{h}_2\).

  3. Given \(\mathbf{W}_1\) and treating \(\mathbf{h}_1\) as the data (visible) layer, proceed to maximize the likelihood

    \[p(\mathbf{h}_1 \mid \mathbf{W}_2) = \int p(\mathbf{h}_1 \mid \mathbf{h}_2, \mathbf{W}_2) p(\mathbf{h}_2) d\mathbf{h}_2.\]
    • For each visible pattern \(\mathbf{v}\), the freezed weights are used to sample from \(p(\mathbf{h}_1 \mid \mathbf{v}, \mathbf{W}_1)\).

    • This will make the \(\mathbf{h}_2\) learn the aggregate posterior distribution over \(\mathbf{h}_1\).

  4. Proceed recursively until all the layers are learned.

This greedy procedure does not work well at all. Recall that if \(p(\mathbf{v})\) is factorial i.e. \(p(\mathbf{v}) = \prod_i p(v_i)\), there is no point in having hidden units in the generative model because a model that has hidden units can do no better than a model with just bias inputs to the visible units.

In a directed belief network with one hidden layer connected by weights \(\mathbf{W}\), the prior \(p(\mathbf{h})\) is factorial, and the posterior \(p(\mathbf{h} \mid \mathbf{v})\) is non-factorial due to explaining away. Learning weights \(\mathbf{W}\) with EM will result in features that are independent in the prior, which tends to make the aggregate posterior as factorial as possible.

Proposed Solution(s)

The authors observe that when trying to improve

\[p(\mathbf{v}) = \int p(\mathbf{h}) p(\mathbf{v} \mid \mathbf{h}) d\mathbf{h},\]

solely focusing on the prior can indirectly refine \(p(\mathbf{v})\). To improve the prior, a better model of the aggregate posterior distribution is desirable.

One way to enhance the aggregate posterior distribution is to make drawing samples easier by removing the effects of explaining away. This can be done by restricting the likelihood and prior to the exponential family. As shown in the appendix, the exponential family always have a (complementary) prior that makes the posterior exactly factorize.

In order to create a complementary prior for each greedily learned layer, first recall how the first layer of a directed belief network is learned. Let \(\mathbf{v} = \mathbf{x}^{(0)}, \mathbf{h}_1 = \mathbf{y}^{(0)}, \mathbf{x}^{(1)}, \mathbf{y}^{(1)}, \ldots, \mathbf{x}^{(\infty)}, \mathbf{y}^{(\infty)}\) be an infinite sequence (stack) of variables where \(p( \mathbf{x}^{(0)} \mid \mathbf{y}^{(0)})\) and \(p( \mathbf{y}^{(0)} \mid \mathbf{x}^{(0)})\) factorizes. \(\mathbf{x}^{(i)}\) and \(\mathbf{y}^{(i)}\) represent a sequence of successively deeper layers with tied weights, i.e., the parameters defining the conditional distributions between layers are shared. This sequence can be interpreted as unrolling the Gibbs sampling scheme in space such that each parallel update of the variables defines the states of a separate layer in the graph. The infinite stack of directed graphs with tied weights has been proven (in the appendix) to produce a factorial posterior but non-factorial prior, assuming first-order Markovian dependency and detailed balance holds. The Markov chain’s detailed balance property reveals that a top-down pass of the directed belief network is equivalent to letting a RBM settle to equilibrium [Fau]. Since a RBM has a factorial posterior but non-factorial prior, the learning process does not force the aggregate posterior to be factorial.

The model above \(\mathbf{y}^{(0)}\) implements a complementary prior. However, once the next hidden layer changes the tied weights, the priors for the already learned lower layers cease to be complementary. Consequently, the posterior distributions in the lower layers are no longer factorial. This approximation, which assumes higher layers exist but have tied weights initially, is more reasonable than one that ignores the higher layers.

The authors assert that if image-label pairs were generated as stuff -> image -> label, it would make sense to learn a mapping from images to labels. However, if image-label pairs are generated as image <- stuff -> label, it makes economic sense to do unsupervised learning on the image followed by supervised learning with the labels [Hinb].

Unsupervised Pre-training

Unlabeled data can be used to discover good features as follows:

  1. Given the visible data layer \(\mathbf{v}\) that is connected to the hidden layer \(\mathbf{h}_1\), proceed to learn the weights \(\mathbf{W}\) in the same manner as a RBM.

    • Contrastive divergence can be used instead of alternating Gibbs sampling.

    • The typical application of contrastive divergence fails for deep, multilayer networks with different weights at each layer because these networks take far too long even to reach conditional equilibrium.

  2. Freeze (i.e. make a copy of) \(\mathbf{W}\) and denote it as \(\mathbf{W}_1\).

    • \(\mathbf{W}_1\) is typically described as being untied from the tied weights \(\mathbf{W}\).

  3. Given \(\mathbf{W}_1\) and treating \(\mathbf{h}_1 \sim p(\mathbf{h}_1 \mid \mathbf{v}, \mathbf{W}_1)\) as the data (visible) layer, train the next hidden layer \(\mathbf{h}_2\) with a RBM.

    • This will make \(\mathbf{h}_2\) refine \(\mathbf{W}\) to better model the aggregated posterior distribution.

  4. Proceed recursively until all the layers are learned.

Supervised Fine-tuning

The pre-training stage uses the frozen weights as recognition weights and generative weights. As a result, neither the weights nor the inference procedure are optimal for the lower layers due to underfitting.

To refine the weights, the contrastive wake-sleep algorithm is one way to untie the recognition weights from the generative ones. The wake-phase is still the same: perform a bottom-up pass, starting with a pattern from the training set, and use the delta rule to increase the likelihood of the generative model. The goal of the sleep-phase is still the same: perform a top-down pass and use the delta rule to increase the likelihood of the recognition model. However, instead of starting from an equilibrium sample from the top-level associative memory, the associative memory is initialized by a bottom-up pass followed by contrastive divergence. This ensures that the recognition weights capture representations that resemble real data and eliminate the problem of mode averaging.

Note that a DBN is not a multilayer Boltzmann machine [salakhutdinov2009deep]: it has undirected connections between its top two layers and downward directed connections between all its lower layers. After the weights are untied, the whole probabilistic framework can be discarded. Only the generative weights in the bottom-up direction are used to initialize all the feature detecting layers of a deterministic feedforward deep neural network (DNN). The network can then be augmented with the desired final output layer (e.g. softmax) and trained discriminatively via backpropagation.

The proposed consecutive phases of fine-tuning only modifies the features slightly to get the category boundaries right. It does not need to discover features. The resulting network is called DBN-DNN to differentiate its training regime from typical DNNs [HDY+12].

Evaluation(s)

The authors reasoned at a high level that the RBM learning rule is the same as the maximum likelihood learning rule for the infinite logistic belief network with tied weights.

Recall that each pair of layers in the infinite belief network with tied weights takes the form of

\[p(\mathbf{x}, \mathbf{y}) = Z^{-1} \exp\left( \sum_{i, j} \Psi_{i, j}(x_i, y_j) + \sum_i \gamma_i(x_i) + \sum_j \alpha_j(y_j) \right) = Z^{-1} \exp E(\mathbf{x}, \mathbf{y})\]

where

\[Z = \sum_{\mathbf{x}', \mathbf{y}'} \exp E(\mathbf{x}', \mathbf{y}').\]

To simplify notation from here on, define \(E^{(i, j)} = \mathop{E}\left( \mathbf{x}^{(i)}, \mathbf{y}^{(j)} \right)\).

The derivatives of the log-probability of each layer are given by

\[\begin{split}\frac{\partial}{\partial w_{ij}} \log p(\mathbf{x}^{(i)}) &= \frac{\partial}{\partial w_{ij}} \left( \log \sum_{\mathbf{y}^{(i)}} \exp E^{(i, i)} - \log Z \right)\\ &= \left( \sum_{\mathbf{y}^{(i)}} \exp E^{(i, i)} \right)^{-1} \sum_{\mathbf{y}^{(i)}} \exp \left\{ E^{(i, i)} \right\} \frac{\partial E^{(i, i)}}{\partial w_{ij}} - Z^{-1} \sum_{\mathbf{x}', \mathbf{y}^{(i)}} \exp \left\{ E^{(', i)} \right\} \frac{\partial E^{(', i)}}{\partial w_{ij}}\\ &= \sum_{\mathbf{y}^{(i)}} p(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)}) \frac{\partial E^{(i, i)}}{\partial w_{ij}} - \sum_{\mathbf{x}', \mathbf{y}^{(i)}} p(\mathbf{x}', \mathbf{y}^{(i)}) \frac{\partial E^{(', i)}}{\partial w_{ij}}\\ &= \left\langle \frac{\partial E^{(i, i)}}{\partial w_{ij}} \right\rangle_{p(\mathbf{y}^{(i)} \mid \mathbf{x}^{(i)})} - \left\langle \frac{\partial E^{(', i)}}{\partial w_{ij}} \right\rangle_{p(\mathbf{x}', \mathbf{y}^{(i)})} & \quad & p(\mathbf{x}', \mathbf{y}^{(i)}) = p(\mathbf{x}' \mid \mathbf{y}^{(i)}) p(\mathbf{y}^{(i)})\end{split}\]

and

\[\begin{split}\frac{\partial}{\partial w_{ij}} \log p(\mathbf{y}^{(i)}) &= \frac{\partial}{\partial w_{ij}} \left( \log \sum_{\mathbf{x}^{(i + 1)}} \exp E^{(i + 1, i)} - \log Z \right)\\ &= \left\langle \frac{\partial E^{(i, i)}}{\partial w_{ij}} \right\rangle_{p(\mathbf{x}^{(i + 1)} \mid \mathbf{y}^{(i)})} - \left\langle \frac{\partial E^{(i + 1, i')}}{\partial w_{ij}} \right\rangle_{p(\mathbf{x}^{(i + 1)}, \mathbf{y}')} & \quad & p(\mathbf{x}^{(i + 1)}, \mathbf{y}') = p(\mathbf{y}' \mid \mathbf{x}^{(i + 1)}) p(\mathbf{x}^{(i + 1)}).\end{split}\]

The quantity \(\langle \cdot \rangle\) can be estimated via Gibbs sampling. Notice how \(\mathbf{x}'\) can be either \(\mathbf{x}^{(i)}\) or \(\mathbf{x}^{(i + 1)}\) because the weights are tied (see Figure 3) and the generative process converges to the stationary distribution of the Markov chain [Hina]. Likewise, \(\mathbf{y}'\) can be either \(\mathbf{y}^{(i)}\) or \(\mathbf{y}^{(i + 1)}\). Note that the bias terms in the conditional probability distributions can be rolled into \(\mathbf{W}\). Since the weights are tied, the full derivative for a generative weight is obtained by

\[\sum_{i = 0}^{\infty - 1} \frac{\partial}{\partial w_{ij}} \left( \log p(\mathbf{x}^{(i)}) + \log p(\mathbf{y}^{(i)}) \right) = \left\langle \frac{\partial E^{(0, 0)}}{\partial w_{ij}} \right\rangle_{p(\mathbf{y}^{(0)} \mid \mathbf{x}^{(0)})} - \left\langle \frac{\partial E^{(\infty, \infty)}}{\partial w_{ij}} \right\rangle_{p(\mathbf{x}^{(\infty)}, \mathbf{y}^{(\infty)})}.\]

Another justification for the greedy learning algorithm is that the incorrect inference procedure gives a variational lower bound on the log probability of the data. The higher layers learn a prior that is closer to the aggregated posterior distribution of the first hidden layer.

The greedy layer-by-layer training enabled the network to find a solution that achieved a competitive low error rate on MNIST without the use of data augmentation and domain-specific tricks such as weight-sharing and subsampling.

Unlike existing discriminative models, the generative model after learning \(n\) layers can generate data in the same manner as a Helmholtz machine:

  1. Get an equilibrium sample from the top-level RBM by performing alternating Gibbs sampling between the top-level layer and the penultimate layer.

  2. Perform a top-down pass to get states for all the other layers.

This additional insight into the network makes it possible to interpret the non-linear, distributed representations of the hidden layers. The samples drawn from the learned generative model were representative of the real data.

Future Direction(s)

  • How would one interpret the non-linear, distributed presentations when dealing with abstract concepts such as mathematics?

  • How to quantify the suboptimality of assuming the likelihood factorizes?

  • Since a single layer feedforward neural network is a universal approximator, would greedy learning lose information? Would it make recovering the original information much more difficult?

  • How to validate the learned concepts of each layer?

Question(s)

  • The test cases that the network got wrong were pretty easy. Were these scenarios not covered by the training set?

Analysis

Supervised learning of a mapping between data and labels is inefficient due to the limited supply of labels. Unsupervised feature learning is one way to make use of unlabeled data.

A brief introductory overview can be found in [HS06], but reading it is unnecessary if one has or is going to read [HOT06]. A thorough understanding EFHs will make this paper easier to understand.

The idea of fine-tuning appeared previously as graph transformer networks [BBLC97]. The main takeaway is to connect pipelined components together as a feed-forward network of differentiable modules and perform backpropagation to tune the parameters of each component simultaneously.

The proof techniques and concept of complementary priors look to be very useful for future directions. The paper would be even better if not for the confusing summary of RBM contrastive divergence. The most interesting insight is that ancestral sampling on an infinitely deep belief network with tied weights is equivalent to block Gibbs sampling on a RBM.

As an aside, do not bother reading [BLPL07]. It offers zero new insights compared to the other papers. Similarly, [EBC+10] is too verbose. In essence, the paper goes over a set of experiments suggesting unsupervised pre-training may be classified as an initialization-as-regularization strategy. The experiments’ results can be summarized as verifying the consistent usefulness of unsupervised pre-training.

References

BLPL07

Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. In Advances in neural information processing systems, 153–160. 2007.

BBLC97

Léon Bottou, Yoshua Bengio, and Yann Le Cun. Global training of document processing systems using graph transformer networks. In Computer Vision and Pattern Recognition, 1997. Proceedings., 1997 IEEE Computer Society Conference on, 489–494. IEEE, 1997.

EBC+10

Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy Bengio. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research, 11(Feb):625–660, 2010.

Fau

Ryan Faulkner. Deep belief nets in reinforcement learning. http://www.cs.mcgill.ca/ rfaulk/presentationDBN_final.pdf. Accessed slides 16 and 17: 2017-09-01.

Fre

Marcus Frean. Aciss’09 tutorial on deep belief nets. http://goanna.cs.rmit.edu.au/ xiaodong/aciss09/tute-slides/ACISS-frean-tutorial-2x2.pdf. Accessed: 2017-09-06.

Hina

Geoffrey Hinton. 2007 nips tutorial on: deep belief nets. https://www.cs.toronto.edu/ hinton/nipstutorial/nipstut3.pdf. Accessed slides 48 and 49: 2017-09-06.

Hinb

Geoffrey Hinton. Ucl tutorial on deep belief nets. https://www.cs.toronto.edu/ hinton/ucltutorial.pdf. Accessed: 2017-09-06.

HDY+12

Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, and others. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.

HOT06

Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.

HS06

Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.