############################################
Deep Residual Learning for Image Recognition
############################################

Motivation(s)
=============

Recent experiments reveal that network depth is crucial in attaining high
accuracy.  However, with the network depth increasing, accuracy gets saturated
and then degrades rapidly.  Such degradation is not caused by overfitting
because adding more layers to a suitably deep model leads to higher training
error.

Consider a shallow architecture and its deeper counterpart that adds more layers
onto it.  Suppose the added layers are identity mappings and the other layers
are copied from the learned shallower model.  By inspection, the deeper model
should produce no higher training error than its shallower counterpart.
Counterintuitively, experiments show that current solvers on hand are unable to
find solutions that are comparably good or better than the constructed solution
within the same duration of time.  The degradation problem suggests that the
solvers might have difficulties in approximating identity mappings by multiple
nonlinear layers.

Proposed Solution(s)
====================

The authors propose a deep residual learning framework to address the
degradation problem.

Consider :math:`\mathcal{H}(\mathbf{x})` as an underlying mapping to be fit by a
few stacked layers (not necessarily the entire net) with :math:`\mathbf{x}`
denoting the inputs to the first of these layers.  The goal is to have the
stacked layers approximate a residual function
:math:`\mathcal{F}(\mathbf{x}) \doteq \mathcal{H}(\mathbf{x}) - \mathbf{x}` by
explicitly defining a building block

.. math::

   \mathbf{y}_l = h(\mathbf{x}_l) + \mathcal{F}(\mathbf{x}_l, \mathcal{W}_l)
   \quad \text{and} \quad
   \mathbf{x}_{l + 1} = f(\mathbf{y}_l)

where :math:`f` is the ReLU activation function,
:math:`\mathcal{W}_l = \left\{ W_{l, k} \right\}_{k = 1}^K`, and :math:`K` is
the number of layers in a Residual Unit.  The shortcut connection
:math:`h(\mathbf{x}_l) = \mathbf{x}_l` introduces zero additional parameters and
negligible computational cost.  When the dimensions of :math:`\mathbf{x}` and
:math:`\mathcal{F}` are not equal, the building block needs to learn an
additional a linear projection :math:`W_s` such that
:math:`h(\mathbf{x}_l) = W_s \mathbf{x}_l`.

There are no guarantees that identity mappings are optimal, but the hope is that
it will act as a preconditioner.  This concept of residual representations are
widely used in multigrid methods.  The theory of shortcut connections is well
studied and have been used to address vanishing/exploding gradients.

Evaluation(s)
=============

As a baseline, the authors constructed a plain model that has fewer filters and
lower complexity than VGG nets.  The plain network augmented with shortcut
connections every two layers is dubbed ResNet.  Both networks applied batch
normalization (BN) with no dropout.

The experiments on ImageNet 2012 demonstrates that vanishing gradients is an
unlikely cause of the plain network's degradation.  The backward propagated
gradients exhibit healthy norms because BN ensures forward propagated signals
have non-zero variances.  Furthermore, the authors still observed the
degradation problem even with more training iterations.  The degradation problem
is mitigated with ResNet's shortcut connections, which also boosted the overall
convergence rate.  

To handle dimension mismatch in ResNet, the authors explored identity shortcuts
and projection shortcuts.  The former performs an identity mapping with extra
zero entries padded for increasing dimensions, while the latter used
:math:`1 \times 1` convolutions to match dimensions.  The results indicate that
projection shortcuts gave trivial gains over identity shortcuts.  This is
important for exploring bottleneck building blocks, which outperforms state of
the art despite having has lower complexity than VGG.

Experiments on CIFAR-10, PASCAL VOC, and COCO also signal the consistent
benefit of deep residual networks.  When there are more layers, the analysis of
layer responses show that an individual layer of ResNets tends to modify the
signal less.  Compared to their plain counterparts, ResNets have generally
smaller magnitudes of responses.

Even though the proposed method succeeded in training ResNet-1202 till
convergence, the accuracy of ResNet-1202 is inferior to ResNet-110.

Future Direction(s)
===================

- What is the performance of residual networks without batch normalization?
  How does dropout improve it?
- :cite:`he2016identity` only looked at dropout as a shortcut connection.  How
  well does dropout perform when used on the indirect flow path i.e. within the
  residual unit?
- The proposed deep residual framework eschews any kind of
  :doc:`unsupervised pre-training </blog/2016/12/08/a-fast-learning-algorithm-for-deep-belief-nets>`.
  How much does pre-training help ResNets?

Question(s)
===========

- Multiple nonlinear layers can asymptotically approximate complicated functions
  when the :doc:`activation function is continuous, bounded and nonconstant </blog/2016/11/05/multilayer-feedforward-networks-are-universal-approximators>`?

Analysis
========

When designing a new building block for neural networks, ensure there is a
direct path for propagating information throughout the entire network.

One flaw in the analysis is the authors' assertion that ResNet-1202's low
accuracy is due to overfitting because the dataset is too small and a strong
regularizer like dropout is needed.  :cite:`he2016identity` is a follow-up work
that invalidates this claim.

A questionable result that :cite:`he2016identity` presents is the practicality
of a one thousand layers deep network.  For a ten-fold increase in complexity,
one is able to reduce the error by one percent, which is certainly inefficient.
Nevertheless, the formulation, derivations, and illustrations of the Residual
Unit are very useful in the design of future network architectures.

The predecessors of ResNet are LeNet, AlexNet, VGGNet, and GoogLeNet.  The ideas
presented in LeNet and AlexNet have become so widespread in the deep learning
literature such that reading the original works
:cite:`krizhevsky2012imagenet,lecun1998gradient` may not be fruitful.  VGGNet is
still worthwhile to read to understand the motivation behind the current design
of convolutional layers :cite:`simonyan2014very`, which can be summarized as

(i) For the same output feature map size, the layers have the same number of
    filters.
(#) If the feature map size is halved, the number of filters is doubled to
    preserve the time complexity per layer.

Likewise, GoogLeNet's Inception (network within a network) module popularized
the use of :math:`1 \times 1` convolutional layers to reduce the computations of
stackable local network topologies :cite:`szegedy2015going`.  One issue with the
Inception architecture is the handcrafted stage-by-stage network topology
designs.  To resolve this issue, ResNeXt homogenized each block's topology
:cite:`xie2017aggregated`.  However, ResNeXt did not investigate the channel
relationship aspect of neural network design.

Exploiting the fact that convolutional filters are expected to be informative
combinations by fusing spatial and channel-wise information together within
local receptive fields, :cite:`hu2018squeeze` proposes SENet: explicitly model
the interdependencies between the channels of convolutional features via a
Squeeze-and-Excitation (SE) block.  The authors argue that the mechanism allows
the network to perform feature recalibration: early layers learn to excite
informative features in a class agnostic manner, and later layers learn
increasingly specialised class-specific features.  Their experiments defined the
SE block as
:menuselection:`Global Pooling --> FC --> ReLU --> FC --> Sigmoid --> Scale`
where scale is scaling each channel of the original input.  SENet achieves
2.251% error versus the previous 2.991% error in the ILSVRC 2017 classification
task at the cost of an additional 10% overhead in computation and parameters.
Furthermore, the designer of the network now has to choose the reduction ratio
i.e. the capacity and computational cost of each SE block in the model.

All of the neural network architectures described thus far are designed by hand.
A growing trend is to have the network design itself for the desired task.
Neural architecture search is a recent technique that capitalizes on the
repeated motifs CNNs are built upon (e.g. convolutional filter banks,
nonlinearities, selection of connections) .  NASNet is an example neural
architecture that achieved state of the art performance
:cite:`zoph2018learning`.  NASNet consists of two motifs: normal cells and
reduction cells.  The former are convolutional cells that return a feature map
of the same dimension while the latter return a feature map where the height and
width are reduced by a factor of two.  Although these cells are designed via the
search space optimization, the optimization only explores a predefined set of
operations and heuristics (e.g. double the number of filters in the output to
maintain roughly constant hidden state dimension).  The authors found that
reinforcement learning is slightly more efficient than random search in the
NASNet search space.  The idea is to use a controller RNN that samples child
networks with different architectures.  Each network is trained to convergence
to obtain some accuracy on a held-out validation set.  The resulting accuracies
are used to update the controller weights (via policy gradient) so that the
controller will generate better architectures over time.  Even though NASNet
slightly beats state of the art in terms of accuracy with half the number of
parameters and multiply-add operations, it requires over four days using 500
P100 GPUs to find this instance of NASNet trained on CIFAR-10.  However, the
only hyperparameters to tune are the number of cell repeats and the number of
filters in the initial convolutional cell.  

Notes
=====

Identity Mappings
-----------------

:cite:`he2016identity` investigates the propagation formulations behind the
connection mechanisms of deep residual networks, focusing on creating a “direct”
path for propagating information through the entire network.

The original Residual Unit takes the form of
weight -> BN -> ReLU -> weight -> BN -> addition.  BN is used after each weight
layer, and ReLU is applied after BN except that the last ReLU in a Residual Unit
is after an element-wise addition.  Although BN normalizes the signal, this is
soon added to the shortcut and thus the merged signal is not normalized.  This
unnormalized signal is then used as the input of the next weight layer.

The authors observed that using asymmetric after-addition activation is
equivalent to constructing a pre-activation Residual Unit.  Hence, they
rearranged the layers to
BN -> ReLU -> weight -> BN -> ReLU -> weight -> addition.  Mathematically, this
is defining an identity mapping :math:`f(\mathbf{y}_l) = \mathbf{y}_l` such that

.. math::

   \mathbf{x}_L =
   \mathbf{x}_l + \sum_{i = l}^{L - 1} \mathcal{F}(\mathbf{x}_i, \mathcal{W}_i)

for any deeper unit :math:`L` and any shallower unit :math:`l`.  Analyzing the
gradients of the proposed formulation (i.e.
:math:`\frac{\partial \mathcal{E}}{\partial \mathbf{x}_l}`) reveals that
information is directly propagated back to any shallower unit :math:`l`.  This
implies that the gradient of a layer does not vanish even when the weights are
arbitrarily small.

The full pre-activation Residual Unit enabled ResNet-1001 to outperform all
previous models on CIFAR-10, CIFAR-100, ILSVRC 2012.  Furthermore, ResNet-200
achieved a better result than Inception v3 when using scale and aspect ratio
augmentation.

For non-identity shortcuts, even though they have stronger representational
abilities, they often impede information propagation that causes optimization
issues.  The CIFAR-10 experiments show that skip connections of scaling, gating,
dropout, and :math:`1 \times 1` convolutions lead to higher training loss and
error.  The earlier results with :math:`1 \times 1` convolutions in
:cite:`he2016deep` are misleading because that network is only 34-layers deep.

An extreme variation of ResNet's shortcut connections has been explored
in DenseNet :cite:`huang2016densely`.  DenseNet achieved
higher accuracy and efficiency than ResNet with pre-activation residual units
by defining the residual function to be

.. math::

   \mathcal{F}(\mathbf{x}_l, \mathcal{W}_l) =
   -h(\mathbf{x}_l) +
   \mathop{\mathcal{F}}\left(
     \begin{matrix}
       \mathbf{x}_0, \mathcal{W}_0\\
       \vdots\\
       \mathbf{x}_l, \mathcal{W}_l
     \end{matrix}
   \right).

Instead of combining features through summation before they are passed into a
layer, the proposed dense block combines features by concatenating them.  Each
application of the residual function now produces :math:`k` feature maps.  The
hyperparameter :math:`k` describes the growth rate of the network.  Within
a dense block is a stack of bottleneck layers, each performing
BN-ReLU-Conv(:math:`1 \times 1`)-BN-ReLU-Conv(:math:`3 \times 3`).  The
:math:`n` layers in the stack are connected in a manner analogous to a Markov
chain of order :math:`n`.  Between each dense block is a transition layer that
does the typical BN-ReLU-Conv(:math:`1 \times 1`)-Pooling.

Convolutional Layer
-------------------

A convolutional (conv) layer accepts a volume of size
:math:`W_i \times H_i \times D_i`.  The hyperparameters to tune are

- number of filters :math:`K`,
- :math:`F \times F` spatial extent,
- stride :math:`S`, and
- amount of zero padding :math:`P`.

The conv layer produces a volume of size
:math:`W_{i + 1} \times H_{i + 1} \times D_{i + 1}` where

- :math:`W_{i + 1} = \frac{W_i - F + 2P}{S} + 1`,
- :math:`H_{i + 1} = \frac{H_i - F + 2P}{S} + 1`, and
- :math:`D_{i + 1} = K`.

With parameter sharing, a conv layer introduces :math:`F^2 D_i` weights plus
a bias per filter.  The total number of parameters is :math:`F^2 D_i K` weights
and :math:`K` biases.

Convolutional to Fully-Connected
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Given the preceding configuration, the conv layer can be implemented by a
fully-connected layer as follows.

1) Reshape the input volume to a vector
   :math:`\mathbf{x}_i \in \mathbb{R}^{W_i H_i D_i + 1}`.  The last component is
   always set to one to incorporate the filter kernel's bias term.
#) Transform filter :math:`k` with weights
   :math:`\mathbf{H}_{i, k} \in \mathbb{R}^{F \times F \times D_i}` and bias
   :math:`b_{i, k} \in \mathbb{R}` to

   .. math::

      \mathcal{H}_{i, k} =
      \left\{
        \mathbf{h}_{i, k, j} \in \mathbb{R}^{W_i H_i D_i + 1}
      \right\}_{j = 1}^{W_{i + 1} H_{i + 1}}.

   Each vector :math:`\mathbf{h}_{i, k, j}` is sparse and encodes the discrete
   convolution operation in one region of the image.  Many of the entries are
   equal due to parameter sharing, and the last entry contains the bias term.
   By inspection, the structure of :math:`\mathcal{H}_{i, k}` is akin to a
   Toeplitz matrix.  Stacking
   :math:`\mathcal{H}_i = \left\{ \mathcal{H}_{i, k} \right\}_{k = 1}^K` into a
   sparse block matrix gives the desired fully-connected layer.

Fully-Connected to Convolutional
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Consider a fully-connected layer with :math:`N` hidden units looking at some
input volume of size :math:`W_i \times H_i \times D_i`.  Each hidden unit will
have :math:`W_i H_i D_i` weights.  If the hidden units are treated as separate
filters, the fully-connected layer can be implemented by a conv layer as
follows: :math:`K = N`, :math:`S = 1`, :math:`P = 0`, and the spatial extent is
:math:`W_i \times H_i`.

Pooling Layer
-------------

A pooling layer accepts a volume of size :math:`W_i \times H_i \times D_i`.  The
hyperparameters to tune are

- :math:`F \times F` spatial extent, and
- stride :math:`S`.

The pooling layer produces a volume of size
:math:`W_{i + 1} \times H_{i + 1} \times D_{i + 1}` where

- :math:`W_{i + 1} = \frac{W_i - F}{S} + 1`
- :math:`H_{i + 1} = \frac{H_i - F}{S} + 1`
- :math:`D_{i + 1} = D_i`

Since it computes a fixed function of the input, this layer introduces zero
parameters.

VGGNet
------

The experiments and analysis of :cite:`simonyan2014very` standardized the
hyperparameters of a conv layer.  They deliberately restricted the spatial
extent of each filter to :math:`3 \times 3` because that is the smallest size
to capture the notion of left, right, up, down, and center.  The stride is fixed
to one, and the zero padding is set to one to preserve the spatial resolution
after the convolution.  When stacking conv layers, each layer has the same
number of filters.  Downsampling is handled via max pooling over a
:math:`2 \times 2` pixel window with a stride of two.

Their analysis on receptive fields and nonlinearity indicates that larger
filters are undesirable.  Since the input volume's spatial resolution is
preserved when stacking conv layers, each additional :math:`3 \times 3` conv
layer extends the effective receptive field by two.  A stack of two
:math:`3 \times 3` conv layers gives each neuron an effective receptive field of
:math:`5 \times 5`.  Likewise, three such layers have a :math:`7 \times 7`
effective receptive field.  Assuming that both the input and the output of a
three-layer :math:`3 \times 3` convolution stack has :math:`C` channels, the
stack is parameterized by :math:`3 (3^2 C^2) = 27 C^2` instead of
:math:`7^2 C^2 = 49 C^2` that a single :math:`7 \times 7` conv layer would
require.  Furthermore, in between each conv layer is a nonlinear activation
function that makes the decision function more discriminative.

:cite:`simonyan2014very` also explored :math:`1 \times 1` conv filters as a way
to increase the nonlinearity of the decision function without affecting the
receptive fields of the conv layers.  The filters did not give as good a result
compared to spatial conv filters.

GoogLeNet
---------

The Inception architecture focuses on how an optimal local sparse structure of a
convolutional vision network can be approximated efficiently by readily
available hardware :cite:`szegedy2015going`.  It assumes that if the probability
distribution of the dataset is representable by a large, very sparse deep neural
network, then the optimal network topology can be constructed layer after layer
by analyzing the correlation statistics of the preceding layer activations and
clustering neurons with highly correlated outputs.

In the lower layers that are closer to the input, correlated units would
concentrate in local regions.  Thus, the clusters concentrated in a single
region can be covered by a layer of :math:`1 \times 1` convolutions in the next
layer.  However, there will be a smaller number of more spatially spread out
clusters that can be covered by convolutions over larger patches, in addition to
a decreasing number of patches over larger and larger regions.

In order to avoid patch-alignment issues, the Inception architecture uses only
convolutions of size :math:`1 \times 1`, :math:`3 \times 3`, :math:`5 \times 5`,
and a :math:`3 \times 3` max pooling.  Their output filter banks are
concatenated into a single output stack forming the input of the next stage.
Together, these components form the Inception module.  Unfortunately, stacking
these modules is computationally expensive because the total depth of a module
can only grow at each layer.

To reduce the computational bottleneck, :cite:`szegedy2015going` performs
dimensionality reduction using :math:`1 \times 1` convolutions, assuming low
dimensional embeddings still contain a lot of information about a relatively
large image patch.  Furthermore, they used ReLUs to impose sparsity on the
embedding.  In order to train this stack of modules, the intermediate Inception
layers required auxiliary classifiers to backpropagate gradients to the lower
stages.

Inception-v3
^^^^^^^^^^^^

The second and third generations of the Inception architecture focused on
scaling up convolutional networks to meet the demands of big-data
:cite:`szegedy2016rethinking`.  The architectural modifications can be
summarized as:

1) Avoid representational bottlenecks, especially early in the network.  The
   representation size should gently decrease from the inputs to the outputs
   before reaching the final representation used for the task at hand.
#) Increasing the activations per tile in a convolutional network allows for
   more disentangled features, which results in faster training.
#) Spatial aggregation can be done over lower dimensional embeddings with tiny
   loss in representational power due to the strong correlation between adjacent
   units.
#) Balance the width and depth of the network.  The optimal performance of the
   network can be reached by balancing the number of filters per stage and the
   depth of the network.

The following optimizations are necessary to increase the depth of the network
while keeping the computational complexity acceptable:

- When dealing with lower resolution input with constant network complexity, one
  can reduce the receptive field after the first layer and still achieve roughly
  the same accuracy.
- Instead of the typical max pooling layer, one can modify the Inception module
  to use a stride of two for both the pooling layer and spatial convolutions.
  On a related note, :cite:`springenberg2014striving` presents some empirical
  evidence that when a pooling layer is replaced with a strided convolution
  layer, performance stabilizes and even improves on the base model.  However,
  removing the pooling layer and just increasing the stride of the previous
  layer results in diminished performance in all settings.
- Instead of replacing higher order convolutions with stacks of
  :math:`3 \times 3` convolutions, consider the computationally cheaper
  asymmetric convolutions: replace a :math:`n \times n` convolution with a
  :math:`n \times 1` convolution followed by a :math:`1 \times n` convolution.
  The experiments indicate factorizing convolutions works well on feature maps
  of size twelve to twenty.
- Removing the lower auxiliary classifiers did not have any adverse effect on
  the final quality of the network.  The virtually identical training
  progression demonstrates that :cite:`szegedy2015going` incorrectly
  hypothesized that these branches help evolve the low-level features.

Inception-v4
^^^^^^^^^^^^

The fourth generation of the Inception architecture focused on simplifying the
architecture to take advantage of residual connections
:cite:`szegedy2017inception`.  The Inception-Residual block is essentially the
Residual block with the indirect path flowing through the Inception block
followed by an activation scaling module.

The scaling module was manually configured to scale the residuals before adding
them to the accumulated layer activations.  Without this hack, the residual
variants started to exhibit instabilities and died early in the training when
the number of filters exceeded one thousand.  In this case, dying means the last
layer before the average pooling started to produce only zeros after a few tens
of thousands of iterations.

The experiments reveal that the residual connections lead to dramatically
improved training speed for the Inception architecture.  However, the accuracy
of the hybrid network is only fractionally higher than the Inception-v4
architecture that does not have residual connections.  Furthermore, each
network's gains on single frame performance does not translate into large gains
on ensembles: an ensemble of ResNet-151 achieves comparable accuracy.

ResNeXt
-------

ResNeXt combines VGG's strategy of repeating layers with Inception's customized
stage-by-stage split-transform-merge strategy under the residual learning
framework :cite:`xie2017aggregated`.  The underlying concept is
Network-in-Neuron: define the residual function such that

.. math::

   \mathcal{F}(\mathbf{x}) = \sum_{i = 1}^C \mathcal{T}_i(\mathbf{x})

where :math:`\left\{ \mathcal{T}_i(\mathbf{x}) \right\}_{i = 1}^C` is a set of
arbitrary (heterogeneous) functions.  Here the cardinality :math:`C` refers to
the size of the set of transformations.

ResNeXt is fundamentally a ResNet whose residual bottleneck block template
:math:`\begin{bmatrix}
1 \times 1, d\\ 3 \times 3, d\\ 1 \times 1, 4d
\end{bmatrix}` is replaced with grouped convolutions
:math:`\begin{bmatrix}
1 \times 1, 2d\\ 3 \times 3, 2d, C = 32\\ 1 \times 1, 4d
\end{bmatrix}`.  In a grouped convolutional layer, the :math:`2d` input and
output channels are divided into :math:`C` groups, and convolutions are
separately performed within each group :cite:`krizhevsky2012imagenet`.  Each
group's activation maps are then concatenated into a single output tensor.  Two
more explicit, albeit slower, reformulations are

.. math::

   \begin{bmatrix}
     1 \times 1, 2d\\ 3 \times 3, 2d, C = 32\\ 1 \times 1, 4d
   \end{bmatrix} =
   \begin{bmatrix}
     1 \times 1, \frac{2d}{C} & \cdots & 1 \times 1, \frac{2d}{C}\\
     3 \times 3, \frac{2d}{C} & \cdots & 3 \times 3, \frac{2d}{C}\\
     & \text{concatenation} &\\
     & 1 \times 1, 4d &
   \end{bmatrix} =
   \begin{bmatrix}
     1 \times 1, \frac{2d}{C} & \cdots & 1 \times 1, \frac{2d}{C}\\
     3 \times 3, \frac{2d}{C} & \cdots & 3 \times 3, \frac{2d}{C}\\
     1 \times 1, 4d & \cdots & 1 \times 1, 4d\\
     & \text{aggregated transformation}
   \end{bmatrix}.

The experiments indicate that increasing cardinality as the model size and
the number of parameters are kept constant is a more effective way of gaining
accuracy, especially for shallow networks, than going deeper or wider.  Here
width refers to the number of channels in a layer.  When tuning :math:`C`, it is
often not worthwhile to make :math:`\frac{2d}{C} < 4`.  Moreover, a ResNeXt
block must be at least three layers deep to produce nontrivial topologies.  When
the block's depth is two, the aggregated transformation leads to a trivially
wide, dense module.

.. rubric:: References

.. bibliography:: refs.bib
   :all: