############################################ Deep Residual Learning for Image Recognition ############################################ Motivation(s) ============= Recent experiments reveal that network depth is crucial in attaining high accuracy. However, with the network depth increasing, accuracy gets saturated and then degrades rapidly. Such degradation is not caused by overfitting because adding more layers to a suitably deep model leads to higher training error. Consider a shallow architecture and its deeper counterpart that adds more layers onto it. Suppose the added layers are identity mappings and the other layers are copied from the learned shallower model. By inspection, the deeper model should produce no higher training error than its shallower counterpart. Counterintuitively, experiments show that current solvers on hand are unable to find solutions that are comparably good or better than the constructed solution within the same duration of time. The degradation problem suggests that the solvers might have difficulties in approximating identity mappings by multiple nonlinear layers. Proposed Solution(s) ==================== The authors propose a deep residual learning framework to address the degradation problem. Consider :math:`\mathcal{H}(\mathbf{x})` as an underlying mapping to be fit by a few stacked layers (not necessarily the entire net) with :math:`\mathbf{x}` denoting the inputs to the first of these layers. The goal is to have the stacked layers approximate a residual function :math:`\mathcal{F}(\mathbf{x}) \doteq \mathcal{H}(\mathbf{x}) - \mathbf{x}` by explicitly defining a building block .. math:: \mathbf{y}_l = h(\mathbf{x}_l) + \mathcal{F}(\mathbf{x}_l, \mathcal{W}_l) \quad \text{and} \quad \mathbf{x}_{l + 1} = f(\mathbf{y}_l) where :math:`f` is the ReLU activation function, :math:`\mathcal{W}_l = \left\{ W_{l, k} \right\}_{k = 1}^K`, and :math:`K` is the number of layers in a Residual Unit. The shortcut connection :math:`h(\mathbf{x}_l) = \mathbf{x}_l` introduces zero additional parameters and negligible computational cost. When the dimensions of :math:`\mathbf{x}` and :math:`\mathcal{F}` are not equal, the building block needs to learn an additional a linear projection :math:`W_s` such that :math:`h(\mathbf{x}_l) = W_s \mathbf{x}_l`. There are no guarantees that identity mappings are optimal, but the hope is that it will act as a preconditioner. This concept of residual representations are widely used in multigrid methods. The theory of shortcut connections is well studied and have been used to address vanishing/exploding gradients. Evaluation(s) ============= As a baseline, the authors constructed a plain model that has fewer filters and lower complexity than VGG nets. The plain network augmented with shortcut connections every two layers is dubbed ResNet. Both networks applied batch normalization (BN) with no dropout. The experiments on ImageNet 2012 demonstrates that vanishing gradients is an unlikely cause of the plain network's degradation. The backward propagated gradients exhibit healthy norms because BN ensures forward propagated signals have non-zero variances. Furthermore, the authors still observed the degradation problem even with more training iterations. The degradation problem is mitigated with ResNet's shortcut connections, which also boosted the overall convergence rate. To handle dimension mismatch in ResNet, the authors explored identity shortcuts and projection shortcuts. The former performs an identity mapping with extra zero entries padded for increasing dimensions, while the latter used :math:`1 \times 1` convolutions to match dimensions. The results indicate that projection shortcuts gave trivial gains over identity shortcuts. This is important for exploring bottleneck building blocks, which outperforms state of the art despite having has lower complexity than VGG. Experiments on CIFAR-10, PASCAL VOC, and COCO also signal the consistent benefit of deep residual networks. When there are more layers, the analysis of layer responses show that an individual layer of ResNets tends to modify the signal less. Compared to their plain counterparts, ResNets have generally smaller magnitudes of responses. Even though the proposed method succeeded in training ResNet-1202 till convergence, the accuracy of ResNet-1202 is inferior to ResNet-110. Future Direction(s) =================== - What is the performance of residual networks without batch normalization? How does dropout improve it? - :cite:`he2016identity` only looked at dropout as a shortcut connection. How well does dropout perform when used on the indirect flow path i.e. within the residual unit? - The proposed deep residual framework eschews any kind of :doc:`unsupervised pre-training `. How much does pre-training help ResNets? Question(s) =========== - Multiple nonlinear layers can asymptotically approximate complicated functions when the :doc:`activation function is continuous, bounded and nonconstant `? Analysis ======== When designing a new building block for neural networks, ensure there is a direct path for propagating information throughout the entire network. One flaw in the analysis is the authors' assertion that ResNet-1202's low accuracy is due to overfitting because the dataset is too small and a strong regularizer like dropout is needed. :cite:`he2016identity` is a follow-up work that invalidates this claim. A questionable result that :cite:`he2016identity` presents is the practicality of a one thousand layers deep network. For a ten-fold increase in complexity, one is able to reduce the error by one percent, which is certainly inefficient. Nevertheless, the formulation, derivations, and illustrations of the Residual Unit are very useful in the design of future network architectures. The predecessors of ResNet are LeNet, AlexNet, VGGNet, and GoogLeNet. The ideas presented in LeNet and AlexNet have become so widespread in the deep learning literature such that reading the original works :cite:`krizhevsky2012imagenet,lecun1998gradient` may not be fruitful. VGGNet is still worthwhile to read to understand the motivation behind the current design of convolutional layers :cite:`simonyan2014very`, which can be summarized as (i) For the same output feature map size, the layers have the same number of filters. (#) If the feature map size is halved, the number of filters is doubled to preserve the time complexity per layer. Likewise, GoogLeNet's Inception (network within a network) module popularized the use of :math:`1 \times 1` convolutional layers to reduce the computations of stackable local network topologies :cite:`szegedy2015going`. One issue with the Inception architecture is the handcrafted stage-by-stage network topology designs. To resolve this issue, ResNeXt homogenized each block's topology :cite:`xie2017aggregated`. However, ResNeXt did not investigate the channel relationship aspect of neural network design. Exploiting the fact that convolutional filters are expected to be informative combinations by fusing spatial and channel-wise information together within local receptive fields, :cite:`hu2018squeeze` proposes SENet: explicitly model the interdependencies between the channels of convolutional features via a Squeeze-and-Excitation (SE) block. The authors argue that the mechanism allows the network to perform feature recalibration: early layers learn to excite informative features in a class agnostic manner, and later layers learn increasingly specialised class-specific features. Their experiments defined the SE block as :menuselection:`Global Pooling --> FC --> ReLU --> FC --> Sigmoid --> Scale` where scale is scaling each channel of the original input. SENet achieves 2.251% error versus the previous 2.991% error in the ILSVRC 2017 classification task at the cost of an additional 10% overhead in computation and parameters. Furthermore, the designer of the network now has to choose the reduction ratio i.e. the capacity and computational cost of each SE block in the model. All of the neural network architectures described thus far are designed by hand. A growing trend is to have the network design itself for the desired task. Neural architecture search is a recent technique that capitalizes on the repeated motifs CNNs are built upon (e.g. convolutional filter banks, nonlinearities, selection of connections) . NASNet is an example neural architecture that achieved state of the art performance :cite:`zoph2018learning`. NASNet consists of two motifs: normal cells and reduction cells. The former are convolutional cells that return a feature map of the same dimension while the latter return a feature map where the height and width are reduced by a factor of two. Although these cells are designed via the search space optimization, the optimization only explores a predefined set of operations and heuristics (e.g. double the number of filters in the output to maintain roughly constant hidden state dimension). The authors found that reinforcement learning is slightly more efficient than random search in the NASNet search space. The idea is to use a controller RNN that samples child networks with different architectures. Each network is trained to convergence to obtain some accuracy on a held-out validation set. The resulting accuracies are used to update the controller weights (via policy gradient) so that the controller will generate better architectures over time. Even though NASNet slightly beats state of the art in terms of accuracy with half the number of parameters and multiply-add operations, it requires over four days using 500 P100 GPUs to find this instance of NASNet trained on CIFAR-10. However, the only hyperparameters to tune are the number of cell repeats and the number of filters in the initial convolutional cell. Notes ===== Identity Mappings ----------------- :cite:`he2016identity` investigates the propagation formulations behind the connection mechanisms of deep residual networks, focusing on creating a “direct” path for propagating information through the entire network. The original Residual Unit takes the form of weight -> BN -> ReLU -> weight -> BN -> addition. BN is used after each weight layer, and ReLU is applied after BN except that the last ReLU in a Residual Unit is after an element-wise addition. Although BN normalizes the signal, this is soon added to the shortcut and thus the merged signal is not normalized. This unnormalized signal is then used as the input of the next weight layer. The authors observed that using asymmetric after-addition activation is equivalent to constructing a pre-activation Residual Unit. Hence, they rearranged the layers to BN -> ReLU -> weight -> BN -> ReLU -> weight -> addition. Mathematically, this is defining an identity mapping :math:`f(\mathbf{y}_l) = \mathbf{y}_l` such that .. math:: \mathbf{x}_L = \mathbf{x}_l + \sum_{i = l}^{L - 1} \mathcal{F}(\mathbf{x}_i, \mathcal{W}_i) for any deeper unit :math:`L` and any shallower unit :math:`l`. Analyzing the gradients of the proposed formulation (i.e. :math:`\frac{\partial \mathcal{E}}{\partial \mathbf{x}_l}`) reveals that information is directly propagated back to any shallower unit :math:`l`. This implies that the gradient of a layer does not vanish even when the weights are arbitrarily small. The full pre-activation Residual Unit enabled ResNet-1001 to outperform all previous models on CIFAR-10, CIFAR-100, ILSVRC 2012. Furthermore, ResNet-200 achieved a better result than Inception v3 when using scale and aspect ratio augmentation. For non-identity shortcuts, even though they have stronger representational abilities, they often impede information propagation that causes optimization issues. The CIFAR-10 experiments show that skip connections of scaling, gating, dropout, and :math:`1 \times 1` convolutions lead to higher training loss and error. The earlier results with :math:`1 \times 1` convolutions in :cite:`he2016deep` are misleading because that network is only 34-layers deep. An extreme variation of ResNet's shortcut connections has been explored in DenseNet :cite:`huang2016densely`. DenseNet achieved higher accuracy and efficiency than ResNet with pre-activation residual units by defining the residual function to be .. math:: \mathcal{F}(\mathbf{x}_l, \mathcal{W}_l) = -h(\mathbf{x}_l) + \mathop{\mathcal{F}}\left( \begin{matrix} \mathbf{x}_0, \mathcal{W}_0\\ \vdots\\ \mathbf{x}_l, \mathcal{W}_l \end{matrix} \right). Instead of combining features through summation before they are passed into a layer, the proposed dense block combines features by concatenating them. Each application of the residual function now produces :math:`k` feature maps. The hyperparameter :math:`k` describes the growth rate of the network. Within a dense block is a stack of bottleneck layers, each performing BN-ReLU-Conv(:math:`1 \times 1`)-BN-ReLU-Conv(:math:`3 \times 3`). The :math:`n` layers in the stack are connected in a manner analogous to a Markov chain of order :math:`n`. Between each dense block is a transition layer that does the typical BN-ReLU-Conv(:math:`1 \times 1`)-Pooling. Convolutional Layer ------------------- A convolutional (conv) layer accepts a volume of size :math:`W_i \times H_i \times D_i`. The hyperparameters to tune are - number of filters :math:`K`, - :math:`F \times F` spatial extent, - stride :math:`S`, and - amount of zero padding :math:`P`. The conv layer produces a volume of size :math:`W_{i + 1} \times H_{i + 1} \times D_{i + 1}` where - :math:`W_{i + 1} = \frac{W_i - F + 2P}{S} + 1`, - :math:`H_{i + 1} = \frac{H_i - F + 2P}{S} + 1`, and - :math:`D_{i + 1} = K`. With parameter sharing, a conv layer introduces :math:`F^2 D_i` weights plus a bias per filter. The total number of parameters is :math:`F^2 D_i K` weights and :math:`K` biases. Convolutional to Fully-Connected ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Given the preceding configuration, the conv layer can be implemented by a fully-connected layer as follows. 1) Reshape the input volume to a vector :math:`\mathbf{x}_i \in \mathbb{R}^{W_i H_i D_i + 1}`. The last component is always set to one to incorporate the filter kernel's bias term. #) Transform filter :math:`k` with weights :math:`\mathbf{H}_{i, k} \in \mathbb{R}^{F \times F \times D_i}` and bias :math:`b_{i, k} \in \mathbb{R}` to .. math:: \mathcal{H}_{i, k} = \left\{ \mathbf{h}_{i, k, j} \in \mathbb{R}^{W_i H_i D_i + 1} \right\}_{j = 1}^{W_{i + 1} H_{i + 1}}. Each vector :math:`\mathbf{h}_{i, k, j}` is sparse and encodes the discrete convolution operation in one region of the image. Many of the entries are equal due to parameter sharing, and the last entry contains the bias term. By inspection, the structure of :math:`\mathcal{H}_{i, k}` is akin to a Toeplitz matrix. Stacking :math:`\mathcal{H}_i = \left\{ \mathcal{H}_{i, k} \right\}_{k = 1}^K` into a sparse block matrix gives the desired fully-connected layer. Fully-Connected to Convolutional ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Consider a fully-connected layer with :math:`N` hidden units looking at some input volume of size :math:`W_i \times H_i \times D_i`. Each hidden unit will have :math:`W_i H_i D_i` weights. If the hidden units are treated as separate filters, the fully-connected layer can be implemented by a conv layer as follows: :math:`K = N`, :math:`S = 1`, :math:`P = 0`, and the spatial extent is :math:`W_i \times H_i`. Pooling Layer ------------- A pooling layer accepts a volume of size :math:`W_i \times H_i \times D_i`. The hyperparameters to tune are - :math:`F \times F` spatial extent, and - stride :math:`S`. The pooling layer produces a volume of size :math:`W_{i + 1} \times H_{i + 1} \times D_{i + 1}` where - :math:`W_{i + 1} = \frac{W_i - F}{S} + 1` - :math:`H_{i + 1} = \frac{H_i - F}{S} + 1` - :math:`D_{i + 1} = D_i` Since it computes a fixed function of the input, this layer introduces zero parameters. VGGNet ------ The experiments and analysis of :cite:`simonyan2014very` standardized the hyperparameters of a conv layer. They deliberately restricted the spatial extent of each filter to :math:`3 \times 3` because that is the smallest size to capture the notion of left, right, up, down, and center. The stride is fixed to one, and the zero padding is set to one to preserve the spatial resolution after the convolution. When stacking conv layers, each layer has the same number of filters. Downsampling is handled via max pooling over a :math:`2 \times 2` pixel window with a stride of two. Their analysis on receptive fields and nonlinearity indicates that larger filters are undesirable. Since the input volume's spatial resolution is preserved when stacking conv layers, each additional :math:`3 \times 3` conv layer extends the effective receptive field by two. A stack of two :math:`3 \times 3` conv layers gives each neuron an effective receptive field of :math:`5 \times 5`. Likewise, three such layers have a :math:`7 \times 7` effective receptive field. Assuming that both the input and the output of a three-layer :math:`3 \times 3` convolution stack has :math:`C` channels, the stack is parameterized by :math:`3 (3^2 C^2) = 27 C^2` instead of :math:`7^2 C^2 = 49 C^2` that a single :math:`7 \times 7` conv layer would require. Furthermore, in between each conv layer is a nonlinear activation function that makes the decision function more discriminative. :cite:`simonyan2014very` also explored :math:`1 \times 1` conv filters as a way to increase the nonlinearity of the decision function without affecting the receptive fields of the conv layers. The filters did not give as good a result compared to spatial conv filters. GoogLeNet --------- The Inception architecture focuses on how an optimal local sparse structure of a convolutional vision network can be approximated efficiently by readily available hardware :cite:`szegedy2015going`. It assumes that if the probability distribution of the dataset is representable by a large, very sparse deep neural network, then the optimal network topology can be constructed layer after layer by analyzing the correlation statistics of the preceding layer activations and clustering neurons with highly correlated outputs. In the lower layers that are closer to the input, correlated units would concentrate in local regions. Thus, the clusters concentrated in a single region can be covered by a layer of :math:`1 \times 1` convolutions in the next layer. However, there will be a smaller number of more spatially spread out clusters that can be covered by convolutions over larger patches, in addition to a decreasing number of patches over larger and larger regions. In order to avoid patch-alignment issues, the Inception architecture uses only convolutions of size :math:`1 \times 1`, :math:`3 \times 3`, :math:`5 \times 5`, and a :math:`3 \times 3` max pooling. Their output filter banks are concatenated into a single output stack forming the input of the next stage. Together, these components form the Inception module. Unfortunately, stacking these modules is computationally expensive because the total depth of a module can only grow at each layer. To reduce the computational bottleneck, :cite:`szegedy2015going` performs dimensionality reduction using :math:`1 \times 1` convolutions, assuming low dimensional embeddings still contain a lot of information about a relatively large image patch. Furthermore, they used ReLUs to impose sparsity on the embedding. In order to train this stack of modules, the intermediate Inception layers required auxiliary classifiers to backpropagate gradients to the lower stages. Inception-v3 ^^^^^^^^^^^^ The second and third generations of the Inception architecture focused on scaling up convolutional networks to meet the demands of big-data :cite:`szegedy2016rethinking`. The architectural modifications can be summarized as: 1) Avoid representational bottlenecks, especially early in the network. The representation size should gently decrease from the inputs to the outputs before reaching the final representation used for the task at hand. #) Increasing the activations per tile in a convolutional network allows for more disentangled features, which results in faster training. #) Spatial aggregation can be done over lower dimensional embeddings with tiny loss in representational power due to the strong correlation between adjacent units. #) Balance the width and depth of the network. The optimal performance of the network can be reached by balancing the number of filters per stage and the depth of the network. The following optimizations are necessary to increase the depth of the network while keeping the computational complexity acceptable: - When dealing with lower resolution input with constant network complexity, one can reduce the receptive field after the first layer and still achieve roughly the same accuracy. - Instead of the typical max pooling layer, one can modify the Inception module to use a stride of two for both the pooling layer and spatial convolutions. On a related note, :cite:`springenberg2014striving` presents some empirical evidence that when a pooling layer is replaced with a strided convolution layer, performance stabilizes and even improves on the base model. However, removing the pooling layer and just increasing the stride of the previous layer results in diminished performance in all settings. - Instead of replacing higher order convolutions with stacks of :math:`3 \times 3` convolutions, consider the computationally cheaper asymmetric convolutions: replace a :math:`n \times n` convolution with a :math:`n \times 1` convolution followed by a :math:`1 \times n` convolution. The experiments indicate factorizing convolutions works well on feature maps of size twelve to twenty. - Removing the lower auxiliary classifiers did not have any adverse effect on the final quality of the network. The virtually identical training progression demonstrates that :cite:`szegedy2015going` incorrectly hypothesized that these branches help evolve the low-level features. Inception-v4 ^^^^^^^^^^^^ The fourth generation of the Inception architecture focused on simplifying the architecture to take advantage of residual connections :cite:`szegedy2017inception`. The Inception-Residual block is essentially the Residual block with the indirect path flowing through the Inception block followed by an activation scaling module. The scaling module was manually configured to scale the residuals before adding them to the accumulated layer activations. Without this hack, the residual variants started to exhibit instabilities and died early in the training when the number of filters exceeded one thousand. In this case, dying means the last layer before the average pooling started to produce only zeros after a few tens of thousands of iterations. The experiments reveal that the residual connections lead to dramatically improved training speed for the Inception architecture. However, the accuracy of the hybrid network is only fractionally higher than the Inception-v4 architecture that does not have residual connections. Furthermore, each network's gains on single frame performance does not translate into large gains on ensembles: an ensemble of ResNet-151 achieves comparable accuracy. ResNeXt ------- ResNeXt combines VGG's strategy of repeating layers with Inception's customized stage-by-stage split-transform-merge strategy under the residual learning framework :cite:`xie2017aggregated`. The underlying concept is Network-in-Neuron: define the residual function such that .. math:: \mathcal{F}(\mathbf{x}) = \sum_{i = 1}^C \mathcal{T}_i(\mathbf{x}) where :math:`\left\{ \mathcal{T}_i(\mathbf{x}) \right\}_{i = 1}^C` is a set of arbitrary (heterogeneous) functions. Here the cardinality :math:`C` refers to the size of the set of transformations. ResNeXt is fundamentally a ResNet whose residual bottleneck block template :math:`\begin{bmatrix} 1 \times 1, d\\ 3 \times 3, d\\ 1 \times 1, 4d \end{bmatrix}` is replaced with grouped convolutions :math:`\begin{bmatrix} 1 \times 1, 2d\\ 3 \times 3, 2d, C = 32\\ 1 \times 1, 4d \end{bmatrix}`. In a grouped convolutional layer, the :math:`2d` input and output channels are divided into :math:`C` groups, and convolutions are separately performed within each group :cite:`krizhevsky2012imagenet`. Each group's activation maps are then concatenated into a single output tensor. Two more explicit, albeit slower, reformulations are .. math:: \begin{bmatrix} 1 \times 1, 2d\\ 3 \times 3, 2d, C = 32\\ 1 \times 1, 4d \end{bmatrix} = \begin{bmatrix} 1 \times 1, \frac{2d}{C} & \cdots & 1 \times 1, \frac{2d}{C}\\ 3 \times 3, \frac{2d}{C} & \cdots & 3 \times 3, \frac{2d}{C}\\ & \text{concatenation} &\\ & 1 \times 1, 4d & \end{bmatrix} = \begin{bmatrix} 1 \times 1, \frac{2d}{C} & \cdots & 1 \times 1, \frac{2d}{C}\\ 3 \times 3, \frac{2d}{C} & \cdots & 3 \times 3, \frac{2d}{C}\\ 1 \times 1, 4d & \cdots & 1 \times 1, 4d\\ & \text{aggregated transformation} \end{bmatrix}. The experiments indicate that increasing cardinality as the model size and the number of parameters are kept constant is a more effective way of gaining accuracy, especially for shallow networks, than going deeper or wider. Here width refers to the number of channels in a layer. When tuning :math:`C`, it is often not worthwhile to make :math:`\frac{2d}{C} < 4`. Moreover, a ResNeXt block must be at least three layers deep to produce nontrivial topologies. When the block's depth is two, the aggregated transformation leads to a trivially wide, dense module. .. rubric:: References .. bibliography:: refs.bib :all: