Wednesday, March 9, 2016

Fri, Mar 11 - ResNet

Deep Residual Learning for Image Recognition. Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. MS COCO detection challenge winner 2015.

arXiv

21 comments:

  1. summary:
    This paper presents deep residual net architecture that won ImageNet detection and localization, COCO detection and segmentation. The reason to introduce residual net is the vanishing gradient during training.
    The key part is it's identity mapping adds to the output of plain operation (conv ->norm -> relu ->conv ->norm). In the evaluation,
    they tried with different depth with their residual nets and best results achieved at around depth of 100 even 1202 layers had been tried with they arguing unnecessarily large results in over-fitting.

    questions:
    I don't understand clearly from figure 3 and it's description:

    "The identity shortcuts (Eqn.(1)) can be directly used when the input and output are of the same dimensions (solid line shortcuts in Fig. 3). "
    "
    When the dimensions increase (dotted line shortcuts in Fig. 3), we consider two options: (A) The shortcut still performs identity mapping, with extra zero entries padded for increasing dimensions. This option introduces no extra parameter; (B) The projection shortcut in Eqn.(2) is used to match dimensions (done by 1×1 convolutions)."

    According to figure 2, I think since applied convolution, the output size will decrease.
    1. When it happen when input and output are of the same dimensions?
    2. when output dimensions increases and therefore identity mapping need to be padded?
    3. When eqn(2) is used to match dimensions?

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
  2. This paper introduces a framework to facilitate the training of very deep neural networks and address the problems such architectures encounter such as degradation, where beyond a certain point in the training process, certain very deep architectures can begin to rapidly lose accuracy. In the paper this is addressed by feed forward shortcuts (skipping layers and pushing identity mappings) that allow the network to train to optimizing (minimizing) the residual (error) which proves to be more easily accomplished than training to some unrestricted mapping. It is thought that the solvers involved in this process have a hard time fitting to identity mappings,and so this feed forward of identity mappings helps the solvers converge without breeching a threshold where accuracy severely degrades.

    Questions/Comments :
    1) I have experience with numerical analysis and optimization algorithms and so understand what is meant by minimizing a residual, but in this context I am a little unclear on the particulars. Could the specifics of the algorithm be reviewed?
    2) They suggest that an alternative to identity shortcuts are projection shortcuts, and it seems that these projection shortcuts perform even better in some cases. Would other affine transformations work (like rotations)?

    ReplyDelete
  3. In this paper, the authors propose learning a residual mapping F(x) = H(x) - x, instead of the unreferenced, underlying mapping H(x) in order to combat the degradation problem experienced by very deep networks. They propose a simple modification of allowing a shortcut connection between stacks of at least 2 layers which simulates the residual mapping without adding any more parameters. More particularly, they introduce a bottleneck architecture where a layer is now a 3x3 convolution prepended and appended with 1x1 convolutions so that the middle 3x3 layer is now a bottleneck. They discuss their architecture changes in designing networks of depths 34 layers, 50 layers, 101 layer and 152 layers and show results on ImageNet ILSVRC2015 and CIFAR-10 with the baseline being similar nets without the shortcut connections.

    Shortcut Connections:
    VGG16/19 uses padding of size 1 to maintain the same input-output dimensionality during convolution and and use max pooling over 2x2 windows which leads to the doubling of the number of channels or "increase in dimensions". Thus where max pooling does not occur, we can directly apply the identity connection. Where max pooling does occur, we apply the zero padding or a projection of the input.

    I also managed to find this very cool visualization of the ResNet 152 on the github page of the 1st author: http://ethereon.github.io/netscope/#/gist/d38f3e6091952b45198b
    The website also has visualizations of other models.

    Discussion:
    1. When the authors mention padding zeros for the increased dimensions, where are they padding the zeros? I assume they are adding feature maps of zeros for the increased number of channels, e.g. if the input is 64 channels, and the output is 128 channels, they just stack 64 channels of HxW zeros onto the input 64 channels and do the addition. However, I am not certain if this is right because then the residual function being learned is not really a residual since H(x) = F(x) + 0 instead of x for channels > 64.

    2. What is the motivation behind the design of the bottleneck architecture other than the need to reduce training time?

    ReplyDelete
  4. The authors have a way of training very deep networks. The key insight is that the residual network F(x) = H(x) - x. The residuals are, allegedly, easier to optimize.

    1. Why does the bottleneck architecture reduce training time?
    2. It's still not clear why the residual function is easier to train?
    3. Why would a deep network have a hard time learning the id function?
    4. Why wouldn't minimizing the residual not help with only a few layers?

    ReplyDelete
  5. Increasing the depth of Deep Neural networks can degrade the accuracy (increase training and testing error) and the authors propose Deep Residual Networks as solution to this problem. Firstly, the authors try to add identity layers to a shallow network and observe the accuracy degrades. This might be because it is difficult for the network to learn identity mappings. Hence the authors purposefully added a "shortcut" along with some residual function and train the network. The results are better than traditional Deep networks and the accuracy improves with stacking more layers.

    Discussion:
    1. The motivation is not very clear to me. In 3.1 authors say, "if the added layers can be constructed as identity mappings, a deeper model should have training error no greater than its shallower counterpart. The degradation problem suggests that the solvers might have difficulties in approximating identity mappings by multiple nonlinear layers." But how did the authors know that 'approximating identity mappings' is important for accuracy before experimentation?
    2. We can consider a shortcut connection also as a convolution. Looking at Figure 3 34-layer residual network, how is adding a shortcut connection making a difference, i.e. why cant the conv layers themselves incorporate the shortcut connection as a filter? Although, we do see the shortcut connections are between 2 conv layers not 1, which might make my question invalid.

    ReplyDelete
  6. Abstract:

    The paper presents a new architecture that enables to train very deep neural network. Up til now training such large network has been very hard task because the convergence rate is very slow and people had found that deeper network in fact had higher training error. This paper introduces the concept of learning residual function than the original function. The motivation is that if the deeper network is superset of the shallower network than learning identity mapping might be hard for plain network but residual network just has to learn zero residual in that case. They have shown that with this it is possible to train very deep network successfully and attained state of art results on various tasks.


    Discussion:

    1) How exactly would projection work in this framework? If we do projections, then it would need to learn the identity mapping that was hard in the first place.

    ReplyDelete
  7. This paper introduces the deep residual networks that has won the recent ILSVCR classification challenge. The authors propose to learn the residual model F(x) = H(x) - x, instead of the mapping function H(x), stating ease in learning the prior function. They convert the plain network to residual network by adding either identity shortcuts or mapped shortcut.
    Discussion-
    1) What is the intuition behind larger networks not performing as efficiently as described in fig.1? The authors state that they have experimented to find this out.
    2) Why is residual learning easier than learning normal mapping?
    3) How do you still get identity function when adding input from different size layers. Don't we end up learning parameters for such mapping?

    ReplyDelete
  8. The paper presents the concept of deep residual networks that won the ILSVRC 2015. It shows that residual networks can be used to combat degradation in training of deep networks. This is done by tweaking a plain network, by adding shortcuts between input and outputs. This means the residual model to be learnt is F(x) = H(x) - x, for a mapping function H(x).

    Discussion:
    The authors claim that adding more stacked layers in a single group, thereby increasing the depth of the network, results in lower training and test errors, up to a limit, after which test errors actually increase. When this limit is reached, will we still be able to improve performance by increasing the number of groups of stacked layers, to get a deeper network, or will this just result in overfitting again?

    ReplyDelete
  9. As very deep convolutional networks begin to converge, accuracy begins to degrade. Although testing accuracy decreases, we know that there exists a solution that has at least as good accuracy as a less deep network (by using identity mappings). Unfortunately, our current methods of training convolutional networks are not able to find solutions that are as good for very deep networks. The authors propose a solution to this problem: approximate an easier to learn residual function H(x) - x, rather than just H(x).

    Questions

    What makes it easier to approximate the residual function?

    One of the questions that the authors were trying to answer was, "Is
    learning better networks as easy as stacking more layers?" At the end of the paper, it seemed that the question goes unanswered. Their deep model with over 1000 layers starts to overfit. The authors propose a few techniques to combat overfitting. Have they published any updates exploring these techniques?

    ReplyDelete
  10. This paper presents a significantly deeper neural network architecture that can have between 56 layers and 2000. The major contribution to the architecture is a residual learning building block which helps the architecture avoid degradation. A common problem with naively increasing the layers in a standard architecture model is that as you increase the layers the training error also increasing. By adding residual building blocks, you avoid this problem by having feed forward shortcuts that perform identity mapping between layers. This architecture was the winner for the MSCOCO 2015 segmentation challenge (this model had 152 layers).

    1. Why is identity mapping hard for Neural Networks to learn?
    2. Can we use the residual layers to build shallower networks as well?

    ReplyDelete
  11. The paper presents discusses a residual learning framework that speeds up the training of deeper neural networks. Residual networks are easy to optimize and gain accuracy from increased depth. This is done by reformulating the layers as learning residual functions with reference to the layer inputs i.e the residual model to be learnt becomes F(x)=H(x)-x. The formulation F(x)+x is realized by feedforward neural networks with shortcut connections. The proposed architecture won the ILSVRC 2015 classification task.

    Question: I still don't understand how learning residual functions makes training easy. Also, could you please go over how bottleneck architecture helps reduce training time?

    ReplyDelete
  12. This paper presents the first ultra-deep model for image recognition. The model consists of 152 layers, which 8x deeper than the famous deep models -VGG net. Stacking more layers does not result in better performance; in fact, experiments show that it would result in dropping the accuracy rapidly, a problem known as 'degradation' problem. To overcome this, the paper introduces the residual framework, where the layers in the network are connected with extra-lines-shortcut connections- that allow them to learn from the previous layer input as well as from the data. Using this identity mapping, they were able to design this very deep network and overcome the degradation of the training accuracy. The proposed model was the winning entry for ImgeNet and MS CoCo 2015 challenges.

    1. Can we discuss the difference between projection shortcuts and identity mapping and why identity mapping preformed better?

    ReplyDelete
  13. This paper introduces residual networks, a technique used to win ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation. The residual aspect attempts to solve the vanishing gradient problem: typically with really deep networks, the change at the bottom layers relative to small changes at the top are almost incomputibly small. They solve this problem by adding extra connections into the network that allow each layer to train from more than just the layer directly above it. They test their technique with a 152 layer network applied to Image Net (deepest ever), and show that for much deeper networks (1200+) they actually overfit.

    Questions:
    The underlying assumption:
    F(x) := H(x) − x
    Won't the gradients for H(x) still vanish even if the gradients for x do not?

    ReplyDelete
  14. This paper talks about ultra deep models for image recognition. This attempts to solve the vanishing gradients problem by adding one layer to a layere ahead. Using an identity mapping we can reinvigorate the functions gradient later in the n/w. This net won the ImageNet Challenge MSCOCO test.

    ReplyDelete
  15. Summary
    This paper proposes neural networks, called Residual Neural Networks, which are deeper than the state-of-the-art by using unreferenced learning functions in the network architecture. These networks have been evaluated on detection, localization, detection, and segmentation datasets and have shown significant improvements over the existing state-of-the-art. An ensemble of residual nets with a depth of 152 layers has been evaluated on ImageNet dataset and an error of 3.57% was achieved on its test set.
    Questions
    1) Could you please clarify how exactly the residual mapping helps in avoiding degradation problem in deeper networks?
    2) I did not understand how identity mapping of layers can contribute to improving accuracies.

    ReplyDelete
  16. This paper presents a deep residual learning framework. By allowing the layers to learn a residual mapping, F(X) = H(X) - x, the network will be easier to optimize and greater depth can be achieved.

    Clarification:

    I still don't quite understand how the residual mapping allows for increased depth.

    ReplyDelete
  17. This paper present a residual learning framework for training deep neural networks. The authors identify the degradation of network accuracy not as a result of overfitting, but the difficulty of optimizing different networks. They propose to use a residual function in place of mapping function in order to combat this.

    How would these residual networks perform with a shallower depth/less layers?

    ReplyDelete
  18. This paper proposes a residual learning framework to ease the training of networks that are way deeper than those seen before (about 56 to 200). Since by merely increasing the depth of the neural network causes the error rates to go up and the convergence time to increase by a huge factor too. To prevent the authors propose identifying layers in a shallow network where these problems occur and then create shortcut connections between stack of at least 2 layers which simulates the residual function's mapping.

    Discussion:
    How does the bottleneck architecture make it easier to train and reduce training time?

    ReplyDelete
  19. This paper presents the first ultra-deep model for image recognition. The model consists of 152 layers, which 8x deeper than the famous deep models -VGG net. Stacking more layers does not result in better performance; in fact, experiments show that it would result in dropping the accuracy rapidly, a problem known as 'degradation' problem. To overcome this, the paper introduces the residual framework, where the layers in the network are connected with extra-lines-shortcut connections- that allow them to learn from the previous layer input as well as from the data. Using this identity mapping, they were able to design this very deep network and overcome the degradation of the training accuracy. The proposed model was the winning entry for ImgeNet and MS CoCo 2015 challenges.

    1. Can we discuss the difference between projection shortcuts and identity mapping and why identity mapping preformed better?

    ReplyDelete
  20. Summary
    This paper proposes neural networks, called Residual Neural Networks, which are deeper than the state-of-the-art by using unreferenced learning functions in the network architecture. These networks have been evaluated on detection, localization, detection, and segmentation datasets and have shown significant improvements over the existing state-of-the-art. An ensemble of residual nets with a depth of 152 layers has been evaluated on ImageNet dataset and an error of 3.57% was achieved on its test set.
    Questions
    1) Could you please clarify how exactly the residual mapping helps in avoiding degradation problem in deeper networks?
    2) I did not understand how identity mapping of layers can contribute to improving accuracies.

    ReplyDelete