Advanced Computer Vision: Mon, Feb 15 - Going Deeper with Convolutions

Friday, February 12, 2016

Mon, Feb 15 - Going Deeper with Convolutions

Going Deeper with Convolutions. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich. 2014.

arXiv

20 comments:

John TurnerFebruary 14, 2016 at 5:13 PM
In this paper, a convolutional network architecture is presented that set the state of the art in 2014's Imagenet Large Scale Visual Recognition Challenge. The primary contribution of this paper is the design philosophy behind this network, one of efficiency and optimization, where the "need to go deeper" in order to improve performance can be achieved with minimal increase in parameters and drain on computational resources, by approximating a sparse architecture (even in the convolutional layers) through the use of "Inception Modules", which are parallel convolutional/pooling layers tied directly to the previous convolutional stage's units (which are normally densely connected with subsequent layers). The authors also elaborate on their training methodology, hyperparameter selection, and competition results in classification and detection ILSVRC 2014 challenges.

Discussion/Question : The authors caution in the paper that "although the propose architecture has become a success for computer vision, it is still questionable whether its quality can be attributed to the guiding principles that have lead to its construction" and go on to discuss mechanisms that their design paradigm could be validated. Does this ambiguity not invalidate this architecture as a design choice? (i.e. even if another similarly designed but differently configured architecture performed well, could that also not be just "due to chance"?) In other words, how would one provide theoretical proof of the kind the authors caution does not exist regarding their design.
ReplyDelete
Replies
UnknownFebruary 14, 2016 at 9:22 PM
The authors present a new architecture named GoogLeNet which is based on using a sparsely connected architecture in order to avoid computational bottlenecks and improve computational efficiency over the entire network as they go deeper and wider. The sparsely connected architecture (called Inception modules) are based on the work of Arora et. al. which suggests the use of a layer by layer construction of the network using the inception modules based on correlation statistics of the previous layer when clustering the units in order to define the units of the next layer. Dimension reduction using 1x1 convolutional filters and projections is done to accommodate layers where the computational requirements spike. The authors provide details on their best performing model as well as the design considerations when submitting for the ILSVRC 2014 which GoogLeNet won.

Discussion:
1. Rather than a question, one criticism I have of this paper is that the authors have made a lot of design decisions for convenience rather than providing sound scientific reasoning. Even when they do, it is either empirical or based on theoretical work, not both. I wonder if the authors have published a follow up paper where they discuss why their other models and decisions failed.

2. The use of the 1x1 projections has not been elaborated upon. I believe this is the 1x1 convolution done on the 3x3 max pooling layer. Can some light be shed upon what is happening in this part of the Inception module?

3. The authors used a deeper and wider network, which gave poorer results, in conjunction with the ensemble. This was also done in AlexNet. How important is using an ensemble of Deep Networks for object classification? Has there been any study on this?
ReplyDelete
Replies
JonathanFebruary 14, 2016 at 10:43 PM
The paper is attempt to use a sparsely connected architecture but utilizes computations on dense matrices (which computers are really good at). They also use use a lot of dimension projection and adding in softmax activation to boost the gradient.

1. I would like to go over the 1x1 convolution. How does this work and why is it useful for dimension reduction?

2. The paper says that the Inception architecture is supposed to "approximate a sparse structure". I don't see anything about the architecture that would suggest that would occur
ReplyDelete
Replies
Sam SeifertFebruary 14, 2016 at 10:55 PM
This paper describes an approach to convolution nets that extracts better performance from the technique when applied to recognition. One of the insights to this approach is that convolutional features at many sizes are combined in the network, so the network can abstract features from different scales simultaneously. They crop each input image 144 different times to address the translation and scale invariance issues. They didn’t use a bounding box regression, but still outperformed other models. This shows the strength of their cnn approach (dubbed “inception”). They hypothesize a bounding box regression would further improve results.

Discussion: I’m not sure I understand 1x1 convolutional filters and the advantages they offer. Can you elaborate?
ReplyDelete
Replies
UnknownFebruary 14, 2016 at 11:04 PM
This paper presents a new CNN architecture capable of going deeper and wider. Dimensionality reduction techniques are used judiciously in order to mitigate increasing computational resources. As a result, GooLeNet is able to provide significant quality gains with a small increase in computational resources.

Discussion:
I apologize if the answer to this question is obvious, but how are 1x1 convolutions able to reduce computational cost? I don't fully understand how a 1x1 convolution is different from some pointwise scaling. I would appreciate if someone could explain this use case more carefully.
ReplyDelete
Replies
enlite traderFebruary 14, 2016 at 11:09 PM
Summary:
This paper presents the GoogLeNet that won the ILSVRC 2014 classification and detection competition. Increased the depth of traditional CNN leads to overfitting the data originated from more parameters in CNN and computational inefficiency, Inception architecture was introduced in GoogLeNet to cluster the "neurons" with highly correlated output under the sparse assumption in the deep network. During training, auxiliary classifiers were introduced to tackle diminishing gradient issue of the deep network.

Question:
On Figure2(b), how the filter concatenation is performed? Can you give a concrete example?
ReplyDelete
Replies
UnknownFebruary 14, 2016 at 11:28 PM
Extra point: The architecture and design of GoogLeNet reminds me of TensorFlow - https://www.tensorflow.org/
Some of the people in the DistBelief team are in the introductory video, so it is highly likely GoogLeNet was the inspiration for TensorFlow.
ReplyDelete
Replies
UnknownFebruary 15, 2016 at 12:21 AM
Abstract:
The paper presents a new neural network architecture GoogleNet (Inception 5) network for imagenet classification task. The paper is inspired mainly by 'Network in Network' which focuses on replacing the convolution layer with a universal function approximator which is multilayer perceptron. Thus the paper was called 'network in network'. The paper also shows that theoritically, a multilayer perceptron can be represented as 1X1 convolution. The GoogleNet paperbuilds upon this and uses 1X1 convolution as dimension reduction to reduce the number of parameters in the network. Thus the GoogleNet paper, builds a deeper and wider neural network architecture but still keeps number of parameter lesser than AlexNet. Also one addition idea that GoogleNet shows is reinforcing the middle layer of architecture with some added softmax layers. This would help with vanishing gradient problem. The final results of the paper were quite impressive, in fact Google followed this work by Inception 6 and 7 architectures which are even deeper.

Discussion:

1) In the 'Network in Network' paper, it is shown that a multilayer percetron layer is theoretically similar to 1X1 convolution layer. In one of the reddit thread (https://www.reddit.com/r/MachineLearning/comments/3oln72/1x1_convolutions_why_use_them/) I followed it is mentioned that they are equivalent to single layer percetron. I couldn't understand as which is completely correct and if so, why?
ReplyDelete
Replies
UnknownFebruary 15, 2016 at 2:42 AM
The paper introduces a new deeper and wider CNN architecture called GoogLeNet. This sparsely connected architecture aims at avoiding computational bottlenecks, and goes deeper to increase performance without increasing the number of parameters. This is done by sparse architectures through the use of inception layers, which clusters neurons using the ‘neurons that fire together, wire together’ principle. GoogLeNet is therefore able to achieve far better results in recognition, with a minimal increase in computation costs.

Discussion:
How 1x1 convolutions are used in dimensionality reduction is unclear.

Seeing that the authors use a dropout layer of 70% dropped outputs along with photometric distortions to combat overfitting, I wonder how much training would change for an even larger training set.
ReplyDelete
Replies
Aditi GuptaFebruary 15, 2016 at 5:21 AM
The paper presents a new CNN architecture that optimises the use of computing resources inside the network by using a new level of organisation called the “Inception module”. The Inception module is based on the idea that highly correlated outputs of a layer can be clustered together to form the inputs to the next layer. The authors employ 1X1 convolutional filters to reduce the dimension of the data before computing 3X3 and 5X5 convolutions.

Questions:
1. The paper says “The main idea of the Inception architecture is based on finding out how an optimal local sparse structure in a convolutional vision network can be approximated and covered by readily available dense components “.
I am not clear on what that means.
2. I am also not clear on how some of the dimensions given in Table 1 were obtained.
ReplyDelete
Replies
CJDSFebruary 15, 2016 at 8:50 AM
In this paper, the authors present a new CNN architecture based on LeNet. They use what is called an Inception Module to discover the optimal local construction and repeat it spatially. The authors employ convolutional filters to reduce the dimension of the data before the Inception Module.

Questions:
1. How does a 1x1 convolution decrease dimensionality
2. What is Polyak Averaging
ReplyDelete
Replies
UnknownFebruary 15, 2016 at 9:14 AM
The approach in this paper attempts to improve on the state of the art deep network architecture by increasing the depth of the networks. The challenges associated with doing so are that deeper networks are more prone to over-fitting and that adding more layers increases the usage of computational resources. The proposed architecture uses dense building blocks known as inception modules. These modules while dense attempt to approximate locally sparse structure (this structure is proposed as an alternative since computers have a hard time dealing with non-uniform sparse data structures). The modules also include components that reduce the dimensionality of the data to prevent computational blow-up. The result was a winning network with 22 layers that outperformed all previous models.

Questions:
1. This has already been asked, but how exactly do 1x1 convolutions reduce the dimensions? Why would this work over other DR algorithms like SVD?
2. How does this structure compare to the current hyper-deep model with 152 layers?
ReplyDelete
Replies
UnknownFebruary 15, 2016 at 9:22 AM
This paper discusses a convolutional net architecture created for the ILSVRC 2014 challenges. The authors try to approximate a sparse network in order to solve two problems that accompany increasing the depth and width of convolutional nets: large numbers of parameters and an increase in computational resources used. To approximate a sparse network, the authors group together units that activate on similar patches, and use "inception modules." Inception modules use 1x1, 3x3 and 5x5 convolutions. The 1x1 convolution serves as a dimensionality reduction before the more expensive 3x3 and 5x5 convolutions.

Question:
I thought that the use of the 1x1 filter wasn't explained very well. I'm sure how this reduces the dimensionality.
ReplyDelete
Replies
UnknownFebruary 15, 2016 at 10:22 AM
This paper introduces a novel concept in designing efficient deep learning architectures by introducing an Inception layer at different layers. How one can exploit the sparsity for faster, resource efficient computations has been described. 1x1 convolution filters are used for dimensionality reduction. Multi-scale techniques are employed using 3x3 and 5x5 filters which also helps in an increased accuracy. They go on to describe the 22 layer GoogLeNet which employs inception layers. How it outperforms all other deep architectures for the ILSVRC 2014 has been described, both for object classification and detection.

Questions-

1. How 1x1 convolutions resulting in dimensionality reduction is not yet clear to me.
ReplyDelete
Replies
Vasavi GajarlaFebruary 15, 2016 at 10:55 AM
Summary
This paper describes GoogLeNet, a 22-layer deep convolutional neural network which won the object detection challenge called ILSVRC14 with a mean average precision of 43.9%. It focuses on using techniques for working with sparse structures of network layers, making the network deeper and wider by keeping the computational resource usage optimal. In this process, the authors used Inception modules, 1x1 convolutional layers for dimensionality reduction and rectified linear activation, a dropout layer with 70% dropout ratio, and a softmax linear classification layer.
Questions
1) How are filters of 1x1 dimensions responsible for a dimensionality reduction?
ReplyDelete
Replies
prateekFebruary 15, 2016 at 11:12 AM
This paper presents a method for building deep convolutional nets with the Hebian principle. They build a 22 layer network with the aim of going deeper and wider to increase the capacity of the network. They employ the the probability distribution of
the data-set is representable by a large, very sparse deep neural network, then the optimal network
topology can be constructed layer by layer by analyzing the correlation statistics of the activations
of the last layer and clustering neurons with highly correlated outputs from Arora et al. They go on to explain the architecture where they use inspiration from a network in a network model between layers to increase the capacity.

They show results on ILSVRC challenge with state of the art results on both detection and classification.

Questions:

1 "The main idea of the Inception architecture is based on finding out how an optimal local sparse
structure in a convolutional vision network can be approximated and covered by readily available
dense components". I dont understand how has it been implemented , as what sparse representation are they able to achieve through this.

2. They talk about adding Network in a Network architecture between layers but there is no explanation for the pooling layers and how the dimension reduction is done
ReplyDelete
Replies
UnknownFebruary 15, 2016 at 11:17 AM
Summery:
This paper explains the architecture design of GoogleNet, the winning model of ILSVCR 2014 compition. Inspired by the paper "Network in Networ", GoogleNet introduces "Inception Module", that construct a more sparse representation of the convolution networks by clustering the neurons with the highest correlation and uses an extra 1x1 convolutional layer as dimensionality reduction. This model though much deeper – 22 layers- than previous winning model AlextNet, it has 12x less parameters.

Question Research:

1. The main new component of this Inception Module is the use of the new 1x1 convolutional layer. But it is not very clear how do they found the neurons that "fire-together" and therefore should be connected using this new 1x1.
2. Table 1: Can you explain the numbers in this table, especially the column #1x1
ReplyDelete
Replies
anushaFebruary 15, 2016 at 11:18 AM
The paper proposes a new neural network architecture for imagenet classification that makes use of dense building blocks called inception modules. The architecture proposed, reduces computational bottlenecks and increases performance by representing a multilayer perceptron as 1x1 convolution and thereby reducing the number of parameters in the network. GoogLeNet which won the ILSVRC14 detection challenge, was based on this architcture, and had much better recognition results with a small increase of computational requirements as compared to shallower and less wide networks.
Question:
How do 1x1 convolutions help in dimensionality reduction?
ReplyDelete
Replies
UnknownFebruary 15, 2016 at 11:19 AM
This paper presents a dCNN architecture with improved computational efficiency to go wider and deeper. One of the main challenges faced is that CNNs tend to overfit with increasing depth. The authors use dense components called "inception modules" to approximate a locally sparse network. These modules use various-sized convolutions to reduce the dimensionality of the data in order to minimize computational expenses. With this architecture, GoogLeNet performs better while modestly increasing computational requirements.

In regards to the authors' skepticism about the correlation between the architecture and its performance, did they ever follow up with their analysis/verification of this?
ReplyDelete
Replies
UnknownFebruary 15, 2016 at 11:50 AM
This paper discusses a new architecture for ConvNets created for the ILSVRC 2014 challenge called the GoogLeNet (named after Yann LeCun's popular LeNet). One of the interesting things about this paper is to move from fully connected to sparsely connected architectures inside convolutions to solve the problem of huge amounts of computation. The main idea in the architecture is to find out how an optimal sparse structure in a CNN can be approximated by readily available dense components.

Questions:
The authors say that judiciously applying dimensionality reduction would reduce the computation time by a lot. How does the 1x1 before 3x3 or 5x5 achieve this?
ReplyDelete
Replies

Add comment