Advanced Computer Vision: Fri Jan 29

Wednesday, January 27, 2016

Fri Jan 29 - AlexNet

ImageNet Classification with Deep Convolutional Neural Networks. Alex Krizhevsky, Ilya Sutskever, Geoffrey E Hinton. NIPS 2012.

pdf

22 comments:

John TurnerJanuary 28, 2016 at 5:54 PM
This paper presents what was, at the time, one of the largest convolutional nets trained on a subset of ImageNet data, upon which it excelled, achieving the best results seen. They have made their code available, and the paper itself gives many excellent practical insights into designing and optimizing a convolutional net.

Questions :

1) Why does overlapping pooling layers help to minimize overfitting?

2) I understand how dropout helps to combat overfitting, but would dropout also help to counter the back-prop problem discussed briefly on Wednesday, where the initial layers of the network may end up getting only a very diluted benefit from the error adjustments?

Discussion :
Could we briefly discuss the particulars of an auto-encoder, specifically with regard to what kind of tasks they are most suited for, and how they are implemented.

ReplyDelete
Replies
UnknownJanuary 28, 2016 at 6:01 PM
Abstract:
The paper describes the details of AlexNet a large convolution neural network designed for object recognition task on ImageNet dataset. The paper explains the architecture of the network in detail along with some of thr tricks that author of the paper thought helped the most which are using rectified linear unit instead of some squashing function, using multiple GPUs to train faster, response normalization across feature maps and overlapped pooling operation. Data augmentation and dropout are used as regularization and they seemed to help a lot with over fitting problem. The results of the paper were breakthrough in the CV field and opened paths to many such neural networks.

Discussion:
1) Why not do feature normalization over spatial neighborhood for a particular feature map also? In the referenced paper Jarret et al , they seemed to be doing across feature maps as well as spatial neighborhood.

2) The paper doesn't give substantial theoretical or empirical evidence for why overlapped pooling is better? It might be better just because it has more parameters in the next layer due to smaller stride? Are there any experiments where people give better analysis about the stride parameter?
ReplyDelete
Replies
UnknownJanuary 28, 2016 at 6:18 PM
The paper discusses the design decisions made to create a Convoutional Neural Network with the objective of minimizing overfitting, that was capable of winning the ImageNet LSVRC 2010 and 2012. The main decisions, in order of importance are, Rectified Linear Units which help speed up training, a 2 GPU implementation with an innovative step of allowing only certain layers (every layer except the 2nd, 4th and 5th) to communicate across GPU, local response normalization which basically causes neurons to contest for larger activities, and overlapping pooling in contrast to local pooling. To reduce overfitting, the training images were augmented by patching, translating and mirroring each training image into a new training image of size 224x224, applying PCA over the RGB values of the training images over the entire set and multiplying each image with the eigenvectors weighted by the randomly scaled eigenvalues, and using Dropout regularization which involves randomly turning off neurons so as to simulate different architectures. Results of their submissions are provided.

Discussion:
1. Can the process of Local Response Normalization be elaborated upon? I don't seem to understand it completely.
2. What was the guiding decision behind which layers would be able to communicate across GPUs? That does not seem to have been given enough attention.
ReplyDelete
Replies
UnknownJanuary 28, 2016 at 6:54 PM
The paper presents the architecture and techniques used in the construction of AlexNet, an 8 layer convolutional neural network. At the time of publication, the net was one of the largest around. The authors list techniques that they believe were the most important in creating a successful architecture. The techniques include using ReLUs to increase training speed, training on multiple GPUs, local response normalization, and overlapping pooling. The paper also emphasis the importance of artificially increasing the amount of training data with simple transformations to help prevent overfitting.

Questions:

1) The authors state that they preprocess the training images by subtracting the mean of the entire training set from each image. I've noticed that this is commonly done, but have never understood the intuition behind it. Can someone explain why this is beneficial?
ReplyDelete
Replies
JonathanJanuary 28, 2016 at 9:49 PM
The paper is one of the papers that put deep learning on the map. They have a large (at the time) network of 5 convolutional layers and 3 FC layers. Techniques include ReLU, LRN, pooling, multiple gpu's, dropout, and weight decay.

Q:

I've seen different normalizations, (normalizing an image, batch normalization are 2 that come to mind). But I've never heard a good explanation as to WHY you want to do some sort of normalization.
ReplyDelete
Replies
Sam SeifertJanuary 28, 2016 at 10:31 PM
This paper discusses deep learning as applied to image recognition. What was different about this paper was it focused a lot on physical implementation. The authors trained their network on two GPU's, and put in a lot of thought on how to maximize training efficiency. At some points in the network, distinct sections of layers only reside on one GPU (or there other). They use this to minimize the cross talk needed between GPUs (as cross GPU communication is more time consuming than same GPU communication). They discuss their impressive results.

Discussion:
If I show a neural network a picture of a dog, and then I show it the same picture of a dog rotated 90 degrees, will I get the same output (assuming the network is setup to accept square pictures and I don't have to scale or crop anything)? In theory, I don't think there is anything that gives rotational invariance. In practice... I'm not sure. Is orientation one of the learned things in a neural net? Will a network trained on everyday images of people have trouble picking up an upside girl on monkey bars? If yes, why?
ReplyDelete
Replies
UnknownJanuary 28, 2016 at 10:55 PM
When training a neural network for object detection or classification in images, using of MLP network suffers from the "curse of dimensionality". Even a simple image with limited visual features has a million pixels going into the MLP network as inputs. This requires the network to have tremendous amount of perceptrons and hence lot off weights to be adjusted, in turn slowing the training process and also causing overfitting. Convolutional Neural Networks (CNNs) comes to the rescue as they are designed to consider the spatially local correlation in an image into account. They also have additional features compared to MLPs like shared weights, 3D volume of perceptrons and non-global (local) connectivity which make them a better and faster image classifiers.

This paper talks about how a deep CNN was implemented for image classification based on ImageNet dataset and its results.

The CNN had 5 convolutional layers, some followed with pooling layers, and 3 fully-connected layers along with a Softmax at the end. The Non-linearity used for a perceptron is ReLU (Rectified Linear Units) which is faster to train compared to standard sigmoid function. Since the network is still big, it was implemented on 2 parallel GPUs with carefully placing the network layers in order to minimize communication between the GPUs. Other new features implemented for better accuracy are Overlapping Pooling, Local Response Normalization and Data-augmentation (create synthetic images from the given dataset to increase the training data and in turn reduce overfitting).

As a result of these novel features the CNN showed best error performance compared to other classification algos for same dataset.

Discussion:
1) What are the metrics (may be based on given dataset or given task) we can use to comeup with a "good" initial network architecture, from where we can start fine-tuning?

2) While comparing results with others (Sparse coding and SIFT+FVs), were total resources used (memory) and time to train compared?

3) From Table 2 We can see that 1CNN to 5CNN gave ~2% improvement. What is the effect of increasing the number for averaging? How well does it improve with increase in denominator?
ReplyDelete
Replies
sfenu3January 28, 2016 at 11:02 PM
Alexnet blew every ILSVRC submission out of the water at the time, and this is the paper describing it. The contributions of the paper are first the network itself, and secondly a method to train large CNNs using heavily optimized convolutions running on a GPU.

Questions:

1) Very little information is given as to the rationale behind the overall network architecture (choosing which layers communicate across GPUs, choosing number of layers, choosing pooling layers etc), are there any talks or further work by the authors that provide more details?

2) Were any attempts made to compress the network after the fact?

3) Were other architectures tried and found to do worse? If so what were they and why did they fail?
ReplyDelete
Replies
UnknownJanuary 28, 2016 at 11:28 PM
This paper presents a eight-layer convolutional neural network trained on a subset of the ImageNet dataset. The authors describe its architecture, training process, and how it surpassed all previous results. It was one of the largest networks and potentially could have been larger without the memory limitations at the time.

Are there any other advantages (other than minimizing the amount of computations) for having only certain layers across GPUs communicate with each other?

The authors mentioned that faster GPUs and larger datasets would improve their performance. Has there been a follow-up on this?
ReplyDelete
Replies
enlite traderJanuary 28, 2016 at 11:47 PM
Summary:
This paper is mainly the description of the architecture and performance about the AlexNet , which achieved the state-of-the-art recognition task results in the ImageNet competition in 2012. Rectified Linear Units(RELUs) was used as the activation function with non saturating property in the neurons to achieve faster training;Even RELUs don't require normalization to prevent saturating,local normalization for each pixels across kernels in the same layer helps with generalization;Overlapping pooling is followed to help summarizing local region in the image and extracting high level features.To reduce overfitting problem given already a good amount of training data, data augmentation like extracting random patches of image and horizontal reflections and changing the intensity of images based on PCA were used,plus dropout was used with 0.5 sampling rate but require twice time to train.

Questions:
This paper is more like a description of the architecture of the CNN they build to reach the state-of-the-art results on the ImageNet competition, One complain about deep learning methods is their lacking of theoretical backups at least mathematically maybe not from neurobiologist.

It is working pretty well but, say, it will be really good if given the parameters of the architecture of CNN, and the computer vision task, (suppose the quality of data is good),we can come up with an upper bound of the number of data we need to achieve certain error rate with small deviation epsilon. Or do you think is feasible we can come up with PAC Learning framework to precisely describe CNN?
ReplyDelete
Replies
UnknownJanuary 28, 2016 at 11:59 PM
AlexNet is a very influential paper which broke all the previous benchmarks on the ILSVRC-2010/12 datasets and introduced a novel method to train a CNN for the recognition task with reduced training time and overfitting issues. The paper describes the various approaches that the authors took to reduce the training time by using ReLUs and utilizing cross-GPU parallelization. They also describe the use of a dropout layer which helps greatly minimize overfitting thereby forcing neurons to learn more robust features. Finally, they report their top-1 and top-5 test set error rates which are much better than the ones reported earlier.

Discussion:
1. Since the authors repeatedly mention the problem of overfitting, why did they not employ any regularization techniques and simply choose to rely on a dropout layer?

2. While discussing the training on multiple GPUs, a trick to reduce computation which employs layer-wise communication between GPUs is mentioned. Why does this layer-wise connectivity between GPUs cause problems for cross-validation?

3. Why did they "subtract the mean activity over the training set from each pixel" in each image in the preprocessing stage?
ReplyDelete
Replies
UnknownJanuary 29, 2016 at 12:12 AM
The description of the convolutional neural net used in ImageNet is described. The architecture contains five convolutional layer followed by three fully connected layer. Other techniques like using Relu, training on multiple gpu and local response normalization and overlapping pooling are discussed which improve the performance in terms of accuracy and time. Using data augmentation and using dropout layers helps in reducing overfitting as mentioned in paper. Finally the results are reported which were the best result at the time of publication.

Question:
How does local response normalization work and help in generalization. The paper goes on to mention improved in error rate by this technique.
ReplyDelete
Replies
Aditi GuptaJanuary 29, 2016 at 2:43 AM
The paper presents the results of training an 8-layer Convolutional Neural Network on a subset of the ImageNet dataset using a highly optimized GPU implementation method. The paper proposes some novel changes in the CNN architecture including using the Rectified Linear Units to introduce non linearity, training on multiple GPUs to speed up the training process and local response normalization. The authors use the dropout technique and perform augmentation to reduce over-fitting.
Question:
1) The paper mentions that CNNs with ReLUs train faster than those using |tan h|. I am not clear why this is the case. Also is there any advantage of using the |tan h| or the sigmoid function to introduce non-linearity?
ReplyDelete
Replies
Vasavi GajarlaJanuary 29, 2016 at 2:49 AM
This comment has been removed by the author.
ReplyDelete
Replies
Vasavi GajarlaJanuary 29, 2016 at 2:51 AM
Summary:
This paper talks about building and training a large and deep convolutional neural network (CNN) for classifying ImageNet data. The notable success of this network happened in contests, LSVRC-2010 and ILSVRC-2012, where this CNN achieved error rates better than the state-of-the-arts at the time. The authors tackled various problems like – 1) making the training faster by using ReLU instead of the common tanh function, along with two parallelized GPUs, and 2) overfitting by using dropout regularization.
Questions:
1) Why is the split of neurons in one layer across GPUs a problem for cross validation? The computation is parallelized on the 2 GPUs, right? How does it affect training, if at all?
2) Why are models with overlapping pooling difficult to overfit?
ReplyDelete
Replies
anushaJanuary 29, 2016 at 5:40 AM
The paper describes how a deep, large convolutional neural network was trained to classify images in ImageNet. The network comprises of 5 convolutional layers and 3 fully connected layers. Despite the large size of the network, over-fitting was tackled using dropout and data augmentation techniques. ReLU was used to speed up training.

Questions:
1.The network implemented in the paper consists of 5 convolutional layers and 3 fully connected layers. What other factors other than amount of memory on GPU and training time poses a constraint on the network size? Assuming we have no constraints on memory and training time, would the error rate decrease as we increase the number of layers or would it saturate at some point?
2. The paper mentions that overlapping pooling helps tackle over-fitting. How does this work?
3. I don't quite understand the specialization exhibited by the 2 GPUs, where kernels on GPU 1 is color agnostic and while kernels on GPU 2 are largely color-specific. The paper mentions that this kind of specialization is independent of any particular random weight distribution. How/why does this happen?
ReplyDelete
Replies
CJDSJanuary 29, 2016 at 8:16 AM
This paper presents AlexNet. It won the ILSRVC prize in 2012 and was the first CNN model to do so. It comprises of 5 convolutional layers and 3 fully onnected ones. It beat out the competition by acheiving a 15% error rate and set the stage for GoogleNet and MSRA. To tackle the problem of overfitting Data augmentation and dropout were used.

Questions:
1. I'm confused about the kernel size of the second layer 5 X 5 X 48. As far as I understood, doesn't at least one of the dimensions of the kernel, have to be the at most the number of channels in the image (3). How can none of the dimensions be greater than 3?
2. Also, Why do the researchers subtract the mean activity from each pixel and how does it help?
ReplyDelete
Replies
UnknownJanuary 29, 2016 at 9:13 AM
This paper describes the architecture of the AlexNet deep architecture. One of the major contributions for this paper is an architecture that can be trained on more than one GPU which increases the network's capability to learn. This made (at that time) the AlexNet the biggest deep network with its 8 layers (which is a far cry from the size of the modern deep architecture with its 152 layers). Over-fitting was addressed in the architecture by using data augmentation techniques and dropout.

1. I'd like to see how the information in the system that tracks the information flow (perhaps similar to the models that attempt to track information flow in the human brain). It would be great to see how individual pixels affect the network flow and change the system in a visualization.

2. I'd also like to know the intuition behind the architecture choice. It seems certain layers were chosen for specific reasons but besides hacking together random layers it would be interesting to know the intuition behind layer choices and the number of layers. Essentially I'm curious as to why these layers work and why a different set wouldn't work.
ReplyDelete
Replies
UnknownJanuary 29, 2016 at 10:48 AM
The paper presents AlexNet, a deep convolutional neural network trained to classify images using the ImageNet database. It has 8 layers, 5 Convolutional layers and 3 fully connected layers. A variant of this won the ILSVRC 2012 with top1 and top5 performances of 26.2 and 15.3% error rates.

Discussions:
1) The issue of overfitting comes up time and again. Increasing the size of the image and changing pixel intensities are discussed to synthetically increase the database. How about tackling insufficient data/ overfitting issue by perturbing the image using some kind of random vector field, that moves the pixels about, but still preserves edges?

2) So since we have a limitation of training on two GPUs, in that GPUs communicate only in certain layers? Won’t this always be an issue when it comes to training on GPUs as training sets become larger and larger?

3) The last line of section 3.1 states that faster learning has a great influence on the performance of large models trained on large datasets. Why?
ReplyDelete
Replies
prateekJanuary 29, 2016 at 11:00 AM
The paper is a seminal paper on CNN's and began the deep learning surge. The paper outperformed the next best competitor in the ILSVRC challenge by 10%. The paper makes the following contributions:

1. It outlines the first known large scale working architecture with 8 layers of which 5 are convolution and 3 are fully connected.

2. It displays importance of methods like dropout and data augmentation in reducing over fitting.

3. They show quantitative results on the ILSRVC challenge with top 5 error of 16% and also release the code for public use.

Questions:

1. They say dropout reduces overfitting but they apply it only on the fully connected layers. Why is dropout not used for the convolution layers.

2. Is local response normalization same as normalization across features maps, a thing which was asked in the last class.

3. Is there an intuition behind the architecture of each layer defined by the size of the filter and the number of filters per layer.
ReplyDelete
Replies
UnknownJanuary 29, 2016 at 11:58 AM

This paper introduce AlextNet, a conventional neural network architecture for image recognition. The proposed model consists of 7 main layers : 5 conventional , 2 fully connected layers, making a total of 650K neurons and 60 million free parameters. The model was trained on subset of ImageNet database for ILSVRC challenge. It won the first place for top 1% and 5% error. AlexNet is one of the most important and earliest deep learning models because it brought conventional networks back to the map.

The question is about the architecture design, what is the intuition behind all the kernel sizes? The max-pool filters ?
ReplyDelete
Replies

Add comment