Advanced Computer Vision: Mon, Mar 14 - Unsupervised Visual Representation Learning by Context Prediction

Friday, March 11, 2016

Mon, Mar 14 - Unsupervised Visual Representation Learning by Context Prediction

Unsupervised Visual Representation Learning by Context Prediction. Carl Doersch, Abhinav Gupta, Alexei A. Efros. ICCV 2015.

project page

19 comments:

enlite traderMarch 13, 2016 at 6:42 PM
summary:
This papers presents CNN can learn 'better representation' from unsupervised visual representation by predicting relative position of patches in the image. With the help of the batch normalization, two alexNet with tied weights are trained to predict the position in a 3 by 3 square. The knn results showed alexNet trained with this task performs better on clustering more semantic similar images.

questions:
The author talks about the Learnability of Chromatic Aberration at section 4.2 or around 9:00 in his presentation.
He mentioned CNN like to take 'shortcuts' to make prediction:
My question is
1) Why, how he think it is the chromatic aberration that CNN learns?
2) To force the CNN learns the representation, what are some other possible 'shortcuts' that CNN learns (not constrained to this paper) and how people come up with tricks to avoid learning those 'shortcuts'
ReplyDelete
Replies
UnknownMarch 13, 2016 at 7:32 PM
Abstract:

The paper presents an unsupervised learning approach that can learn the semantic understanding of objects in images. This can leverage very large size of unlabeled images to better the accuracy on tasks like object detection. The concept is sort of taken from text where predicting word from context has led to much successful word vector representation. Here they train a conv net to take input two patches from an image and predict their relative spatial location. They have shown that this task on unlabeled image captures the semantic understanding of objects. Finally they have shown that the network pretrained on such approach helps object detection task.

Discussion:

1) They only use 96x96 patches, would taking patches of multiple sizes and then upscaling/downscaling can help this task?

2) If we take take this idea to little more extreme and randomly shuffle patches from image and train a convnet on correcting the patch order, can that work? Sort of convnet trying to solve the jigsaw puzzle? Would it then only focus on low level details instead of semantic understanding?
ReplyDelete
Replies
UnknownMarch 13, 2016 at 11:03 PM
The paper presents the idea of training on a pretext task in order to learn a representation for other, related tasks. The pretext task in this case is learning relative spatial configurations of 2 patches from an image, thus allowing a self-supervised formulation to learning parts of objects. The authors use a late fusion architecture of 2 ConvNets and utilize techniques to correct for trivial solutions due to color aberrations. The key idea is to use one side of the network as a pretrained network and fine-tune it for object detection. Empirical results on the Pascal VOC 2007 show state of the art results using this pretrained formulation. The authors also show applicability across datasets and to the the task of unsupervised visual data mining.

Discussion:
1. How much potential impact did the changes to the architecture to adapt it to R-CNN might have had on the performance? The changes pretty much threw away a lot of information which could have helped or hindered the network, hence seeing a pure vanilla version trained on patches of 227x227 would be an interesting comparison.

2. The Fast R-CNN pipeline supplies both the entire image and the various proposals to maintain global understanding. Would a similar strategy while training the pretext task be helpful? At first glance, you would think that providing the whole image would lead to shortcut mappings since now the ConvNet is seeing the actual configuration of various patches (albeit downsampled), but an empirical verification of this would have been a nice-to-have.
ReplyDelete
Replies
UnknownMarch 14, 2016 at 1:33 AM
This paper talks about extracting visual representation by context prediction. The authors argue that performing well on their mentioned task signifies good ability of the networks to recognize objects and its part. The task here refers to randomly sampling two patches from the image and using these two patches to predict the context between them in terms of 8 connectivity. The authors train a standard combined CNN network and predict the result of context between the two patches taking care to remove cases yielding trivial solution. The present state of art result for their methods.

Question -
The authors mentioned utilizing borders or chromatic aberration as a source of generating trivial cases for the network. Can there be some other cases which can be also lead to trivial solutions?
ReplyDelete
Replies
John TurnerMarch 14, 2016 at 5:58 AM
This paper presents a methodology for using location and spatial context of randomly sampled patches to teach an unsupervised algorithm about objects. It works by picking two patches within an image and learning their position relative to one another, with the hypothesis being that being able to do this would only be possible if the algorithm is learning a good object representation. They accomplish this using a pair of alexnet-like convolutional nets that share weights until their 2nd fully connected layers, which then work on both images together.

Questions/Discussion :
Would this algorithm also work on non-iconic images like MS-COCO?
ReplyDelete
Replies
Aditi GuptaMarch 14, 2016 at 6:27 AM
The paper presents an interesting unsupervised technique of extracting features which capture the semantics of an image. The proposed method is inspired by the n-skip grams technique used to generate word representations. In order to capture context in an image, the authors divide the image into patches and formulate the problem as determining the relative positions of any two given patches. The authors train a pair of AlexNet style architecture that capture the representation of each patch separately, until the last layer where the representations are fused and the output is fed to the soft max layer.
The authors use this unsupervised feature representation for object detection and surface normal estimation and achieve performance comparable to that achieved using labeled ImageNet data.

Questions:
Could you explain a little bit more about chromatic aberration and how it helped the network to learn a lazy representation of the image patches?
I did not clearly understand why the network was getting stuck at a saddle point.
ReplyDelete
Replies
UnknownMarch 14, 2016 at 7:01 AM
This paper introduces an unsupervised model that predicts the spatial relative location for two patches of the same image. To predict such high level information, the model needs to reason about the objects and their parts. To avoid trivial solutions, they introduced set of preprocessing steps to force the system to learn high-semantic features rather than taking shortcut solutions using boundary patterns or texture. The proposed model accepts two patches as input, these patches are selected randomly from the same image, then feed each patch to an AlextNet CNN, finally using late fusion, the learning features from both networks are combined and forwarded to Softmax to produces a probabilities for each of the 8 oriented positions. The system is used as pre-training step for R-CNN as well for object detection task.

I understand that the model can learn if the image patches belong to the same object or not, it is not clear to me what kind of features that model is learning to determine the relative locations?

Can we discuss chromatic aberration problem and if it is present in other visual representation-learning task?
ReplyDelete
Replies
Sam SeifertMarch 14, 2016 at 9:29 AM
This paper talks about learning with deep convolutional networks without labels. Their “training” data is location of specific image patches relative to each other. It works quite well when looking at nearest neighbors for patches, but their networks don’t outperform networks trained with imagenet labels.

The learning of chromatic aberration was one of the coolest parts of this paper to me. Are there other examples in computer vision where deep networks find a shortcut to solve a problem that is quite different than the one the network was designed for?
ReplyDelete
Replies
UnknownMarch 14, 2016 at 9:46 AM
This paper uses spatial context to guide the unsupervised learning of objects. This approach works by taking in two random patches from an image and have a neural net architecture (that is actually two networks fused together using late fusion) predict where spatially the patches are in the image. This allows the algorithm to scalably learn something semantically meaningful about unseen object like “wheels often occur near each other” or “pointy ears occur above eyes”, without requiring a human fed rule set.

1.Could we use this in a lightly supervised way, perhaps by fusing three networks where the third takes into account the context such as tags, comments, gps, etc.?
2.What about using a pyramid representation of the patches? Would that improve performance?
ReplyDelete
Replies
UnknownMarch 14, 2016 at 10:26 AM
This papers tries to learn relative positions of patches of an image using Alexnet-like network and unsupervised learning. This work is motivated by how words in a text can be understood using context. The training is done on ImageNet database (without annotations). The network architecture is similar to two Alexnets in parallel, each taking a patch as input, until fc6 and has common fc7, fc8 and softmax. Some of the layers share weights. While training, first patch is uniformly taken from the image and then second patch is taken randomly from 8 neighbors of first patch. In order to avoid trivial solutions, the neighboring patches have 48 pixel distance between, jitter the patches by 7 pixels, and avoid Chromatic aberrations by color projection or color dropping.
Testing is done on PASCAL VOC 2007. First the authors analyze the performance of predicting nearest neighbors by comparing it to Alexnet trained on ImageNet. Then they use this unsupervised trained network to boost the object detection performance by converting the network to an R-CNN pipeline. The authors also propose this system to be used as unsupervised data discovery and hence Visual Data Mining without human annotators.
ReplyDelete
Replies
Vasavi GajarlaMarch 14, 2016 at 10:42 AM
Summary
This paper proposes an algorithm which takes patch-pairs from each image and learns the relative position of each other using convolutional neural net - by learning to recognize objects and their parts. It was used to perform unsupervised object discovery on PASCAL VOC. They have learnt that a convolutional neural net learnt using R-CNN gives better results among state-of-the-art algorithms which use only Pascal-provided training set annotations.
Questions
What do they mean by “shared weights” represented by dotted lines in the architecture in Figure 3?
How does downsampling and upsampling improve robustness to pixelation?
How does batch normalization force the network to vary activations across examples?
Can this kind of unsupervised learning be applied to scene understanding and similarity between scenes?
ReplyDelete
Replies
prateekMarch 14, 2016 at 10:54 AM
This paper proposes a novel way to train Convolutional nets in an unsupervised manner by using pretext in images. The authors aim to leverage the power of spatial coherence in objects and images to learn a embedding in an unsupervised manner.

The main contributions of this paper are:
1. They train a net using a patch and its 8 neighbors as different categories encoding spatial relationship between them.
2. They develop an architecture modelled on AlexNet with siamese network to learn this similarity metric.
3. They do extensive testing on variety of tests from object detection to visual data mining.

They show that pretraining using their method without any supervision allows them a gain of 7% over a net trained from scratch for object detection. They also show results on visual data mining and clustering of semantically similar images outperforming the state of the art.

Questions:
I did not understand the reason for the neighborhood based on the spatial relationship like why encode top right or top left.

They talk about rescale method for their variances . could you elaborate on this.
ReplyDelete
Replies
UnknownMarch 14, 2016 at 10:58 AM
The paper trains a deep convolutional net to determine relative locations of image patches. One of the interesting aspects of the net is that it is trained on data that requires no annotation. This is important because it means that the amount of available training data is huge. The authors find that even though the net wasn't trained for object recognition, it still seems to develop features that detect them. They show that the net can be fine tuned to work well for a variety of tasks.

Question:

Has there been any work to train a similar net on an even larger dataset? Since labels aren't required, it seems like webscraping and filtering duplicates could provide a massive unlabeled dataset.
ReplyDelete
Replies
JonathanMarch 14, 2016 at 11:05 AM
This paper argues their CNN learns better representations from by predicting relative position of patches in the image. They leverage the idea of spatial coherence in an unsupervised manner.

1) How do you detect chromatic aberration?
ReplyDelete
Replies
UnknownMarch 14, 2016 at 11:15 AM
This paper presents a method of unsupervised learning by predicting relative positions of patches in an image. This task involves the model to learn object representations. The authors use a pair of architectures similar to AlexNet to train on the patches independently until the second fully connected layer where the networks and the patch representations fuse. This can also be used to pre-train R-CNNs for object detection.

Can you explain how batch normalization combats the saddle point issue?
ReplyDelete
Replies
UnknownMarch 14, 2016 at 11:20 AM
By predicting the relevant position of patches in an image, the idea of this paper is to learn better representations of the image using a CNN in an unsuervised way. This idea is an analogy to the one used in text prediction tasks where having a context makes things easier. They pick two image patches and run them through a pair of AlexNet based CNNs which are fused together using late fusion.

Question:
Have they explored various patch sizes and handling different resolutions? Also, how would this algorithm work on COCO datasets so that it can promote better scene understanding?
ReplyDelete
Replies
UnknownMarch 14, 2016 at 11:37 AM
This paper presents a way to train better representations by using spatial context. By training a deep network to recover relative positions of other patches, better image representations can be learned.

Questions:
Can this work be adapted for structural image editing, like in PatchMatch?
ReplyDelete
Replies
UnknownMarch 14, 2016 at 12:58 PM
The authors present a CNN that learns representations by exploiting spatial relationships in an unsupervised way. The idea is to build a network that predicts relative position of patches in the image.

Discussion:
How were the chromatic aberrations detected?
ReplyDelete
Replies
UnknownMarch 14, 2016 at 2:27 PM
This papers tries to learn relative positions of patches of an image using Alexnet-like network and unsupervised learning. This work is motivated by how words in a text can be understood using context. The training is done on ImageNet database (without annotations). The network architecture is similar to two Alexnets in parallel, each taking a patch as input, until fc6 and has common fc7, fc8 and softmax. Some of the layers share weights. While training, first patch is uniformly taken from the image and then second patch is taken randomly from 8 neighbors of first patch. In order to avoid trivial solutions, the neighboring patches have 48 pixel distance between, jitter the patches by 7 pixels, and avoid Chromatic aberrations by color projection or color dropping.
Testing is done on PASCAL VOC 2007. First the authors analyze the performance of predicting nearest neighbors by comparing it to Alexnet trained on ImageNet. Then they use this unsupervised trained network to boost the object detection performance by converting the network to an R-CNN pipeline. The authors also propose this system to be used as unsupervised data discovery and hence Visual Data Mining without human annotators.
ReplyDelete
Replies

Add comment