Advanced Computer Vision: Mon, Feb 22

Friday, February 19, 2016

Mon, Feb 22 - Places database

Learning Deep Features for Scene Recognition using Places Database. B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. NIPS 2014.

project page, pdf, demo

23 comments:

John TurnerFebruary 20, 2016 at 11:14 AM
Hey folks check this link out :

http://people.csail.mit.edu/torralba/research/drawCNN/drawNet.html

this app shows the configuration of the placesCNN, the connections between any particular unit and those in other layers, and the 4 images that motivate the greatest activation for that particular unit.
ReplyDelete
Replies
UnknownFebruary 21, 2016 at 11:55 AM
This paper presents a new dataset to be used for scene recognition. The data set consists of over 7 million images. The paper describes how the images were labeled through Mechanical Turk. The authors present a way to measure relative density and diversity of datasets. Again through the use of Mechanical Turk, the relative densities and diversities between SUN, ImageNet and Places were calculated. All datasets had similar densities, but the Places database was the most diverse. The authors then created Places-CNN based of the Caffe reference network. Places-CNN shows significant improvements in classification accuracy on Places 205 and SUN 205 over a baseline of ImageNet CNN with a SVM classifier. Finally, results of Places-CNN features and ImageNet-CNN features are compared on a collection of different datasets. The CNN features are passed into an SVM classifier for both nets. The authors found that Places-CNN produces higher accuracy on scene-centric databases, while ImageNet-CNN performs better on object-centric databases.

Question:

Have the density and diversity measures from the paper been used to guide the creation of any recent datasets?
ReplyDelete
Replies
JonathanFebruary 21, 2016 at 6:13 PM
This paper gives a new scene-centric dataset, Places. They describe how they use AMT to lable the dataest. A good db should be dense and diverse. The relative densities of ImageNet, SUN, and Places were calculated. The densities on the datasets are, on average, similar but Places was more diverse, which was calculated using a variant of Simpson index of diversity. Places-CNN shows improvement on scene-centric databases, while ImageNet-CNN does better on object-based dbs.

Q: Potential Research Q: Combine features using something like knowledge distillation?

Q2: Why use Simpson index of diversity? Different notions of similarity used?
ReplyDelete
Replies
enlite traderFebruary 21, 2016 at 6:54 PM
Summary:
This paper presents the Place dataset,the counterpart of ImageNet dataset, specially for scene recognition task. The dataset was chosen with high density and diversity in mind with authors introduced new corresponding metrics. The key result shown that for different data (for different task) feed into CNN with the same architecture, the performance can boost around 10% for 205 scene categories because of the different representation learned at higher level from the data. Authors also trained the same CNN with hybrid data from Place and ImageNet, and achieved some of the state-of-arts result back then.

Question:
This paper talks about relative density and diversity as important metrics for dataset quality, what are some other good metric for recognition task? What about metric for other computer vision task we haven't mentioned since this semester?
ReplyDelete
Replies
UnknownFebruary 21, 2016 at 9:14 PM
The paper introduces “Places”, a new comprehensive database of scene-centric images. The database consists of more the 7 million images of 476 scene categories. To create the database, scene taxonomy from SUN database was used to query images using 3 different search engines. Manual annotation was preformed using AMT. The paper presents an extensive comparison between Places and other benchmark databases (such as ImageNet, and SUN) in terms of density and diversity. Experiments on training deep architectures using the new database Place-CNN, and testing the trained model on different datasets yield higher accuracy than using the same CNN model that was trained using ImageNet.

Research Question:

Table 3 presents the classification accuracy on different datasets using Hybrid-CNN feature, and we can see that in all object-centric datasets, the accuracy is lower than the ones reported using only ImageNet-CNN feature (table-2), any thoughts why this is the case?
ReplyDelete
Replies
UnknownFebruary 21, 2016 at 9:15 PM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownFebruary 21, 2016 at 10:03 PM
Abstract:

The paper presents a new image dataset (Places) that focuses on scene recognition instead of object recognition focused by ImageNet. They show that scene recognition and object recognition both have different characteristics and ImageNet is not good enough for scene recognition tasks. The Places dataset has around 7 million of images under 476 scene categories. This is a unique dataset with such huge number of images of scenes. They have shown that Places dataset is similarly dense and more diverse than currently available dataset. Finally they have trained a CNN to get state of art scene recognition performance on various benchmarks. They have also shown the visualization of layers to demonstrate that the RF learned for both problems is different. Thus this new dataset provides a baseline for scene recognition problems.

Discussion:

1) What is the current state of art architecture for the Places dataset.
ReplyDelete
Replies
Sam SeifertFebruary 21, 2016 at 10:30 PM
This paper outlines a new dataset, called places, that contains 7 million labeled images of scenes that fall into 476 categories. They determined categories by combining the Sun categories with adjectives. They pull images off Google, and labels are validated using Mechanical Turk. They make some claims about diversity and density to justify their dataset. They used Mechanical Turk to verify their claims. They plugged in their dataset to the Caffe “image net” CNN. Some of the features from various layers of the CNN are visualized. I don’t see anything particularly interesting. They report their results. Interestingly enough, CNN’s trained with their dataset perform worse on object datasets (like ImageNet) than CNN’s trained with ImageNet.

Discussion:
The list the performance of the ImageNet CNN feature + SVM classifier on the Places dataset (and compare against just the Places CNN). How did the ImageNet CNN without SVM do? Where is the room for improvement? How is the variability between scenes? Are some scenes really easy and some hard? Or is it poor (indistinguishable) labeling an issue?
ReplyDelete
Replies
sfenu3February 21, 2016 at 10:48 PM
We've seen before that iconic image representations tend not to be incredibly useful when trying to learn images in context with MS COCO. This paper first shows that such images are also not incredibly useful for training classifiers for places (the context an image might be in), and provides us with a very large and diverse dataset of images that are better representative of different contexts an image might be found in.

Questions:

1) We've seen a lot of papers presenting datasets of varying quality now. Other than just a number of images per class and a number of classes, what metrics can be used to measure how diverse a dataset is or how hard of a problem classification on that image dataset is?

-Stefano
ReplyDelete
Replies
UnknownFebruary 21, 2016 at 10:54 PM
This paper introduces us to the Places dataset created for scene recognition and it's performance compared to other available datasets like ImageNet and SUN. The dataset is created by mining images from Google Images, Flickr and Bing Images with category name attached with an adjective. Then they use AMT to filter the images. The dataset is then compared to ImageNet and SUN using Density and Diversity as a metric. For a random image the distance between its nearest neighbor in the dataset is directly proportional to Density and inversely proportional to Diversity. The authors again use AMT to figure out the images and their nearest neighbors in all 3 datasets and calculate Density and Diversity empirically. As per the results all 3 datasets have similar Densities but Places has highest Diversity.
But the most important way to compare datasets is to run the classification algos they were created for. CNN is used for scene recognition using all 3 datasets (only scene categories from ImageNet). Since Places dataset has higher number of images per category than SUN and since it is more scene centric compared to ImageNet, it performs the best in classification accuracy.

Discussion:
1) During dataset building (2.1), I couldn't understand the importance of having default Yes and default No for the HITs? Does it reduce the annotation time?
2) In 2.1 there is a reasoning behind improved diversity, "Adding adjectives ... to increase the diversity of visual appearances". This surely is a great way to improve available datasets and future datasets.
ReplyDelete
Replies
Vasavi GajarlaFebruary 21, 2016 at 11:20 PM
Summary:
This paper achieves 2 purposes –
1) builds a relatively diverse and dense scene-centric dataset with 7 million images and 476 place categories called the “Places” dataset.
2) shows experimental proofs of better classification performance of training a standard CNN architecture on Places dataset than the state-of-the-art performances on various scene-centric datasets like the SUN397 dataset.
This paper is the first of its kind to have a deep CNN trained on a scene-centric dataset. The motivation of the paper lies in the fact that, although deep neural networks like ImageNet-CNN are good for object classification, high level features (which are ultimately used for classification) don’t contain the richness of detail involved in scene images.

Questions:
1) The experiments with Hybrid-CNN have shown improvements mostly on the scene-centric datasets. Could there be any intuitive reason for not having similar kind of improvements on the object-centric datasets?
ReplyDelete
Replies
UnknownFebruary 21, 2016 at 11:21 PM
The paper introduces a new benchmark image dataset, Places, designed to incorporate all the scene categories from the SUN397 dataset plus many more and being competitive with teh ImageNet dataset in term of number of images per category and number of images overall. The authors introduce 2 novel measures to allow comparison of the datasets, relative diversity and relative density, and demonstrate how Places while being comparable in density, is more diverse than ImageNet and SUN397. The authors then show experimentally that a scene-centric Deep Network performs better than a object-centric Deep Network, by training their own CNN called Places-CNN and visualizing the difference between the units in the higher layers of each CNN, Places-CNN and ImageNet-CNN.

Discussion:
1. The authors sample just 25 times for the density experiment in order to avoid duplicate pairs. Isn't this counter-intuitive?
2. Can the mean image method be explained further? I particularly don't understand how the images are ranked based on the unit activations of a layer.
ReplyDelete
Replies
UnknownFebruary 21, 2016 at 11:41 PM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownFebruary 21, 2016 at 11:43 PM
Places Database is a database developed for the purpose of supporting and extending the state of the art computer vision methods for scene classification. Traditional convolutional neural networks require a large amount of data to be trained upon and this paper describes their attempt in creating a rich and diverse dataset focussing on scene centric images which consists of 7 million images distributed into 476 categories. Here the dataset is compared in depth with the current well known datasets and how it differs is explained in depth in terms of density and diversity as well. Results of applying the ImageNet architecture to a scene centric database have been discussed. A Hybrid-CNN structure consisting of a combination of Places-CNN and ImageNet-CNN has been discussed.

Questions-

1. In which novel applications, if any, has this dataset been utilized in recently?
2. Have these been experimented on different architectures say GoogLeNet or any other state of the art?
ReplyDelete
Replies
UnknownFebruary 21, 2016 at 11:53 PM
This paper presents a new database Places that deals with the task of scene recognition. The authors argue that though deep learning has been getting better at object recognition, the performance of deep networks on scene recognition is still not that great. They show the higher level features learned by object recognition networks versus scene recognition networks is different. The authors go ahead and compare the Places dataset with SUN and ImageNet using AMT and found that the Places dataset is more dense and diverse as compared to other dataset. It contains about seven million images. Then they train a CNN network for Places and report state of art performance for scene recognition.

Question-
Has there been any other networks recently that have performed better? What new things have been tried out?
ReplyDelete
Replies
UnknownFebruary 22, 2016 at 1:27 AM
The paper focuses on the golden rule that a CNN's performance is directly linked to the data it is trained on. The authors present a new scene centric database for scene recognition, and compare it with the object centric ImageNet. They compare the two by selecting a network architecture, and training it on each of the two separately, and then by comparing the results and the filter responses at different stages. As expected, the ImageNet CNN performs better at object recognition while the PlacesCNN performs better at scene recognition.

Discussion:
While they talk about the Hybrid-CNN, they don't discuss its performance in detail. Table 3 shows that it performs better than its more focused parent CNNs in 4 out of the 8 datasets tested, but they do not discuss its performance on ImageNet or Places. I wonder how its performance on these compares with its parents.

I'm also curious about what the RFs of the Hybrid-CNN obtained from its pool layers will look like? More object blobs or landscape-ish? Or a combination over layers?
ReplyDelete
Replies
Aditi GuptaFebruary 22, 2016 at 1:36 AM
The paper describes the creation as well as the application of a large scene-centric database consisting of 7 million images organised into 476 place categories. The paper also presents a metric to measure the density and diversity of databases. It concludes that while the SUN, Places and ImageNet datasets have the same density, the Places Database has the most diversity. Finally the paper also trains a CNN on the proposed database and tests its performance on various scene-centric and object-centric databases.
Discussion:
1.The concept of “similarity” that is used to evaluate the density and diversity of the database is pretty subjective. Why wasn’t a mathematical definition of similarity used as opposed to human perception?
2. I am not clear as to what the intersection of the 2 black lines on the indicates in Fig 3c.
ReplyDelete
Replies
UnknownFebruary 22, 2016 at 3:57 AM
The authors introduce the places dataset in this paper which is a great attempt at making a comprehensive scene-centric dataset to improve the state-of-art scene recognition task. The most noteworthy thing about the places dataset is its size (7 milion images across 496 categories) which makes it the largest scene-centric dataset, a huge advantage to be trained with CNNs. Another interesting discussion point of the paper was the comparison of Places with SUN and ImageNet databases. The authors try and answer this question with the help of metrics like diversity and relative density. Places dataset turns out as the most diverse of the three. When CNN is trained using the places dataset, its accuracy improves on scene-centric datasets.

Questions:
Has the metic of diversity been widely used to describe the chracterestics of other datasets? Why is there such a huge drop in performance of the Place-CNN on Caltech-256 as compared to Caltech-101?
ReplyDelete
Replies
anushaFebruary 22, 2016 at 9:04 AM
The paper introduces a new scene-centric database called Places which has over 7 million labeled pictures of scene spanning over 476 place categories. The paper also describes a new method to measure density and diversity of databases that helps estimate dataset biases. The performance of CNNs trained on different object-centric and scene-centric databases, including the proposed database is compared, and it is observed that Places-CNN has a higher classification accuracy at scene recognition while ImageNet CNN works better for object recognition.
Question:
Why does the hybrid CNN have lower classification accuracy in object-centric databases?
ReplyDelete
Replies
UnknownFebruary 22, 2016 at 9:50 AM
This paper attempts to further the work on scene recognition of indoor and outdoor scenes. The paper introduces the places dataset which has 205 scene categories over 2.5 MIL images. Labels for the dataset were generated using AMT workers. An important contribution for this paper is a metric for measuring density and diversity of datasets. This metric is important to compare the new Places dataset to the existing benchmark sets and to provide motivation for creating larger more diverse datasets. These metrics seem to indicate that the Places set has high diversity but average density in comparison to SUN and ImageNet. Perhaps this diversity can explain the Places-CNN’s (created using a generic deep CNN architecture trained on the Places dataset) performance on generic recognition tasks - this model outperformed existing work in each of the tasks.

1. The top-1 accuracy on this dataset is fairly low at around 60% but the top-5 is much higher at >80%. Maybe this is caused by the categories being too similar to each other? For example the dome category you would expect to be a part of other categories like church and palace rather than on its own as a category. So it would be interesting to map the categories hierarchically like ImageNet did.

2. How well would this work on images where the scene isn’t the prominent part of the image but instead is the background of an image. For example, how well would this do on typical travel photos where there is a family in the foreground but the salient scene is in the background?
ReplyDelete
Replies
prateekFebruary 22, 2016 at 10:32 AM
The paper introduces a new dataset for scene recognition with a millions of images to allow efficient learning through deep networks. They propose to present a database with more than 7 million annotated images which is an order larger than the previous works.

They emphasize on diversity and density of the dataset and show how the Places dataset is viable in both of these cases. The data is labelled though Mechanical turk with 2 level of control to maintain quality.

They train a deep network on this dataset and compare with the standard Imagenet based networks. They claim and show how for scene recognition tasks pretraining on Places improves accuracy across the board. They also visualize features from the activation units at each layer confirming their belief that the net extracts more spatial and landscape based features rather than object centric.

Questions:

The paper talks about the visualization of the features. Could you explain how that happens.

They also seem to just have a classification based approach to generic rooms, would this work in more fine grained settings.
ReplyDelete
Replies
UnknownFebruary 22, 2016 at 10:47 AM
This paper presents a new image dataset 60 times larger than the SUN dataset. Additionally, the paper shows that, although size is increased, data density is preserved. Finally, a new CNN, Places-CNN, is trained on the new dataset, and an increase in classification accuracy is shown.

Question:
How valid is the density/diversity metric in determining quality of similar datasets? Is it widely utilized by other researchers?
ReplyDelete
Replies
UnknownFebruary 22, 2016 at 10:48 AM
This paper presents a new database, Places, with 7M+ labeled images from 476 scene categories. The images were queried/selected from three search engines and then annotated through two rounds on AMT. The authors compare Places with SUN and ImageNet in terms of relative density, diversity, and classification. By training a CNN on Places and ImageNet, they found that the Places CNN performed better for scene recognition while ImageNet performed better for object recognition.

Why do you think the Hybrid-CNN performs worse than just the Places-CNN? Is there a better way to measure the density and diversity of datasets?
ReplyDelete
Replies

Add comment