ImageNet: A Large-Scale Hierarchical Image Database. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei. IEEE Computer
Vision and Pattern Recognition (CVPR), 2009
This paper presents ImageNet, a database of images arranged hierarchically, partitioned into synsets, conceptually synonymous categories as described by an earlier work, WordNet, which is currently at nearly 22k synset categories with over 14 million images, over 4x the size of the dataset when the paper was published in 2009. This database was acquired by querying specific Wordnet synsets (nouns) in various search engines, including translations into other languages. With potential candidates, mechanical turk crowd sourcing is used to label and verify the integrity of the labels of the images. Example applications of the data set are also presented.
Question :
How accurate is their diversity algorithm (level and extent of "gray"ness of the average image as a measure of diversity in the image set)?
The paper describes the creation as well as the application of a large database of images organized hierarchically according to the WordNet structure. The unique features of this database are as follows: Scale and Hierarchy. The ImageNet database as of now consists of 14 million images classified into over 20K categories corresponding to the synsets given by WordNet. The categories are densely populated with 500-1000 images per class. Accuracy. The database consists of clean images. Moreover the accuracy of the classes is ensured by having multiple users annotate the images. Diversity. The database consists of objects in varying poses and against varying backgrounds to ensure a high within class variance.
Discussion: 1. Why is such a round-about method used to measure intra class variance. Why aren’t standard statistical techniques like variance, correlation or entropy used instead? 2. The database contains many fine-grained classes such as “Burmese cats” vs “Siamese cats”. However for an average person such detailed classes are not of relevance. What was the intuition behind creating these detailed classes?
The paper describes the creation of large image dataset ImageNet with over 3 million images which are labeled semantically into categories of wordnet. The ImageNet would become to be the benchmark in various computer vision tasks. Authors have also shown that ImageNet is considerably larger in terms of total categories and images per category than the existing datasets. The precision of ImageNet is also very high, so it provides dataset of cleaned images.
Discussion:
ImageNet is image dataset over noun categories of wordNet. Are there any datasets which have explored verb categories to detect activities?
The paper describes the Imagenet dataset, which is a massive dataset of images based off of the hierarchical WordNet structure. There are over 3.2 million images, with a variety of poses, occulusions, background clutter, and viewpoints. They also describe the data collection process and how to do extend Imagenet to do localization?
Q: Better way to test the diversity of images than computing the average image of each synset?
The paper discusses the creation and objectives of ImageNet, a large scale, diverse image dataset which corresponds to the synsets of WordNet. The authors use Image Search Engine crawling along with Crowd Sourcing in conjunction with innovative algorithms to create the dataset, with 500-1000 images per synset. They run multiple experiments to validate why the large scale, clean and hierarchical structure of ImageNet proves useful for pushing the state of the art in terms of computer vision research.
Discussion: 1. The only dataset quantitatively compared is the TinyImages dataset. Would the results still hold on other more elaborate datasets such as MS COCO?
2. Would having the hierarchical structure actually help detection? One would intuitively imagine that sub categories would be harder to classify (and that is even mentioned that it is harder for humans to classify sub- and super-categories). I would imagine that attribute tagging would be a much better application and research area for this structure of ImageNet, and that has not been touched upon.
The paper introduces a large-scale database called ImageNet, organized into a taxonomy based on the WordNet heirarchy. Each node here includes a large set of images. The first step involved in creating such a database is to collect candidate images from the internet. For each synset, the queries are the set of WordNet synonyms. The accuracy of Internet image search was about 10%, which is why the next step was to clean up the candidate images by humans, i.e. rely on humans to verify each candidate image collected for a given sysnset. The paper claims that the database has high accuracy since it consists of clean images at all levels. Also, it's diverse because it consists of images of objects in variable appearances, positions, view points, poses, background clutter and occlusions
Question: 1. Every image has just one label. Is this of the most prominent object in the image? Would having more than a single label per image help?
Summary: This paper presents ImageNet image database, which provide most comprehensive and diverse coverage of the images labeled with average of 99.7% accuracy with semantic concepts based on the hierarchical structure provided by WordNet.Compared to the related image datasets, ImageNet contains high quality synsets and full resolution images, maintain a much more balanced distribution of images across semantic hierarchy with less ambiguation and a large number of categories and a good amount of images per category.With a clean set of full resolution images non-parametric object recognition can be more accurate; Exploiting ImageNet hierarchy can provide substantial improvement for classification task with a with the tree-max classifier.
Questions: With the ImageNet recognition task has achieved the near human results given a decent amount of categories, more and more datasets(Microsoft COCO,visual genome [visualgenome.org]) has been relseased tackling different kind of computer vision task my question is twofold: 1. Any new smart pipeline has been adapted to make some of the steps in making the dataset easier? 2. Are researcher really moved on from the classification problems? CNN is still pretty easily to be fooled given (this)[Nguyen, Anh, Jason Yosinski, and Jeff Clune. "Deep neural networks are easily fooled: High confidence predictions for unrecognizable images." Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. IEEE, 2015] paper's result.
This is the introductory paper for the ImageNet database. The database offered higher accuracy than competing databases of the time. The architects used Amazon Mechanical Turk to label the data. Labeling was done in word net styles, with each category either belonging to a sub category of having a subcategory. For example, the label dog implies canine -> carnivore -> placental -> mammal. This is useful for classifiers, as things like fur or hair have distinct appearances but aren't unique to one of the lowest subclasses, but are relatively unique to mammals. So if model's can't get all the way down to the unique lowest subclass, they can stop at mammal or some other sysnet root if the model designer think's that is relevant. The hierarchy can also be used for some specific classifiers such as the "tree-max classifier" that was implemented on ImageNet and discussed in the paper.
Discussion: Can the hierarchal structure be used to seed a convolution network? In class we talked about how a convolution network could identify lower level things like wheels and handles bars and use them later in the network to distinguish trucks and cars from bicycles and planes. If the long term goal is to get a neural network to learn something with a small image sample size (i.e. 30 images), won’t something like this have to be done?
"a convolution network could identify lower level things like wheels and handles bars" <-- this is in theory, as we know very little about what the network is actually doing once it's trained.
ImageNet database has become one of the driving forces of today's deep learning driven computer vision applications. This paper provides an insightful explanation into the creation of this rich database and its potential applications. How using synsets from WordNets, they create a more diverse and more semantically closer to human perception has been described. At the point when the paper was written, there were 5247 synsets each with roughly 600 images accounting to 3.2 million images which is just 10% of their 50 million images target with more synsets. The organization of this categories into a hierarchical tree structure is described. It goes on to describe the results achieved on using this dataset by applying well known object recognition techniques and reports an improved performance. Same is followed up for tree based classification and object localization.
Questions- 1. ImageNet database consists of images labelled with the type of object in mind. For example "German shepherd" in the dog category. But they don't take into account the pose of the object. Can it be made to contain information of the poses or activity as well? Or you would need a different network altogether just to estimate the pose using a different database, if available?
Large datasets with accurate semantic labels are useful both as learning tools and as performance benchmarks. Imagenet is one such dataset, and has been the base dataset for a number of image recognition challenges. It consists of a large database of images mapped to WordNet synsets.
Questions: 1) Have similar attempts been made to create benchmark datasets of this scale for videos?
2) Is the labeling of objects according to existing wordnet synsets necessarily accurate? What if existing synsets don't accurately capture the distinction between two object classes?
Summary: This paper shows how one of the most popular datasets in Computer Vision community, ImageNet, has been built. ImageNet is rich with properties like diversity, accuracy and hierarchical structure of high resolution images. With 12 main categories, and 5247 subcategories each, ImageNet has 3.2 million images in total, and it is still growing. The usefulness of ImageNet is proven by some experiments including object recognition and image classification. Questions: 1) I could not understand how confidence scores were calculated for categories and subcategories using user votes to calculate probabilities of the image being good. 2) As we move to deeper levels in the tree, the details in image increase and it becomes difficult for humans to identify the synset exactly; it would be similar for a classifier too, I assume, because of either lack of enough data at this level or just because the image is too detailed. However, Tree-Max classifier is shown to perform better at deeper levels than the higher levels. How could this be possible? 3) In Object Localization application, it has been shown that k-means clustering resulted in different clusters for various poses of tusker and various views of aircraft. If the training set contained good amount different facial poses, would k-means give good cluster separations for these poses too?
This paper presents ImageNet, a large database based on the hierarchical structure of WordNet, as well as its creation and applications in recognition, classification, and clustering. The authors collected candidate photos by querying multiple search engines and used Amazon Mechanical Turk to verify the quality. With the wide variety of categories, poses, and viewpoints, the authors highlighted ImageNet's usefulness in future research.
Has there been a different approach to measuring the diversity of ImageNet (e.g. similarity/dissimilarity between images in each category through AMT)?
What's the reasoning or ultimate goal behind building the depth of the tree other than following WordNet's structure? At some point, I feel like it would be more useful to expand the width of the tree rather than have such specific subcategories.
This paper presents a very large scale images database. ImageNet is the largest hierarchical ontology of images today. The structure of the database is inspired by the semantic hierarchicy of WordNet. It consists of millions of full resolution images organized in main synonym sets “synest”, which is consists of further sub-synsets. Images were collecting through querying several search engines. The annotation was done by crowed-sourcing. The paper compares ImageNet with different image databases and describes some applications that ImageNet can be useful for. Nowadays, ImageNet is a very important benchmark database for different computer vision problems such as recognition, localization.
Research Question:
ImageNet is built using the structure of the noun synsets of WordNet. Which resulted on dataset of isolated iconic objects that don't help in understanding scenes and interaction between different entities. would having a dataset that is build on the other synsets strings (verbs, adverbs, or adjective) be useful for more diverse computer vision problems such as scenes understanding and semantic segmentation.
The paper presents a large scale diverse dataset which would aid the growth of computer vision research. The dataset aims to make an ontology of images built upon the backbone of the WordNet structure. This dataset can be used for diverse classification to fine grained category recognition.
The dataset has 12 subtrees with 5247 synsets and 3.2 million images in total. Each synset has 500-1000 high resolution image. The dataset shows the applications of the datset in Object Recognition, Image Classification and Object Localization.
Questions:
In this context how are object recognition and Image classification different.
The Imagenet challenge being a large driver of Deep Learning, has shown to overfit on the dataset and learn its biases. How would it be able to generalize to more realistic images.
The paper presents the creation and applications of Imagenet, a huge dataset with a hierarchical structure based onWordnet. It has over 3.2 million images, with a huge variance in pose, viewpoints, occlusion and clutter. Imagenet has become one of the leading datasets for computer vision research. The authors also describe the data collection process and discuss the results obtained with it.
Discussion: Can this hierarchical structure be exploited while training a deep neural net/CNN?
They test the diversity of images using jpeg compression. Not sure how valid a test this is for pose, viewpoints, occlusions etc.
This paper describes the cretion of the ImageNet database, one of the largest and most comprehensive image datasets to be created to aid computer vision resarch. ImageNet was based on the ontological structure of the lexical database WordNet. The paper also enumerates the properties of ImageNet which are its hierarchical struture, the diversity in the categories and viewpoints of images and its sheer scale (3.2 million images with over 5000 categories at the time of writing this paper). Finally the authors discuss the tasks of object recognition and classification that can be aided by ImageNet and further steps to complete the database.
Database: Why did the authors decide to spread the database 'thin' and choose only average 600 image instances per category and focus on increasing the categories? Does this have to do with the heavily tailed nature of vision data and covering as many categories as possible is always more beneficial than having more instances per category?
This paper presents the creation and application of Imagenet dataset consisting of 3.2 million image with 12 sub-trees and over 5247 categories. The data collection process using AMT and verification process of the collected images is explained. Finally the authors provide some possible applications of the dataset in tasks of object recognition, localization etc.
How did the authors come up with the 12 sub-trees. Was availability of images on internet for each sub-trees the only criteria. Why are there generally no images in such dataset of groceries, buildings, highways and other categories.
This paper presents ImageNet, a database of images arranged hierarchically, partitioned into synsets, conceptually synonymous categories as described by an earlier work, WordNet, which is currently at nearly 22k synset categories with over 14 million images, over 4x the size of the dataset when the paper was published in 2009. This database was acquired by querying specific Wordnet synsets (nouns) in various search engines, including translations into other languages. With potential candidates, mechanical turk crowd sourcing is used to label and verify the integrity of the labels of the images. Example applications of the data set are also presented.
ReplyDeleteQuestion :
How accurate is their diversity algorithm (level and extent of "gray"ness of the average image as a measure of diversity in the image set)?
The paper describes the creation as well as the application of a large database of images organized hierarchically according to the WordNet structure. The unique features of this database are as follows:
ReplyDeleteScale and Hierarchy. The ImageNet database as of now consists of 14 million images classified into over 20K categories corresponding to the synsets given by WordNet. The categories are densely populated with 500-1000 images per class.
Accuracy. The database consists of clean images. Moreover the accuracy of the classes is ensured by having multiple users annotate the images.
Diversity. The database consists of objects in varying poses and against varying backgrounds to ensure a high within class variance.
Discussion:
1. Why is such a round-about method used to measure intra class variance. Why aren’t standard statistical techniques like variance, correlation or entropy used instead?
2. The database contains many fine-grained classes such as “Burmese cats” vs “Siamese cats”. However for an average person such detailed classes are not of relevance. What was the intuition behind creating these detailed classes?
The paper describes the creation of large image dataset ImageNet with over 3 million images which are labeled semantically into categories of wordnet. The ImageNet would become to be the benchmark in various computer vision tasks. Authors have also shown that ImageNet is considerably larger in terms of total categories and images per category than the existing datasets. The precision of ImageNet is also very high, so it provides dataset of cleaned images.
ReplyDeleteDiscussion:
ImageNet is image dataset over noun categories of wordNet. Are there any datasets which have explored verb categories to detect activities?
The paper describes the Imagenet dataset, which is a massive dataset of images based off of the hierarchical WordNet structure. There are over 3.2 million images, with a variety of poses, occulusions, background clutter, and viewpoints. They also describe the data collection process and how to do extend Imagenet to do localization?
ReplyDeleteQ: Better way to test the diversity of images than computing the average image of each synset?
The paper discusses the creation and objectives of ImageNet, a large scale, diverse image dataset which corresponds to the synsets of WordNet. The authors use Image Search Engine crawling along with Crowd Sourcing in conjunction with innovative algorithms to create the dataset, with 500-1000 images per synset. They run multiple experiments to validate why the large scale, clean and hierarchical structure of ImageNet proves useful for pushing the state of the art in terms of computer vision research.
ReplyDeleteDiscussion:
1. The only dataset quantitatively compared is the TinyImages dataset. Would the results still hold on other more elaborate datasets such as MS COCO?
2. Would having the hierarchical structure actually help detection? One would intuitively imagine that sub categories would be harder to classify (and that is even mentioned that it is harder for humans to classify sub- and super-categories). I would imagine that attribute tagging would be a much better application and research area for this structure of ImageNet, and that has not been touched upon.
The paper introduces a large-scale database called ImageNet, organized into a taxonomy based on the WordNet heirarchy. Each node here includes a large set of images. The first step involved in creating such a database is to collect candidate images from the internet. For each synset, the queries are the set of WordNet synonyms. The accuracy of Internet image search was about 10%, which is why the next step was to clean up the candidate images by humans, i.e. rely on humans to verify each candidate image collected for a given sysnset. The paper claims that the database has high accuracy since it consists of clean images at all levels. Also, it's diverse because it consists of images of objects in variable appearances, positions, view points, poses, background clutter and occlusions
ReplyDeleteQuestion:
1. Every image has just one label. Is this of the most prominent object in the image? Would having more than a single label per image help?
Summary:
ReplyDeleteThis paper presents ImageNet image database, which provide most comprehensive and diverse coverage of the images labeled with average of 99.7% accuracy with semantic concepts based on the hierarchical structure provided by WordNet.Compared to the related image datasets, ImageNet contains high quality synsets and full resolution images, maintain a much more balanced distribution of images across semantic hierarchy with less ambiguation and a large number of categories and a good amount of images per category.With a clean set of full resolution images non-parametric object recognition can be more accurate; Exploiting ImageNet hierarchy can provide substantial improvement for classification task with a with the tree-max classifier.
Questions:
With the ImageNet recognition task has achieved the near human results given a decent amount of categories, more and more datasets(Microsoft COCO,visual genome [visualgenome.org]) has been relseased tackling different kind of computer vision task my question is twofold:
1. Any new smart pipeline has been adapted to make some of the steps in making the dataset easier?
2. Are researcher really moved on from the classification problems? CNN is still pretty easily to be fooled given (this)[Nguyen, Anh, Jason Yosinski, and Jeff Clune. "Deep neural networks are easily fooled: High confidence predictions for unrecognizable images." Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. IEEE, 2015] paper's result.
This is the introductory paper for the ImageNet database. The database offered higher accuracy than competing databases of the time. The architects used Amazon Mechanical Turk to label the data. Labeling was done in word net styles, with each category either belonging to a sub category of having a subcategory. For example, the label dog implies canine -> carnivore -> placental -> mammal. This is useful for classifiers, as things like fur or hair have distinct appearances but aren't unique to one of the lowest subclasses, but are relatively unique to mammals. So if model's can't get all the way down to the unique lowest subclass, they can stop at mammal or some other sysnet root if the model designer think's that is relevant. The hierarchy can also be used for some specific classifiers such as the "tree-max classifier" that was implemented on ImageNet and discussed in the paper.
ReplyDeleteDiscussion:
Can the hierarchal structure be used to seed a convolution network? In class we talked about how a convolution network could identify lower level things like wheels and handles bars and use them later in the network to distinguish trucks and cars from bicycles and planes. If the long term goal is to get a neural network to learn something with a small image sample size (i.e. 30 images), won’t something like this have to be done?
"a convolution network could identify lower level things like wheels and handles bars" <-- this is in theory, as we know very little about what the network is actually doing once it's trained.
DeleteImageNet database has become one of the driving forces of today's deep learning driven computer vision applications. This paper provides an insightful explanation into the creation of this rich database and its potential applications. How using synsets from WordNets, they create a more diverse and more semantically closer to human perception has been described. At the point when the paper was written, there were 5247 synsets each with roughly 600 images accounting to 3.2 million images which is just 10% of their 50 million images target with more synsets. The organization of this categories into a hierarchical tree structure is described. It goes on to describe the results achieved on using this dataset by applying well known object recognition techniques and reports an improved performance. Same is followed up for tree based classification and object localization.
ReplyDeleteQuestions-
1. ImageNet database consists of images labelled with the type of object in mind. For example "German shepherd" in the dog category. But they don't take into account the pose of the object. Can it be made to contain information of the poses or activity as well? Or you would need a different network altogether just to estimate the pose using a different database, if available?
Large datasets with accurate semantic labels are useful both as learning tools and as performance benchmarks. Imagenet is one such dataset, and has been the base dataset for a number of image recognition challenges. It consists of a large database of images mapped to WordNet synsets.
ReplyDeleteQuestions:
1) Have similar attempts been made to create benchmark datasets of this scale for videos?
2) Is the labeling of objects according to existing wordnet synsets necessarily accurate? What if existing synsets don't accurately capture the distinction between two object classes?
Summary:
ReplyDeleteThis paper shows how one of the most popular datasets in Computer Vision community, ImageNet, has been built. ImageNet is rich with properties like diversity, accuracy and hierarchical structure of high resolution images. With 12 main categories, and 5247 subcategories each, ImageNet has 3.2 million images in total, and it is still growing. The usefulness of ImageNet is proven by some experiments including object recognition and image classification.
Questions:
1) I could not understand how confidence scores were calculated for categories and subcategories using user votes to calculate probabilities of the image being good.
2) As we move to deeper levels in the tree, the details in image increase and it becomes difficult for humans to identify the synset exactly; it would be similar for a classifier too, I assume, because of either lack of enough data at this level or just because the image is too detailed. However, Tree-Max classifier is shown to perform better at deeper levels than the higher levels. How could this be possible?
3) In Object Localization application, it has been shown that k-means clustering resulted in different clusters for various poses of tusker and various views of aircraft. If the training set contained good amount different facial poses, would k-means give good cluster separations for these poses too?
This paper presents ImageNet, a large database based on the hierarchical structure of WordNet, as well as its creation and applications in recognition, classification, and clustering. The authors collected candidate photos by querying multiple search engines and used Amazon Mechanical Turk to verify the quality. With the wide variety of categories, poses, and viewpoints, the authors highlighted ImageNet's usefulness in future research.
ReplyDeleteHas there been a different approach to measuring the diversity of ImageNet (e.g. similarity/dissimilarity between images in each category through AMT)?
What's the reasoning or ultimate goal behind building the depth of the tree other than following WordNet's structure? At some point, I feel like it would be more useful to expand the width of the tree rather than have such specific subcategories.
Summary:
ReplyDeleteThis paper presents a very large scale images database. ImageNet is the largest hierarchical ontology of images today. The structure of the database is inspired by the semantic hierarchicy of WordNet. It consists of millions of full resolution images organized in main synonym sets “synest”, which is consists of further sub-synsets. Images were collecting through querying several search engines. The annotation was done by crowed-sourcing. The paper compares ImageNet with different image databases and describes some applications that ImageNet can be useful for.
Nowadays, ImageNet is a very important benchmark database for different computer vision problems such as recognition, localization.
Research Question:
ImageNet is built using the structure of the noun synsets of WordNet. Which resulted on dataset of isolated iconic objects that don't help in understanding scenes and interaction between different entities. would having a dataset that is build on the other synsets strings (verbs, adverbs, or adjective) be useful for more diverse computer vision problems such as scenes understanding and semantic segmentation.
The paper presents a large scale diverse dataset which would aid the growth of computer vision research. The dataset aims to make an ontology of images built upon the backbone of the WordNet structure. This dataset can be used for diverse classification to fine grained category recognition.
ReplyDeleteThe dataset has 12 subtrees with 5247 synsets and 3.2 million images in total. Each synset has 500-1000 high resolution image. The dataset shows the applications of the datset in Object Recognition, Image Classification and Object Localization.
Questions:
In this context how are object recognition and Image classification different.
The Imagenet challenge being a large driver of Deep Learning, has shown to overfit on the dataset and learn its biases. How would it be able to generalize to more realistic images.
The paper presents the creation and applications of Imagenet, a huge dataset with a hierarchical structure based onWordnet. It has over 3.2 million images, with a huge variance in pose, viewpoints, occlusion and clutter. Imagenet has become one of the leading datasets for computer vision research. The authors also describe the data collection process and discuss the results obtained with it.
ReplyDeleteDiscussion:
Can this hierarchical structure be exploited while training a deep neural net/CNN?
They test the diversity of images using jpeg compression. Not sure how valid a test this is for pose, viewpoints, occlusions etc.
This paper describes the cretion of the ImageNet database, one of the largest and most comprehensive image datasets to be created to aid computer vision resarch. ImageNet was based on the ontological structure of the lexical database WordNet. The paper also enumerates the properties of ImageNet which are its hierarchical struture, the diversity in the categories and viewpoints of images and its sheer scale (3.2 million images with over 5000 categories at the time of writing this paper). Finally the authors discuss the tasks of object recognition and classification that can be aided by ImageNet and further steps to complete the database.
ReplyDeleteDatabase:
Why did the authors decide to spread the database 'thin' and choose only average 600 image instances per category and focus on increasing the categories? Does this have to do with the heavily tailed nature of vision data and covering as many categories as possible is always more beneficial than having more instances per category?
This paper presents the creation and application of Imagenet dataset consisting of 3.2 million image with 12 sub-trees and over 5247 categories. The data collection process using AMT and verification process of the collected images is explained. Finally the authors provide some possible applications of the dataset in tasks of object recognition, localization etc.
ReplyDeleteHow did the authors come up with the 12 sub-trees. Was availability of images on internet for each sub-trees the only criteria. Why are there generally no images in such dataset of groceries, buildings, highways and other categories.