Advanced Computer Vision: Fri, Jan 15: MS COCO

Wednesday, January 13, 2016

Fri, Jan 15: MS COCO

Microsoft COCO: Common Objects in Context. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C. Lawrence Zitnick. ECCV 2014.

25 comments:

sfenu3January 14, 2016 at 4:59 PM
Since objects tend to appear in dense settings, and context can be used to gain insight into the nature of objects from their relations to other things, MSR took it upon themselves to generate a _really_ large dataset of semantically labeled image segmentations of things in object-rich scenes.

The contributions of this paper are threefold:

1) It provided the MS COCO dataset to the larger computer science community. This dataset consists of 328,000 images containing 2,500,000 instances of 91 object classes. While this has less labeled classes than, say, ImageNet, it has significantly more per-class instances, and instead of simply providing bounding boxes for each object it provides a semantic segmentation of each image. This allows us to train classifiers for more general scene labeling tasks.

2) A detailed look at how to construct a cost-efficient pipeline for image annotation using mechanical turk.

3) An initial comparison of object detections across PASCAL VOC and MS COCO using different models, providing initial benchmark results and showing that at first blush the tasks presented alongside COCO are considerably harder than those in PASCAL VOC.

Some questions that arise from the readings:

1) What meaningful exploration been done into comparing the accuracy of classifiers trained on iconic images to those trained on synthetic images composed of multiple iconic images, and those trained on non-iconic images?

2) To what extent will providing further semantic labeling help with classification (e.g., labeling all "stuff" as well as all the labeled "things")? What considerations need to be taken into account when retrofitting a dataset of this scale with additional semantic data?

-Stefano
ReplyDelete
Replies
JonathanJanuary 14, 2016 at 5:59 PM
This dataset places the question of object recognition in the context of scene understanding. It addresses 3 research problems in scene understanding: non-iconic views, contextual reasoning between objects, and precise two dimensional localization of objects. This dataset uses non-iconic images (defined in paper) and images depicting scenes rather than isolated objects.

To help with labeling and segmenting the images, Amazon's Mechanical Turk is used. A detailed look is presented.

For bounding box detection, the results suggest that MS COCO is, in fact, harder than PASCAL.

Questions:

1. How do you determine what would constitute a "representative set of all object categories"?

2. Are workers timed? Has anyone used this? It seems some things are really easy to, say, identify and others not so much. It seems this could be a useful heuristic in determining whether something is (a) easy or hard and (b) could be useful in determining what a human notices first.
ReplyDelete
Replies
UnknownJanuary 14, 2016 at 6:28 PM
The paper describes the process of creating the new dataset MS-COCO generated by MSR, it's characteristics and some basic level performance of some algorithms. The main goal of creating MS COCO was to present objects in their natural context so that semantic information can be learned about the objects. Amazon mechanical turk was used to crowd source the labeling of dataset. MS COCO tries to focus on non-iconic images more and thus having more category per image than ImageNet and Pascal dataset.

Discussion:

1) Generally there are multiple objects in an Image. It would be interesting to know in what order human views those objects and how much time he takes to identify those objects.

2) Can the similar dataset be easily extended for encoding the background 'stuff'? by removing the segmented object and showing users just the background?
ReplyDelete
Replies
UnknownJanuary 14, 2016 at 7:24 PM
The entire process of creating a high-quality, high volume dataset is discussed with emphasis on a large number of categories and instances per category in the images, with focus given to non-iconic images so that object detection and scene understanding can be progressed in more natural settings. The paper talks about 3 types of computer vision tasks, Image Classification, Object Detection and Semantic Scene Labeling and uses them as guidelines to design the dataset. In-depth discussions of the creation process via crowdsourcing and statistics compared to other popular datasets are provided.

Questions:
1. What was the rough timeline and funding required to complete this dataset? If, suppose, I decide to create a dataset today for a particular vision task, how should I plan for time and costs?
2. There are some very good ideas regarding crowd worker approval and analyzing the annotation performance, but how do the authors know how much is good enough? There is a direct comparison with ground truths, but as mentioned, there is inherent ambiguity in object categories and instances, and we don't know how well these workers would have performed on other datasets or on similar tasks (though evidence seems to suggest they would do an equally good job, this is an interesting thought given the different objectives of MS COCO).
ReplyDelete
Replies
UnknownJanuary 14, 2016 at 7:42 PM
Summary
The paper presents a new dataset Microsoft COCO. This dataset is to address three major problems: 1. detecting non-iconic views of objects. 2. contextual reasoning between objects and the precise 2D localization of objects. 3. detailed spatial understanding of object layout.
It has two steps: image collection and image annotation.
One important feature in image collection is that this dataset uses non-iconic images, which contain rich contextual information.
There are three steps for annotation: category labeling, instance spotting, and instance segmentation, which helps precise object localization. This part is done by amazon mechanical turk crowdsourcing.

Question and discussion:

1. In the instance segmentation, "If other instances have already been segmented in the image, those segmentations are shown to the worker." This object maybe overlap with another object. If its segmentation covers the other one, the worker is not able to label that.

2. Is there any research already using the contextual information from MS COCO? How did them figure out the relationships between objects?

ReplyDelete
Replies
prateekJanuary 14, 2016 at 8:04 PM
MS COCO dataset is an attempt to solve the problems plagued in vision datasets of iconic, single category images. The dataset contains 91 common object categories with 82 of them having more than 5,000 labeled instances, in total the dataset has 2,500,000 labeled instances in 328,000 images.

The contributions of the dataset are:
1) It is the most comprehensive dataset in terms of outdoor scenes and indoor scenes capturing non-iconic images of categories.

2) It has large number of objects of each category per image leading to an instance segmentation task. These images are also rich in context to allow for contextual reasoning.

3) It provides instance level pixel wise segmentation of over 2.5 million instances of objects as well as descriptions of the image allowing for accurate 2d object localization.

All these 3 contributions aid in real world application of vision to systems like robotics. For example instance segmentation is an essential problem to be solved for manipulation tasks in robotics.

The dataset statistics prove the claims of the dataset having on average the dataset contains 3.5 categories and 7.7 instances per image. It also has 90% of the images with more than 1 category. Thus this makes this dataset an important driving force towards the solving of vision.

Discussion:

1. The dataset contains biases of humans like an image of person or an object in a mirror is not labelled while the detector would detect this. Would this be right to allow the algorithm to be penalized for human biases.

2. As the paper claims that a large number of objects in the dataset are small and have multiple instances in the same image, while it does not mean all the instances. This can be clearly explored while browsing the crowded scenes in the dataset. This again would lead to penalization of the algorithm for detecting all the instances. Would this be fair.

ReplyDelete
Replies
CJDSJanuary 14, 2016 at 8:44 PM
Review:
The MS COCO dataset is a crowd-sourced collection of locally tagged images. These images have labels from 91 object categories, and one image can contain more than 5000 labels.

One interesting facet of the database, is that a majority of its images are non-iconic. They attempt to be as generic as possible in their images.

Discussion
The argument the author advances for the failures of DPM-v5 seem arbitrary.
Also, using workers through mechanical turk, does not seem likely to be necessarily the best output. They discuss how they added a stage to verify the instance segmentations but do not discuss it in more detail.

Carl Saldanha
ReplyDelete
Replies
CJDSJanuary 14, 2016 at 8:45 PM
Review:
The MS COCO dataset is a crowd-sourced collection of locally tagged images. These images have labels from 91 object categories, and one image can contain more than 5000 labels.

One interesting facet of the database, is that a majority of its images are non-iconic. They attempt to be as generic as possible in their images.

Discussion
The argument the author advances for the failures of DPM-v5 seem arbitrary.
Also, using workers through mechanical turk, does not seem likely to be necessarily the best output. They discuss how they added a stage to verify the instance segmentations but do not discuss it in more detail.

Carl Saldanha
ReplyDelete
Replies
John TurnerJanuary 14, 2016 at 9:06 PM
This paper presents a massive dataset geared towards assisting with object recognition tasks by giving very accurate instance-of-object level segmentation (pixel-level isolation of individual instances of any of 91 object categories), across over 320,000 images. Accomplishing this herculean task was made possible by crowd sourced workers, particularly through Amazon Mechanical Turk, performing multiple steps of the process - non-iconic image validation, categorical labeling, instance identification, instance segmentation, segmentation verification, and crowd labeling .

Discussion/Questions :

1. A graphical model could be easily built from this dataset linking the probabilities of certain objects in scenes with other specified objects. This could help convey deeper scene understanding when using the database to train for recognition tasks.

2. were there many images that came up "blank"? (i.e. had no segmentable objects visible, or at least chosen by the workers). What about the image quality - were the workers able to discard an image that was found to be blurry or otherwise poor quality?
ReplyDelete
Replies
UnknownJanuary 14, 2016 at 9:11 PM
This paper provides an in-depth explanation of a comprehensive approach undertaken to develop a database that would further fuel the explosive growth in deep learning based computer vision. Especially areas like scene understanding and object detection based on local context which are particularly difficult problems. The paper provides a focus on how MS COCO database compares with other well known datasets, statistics that make this database attractive for this specific domain and the methodology of how and why a specific set of data with specific categories was captured. It highlights the importance of how non-iconic images capture natural scenarios compared to iconic images and help in local context based scene understanding. Unique methods that were adapted for making the crowdsourcing efficient and effective have been discussed thoroughly.

Questions and discussions-

1. What would be the complexity of including stuff? If you include the presence of stuff, for example a street, wouldn't that help in better understanding the presence of a large crowd in the image for example?

2. I believe including stuff would help scene annotation/caption algortihms. If a hierarchical scene understanding method first detects the stuff and then object instances present in the image, it could provide a more intuitive scene caption. For example, "dogs in the field". Just a thought.
ReplyDelete
Replies
Jianling WangJanuary 14, 2016 at 9:22 PM
Summary
The paper describes the process of constructing MS COCO, which is a new object detection and segmentation dataset containing about 328K non-iconic images. MS COCO will finally contain 91 object categories. It shows that models trained by COCO can perform better on detection and segmentation of objects in their natural context than models trained by other old datasets.

This paper also mentions that MS COCO is much more difficult than PASCAL VOC. Thus with MS COCO, we can gain models which can generalize better in easier datasets. MS COCO is a dataset containing information more close to the nature and reality.

Discussion
1. Are the funds and worker hours spent on MS COCO much higher than on the previous datasets?
2. In selection of common object categories, why you need free recall experiment with young children?
ReplyDelete
Replies
UnknownJanuary 14, 2016 at 9:47 PM
This paper presents the process behind creating the MS COCO data set. The data set contains annotations for image classification, objection detection and image segmentation. MS COCO does not have as many training examples as imagenet, but has more examples per category. While constructing MS COCO, an emphasis was placed on mostly non-iconic images for each category. The hope is that non-iconic images can provide useful context that is missing from iconic images.

Discussion:

Since the publication of this paper has there been any research about how including non-iconic images and more context affects detection performance?

-Robert Allen
ReplyDelete
Replies
UnknownJanuary 14, 2016 at 9:48 PM
The MS COCO dataset is a large-scale collection aimed at solving issues with non-iconic scene recognition, contextual reasoning, and 2D localization. Additionally, Microsoft details their approach for gathering such a large dataset via Mechanical Turk. The COCO dataset differentiates itself most in the number of instances per image, which aids in the goal of contextual scene understanding.

Discussion:

One of the primary aspects of this paper was how MIcrosoft maintained dataset quality with Mechanical Turk. For example, a hierarchical labeling process was used to identify categories efficiently, and peer review was used to determine if segmentations were accurate. Have any other research groups improved upon these techniques in a significant way? If so, how?
ReplyDelete
Replies
UnknownJanuary 14, 2016 at 10:00 PM
The paper describes the method of constructing a image dataset for detection and segmentation of everyday objects with emphasis on finding non-iconic images of object in natural setting. The dataset currently labels "things" which include prominent individual objects.

Discussion
1. How would appropriately labeling the "stuff" in image help in the task described in paper?
2. In dataset statistics section the effect of size of objects is mentioned. How would algorithms perform if partial object or zoomed objects are present in image.

Shantanu
ReplyDelete
Replies
anushaJanuary 14, 2016 at 10:45 PM
The paper introduces a new large-scale dataset for detecting and segmenting object and describes the methodology adopted to create such a dataset. The MS COCO data-set contains 91 object categories and 2500000 labelled instances in 328000 images. The proposed dataset performs well when it comes to detecting non-iconic views, locating objects precisely and contextual reasoning between objects.

Questions:

1. The paper mentions that each image in MS COCO has 5 written caption descriptions. It would be interesting to know what these descriptions are and the complexities involved in collecting the same.

2. Figure 9 shows the average segmentation overlap measured on MS COCO for the 20 PASCAL VOC categories to demonstrate the difficulty involved in segmentation from detection. Out of the 20 categories, sofa has the least average segmentation overlap. Why is this so?

~
Anusha
ReplyDelete
Replies
UnknownJanuary 14, 2016 at 10:50 PM
Summary:
The paper describes the motivation and procedure behind creation of MS COCO database. The main intention of the database is to collect objects in context to provide for better semantic understanding during object recognition.
Emphasis is placed on collection of non-iconic views of objects, as training on these is assumed to generalize well on natural scenes.
The paper also delineates the pipeline of the image annotation through Amazon Mechanical Turk which involves, category labelling, Instance spotting, instance segmentation and validation.
It also reports the performance of deformable parts model trained on Pascal dataset tested on MS COCO and vice versa and concludes that MS COCO data set is significantly harder.

Questions :
1.Some categories were removed as sufficient number of instances could not be found. How representative are the 91 categories compared to the 1000 of ImageNet?
2. How scalable is this type of image collection and annotation with respect to number of categories? For instance, from the present dataset if I also want to extend the labels to types of dogs, in addition to just the label 'dog'.
ReplyDelete
Replies
Vasavi GajarlaJanuary 14, 2016 at 11:40 PM
Summary:
This paper presents a new dataset (MS COCO) with complex images which have scenes with common objects (like car, chair and person) in their natural context labelled. This new dataset focuses on multiple instance detection (even for those of same category) in non-iconic images. After coming up with a set of categories to look for, category labelling, instance spotting and segmentation was performed by crowd workers to build the dataset which has 2.5 million labelled instances in 328k images with 91 object types.
Questions:
1) I think the paper only briefly touched upon how the segmentation masks are learnt. Is it the case that there is one mask for a given category? Also, for any given category, its mask is found by averaging those from the training instances? Since some categories can have many different shapes, like a human with different poses, how can we be sure that an average of masks for all these poses is good enough to detect all kinds of instances of the category?
2) What was the reason behind leaving out labelling “stuff” in images?
ReplyDelete
Replies
Sam SeifertJanuary 14, 2016 at 11:54 PM
Summary:
This paper justifies a CV dataset that contains many images with obstacles in non-canonical layouts (Canonical roughly translated to "expected", or "picture perfect"). The justification was training on a dataset of images with non orthodox points of view will yield better model performance on real world images. The paper then walks through how the data was picked & labeled (using Amazon Turk and a complex labeling scheme with many workers). State of the art classifiers trained on old datasets are applied to the new one, and the performance drop was significant (alluding to a difficulty increase).

Discussion:
If I see the butt of a sheep (like in the image on the first page), my first response is that I have no idea what animal it is. Is it correct to train a single model for all of sheep? Has anyone tried to train a model that segregates pose? Why not include pose in the dataset? Butt of sheep would be an apt tag.
ReplyDelete
Replies
enlite traderJanuary 14, 2016 at 11:58 PM
Summary:
This paper presents a new form of dataset aiming at segmenting individual object instances. Shown better at generalizing, Non-iconic images are used with only 91 common entry level categories but more instances within each image. Labeling was carefully designed at each step with category labeling first group objects into 11 super categories to speed up sub-process, and crowd worker quality was weighted against each other and authors of the paper.

Question:
From the paper, I got the sense the different datasets are aiming at solving different computer vision research tasks, and the quality of the datasets are important as well to remove the noises.
But nowadays we need more training data to extract better features based on CNN, how many steps and process can be automated but still get good quality control of the data?And by how?
ReplyDelete
Replies
PFJanuary 15, 2016 at 1:19 AM
Summary:

MS-COCO is a large-scale dataset addressing three core research problems in scene understanding: detecting non-iconic views of objects, contextual reasoning between objects and the precise 2D localization.
The paper explained the novel pipeline for building this dataset, and the difference from other datasets: the larger number of labeled instances per image aiding in precise 2D localization and learning contextual information.

Question:

- About the segmentation benchmark, would it be able to filter out all rejected automatically, or human intervene, such as visual check by expert, is still needed?
- Since MSCOCO has more instances per image than other datasets, would the cost be significantly higher?
ReplyDelete
Replies
Aditi GuptaJanuary 15, 2016 at 1:23 AM
The paper introduces the dataset MS COCO consisting of 91 object categories and 328,000 images. The key features of this dataset are the inclusion of non-iconic images corresponding to the object categories, a large number of instances per category and objects present in their natural environment to allow for context based search. The paper describes how the images were collected and annotated. Finally it analyses the performance of the DPMv5 algorithm when trained on the PASCAL VOC 2012 dataset vs when trained on MS COCO.

Discussion:
1. The paper mentions that the images were collected by searching for pairs of words. Where these pairs made randomly or did they have some likelihood of occurring together? For example searching for “fork+cup” vs searching for “fork+cow”.
2. One of the aims of this dataset is to encourage context based search. Are there any papers which attempt that?
ReplyDelete
Replies
UnknownJanuary 15, 2016 at 6:21 AM
Summery:

This paper introduces a new comprehensive database MS COCO of natural scenes images. The images consist of several objects with non-iconic view. Through crowdsourcing, 328,000 images are fully segmented in instance-level, and annotated, resulting in 91 different objects and more than 2.5 million different instances.

Question:

1. Can this database be updated or further annotated to include segmentation and labeling for “stuff” objects along with “thing” objects.
ReplyDelete
Replies
UnknownJanuary 15, 2016 at 11:08 AM
The MS COCO dataset was introduced by MSR to push the state-of-the-art object detection and image segmentation thereby promoting better scene understanding. The dataset focuses on common objects in complex images so that majority of the images are non-iconic and the dataset itself is rich in contextual information with several objects present per image. Algorithmic analysis shows that MS COCO is a more challenging dataset than say PASCAL VOC for object detection.

Discussion:
One major insight of this paper I think is the observation that a dataset like MS COCO which has more complex images can be used for training which could then help the learning algorithms generalize better on datasets that contain more iconic images. In the algorithmic analysis however, it would be interesting to see how the dataset fares in comparison to SUN dataset which is a very comprehensive dataset for scene understanding. With regard to annotation and captioning what experiments were conducted and metrics used to judge the quality of the captions so that automated captioning algorithms in the future can be evaluated against them.
ReplyDelete
Replies
UnknownJanuary 19, 2016 at 9:50 AM
The paper presents a new dataset with the goal of improving object recognition by combining it with scene understanding. This dataset emphasizes on detecting non-iconic views of objects, contextual reasoning between objects and precise 2D localization of these objects. It also describes the annotation process using Amazon’s Mechanical Turk, mainly involving category labeling, instance spotting and segmenting each instance.
The paper goes on to report its performance using the MS-COCO and PASCAL VOC datasets interchangeably for training and testing.

Discussion:
1) Why were only a maximum of 10 instances per category labeled? Is this to keep the total number of instances per category close to each other? How does one select the best 10 instances?
2) Instead of segmenting a few people and marking a crowd as a crowd, wouldn’t segmenting at least a fraction of people from a crowd provide more contextual segmentation. It could also help detecting/segmenting occluded people or objects in general.
ReplyDelete
Replies
UnknownJanuary 19, 2016 at 3:04 PM
This comment has been removed by the author.
ReplyDelete
Replies

Add comment