Summary: This paper presents a system to recognize human sketch of 250 objects with real time performance. 20,000 sketches were collected through AMT with instructional examples to force simple, non-fine-grained drawing labeling by augmented labels by LableMe, PrincetonShape Benchemark, Caltech256 and later proposed. Sketches were first translated by isotropic rescaling to achieve global scale and translation invariance; SIFT like features were extracted but only orientation stored(SHoG);FFT were used to filling the histogram bin to speedup in feature extraction; SVM with soft boundary were used to classify the features.
Questions/Discussion: As mentioned in the Section 9 in the paper,author discussed separately the spatial layout and temporal of the stroke worth further research.My question is how spatial/temporal features is used to address the research problems, any of them works pretty well? When do they fail to work?
This paper attempts to solve the problem of recognizing human sketches of objects – which tend to not only be iconic but visual very different than an objects' photo-realistic counterparts. This first contribution towards this goal is a huge dataset of labeled objects drawn by a large number of different people (through crowd-sourcing). The dataset contains 250 object labels and 20,000 sketches. The second contribution was a human evaluation benchmark on the dataset. Humans could recognize the sketches around 73% of the time. Finally, the last contribution was an automatic sketch classifier with a comparison to other classifiers. The SVM with an RBF kernel performed the best at recognizing sketches (even sketches from artists not in the training set such as those of cave drawings).
Questions: 1. Would including contest with the object improve the human recognition accuracy? Why was context removed and only the objects considered?
2. Was the age of the artist taken into account? Children tend to draw differently (and even children of different ages draw drastically differently) than adults – would this system recognize a child's drawing?
This paper presents a mechanism by which simple human line drawings, picked from one of 250 different categories, are able to be classified with nearly the accuracy that humans exhibit on the same task. Using Amazon Mechanical Turk, human participants are given a tool and instructions on how to draw simple sketches of examples of any of the 250 given categories. These sketches are encoded into features through the use of bag-of-features representations converted to histograms of visual words using a process very similar to the scene recognition algorithms we implemented in the Computer Vision class (albeit with a kernelized distance function that encodes weighted distances to all visual words), and they are classified in a similar manner, using both knn and 1vsAll SVM classifiers.
Discussion/Question : this seems like a task that would be well suited to a Convolutional Net approach - has this been tried?
The paper presents a new dataset which is a category to non-expert human sketch dataset and generates a classifier to classify a sketch into category. The large goal of the paper is to open up the direction of understanding as how people sketch and promoting sketch based query because with sketch user can give pose and orientation info easily. The paper also shows a local feature descriptor and multiclass SVM classifier that achieves good accuracy.
Discussion:
I guess the 'hardness' of guessing category for humans is because of the categories overlap. Seagull and 'standing bird' categories are very near and sketch would be in both categories.
This paper tries to classify human drawn sketches into many different categories. First, a sketch dataset of 20,000 drawings in 250 different categories is collected using Mechanical Turk. Following a bag-of-features approach, sketches are sampled from the dataset to create a visual vocabulary using sift-like features and clustering. Both KNN and 1 vs all SVMs are used together with the visual vocabulary to classify sketches. Accuracy with SVMs is 56%, which is much better than chance, but still significantly lower than human performance.
Question:
I thought figure 11 was confusing. The description says that "red entries mean humans are better", but how does that work for categories off of the diagonal? For instance, cup and teacup has a red square. Does this mean humans are better at mislabeling a cup as a teacup, or that they are better at making the distinction between the two?
In this paper, the authors collect 20,000 unique sketches for 250 object categories. Humans can identify the correct category 73 percent of the time. Their resulting recognition algorithm got 54% correct using a SVM. They describe how they gather the dataset via AMT.
Q: How did they choose the categories? It seems that the accuracy is largely influenced by the specificity of the categories.
Sketches tend to be at the same time both simplified and exaggerated representations of iconic images of certain objects, and can be a natural way to communicate the appearance of objects. This paper attempts to have an SVM learn a sketch representation for several different classes of objects. By using a BoW representation of sift-like features and an SVM with an RBF kernel trained on some 20,000 images across 250 classes, the system learned to correctly identify object classes 54% of the time, compared to humans' 73%.
Question:
1) When humans sketch they have an iconic representation of an object in mind. In this paper the agent is presented with examples of sketches in the absence of such an image. How would the results look if along with each sketch the learning agent had been presented with an image of the object being sketched?
2) The paper states that incorrect drawings were removed but low quality ones were not. How could they tell which ones were bad and which ones were incorrect (belonging to the wrong class) if humans are only right at this task 73% of the time?
The paper discusses a first-of-its-kind study of human sketch recognition, comparing computational models for classification against human ability as a baseline. The major contributions of this paper are the large and varied sketch dataset generated through crowdsourcing, the SIFT-like feature descriptor for the sketch images which performs amazingly well given its simplicity and an analysis of different models for performing the best classification of sketches. Applications and drawbacks of the research are provided.
Discussion: 1. Would using a more gamified crowdsourcing approach, such as the ESP game, lead to less confusion among humans for semantically similar categories? 2. While temporal order of sketches seems useful, how would you account for the fact that different users may have different starting points and have different hand preferences? 3. Previous papers use sketch gradient comparison. Was that considered as a possible feature representation?
Summary: The paper analyses how human sketch and recognize objects as well as the possibility of automating this task. The following are the most significant contributions of the paper: 1. It presents a large database of 20,000 human-drawn sketches organized into 250 categories. 2. It analyses human accuracy in recognizing these sketches. 3. Finally the paper presents a method to represent the sketches in a robust feature space and cluster them into different categories. The authors also examine the performance of an SVM classifier as well as the K nearest neighbor classifier in correctly predicting the category of a given set. Discussion: 1. I did not clearly understand how the described descriptor is different from SIFT? More over what was the intuition behind designing a new descriptor as opposed to using SIFT? 2. In my opinion contextual information would certain improve the performance of humans in the recognition task. For example a “tire” would be less likely to be categorized as a “donut” if some context was present. Would contextual information increase the accuracy of machines as well?
This project is developed to amass non-expert human sketches of simple objects (ranging over 250 categories) and then apply sketch recognition. The sketches are obtained by crowd-sourcing, i.e. using HITs from Amazon Mechanical Turk. Around 20000 sketches were collected which had both stroke data and bitmap conversion of the image.
For sketch recognition, first HITs were used to classify the sketches based on 250 categories. Then some classifiers based on conventional computer vision techniques were built, such as K-NN and SVM (unsupervised classification). As per the results human recognition performs better.
Discussion: 1. I am sure there can be better classifiers built for this dataset including but not restricted to CNNs. Was their any follow-up on this project for improving sketch recognition? 2. Sketching with pencil and paper feels different from drawing on a touchscreen. The strokes sometimes also depends on which device is being used to draw. Did all the HIT workers used same device to draw or were they allowed to draw on any of their touchscreen devices?
The project uses 20,000 sketches amassed in 250 categories to be used in recognition tasks. The researchers then compared classic CV techniques such as k-NN and SVM classification and compared the result to human recognition
Discussion: 1. Would CNN or modern classifiers work better? 2.What is the effect of pose of objects on the outcome?
This paper presents first database of sketched objects. First, a taxonomy of 250 objects was defined, then each participant draw small number of objects. In total, 20,000 images sketched by 1350 participants. The paper introduces a new descriptor for this database, using bag-of-words model. Different set of experiments and unsupervised analysis were conducted. Also, different supervised classification models were developed. The results were compared to human classification and it shows that human can do far better job in recognizing the object of even badly drawn sketch. Finally, the paper presents different applications that can leverage this work and this database, such as: interactive sketch recognition.
Research Question:
In section 5.1, it is mentioned that the size of the local batch 28 * 28 is the best choice since larger batches give better representation. Would having smaller local batches be helpful in improving the recognition results on the difficult classes?
In section 9.1, the paper mentions that using spatial pyramid representation did not improve the results, and suggests that better spatial representation could help but it has to be distinct from the developed features, any intuition about what kind of spatial presentation that could help?
The paper proposes an approach for large scale human sketch recognition. 20000 human sketches distributed over 250 object categories were collected for training and testing. These sketches are encoded into features using a bag-of-features approach and clustered into different categories. The paper compares the performance of different classification models such as K-NN and SVM, and concludes that SVM performs better with an accuracy of 56%, which is still lower than human accuracy (73%). Question/Discussion: The features extracted are translation and scale invariant. Why wasn't rotational invariance taken into account? During the sketch collection process, people are not given instructions on how to draw a specific object, in which case they are free to choose any view of the object to draw. Wouldn't making features rotationally invariant improve accuracy of recognition?
This paper analyses 20000 unique sketched sread across 250 image categories to test computer vision algorithms' recognition accuracy compared to that of a humans which is about 73%. Features are extracted from these sketches using bag of features which are then fed to a standard classifier like a one vs all svm. This pipeline achieved an accuracy of about 56%. This is much better than chance 0.4% but not as good as human level recognition.
Questions: 1. Obvious question is how would deep learning architecture like a CNN perform on this task? 2. It seems like humans perform better at the recognition task on some object categories than others. Any correlation between the top performing categories for humans and the cv algorithms?
Summary: This paper explores the field of human object sketches, for the first time, on a large scale. By using sketches collected from various users in Amazon Mechanical Turk, it builds a dataset of 20,000 unique sketches uniformly distributed over the 250 different and most commonly used categories. This dataset is used in performing experiments on how humans perceive sketches and see if they can identify those drawn by other humans. It also demonstrates the performance of a computational model built using bag-of-feature like representation of sketch on applications like interactive sketch recognition and semantic sketch-based retrieval. Questions: 1) My question is regarding extension of sketches to scenes – what more information needs to be captured by the features in order to enable a dataset (still no text!) with scene sketches to perform well on scene recognition task? One information I can think of is the spatial information of various high-level categories in sketches.
This paper presents a large, crowdsourced dataset of 20,000 human-drawn sketches across 250 categories as well as a method to classify these sketches. Using Amazon Mechanical Turk, participants drew more iconic views of the object they were instructed to sketch. The authors use sift-like features and an SVM to get the best classification performance of 56& accuracy of the time, compared to the human baseline results of 73% accuracy.
Was there any way of testing the diversity of the dataset used other than viewing the layouts (maybe like ImageNet's average image)?
This paper explores an unchartered area of exploring how humans sketch objects. A novel database containing 20,000 sketches of 250 different commonly used categories was developed using crowd sourcing for this purpose as well as to support future research. A bag of features model followed by a multi class svm classifier is used in the pipeline. Feature vectors are binned and normalized to form a representational vector h for each image. How clustering is carried out to group categories and electing a sketch to represent a category has been explained. Results from using knn and multi-class svm classifiers for various sizes of the data has been reported. Recognition rates of 56% have been achieved compared to humans 73%. It is also described how for some categories machine performs better but humans are very effective in generalizing from a small dataset.
Discussions- 1. How important is rotational invariance in this aspect? Generally from what I understand humans draw sketches from a flat landscape perspective, orthogonal to the land or for flying objects orthogonal to the horizon.
Questions- 1. Was the feature representation good enough to provide a distant representation of various categories in the high dimensional space? So that they can be easily classified using binary classifiers? 2. What type of kernel was used?
The paper presents a new method for large scale human sketch recognition. It discussed the building of a new database of 20000 human sketches distributed over 250 object categories, that were collected for training and testing. SIFT like features are extracted from these sketches, and a dictionary is built using a bag-of-features approach and clustered into different categories. The authors also compare the performance of different classification models such as K-NN and SVM.
Discussion: 1) How about combining orientation histogram based features with lower level shape detectors/ spatial features? 2) CNN extensions? Since CNNs have worked well for recognition tasks like hand writing etc.
Summary:
ReplyDeleteThis paper presents a system to recognize human sketch of 250 objects with real time performance. 20,000 sketches were collected through AMT with instructional examples to force simple, non-fine-grained drawing labeling by augmented labels by LableMe, PrincetonShape Benchemark, Caltech256 and later proposed. Sketches were first translated by isotropic rescaling to achieve global scale and translation invariance; SIFT like features were extracted but only orientation stored(SHoG);FFT were used to filling the histogram bin to speedup in feature extraction; SVM with soft boundary were used to classify the features.
Questions/Discussion:
As mentioned in the Section 9 in the paper,author discussed separately the spatial layout and temporal of the stroke worth further research.My question is how spatial/temporal features is used to address the research problems, any of them works pretty well? When do they fail to work?
Summary:
ReplyDeleteThis paper attempts to solve the problem of recognizing human sketches of objects – which tend to not only be iconic but visual very different than an objects' photo-realistic counterparts. This first contribution towards this goal is a huge dataset of labeled objects drawn by a large number of different people (through crowd-sourcing). The dataset contains 250 object labels and 20,000 sketches. The second contribution was a human evaluation benchmark on the dataset. Humans could recognize the sketches around 73% of the time. Finally, the last contribution was an automatic sketch classifier with a comparison to other classifiers. The SVM with an RBF kernel performed the best at recognizing sketches (even sketches from artists not in the training set such as those of cave drawings).
Questions:
1. Would including contest with the object improve the human recognition accuracy? Why was context removed and only the objects considered?
2. Was the age of the artist taken into account? Children tend to draw differently (and even children of different ages draw drastically differently) than adults – would this system recognize a child's drawing?
This paper presents a mechanism by which simple human line drawings, picked from one of 250 different categories, are able to be classified with nearly the accuracy that humans exhibit on the same task. Using Amazon Mechanical Turk, human participants are given a tool and instructions on how to draw simple sketches of examples of any of the 250 given categories. These sketches are encoded into features through the use of bag-of-features representations converted to histograms of visual words using a process very similar to the scene recognition algorithms we implemented in the Computer Vision class (albeit with a kernelized distance function that encodes weighted distances to all visual words), and they are classified in a similar manner, using both knn and 1vsAll SVM classifiers.
ReplyDeleteDiscussion/Question : this seems like a task that would be well suited to a Convolutional Net approach - has this been tried?
Abstract:
ReplyDeleteThe paper presents a new dataset which is a category to non-expert human sketch dataset and generates a classifier to classify a sketch into category. The large goal of the paper is to open up the direction of understanding as how people sketch and promoting sketch based query because with sketch user can give pose and orientation info easily. The paper also shows a local feature descriptor and multiclass SVM classifier that achieves good accuracy.
Discussion:
I guess the 'hardness' of guessing category for humans is because of the categories overlap. Seagull and 'standing bird' categories are very near and sketch would be in both categories.
This paper tries to classify human drawn sketches into many different categories. First, a sketch dataset of 20,000 drawings in 250 different categories is collected using Mechanical Turk. Following a bag-of-features approach, sketches are sampled from the dataset to create a visual vocabulary using sift-like features and clustering. Both KNN and 1 vs all SVMs are used together with the visual vocabulary to classify sketches. Accuracy with SVMs is 56%, which is much better than chance, but still significantly lower than human performance.
ReplyDeleteQuestion:
I thought figure 11 was confusing. The description says that "red entries mean humans are better", but how does that work for categories off of the diagonal? For instance, cup and teacup has a red square. Does this mean humans are better at mislabeling a cup as a teacup, or that they are better at making the distinction between the two?
Hi,
ReplyDeleteRequest for everyone. Please do not reveal the answer to the riddle in Fig 1. It would be nice to compare answers in class tomorrow.
In this paper, the authors collect 20,000 unique sketches for 250 object
ReplyDeletecategories. Humans can identify the correct category 73 percent of the time. Their resulting recognition algorithm got 54% correct using a SVM. They describe how they gather the dataset via AMT.
Q: How did they choose the categories? It seems that the accuracy is largely influenced by the specificity of the categories.
Sketches tend to be at the same time both simplified and exaggerated representations of iconic images of certain objects, and can be a natural way to communicate the appearance of objects. This paper attempts to have an SVM learn a sketch representation for several different classes of objects. By using a BoW representation of sift-like features and an SVM with an RBF kernel trained on some 20,000 images across 250 classes, the system learned to correctly identify object classes 54% of the time, compared to humans' 73%.
ReplyDeleteQuestion:
1) When humans sketch they have an iconic representation of an object in mind. In this paper the agent is presented with examples of sketches in the absence of such an image. How would the results look if along with each sketch the learning agent had been presented with an image of the object being sketched?
2) The paper states that incorrect drawings were removed but low quality ones were not. How could they tell which ones were bad and which ones were incorrect (belonging to the wrong class) if humans are only right at this task 73% of the time?
The paper discusses a first-of-its-kind study of human sketch recognition, comparing computational models for classification against human ability as a baseline. The major contributions of this paper are the large and varied sketch dataset generated through crowdsourcing, the SIFT-like feature descriptor for the sketch images which performs amazingly well given its simplicity and an analysis of different models for performing the best classification of sketches. Applications and drawbacks of the research are provided.
ReplyDeleteDiscussion:
1. Would using a more gamified crowdsourcing approach, such as the ESP game, lead to less confusion among humans for semantically similar categories?
2. While temporal order of sketches seems useful, how would you account for the fact that different users may have different starting points and have different hand preferences?
3. Previous papers use sketch gradient comparison. Was that considered as a possible feature representation?
Summary:
ReplyDeleteThe paper analyses how human sketch and recognize objects as well as the possibility of automating this task. The following are the most significant contributions of the paper:
1. It presents a large database of 20,000 human-drawn sketches organized into 250 categories.
2. It analyses human accuracy in recognizing these sketches.
3. Finally the paper presents a method to represent the sketches in a robust feature space and cluster them into different categories. The authors also examine the performance of an SVM classifier as well as the K nearest neighbor classifier in correctly predicting the category of a given set.
Discussion:
1. I did not clearly understand how the described descriptor is different from SIFT? More over what was the intuition behind designing a new descriptor as opposed to using SIFT?
2. In my opinion contextual information would certain improve the performance of humans in the recognition task. For example a “tire” would be less likely to be categorized as a “donut” if some context was present. Would contextual information increase the accuracy of machines as well?
This project is developed to amass non-expert human sketches of simple objects (ranging over 250 categories) and then apply sketch recognition. The sketches are obtained by crowd-sourcing, i.e. using HITs from Amazon Mechanical Turk. Around 20000 sketches were collected which had both stroke data and bitmap conversion of the image.
ReplyDeleteFor sketch recognition, first HITs were used to classify the sketches based on 250 categories. Then some classifiers based on conventional computer vision techniques were built, such as K-NN and SVM (unsupervised classification). As per the results human recognition performs better.
Discussion:
1. I am sure there can be better classifiers built for this dataset including but not restricted to CNNs. Was their any follow-up on this project for improving sketch recognition?
2. Sketching with pencil and paper feels different from drawing on a touchscreen. The strokes sometimes also depends on which device is being used to draw. Did all the HIT workers used same device to draw or were they allowed to draw on any of their touchscreen devices?
The project uses 20,000 sketches amassed in 250 categories to be used in recognition tasks. The researchers then compared classic CV techniques such as k-NN and SVM classification and compared the result to human recognition
ReplyDeleteDiscussion:
1. Would CNN or modern classifiers work better?
2.What is the effect of pose of objects on the outcome?
This paper presents first database of sketched objects. First, a taxonomy of 250 objects was defined, then each participant draw small number of objects. In total, 20,000 images sketched by 1350 participants. The paper introduces a new descriptor for this database, using bag-of-words model. Different set of experiments and unsupervised analysis were conducted. Also, different supervised classification models were developed. The results were compared to human classification and it shows that human can do far better job in recognizing the object of even badly drawn sketch. Finally, the paper presents different applications that can leverage this work and this database, such as: interactive sketch recognition.
ReplyDeleteResearch Question:
In section 5.1, it is mentioned that the size of the local batch 28 * 28 is the best choice since larger batches give better representation. Would having smaller local batches be helpful in improving the recognition results on the difficult classes?
In section 9.1, the paper mentions that using spatial pyramid representation did not improve the results, and suggests that better spatial representation could help but it has to be distinct from the developed features, any intuition about what kind of spatial presentation that could help?
The paper proposes an approach for large scale human sketch recognition. 20000 human sketches distributed over 250 object categories were collected for training and testing. These sketches are encoded into features using a bag-of-features approach and clustered into different categories. The paper compares the performance of different classification models such as K-NN and SVM, and concludes that SVM performs better with an accuracy of 56%, which is still lower than human accuracy (73%).
ReplyDeleteQuestion/Discussion:
The features extracted are translation and scale invariant. Why wasn't rotational invariance taken into account? During the sketch collection process, people are not given instructions on how to draw a specific object, in which case they are free to choose any view of the object to draw. Wouldn't making features rotationally invariant improve accuracy of recognition?
This paper analyses 20000 unique sketched sread across 250 image categories to test computer vision algorithms' recognition accuracy compared to that of a humans which is about 73%.
ReplyDeleteFeatures are extracted from these sketches using bag of features which are then fed to a standard classifier like a one vs all svm. This pipeline achieved an accuracy of about 56%. This is much better than chance 0.4% but not as good as human level recognition.
Questions:
1. Obvious question is how would deep learning architecture like a CNN perform on this task?
2. It seems like humans perform better at the recognition task on some object categories than others. Any correlation between the top performing categories for humans and the cv algorithms?
Summary:
ReplyDeleteThis paper explores the field of human object sketches, for the first time, on a large scale. By using sketches collected from various users in Amazon Mechanical Turk, it builds a dataset of 20,000 unique sketches uniformly distributed over the 250 different and most commonly used categories. This dataset is used in performing experiments on how humans perceive sketches and see if they can identify those drawn by other humans. It also demonstrates the performance of a computational model built using bag-of-feature like representation of sketch on applications like interactive sketch recognition and semantic sketch-based retrieval.
Questions:
1) My question is regarding extension of sketches to scenes – what more information needs to be captured by the features in order to enable a dataset (still no text!) with scene sketches to perform well on scene recognition task? One information I can think of is the spatial information of various high-level categories in sketches.
This paper presents a large, crowdsourced dataset of 20,000 human-drawn sketches across 250 categories as well as a method to classify these sketches. Using Amazon Mechanical Turk, participants drew more iconic views of the object they were instructed to sketch. The authors use sift-like features and an SVM to get the best classification performance of 56& accuracy of the time, compared to the human baseline results of 73% accuracy.
ReplyDeleteWas there any way of testing the diversity of the dataset used other than viewing the layouts (maybe like ImageNet's average image)?
This paper explores an unchartered area of exploring how humans sketch objects. A novel database containing 20,000 sketches of 250 different commonly used categories was developed using crowd sourcing for this purpose as well as to support future research. A bag of features model followed by a multi class svm classifier is used in the pipeline. Feature vectors are binned and normalized to form a representational vector h for each image. How clustering is carried out to group categories and electing a sketch to represent a category has been explained. Results from using knn and multi-class svm classifiers for various sizes of the data has been reported. Recognition rates of 56% have been achieved compared to humans 73%. It is also described how for some categories machine performs better but humans are very effective in generalizing from a small dataset.
ReplyDeleteDiscussions-
1. How important is rotational invariance in this aspect? Generally from what I understand humans draw sketches from a flat landscape perspective, orthogonal to the land or for flying objects orthogonal to the horizon.
Questions-
1. Was the feature representation good enough to provide a distant representation of various categories in the high dimensional space? So that they can be easily classified using binary classifiers?
2. What type of kernel was used?
The paper presents a new method for large scale human sketch recognition. It discussed the building of a new database of 20000 human sketches distributed over 250 object categories, that were collected for training and testing. SIFT like features are extracted from these sketches, and a dictionary is built using a bag-of-features approach and clustered into different categories. The authors also compare the performance of different classification models such as K-NN and SVM.
ReplyDeleteDiscussion:
1) How about combining orientation histogram based features with lower level shape detectors/ spatial features?
2) CNN extensions? Since CNNs have worked well for recognition tasks like hand writing etc.