The paper presents a new visual question answering dataset towards setting up a task that is more AI complete than the other available tasks like image captioning. The authors have taken MS COCO image subsets and created various mechanical turk tasks to create questions and generate answers for the questions. They have made sure that the questions require the visual feedback (image)to be answered properly. Finally they have shown various statistics about the dataset and some baseline performance on the dataset. The large gap in baseline and human performance suggests that a lot can be improved towards solving this task.
Discussion:
1) Table 1 says humans can answer 40.81% without looking at image and Table 2 says that BOW Q gets 48.09%. Isn't it weird? Is it because Table 1 is just a subset?
2) What are some current state of the arts methods on this dataset?
This paper introduces tools and data to facilitate the development of methodologies to automate the task of providing natural language answers to natural language questions about images. To this end, the authors have provided a dataset comprising 250k images, 760k questions and approximately 10 million answers. The images were mostly accumulated from the ms-coco dataset, along with an abstract "clipart"-animation-like scene database, while the questions and answers were collected via crowdsourcing with specific instructions to ask questions that would be understandable by "a toddler" or "a robot" and also require the images to answer, and the answers would be phrases and not complete sentences. The data collection mechanisms are elaborated upon and also provided. Methodologies for providing baseline evaluations of the dataset are also provided, consisting of two channels - an image channel built off VGGNet - and a question channel - the best performing of which is built from a recurrent net based on multiple Long Short-term memory layers - which merge into a fully-connected MLP layer fed into a softmax classifier.
Questions/Discussion : 1) (more of a theoretical/discussion point) Would this approach be a suitable template to use for sequence/video assessment/Q&A?
summary: This presents visual question and answering challenge partially stems from solving real world AI problems combining computer vision, natural language processing and knowledge representation and reasoning. MSCOCO and abstract scenes were used to composite the datasets with furthering questions and answers collected from AMT in the format of open-ended and multiple-choice. Authors also proposed a deep network featuring parallel VGG and LSTM taking image and question separately and element wise multiply the one same sized FC layer and then pass to MLP to search for answers.
question: Can you explain more on how the answer is generated? What if the answer need to be sentence to give a good explanation for the open-ended question in addition to word or phrase?
This paper introduces Visual Question Answering and a dataset to help develop algorithms to attack AI-complete tasks. Basically, there is a dataset, largely drawn from MS-COCO, along with open-ended questions about that subject.
Q: Questions whose answers are just answerable by "common sense" are desired. Is that because they are too easy or too hard to answer?
D: (1) A way of adding in context? E.g., Experts talking about about their expertise are presumably make much more fine-grained distinctions than non-experts (think artists talking about color vs non-artists). Maybe telling the computer that it wants an answer (what color is X?) for a subject expert (artist) vs an answer for non-subject expert (non-artist) (2) Should we add in context? Is it worthwhile to do this?
The paper proposes a new type of AI task that combines the different sub-branches of AI, Visual Question Answering. The idea is that given an image and a set of questions about that image, can a machine generate correct or plausible answers? The authors state that this task is significantly more challenging than image captioning since it requires much more high-level and more fine-grained knowledge about the image as well as better natural language understanding to generate the answers, while at the same time, the task is easier to evaluate than image captioning. This is due to the fact that most answers are 1-4 words long. The authors describe the collection of a new dataset based on MS COCO images and synthetic, abstract images that capture high-level information. Much emphasis is placed in collecting as diverse a set of questions and answers as possible and various analyses are performed to evaluate metrics such as diversity of words, type of questions and answers and the importance of the image and common sense when answering questions. Baseline techniques using deep learning are provided.
Discussion: 1. Could we have some small discussion on Computer Vision techniques to understand abstract images?
2. The authors have not mentioned any sort of validation done on the questions and answers, instead they have explored attributes and aspects of the collected Q&A. Won't this be important especially when using the various metrics to categorize questions and answers? E.g. they mention synonym comparison is a hard problem in NLP, but this could be relatively trivial with another round of crowd-sourced validation.
This paper proposes a dataset used for visual question answering. They authors take a image and a open ended question related to image for which the deepnet provides solution. The dataset is made up of real images from MS-COCO and abstract scene image both of which include caption at the end. The question collection part is pretty interesting and the questions are generated such that a smart bot would be uanble to answer them. The answers to the question are again rated in different categories and the authors found that answers to real and scenic images are generally similar. Another interesting feature is that network is tuned to answer common sense questions as well.
Questions- The task at hand is really interesting. Could you explain in detail the baseline approach that authors took in detail?
This paper presents a dataset for visual question ls and answers. This is intended to spark new research in the field of multi modal AI. The dataset is split into real and abstract images and the questions are diverse. All of them are built on the premise that they are easier for humans than for algorithms to answer. The questions are also split into open ended and multiple choice. The researchers also have run some baseline tests on the dataset to see if modern algorithms could do well against the dataset (in terms of solving questions)
Discussions: How does this compare with other datasets like FM-IQA?
The paper introduces Visual Question Answering, which is a pipeline that uses Image information along with a learned vocabulary to answer questions relating to the image. The visual dataset used is a combination of MS-COCO and sketched abstract images.
Discussion: The authors say that if 3 users choose the same answer, then the accuracy is 100%. Was there any previous literature that similarly used 3 users as a representative set? Also what if 3 other users choose a contradicting answer?
The paper presents a dataset of images annotated with question-answer pairs. The images are context-rich and largely drawn from MS COCO. The questions are provided by Turkers, and are intended to require the image for a correct answer. A great deal of analysis of the data distributions is provided in the attached appendices, and a baseline classifier is trained on the question answering task to some degree of success.
Discussion:
1) Table 1 shows that there is some amount of information provided in the captions that can be used to answer questions without looking at the matching image. This presents a new dimension along which to evaluate how good a caption for an image is- the less necessary an image would be to answer questions given a caption for the image, the better the captioning system that generated it. Has anyone tried to use this approach or similar as a metric for evaluating systems trained to generate image captions?
In this paper, the authors propose the task of Visual Question Answering, a system that takes an image and a free form question as input, and outputs a natural language answer. The authors create a dataset that is made up of scenes from MSCOCO and an abstract scene dataset. For each image, three questions and answers are provided. All questions and answers are created by mechanical turk workers. The authors show the distribution of question types, and answer types. A classifier is trained on the dataset to perform the VQA task.
Questions:
1) What is the advantage of the abstract scene portion of the dataset? Can you explain what kind of tasks would use the abstract scenes over the MSCOCO scenes?
These guys made a new dataset. Taking 200k images from MS COCO and 50k scene images they created, they generated 3 questions for each image. These questions have a range of complexities, from “what color are her eyes” to “is this person expecting company.” They gathered human answers for each question, along with the confidence level the human had at the time of responding. The dataset is designed to test Visual Question Answering algorithms, and they use some metrics do determine the diversity of their dataset. They test the dataset against simple algorithms (always guess most common answer, and kNN), and then run their own model. Their model uses the last layer of VGG net as features for images, and a BOW features for question identification. Question identification features are passed through a LSTM (recurrent neural network), and finally combined with image features. They train a multi layer perceptron model to make their final decision. They get relatively good results (48% accuracy) when ignoring the image, which suggests that many image based questions have relatively strong priors. Their algorithm performs about as well as a 5 year old child (they estimate).
Discussion: do we think coding knowledge directly into a database structure or is necessary? Or do we think a deep networks will continue outperform models that try to directly encode knowledge.
This paper presents a system that takes an input of an image and a free-form/open-ended question and outputs a natural-language output. The authors create a dataset using real and abstract images, and collect both open-ended and multiple-choice questions and answers for the images. They provide an in-depth analysis of this new dataset and perform baseline experiments to compare with human performance.
Can you go over the author's model for the VQA task?
This paper presents a novel computer vision challenge and an associated dataset. The research question is based around the idea of if we show an algorithm a picture and ask a question, the algorithm should be able to reason about the image and provide an answer. The dataset is built partially from MS COCO and abstract scenes (cartoon-like depictions of real world scenarios). Each image then has a serious of questions that can be asked about the image. These questions can’t be answered by low-level techniques so as to force the algorithm to learn visual understanding and reasoning.
1. Could we use this dataset to tell stories? We know what sort of questions a person might ask for each image - can we then use that to tell a short story about the scenario that led to such a scene? 2. Are these questions likely ones that a human would ask? For example, I am more likely to ask why that woman taped bananas to her face than ask what is on her face. Can we model more human like questions using this dataset?
This paper proposes a system for automatic question-answering from images. This work aims toward high-level understanding of visual content and common sense knowledge. Given an input image and open-ended question, the system provides an open-ended answer. First, the paper introduces a new dataset of images, with open-ended, interesting questions (3 question per image) and 10 tuckers answer each question. In total: the VQA has 760K questions with 10M answers. Different experiments are done to analyze the types of questions and answers using text and images features.
Q: What is the advantage of having the abstract scene dataset since all the evaluation was done using the dataset from MS COCO?
In this paper the authors propose a novel problem of Visual Question Answering. This problem aims to answer questions which could be answered by humans looking at the given image. They do an in depth analysis of this problem and release 2 datasets MSCOCO VQA and abstract scenes VQA. They use the mechanical turk to generate two types of questions : multiple choice question and answers for specificity and open ended questions for generalization.
They talk about the care in generating the dataset with the questions being non trivial. They also explore the role of photo realism in this task be using abstract art made from clip art to generate images. They show photo realism is not an important aspect. The datasets were part of the challenge and are up in the public.
The method they propose as a solution to this problem uses a combination of a LSTM and a conv net to combine language and image features to perform classfication of the answers. They show that they can achieve accuracy of upto 62% in this task.
Questions:
Can you explain the LSTM and the work embeddings. Why are they learning the question.
Authors claim it requires to do several tasks liek object detection but isnt it just using generic image features then learning an embedding to the question. Basically it does not have any idea of the concept of object.
The paper presents the VQA dataset which is a collection of images paired with a set questions that can be answered from the image and their corresponding answers. The authors build the dataset using 200,000 images from the MS COCO dataset and around 50,000 abstract images. They then use AMT to generate questions that can be answered by the image and their corresponding answers. Finally the authors use a MPL model as well as LSTMs to present the baseline performance for their dataset.
Questions: -- While evaluating the accuracy of the answers, the authors mention that they look at the number of humans that gave the same answer. However is the similarity between the answers measured. -- How does including abstract images help our model perform better on natural images? Are the image features learnt for abstract images the similar to the ones learnt for a natural image representing the same scene?
The Visual QA system that allows for free-form and open-ended questions to be asked of an image with the aim of providing a natural answer. The authors combine MSCOCO and abstract images to create a dataset with the training answers being collected from AMT workers. Their best model uses a two layer LSTM to encode the questions along with a last hidden layer of VGGNet to encode the images. The network performs well on questions that depend on leveraging scene-level information but poorly on questions that require more reasoning.
Discussion: How does this work compare to Karpathy and Fei-Fei's work on generating image descriptions using multimodal RNNs. They don't use LSTMs initially but use them later and get better results. I can't understand why LSTMs are so well suited to this task?
The paper proposes a large dataset for visual question ansering. The dataset contains over 760K questions with approximately 10M answers. The paper compares the accuracy of the proposed method for open ended and multiple choice tasks on the VQA test-dev for real images. Accuracy on multiple choice turn out to be better than open-ended as expected. The best model proposed in the paper (LSTM Q+normI) outperforms other baselines.
Abstract:
ReplyDeleteThe paper presents a new visual question answering dataset towards setting up a task that is more AI complete than the other available tasks like image captioning. The authors have taken MS COCO image subsets and created various mechanical turk tasks to create questions and generate answers for the questions. They have made sure that the questions require the visual feedback (image)to be answered properly. Finally they have shown various statistics about the dataset and some baseline performance on the dataset. The large gap in baseline and human performance suggests that a lot can be improved towards solving this task.
Discussion:
1) Table 1 says humans can answer 40.81% without looking at image and Table 2 says that BOW Q gets 48.09%. Isn't it weird? Is it because Table 1 is just a subset?
2) What are some current state of the arts methods on this dataset?
This paper introduces tools and data to facilitate the development of methodologies to automate the task of providing natural language answers to natural language questions about images. To this end, the authors have provided a dataset comprising 250k images, 760k questions and approximately 10 million answers. The images were mostly accumulated from the ms-coco dataset, along with an abstract "clipart"-animation-like scene database, while the questions and answers were collected via crowdsourcing with specific instructions to ask questions that would be understandable by "a toddler" or "a robot" and also require the images to answer, and the answers would be phrases and not complete sentences. The data collection mechanisms are elaborated upon and also provided. Methodologies for providing baseline evaluations of the dataset are also provided, consisting of two channels - an image channel built off VGGNet - and a question channel - the best performing of which is built from a recurrent net based on multiple Long Short-term memory layers - which merge into a fully-connected MLP layer fed into a softmax classifier.
ReplyDeleteQuestions/Discussion :
1) (more of a theoretical/discussion point) Would this approach be a suitable template to use for sequence/video assessment/Q&A?
summary:
ReplyDeleteThis presents visual question and answering challenge partially stems from solving real world AI problems combining computer vision, natural language processing and knowledge representation and reasoning. MSCOCO and abstract scenes were used to composite the datasets with furthering questions and answers collected from AMT in the format of open-ended and multiple-choice. Authors also proposed a deep network featuring parallel VGG and LSTM taking image and question separately and element wise multiply the one same sized FC layer and then pass to MLP to search for answers.
question:
Can you explain more on how the answer is generated? What if the answer need to be sentence to give a good explanation for the open-ended question in addition to word or phrase?
This paper introduces Visual Question Answering and a dataset to help develop algorithms to attack AI-complete tasks. Basically, there is a dataset, largely drawn from MS-COCO, along with open-ended questions about that subject.
ReplyDeleteQ: Questions whose answers are just answerable by "common sense" are desired. Is that because they are too easy or too hard to answer?
D: (1) A way of adding in context? E.g., Experts talking about about their expertise are presumably make much more fine-grained distinctions than non-experts (think artists talking about color vs non-artists). Maybe telling the computer that it wants an answer (what color is X?) for a subject expert (artist) vs an answer for non-subject expert (non-artist) (2) Should we add in context? Is it worthwhile to do this?
The paper proposes a new type of AI task that combines the different sub-branches of AI, Visual Question Answering. The idea is that given an image and a set of questions about that image, can a machine generate correct or plausible answers? The authors state that this task is significantly more challenging than image captioning since it requires much more high-level and more fine-grained knowledge about the image as well as better natural language understanding to generate the answers, while at the same time, the task is easier to evaluate than image captioning. This is due to the fact that most answers are 1-4 words long. The authors describe the collection of a new dataset based on MS COCO images and synthetic, abstract images that capture high-level information. Much emphasis is placed in collecting as diverse a set of questions and answers as possible and various analyses are performed to evaluate metrics such as diversity of words, type of questions and answers and the importance of the image and common sense when answering questions. Baseline techniques using deep learning are provided.
ReplyDeleteDiscussion:
1. Could we have some small discussion on Computer Vision techniques to understand abstract images?
2. The authors have not mentioned any sort of validation done on the questions and answers, instead they have explored attributes and aspects of the collected Q&A. Won't this be important especially when using the various metrics to categorize questions and answers? E.g. they mention synonym comparison is a hard problem in NLP, but this could be relatively trivial with another round of crowd-sourced validation.
This paper proposes a dataset used for visual question answering. They authors take a image and a open ended question related to image for which the deepnet provides solution. The dataset is made up of real images from MS-COCO and abstract scene image both of which include caption at the end. The question collection part is pretty interesting and the questions are generated such that a smart bot would be uanble to answer them. The answers to the question are again rated in different categories and the authors found that answers to real and scenic images are generally similar. Another interesting feature is that network is tuned to answer common sense questions as well.
ReplyDeleteQuestions-
The task at hand is really interesting. Could you explain in detail the baseline approach that authors took in detail?
This paper presents a dataset for visual question ls and answers. This is intended to spark new research in the field of multi modal AI. The dataset is split into real and abstract images and the questions are diverse. All of them are built on the premise that they are easier for humans than for algorithms to answer. The questions are also split into open ended and multiple choice. The researchers also have run some baseline tests on the dataset to see if modern algorithms could do well against the dataset (in terms of solving questions)
ReplyDeleteDiscussions:
How does this compare with other datasets like FM-IQA?
Questions:
1.
The paper introduces Visual Question Answering, which is a pipeline that uses Image information along with a learned vocabulary to answer questions relating to the image. The visual dataset used is a combination of MS-COCO and sketched abstract images.
ReplyDeleteDiscussion:
The authors say that if 3 users choose the same answer, then the accuracy is 100%. Was there any previous literature that similarly used 3 users as a representative set?
Also what if 3 other users choose a contradicting answer?
The paper presents a dataset of images annotated with question-answer pairs. The images are context-rich and largely drawn from MS COCO. The questions are provided by Turkers, and are intended to require the image for a correct answer. A great deal of analysis of the data distributions is provided in the attached appendices, and a baseline classifier is trained on the question answering task to some degree of success.
ReplyDeleteDiscussion:
1) Table 1 shows that there is some amount of information provided in the captions that can be used to answer questions without looking at the matching image. This presents a new dimension along which to evaluate how good a caption for an image is- the less necessary an image would be to answer questions given a caption for the image, the better the captioning system that generated it. Has anyone tried to use this approach or similar as a metric for evaluating systems trained to generate image captions?
-Stefano
In this paper, the authors propose the task of Visual Question Answering, a system that takes an image and a free form question as input, and outputs a natural language answer. The authors create a dataset that is made up of scenes from MSCOCO and an abstract scene dataset. For each image, three questions and answers are provided. All questions and answers are created by mechanical turk workers. The authors show the distribution of question types, and answer types. A classifier is trained on the dataset to perform the VQA task.
ReplyDeleteQuestions:
1) What is the advantage of the abstract scene portion of the dataset? Can you explain what kind of tasks would use the abstract scenes over the MSCOCO scenes?
These guys made a new dataset. Taking 200k images from MS COCO and 50k scene images they created, they generated 3 questions for each image. These questions have a range of complexities, from “what color are her eyes” to “is this person expecting company.” They gathered human answers for each question, along with the confidence level the human had at the time of responding. The dataset is designed to test Visual Question Answering algorithms, and they use some metrics do determine the diversity of their dataset. They test the dataset against simple algorithms (always guess most common answer, and kNN), and then run their own model. Their model uses the last layer of VGG net as features for images, and a BOW features for question identification. Question identification features are passed through a LSTM (recurrent neural network), and finally combined with image features. They train a multi layer perceptron model to make their final decision. They get relatively good results (48% accuracy) when ignoring the image, which suggests that many image based questions have relatively strong priors. Their algorithm performs about as well as a 5 year old child (they estimate).
ReplyDeleteDiscussion: do we think coding knowledge directly into a database structure or is necessary? Or do we think a deep networks will continue outperform models that try to directly encode knowledge.
This paper presents a system that takes an input of an image and a free-form/open-ended question and outputs a natural-language output. The authors create a dataset using real and abstract images, and collect both open-ended and multiple-choice questions and answers for the images. They provide an in-depth analysis of this new dataset and perform baseline experiments to compare with human performance.
ReplyDeleteCan you go over the author's model for the VQA task?
This paper presents a novel computer vision challenge and an associated dataset. The research question is based around the idea of if we show an algorithm a picture and ask a question, the algorithm should be able to reason about the image and provide an answer. The dataset is built partially from MS COCO and abstract scenes (cartoon-like depictions of real world scenarios). Each image then has a serious of questions that can be asked about the image. These questions can’t be answered by low-level techniques so as to force the algorithm to learn visual understanding and reasoning.
ReplyDelete1. Could we use this dataset to tell stories? We know what sort of questions a person might ask for each image - can we then use that to tell a short story about the scenario that led to such a scene?
2. Are these questions likely ones that a human would ask? For example, I am more likely to ask why that woman taped bananas to her face than ask what is on her face. Can we model more human like questions using this dataset?
This paper proposes a system for automatic question-answering from images. This work aims toward high-level understanding of visual content and common sense knowledge.
ReplyDeleteGiven an input image and open-ended question, the system provides an open-ended answer. First, the paper introduces a new dataset of images, with open-ended, interesting questions (3 question per image) and 10 tuckers answer each question. In total: the VQA has 760K questions with 10M answers. Different experiments are done to analyze the types of questions and answers using text and images features.
Q: What is the advantage of having the abstract scene dataset since all the evaluation was done using the dataset from MS COCO?
In this paper the authors propose a novel problem of Visual Question Answering. This problem aims to answer questions which could be answered by humans looking at the given image. They do an in depth analysis of this problem and release 2 datasets MSCOCO VQA and abstract scenes VQA. They use the mechanical turk to generate two types of questions : multiple choice question and answers for specificity and open ended questions for generalization.
ReplyDeleteThey talk about the care in generating the dataset with the questions being non trivial. They also explore the role of photo realism in this task be using abstract art made from clip art to generate images. They show photo realism is not an important aspect. The datasets were part of the challenge and are up in the public.
The method they propose as a solution to this problem uses a combination of a LSTM and a conv net to combine language and image features to perform classfication of the answers. They show that they can achieve accuracy of upto 62% in this task.
Questions:
Can you explain the LSTM and the work embeddings. Why are they learning the question.
Authors claim it requires to do several tasks liek object detection but isnt it just using generic image features then learning an embedding to the question. Basically it does not have any idea of the concept of object.
The paper presents the VQA dataset which is a collection of images paired with a set questions that can be answered from the image and their corresponding answers. The authors build the dataset using 200,000 images from the MS COCO dataset and around 50,000 abstract images. They then use AMT to generate questions that can be answered by the image and their corresponding answers.
ReplyDeleteFinally the authors use a MPL model as well as LSTMs to present the baseline performance for their dataset.
Questions:
-- While evaluating the accuracy of the answers, the authors mention that they look at the number of humans that gave the same answer. However is the similarity between the answers measured.
-- How does including abstract images help our model perform better on natural images? Are the image features learnt for abstract images the similar to the ones learnt for a natural image representing the same scene?
The Visual QA system that allows for free-form and open-ended questions to be asked of an image with the aim of providing a natural answer. The authors combine MSCOCO and abstract images to create a dataset with the training answers being collected from AMT workers. Their best model uses a two layer LSTM to encode the questions along with a last hidden layer of VGGNet to encode the images. The network performs well on questions that depend on leveraging scene-level information but poorly on questions that require more reasoning.
ReplyDeleteDiscussion:
How does this work compare to Karpathy and Fei-Fei's work on generating image descriptions using multimodal RNNs. They don't use LSTMs initially but use them later and get better results. I can't understand why LSTMs are so well suited to this task?
The paper proposes a large dataset for visual question ansering. The dataset contains over 760K questions with approximately 10M answers. The paper compares the accuracy of the proposed method for open ended and multiple choice tasks on the VQA test-dev for real images. Accuracy on multiple choice turn out to be better than open-ended as expected. The best model proposed in the paper (LSTM Q+normI) outperforms other baselines.
ReplyDelete