Advanced Computer Vision: Wed, April 6

Monday, April 4, 2016

Wed, April 6 - Visual Madlibs

Visual Madlibs: Fill in the blank Description Generation and Question Answering. Licheng Yu, Eunbyung Park, Alexander C. Berg, Tamara L. Berg. ICCV, 2015.

project page, pdf

Another note -- attendance hasn't been great and many people are arriving to class late. It's vital to have people present for discussions. I do take attendance and as the syllabus says it is part of your grade.

22 comments:

UnknownApril 5, 2016 at 7:04 PM
Abstract:

The paper presents alternate task of guided captioning and multiple choice question answer for combining natural language and computer vision modalities. They have used MS COCO images as their basic images and used mechanical turk to generate fill in the blank type captions. They have also given analysis that the coverage of their description is higher than MS COCO captions. They also have questions that asks the past and future of image scene, which would require a deeper understanding of answer. Finally they have shown some baseline results for the task.

Discussion:

1) How does this dataset compare to VQA in terms of ability of capturing scene understanding?

2) Is there any work where people tried to combine natural language and computer vision modality with knowledge base? Because it seems like many of these questions require real world knowledge which can come only from KB's. This is being explored in NLP QA tasks but not sure about such visual QA tasks.
ReplyDelete
Replies
John TurnerApril 5, 2016 at 9:37 PM
This paper presents a captioned dataset built of 10k human-centered images from MS-COOC using a fill-in-the-blank strategy that leverages existing captioning to construct candidate sentences missing important descriptive terms relating to the image, allowing for a Turk worker to either provide the missing text or select from a list of choices. Multiple algorithms were also tested on performing tasks, both easy and hard multiple choice (determined by the sources of the choices presented) and fill in the blank, and their results were presented.

Questions/Discussion :
1) How can inherently subjective questions provide any kind of salient information about an image? (i.e. "This image makes me feel _________").

2) Can you describe the canonical correlation analysis methods more? Why did the nCCA variants perform so well?
ReplyDelete
Replies
sfenu3April 5, 2016 at 11:26 PM
This paper introduces another image captioning dataset. This dataset consists of around 10,000 images of humans from MS-COCO, captioned using a MadLibs style approach (blanking out one word in a sentence, having Turkers fill in the blank). The idea here is to try and capture fine grained relationships between objects- descriptive terms, actions, or relationships. Several different approaches were taken to producing baseline results for the fill-in-the-blank task, to some degree of success.

Questions:

1) Under the Experiments section, they initially represent sentences by "[averaging] the Word2Vec scores of all the words in a sentence to get a 300 dimensional representation". Doesn't this get rid of some of the positional relationship the words have to each other? If you're trying to determine relationships between nouns in a sentence it seems like word position matters a lot.

2) What sort of accuracy does the system attain when training without visual clues? Does sentence structure for the captions, or co-occurrence of certain words influence the predictions?

-Stefano
ReplyDelete
Replies
UnknownApril 5, 2016 at 11:33 PM
This paper presents a new type of VQA dataset where instead of free form questions and answers, there are targeted and specific prompts for fill-in-the-blank style answers targeting various aspects of the image. The key idea in this paper is to provide a mechanism for more descriptive captions that target specific objects and attributes of the image, as well as be more amenable to evaluation via the multiple choice answers. The authors use template based madlib generation over MS COCO images and then use AMT to provide different answers for each question prompt type. Only human-centric images with accompanying objects are chosen to create the dataset. The authors analyze the madlib responses, compare its comprehensiveness and preciseness with MS COCO captions and evaluate metrics on the two tasks, visual madlib description generation and multiple choice question answering.

Discussion:
1. Is there some reasoning for why the authors chose Word2Vec for similarity mapping when a previous paper uses BLEU and METEOR? In general, are some NLP techniques favorable for certain CV+NLP tasks?

2. The dataset uses mainly human-centric images. Could this be a potential error in hindsight since now over half the question prompts are human based and this makes it harder to collect data and evaluate techniques on images that don't have humans? For images without humans, I'd have to come up with a new (and partially overlapping) set of question prompts to collect associated madlibs (which, in my very naive opinion, is more overhead).
ReplyDelete
Replies
enlite traderApril 6, 2016 at 12:01 AM
summary:This paper presents a new data sets Visual Madlibs on VQA with the emphasis on more detail or focus question answering in contrast to the generic description of the images. This dataset uses MSCOCO generating the dataset format as Image+Instruction+Prompts+Blank given the descriptions from MSCOCO and similar to VQA from Virginia Tech, Visual Madlibs has automatic generate description given the blank as well as multiple choices to fill-in the blank. They also experiments with a variety of models in this task: nCCA shows good results on easy task;CNN + LSTM shows outstanding performance in fine-grained image feature answering like object attributes; Places-CNN helps with scene and person location, image emotion question answering.

question:
Can you please give more details on how each model(nCCA, CNN +LSTM) is trained on this data set on automatic generate description problem as well as on the multiple choices problem?
ReplyDelete
Replies
UnknownApril 6, 2016 at 1:05 AM
This paper introduces a new dataset with around 10000 images and 36000 natural language descriptor for with questions directed in reference to objects in image. The images are human images obtained from MS-COCO dataset with AMT workers providing the missing text. The authors experiment with different model and show that the network is capable of also predicting time based questions for image.

Discuss -
Can we discuss more about CCA, nCCA and CNN+LSTM.
ReplyDelete
Replies
UnknownApril 6, 2016 at 3:29 AM

Visual Madlibs is a Dataset of natural language descriptions for images. The dataset is created by taking already labeled images and posing some standard fill-in-the-blanks or multiple choice questions based on a subset of object labels. The labeled images are a subset of MS-COCO dataset. AMTs are used to answer these questions. The authors analyze the Dataset by quantifying the Madlibs responses based on length, structure, consistency and phrase chunking. They also compare the Madlibs responses to MS-COCO image descriptions. Using this dataset they train CCA and CNN+LSTM and evaluate the results.

Discussion:
1. When generating questions for each category (scene, emotion, past, future, etc.) does each image get one or more questions from each category? If not, how are the question categories chosen for an image just by knowing its object labels?
ReplyDelete
Replies
JonathanApril 6, 2016 at 3:55 AM
Visual Madlibs is a dataset that pairs natural language descriptions with images. The dataset is is created by "Madlib" type fill-in-the-blank questions as well as multiple choice.

Q/Discussion
1. Can one adjust the difficulty of negative answers in multiple choice answers?
2. Why did CNN+LSTM do so poorly in Table 3 (filtered questions from Hard Task)?
ReplyDelete
Replies
UnknownApril 6, 2016 at 4:16 AM
This paper talks about visual madlibs which is a new kind of dataset to perform the image captioning task that is similar to the VQA paper. The dataset is created by taking a subset of MSCOCO (~10k images) and generating fill-in-the-blank stlye template descriptions for each image. These are then provided to AMT workers who either fill in the missing part of select from a list of provided options. The authors then compare the visual madlibs dataset with MSCOCO to answer multiple-choice questions and show that madlibs outperforms MSCOCO. They compare the accuracy of answering easy and hard multiple-choice questions using different types of networks like CCA (with places dataset), nCCA (box) and CNN+LSTM which uses madlibs. One interesting observation from the experiments is that madlibs perform well for tasks that require lots of visual information.

Discussion:
1. What is the intuition behind using Word2Vec and then taking an average of all words in a sentence to get the representation.
2. Could you please talk more about CCA and nCCA?
ReplyDelete
Replies
UnknownApril 6, 2016 at 4:32 AM
The authors introduce a new fill-in-the blank strategy for
collecting targeted natural language descriptions for images. These descriptions are shown to be more detailed than generic whole image descriptions. They also discuss dataset generation which uses a subset of MS-COCO, and generate fill in the blanks style template descriptions. They also compare this dataset with MS-COCO on a multiple-choice question answering task, then train and
evaluate joint-embedding and generation models.

Discussion:

1. Another option for generating hard examples could be using the CCA/nCCA models to find embeddings close to the answer.

2. I'm a little unclear on why averaging words (Word2Vec) is used as a good representation.
ReplyDelete
Replies
Aditi GuptaApril 6, 2016 at 8:09 AM
The paper presents a Visual Madlibs Dataset which consists of 10,738 images taken from the MS COCO dataset along with 360,001 descriptions that capture the different aspects of the image. The authors generate these descriptions using a strategy called Visual Madlibs, where humans are asked to fill a template that describes the given image from a list of given options. The authors evaluate the generated image descriptions by comparing it with how well the MS COCO descriptions can select the multiple choice answer and the automatically generated descriptions by a CNN LSTM network.
Questions:
The Madlibs and the VQA datasets seem to be performing complementary tasks. While VQA proposes questions that can be answered by looking at an image, Madlibs presents descriptions that could possibly answer these questions. Have any attempts been made to train a neural net on one of these datasets and evaluate it on the other?
Could you please talk a little bit about the Word2Vec representation mentioned in the paper?
ReplyDelete
Replies
UnknownApril 6, 2016 at 8:47 AM
This paper presents the process behind creating a new dataset, Visual Madlibs. The dataset consists of 10,738 human-centered images from MS COCO. For each image, a set of 12 template questions are answered by mechanical turk workers. Two tasks are purposed, targeted generation and multiple choice answering. Two sets of multiple choice answers are provided, a hard set (descriptions sampled from similar images) and an easy set (descriptions sampled from random images). The descriptions of the Madlibs dataset are compared to the general image descriptions provided in MS COCO. Baseline results for many approaches are given for the fill in the blank task.

Questions:

1. Why did the authors choose to use only human-centered images?

Discussion:

I wonder if there has been any work put towards generating unique madlib questions on a per image basis. It would be more interesting than answering similar questions for every image.
ReplyDelete
Replies
UnknownApril 6, 2016 at 8:47 AM
This comment has been removed by the author.
ReplyDelete
Replies
anushaApril 6, 2016 at 8:56 AM
The paper proposes a new dataset of focused, targeted descriptions. Automatically produced fill-in-the-blank templates are designed to collect a range of different descriptions for the visual content in an image. The paper also introduces a mutliple choice question answering task for images. Upon comparing the descriptions in the Visual Malibs Dataset to the general image descriptions in the MS COCO dataset, it was observed that the proposed dataset has more coverage than MS COCO. Finally, the paper presents the evaluation results of many algorithms on the targeted natural language generation and multiple choice question answering tasks.
Question:
I am unclear about why CNN+LTSM trained on Madlibs is not as accurate as nCCA for selecting correct multiple choice answers. Could you please go over CCA,nCCA and CNN+LTSM?
ReplyDelete
Replies
UnknownApril 6, 2016 at 9:32 AM
This paper presents a rich dataset with fill in the blank style descriptions for each image. The purpose of this dataset is learn more natural and detailed descriptions for an image over the typical generic captions such as those in MSCOCO. The descriptions were generated using a madlibs (i.e. fill in the blank) approach that specifically target object interactions, feelings about a scene, and scene context around the image being taken. This is an novel approach because it seems to target a ore narrative approach for descriptions and the descriptions are necessarily more detailed and interesting.

1. I saw a recent paper about generating explanations for results (http://arxiv.org/pdf/1602.04938v1.pdf) I am really curious to see how this would perform on this dataset. Perhaps the explanations could provide more insight into some of the wrongly generated results.
2. It would also be interesting for the dataset to start including more Why questions. For example, this picture makes me feel weird, because ________. This incorporate more reasoning which would be interesting.
ReplyDelete
Replies
Sam SeifertApril 6, 2016 at 10:30 AM
They introduce a new datatset, with 360k descriptions for 10,738 images from MS COCO. Generated mad-libs esque questions for MS COCO is easier because most objects are labeled, and a automated system can propose queries like “the firsbee is…” where a human will fill in the blank. They chose images from the subset of images that have captions, and they try to align the captions with the labels in the image. If they don’t have a 1:1 alignment correspondence, they generated queries for each Frisbee or so on. Upon inspection of the dataset, 22.45% of the Madlibs’s words were present in MS COCO descriptions, 52.38% of the MS COCO words were present in Madlibs.. Madlibs had a larger vocabulary. The run some generative models to baseline their dataset.

Discussion: Why generate a model using multiple choice questions? I understand how it might be easier, but what is the reasoning behind using multiple choice questions vs generative answers?
ReplyDelete
Replies
UnknownApril 6, 2016 at 11:12 AM
This paper introduces a new dataset influenced by madlibs style. They collect 360k natural language focussed descriptions for 10k images. Here descriptions are collected in a novel way by using automatically generated fill-in-blank templates and multiple choice questions. Question instantiation takes place based on the type and number of objects in the image. Here they make use of MS-COCO dataset for their image data. Only human-centric images to be precise. Evaluation is carried out on CCA, nCCA and CNN+LSTM. VGG CNN is used for deep image descriptors.

Questions-

1. Why all words in a sentence need to be averaged? To capture the distribution equally? Is some other method used for this purpose?

2. Is this method still confined with the type of data available for training as we discussed in the case of 'Exploring Nearest Neighbor Approaches for Image Captioning'? Because they find the most similar embedding point in the latent space and this should be limited by the type of images and questions they have.
ReplyDelete
Replies
UnknownApril 6, 2016 at 11:14 AM
Summary
This paper presents a new data gathered using a "visual Marlins" strategy. The primary goal of this strategy is to collected targeted natural language descriptions for images using fill-in-the-blank templates. In doing so, a dataset of 360,001 descriptions were collected for 10,738 human-centric images sourced from MS COCO. They then compare the coverage of their descriptions favorably with MS COCO.

Questions/Clarifications:
1. why 360,001? Why not stop at 360,000?
2. Could their experimental process be covered in more detail?
ReplyDelete
Replies
UnknownApril 6, 2016 at 11:14 AM
This paper presents an image captioning dataset created by using a mad libs approach of filling in a blank on images from MS COCO. This allows the authors to collect more descriptive terms about specific objects or things happening in the images. The authors analyzed this new Visual Madlibs dataset in comparison to MS COCO on tasks including multiple-choice question-answering and caption generation.

Have other representations been used to analyze the dataset other than Word2Vec?
ReplyDelete
Replies
prateekApril 6, 2016 at 11:21 AM
This paper presents a new dataset Visual Madlibs for the the task of visual fill in the blanks. They aim to solve the problem of moving away from generic image descriptions to more complex relationship between the objects in the image. They do this by collecting two sets of answers hard multiple choice questions and open fill in the blanks. They gather this on 10,000 images with around 36,000 questions they combine this with MS COCO dataset.

They show how visual madlibs is able to capture the meaning with the comparison to human generated descriptions. The paper proposes baseline methods using NCCA and CCA embedding to classify answers . They combine embeddings of a CNN and a word2vec. They reort accuracy in comparison to the human generated answers.

Question:

Could you talk about CCA and NCCA , how are they trained.

What is Fig3 trying to explain. I could not get a clear picture about what do top row means
ReplyDelete
Replies
Vasavi GajarlaApril 6, 2016 at 11:44 AM
This paper introduces a dataset called "Visual Madlibs" which is a collection of images and a set of descriptions associated with those images - regarding the image's scene, emotion, past, future, etc. They used AMT to collect these descriptions. For evaluation, they compared these descriptions to those of MS COCO, as well as a few CNNs.

Why did the authors take mean of word2vec representations of words? How is it different from, say, a concatenation of these representations.

Could you talk about joint embedding methods - CCA and nCCA?

ReplyDelete
Replies
CJDSApril 6, 2016 at 12:01 PM
This is a human created dataset. It has been created to give researchers a rich dataset. The idea is to create a set of captions for each image baes on a fill in the black system where the user fills in the blanks. They are attempting to make a more interesting dataset. Their dataset has a bigger vocabulary than MSCOCO and can generate theoretically. more diverse models
ReplyDelete
Replies

Add comment