Exploring Nearest Neighbor Approaches for Image Captioning. Jacob Devlin, Saurabh Gupta, Ross Girshick, Margaret Mitchell, C Lawrence Zitnick. arXiv, 2015.
arXiv
Naman is also going to discuss a non-baseline version of image captioning: Deep Visual-Semantic Alignments for Generating Image Descriptions.
This paper presents a strategy to caption images by finding an existing caption with the greatest saliency to the image topic, instead of generating a novel, relevant caption as is the goal of numerous other papers on the topic. This is accomplished by clustering the image with other "similar" images with pregenerated captions, and using the pre-existing caption from the image that is the closest to the target image in cluster space. The clustering is accomplished by comparing different species of features, including GIST descriptors and deep features from a pretrained CN (VGG16). The authors found that this method worked well "on paper" captioning MSCOCO but that human validators preferred novel captions to recycled ones.
ReplyDeleteQuestions :
1) Can we discuss the mechanisms for caption evaluation? (BLEU, CIDEr, METEOR)
2) Given the potential fallibility of these caption fitness measuring mechanisms, how can the authors be confident that their method outperforms generative methods? Recycling an existing, human-generated caption carries the benefit of being grammatically correct, even if it is not an accurate description of the image, whereas a generated caption, which might have a slight flaw in construction and therefore might be less well received by an automated fitness evaluator despite that its intended message holds more saliency to the image.
summary:
ReplyDeleteThis paper presents using KNN based on GIST and CNN FC layer for generating image captioning. BLEU and CIDEr similarity metrics are used for finding the highest average lexical similarity of generated captioning to the other captions in the pool that are selected by KNN. Based on the human evaluation, knn achieved relative results compared to the state of the arts.
question:
In 3.1: author mentions using GIST,fc7,fc7-fine as nearest neighbor features, in recent related works, anyone tried combined different level of cnn features for making captioning taking low level features into account?
The authors provide a baseline, exploratory analysis of nearest neighbor approaches for the task of image captioning on the MS COCO dataset. They use 3 types of features, the handcrafted GIST descriptor, the deep features from VGG-16 FC7, and deep features from VGG-16 FC7 fine-tuned to perform classification on 1000 most common word in the captions. The novelty of their approach is the "consensus caption" which is the caption that best describes most of the k nearest neighbor images. To quantify best, they use similarity metrics such as CIDEr, BLEU and METEOR. Quantitative results show that NN approaches using deep features perform as well as state-of-the-art generative approaches, however, human evaluation via crowdsourcing gives contrasting results, highlighting the need for better automated evaluation metrics.
ReplyDeleteDiscussion:
1. The authors bin the test images based on their mean distance to their 50 nearest neighbors in order to capture the degrees of visual similarity. Will doing this when performing human evaluation provide better insight? E.g. if a particular image stands out as very diverse from the training set, a human would still be easily able to accurately caption the image, given enough knowledge of the world, but the NN approach would fail simply because none of the NNs of that query image have suitable captions it can borrow from.
This is alluded to in the discussion section but not directly touched upon.
2. I am assuming when finding the optimal K and M, the authors do NN using all 3 feature spaces and evaluate the best K and M across the feature spaces (though they don't elaborate their approach, which is somewhat confusing). Would different K and M values perform differently for different feature spaces? If they evaluation of optimal K and M could be explained, that would really help in answering this question.
The paper discusses about generating image caption of the MS COCO dataset. The authors argue that reusing captions of similar image gives almost state of the art results. So the authors suggest three different feature vectors for each image GIST and coarse and fine fc-7 layer of deep net and then use this feature vectors to find similar image for query image. Then using the most similar images, captions are generated which are evaluated using BLEU and CIDEr. The authors also show that features generated by deep net give better similar images.
ReplyDeleteHow would the hybrid systems work which combine both nearest neighbors and generative models. Is there some hard decision boundary to select one of them?
The paper provides an approach to captioning based on nearest-neighbors analysis. To caption an image, a caption is selected that best describes the nearest neighbors of the new image. The results across the board are better than LRCN based approaches but remain worse than ME+DMSM.
ReplyDeleteDiscussion:
1) What would the results like if the average for the consensus captions was weighted by their distance from the new image?
2) It seems like the volume of data used along with the relatively small n chosen for the comparison make the similarity function the most important component of the system. What other image embeddings or similarity functions would be worth trying?
This papers explores KNN for captioning. They use features like GIST and outputs of VGG net to find nearest neighbor images in coco. The subset of coco images they are using have 5 captions each, so they need to pick the best one from one of the best images. They do this by finding which caption is most similar to a caption in other images. This provides them a nice generic – specific trade off. If the only thing the nearest neighbors have in common is human, the caption might simply read “a human” if the other images are all a human on a grill, output is more likely to be “a human grilling.” They find a K value of more than 20 is sufficient. They use humans to score their results, as well as other non human techniques…
ReplyDeleteCan we talk about CIDEr and BLEU ?
They find the k-nearest images. Use the captions from those images to select the one that scores highest with respect to the other candidate captions.
ReplyDeleteQ/Discussion
1. Would like to discuss the scoring mechanisms (BLEU,etc).
2. Why do human annotations get lower scores? Is this a problem with the scores themselves?
ReplyDeleteThis paper discusses about k-NN based image captioning. The authors use MSCOCO as the training data and the features are GIST and fc7 & fc7-fine layers output of VGGnet (trained with Imagenet). Cosine similarity is used to measure the distance between image features. After determining the k-NN images, all their captions are combined and the caption which has maximum similarity to other captions is chosen.
This paper presents an approach for generating image captions using captions from similar images in the dataset. The similarity between images is based on K nearest neighbors. Given an input image, the methods will retrieve the K nearest neighbors images, and the associated captions of these images will be ranked to select the final output caption.
ReplyDeleteExperiments in generating captions for images from MS CoCo show that this simple method out preformed some advanced methods that generate more novel captions. But human studies show that novel captions are preferable.
When choosing the number of candidate captions used to select the consensus (m), how is it ensured that the outliers are not part of in the selected subset m?
The paper proposes a novel method for image caption generation. The authors find the k nearest neighbours of the query image and create a set containing all the captions of these images. They then pick the caption which is most similar to the other captions in the set. The similarity of the captions is evaluated using BLEU and CIDEr. The authors experiment with different feature representations of the images and conclude that deep features fine tuned for image caption generation work the best.
ReplyDeleteQuestions:
1. Why does GIST not yield impressive results for this task?
2. I did not clearly understand how M ( the smaller subset of captions) was determined.
3. Could you talk a little bit about the generation-based approach that is mentioned in the paper?
This paper presents an approach to image captioning using nearest neighbors. They use MS COCO as the training dataset, and find captions for images based on k-nearest neighbors for these images to those in the dataset by taking GIST descriptors, features from FC7 in VGG16 net, and also features from FC7 after fine tuning the network for image captioning task.
ReplyDeleteI did not understand how exactly the consensus captioning is done - how taking a subset of captions from a set that also has outliers and finding a subset that is most similar to a given image.
DeleteThe paper discusses KNN as an approach for providing images with captions using MSCOCO. First, K nearest neighbors are found from the dataset. GIST, deep features, and deep features fine-tuned for image captioning are all tested as image representations. Next, a consensus caption is selected by finding the caption that has the highest lexical similarity to the K neighbors' captions. Caption similarity is measured using CIDEr and BLEU, and deep features + KNN performs well. However, in tests of human caption preference, the KNN provided captions do not perform well.
ReplyDeleteQuestion:
Has there been work on creating new evaluation methods that more closely represent human evaluation?
This paper presents an approach to use a simple machine learning process, KNN, to caption images from the MSCOCO dataset. A high level overview of the approach is as follows: the images are modeled into the GIST, FC7, and FC7-fine feature spaces, then the top k images are selected alongside their captions, and finally a caption is generated form the consensus of the k images’ captions. The results are pretty good despite the cations being recycled so to speak. However humans seem to prefer novel captions over the ones generated by these results.
ReplyDelete1. Is there a way to improve the cations after this result? For example maybe a written “style” could be learned at then used on the recycled caption to generate a more pleasing result? Could generative models be used to achieve such a result?
2. Should better captions be recorded? For example it seems humans prefer more nuanced captions - would it be better to capture more poetic captions in the dataset?
This paper provides a novel yet simpler approach to image captioning using the classical Nearest Neighbors (NN) method. They find a set of k-nearest neighbors using different feature spaces- GIST, pre-trained deep features and deep features from a network specifically fine-tuned for the task. Initial step carried out is to find nearest neighbors in the training dataset and generate a consensus caption for the cluster. Here, they use MS-COCO dataset which has 82,783 images each having 5 captions. Cosine similarity metric is used to find nearest neighbors.
ReplyDeleteQuestions-
1. Can you explain more about the similarity functions explored in the paper? BLEU, CIDEr etc.
His paper presents a nearest neighbor approach to captioning images. KNN is used to cluster images from the MS COCO database using GIST features and deep features from VGG16. After a set of candidate captions are found by clustering, the best scoring captions are chosen.
ReplyDeleteQuestion:
Why is the fc7 layer specifically chosen for the NN features?
In this paper, a nearest neighbor approach is presented to caption images. They show how using different metrics like Cider and BLeu effect NN and the tyoe of features being used are important. They explore the combination of k for KNN and show that deep learnt features outperform hand crafted features like GIST.
ReplyDeleteThey do an extensive evalutation on MSCOCO and argue how the scale of the dataset is important for nearest neighbor approaches. They talk about differences between objectivity and descriptivity using the two metrics and do a evaluation against humans.
Questions:
1. Could you explain more about the two mwtrics being used Cider and Bleu
2. In human evalutation. ME+DSM does better than other than 2 metrics , does this show that the other 2 metrics are learning the dataset bias and are not easily generaziable.
The paper proposes a nearest neighbor approach for image captioning. For any image that has to be captioned, a set of K nearest neighbors are found, the captions describing these images are combined into a set of captions from which the final caption is selected. The best caption is selected by calculating the one that scores the highest with respect to other captions. The MSCOCO dataset is used for training and features like GIST and outputs of VGG net are used to find the nearest neighbor images in the dataset.
ReplyDeleteQuestion:
1. Could you please go over the caption evaluation metrics (BLEU, CIDEr)?
This paper presents image caption generation. The authors use knn on GIST and CNN fc7 features to find visually similar images. The union of the captions of the images are taken to create a set of potential captions for the test image. They use similarity metrics, BLEU and CIDEr, to find the caption with the highest average lexical similarity to the ones found in the potential captions set.
ReplyDeleteDid the authors try using features from a layer other than fc7? And I'm not very familiar with the similarity metrics used by the authors.
This paper proposes using K-Nearest Neighbors to find image captions from the MS COCO dataset. They cluster the images and then out of 5 captions they pick the best one which is determned by choosing the caption from the closest image in the cluster. They make use of GIST descriptor and deep features from VGG-16 ConvNet to improve their clustering.
ReplyDeleteDiscussion:
What is the significance of BLEU, METEOR and CIDEr scores?