Advanced Computer Vision: Mon Jan 25: Learning to predict where humans look

Thursday, January 21, 2016

Mon Jan 25: Learning to predict where humans look

This topic is delayed until Monday because of inclement weather.

Paper assignments are up, and a tentative early schedule.

Learning to predict where humans look. T. Judd, K. Ehinger, F. Durand, and A. Torralba. IEEE International Conference on Computer Vision (ICCV), 2009.

Project Page

24 comments:

John TurnerJanuary 21, 2016 at 3:04 PM
The authors of this paper collected a database of eye tracking data from 15 different observers on 1003 images from flickr and labelme to use as a datasource to train models geared toward semantic-based saliency. They accomplished this by tracking the viewers' fixation locations, blurring the resultant map of all the veiwers' results and thresholded the results to provide a mask of saliency in the image. They also propose various levels of feature set configurations to build a model from this particular dataset.

Questions/Discussion :
Given the bias toward center-of-image from the center prior, would these data be sufficient to train a model to be used on video, where the most important parts of the image might not necessarily remain near the center, even if they start there.
ReplyDelete
Replies
CJDSJanuary 21, 2016 at 3:45 PM
Summary

The goal of the research was to predict where humans would look in a particular image, using data from human subjects and eye tracking. They uses eye tracking data to create a model of saliency for images using features from an SVM. They train the SVM on 15 users. They also use features from the images (low level such as color, intensity etc., .middle level such as horizon height etc and high level like people’s face. The SVM is then tested against humans as well as other feature sets.

Questions
- Isn’t the argument that people are likely to look at other people should be borne out by the research instead of being used as a high level feature. For example, if they were only using low and mid level features would the same output be true
- What about images where the people are in the background and looking around a particular thing
- When every image is converted to 200x200 pixels does the eye tracking information get scaled as well
ReplyDelete
Replies
sfenu3January 21, 2016 at 5:28 PM
Attention based image classification models are showing a lot of promise lately, and in general having access to good models of how humans look at images can give hints to how our internal scene understanding pipeline works. This paper contributes two things in aid of that:

1) A dataset tracking gaze as a person is presented with an image

2) An SVM that estimates human attention to an image from features of different scales (with relatively high accuracy, as tested by comparing the results to human generated data)

There was a noticeable attention bias towards large, bright regions and the center of an image. It is unclear whether these are artefacts of the experimental methodology, brought on by the scale of the image, or because it is a salient feature of how human visual attention works.

Questions:
- It seems to me that part of the bias towards attention on the center of the images may stem from the data collection methodology (each subject's chin was rested on a block for ease of eye tracking, making normal head movements associated with looking at different parts of a screen less likely), have there been similar gaze-tracking studies done with free-standing subjects?
- Has any investigation been done into how image scale and distance affect the pattern of human attention?
ReplyDelete
Replies
UnknownJanuary 21, 2016 at 7:34 PM
This paper attempts to solve the problem of predicting interesting and salient regions in images. The results produced two major contributions: a labeled image dataset where labels were generated using human experiments and a trained model which predicts salient regions within an image. The analysis of the dataset revealed that users tended to favor the center of the image, faces, and text.

Research Questions:
1) To combat the central focus bias, it would have been interesting to include more images where prominent images are not centered. Or by manipulating a photograph so that the foreground is moved to a corner.

2) The hand-crafted features seem problematic since they don't directly come from the data – it would be interesting to compare the results to a data-driven approach.
ReplyDelete
Replies
UnknownJanuary 21, 2016 at 8:00 PM
The paper presents a method for predicting where a human will look when presented an image. The researchers first collected eye tracking data from 15 participants viewing around 1000 images gathered randomly from Flickr and LabelMe. The training data is publicly available for future research. Previous research uses biologically plausible filters to approximate saliency. This paper instead computes a combination of low, mid and high level features and a center prior to train an SVM. The SVM trained on all features performs better than any individual feature.

Discussion:

Could using a learned representation instead of handcrafted features produce better results?
ReplyDelete
Replies
UnknownJanuary 21, 2016 at 8:41 PM
Abstract:
The paper shows a methodology to solve a large problem of understanding where and how people put attention to things. They first describe the collection of large dataset having 1003 images looked by 15 different people while their eyeballs being tracked. They also train a linear SVM model to classify the saliency maps by the low level and high level features and they show that they are able to get better performance than other models.

Discussion:

1) The paper doesn't give the performance of the model (All features - high level features), that would actually have emphasized about the importance of high level features.

2) It seems like to me like where people look at a photograph would change as time goes on. Probably at start people would notice the semantically important things but after that they move their gaze to other things. Although for the applications mentioned in the paper, 3 second might be a good time but it would be interesting if we have some dataset that can capture people gaze at every second for 1 minute or so. The temporal relation would eb very interesting.

3) Although in the paper they have mentioned the ROI as future work but I guess the radius of ROI would have a direct relationship with whether the image is of 'object' or 'stuff'.
ReplyDelete
Replies
UnknownJanuary 21, 2016 at 9:01 PM
This paper dives into the area of how humans look at an image. Which parts of the image or presence of which objects in the image influences the human visual system to look for within the scene has been described upon. It tries to eliminate the use of eye tracking hardware by using a data driven approach of using eye tracking data to train a linear SVM. How a saliency map using low, mid and high-level features improves performance compared to the previous work done in this area has been described. A detailed description of creating the training database using images from Flickr and LabelMe has been described. Faces, text and central areas where object under consideration tend to be present are explained as the some of the primary areas noticed by humans.

Questions-

1. Does reducing the resolution somehow affect the eye tracking results?
2. Could a finer representation of the thresholded saliency map result in better results?

Discussion-

You could use deep learning instead of hand crafted features for this purpose. Has this been done before? Is a larger database available?
ReplyDelete
Replies
JonathanJanuary 21, 2016 at 9:24 PM
This paper tries to predict where humans where look. They use an SVM along with several hand-crafted features (they give some low, mid, and high level features). Also, one of the biggest contributions of this paper was the dataset of eye movements the authors gave.

Discussion/Q:
1. How has Deep Learning improved upon this (if any)?
2. What were the images like? Were the more Imagenet like or more MS Coco like?
3. Does this have application for things like VR? Knowing where an human's eyes might go might be useful (maybe?).
ReplyDelete
Replies
Sam SeifertJanuary 21, 2016 at 9:58 PM
In this paper, the authors indroduce a new database of eye tracking data for a set of 1000 iages. They label this data using a vareity of features, and create a saliency model that can be used to predict which parts of an image stand out to a human observer.

Question:

In their future work section, the authors propose determing how framing / cropping of an image affect what people look at. Has there been any work culminating in an image cropper that will let the user pick what parts of an image they want an audience to focus on and then modify the image to maxize saliency in thos regions?
ReplyDelete
Replies
Sam SeifertJanuary 21, 2016 at 10:00 PM
And I'm curious about the age of the subject. Do people of different ages focus on different things? For example, do babies focus on faces?
ReplyDelete
Replies
UnknownJanuary 21, 2016 at 10:18 PM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownJanuary 21, 2016 at 10:19 PM
In examining humans fixations and saliency in images, the authors propose a learning based approach to compute a model of saliency. The main contributions are 1. a database of eye tracking data for 1003 natural images which forms the ground truth for saliency detection and 2. the formulation of a model which takes a bottom up and top down approach to construct. The creation of the human eye tracking dataset is given in good detail and is straightforward. The more interesting part is the design of the saliency model and the choice of features. The authors use low (steerable pyramid features, Torralba features, intensity, color, orientation, etc), mid (the location of the horizon calculated using gist descriptors) and high (features from object detector for cars, faces, people, etc) level features. This combined model outperforms models that use only a subset of the features in various tests. An example of using their model of saliency in a graphics application is provided.

Questions:
1. Did the authors consider using completely random scene images where there was no particular focus on any item in order to test the degree of the effect of the center bias?

2. There is no dimensionality reduction being done. Would dimensionality reduction help and how much, considering we are looking at a high dimension of features (especially if we consider more mid and high level features as object detectors get better)?

3. The human viewers were all allowed free gazing. What would be the effect if they were provided some initial direction on where to look and what to look for? Suppose as a third set, they were instructed to look for birds in an image, would the fixations still have been consistent for faces and people? How would major distractions (such as something unusual like a man in a gorilla suit) in an image affect the fixations and gazepaths? There are certain images on the internet that challenge human perception by pointing out unusual objects in the image that humans usually miss finding on their own.
ReplyDelete
Replies
enlite traderJanuary 21, 2016 at 10:44 PM
Summary:
This paper predicts human attentions on the images with supervised learning model of saliency training on a large database of eye tracking results with labels and analysis. Different from previous method of modeling saliency purely from mathematical derivation, this papers model of saliency based the bottom-up three level of features that are more comprehensive describe the
images with the top-down semantic understanding of the objects. State-of-the-art results was achieved and Performance are evaluated on the ROC curve under different percnet salient, which is thresholding the continuous saliency map from the feature vectors.

Discussion:
This paper present the learning or modeling the saliency combining the bottom-up approach of the features and top-down semantic understanding of some specific objects. But sometimes this top-down semantic understanding of the image or objects still involve learning separately before applying, how did the author handle the noise introduced by this top-down approach and how in more detail the top-down and bottom-up approaches combined to work?
ReplyDelete
Replies
Vasavi GajarlaJanuary 21, 2016 at 11:02 PM
Summary:
This paper proposes a novel way of predicting where humans look in a scene by using a large number of high, medium and low-level features like pyramid subbands, color channels, horizon line detector, people faces, animal faces, and objects near the center of image to train a saliency model using ground-truth eye-tracking data collected from 15 viewers and 1003 images. It also shows how this method performs in an application of non-photorealistic rendering of a photograph by using the predictions of human fixations on the image using the trained saliency model.

Questions:
1) What is the rationale behind resizing the training images to 200*200? Won’t this type of scaling be lossy in terms of the training features?
2) The paper considers random images from Flickr and LabelMe. If these have to be extended also to other specific kinds of images like art paintings, advertisements, etc. what extra features might be helpful in training the system better? How feasible is it to use deep learning techniques to come up with these features (given training data with each instance containing an image and its corresponding ground truth eye-tracking data)?
ReplyDelete
Replies
UnknownJanuary 21, 2016 at 11:09 PM
The paper discusses about the new eye tracking dataset the researchers have collected on various images. Test subjects are asked to observe different images from Flickr and LabelMe, and a camera tracks their eyes. In order to measure the effectiveness of the data collected a Linear SVM is trained based on various features from the images. Low level features include intensity, color contrast, local energy, etc. Since humans also tend to look at horizons, horizon detection results are included in the features. Since humans tend to look for faces in the image, Viola Jones algo is also run on the images for face detection and the results are added as high level features for training.
As per the results, the performance of trained network was similar to data from eye tracking.

Discussion:
1) In order to remove bias towards center
a. Iconic images with objects not at the center can be added to the set.
b. Because the setup itself is centered in front of the subject's eyes, regardless of the image, every subject would initially look at the center. One solution would be to not place the image at center. But this might also add constraint on tracking camera. Another solution would be to start tracking after some delay when the subjects actually look at interesting things in the image.

2) Did the researchers label the images which were used for testing? If yes, then all images labelled as "human face(s)" can be used as special case. The eye tracking data obtained from these images can be an added feature set for Viola Jones for better face detection.
ReplyDelete
Replies
prateekJanuary 21, 2016 at 11:13 PM
The paper aims to provide a definitive direction for the saliency mapping problem in vision by utilizing human gaze. The authors provide a dataset and show empirically that high level semantic information is important for solving this problem.

The contributions of the paper are :

1. They provide a dataset of eye gaze tracking of 15 humans over 1000 images. The dataset contains points where the humans look in the first 3 seconds in an image. This dataset allows to infer what do humans consider as salient.

2. They contribute a saliency prediction method utilizing high, mid and low level features which outperforms the state of the art predictors. They show the combination of features especially high and mid level representations are very important for higher accuracy. This is backed by intuition from human understanding.

The paper shows impressive results being able to achieve 75% accuracy when the image has 20% salient regions when the humans achieve 85%. This result emphasizes the importance of high level features for saliency prediction.

Discussion :

1. Most of the images are iconic and have clear single salient region. It will be interesting to see in non iconic datasets like MSCOCO.

2. They also compare their results to when the image has 20% salient region of the image. This I perceive is an onerous ask for images. Comparing 10% salient regions shows how the algorithms performs poorly in comparison to humans.

3. I wonder whether it would be able to invert the problem where we use these saliency maps to predict objects in the scene as this can allow us to localize objects without going through a process like object proposals.

ReplyDelete
Replies
PFJanuary 21, 2016 at 11:53 PM
This paper attempts develop a method to predict the image saliency location. The authors build a large database from eye tracking experiments. They also train a SVM model of saliency witch combined both bottom-up image based saliency cues and top-down image semantic dependent cues. They evaluate the performance of the trained model using ROC curve, the result is better than other models towards human prediction.

Questions:
The image viewers in the experiments have only 3 second to look at the image, and the eye tracking data doesn’t consider viewing time. What if we take time into consideration and extend the view time? Would it improve the model performance? (I guess)
ReplyDelete
Replies
Aditi GuptaJanuary 22, 2016 at 12:48 AM
The paper presents a novel method to predict where humans focus when looking at an image. The contributions of the paper are two-fold:
It presents a database of 1003 images labeled with the results of an eye-tracking experiment run on 15 viewers. This database serves as the ground truth for the next part.
It presents a supervised learning approach to predicting where humans look by taking into account low level features (steerable pyramid filters), mid-level features(gist features) and high level features ( face and person detectors)
Discussion:
1. The paper argues that the human gaze is focused around the center of the image as the object of interest is often located there. It would be interesting see the results on a database images that do not have their object of interest located at the center. Would the human gaze be still focused on the center? Would the center-prior feature be helpful in predicting gaze?
2. To what extent can deep learning methods be used to learn features that are effective in predicting gaze as opposed to hand designing these features?
ReplyDelete
Replies
anushaJanuary 24, 2016 at 6:45 PM
The paper proposes a model of saliency based on low , middle and high level image features to help predict where humans look in a picture. The proposed model is a supervised learning based model of saliency that combine both bottom up image based saliency cues and top down image semantic cues. The low level features include intensity, orientation and color contrast, mid level feature include a horizon line detector and a face detector is used for high level features.

Questions:

1. In the paper, the users sat at a distance of 2ft from the computer in a dark room. Would the focus points on the image differ if ambient lighting condition was different or if the users sat closer/farther away from the screen? Are there any other external parameters other than lighting, distance and size of screen that affect what the user looks at?
2. What kind of images were used for training? The paper states that 903 images were used to train the model. However, isn't that too small a database for training a model of saliency for a free viewing context?
3. What is the impact of color distribution on fixations? Any particular color that humans tend to fixate on more than the others?

~
Anusha
ReplyDelete
Replies
UnknownJanuary 24, 2016 at 9:13 PM
Learning to Predict where Humans Look

Summery:

This paper presents a new approach for highlighting the salient regions in static images. A top-down, bottom-up model is trained using combination of high level, mid-level, and low level features to predict where humans fixate. The paper introduces a new eye-tracking database for 15 people over 1003 natural images. The paper shows that general semantic understating of the images is very important to predicate where human look and the combination of these features with the low level features outperformed the previous methods.

Research Questions:

• Given that larger image databases are now available, can we apply deep learning methods to learn the features instead?
ReplyDelete
Replies
UnknownJanuary 24, 2016 at 11:17 PM
Summary:
This paper presents an method to determine where do humans look in an scene. The researchers have collected a database of 1003 images and eye tracking data of 15 viewers viewing these 1003 images. They learn a saliency model based on low, middle and high level features. They present results describing that humans tend to look for other human images or written text prominently and in absence of them look at center of the image.

Discussion -
1) Can the features be learned by deep learning instead of manually finding them?
2) What are the properties of the images. What is the variation in case of images having highly detailed scenes?
ReplyDelete
Replies
UnknownJanuary 24, 2016 at 11:32 PM
This paper presents a method of predicting where humans look in a scene using high, medium, and low level features. This model combines bottom-up saliency cues and top-down semantic cues to outperform previous methods. Also, the authors provide a database of eye tracking data which is used to form ground truth for the saliency detection.

Referring to the conclusions, was there any follow-up on how the same image at different sizes affects the results? Also, it would be interesting to see if there are any patterns in the types of objects that were still being skipped over even with an extended viewing time.
ReplyDelete
Replies
UnknownJanuary 25, 2016 at 2:19 AM
In this paper, the authors collect a database of 1003 natural images and run a study with 15 participants to collect data about their eye movements as they browse through these images. They create a saliency model to estimate where people loook by using the data to train SVM models using bottom-up image based saliency cues and top-down image semantic cues. They then present their analysis in which they conclude that humans focus on image centers, and look for faces and texts in images.

Discussion:
1. Why was is necessary to stabilze the participants heads by giving them a chin rest? Did this affect the results and introduce a bias towards people mainly keeping their gaze centered?

2. I think the if the image datast was not so biased towards having the most inteesting thing in the center of the image, the results of where exactly in the image people look most might have been different
ReplyDelete
Replies
UnknownJanuary 25, 2016 at 9:00 AM
The paper attempts at predicting where humans tend to look in an image. A database consisting of 1003 images, labeled with eye-tracking data obtained from an experiment set up with 15 users. An SVM model of saliency that combines both bottom-up image based saliency cues and top-down image semantic dependent cues is trained. The features used are divided into low(intensity, orientation, color contrast), middle( horizon detector using gist features) and high level features(face, person detector).

Discussion:
1) Would adding an additional texture based search help in landscape images, that don't have people.
2) Since humans tend to look at animals, how about train a detector for that?
ReplyDelete
Replies

Add comment