Advanced Computer Vision: Fri, Feb 12 - What Makes Paris Look Like Paris?

Thursday, February 11, 2016

Fri, Feb 12 - What Makes Paris Look Like Paris?

What makes Paris look like Paris? Carl Doersch, Saurabh Singh, Abhinav Gupta, Josef Sivic, and Alexei A. Efros. Siggraph 2012.

project page

18 comments:

enlite traderFebruary 11, 2016 at 8:49 PM
Summary:
This paper address the problem of automatically find the subtle visual elements that are representative of certain geo-spatio area with the Paris as an initial learning object.Subsampling image square image patches based on the google street view,author use discriminative clustering approach aiming at finding pattern that are both frequently occur within an area and geographically discriminative: they first extract HOG+color descriptor with KNN to find patterns that are distributed more towards positive set,then using discriminative learning, a linear SVM was trained iteratively to obtain a weight vector across visual elements. Application is broad for this technique: visual elements can be mapped geographically at different scale,visual correspondences can be found across cities and image retrieval can worked in a geographically informed way.

Question:
This paper use iterative SVM with KNN to find the subtle visual elements that are representative to the geo-location, how is the performance compared to the CNN?( I see CNN in ImageNet challenge has good performance in differentiating the subtle features of species,even though it's not geo related)
ReplyDelete
Replies
JonathanFebruary 11, 2016 at 9:24 PM
This paper argues that the what makes Paris look like Paris are the stylistic elements, not famous landmarks. They use KNN and SVM's to find clusters of similar items (items that are from Paris are mapped together).

Q:

1. What is the geographical constraint added to SVM?
2. I would like to go over what they training procedure is in the paper.
ReplyDelete
Replies
UnknownFebruary 11, 2016 at 10:03 PM
The paper presents a technique to find visual elements in geo-tagged images, that are both geographically discriminative as well as recurring. The authors use Google Street View images since they satisfy all their requirements but are not biased towards landmarks (the Eiffel Tower occurs only once in the 1000 images of Paris in their dataset). Their algorithm randomly samples 25,000 high-contrast patches from a positive set (all images belonging to a particular city), model them as HOG+color patches and apply Nearest Neighbours on these patches to find the candidate visual elements that occur frequently and that can be discriminative. They filter out poor visual elements by calculating the proportion of nearest neighbours that are from the positive set and eliminating those below a certain threshold as well as those that have high similarity to other good patches. They then iteratively train a linear SVM for each visual element, dividing the training data into l (l=3) subsets and running the SVM on each unseen subset to find good patches that are then used for training, with geolocation as a weak constraint, a process termed iterative discriminative learning. The SVM is initialized using the initial patch and its nearest neighbors. They pick the SVMs that give the best firing accuracies and that reduces the number of visual elements to a few hundred. They provide tests to answer various questions about the discriminatory nature of the elements, and how they perform to recognize and classify new images. Finally they present applications to find geo-spatial distributions of elements, correspondence of elements across cities and geographically informed image retrieval.

Discussion:
1. Why is k set to 5? Isn’t that a very small number and prone to overfitting?
2. American cities have trouble with patches of cars. Wouldn’t an already established HOG descriptor of a car prove useful to eliminate such patches using a similarity measure such as SSD or Normalized Cross Correlation?
3. From afar, this method seems similar to boosting, instead they use smart classifiers in the form of SVMs. What are the salient distinctions of discriminatory learning? I don’t think I fully understand its benefits and power.
ReplyDelete
Replies
UnknownFebruary 11, 2016 at 10:12 PM
This paper determines the visual elements that make cities unique. The authors do this by first collecting images from google street view. From the images, images patches are represented with HOG and color descriptors. The best candidate patches are those that both appear frequently, and are discriminative. First, KNN is used to find candidate seeds for clusters. Next, SVMs are used iteratively to refine each visual element. One cool way that the authors verified results was by comparing results to a 19th century book on Paris architecture. They found that many of their visual elements were mentioned in the book.

Discussion:

In U.S. cities, types of cars were sometimes one of the chosen elements. Are these actually geo-informative, or were they just the best the algorithm could come up with out of a set of non discriminative choices?
ReplyDelete
Replies
Sam SeifertFebruary 11, 2016 at 10:30 PM
This paper used images from google street view of 12 cities around the world to create a model for what each city looks like. They strove to find patterns / features that are both frequently occurring within a city and geographically discriminative (appear a lot at city X and not elsewhere). They train several one vs all classifiers. They start feature selecting by random sampling high contrast patches. They then use a HOG+color descriptor to group each feature with it’s nearest neighbors. They filter features that aren’t discriminative (< 30% nearest neighbors match geo location). Then they use an adaptive distance metric with svm classification to group “detectors” based on accuracy. They keep a few hundred detectors. They verify these detectors are relevant by showing humans can classify these features at rates far better than chance, and better than humans identifying random images of Paris or Prague.

Discussion:
Does China Town look like china? Does the French Quarter in New Orleans look like France? Do deep learned features outperform these hand crafted ones?
ReplyDelete
Replies
Vasavi GajarlaFebruary 11, 2016 at 10:46 PM
Summary
This paper successfully attempts to find visual elements in geographical locations which are discriminative and repetitive to a particular geographical region. They do this by collected a lot of images from google street view and extract HOG + color features out of them. They perform clustering to come up with discriminative clusters called visual elements. As per the paper, these visual elements must be representative of the geographical location they are tagged with.

Questions
1) The paper briefly mentions possibility of applying the algorithm to data of specific products like that of Apple’s. Can this be applied to a domain like that of faces too using eigen vectors as features maybe?
ReplyDelete
Replies
sfenu3February 11, 2016 at 11:24 PM
Style as a concept is fairly difficult to capture. This paper tries to use HOG features and color descriptors to describe the visual style of specific cities, and then determine whether the visual styles of individual cities are distinguishable by an SVM.

Questions:

Have any attempts been made to use the distance functions learned for these sorts of architectural styles used in generative models, for example to generate a design of a facade that matches the Parisian style, or to procedurally generate groups of objects in film or games that have a distinct style from each other?
ReplyDelete
Replies
UnknownFebruary 12, 2016 at 12:22 AM
The paper describes what features make a particular location unique and different from others. The paper uses HOG descriptors along with iterative clustering method with SVM to map together similar items. Possible applications of this research are also discussed in image based information retrieval and pattern matching.

Questions:
1) How does the approach discussed by authors exactly work?
2) Advantage and possibilities of uUsing deepnets to find the features instead?
ReplyDelete
Replies
UnknownFebruary 12, 2016 at 1:04 AM
The paper presents an algorithm that can use large dataset to find characteristic details of a place. They gathered a large dataset of around 10,000 images of 12 cities via google street view. Then they present an iterative algorithm that finds out patches of images that are frequent and also geographically informative. The features need to be frequent otherwise instead of finding characteristic qualities, the algorithm would just find the most unique thing about the place like Eiffel tower for Paris. In the end they have shown various applications of their algorithm for practical purposes.

Discussion:

1) Can you please explain the training procedure again from the paper, I couldn't understand it completely.

2) In the paper they have mentioned that the algorithm is highly parallelizable, but it is not obvious how.
ReplyDelete
Replies
UnknownFebruary 12, 2016 at 7:44 AM
In this paper, the authors propose a method for discovering geographically discriminative visual elements. Given an input image, the method will find the important batches in the image that are repeated across a certain geo-spatial area. For the training, the authors used subset of Google Street View Database, 120,000 images of 12 cities.
To discover these batches, they randomly sample batches from the entire feature space, used these batches as the seeds for clustering. These batches are represented using HOG + color descriptor. First they fined the top 20-NN using normalized cross-correlation. Then they train SVM detector for each batch to update the distance metric in every iteration – discriminative learning.

A: For training the linear SVM detector, a weak geographical constraint was added? Can you explain what is this weak constraint?
ReplyDelete
Replies
Aditi GuptaFebruary 12, 2016 at 9:34 AM
The paper presents a method to identify the discriminative visual features of a geo spatial location such as Paris. The authors identify architectural elements of a city that are both frequent and discriminative such as balconies, street lamps, etc. The paper uses the enormous geo-tagged Google street view data to create a database of images consisting of positive and negative examples of images of Paris. These images are sampled to generate a 25000 high contrast patches which are represented using HOG+ color histogram feature vector. The authors find the top 20 nearest neighbors for each of the image patch in the entire dataset. Patches which have matches both in the positive and negative set are disregarded for being non discriminative. Similarly patches which have very few matches in the positive set are disregarded as being infrequent. In order to eliminate the problems of using normalized correlation as the similarity metric, the authors iteratively train a linear SVM for each visual patch. Cross validation is performed to avoid a bias to the initial image patches.
Question:
1. The paper briefly mentions the disadvantage of using discriminative clustering methods. However I am not clear why partitioning the entire feature space is unwise for our purpose.
ReplyDelete
Replies
John TurnerFebruary 12, 2016 at 10:18 AM
This paper suggests that, rather than focusing on particular landmarks, a location can be inferred from an image via the stylistic elements of the constructions shown, such as balconies, window frames, etc, rationalizing that each distinct location would favor certain styles that could be discernible. Using Google Street View, 10,000 images of 12 different cities were accumulated. Next, randomly sampled patches were derived via HOG+color descriptors from the positive set of images (those from the target city). These are ranked for geographic descriminativeness via NN search against all patches, both in positive and negative sets, keeping those patches whose presence is most common in the positive set NN matches, while also rejecting duplicates, where feature distance is measured by an iterative linear svm algorithm.

Discussion/Question : Wouldn't this method be prone to fail on unique landmarks, which might have their own style (like the washington monument or, indeed, the eiffel tower)? Would an alternate method, or an augmentation to this method, work in making this algorithm resilient to such images as well?
ReplyDelete
Replies
UnknownFebruary 12, 2016 at 10:57 AM
In this project the authors try to capture certain geo-informative features from a given location which humans seem to be able to do quite well especially if they are familiar with the location. Hence this work is an attempt to extract those discriminative features for a specific location such as the city of Paris given a large dataset of Google street view images from that location. In doing so they try to pick features which strike a balance between rarity and being discriminative. THey use nearest neighbours clustering algorithm by picking random image patches from a set of about 25000 high contrast patches from a given location of interest to cluster together features that appear more frequently and can be yet called discriminative. They used a linear svm to update the NN distance metric iteratively and discarded features appearing in both positive and negative set.

Questions:
They only seem be focusing on architectural elements, not other aspects such as people, cars, landscape, etc and they mention this as a limitation. Any follow-up work that incorporates this?
ReplyDelete
Replies
prateekFebruary 12, 2016 at 11:13 AM
This paper aim to find mid level discriminative representations at a city scale. They particularly compare to european cities like Paris, Prague, etc. They show results on the google street view dataset and show applications of their approach to real world like finding artistic influences in architecture.

The main contributions of the paper are :

1. Creating a large database of city scale images.
2. Finding geo-discriminative patches in images and classifying them into cities.
3. Novel applications like geo-location finder and artistic style detector.

The results they show are impressive and the applications of the approach make it highly interesting and influential paper.

I did not find any novelty in this paper. The paper aims to use the same nearest neighbor but with a different metric than euclidean.

Questions:

1. The paper shows uses HOG as feature descriptor, current deep descriptor should ideally do better.

2. Would learning a metric like metric learning approaches be better than using a similarity measure and training an SVM on top of it.
ReplyDelete
Replies
anushaFebruary 12, 2016 at 11:14 AM
The paper presents a technique to automatically find geo-informative features from a large database of photographs of a particular place. The visual elements detected are those features that are both repeating and geographically discriminative. This is achieved by extracting HOG and color descriptors from Google Street View images of a city (approximately 10000 images for each city). An iterative clustering method with SVM is used to group together similar visual elements.
Question/Discussion:
1. I am still unclear about the weak geographical constraint that was added to the SVM classifier.
2 The paper mentions that only fronto-parallel views of building facades were included to avoid variations of camera viewpoints. Why weren't images from different camera viewpoints taken into account?
ReplyDelete
Replies
UnknownFebruary 12, 2016 at 11:32 AM
This paper describes a novel approach of finding visual features that discriminate cities by making use of the large amount of geo-tagged data available. How visual cues like architectural styles, street signs etc helps humans discriminate among cities has been explored upon. HOG+color descriptor are used as 2112 (8x8x33) dimensional feature vectors for the SVM's, which is then used iteratively to separate positive and negative samples for three iterations. How this method is applicable to finding neighborhood bearing similar cues and how we can find similar cues in different cities has been described.

Questions:-

1. It seems to me that it could be a good task for deep features to come in handy. Has there been an attempt in that area?

Discussion:-
I think that the method described of how to mine for visual cues can be developed as another database of visual cues for each city essentially acting as a training data upon which deep nets can be experimented upon.
ReplyDelete
Replies
UnknownFebruary 12, 2016 at 11:41 AM
This paper presents a method for using geotagged photos to identify visual elements that can distinguish certain geo-spatial areas. First, they collected a dataset of images from Google Street View from 12 cities around the world. Then they represent high-contrast image patches using HOG+color descriptors and use the top nearest neighbors to find discriminative patches. Finally, they iteratively train a linear SVM on each visual patch. With this, they ran the top 100 element detectors from Paris and Prague on an unseen dataset and were able to achieve 83% and 92% accuracy, respectively.

Has this work been extended to more natural views or to non-geographic data? Since these patches were generated from a pretty consistent head-on, street perspective, would it be useful to include images from varying camera viewpoints?
ReplyDelete
Replies
UnknownFebruary 12, 2016 at 11:43 AM
The paper aims at identifying features in an image unique to a certain geographical location. The authors used google street view to build a dataset of 10000 images of 12 different cities. The look for repeating and geographically discriminative visual elements to achieve this. The patches obtained are represented by a 8*8*33 feature descriptor that combines HOG features and an 8*8 Lab color image. First, KNN is used to find candidate seeds for clusters, followed by an iterative clustering method using SVMs to refine each visual element.

Discussions:
1) Can’t geographically unique information like sign posts(language?), car license plates etc be exploited when geo tags are not available?

2) CNN based extensions?
ReplyDelete
Replies

Add comment