Advanced Computer Vision: Wed, Jan 13 paper: Scene Completion

Monday, January 11, 2016

Wed, Jan 13 paper: Scene Completion

Scene Completion Using Millions of Photographs. James Hays, Alexei A. Efros. ACM Transactions on Graphics (SIGGRAPH 2007). August 2007, vol. 26, No. 3.

project page.

This is the first paper for which you'll post reading summaries. Here is the description of these summaries from the class website: Students will be expected to read one paper for each class. For each assigned paper, students must write a two or three sentence summary and identify at least one question or topic of interest for class discussion. Interesting topics for discussion could relate to strengths and weaknesses of the paper, possible future directions, connections to other research, uncertainty about the conclusions of the experiments, etc. Reading summaries must be posted to the class blog http://cs7476.blogspot.com/ by 11:59pm the day before each class. Feel free to reply to other comments on the blog and help each other understanding confusing aspects of the papers. The blog discussion will be the starting point for the class discussion. If you are presenting you don't need to post a summary to the blog.

Simply click on the comment link below this to post your short summary and one or more questions / discussion topics.

23 comments:

enlite traderJanuary 12, 2016 at 12:08 PM
Summary:
This paper solves scene completion problem based mostly on data-driven approach. Semantic Scene matching using gist descriptor with a large amount of data dramatically increases speed and reduces the artifacts. Graph cut seam finding and Poisson blending are used to match local context of complementary parts of images with emphasis to minimize the gradient of the image
difference along the seam over intensity differences.

Question:
Why use Lab color space in SSD?

Question:
Not fully understand find the seam by minimizing the cost function C(L).

Discussion:
In part 5 where previous work is better than this paper, it is mentioned previous work included local texture propagation instead of from the flicker data, what if we combine the texture generated based on the technique used by paper (A Neural Algorithm of Artistic Style.Leon A. Gatys, Alexander S. Ecker, Matthias Bethge. 2015)from the part of the original picture that is not excluded with the approach described in the paper?

Discussion:
Complete the scene by technique in the paper is one approach. What if based on CNN, we first recognize the scene and then let CNN generate context based patch to the cutted input image?
(Learning to Generate Chairs, Tables and Cars with Convolutional Networks. Alexey Dosovitskiy, Jost Tobias Springenberg, Maxim Tatarchenko, Thomas Brox. CVPR 2015.)
ReplyDelete
Replies
UnknownJanuary 12, 2016 at 3:52 PM
This paper presents a method that leverages a 2 million image database to fill in a missing section of an image. The algorithm is completely data driven; no labeling or annotations is required. It differs from other completion algorithms by forgoing the use of local image data, and instead searching for images with similar gist descriptors, and then using graph cut and poisson blending to patch the hole in the original image with the most similar image in the database.

Discussion: How well does this algorithm perform with indoor scenes? The SIGGRAPH paper mentions that scenes related to landscape, travel and city photography were chosen. Is this just because these categories were more available on Flickr, or is there some property of indoor scenes that impacts the algorithms performance?
ReplyDelete
Replies
UnknownJanuary 12, 2016 at 4:08 PM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownJanuary 12, 2016 at 4:10 PM
Summary:
This paper solves scene completion in two parts.
The first part is semantic scene matching. They compute gist descriptor with the missing regions excluded to search for images depicting semantically similar scenes.
The second part is local context matching. They use graph cut seam finding and standard Poisson blending. For graph cut seam finding, they minimize the gradient of the image difference along the seam instead of intensity differences. For Poisson blending, it is operated on the entire image domain instead of just the composited region.

Question or discussion:
In the paper, they only used gist descriptor to find images. Is it better to use multiple descriptors? For example, use gist for the entire image and use SIFT for local regions. We can first use gist to get 200 images (fast) and then use SIFT to select from them (slow).
ReplyDelete
Replies
UnknownJanuary 12, 2016 at 5:25 PM
Summery:

This paper proposes a novel data-driven approach for scene completion by first searching a large database of images for candidate scenes that match the input image semantically – top 200 scenes are retrieved. For this part, the paper used gist descriptor to group similar scenes. After that, the matched scenes will be searched for local context the best match the regions around the missing part of the input image. Finally, Posin blending and graph cut segmentation are applied to blend the match region with the input image.

Research Question:

1. How can we adapt the proposed method to preform on more constrain scenes, such as indoor scenes, or images of paintings , or portraits ?

2. For a given image with missing parts, can we make a decision on which scene completion algorithm would preform better? The proposed method from paper or other approaches that search the input image for usable texture.
ReplyDelete
Replies
Wei Yang QuekJanuary 12, 2016 at 5:41 PM
Summary:
This paper enhances the quality of image reparation and replacement from prior algorithms. The method proposed uses a huge database of images taken from flickr (2 million), scene match them, and employ a series of possion matching and graph cut segmentation and seam finding to blend the missing portion. It addresses 3 of the main difficulties present when these portions are taken from other images: Computaional complexity, Semantically incorrect image fragments, and off-lighting and color. The result was an image that had its missing portion more correctly replaced.

Discussion:
With increased number of images, the time taken will increase. While there is already a speedup with certain processes in this algorithm, could doing a recognition of possible locations of the scene of the input image help to speed up the process?

For example, the process of such would be:
1) Do a pre-scene recognition for all 2 million images and tag each image as a series of possible location that they could be (based on the results that get he highest probability above a certain threshold).
2) For every input image that comes in, do a similar scene recognition and take the tags of the highest few probability as the tags to that image.
3) For each of these tag, compare the input image with all images tagged with that particular tag if they had not been compared before, using the GIST and SSD method as described in the paper.

With 2 million pictures that could potentially grow over time, this method could possibly (although maybe not probably) reduce the number of comparisons required for scene matching and hence reduce the time taken for this portion (while possibly still maintaining a certain percentage of feasible images).

By: Wei Yang Quek
ReplyDelete
Replies
UnknownJanuary 12, 2016 at 7:39 PM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownJanuary 12, 2016 at 7:46 PM
This paper proposes a data-driven image completion method, which not only filling holes seamlessly but also with semantically valid content. This two-step method first searches through the database and finds the most semantically matching images, and then it uses kNN method to find most similar results based on roughly equally weighted distance of scene matching, context matching, local texture similarity, and the cost of the graph cut. This method, together with gist scene descriptor and poisson blending, works much better than Criminisi's method in most cases, though it has some typical failure cases. In sum, this paper proposes an image completion method that fill in holes with semantically valid contents and only requires reasonably large number of images.

Question:
Why "all four components of the score are scaled to contribute roughly equally"? Is this an empirical result?
ReplyDelete
Replies
Jianling WangJanuary 12, 2016 at 8:51 PM
Summary
This paper proposes a new image completion algorithm which is based on a large database of images. Firstly, the algorithm will do scene matching to find out images with semantically similar scenes. Secondly, it will perform local context matching with the selected candidates and try to minimize the gradient if the image difference along the seam. The algorithm aims to provide users with about 20 composites with the lowest scores.

Discussion
Is it possible to combine this algorithm with the old approaches which filling in the hole with content from the input source image? For example, if the database can not provide images with highly semantically similar scene, we may apply the old algorithms in the application.
ReplyDelete
Replies
John TurnerJanuary 12, 2016 at 9:00 PM
This paper proposes an algorithm for image completion by first searching a huge database for images which match the source image semantically (i.e. similar scenes/environments) and then finding patches within the closest matches of these images that have the best local contextual similarity. These patches are then merged with the source image by finding an appropriate seam and matching the color of the source and replacement patches via poisson blending.

Questions : Was the poor unlimited-time performance due mainly to differences in color tones? (were b&w images tested?) If so could this be improved by an improved color blending algorithm?

Was expanding the removed region in the source image to the nearest edges or to the nearest regions of increased gradient considered? Would this have made the matching easier?
ReplyDelete
Replies
UnknownJanuary 12, 2016 at 10:21 PM
Summary: The paper uses large unlabeled dataset of images to improve on image completion task. The main idea is to create gist descriptor for the given image and the dataset and find out the most similar images from the dataset using scene matching algorithm. Then out of these semantically similar images, local context is matched by minimizing the gradient and poisson blending.

Discussion: Can we create a reranker on the top 20 images provided by the system using the qualitative metric provided in the paper, it might improve the accuracy.

Discussion: How can we solve the problem of objects being cut into half as of figure 8? Can we somehow find out other objects in the image and leverage the semantic context?

Question: What is the time consuming part of the computation on 2 million image dataset (finding the similar scene or local context matching)?
ReplyDelete
Replies
UnknownJanuary 12, 2016 at 10:56 PM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownJanuary 12, 2016 at 10:58 PM
The paper discusses a data-driven, "what could have been there" approach to image completion or inpainting. The 3 main challenges of computational cost, semantic validity, and texture, color and illumination mismatch are tackled using 1. nearest neighbor search for matching scenes using gist descriptors with augmented color information helped by 2. the use of a very large (2.3M images) dataset of images and then completed plausibly using 3. graph cut seam finding over the gradient of image differences and Poisson blending for the valid local contexts. Both qualitative and quantitative results are provided.

While already many good questions have been asked, I propose the following questions:
1. Was part based matching examined? In the case where the algorithm fails due to a lack of high level semantic understanding, perhaps an approach involving completion of parts of the hole could provide better solutions?
2. Why is a 80 pixels radius used for the local context? Is that an empirical finding or is there some justification from a computational standpoint?
ReplyDelete
Replies
UnknownJanuary 12, 2016 at 11:39 PM
The paper discusses an entire data-driven algorithm for scene completion. It proposes a 2 stage search algorithm where first filter those images that are semantically similar to the given incomplete image by using gist scene descriptor and then find the correct or most satisfying patch from these filtered images.

Question:
1. Why augment color information in the feature ?

2. For local context matching why use SSD and not something like normalized cross-correlation?
ReplyDelete
Replies
Vasavi GajarlaJanuary 12, 2016 at 11:58 PM
This comment has been removed by the author.
ReplyDelete
Replies
Vasavi GajarlaJanuary 13, 2016 at 12:04 AM
This comment has been removed by the author.
ReplyDelete
Replies
Vasavi GajarlaJanuary 13, 2016 at 12:06 AM
The paper solves the problem of completing a scene in an image. The previous works tried to solve this problem using either a small collection of images or labelled images. The novelty of the approach discussed in the paper lies in the fact that it uses a purely data-driven approach on a large collection of images taken from various online sources like Flickr, sometimes even allowing the algorithm to come up with multiple ways of completing the scene. Main steps involved in the algorithm are to find images semantically-similar to input image, finding context matches between them, and generating new images by using Poisson blending using the seam optimization of minimizing gradient of image difference.

Questions/Discussion:
I think the discussion section of the paper talks about some interesting possibilities. I have following propositions for discussion regarding the same:
1) In this section of the paper a question has been raised about the possibility of representing the whole of visual world using a set of semantically differentiable images. Is it actually possible?
2) Also, it has been claimed that the results of the paper's work make it seem that it is indeed possible, however, it is unclear how this conclusion is reached. A more elaborate
explanation would help in better understanding.
3) What could be the other vision and graphics problems that might be benefited by having a set of exhaustive, semantically-differentiable scenes?
ReplyDelete
Replies
Aditi GuptaJanuary 13, 2016 at 1:33 AM
The paper proposes a method of scene completion which leverages the immense potential of a large database of images. The authors first find images semantically similar to the query image by finding the SSD of the GIST descriptor of the query image with all the images in the dataset.
The 200 closest images are retrieved and then local context matching is performed on patches of these images by minimizing the SSD error in the L*a*b color space.

Discussion:
In this paper we solve the problem of semantically meaningful completions by using a large database that definitely consists of images semantically similar to the query image. For example it is mentioned that the database consists of outdoor images like cities, forests etc. However if a query image of say, an art gallery scene was provided the algorithm would not work as well. In that scenario, can we design an algorithm which would be aware of this shortcoming and return say a "not valid"?

By: Aditi Gupta
ReplyDelete
Replies
CJDSJanuary 13, 2016 at 6:54 AM
Summary :
The paper attempts to solve the problem of completing a scene in an image where part of the scene has been removed. To do this it leverages a large database of images and runs a GIST descriptor to find similar images in the database. After this a the image is inserted into the scene using a combination of Poisson blending and graph cut seam finding. They then assign each composite a score using the scene matching distance, local context matching distance, texture simmilarity and the cost of the graph cut and present the user with the lowest scores

Questions:
- Does the user pick the final composite image after he is presented with 20 options? This isn't made clear towards the end.

Discussion:
This has been brought up earlier: I concur that the images seem only to be tested in outdoor environements like landscapes or citiscapes. Whether this will fare well indoors or in other types of images is an open question/

Carl Saldanha
ReplyDelete
Replies
sasha172January 13, 2016 at 10:06 AM
Summary:
The authors attempt to solve the problem of replacing missing parts of a scene from an image. Their algorithm first uses scene matching to find semantically similar scenes from a large collection of over two million images. Then local context matching was used followed by poisson blending and graph cut seam finding algorithms with gradient domain fusion. Each insertion of new textures or deletion of pixels from the original image is weighed with a cost function ensure lower error and create a scene matching score using matching distance, texture comparison, etc. The user is then given a choice of several scenes that match the missing region and the selected image is used to complete the scene.

Question:
- Would constraining the two million image database to images taken in the same location or near the same gps coordinates make a difference to the results producing cleaner graph cut seam binding?
- Using newer social media metadata such as location information, tags and descriptions could decrease the semantic search time.
- Instead of informing the 20 users that they were searching for real/altered images, but instead asking them to choose the best image completion may make a difference to the result analysis and comparison with Criminisi et al since it would provide a more qualitative result for image completion vs analysis of the blending.
ReplyDelete
Replies
sfenu3January 13, 2016 at 11:04 AM
Summary:
The problem addressed in this paper is twofold:

1) How can one go about selecting good candidate image patches to fill in a gap in an image?
2) Given good candidate patches to fill in a gap in an image, how does one blend those patches in order to fill in the gap?

Part 1 is addressed by searching the space of semantically similar scenes, from which patches corresponding to the location of the gap are selected. Part 2 is addressed by poisson blending the new patch with the original image (into a region determined by a graph cut).

Question:
- It _is_ very difficult to quantify the results of this problem, but it would have been interesting if the scale at which the results had been tested had been higher, or if instead of using human judges supervised learning agents had been trained to recognize natural vs synthetic images and those had been used for the evaluation.

-Stefano
ReplyDelete
Replies
UnknownJanuary 13, 2016 at 11:30 AM
The paper talks about a novel approach to scene completion tasks in an image by finding other images with 'similar semantic skeleton' (extremely similar scenes) to that of the input image. This involves creating scene descriptors (Gist in this case) of both the input image and every other image in the huge Flickr dataset used for this task. Then these Gist descriptors are used to get the closest neighbours of the input image in the dataset. Matching scenes from these images are then composited into the input image using seam-finding and poisson blending.

A lot of good questions have been asked in this discussion thread already. I'd like to add two more:
1. Since the Gist descriptors don't care about the variation in the dimensions of different images, will the different dimensions of the actual images cause the low-level artifacts to become more prominent while compositing, even if the images were down-sampled initially?
2. Are there any other quantative (or less subjective) methods that have been developed since that give a better evaluation of what constitutes a successful completion of a given image?

- Siddharth R
ReplyDelete
Replies
UnknownJanuary 17, 2016 at 6:16 PM
The paper tackles the issue of completing missing portions of an image by using matching scenes from a large database of images. The algorithm first constrains its context matching search to semantically valid scenes. A matching scene is then inserted using a combination of graph cut seam finding and Poisson blending. Then each composite is assigned a score, which roughly equally includes the scene matching distance, the local context matching distance, the local texture similarity distance, and the cost of the graph cut. The user is finally presented with the 20 best composites.

Questions:
This has been touched on, but why equally weight scene matching, context matching, texture similarity and the cost function? Since semantic scene matching has already brought down the candidates, can’t we reduce its contributing weight in the final score?

Discussion:
Can’t an additional object recognition step be used? This could help retain semantic content in the missing region apart from reducing half objects?

Avinash Bhaskaran
ReplyDelete
Replies

Add comment