Advanced Computer Vision: Wed, Jan 20: Photo Clip Art

Monday, January 18, 2016

Wed, Jan 20: Photo Clip Art

Hi Class, I'm going to go ahead and lead another discussion on Wednesday because I don't think I would be giving enough notice for someone else to present.

The image generation topics seemed popular but nobody selected this paper. It's a good paper to read before we get to the later papers.

Photo Clip Art. Jean-Francois Lalonde, Derek Hoeim, Alexei A. Efros, Carsten Rother, John Winn and Antonio Criminisi. ACM Transactions on Graphics (SIGGRAPH 2007).

project page

30 comments:

UnknownJanuary 18, 2016 at 10:58 AM
Question: I'm not able to try the photo clip art live-demo, server is down. Is anyone able to do so?
ReplyDelete
Replies
John TurnerJanuary 19, 2016 at 7:12 AM
This paper presents a system and UI to add new objects into existing photographs by querying a large database of roughly segmented images to find a palette of suitable candidates within the desired image class, ordered by matching lighting, camera orientation, resolution and context at the location the user desires to place the image, all of which are precalculated for each image in the library. Then, the user's choice of object is composited into the background image at the user's desired location through an improved segmentation algorithm (expanding the Grab-Cut algorithm to include a term intended to preserve thin important features from "shrinking bias"), improved poissson blending to preserve color, and building convincing shadows by using the source object's shadow but then modifying it based on the target background's statistics.

Discussion :
1) Could the modified Grab-cut algorithm from the paper(or one like it) be used to clean up the segmentations in COCO and other Crowd-segmented databases online (as the users were drawing them)?

2) I noticed a pronounced specular reflection in the window of the car used as an inserted object that, while small, gives a very specific cue as to where the sun/light in the car's original image is. Is the algorithm determining the lighting (either the global or local components) precise enough to present this car as an option only in appropriate situations? I'm thinking specifically of a scene where the lighting angle might only vary by ~30 degrees from this car's scene, but that variance would lie in the direction of the camera (so perhaps directly over the camera man's head, or possibly even to the left). In other words, were specular reflections considered in any way when determining lighting?
ReplyDelete
Replies
CJDSJanuary 19, 2016 at 11:21 AM
Summary

The researchers present a system by which users can insert novel objects into a scene. The users select these objects from a large database of segmented images (LabelMe). They then use the software to specify the horizon in the new image and then choose from a series of similar images to insert in the image. The authors show the user these images based on Camera Orientation, Resolution, Local Context and Segmentation Quality. After the image is chosen, it is then inserted in real time using object segmentation and blending. (Graph Cut and Poisson). Then a shadow is added. The user then

Questions
1. A lot of the focus seems to be on adding people or cars or other small objects. I wonder what happens when they add enormous artifacts such as buildings or large areas to the scene. I don’t think that the scene would look as realistic
ReplyDelete
Replies
anushaJanuary 19, 2016 at 2:37 PM
This comment has been removed by the author.
ReplyDelete
Replies
anushaJanuary 19, 2016 at 2:45 PM
The paper proposes a new method to effectively insert objects into an existing picture. In the proposed method, the object to be inserted is found in the LabelMe database. The LabelMe databased is built using a web based annotation tool with the help of people all around the globe labelling objects in different images. Once the user selects the object to be inserted, the object segmentation has to be fine-tuned since the segmentation in the LabelMe database consists of simple crude polygonal boundaries. Post segmentation, the object has to blend into the existing picture. This is done using improved poisson blending. The user interface developed helps the user to navigate through the photo clip art library and paste an object of choice into the photograph. The objects that have the highest matching criteria to the existing scene would be on top of the list, and composting operation with these objects would yield good results. The ordering is based several matching criteria such as camera orientation, global lighting conditions, local context, resolution and segmentation quality.

Discussion:
1.As the paper points out, the proposed system fails when the input scene has unusual illumination that doesn’t match with that of any object in the LabelMe database. This problem will get eliminated as the database grows in size. However, till then, a work around for this could possibly be to manipulate the color distribution of the object that has the highest matching criteria for a given scene with unusual illumination.
ReplyDelete
Replies
sfenu3January 19, 2016 at 5:07 PM
Easily composing digital images can be difficult without expert knowledge or a great deal of experience using image manipulation tools. This work presents a new tool to insert objects into an image. The approach presented allows for the insertion of an object by selecting from a database of objects one that matches the lighting and coloration of the scene, so as to appear to be a natural part of the original image. The results presented are good, and would doubtless improve as more data is used, but seem to be limited to the addition of small objects (it seems unclear whether adding large objects that might alter the scene lighting or cast shadow onto other objects would work well, and I am so far unable to run their provided code and determine this). The paper raises the following questions:

1) Exactly how good are the synthetic images generated? It would be interesting to see a more comprehensive evaluation

2) How does search performance scale with database size?
ReplyDelete
Replies
UnknownJanuary 19, 2016 at 6:28 PM
Abstract:
The paper provides a new system for inserting new objects in the existing photographs by querying a large image dataset. The system provides an easier user interface that can help user in finding similar objects and insert them into the new photograph.The proposed system does image manipulation in 3D space instead of 2D, thus it also better control to user for object placement.Innovative iterative scheme for estimating 3D height and camera pose priors using the known object height distributions. A modified grab cut is used with blending mask to inset the closely matched object into the image.

Discussion:
1) What would be the minimum parameters required for size and orientation estimation for 'object not on the ground' such as airplane flying? Can those also be estimated just from the dataset?

2) The dataset seems very small as compared to MS-COCO, which is much larger segmented dataset and also has more categories per image. Has anyone tried this system with MS-COCO dataset?
ReplyDelete
Replies
UnknownJanuary 19, 2016 at 7:48 PM
This paper attempts to solve automatic image composition using a data driven (and object retrieval) approach. The system outlined in this paper consists of two major stages – 1) object sorting and 2) object insertion. For the first stage, the system uses the LabelMe dataset, which consists of a variety of real world Internet images with a large variety objects segmented and labeled, and then sorts the objects within this dataset according to which object would be most appropriate to add to the user requested image. Objects are sorted based on user selected object label and cluster as well as the similarity of the images' camera parameters, illumination, local context, and object size.The second stage, object insertion, uses graph-cuts to better segment the object and poission blending to add the object realistically to the image.

Questions:
1) It would be interesting if the object candidates also were ranked by how often they were chosen. Perhaps certain images that provide a better narrative scene (rather than just photo-realistic) would be interesting to consider. There might be something else besides photorealism that drives individuals to chose certain objects over others.

2) Similarly it would be interesting if the system could be scaled (perhaps with the inclusion of the segmented section of ImageNet and MSCOCO) to allow the user to ask for more specific traits beyond a generic category and pose. For example, say I'd like to add a red car to the scene versus just any car, or a person with brown hair.
ReplyDelete
Replies
UnknownJanuary 19, 2016 at 8:39 PM
The paper presents a process and accompanying UI for compositing images. The UI is made to be intuitive enough that non artistic users are easily able to create combined images. The user can specify an object category to be placed into the background image. The system searches through images from the LabelMe database and finds images from the category that are a best fit for the background image. The images are selected after considering camera orientation, global lighting conditions, and local context. The possible images are presented to the user in order of match quality. After selecting and image, the user can then place the image into the background with a mouse click. The image is composited into the background by segmentation, poisson blending and shadow transferring

Discussion:

1) How can we measure the effectiveness of the system? Would an approach similar to Scene Completion Using Millions of Photographs be applicable?
ReplyDelete
Replies
enlite traderJanuary 19, 2016 at 8:42 PM
Summary:
This paper presented a way on inserting objects in the pictures more naturally based on two key points: Finding instances of objects class from large datasets that fit well in the chosen place of image;Image manipulation is better done in understanding the 3D space given by the image instead of 2D. Objects size and orientation was estimated based on the camera pose estimation with iterative process more like expectation maximization with fixed objects size and orientation, camera pose is optimized to fit the image and vise versa with fixed camera pose. Lighting condition was estimated based on previous work but with local appearance context take the vote as well, plus unsuitable objects was filterd out and objects were then clustered based on shape.Finally blending was well supported by the local appearance context.

Question:
On estimating object size and orientation with camera pose, how this iterative process is conduct mathematically? What if the iterative process is not converged. Any bad results that is in the paper was caused by this iterative process is not converged?
ReplyDelete
Replies
Sam SeifertJanuary 19, 2016 at 9:29 PM
This paper outlined an application that can be used to insert objects synthetically into an image. By creating a database of objects from existing images, the software and operator can select the desired object that best matches the scene lighting conditions and camera point of view. Camera point of view and lightning conditions were not labeled in the dataset, so they were estimated using what sounds like expectation maximization. After creating height distributions for several common classes of objects, the authors iterated the most likely camera orientation. With that new camera orientation, authors could estimate height profiles from certain objects. Continue alternating the E+M step. The camera pose, lighting profile, and other features were combined using a weighted average to score all instances to fit a scene. Operators could then select what object they wanted to place on the screen, and easily drag it into the photo with the application auto scaling to match the depth of the ground plane. The author's also presented a slightly modified energy function for a poisson blending PDE.

Question:
The author's talk a lot about solving the shadow problem, but it most of the synthetic images in the paper the synthetic people do not have shadows. Synthetic cars, on the other hand, usually have shadows. Are shadows visible on smaller object? Or only limited to larger ones.
ReplyDelete
Replies
UnknownJanuary 19, 2016 at 9:32 PM
The paper proposes a system to allow image compositing with a key focus on "photo-realism". The authors use a data-driven approach to find objects from a image clip art library that best fits the pose and illumination of the target scene. The 2 main components of the paper are the library creation and the object matching and blending. The library creation involves using an available dataset of objects and their segmentations and calculating the camera pose for each image, using either 2 object instances or a pose prior distribution and 1 object instance (instances of an object class with known 3D height distributions). A global and local illumination context is also calculated to better match illumination in the target scene. For object matching, a weighted linear combination of Camera Orientation, Global Lighting, Local Context, Instance Resolution and Segmentation Quality is used. Methods are proposed for better segmenting the objects and for doing context aware blending to achieve the goal of photo realism.

Q: Is the step involving creation of subclasses really necesary? Despite there being high variability in the different instances of certain object classes, using the matching criteria (especially the orientation and local context) should help alleviate the need for finding subclasses within the object class, since our final goal is to provide a list of recommended objects that best fit the input scene.
ReplyDelete
Replies
Vasavi GajarlaJanuary 19, 2016 at 10:03 PM
Summary:
This paper presents an easy-to-use system for compositing images with objects where the user just has to pick a 3D location in the image, a kind of object (there could possibly be different sub-orientations) and the system recommends objects segments (taken from a large dataset of images collected from the web) with matching properties for camera pose, lighting and resolution. Automatic algorithms have been used for estimating the object length with respect to the depth, object segmentation and blending, shadows, lighting around the object, etc.

Questions:
I could think a few scenarios which look challenging with the existing system and are some good starting points of discussion.
1) How can the system be made to work even for object segments which are occluded in their original scenes (these occlusions might make the object look out of place even if the illumination, color-blending etc are proper)? Are there any additional processes which can deal with this issue, or are these segments considered partially or discarded altogether?
2) If an object segment selected has transparent surface like a person wearing a transparent rain coat, the original image from which the segment is taken might have a different backgrounds seen through the coat. However, these might not match with the new scene.
3) If the scene doesn’t contain any objects with shadows, but there is a light source. How can this case be handled because there is no reference for the system to know about the shadow? Maybe the location of source of light helps?
4) If a smaller object is placed in the shadow of a bigger object, the side of the object which is towards the bigger object is darker due to the shadow falling on it from that side.
ReplyDelete
Replies
UnknownJanuary 19, 2016 at 10:17 PM
Summary:
This paper presents a novel way to populate environments via photorealistic clip art. A user is required to specify a location, then is presented with a selection of objects that fit the selected location. This provides a user friendly way to enhance images, even to those who are otherwise unskilled.

Discussion:
Has any more work built upon the results of this paper? Perhaps by utilizing more recent datasets to increase the size of the object library?
ReplyDelete
Replies
JonathanJanuary 19, 2016 at 10:59 PM
This paper presents a system that lets users insert new items into an image. The authors do not try to manipulate the image in order to fit the object into it; they find an object that fits into the image. This requires a fairly large dataset, and one with rich image segmentation. So they used LabelMe. The authors require estimating the true size and orientation of the object in 3d space (not 2d). One of most important contributions of this paper is that it acknowledges that often it is not the particular instance of an object to be pasted into the image that matters but the qualities of a kind of object (a car
with correct lighting, pose, size, etc).

Qs
1. I'm not sure I'm totally getting section 3.1, the section about estimating the true size of an object and the camera pose. They say they "first compute the most likely pose for images that contain at least 2 known objects (instances of an object class with a known height distribution)". Where are they getting these known height distributions? Who is labeling all these images? Does their method only work for people?

2. How do they know the true values (heights) of the objects?

3. Interesting, if generative deep network could take an image and learn to generate an image with appropriate lighting, pose, color, size for an image.
ReplyDelete
Replies
UnknownJanuary 19, 2016 at 11:27 PM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownJanuary 19, 2016 at 11:33 PM
How data driven approaches are applicable to novel applications as that of photo insertion and how it makes easy for the end user to operate upon has been discussed in this paper. It goes in depth on how the data was collected, how parameters were estimated and the problems they faced that resulted into novel algorithms being developed. It describes the difficulty in obtaining 3D scene information from a 2D image like camera orientation and height estimation and how probabilistic models can be used to generate a prior distribution with estimated mean, variance and how drawing data from those distributions helped them achieve a better result. The difficult task of maintaining context illumination is solved using local color information stored in 3 histograms, obtaining a good mask and modifying Poisson blending to account for local illumination. And finally how a user-friendly UI was developed along with the steps of using the system to generate a modified image has been discussed.

Discussion-

A similar point as mentioned above. MS COCO database would be highly useful for this purpose. Not only the database accounts for semantic information, it also answers the underlying question of availability of a larger database containing segmented images.

Question-

How scene completion method as discussed before could make this viable for larger scene modifications rather than being restricted to smaller objects? It would be interesting to know if this works.
ReplyDelete
Replies
UnknownJanuary 19, 2016 at 11:38 PM
This paper presents a data-driven way to blend in objects/clip-arts into a scene. The objects are take from existing images of LabelMe Library which are already tagged and segmented manually. In order for the blending to be as photo realistic as possible, the paper suggests to match the camera position and orientation, illumination and resolution of both the scene and object to be inserted. Since there is already a large object library available, the matching is done by just choosing right (suitable) objects from the library. After filtering the set of objects suitable for the scene, the user is asked to chose an object and place it on the scene using a simple Java based GUI. Based on where the object is placed; height, orientation and shadow of the object is corrected in the scene.

Discussion:
1. The example scenes are all outdoor. Is this because position of ground and sky plays an important role in estimating the camera position and orientation?
2. Instead of just using the segments given by the LabelMe library, perform additional passive segmentation for a finer boundary and to remove holes in the object.
ReplyDelete
Replies
UnknownJanuary 19, 2016 at 11:39 PM
This paper presents a system for inserting new object sprites in an existing photograph. The objects to be inserted are pre-computed by processing from Internet images. Also a UI is presented which lets users of the system easily select an object and drop it at required position in the photograph. The object insertion task requires not to modify the object to match the brightness and orientation but rather chose a similar object which matches the requirements in the photograph. Also the paper presents a new algorithm for better object segmentation and blending and estimating the object size and orientation in 3 dimension.

Discussion-
1) Can an algorithm be used to further enhance the orientation and size of object according to photograph to get better visually looking images. With a finite database it is not possible to create sprites for each orientation and size of object which may exactly match the photograph.
ReplyDelete
Replies
prateekJanuary 20, 2016 at 12:11 AM
This paper aims to incorporate photo editing of images by insertion of objects with a large data driven approach. They develop a simple user interface to allow for human interaction to select objects and place the best one in the image.

The contributions of the paper are :

1. It uses a large data driven approach from a standard dataset to find the instance of the object category by matching camera pose, ground plane and scene illumination.

2. They develop an intuitive user interface where the human needs to do simple tasks . One of choosing the category of the object and the shape it wants the object to be.

3. They also a suggest using a shape prior in the grab cut formulation to blend the objects into the background. The shape prior allows for irregular shapes to be composited smoothly rather than segmenting it parts.

They show impressive results on the dataset while also discussing about failure cases occurring due to unusual illumination and complex object shapes like trees.

Discussion:

1. Most of the techniques which are assumed in this paper, can be improved using state of the art deep learning.

2. I particularly did not understand how the shape prior was incorporated in grab cut.

3. I also wonder whether using L2 norm for shape clustering is a good metric.
ReplyDelete
Replies
PFJanuary 20, 2016 at 1:18 AM
This paper presents a data-driven system for inserting new object into existing photographs, without requiring artistic skill and comprehensive work of image compositing. The major contribution of this paper includes:
- setup of rich object library
- developing object segmentation and blending method
- estimation of object size and orientation
- estimation of lighting conditions
- intuitive user interface

Discussion:

- The philosophy of this paper is quit similar to : finding the best matches from the existing image database. The system in this paper has a quite small dataset (13000 instance), comparing to the scale of millions of images. And we can see from the results in the paper that the synthesized images are not so realistic. I believe the performance can be improved dramatically by increase the scale of dataset to a certain level.
- What is the cost of building the dataset? Is there anyway to improve the efficiency?
ReplyDelete
Replies
Aditi GuptaJanuary 20, 2016 at 3:16 AM
The paper proposes a data driven approach to adding new objects in an image. The key contributions of the paper are:
--Creating a database of objects drawing from the LabelMe Library. The true object size and orientation as well as camera pose and height corresponding to each object is also stored.
-- Estimating global lighting conditions for the main surfaces: sky, ground and vertical plane.
-- Object Segmentation with shape prior and modified Poisson Blendign to prevent discoloration.
-- Transferring shadows from the original image to the target image
Discussion:
1. I am not clear as to how the prior distribution of the camera pose was estimated using two known objects.
2. Can we think in terms of eliminating the user completely from this task? For example, can we have the option to “populate” the source image or “make it look busy”? An option that adds contextually correct objects automatically to the image and makes it look busy?
ReplyDelete
Replies
UnknownJanuary 20, 2016 at 4:01 AM
The photo clip art tool is meant for providing an easy way for common users to insert any object into an image. Candidate objects are queried from the LabelMe dataset and the ones which match the original image the most based on light, coloring, orientation, etc. are shown first to the user. The selected object is then extracted and inserted into the original image using poisson blending.

Discussion:
1. Since the estimation of the true size of the object requires that the object be resting on the ground plane, can the user introduce any object that isn't grounded, such as a bird in the scene?

2. Has this been tried with any other datasets. One interesting dataset for this task could be the IKEA 3D dataset by Torralba's group for indoor scenes.

3. The authors mention that most object classes are of outdoor scenes due to the dataset being biased towards outdoor scenes and as they mention "....also due to ground-based objects mostly appearing outdoors." Are they referring here again to the dataset or outdoor scenes in general? It doesn't seem right that indoor scenes in general don't contain ground-based objects.
ReplyDelete
Replies
UnknownJanuary 20, 2016 at 8:57 AM
This paper introduces an interactive tool for object insertion. It allows the user to insert an object to any photograph without any need for the user to do any image composition. First, the user will upload any picture, choose a specific 3-D location in which the object should be inserted, and select a class of object to be inserted. Using a photo clip art library, the system will retrieve set of instances of the selected object that best match the science characteristics and conditions. Automatic segmentation and blending is finally preformed.

Research Question:
The paper presents a system that inserts complete objects into the scene. It would be interesting to see how this work can be extended to insert objects partially instead of the whole object. So the new objects can be occluded or merged with other parts of the scene.
ReplyDelete
Replies
UnknownJanuary 20, 2016 at 8:58 AM
This paper introduces an interactive tool for object insertion. It allows the user to insert an object to any photograph without any need for the user to do any image composition. First, the user will upload any picture, choose a specific 3-D location in which the object should be inserted, and select a class of object to be inserted. Using a photo clip art library, the system will retrieve set of instances of the selected object that best match the science characteristics and conditions. Automatic segmentation and blending is finally preformed.

Research Question:
The paper presents a system that inserts complete objects into the scene. It would be interesting to see how this work can be extended to insert objects partially instead of the whole object. So the new objects can be occluded or merged with other parts of the scene.
ReplyDelete
Replies
UnknownJanuary 20, 2016 at 9:26 AM
This paper presents a system for inserting objects from a photo clip art library into an already existing photograph. The user picks the location in the scene for the object, and then selects the actual object from a menu containing possible candidates from the LabelMe data set. The paper suggests retrieving and sorting objects based on similarity in camera pose, lighting, and resolution to the scene instead of having to manipulate the object itself. The selected object is then inserted into the scene using graph-cuts and Poisson blending.

Given some other data set that contains the appropriate objects, what additional knowledge is needed in order to extend the task to inserting non-grounded objects or into indoor scenes?
ReplyDelete
Replies
UnknownJanuary 20, 2016 at 11:09 AM
This paper presents a data-driven way to blend in objects/clip-arts into a scene. The objects are take from existing images of LabelMe Library which are already tagged and segmented manually. In order for the blending to be as photo realistic as possible, the paper suggests to match the camera position and orientation, illumination and resolution of both the scene and object to be inserted. Since there is already a large object library available, the matching is done by just choosing right (suitable) objects from the library. After filtering the set of objects suitable for the scene, the user is asked to chose an object and place it on the scene using a simple Java based GUI. Based on where the object is placed; height, orientation and shadow of the object is corrected in the scene.

Discussion:
1. The example scenes are all outdoor. Is this because position of ground and sky plays an important role in estimating the camera position and orientation?
2. Instead of just using the segments given by the LabelMe library, perform additional passive segmentation for a finer boundary and to remove holes in the object.
ReplyDelete
Replies
UnknownJanuary 20, 2016 at 11:46 AM
The paper presents a method of inserting segmented objects into an image. These objects to be inserted is found using the LabelMe database, that lets researchers submit, share and label images. The user selects the object to be inserted, from a ranked list, ordered as a weighted linear combination of Camera Orientation, Global Lighting Conditions, Local Context, Resolution and Segmentation Quality. The object is then inserted into the image by computing a blending mask and then using Poisson blending.

Discussion:
How valid is the assumption that the ground plane is orthogonal to the image plane. Wouldn’t errors brought in by this assumption propagate over several images when calculating the prior height distribution and camera pose? Or are images discarded, in which the ground plane is visibly tilted with respect to the camera axes.

While inserting an object in the UI interface, the paper states that the objects are rendered back to front to account for occlusions. Does this mean that the occluded parts are filled in synthetically?

Another metric that can be used to as a matching criterion is the pose of the camera with respect to the object itself? For example if an object captured and annotated at the right side of the image is then positioned in the left side of another image, the result would look odd. Or is it safer to always leave this to the user?
ReplyDelete
Replies

Add comment