Friday, April 8, 2016

Mon, April 11 -- Learning to Generate Chairs

Learning to Generate Chairs, Tables and Cars with Convolutional Networks. Alexey Dosovitskiy, Jost Tobias Springenberg, Maxim Tatarchenko, Thomas Brox. CVPR 2015.

arXiv

22 comments:

  1. This paper presents a mechanism by which images of objects can be generated given high level descriptions : a desired style, orientation w/respect to the viewer and color. Given that the process is conceptually similar to the inverse of a standard recognition/classification task, it is no surprise that the architectures they explored, with success, are similar to inverted convolutional nets, consisting of numerous fully connected layers feeding convolutional+unpooling layers, where both final images are modeled, as well as segmentation masks. the generalizability of these trained networks is demonstrated, with examples of model transformations,interpolation between viewpoints, style and even different objects, and high-level "arithmetics" whereby simple changes can be made in feature space at early FC layers which lead to legitimate new images reflecting these conceptual descriptor changes. The fact that this CNN can interpolate between objects is shown to positively impact the network's ability to form correspondences between similar images, which it does very well, better in most cases than sift-flow, in terms of pixel error.

    Questions/Discussion

    1) what is the purpose of the segmentation mask?
    2) why did they only use images of rendered objects?
    3) if they "trained" the additional unpooling values (instead of just assuming 0) could the results have possibly improved (i.e. the larger training sets have better details) ? would this kind of approach have led to any problems?

    ReplyDelete
  2. These guys trained a CNN do generate images of chairs, tables, and cars. Their CNN looks a lot like a classification CNN flipped upside down, with convolutional steps (un) pooling layers, and a split pipeline (for RGB image and segmentation image). Training set for chairs included computer generated models (with standard lightning physics based pipeline). Same for cars. Computer generated allowed for easy correctly tagged orientation values. They tried several different network architectures, with variations in convolutional levels, channels, and streams. The chair network handles zoom out well, even though it was never trained to do so. It struggles with armrests and other fine details. The chair network can generate new viewpoints not seen in the training data, and generate new chairs by interpolating between two or more different models (implying structural knowledge within the network)


    Discussion: Has any work been done using CNN’s to generate images of objects using training sets with non computer generated images?

    ReplyDelete
  3. In this paper, the authors propose using a CNN for generating images given high level abstractions such as class, transformation parameters and viewpoint. They define a network in uconv layers and un-pooling layers in order to achieve this objective, and try to generate meaningful images using rendered images as a training set.
    The more interesting part of the paper though are the experiments and the analyses performed. The authors show not just how to generate images given high level information, but demonstrate new viewpoint generation, morphing of different object classes, interpolating new styles for a given class, performing feature space arithmetic with corresponding effects in the image and generating new images at random. The coolest results were that of the activation layer analyses, which showed that different neurons handled different aspects of the transformations, something which was explored in a previous paper. Both single neuron and neuron group activation analyses shed light on the inner workings of their network.

    Discussion:
    1. For interpolating between styles, how do the authors know if the network simply isn't picking and choosing various parts based on the style parameters, rather than generating one from almost scratch?

    2. I realize this must be the question on everyone's minds, but how would the dense correspondence feature perform on real images? I imagine, generating natural renderings of elephants is a much more involved process than doing it for inanimate objects.

    ReplyDelete
  4. Summary:
    This paper presents a supervised learning generative CNN architecture, which shows the 3D representation of the objects is learnable , given 3D dataset of the objects. Showed by Figure1 in the paper, this architecture takes 3 vectors representing style, view and artificial transformation( in-plane rotation, zoom, etc) and through FC layers encoded as one vector input to the separate upconv networks to learn the mask and reconstruction of the objects. This network also shows the ability to find dense correspondences.


    question/Discussion:
    So far we've seen a lot of "deconv/upconv", can we have a summarization of what are the difference between all these 'deconv' and why these differences can be used towards different tasks?

    ReplyDelete
  5. Abstract:

    The paper presents a generative neural network architecture to generate images based given on viewpoint and transformation parameters. They use an reverse convolution architecture where conv+pool layer is replaced with unpool+conv layers. They feed this network class, viewpoint and transformation information and generate the segmentation mask and RGB image from the network. The network is trained using Euclidean loss on both of these generated outputs. The result shows that network is able to capture the 3D geometry of objects and does well on various tasks such as, viewpoint, transformation estimation and new image generation.

    Discussion:

    1) Why would having segmentation mask separately be helpful?

    2) In section 3.3, how did they feed image in FC-2?

    ReplyDelete
  6. This paper is a description of a generative neural network that uses 3d objects and creates a 2d image from it. It takes in the object and a rotation angle, and a camera position. Using this the network generates a new picture of the object. To do this it uses convolutional and de-convolutional layers which use a process called de-pooling which takes a kernel and stride and acts across it with different weight. They also use the network to generate a segmentation mask for the original image

    Questions:
    1. During random probabilistic generation, what is the input to the network ? (Section 3.3)

    ReplyDelete
  7. This paper presents a method of generating 2d images from 3d object models using upside down CNN – where layers begin with FC layers and then continue with a series of unpooling + convolution layers. The network not only learns the images of those models which it has been fed with, but also a mixture of two or three models, or a different model of the same object.
    Questions
    How did adding convolution after every unpooling+convolution layer improve the quality of the image?
    In section 3.3., how do random samples drawn from FC2 result in good latent structure learning?

    ReplyDelete
  8. This paper proposes use of up-convolution generative networks to render images of objects given object name, viewpoint, color, brightness, etc. This in an essence is the inverse of CNNs used for recognition. Hence the network architecture looks like a CNN turned upside down, with 5 FC layers at first followed by 4 u-conv layers. The network is trained with 3D models of objects like chairs, tables and cars. The network is able to generalize the input parameters and generate 2D images of input, and the authors show that it is able to do it not by overfitting the training data, but to learn the relationship between input parameters and output image.

    ReplyDelete
  9. In this paper, a novel generative model for Neural networks is presented. It allows to generate objects from unseen poses or morph from object into another object based on the encoding learnt. The neural network model is built using up convolution layers which learn an embedding from the high level encoding to the generate image and the segmented mask. They experiment with various architectures and do an analysis on variety of tasks like correspondence matching, image generation from unseen poses and interpolation between objects in class and between classes.

    Questions:

    Could you explain what is the onehot encoding be used.

    I am also unsure what is being given an input when they want to do interpolation between classes. Could you explain Fig 6.

    ReplyDelete
  10. The authors train an up-convolutional neural network that generates images of chairs, tables and cars based on a collection of attributes. Attributes include object style, viewpoint and optional transformation parameters: color, brightness, saturation and zoom. The main idea is that the net first creates a 1024 unit representation of the desired image attributes, and then preforms a series of up-convolutional layers (unpooling + convolutional layers). At each up-convolutional layer, the spatial span of the feature maps increase. The end of the net produces a 3 channel RGB image and a segmentation mask. Two architectures were examined, one where the both the image and mask are created in the same upconvolutional layers, and one where they each have their own collection of upconvolutional layers. The authors show that the 1024 unit feature space before the convolutional layers can be manipulated to transition between attributes, and also perform attribute arithmetic.

    Question:

    Can we get a detailed description of the differences between an up-convolutional layer and a deconvolutional layer?

    ReplyDelete
  11. The train an up-convolutional neural network that produces (images of) chairs, tables and cars based on some attributes. The network also generates a segmentation mask.

    Q.

    Does Deconvolutional = up-conv?

    ReplyDelete
    Replies
    1. Is generating objects easier to do with 3d images? Is that why they used 3d images?

      Delete
  12. This paper proposes the use of a CNN fro generating object images given parameters such as style, color, and viewpoint. This approach is especially novel because past users of deep networks were for discriminative tasks such as classification whereas this model was able to generative new images. Given the generative nature of this task, this approach requires the middle to learn implicit representations of object classes and learn a high-level understanding of abstract concepts such as style.

    1. can we extend this work to generate scenes?
    2. Can this algorithm generate hybrid objects? For example, let the algorithm generate a table that a is also a chair?

    ReplyDelete
  13. This paper discusses a very interesting and novel idea of generating 2d images from its high level information provided by its 3d renderings that include information about style, orientation, pose, transformation parameters etc. Here they use CNN models as a generative model in a sense rather than the discriminative way that it is used for almost all applications. Here the authors have essentially turned the CNN model upside down, with inputs provided to the FC layers and the output is available from the lower-most layer. Unpooling + convolution - up-convolution layers are included which essentially work to scale images up rather than down. Here, they make use of the dataset provided by Aubry et al., which includes 1393 rendered chair models. The authors explain in-depth how CNN models not only learn the 3D to 2D mapping, but can also interpolate amongst various views and angles which it was not trained upon. It also learns to interpolate between given different styles. An in-depth analysis of the network is carried out in the later sections as well.

    Questions-
    1. Could they have implemented a different 'un-pooling' scheme rather than the one they do? Would their current method have resulted in lower detailed outputs?

    ReplyDelete
  14. This paper presents a CNN architecture for generating images given style, viewpoint and color. The authors experimented with training various network architectures on 3D object models and show that CNNs can learn these types of representations. The CNN uses fc layers to encode the input and then uses unpooling+conv layers to generate the image and predict the segmentation mask.

    Are there any particular advantages of directly outputting the segmentation mask?

    ReplyDelete
  15. In this paper the authors use a 2-stream CNN that generated 128 x 128 pixel images to generate objects. They train an 'up-convolutional' network which acts as a generator and use a set of images with high-level descriptions as the training set. The use of up-convolution is interesting as it follows the opposite direction of what a normal CNN used for recognition tasks would follow. Hence the pooling layers which essentially cause significant downsampling in the CNN, but here the unpool causes upsampling. The network performs well on the generation task and even generates images unseen in the training data like new viewpoints and object styles.

    What is the purpose of generating an object segmentation mask along with an image?

    ReplyDelete
  16. The paper presents a novel method to generate natural images of chairs, tables and cars given the desired style, viewpoint and other transformation parameters such as color, brightness etc. The authors first train a neural network to extract the model description from 3D object models. This forms as input to the generative model which is formulated as an upside down CNN. The first four layers are fully connected layers that represent the input parameters in a higher dimension. The subsequent layers consist of un-pooling and convolutional layers to generate the desired image representation. The experiments show that this model is capable of capturing the 3D geometry of the object and is effective in generating new viewpoints and poses not seen in the training set.

    Questions:
    How do we obtain the 8X8X256 vector from the FC5 layer in Figure 1?
    Could you please explain the 2X2 unpooling and 5X5 convolution in Figure 2?

    ReplyDelete
  17. This paper details a method of deep learning using CNNs to generate 2D prejections of chairs, tables, or cars. Trained on 3D models, their CNN relies on "uppooling" layers to map the output of a multistream network to a higher resolution photo. They show that their network is capable of inferring knowledge from multiple classes to improve the output of another class (ex: chairs learning from tables).

    Question:
    Is there any different between their "unpooling" layers and deconvolutional layers in previous work? They mention that it is similar, but I didn't see where they described how it was different.

    If they have shown knowledge transfer from other classes, then can they benefit from training on a large set of classes? At what point does this benefit start having diminishing or negative returns? They only train on three classes, which seems counterintuitive if multiple classes can help the network.

    ReplyDelete
  18. The paper describes a generative neural network thatis trained on a set of 3D models and is capable of generating 2D projects of the models. The proposed networks uses viewpoint, color, brightness, saturation and other parameters as inputs, generates the segmentation mask and produces an RGB image. The network is trained with back propagation to minimize Eucledian reconstruction error on both the outputs.The paper also discusses the generalization abilities of the network and the performance when applied to practical tasks of finding correspondences between different objects.
    Question: Why are segmentation masks being used here? Also, the network outputs the segmentation mask. Why is this better than inferring the mask from the RGB image?

    ReplyDelete
  19. This comment has been removed by the author.

    ReplyDelete
  20. The paper presents a generative model capable of producing images. The proposed model is inspired by ConvNet , named ‘up-convolutional’, and it is trained using standard back-propagation. The model takes as input set of high-level 3D information, such as style, pose, orientation, and transforms them to an RGB image. The authors reversed the functionality of ConvNet by providing the input the last layer fc7 and back propagating in backward way to produce the final output as image pixels from the lower layers.

    Q: Any follow-up work to use the model to generate more realistic images?

    ReplyDelete
  21. The paper presents a generative network that produces RGB images using high level descriptors as input. It is basically an inverted ConvNet and takes as input high level information like style, pose, viewpoint, color, brightness etc. The network is trained using backpropagation, and a variety of loss functions were experimented with. This network has shown interesting properties in generalizing over the same class or different classes, to produce images that it was not previously trained on.

    Discussion:
    Has there been any work in producing more detailed images like human faces etc.

    How exactly do they describe style etc in the input? Also One-hot encoding?

    ReplyDelete