Advanced Computer Vision: Fri, Feb 26 - Understanding Deep Image Representations by Inverting Them

Wednesday, February 24, 2016

Fri, Feb 26 - Understanding Deep Image Representations by Inverting Them

Understanding Deep Image Representations by Inverting Them. Aravindh Mahendran, Andrea Vedaldi. CVPR 2015.

arXiv

21 comments:

UnknownFebruary 25, 2016 at 5:53 PM
Abstract:

The paper describes a technique to perform visualization on various representations (both shallow and deep) of image. Their method only uses prior and representation information and not the activation information about units in the net. The method works by performing a gradient descent over an regularized objective function. The loss function is obtained by normalized euclidean distance of the representation. As regularizer, they try to incorporate the image priors by keeping the final image to be within target interval and have gradients distributed over the image evenly. Finally they show the CNN representation of HOG and SIFT features and perform reconstruction on all 3 representations and comapre the reconstructed image.

Discussion:

1) I guess the reconstruction looks qualitatively good for hand crafted features because deep CNN is trained for classification on a large dataset having various poses and orientation of same objects. How would the features generated by sparse autoencoder would do the on the reconstruction task, where goal is to preserve the features.
ReplyDelete
Replies
enlite traderFebruary 25, 2016 at 9:20 PM
summary:
This paper introduced using an extra total variation regularizer(keep piece-wise patches) with gradient descent to invert the image encoded by CNN. The key observations are CNN tend to encode the "sketch" or abstract aspect of the image, resulting in multiple similar inverted images. Another contribution of the paper is their implementation of DSIFT and HOG as CNN with design of directional filtering and gating in the network.

question:
As mentioned by the paper ([26] C.Szegedy,W.Zaremba,I.Sutskever,J.Bruna,D.Erhan,I.J.Good- fellow, and R. Fergus. Intriguing properties of neural networks. CoRR, abs/1312.6199, 2013.)
, a small adversarial noises adds to the corrected classified image will leads to a misclassified result sometimes. I'm really curious how is the inverted images look like in different layers of CNN-A, since CNN are gradually summarizing the input image as "sketch",
will this leads to in deeper layer, the inverted image looks more like the misclassified result and in the shallow layer still keeps the input information?
ReplyDelete
Replies
JonathanFebruary 25, 2016 at 10:27 PM
This paper attempts to visualize CNN. More specifically, given and encoding of the image, can you reconstruct the image. They use gradient descent to train the network. They start with random noise and learn an image.

Q.

1. Why minimize the difference in features, and not the image reconstruction error?

ReplyDelete
Replies
sfenu3February 25, 2016 at 11:08 PM
Inspection of the intermediate representations of image classifiers can be a very useful debugging tool to figure out exactly what is being learned in these representations. This paper attempts to reproduce initial images using only information from the image representation and an image prior.

Questions:

1) Has any evaluation been done to find other loss functions (ex reprojection error?) that might work for this task?

2) Has any work been done to explore different options for regularization?
ReplyDelete
Replies
UnknownFebruary 25, 2016 at 11:52 PM
The authors aim to better understand the information contained in the different layers of CNNs by proposing a general method to trconstruct an image given the feature representation. They provide a regularized objective function, whose optimization gives the best possible reconstruction. For regularization, a natural image prior is provided to force the objective function to minimize in the space of natural images. They discuss the form of the objective function providing all the considerations. To find the optimal objective function, they use an extended version of gradient descent. This requires having the image representations be differential as composed in the loss function. This leads to the interesting part of the paper, where they show how DSIFT and HOG, classic shallow representations, can be modeled by CNNs faithfully and provide the convenient property of being able to calculate derivatives. Finally they run their method on the 3 CNNs and visualize the outputs to compare the amount of information in each.

Discussion:
For the image prior, they provide the alpha norm and total variation of the original image. How would choosing a different or related image affect the reconstruction?
Would the sensitivity and bias of the hidden units help overcome these differences?
ReplyDelete
Replies
Sam SeifertFebruary 25, 2016 at 11:57 PM
CNN’s aren’t generative models. But you can use the variables from within a network (after an image has been passed through the network for classification) as input data to do supervised learning and predict what the original image was. Essentially making them generative with a little extra help. This paper showed images generated this way from not only CNN’s but other visual representations as well (HOG, bag of visual words). The resulting generated images show insight into what information is being passed through the algorithms and ultimately what they are making decisions on.

Discussion:
I'd love to see a figure showing rebuilt images vs. number of training samples.
ReplyDelete
Replies
UnknownFebruary 26, 2016 at 12:15 AM
The paper introduces an technique to invert the shallow and deep representations over an regularized objective function using simple gradient descent. The authors show the using image priors helps to generate finer statistics of image which we lose in the compressed representation forms. Also the visualization learned at each layer gives an idea of learned representation by the CNN.

Question -
As mentioned in paper, the optimization algorithm requires derivatives. The authors establish DSIFT and HOG in CNN framework. Can we discuss this in detail?
ReplyDelete
Replies
UnknownFebruary 26, 2016 at 12:18 AM
This paper dives into the task of understanding how different feature representations encode information. Essentially what information is captured by these feature detectors and how does the reconstructed image compare to the original image has been discussed. They employ a specifically defined image prior constrained regularized objective function. Loss function representing the euclidean distance between the representation and the original image. Momentum based gradient descent is used for optimization. Shallow representations like D-SIFT and HOG have been presented along with deeper representations like CNN. And the results have been showcased. The output results provide a good insight into what various features focus on primarily especially for the CNN's where the invariance to object decreases as you go deeper and the final representation showcases a sketch containing distinguishing information.

Questions-

1. Need further explanation about the regularization terms.
ReplyDelete
Replies
UnknownFebruary 26, 2016 at 12:19 AM
The authors of the paper explore the problem of reconstructing an image, given its encoding, by inverting its representation. They do so by finding the inverse of a modelled function whose input is the image. They convert this to a regularised regression problem which involves minimizing an objective function. This objective function is based on the Euclidean norm between the target and the generated image and has a regularization term called TV-refularizer (TV stands for total variation). The optimization step involves performing Gradient descent on the objective. Finally, they HOG and SIFT can be implemented as CNNs and perform inversion analysis on HOG, SIFT and CNN to obtain reconstructed images and compare them. The visualizations of the reconstructions help better understand the representations learnt in each layer.

Discussion:
1. Why is the second total variation regulariser 'richer' or better than the previous one? If the TV regularizer was not used, would the reconstruction not be able to recover low-level image statistics as TV regularizer claims to do.
2. One cool thing about this paper is the implementation of HOG and DSIFT as CNNs by using the approximated bilinear binning. This, they claim, reduces the computational time because I assume that the optimizing part can be switched over to GPU code.
3. They show that CNNs are not much more harder to invert than shallow networks even though they are much deeper. But they also mention "that different representations can be easier or harder to invert". What makes it easier or harder? Just the time to optimize?
ReplyDelete
Replies
UnknownFebruary 26, 2016 at 1:00 AM
This paper attempts to reconstruct images from both shallow and deep image descriptors. It also shows that shallow descriptors can be implemented as CNNs in order to perform gradient descent. The different descriptors are then used to reconstruct images, and performance is compared.

Discussion:
How does the choice of architecture for the network used to form deep representations affect the quality of the reconstruction at various layers? Would a different architecture have similar results or would they vary dramatically?
ReplyDelete
Replies
Aditi GuptaFebruary 26, 2016 at 1:14 AM
The paper presents a novel method to reconstruct an image from its HOG, SIFT as well CNN based representation. The authors formulate the problem minimising the error between the given image representation and the image representation of the reconstructed image. A regularisation term is also added to avoid spikes in the reconstructed image. The authors then conduct several experiments to study the CNN representation of the image at the different layers.

Discussion:
-- The paper concludes that the deeper layers of the network captures just a sketch of the objects, evident from the blurry, sketch-like representations they obtain at the higher layers. I am curious as to what we can expect if a CNN was trained on the reconstructed images. Would this CNN be capable of classifying normal images?
ReplyDelete
Replies
UnknownFebruary 26, 2016 at 7:55 AM
The authors present a method for inverting deep and shallow image representations. The method uses a generic natural image prior and information from the image representation. An objective function and gradient descent are used to create a visualization that represents what information is being captured by the image representation.

Question:

Whats the reasoning behind using Euclidean distance as the loss function?
ReplyDelete
Replies
John TurnerFebruary 26, 2016 at 8:42 AM
This paper introduces a mechanism to reconstruct the possible images responsible for a particular image representations by encoding algorithms such as HOG and SIFT, as well as the deeper layers of convolutional nets, through an optimization process that attempts to reverse the encoding process, mapping an encoding to natural image space, in order to gain insights into the nature of the process itself. It is hypothesized that, by determining the elements considered important in the encoding process (i.e. and are therefore not filtered away), we can gain insights into why the representation algorithms work the way they do.

Question/Discussion :
1) a review of the derivation of the two natural image priors would be helpful, as would a review of what is meant, mathematically, by the "natural image" vector space.
2) Could an image that activates a very similar RF (as discussed on Wednesday) as the one generating a representation be successfully used as an image prior? Would this be appropriate (i.e. a reasonable choice)?
3) Could this mechanism be modified, or used in conjunction with the places visualization methodologies discussed on wednesday to generate "photo-realistic" images? Is there any research interest in this topic?
ReplyDelete
Replies
UnknownFebruary 26, 2016 at 9:06 AM
This paper presents an approach for visualizing the layers in a deep network; thus, the major contribution is a visualization to get an intuition behind how deep networks “see” images at each layer. The approach computes the inverse of an image representation by trying to minimize a loss function (between the representation of the image and the representation function) and a regularizing term. This inverting function is optimized using gradient decent. The major contribution

Questions:
I am not sure how to meaningfully interpret the visualization in Fig. 6. Can we go over that is going on there? Why are the images getting "fuzzier"?
ReplyDelete
Replies
prateekFebruary 26, 2016 at 9:38 AM
This paper presents a generalized method to invert learned representations of images like HOG,DSIFT and CNN. It shows how CNN retain the structure of the image which are useful for the task. They propose a loss function which reconstructs the image anf use gradient descent to optimize it. They validate their hypothesis with qualitative results.

Questions: How would a 1000 dimensional vector scale to an image. I did not understand the loss function fully.

Would this inversion be applicable to any layer and would intermediate level of layers be able to revonstruct the image so well
ReplyDelete
Replies
UnknownFebruary 26, 2016 at 10:01 AM
This paper proposes an optimisation method to invert shallow and deep representations based on optimizing an objective function with gradient descent. A key point is the use of image priors such as the Vβ norm that can recover the low-level image statistics removed by the representation. This technique is applied to CNNs, and the visualizations shed light on the information represented at each layer.

Discussion:
The value of beta used in the TV regularizer term is unclear. Can't an adaptive beta value be used based on the gradients of the region.
ReplyDelete
Replies
CJDSFebruary 26, 2016 at 10:35 AM
This paper proposes a method for inverting CNN representations. They use a regulariser with a natural image prior and try to minimise a loss function, to get an inversion of what the net thought the image looked like. The loss function chosen is the Euclidean distance. They use a TV regulariser to encourage images to consist of piecewise consistent patches.

Question:
I was confused about how the regularizer and the loss function choose given that any layer is a many dimensional vector
ReplyDelete
Replies
UnknownFebruary 26, 2016 at 11:07 AM
This paper presents a method of visualizing image representations at different layers. The method computes the inverse image representation by using gradient descent on a regularized objective function. The authors show the use of shallow HOG and SIFT features as CNNs and reconstruct images for comparison. The reconstruction visualizations provide insight on the information learned at each layer.

Can this method be modified somehow for generative tasks?
ReplyDelete
Replies
anushaFebruary 26, 2016 at 11:21 AM
The paper introduces an optimization method to invert shallow and deep representation based on optimizing an object function with gradient descent. The proposed method uses image priors to help recover low level image statistics. The paper describes how HOG and DSIFT can be implemented as CNNs, simplifying the computation of their derivates. The information represented at each layer has been visualized in the paper.

Discussion: What's the rationale behind choosing Euclidean distance for the loss function? Also, gradient descent has been used for optimization. What other optimization techniques could be used considering the non linearties of the representations ?
ReplyDelete
Replies
UnknownFebruary 26, 2016 at 11:32 AM
This paper presents a novel approach to better understand the different image representations - shallow and deep ones-by inverting them and trying to reconstruct the original images. To inverse these representation and to compare different reconstruction errors, DSIFT and HOG were implemented in standard CNN framework and derivatives were computed. The inverting is achieved by minimizing an objective function. Different choices of the loss and regulariser have been discussed and reported. To optimize the reconstruction function, gradient descent was used.

Form the objective function; it is not clear how the regulariser captures a natural image prior?
ReplyDelete
Replies

Add comment