Monday, February 22, 2016

Wed, Feb 22 - Object Detectors Emerge in Deep Scene CNNs

Object Detectors Emerge in Deep Scene CNNs. Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, Antonio Torralba. ICLR, 2015.

project page, arXiv

25 comments:

  1. Hi everyone,
    Found this on the internet. Might be useful to everyone who's experimenting with deep learning.
    http://videolectures.net/deeplearning2015_montreal/

    ReplyDelete
  2. This paper is from most of the same authors as the paper we discussed on Monday, and it builds on the topics we discussed, particularly understanding the nature of deep features from networks trained on object-centric datasets vs. those trained on scene-centric datasets. It was found that object detectors emerged in the deeper layers of the CN trained on the scenes dataset, implying that scene-trained CNs might be capable of being object detectors as well, without ever having trained on object-oriented datasets.

    Questions/Discussion :
    Could a scene-centric dataset be used in conjunction with an object-centric to generate a scene-centric dataset with a combinatorial explosion of increase in size? in the paper, they were able to simplify particular scenes to a small number of constituent objects responsible for the same RF results, which seems to infer that maybe new scene images could be generated by taking the original scene image minus the constituent object image and add instead another object.

    ReplyDelete
  3. This paper expands on the CNN visualization modes outlined in the last paper. After running many images through a CNN, the 100 images that produce the largest output at a specific place (typically a layer) in the CNN are collected and visualized. The representations created become more meaningful at deeper levels of the network. They did some crowd sourced labeling of the representations. The representations were very different based on what dataset the CNN was trained on. ImageNet provided a lot of animals as representations. They also do image simplification by removing as much as they can while the CNN still returns the correct output to determine what the CNN is working on.

    Discussion:
    I like this paper. It makes the insides of CNN’s a lot less scary. How has the information of what’s going on inside the network been used? Besides just looking at it.

    ReplyDelete
  4. Abstract:

    The paper builds upon the Places dataset paper we discussed earlier and shows that object detectors emerge inside a deep neural architecture trained for scene recognition. The authors show various visualization to prove this claim. First is to simplify the input image by taking out each segment and showing that a particular scene has some semantic understanding of objects. They also show RF for various layers of neural network trained on ImageNet vs Places. The higher layer of ImageNet focusing more on the simple element vs Places is focusing more on the objects. This way object localization can be built with just the scene recognition task.

    Discussion:

    1) Since this is done on very large dataset and we have limited segment annotated dataset, can this be used along with segment annotated data to push the state of art networks for object localization.

    ReplyDelete
  5. In this paper, the researchers looked at what activates different layers in a CNN. They say that when describing a scene as a collection of Object classes, it is not clear how the data should be segmented in order to pass it to the CNN. They argue that CNNs trained on the Places dataset tend to perform better on scene related recognition than those in the ImageNet Dataset (which has more iconic imagery). To prove this they check which images have the most activation at a given layer. They then try to derive the RF by looking passing images and deciding which of the layers is being activated the most. They then test the most activated images by providing them to a series of AMT reviewers who look at the data and decide if there is an underlying theme to them.

    Questions:
    Image Net and Places have a lot of images: They must have selected a subset of the images but I don't think they mention how they selected the subset. Choosing particular subsets could influence the results

    ReplyDelete
  6. The paper presents an exploration of the intermediate representations that a network learns while training on the Places dataset. As it turns out higher layers are shown to learn simple color features/gradient features and deeper layers learn object detectors.

    Questions:

    While it's clear from the conclusion and some of the intermediate results (ex figure 13) that the intermediately learned object detectors can be used for localization, I'd be interested to see if they perform significantly worse than networks trained to do object localization.

    -Stefano

    ReplyDelete
  7. This paper focuses on understanding the nature of deep features from networks trained on object-centric vs. scene-centric datasets. A network trained to do scene classification can also develop object detectors, even when no notion of object is provided.

    Q: Why are all the visualized RF's near the center?

    ReplyDelete
  8. Summary:
    This paper ,continues on the result from Place-CNN, illustrates object detectors was learned in the later layers in the CNN therefore single forward pass can give results on scene recognition as well as object localization in Place-CNN. Simplifying images techniques are used to test CNN recognition by iteratively removing the visual information by smallest decreases of correct classification score, the observations are most of the scenes have representative objects that are kept to best classify the scene. RF are visualized and sent to AMT to be labeled: Average precision in the later layer of Place-CNN are higher compared to ImageNet-CNN due to more semantically meaningful objects emerged.

    Question:
    How exactly the pool5 unit is used to detect objects in a single forward pass?

    ReplyDelete
  9. The authors of this paper provide an in-depth analysis of the different representations learned at each layer of the CNNs trained on ImageNet and Places Databases respectively. They run 2 experiments in particular, minimal image representations and empirical receptive fields, for each of the units in each of the layers to better understand what causes a unit to activate and how much does a segment influence the activation of a unit. They then plot the various "themes" and abstractions across various layers (obtained by our good friend, AMT) and show how various layers identify different abstractions.

    Discussion:
    1. The AMT workers are not given any hints to guide them in the process. We know from previous papers that free form labelling can be hard and cause ambiguities. How do you think this was dealt with?

    2. This paper sort of validates the intuition I had about detecting objects in scenes for scene labelling, given the project I am tackling. Another intuition that is touched upon is the spatially distributed coding of the objects in a scene. Would an analysis of the various mixture of objects in a scene prove useful? Something like Topic Modelling but for objects in scenes rather than words in documents (especially given how successful BoW was before Deep Learning came along)?

    ReplyDelete
  10. The paper discusses more about the previous Places dataset paper and shows that the deeper layer learn object detector. The authors show visualization of the layers and since the networks are learning about scene recognition and they implicitly learn about object detection as well.

    Question -
    How does performance of object detection of networks trained on scene recognition compare with networks trained for object recognition.

    ReplyDelete
  11. This paper builds on last weeks paper regarding the Places dataset. It shows that, as the network deepens, object detectors emerge although trained on scene data. This means that reliable classifiers can be extracted from the intermediate layers of the network, not just the final output.

    Question:
    Do these intermediate layers compete with networks designed solely for object recognition?

    ReplyDelete
  12. This paper goes into the features learned by a CNN from object-centric and scene-centric datasets. The authors first iteratively removed segments from an input image while the image could still be correctly classified. Then they visualized the receptive fields of various layers of the Places-CNN and ImageNet-CNN and asked AMT workers to find an underlying theme in the highest activated images. With this, the authors found that object detectors learn representations in deeper layers of the CNN and thus can perform both scene and object localization with the Places-CNN.

    Since the AMT workers were free to type any label, did the authors just group similar labels together? How does Places-CNN perform compared to networks trained specifically for object localization?

    ReplyDelete
  13. In the previous paper about Places database, we learnt that scene detection CNNs work well when trained with dataset with dedicated scene allocations (Places and SUN) when compared to dataset with object annotations (ImageNet). But in this paper we see that the converse is not true, i.e. Object detection can still be done on a CNN trained for scene detection. The paper proposes data driven approach to understand how the CNN is learning scene detection and hence prove that the same network can be used for object detection. This includes simplifying the images which showed strong scene detection, observe the receptive fields of different nodes and understand the semantics learned by each node. From the results it can be concluded that, the inner layers from Places-CNN do object localization and hence can be parallely used for object detection.

    Question: Lets say we extract info from inner nodes of Places-CNN and build an object classifier. When we give an image of 'street', the scene detector detects it as 'street' and the object classifier detects 'cars' in it. But if the input was 'bedroom' scene with a picture/painting of 'car', does the object detector still detect the 'car'?

    ReplyDelete
  14. This comment has been removed by the author.

    ReplyDelete
    Replies
    1. The main point of this paper is that a CNN trained to perform scene recognition, learns to detect objects. This is even more interesting since the paper from the last class showed how CNNs trained on object-centric data performed poorly on scene-centric data. The authors then go on to talk about different approaches to simplifying the input image and the various receptive fields and activation patterns of the CNNs. Finally the authors discuss how objects that emerge while training these scenes are discriminative enough for scene recognition.

      Questions:
      1. The receptive fields and activation patterns seem to tell the authors a lot about precise semantics learnt by each depth of the layers. How is this different for the networks trained specifically for object detection? Also the objects that emerge are ones that are likely to be favored by Zipf's law, is this also true for object-detection specific CNNs?
      2. The authors also mention that about 115 units of pool5 do not detect objects. Any subsequent works that show of other types of representation learning emerging from a net trained for scene recognition as they discuss (e.g. texture-based)?
      3. I don't quite understand if they give a good explanation for why Places-CNN detects more objects than ImageNet-CNN in the higher layers?

      Delete
  15. Summary:
    This paper presents an idea of analysing layers in a deep network trained on performing scene recognition and claim that the same network can be used to perform object recognition too in a single forward pass owing to the kinds of activations on units in various layers of the network. They also show some experiments they performed using ImageNet-CNN and Places-CNN along with results like what are most common object categories that got activated by both the networks.
    Questions/Comments:
    1) They said in the paper that, they expect empirical RFs to perform better classification than theoretical ones. What exactly do they mean by this? There are no additional supporting data given.
    2) How exactly did they measure the correlation between object frequencies in different sets of units?

    ReplyDelete
  16. This paper is a continuation of the work done in the last paper on the Places dataset and PlacesCNN. The authors present a crucial discovery that the inner layers of the CNN trained on a scene-centric dataset provide object detection capabilities. They go on to analyze the RFs of the units in various layers, to figure out what they represent. They show that these units are activated by entire objects, as opposed to object parts or regions in object-centric CNNs. They provide two possible explanations for the emergence of these objects at the inner layers- object frequencies in scene images and discriminative objects that define scenes.

    Discussion:

    Can this be applied to achieve geographical understanding of a scene, like in the 'What makes Paris look like Paris' paper? eg: If a kangaroo is detected in the pool5 layer, we should be able to easily assume that the scene is from Australia, unless the scene is recognized as a zoo.

    Have there been results showing the object recognition performance against CNNs trained on objects? I'm assuming localization might not be as accurate.

    ReplyDelete
  17. The paper visualises the results of the inner layers of CNNs trained on the Places dataset as well as ImageNet. Using various visualisation techniques the authors conclude that the inner layers of the Neural Net trained to recognise scenes automatically discover object detectors.

    Discussion:
    I believe a natural extension of this work would be to use the output of the inner layers of a CNN trained on Places for object detection and compare the results with the CNN trained to detect objects.

    ReplyDelete
  18. This paper continues to explore the difference between the learned features of Places-CNN and ImageNet-CNN. Visualizations of simplified images showing the least amount of information required for the image to be classified correctly were created. The authors then estimate each unit's receptive field with a data driven approach. The most important contribution of the paper is the idea that even Places-CNN, trained to classify scenes, have many units that are acting as object detectors.

    Question:

    When AMT workers were labeling unit concepts they were not given a dictionary of labels to choose from. Were similar responses grouped, or did labels need to be identical?

    ReplyDelete
  19. This paper studies the use of deep learning networks that are trained for scene classification using Places Dataset, to preform object detection. The paper illustrates that the inner layers of a CNN model that is trained for scenes classification PlacesCNN automatically discover some important object detectors.
    The paper examined this using two approaches: learning the minimal image representation, and visualizing the receptive fields (RF) and calculating their empirical size.

    Q: In the fist approach for simplifying the input image (section 3.1) The input image is segmented to edges and regions, and in each iteration the segment that produces the smallest decrease of the correct score is removed. I'm not clear on how to they calculate the classification score for each segment?

    ReplyDelete
  20. This paper is an extension of Monday’s paper. The major contribution of this paper is that a network that is trained to recognize scenes is capable of finding sophisticated patterns in an unsupervised way. More specifically, the network localized objects with out being told what an object even is. This paper really highlights the impressive expressive nature of deep networks.

    1. Can we see how this works in other networks - what patterns might emerge in a network trained on VQA? Would “functional” patterns emerge?

    2. Besides objects, could hierarchical relationships be explored between layers in the network? Perhaps there is a interesting pattern between activations that could be explored for other recognition purposes.

    ReplyDelete
  21. The paper builds on the Places dataset paper that we discussed in the previous class. The paper demonstrates that reliable object detectors can be extracted from the inner layers of a CNN trained to recognize scenes.The authors visualize what the receptive fields of different layers represent of the ImageNet and Places CNN to understand what they actually represent.

    Discussion: It would be interesting to see a comparison of object detection accuracy of the proposed CNN against a CNN specifically trained for object detection.

    ReplyDelete
  22. This paper asserts that deep networks trained for scene recognition can be used for unsupervised object localization. The authors describe a method by which they discover the most important units in the image for classification. They show qualitative and quantitative results on Imagenet and Places CNN.

    The contributions of the paper are:

    1. Image simplification or minimal image representation based classification. Thus allowing to find the most important part of the scene for classification.

    2. They show a visualization technique which allows to find the actual receptive field of each convnet layer and thus allow to localize objects in the scene.

    There results validate their hypothesis and they show that scene based recognition also learn object for localization.

    Questions:

    1. How does the segmentation or localization work, as the receptive field in the images looks non descriptive.

    2. How would the object localizer work on an actual object dataset. Will it still be able to detect the objects in a new dataset.

    ReplyDelete
  23. This paper provides an interesting intuition and visualization of how CNN's perform. What each layer sees, how it responds to a specific set of objects within the scene for classification has been represented quite well. It is an extension to the Places dataset paper where they go even deeper and compare how it performs at each layer compared to ImageNet-CNN. One of the interesting things that they find is how a model learned from scene centric database is to detect more objects compared to ImageNet-CNN. They go on to describe how the receptive fields can be used as a mask for object segmentation and then further trained to be used as a object detector within the scene recognition system itself. Performance metrics are compared with the ImageNet-CNN.

    Questions-
    1. How does it perform compared to systems trained specifically for object detection like the RCNN's?
    2. How would their visualizations look like would be interesting to know of.

    ReplyDelete
  24. This is a great inspiring article.I am pretty much pleased with your good work.You put really very helpful information. Keep it up. Keep blogging. Looking to reading your next post. гидра тор

    ReplyDelete