Friday, March 4, 2016

Mon, Mar 7 - Fast R-CNN

Fast R-CNN. Ross Girshick. ICCV 2015.

arXiv, code

19 comments:

  1. Region-based convolutional networks performs well in object-detection tasks, but have some notable drawbacks, specifically that training is expensive to complete, complicated in that it requires multiple stages to learn proposals, detect objects, and then clean up the bounds around those objects, and slow, due to requiring a forward pass for each object proposal. SPPNets propose to speed the process by performing a single forward pass per image, and then classify proposals within the image, but they also share some of R-CNN's shortcomings, specifically multi-stage, expensive training.

    This paper proposes a training regimen that addresses the shortcomings of R-CNN/SPPnet while improving upon their performance by multiple adjustments, including modifying the pooling layers to map each RoI to a fixed size, which gets mapped to the fully connected layer, and two output layers that give a per-class per-RoI probability and a per-class bounding box offset.


    Questions/Discussion:
    1) Could you elaborate on the mechanism behind the PCA/SVD dim redux of the fc6 and fc7 and why it works please? Is this providing bounding-box tuning?
    2) Could you elaborate on the regression bounding boxes? And the backprop through the RoI layers?

    ReplyDelete
  2. Object detection is hard, especially because it involves localization. Previous works R-CNN is slow because it performs a ConvNet forward pass for each object proposal, without sharing computation. Fast-RCNN gets higher mAP, while being substantially faster.

    The layer uses the full image to create a conv feature map. Then a set of ROI, region of interests, and ayers to produce a conv feature map. Then a ROI "pulls out" a fixed dimension vector from a proposal. Then there is a multi-loss function. This ROI feature vector is fed into the loss function that deals with trying to learn the bounding boxes.

    Q.
    1. Not a question, but it might be helpful to briefly go over Fast-RCNN and Faster RCNN.

    2. ROI projection in Figure 1. How do you figure out what the ROI projection should be? That is, how do figure out what regions should be part of the ROI projection in the Conv feature map?


    3. Sampling "R/N" RoIs ... why use R/N?

    4) Are the ROI of interest fed into the CNN? If so, what size are they?

    5) Any work that uses "smart sampling" of ROI? (Sampling that takes advantage of some prior knowledge.)

    ReplyDelete
  3. Abstract:

    The paper provides fast region based CNN method for object detection that tries to address specific problems with RNN and SPPNets. The main changes Fast RCNN has is a single stage training and taking the input proposals from sampled images to make the training and testing time faster. They initialize their net with various pre-trained networks and fine-tune all the layers by sampling the proposals from 2 images at a time. They also use a multitask loss which has two component ts one for classification loss and other for bounding box detection loss. The results show that the proposed method is orders of magnitude faster and able to achieve state-of-art results on various detection tasks.

    Discussion:

    1) Seems like the choice of R and N is derived from the need to computation time. Do they affect the mAP also?

    2) Could you talk little bit about the Faster RCNN (NIPS 2015)?

    ReplyDelete
  4. This paper talks about fast r-cnn networks for object detection purpose. As compared to base network they train the network 9 times faster and have a faster testing time as well. The fast r-cnn network takes input image and set of object proposals and outputs two sibling networks, one containing the softmax probabilities of the K-object and the second network have the coordinates of the bounding boxes. The testing containing of just forward pass with test input image. The authors also establish what features make their network faster in terms of multi-task training, classifiers, invariance.

    Discussion -
    How did they decide why layers to fine tune?
    Faster CNN?

    ReplyDelete
  5. The author proposes Fast R-CNN, a Deep Network that has better performance and efficiency that its predecessors, R-CNN and Spatial Pyramid Pooling Nets aka SPPNets. This is accomplished by leveraging the shared computations of SPPNets along with the ROI layer in a novel training (or should I say fine-tuning?) approach which jointly ranks object proposals and generates posterior distributions in a single pass, rather than separate passes as was being done earlier. This leads to much faster training and testing time than previous networks used for object detection. Furthermore, the use of the softmax layer in favor of SVMs, and truncated SVD of the fully connected layer helps improve both performance nad efficiency. Various experiments are run on the VOC 2007 and VOC 2012 datasets to validate results.

    Discussion:
    1. I'm still not quite clear about the backpropagation algorithm employed here. Can that be elaborated upon?

    2. Data augmentation is not a big deal for this network to perform well unlike previous networks such as GoogLeNet. How does the mini-batch sampling affect the need (or lack thereof) of data augmentation?

    ReplyDelete
  6. This paper introduces to an improved CNN architecture called Fast Region based CNN. It improves on RCNN and SPPnet to give faster training and testing and better object detection results. Similar to SPPnet the whole image passes through convolution layers once instead of each object proposal (unlike R-CNN) which makes it faster. But unlike SPPnet Fast R-CNN performs ROI pooling which allows back propagation (hence faster training). There are two outputs, one from a softmax layer which denotes all the class detections and other which outputs bounding box for each object detection. Detection is improved by converting the weight matrix of each layer to a truncated SVD form.

    Discussion:
    1. How is efficient back propagation through ROI pooling layer achieved, compared to SPPnet?
    2. How to decide on 't' (section 3.1) for Truncated SVD? In 4.4 authors say Truncated SVD drops only 0.3% mAP, but didn't mention how they chose 't'.

    ReplyDelete
  7. This paper proposes a method to detect objects in the scene by streamlining the R-CNN approach. They combine R-CNN and SPP-Net to leverage the power of shared representation per image while speeding up the process on the whole. They also train using a multi task loss integrating bounding box regression and classification.

    They show state of the art results on the Pascal VOC dataset and do a through analysis of the method. Some of the key findings of the analysis are : SVM are as good as softmax for final layer classification, more data is always helpful and the number of object proposals per image plateaus after 2000 proposals per image

    Questions:
    1. The paper talks about dense object proposals with hard negative mining through SVM performing worse than just object proposals . Is there any intuition behind this

    2. The paper also seems to have a fixed ROI to pool from the convolution layers to fully connected layers, in case of varying object sizes like MS COCO this can lead to very large receptive fields for small objects

    ReplyDelete
  8. This paper presents a deep convolutional network method for object detection, Fast R-CNN. Previous convolutional network proposal algorithms required slow multistage classifiers that seperately evaluate object proposals, detections and bounding boxes. Fast R-CNN combines and modifies aspects of both R-CNN and SSPnet to create a training process that allows for must faster training speed, testing speed and performance. The base architecture is borrowed from existing nets and then modified. RoI pooling layers are used to create fixed size feature maps that corresponds to a specific region in input image. The base net is modified to include two output layers. One that provides a probability distribution across classes per RoI, and another that predicts bounding box positions.

    Question:

    Can you walk us through how back prop works on ROI layers?

    ReplyDelete
  9. The Fast R-CNN paper improves upon previous object detection techniques of R-CNN and SPPNet mainly by cutting down on required training resources, reducing the levels in the detection pipeline, and processing images much faster at runtime.
    Some of the main ugrades in Fast R-CNN are replacing the max pooling layer with a ROI pooling layer, replacing the fully connected layer and softmax with two sibling output layers and finally modifying the network to take two data inputs. THese transformations happen when a pre-trained network initilizes a Fast R-CNN network. One major improvement of Fast R-CNN over SPPNet (which in turn improves over R-CNN) is a more efficient backprop through the ROI pooliing layers. The results on VOC 2012 show that FRCN has a higher mAP than SPPNet and R-CNN. Finally the paper also compares using SVM vs Softmax Regressor as a detection classifier, which shows that softmax is slightly better than SVM.

    Discussion:
    1. What do the argmax switches in the ROI backprop equation do?
    2. The R-CNN paper introduces discriminative supervised pre-training and domain specific fine-tuning to address data scarcity. Does this extend to FRCNs? Also the R-CNN paper, when it discusses its error modes, talks a lot about poor localization as its cause for failure. Could the multi-task loss from FRCNN which has a L1 loss term have greatly improved this?

    ReplyDelete
  10. The original R-CNN algorithm starts with generating regions of interests (RoI) – around 2000 regions - from proposed method. After wrapping the image regions, each is forwarded through ConNet to extract features. The third stage is classifying each region with class-trained SVM and applying bounding box regressors for objects detection. While R-CNN reported state-of-the art mAP on Pascal VOC 2007/12, the multi-stage training pipeline is very slow (~84h) and it takes a lot of space. The paper proposes a new method “Fast R-CNN” overcomes the drawbacks of R-CNN. In Fast R-CNN the training is a one-stage process so it’s much faster, and at the same time, it produces higher mean average precision than slow R-CNN and SPP-net. The training pipeline starts by forwarding the whole image through one ConvNet, which generats Conv5 feature map of the entire image, at the same time, RoIs are generated using external method. Conv5 and RoIs will be forwarded to a single RoL pooling layer, and the output will be forwarded to set of fully connected layers. Finally, feature map from final FCs is forwarded to softmax classifier ( for class recognition )and to bounding box regressors (for object bBs)


    To train Fast R-CNN RoI pooling layer must be differentiable, can we discuss this part a bit more (RoI pooling layer’s backwards) ?

    ReplyDelete
  11. This paper produces a region based CNN (known as r-CNNs) that is rough 10x faster than previous state of the art rCNNs and SPPNs. The proposed network takes in an image and associated region proposals and outputs object detection and localization results. This network speeds up training and slightly improves accuracy by using RoI pooling layers (idea taken from SPPNs), softmax layers instead of SVMs for classification, and uses SVD on the FC layer to improve efficiency.

    1) How could this be changed to output segmented regions rather than bonding boxes?

    2) Not sure I fully understand how back propagation happens through the RoI layers and the FC layers.

    ReplyDelete
  12. This paper presents a method of overcoming the drawbacks of R-CNN soecifically related to speed (to train and test) & accuracy (by allowing for finetuning). Fast-R CNN uses several convolutional and max pooling layers to produce aa feature map. Then for each RoI, the pooling extracts a fixed full length vector of the feature map.

    Question:
    How does the truncated SVD work

    ReplyDelete
  13. This paper builds on previous work on object detectors using convolutional neural networks. In the past R-CNNs and SPPnets were used to detect and localize objects, but they were both very inefficient. Fast R-CNN is able to improve the accuracy of previous efforts while training 9x faster than R-CNNs and 3x faster than SPPnets and testing 213x faster than R-CNNs and 10x faster than SPPnets.

    Clarification:
    I would also like more elaboration on truncated SVD.

    ReplyDelete
  14. Summary
    This paper proposes a method a Fast R-CNN for object detection – improves training and testing speeds while increasing desired accuracy. Comparisons of the performance on Pascal VOC have been provided for Fast R-CNN with respect to deep VGG16 network and SPPnet. Their experiments show that sparse object proposals improve detector quality.
    Doubts
    It would like to know how exactly the backpropagation through the ROI pooling layers works. Also, how exactly does truncated SVD help in faster detection?

    ReplyDelete
  15. This paper presents a fast region-based CNN for object detection with improved efficiency and performance than R-CNN and SPPnet. The network first produces a convolutional feature map and then passes through an ROI pooling layer which ranks object proposals in a single pass. This results in faster training and testing time. Finally, there are two output layers that give the softmax probabilities and the bounding-box positions for each object class.

    How does back-prop work in the RoI pooling layers?

    ReplyDelete
  16. The paper introduces a faster training algorithm that overcomes the disadvantages of R-CNN and SPPnet while improving on their speed and accuracy. The proposed algorithm has a higher detection quality than R-CNN and SPPnet. Also, training is done using multi task loss in a single stage. The training is made faster and efficient by using ROI pooling layers, softmax layers for classification and SVD on the FC layer. The paper also presents a detailed analysis of the results on PASCAL VOC dataset and demonstrates how softmax slightly outperforms SVM.

    Question: I didn't understand how back propagation works on ROI layers. Could you please go over that?

    ReplyDelete
  17. This authors present a Fast R-CNN method for object detection. They show that this fast R-CNN has improved training and testing speeds while an additional improvement on desired accuracy. The network takes in an input image and pre-computed object proposals and outputs detection results and localization. Truncated SVD improves speed with a very small drop in accuracy.

    Discussion:

    Could you please explain how exactly the backprop on the ROI layers works?

    ReplyDelete
  18. The paper proposes a Fast RCNN based method for object detection that builds up on the existing RCNN and SPP Net methods. The paper argues that the main drawback of these methods is that they are slow due to the multistage training pipeline and the fact that the features are a written to a disk.
    The Fast RCNN proposed in the paper takes the whole image as input and creates a conv feature map by convolving it with multiple convolutional and max pooling layers. This feature map is used to extract fixed length features corresponding to the regions of interest. These features form the input to the loss layer, which learns the bounding boxes corresponding to the objects.

    Question:
    I am not clear as to how backpropagation happens through the RoI layers.

    ReplyDelete
  19. This paper introduces to an improved CNN architecture called Fast Region based CNN. It improves on RCNN and SPPnet to give faster training and testing and better object detection results. Similar to SPPnet the whole image passes through convolution layers once instead of each object proposal (unlike R-CNN) which makes it faster. But unlike SPPnet Fast R-CNN performs ROI pooling which allows back propagation (hence faster training). There are two outputs, one from a softmax layer which denotes all the class detections and other which outputs bounding box for each object detection. Detection is improved by converting the weight matrix of each layer to a truncated SVD form.

    Discussion:
    1. How is efficient back propagation through ROI pooling layer achieved, compared to SPPnet?
    2. How to decide on 't' (section 3.1) for Truncated SVD? In 4.4 authors say Truncated SVD drops only 0.3% mAP, but didn't mention how they chose 't'.

    ReplyDelete