Summary: This paper proposes a method of using a four layer CNN trained on an algorithm driven by re-ranking object proposals from bottom-up methods and achieving better object detection results (even on the object categories the network hasn’t seen before) when compared to more complex networks while running at a higher pace. A few experimental results of the comparison of object detection by DeepBox on PASCAL and COCO, DeepBox and EdgeBoxes on unseen categories have been shown. Questions: 1) I have a fundamental question on how this network is trained. In section 3.3, they have mentioned how the features are built – both the positives, negatives, and hard negatives. Even though the alpha, beta+ve, and beta-ve, values differ for this network in comparison to the more complicated networks, I feel that these features are essentially a subset of the bottom-up proposals. If this is true, then it is intuitively hard for me to believe that they would perform better on detection problems. Also, this section doesn’t show any “re-ranking” on the edgebox’ bounding blocks themselves. 2) In section 3.3.2, how exactly is the “perturbing ground truth bounding boxes” and thresholding the positive examples equivalent to differentiating between background of an image and object in an image? 3) Is there a reason why EdgeBox bounding blocks have been picked as the base on which features are built for DeepBox? How are these features consistently better than “any other” object proposal methods, in general, as the paper is claiming it to be? Is there something fundamentally different that is being done by DeepBox?
Summary: This paper presents deepBox, a shrank size and fast yet as good as the state-of-arts network back in 2015 for object proposal with bounding boxes. The key point is ,with shrank size network,still leveraging the the CNN in figuring out the hierarchy of the visual cues that are discriminative of the objects. Another important result is DeepBox has generalized beyond training categories.
Question: 1. On part 3.2, I don't understand how using a fixed spatial pyramid grid to max-pool features for each box resulting in a fixed feature vector for each box.
2. On page2 3.1, author mentioned AUC if certain layer removed: conv5 + conv4 (drop: 10.6 points) and conv5 + conv4 + conv3 layers (drop: 6.7 points). Do they retrained the network after removing layers? (if they didn't retrain, the result is acceptable.)
DeepBox is a fast, compact network for generating object proposals. Experimental results presented here show that it is possible to get accurate object proposals on unseen object classes with a very small network, and that these proposals can be used to increase the accuracy of classifiers.
Questions:
1) Architectural experimentation for the neural net used was limited to variations of imagenet (removing conv layers, reducing input size etc). Has any attempt been made to use entirely different architectures for objectness proposals (for example modifying the one used for segmentation here: http://www.cs.berkeley.edu/~jonlong/long_shelhamer_fcn.pdf ) or starting from any other network intended for pixel-level prediction rather than classification?
The paper presents an approach to rerank the object proposals obtained by bottom up approaches using a 4 layer CNN architecture. They try to keep the CNN architecture to minimal to keep the object proposal stage fast. They train this network in two iterations. First phase gets the positive samples by ground truth and it's small perturbation and negative by raster scan. Second phase gets the positive examples by bottom up proposal overlapping more with ground truth and negative by less overlapping proposals. Thus the network is trained on binary classification CNN. The results and visualizations show that they are achieving high AUC consistently.
Discussion:
1) They use CNN to rerank the proposals, any other papers who tried to use reranking these proposals but using some other methodology?
2) I don't get the fast Deepbox 'b' calculation from the convolution layer output. Can you please elaborate on this part?
This paper re-ranks object proposals using a 4-layer CNN called DeepBox. It is smaller and faster than most object proposals, while maintaining accuracy. Experimental results suggest that there is a generic notion of objectness.
Q/Discussion:
1. Don't have a good grip on the notion of objectness. 2. I would like to go over section 3.2
This paper introduces a refinement upon the bottom-up method of object proposal by using a convolutional net to rerank proposals originally generated using "vanilla" bottom-up proposal-generating methods such as edge boxes trained upon a large database of annotated images. The CN reranks these proposals, many of which don't coincide with the ground truth annotations in the image, by isolating them via cropping, individually, a relatively slow mechanism since proposal boxes can be contained in other proposal boxes, although a faster alternative is also suggested. The CN is relatively small, is comparable in proposal effectiveness to state of the art CN methods while being much faster, and comparable in speed to the fastest bottom-up methods while outperforming them, thus seemingly providing the best of both worlds. Further, it apparently generalizes well to categories it has never seen, giving rise to the hypothesis of some kind of category-agnostic "objectness" shared by objects that this kind of lightweight CN is capable of teasing out of images.
Discussion/Questions 1) The training method seems a little arcane, in particular the ground truth perturbation to generate positive training examples. It seems to me they are only potentially excluding some of the a known object or including extra background, neither of which seems desirable for learning true positives. What am I missing here? 2) How did they initialize the first two layers using Imagenet model - did they mean they just copied the weights? 3) Were both steps in their training really necessary? It seems like they could have accomplished what they were trying for with a combination - perturb bounding box on annotated ground truth images for positives and pick random locations and faulty bottom-up proposals for negatives. 4) It seems like they are making a case for the "objectness senitivity" for a CN lying in the earlier layers, since their smaller net performs almost as well as the larger state of the art nets. This would seem counter to the "object-scene" comparison/dichotomy dicussions we were having last week, where the Places-CN and the Imagenet-CN had very similar RF's earlier in the network hierarchy (closer to the input) while the true "object" detectors were at the later layers.
This paper uses a lightweight Deep Network to effectively fine-tune rankings of object proposals obtained from bottom-up proposal techniques and rerank them. The idea here is for the network to gain an understanding of the notion of "objectness". The network is only 4 layers deep and was obtained by selectively removing parameters while trying to keep the performance of the network approximately the same. The training procedure for the DeepBox architecture seems to be more important. Here the authors train the network twice, once on samples from sliding window detectors, using overlap measures to label positive and negative samples, and the second time on actual proposals using the Edge Boxes technique, in order to learn tighter bounding boxes and differentiate between whole and part objects. Multiple experiments were run, showing 3 features of DeepBox - 1. Improves performance of the naive Edge Box technique, 2. Performs well even on categories it has never seen during training, and 3. Can be adapted to use other object proposal techniques such as Selective Search.
Discussion: 1. As asked above, even I am having trouble understanding how the convolutions are being shared for the object proposal.
2. How do the authors decide measures of relative improvement? For DeepBox with AlexNet and VGG they claim that there is not much improvement for an AUC difference of 0.04 when using VGG, but claim "large gains" of 0.05, when using the Selective Search proposal technique. This seems contradictory in a way.
3. When evaluating on unseen categories, did they take into account correlated categories which occur together more often? Did that help the network since now the network learns parts of these other categories and that helps push the weights in the general direction of the unseen categories.
This paper discusses an algorithm to rerank the object proposal. It gives an novel 4 layer CNN architecture to rerank from bottom-up method. The network achieve higher results than complex networks while being relatively computationally inexpensive and also detect objects which it has not seen before.
For section 3.1, could some other variation give better result finally either in terms of size or auc
CNN’s for object proposal to reduce computation time and increase classification accuracy of other algorithms.
It seems non intuitive that daisy chaining multiple CNN’s (one to do object proposal and one to do classification) are somehow more than the sum of the parts. Is the limitation the size of the network? Like to get the same performance from a single CNN, it needs to be unfeasibly deep? Do we see daisy chained CNN’s in other aspects of CV? Like using a CNN do to object proposals, then doing a CNN to do mid level features (wheels, grass, cloth), then doing CNN to do high level labels?
This paper presents DeepBox, a four-layer CNN for reranking object proposals constructed by bottom-up cues. This architecture is smaller yet performs better than already existing, more complex networks. The proposal boxes are cropped and inputted into the CNN, which evalutes scores and reranks the proposals.
What is the advantage of randomly perturbing the ground truth bounding boxes?
DeepBox is a novel four layer network used to recognize "objectness" in images. DeepBox's compactness makes it easy to train while retaining results comparable to much larger networks. It works by taking in bottom-up object proposals from a generic object detector and reranking them, drastically improving results.
Question: Why not just use a deep learning architecture to detect objects in the first place, rather than ranking existing object proposals?
This paper proposes DeepBox, a four layered CNN to improve upon the bottom-up ranking proposals for object detection. DeepBox architecture learns a semantic notion of 'objectness' which can also extend to object categories which are unseen. They use an implementation of R-CNN proposed by Girshick. They show an improvement of 4.5% in the mean average precision with their method as compared to bounding box proposals for 500 proposals. Trying this on deeper networks like VGG-Net or AlexNet improves the results but not by much suggesting that their 4 layered network is quite good.
Discussion: 1. They don't mention why much deeper networks like AlexNet fail to give substantially better resuts than their 4 layered CNN. 2. in 3.2 I dont quite understand why one would need both single and multi scale. In the multiscale too aren't the picking the feature vector from the box closest to the actual size of the object?
This paper presents a CNN method for ranking object proposals. The net accepts object proposals from bottom-up approaches, and ranks them according to their objectness. The authors show that even with a smaller 4 layer architecture, results are on par with what a deeper net can achieve. The authors then show that the concept of objectness learned by the net is general by using it on categories held out from training.
1) As many others have mentioned, section 3.2 was not clear.
2) Is perturbing a good way to generate new positive examples? It's similar to jittering when training object classification nets, but it seems strange to me to change the corners of the bounding box. It seems that every new bounding box you make will be less appropriate than the previous, and it might cause worse proposals to be ranked higher.
This paper proposes “DeepBox”, a new object detection model using CNN. The model takes a set of proposed object regions as input and evaluates and ranks them based in the objectness of the boxes. The notion of “objectness” measures how fitted is the Bb around the object and if it contains a lot of “stuff”. Objectness is a semantic high-level constraint but it’s also data-driven. The proposed CNN model is a lightweight one, consisting of only 4 layers. Experiments show that DeepBox is agnostic to the original segmentation method - selective search or edge box, and preforms well in PASCAL and CoCo. Experiments on unseen object classes also show that the model is class-agnostic as it learns the object notion without any training on these sets.
Q: Section 3.1 describes the network architecture, it’s mentioned that the problem is “binary classification” “region or not”? How does the model rank the proposed regions then?
This paper proposes a new CNN based shallow object detector. Their model takes in a series of regions as input and outputs the "objectness" of a particular region. To generate these original regions tthey use selective search and edge boxes and found that their model was independent of the proposed system. They found that their model could generalize to unseen categories
The paper presents a method to improve object detection by using bottom up cues to rank object proposals. The object proposals are re-ranked using the scores of a four layer neural net. The authors run several experiments on the PASCAL and COCO dataset and conclude that their algorithm is effective in detecting unseen categories as well. Discussion: I am not clear as to how the object proposals are re-ranked. Moreover how does re-ranking ensure that the object proposals are semantically meaningful?
The authors present a CNN based approach to ranking object proposals. The model takes object proposals obtained from a bottom up approach and ranks them based on objectness, which is a measure of how well the bounding box fits the object. They show that even by using only a 4 layer architecture, the results are comparable with a deeper network, while being lighter.
Discussion:
1) Are BBs perturbed for positive examples only to fight overfitting?
2) Not super clear on the sharing computations for faster ranking section(3.2).
The paper introduces an efficient 4 layer CNN architecture to rerank object proposal. The architectures learns a semantic notion of objectness that generalizes unseen categories. Though the proposed architecute is relatively small, the paper demonstrates how the outputs obtained are comparable with that of deeper nets. Question: The paper mentions that faster reranking is achieved by sharing the convolutional part of the network among all the proposals. However, I don't understand how this is done. Can you please go over this?
The contribution of this paper is a deep learning approach to rank object proposals. The approach begins by generating object bounding box proposals and then rank the proposals from the weights from DeepBox. It should be noted that the DeepBox architecture is relatively shallow at 4 layers and still generalizes well and performs better than non-data driven approaches.
1. Are there any attempts to generate object proposals using deep networks? 2. I am curious as to why the shallow net worked nearly as well s the deeper ALexNet - is there any intuition as to why this is the case?
Summary:
ReplyDeleteThis paper proposes a method of using a four layer CNN trained on an algorithm driven by re-ranking object proposals from bottom-up methods and achieving better object detection results (even on the object categories the network hasn’t seen before) when compared to more complex networks while running at a higher pace. A few experimental results of the comparison of object detection by DeepBox on PASCAL and COCO, DeepBox and EdgeBoxes on unseen categories have been shown.
Questions:
1) I have a fundamental question on how this network is trained. In section 3.3, they have mentioned how the features are built – both the positives, negatives, and hard negatives. Even though the alpha, beta+ve, and beta-ve, values differ for this network in comparison to the more complicated networks, I feel that these features are essentially a subset of the bottom-up proposals. If this is true, then it is intuitively hard for me to believe that they would perform better on detection problems. Also, this section doesn’t show any “re-ranking” on the edgebox’ bounding blocks themselves.
2) In section 3.3.2, how exactly is the “perturbing ground truth bounding boxes” and thresholding the positive examples equivalent to differentiating between background of an image and object in an image?
3) Is there a reason why EdgeBox bounding blocks have been picked as the base on which features are built for DeepBox? How are these features consistently better than “any other” object proposal methods, in general, as the paper is claiming it to be? Is there something fundamentally different that is being done by DeepBox?
Summary:
ReplyDeleteThis paper presents deepBox, a shrank size and fast yet as good as the state-of-arts network back in 2015 for object proposal with bounding boxes. The key point is ,with shrank size network,still leveraging the the CNN in figuring out the hierarchy of the visual cues that are discriminative of the objects. Another important result is DeepBox has generalized beyond training categories.
Question:
1.
On part 3.2, I don't understand how using a fixed spatial pyramid grid to max-pool features for each box resulting in a fixed feature vector for each box.
2.
On page2 3.1, author mentioned AUC if certain layer removed: conv5 + conv4 (drop: 10.6 points) and conv5 + conv4 + conv3 layers (drop: 6.7 points). Do they retrained the network after removing layers? (if they didn't retrain, the result is acceptable.)
DeepBox is a fast, compact network for generating object proposals. Experimental results presented here show that it is possible to get accurate object proposals on unseen object classes with a very small network, and that these proposals can be used to increase the accuracy of classifiers.
ReplyDeleteQuestions:
1) Architectural experimentation for the neural net used was limited to variations of imagenet (removing conv layers, reducing input size etc). Has any attempt been made to use entirely different architectures for objectness proposals (for example modifying the one used for segmentation here: http://www.cs.berkeley.edu/~jonlong/long_shelhamer_fcn.pdf ) or starting from any other network intended for pixel-level prediction rather than classification?
Abstract:
ReplyDeleteThe paper presents an approach to rerank the object proposals obtained by bottom up approaches using a 4 layer CNN architecture. They try to keep the CNN architecture to minimal to keep the object proposal stage fast. They train this network in two iterations. First phase gets the positive samples by ground truth and it's small perturbation and negative by raster scan. Second phase gets the positive examples by bottom up proposal overlapping more with ground truth and negative by less overlapping proposals. Thus the network is trained on binary classification CNN. The results and visualizations show that they are achieving high AUC consistently.
Discussion:
1) They use CNN to rerank the proposals, any other papers who tried to use reranking these proposals but using some other methodology?
2) I don't get the fast Deepbox 'b' calculation from the convolution layer output. Can you please elaborate on this part?
This paper re-ranks object proposals using a 4-layer CNN called DeepBox. It is smaller and faster than most object proposals, while maintaining accuracy. Experimental results suggest that there is a generic notion of objectness.
ReplyDeleteQ/Discussion:
1. Don't have a good grip on the notion of objectness.
2. I would like to go over section 3.2
This paper introduces a refinement upon the bottom-up method of object proposal by using a convolutional net to rerank proposals originally generated using "vanilla" bottom-up proposal-generating methods such as edge boxes trained upon a large database of annotated images. The CN reranks these proposals, many of which don't coincide with the ground truth annotations in the image, by isolating them via cropping, individually, a relatively slow mechanism since proposal boxes can be contained in other proposal boxes, although a faster alternative is also suggested.
ReplyDeleteThe CN is relatively small, is comparable in proposal effectiveness to state of the art CN methods while being much faster, and comparable in speed to the fastest bottom-up methods while outperforming them, thus seemingly providing the best of both worlds. Further, it apparently generalizes well to categories it has never seen, giving rise to the hypothesis of some kind of category-agnostic "objectness" shared by objects that this kind of lightweight CN is capable of teasing out of images.
Discussion/Questions
1) The training method seems a little arcane, in particular the ground truth perturbation to generate positive training examples. It seems to me they are only potentially excluding some of the a known object or including extra background, neither of which seems desirable for learning true positives. What am I missing here?
2) How did they initialize the first two layers using Imagenet model - did they mean they just copied the weights?
3) Were both steps in their training really necessary? It seems like they could have accomplished what they were trying for with a combination - perturb bounding box on annotated ground truth images for positives and pick random locations and faulty bottom-up proposals for negatives.
4) It seems like they are making a case for the "objectness senitivity" for a CN lying in the earlier layers, since their smaller net performs almost as well as the larger state of the art nets. This would seem counter to the "object-scene" comparison/dichotomy dicussions we were having last week, where the Places-CN and the Imagenet-CN had very similar RF's earlier in the network hierarchy (closer to the input) while the true "object" detectors were at the later layers.
This paper uses a lightweight Deep Network to effectively fine-tune rankings of object proposals obtained from bottom-up proposal techniques and rerank them. The idea here is for the network to gain an understanding of the notion of "objectness". The network is only 4 layers deep and was obtained by selectively removing parameters while trying to keep the performance of the network approximately the same. The training procedure for the DeepBox architecture seems to be more important. Here the authors train the network twice, once on samples from sliding window detectors, using overlap measures to label positive and negative samples, and the second time on actual proposals using the Edge Boxes technique, in order to learn tighter bounding boxes and differentiate between whole and part objects. Multiple experiments were run, showing 3 features of DeepBox - 1. Improves performance of the naive Edge Box technique, 2. Performs well even on categories it has never seen during training, and 3. Can be adapted to use other object proposal techniques such as Selective Search.
ReplyDeleteDiscussion:
1. As asked above, even I am having trouble understanding how the convolutions are being shared for the object proposal.
2. How do the authors decide measures of relative improvement? For DeepBox with AlexNet and VGG they claim that there is not much improvement for an AUC difference of 0.04 when using VGG, but claim "large gains" of 0.05, when using the Selective Search proposal technique. This seems contradictory in a way.
3. When evaluating on unseen categories, did they take into account correlated categories which occur together more often? Did that help the network since now the network learns parts of these other categories and that helps push the weights in the general direction of the unseen categories.
This paper discusses an algorithm to rerank the object proposal. It gives an novel 4 layer CNN architecture to rerank from bottom-up method. The network achieve higher results than complex networks while being relatively computationally inexpensive and also detect objects which it has not seen before.
ReplyDeleteFor section 3.1, could some other variation give better result finally either in terms of size or auc
CNN’s for object proposal to reduce computation time and increase classification accuracy of other algorithms.
ReplyDeleteIt seems non intuitive that daisy chaining multiple CNN’s (one to do object proposal and one to do classification) are somehow more than the sum of the parts. Is the limitation the size of the network? Like to get the same performance from a single CNN, it needs to be unfeasibly deep? Do we see daisy chained CNN’s in other aspects of CV? Like using a CNN do to object proposals, then doing a CNN to do mid level features (wheels, grass, cloth), then doing CNN to do high level labels?
This paper presents DeepBox, a four-layer CNN for reranking object proposals constructed by bottom-up cues. This architecture is smaller yet performs better than already existing, more complex networks. The proposal boxes are cropped and inputted into the CNN, which evalutes scores and reranks the proposals.
ReplyDeleteWhat is the advantage of randomly perturbing the ground truth bounding boxes?
DeepBox is a novel four layer network used to recognize "objectness" in images. DeepBox's compactness makes it easy to train while retaining results comparable to much larger networks. It works by taking in bottom-up object proposals from a generic object detector and reranking them, drastically improving results.
ReplyDeleteQuestion:
Why not just use a deep learning architecture to detect objects in the first place, rather than ranking existing object proposals?
This paper proposes DeepBox, a four layered CNN to improve upon the bottom-up ranking proposals for object detection. DeepBox architecture learns a semantic notion of 'objectness' which can also extend to object categories which are unseen. They use an implementation of R-CNN proposed by Girshick. They show an improvement of 4.5% in the mean average precision with their method as compared to bounding box proposals for 500 proposals. Trying this on deeper networks like VGG-Net or AlexNet improves the results but not by much suggesting that their 4 layered network is quite good.
ReplyDeleteDiscussion:
1. They don't mention why much deeper networks like AlexNet fail to give substantially better resuts than their 4 layered CNN.
2. in 3.2 I dont quite understand why one would need both single and multi scale. In the multiscale too aren't the picking the feature vector from the box closest to the actual size of the object?
This paper presents a CNN method for ranking object proposals. The net accepts object proposals from bottom-up approaches, and ranks them according to their objectness. The authors show that even with a smaller 4 layer architecture, results are on par with what a deeper net can achieve. The authors then show that the concept of objectness learned by the net is general by using it on categories held out from training.
ReplyDelete1) As many others have mentioned, section 3.2 was not clear.
2) Is perturbing a good way to generate new positive examples? It's similar to jittering when training object classification nets, but it seems strange to me to change the corners of the bounding box. It seems that every new bounding box you make will be less appropriate than the previous, and it might cause worse proposals to be ranked higher.
This paper proposes “DeepBox”, a new object detection model using CNN. The model takes a set of proposed object regions as input and evaluates and ranks them based in the objectness of the boxes. The notion of “objectness” measures how fitted is the Bb around the object and if it contains a lot of “stuff”. Objectness is a semantic high-level constraint but it’s also data-driven. The proposed CNN model is a lightweight one, consisting of only 4 layers. Experiments show that DeepBox is agnostic to the original segmentation method - selective search or edge box, and preforms well in PASCAL and CoCo. Experiments on unseen object classes also show that the model is class-agnostic as it learns the object notion without any training on these sets.
ReplyDeleteQ: Section 3.1 describes the network architecture, it’s mentioned that the problem is “binary classification” “region or not”? How does the model rank the proposed regions then?
This paper proposes a new CNN based shallow object detector. Their model takes in a series of regions as input and outputs the "objectness" of a particular region. To generate these original regions tthey use selective search and edge boxes and found that their model was independent of the proposed system. They found that their model could generalize to unseen categories
ReplyDeleteThe paper presents a method to improve object detection by using bottom up cues to rank object proposals. The object proposals are re-ranked using the scores of a four layer neural net. The authors run several experiments on the PASCAL and COCO dataset and conclude that their algorithm is effective in detecting unseen categories as well.
ReplyDeleteDiscussion:
I am not clear as to how the object proposals are re-ranked. Moreover how does re-ranking ensure that the object proposals are semantically meaningful?
The authors present a CNN based approach to ranking object proposals. The model takes object proposals obtained from a bottom up approach and ranks them based on objectness, which is a measure of how well the bounding box fits the object. They show that even by using only a 4 layer architecture, the results are comparable with a deeper network, while being lighter.
ReplyDeleteDiscussion:
1) Are BBs perturbed for positive examples only to fight overfitting?
2) Not super clear on the sharing computations for faster ranking section(3.2).
The paper introduces an efficient 4 layer CNN architecture to rerank object proposal. The architectures learns a semantic notion of objectness that generalizes unseen categories. Though the proposed architecute is relatively small, the paper demonstrates how the outputs obtained are comparable with that of deeper nets.
ReplyDeleteQuestion: The paper mentions that faster reranking is achieved by sharing the convolutional part of the network among all the proposals. However, I don't understand how this is done. Can you please go over this?
Was at CSCW
ReplyDeleteThe contribution of this paper is a deep learning approach to rank object proposals. The approach begins by generating object bounding box proposals and then rank the proposals from the weights from DeepBox. It should be noted that the DeepBox architecture is relatively shallow at 4 layers and still generalizes well and performs better than non-data driven approaches.
1. Are there any attempts to generate object proposals using deep networks?
2. I am curious as to why the shallow net worked nearly as well s the deeper ALexNet - is there any intuition as to why this is the case?