This paper improves upon existing work using convolutional networks for semantic segmentation by using pixel-wise prediction mapping (via a fully convolutional architecture) and combining the coarse, high-layer information with the more fine-grained low-layer information. The fully convolutional version of the net provides a heat-map version of classification, which could be considered a pixel-wise prediction of segmentation. This prediction is improved by adding the up-sampled-to-input-size version of the 2nd-to-the-last pooling layer's predictions with the 2x up-sampling of the last pooling-layer's output, where the up-sampling is initialized to bilinear interpolation but subsequently learned. This process is repeated with the 3rd-to-last pooling layer, also up-sampled to input size, summed with the 2x-up-sampled version of the previously generated prediction image.
Questions/Discussion : 1) are there cases where the coarse predictions of the higher-stride lower grain prediction layers perform better than the finer-detailed predictions? could this be indicative of an error in the original segmentation ground-truth data? 2) Considering the statement "Training from scratch is not feasible considering the time required to learn the base classification nets." - would this mechanism work if it were trained from scratch? Has anyone tried? It would be interesting to train this from scratch and then compare the results to this implementation - it would expose whether the pixel-level capability of the architecture is more a hidden consequence of the fully-connected version of the nets, or if it is inherent in the earlier net layers.
The paper presents a 'fully convolutional' network architecture for dense pixelwise predictions. Pretrained networks from former ILSVRC submissions are converted into fully convolutional models, and fine tuned for image segmentation. The results yield a state of the art segmentation on a number of different benchmark datasets.
Questions:
1) The paper (reasonably) claims that training from scratch is not feasibly, but it would be very interesting to try training these models without a pretrained component, just to see whether it's possible to get comparable segmentation results. Has this been attempted?
The paper presents a fully convolutional network architecture for image segmentation by having dense pixel wise predictions. They take a deep net trained for classification and convert it to a fully convolutional network. Finally they upsample the output and use pixel-wise loss function to train the whole network jointly. To join fine and coarse information together for better segmentation, they combine the output of conv7 with upsampled outputs of various pooling layers. This gives net the ability to understand the larger semantic information as well as fine boundary information. The system gives state-of-arts results on various datasets for pixel wise class prediction task.
Discussion:
1) Can you please explain the part of transforming the fully connected layers into convolution layers?
2) Is the upsampling process bunch of deconv layers? Isn't it similar then with the 'Learning Deconvolution Network for Semantic Segmentation' (http://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Noh_Learning_Deconvolution_Network_ICCV_2015_paper.pdf) which is deconv layers followed by conv layers?
Summary: This paper presents the fully convolutional network on image segmentation(/ pixel by pixel classification). The network can take image at arbitrary size because this end-to-end training is pixel-wise loss minimization with the help of upsampling (deconvolution) after the last fully connected layer in the network. To get better result, they introduced 'skip', which fuse lower and higher level conv features to make prediction
Question: The 'skip' technique improves fine-grained segmentation a lot according to table2 and figure4, but the since the input image is not fixed size, to get better segmentation (not only PASCAL 2011), we should consider even lower level 'skip',right?
This paper introduces it's title. Rather than taking 256x256 images and outputting a constant length output, these networks can take images of any size and output a corresponding output with similar size. For classification, this implies pixel by pixel labeling. They need to start with some pertained networks though.
Do they train over the pretrained component? Or do they just leave it as is?
The paper introduces Fully Convolutional Networks (FCNs) as a machinery to do end-to-end semantic segmentation on arbitrarily sized images. This involves convolutionalizing the fully connected layers of existing networks and fine-tuning them to do segmentation. Dense predictions are obtained by providing a deconvolution layer to upsample output layers, where the upsampling filter can be learned. Dense, pixel-wise predictions are improved by providing a novel skip architecture that connects lower level predictions with higher level predictions.
Discussion: 1. I may be missing something, but how does only converting the fully connected layers to convolutional layers allow for arbitrary sized images? Is there any change made to the other layers? I am imagining the convolutionizing step as simply rearranging the previous layer into a grid of feature maps and applying a kernel of the same size with the number of filters being the number of pixels in the aforementioned grid.
2. The Hoiem paper discussed last week talks about discarding techniques before their true potential is realized, simply because the overall improvement is poor. Can that be said about the authors' decision to not go for layer fusion for lower layers?
This paper presents FC networks to do semantic segmentation.They upsample the last feature maps to get a feature map so that they can use a pixel-wise loss function. They use a skip architecture to help improve results.
Q:
1. What is the naive approach mentioned in sec 3.1?
2. It seems you could come up with a lot of ways to upsample. Reasons to prefer one to the other?
This paper provides a method for doing semantic segmentation with "fully" convolutional networks. This involves converting current networks and finetuning them to do pixel wise segmentation. They acheive this by combining some of the lower layers with finer strides with higher layers with larger stride.
How do they know to what density they are going to upsample to without knowing the original size of the image?
This paper tries to prove that a Fully Convolutional deep neural network (FCN) can achieve semantic segmentation (pixelwise labeling). The authors take pretrained nets from Imagenet challenge like Alexnet, Googlenet etc. and convert it into a FCN mainly by considering fully connected layers also as a convolutional layer. The output classifier is replaced with a 1x1xN conv layer (N is the number of classes). Since every layer is convolutionalized, the back propagation algo is faster. To get a pixel-wise output image, the paper proposes Shift-and-stitch and Upsampling by interpolation, although implements the latter. The network is then trained on PASCAL VOC 2011 with mean pixel IoU as metric.
Discussion: 1. The idea of considering FC layers as conv layers so that we can work on any image size, is neat. Although, can we really use ANY image size? For example, (section 3.1) the input size in Alexnet is 227x227 and in FCN they use 500x500 (greater than Alexnet's). But can we give an image of size less than 227x227 to the FCN?
The paper presents Fully Convolutional Networks (FCNs), and how they are used for semantic segmentation. They start with CNNs used for classification, and fine tune them for segmentation. A deconvolution layer is used to upsample the coarse output to form dense predictions. These upsampling filters are initialized as bilinear interpolation filters, and are then learned.
Discussion: Still not too clear on the Shift and Stitch technique. In particular how predictions are made to correspond to the pixels at the centers of their receptive fields.
The paper discusses method for semantic segmentation of image using fully trained end-to-end pixel-to-pixel convolution network, which provides state of art representation. The authors also propose a novel deep architecture which combines semantic information from deeper layer and appearance information from shallower layer to produce segmentation results. The results provided by the authors on Pascal VOC are state of art with relatively fast inference time.
Question - Can you elaborate more on the shift and stitch trick?
Summary This paper defines a novel architecture of fully convolutional layers by adapting AlexNet, GoogLeNet, and VGGNet by fine tuning them to perform state-of-the-art segmentation using end-to-end, pixels-to-pixels training. Dense predictions are achieved by upsampling from pooling layers, decreasing the pooling size, etc. Questions How might upsampling from shallower pool layers like pool4 or lower and combing the outputs with the following pooling layers affect the performance? Also, would it be beneficial to do upsampling from multiple pooling layers?
This paper proposes a method to do semantic segmentation of images by learning upsampling filters. They show that simply upsampling the image using interpolation is not as effective as learning a filter to upsample. They call this net as deconvolutional net.
The paper shows how the fully connected layer can be considered as 1x1 kernel which can be than upsampled with filters to the original size of the image. This helps in finding finer details of the image without losing much information. They show a relative improvement of 20% over the state of the art on the PASCAL VOC segemntation dataset.
Questions:
How does the coarse upsampling to 32s work. I thought they were keeping the pooling weights and combing with the learned filters to upsample the image .
They talk about mini batches where they do overlapping crops of the same image and then go about talking of rejection sampling. Can you explain this in more detail.
This also considers the segmentation as purely as classifcation task and neglects the boundaries in the images. Any papers which address this.
This paper proposes using CNNs to perform image segmentation. The major contribution is improving on the state of the art by using fully convolutional layers - an architecture that allows for dense predictions and thus better accuracy. The output of the FCN is a heat map for prediction which in turn in covered into a semantically meaningful segmentations.
1. would a similar approach work for edge detection or counter detection? 2. how well would this work of segmentation very small objects?
This paper presents a fully convolutional network architecture for image segmentation. The networks are flexible to arbitrary input image sizes by pixel-wise mapping and combining deep, higher-level information with shallow, low-level information. The authors converted popular deep networks into this fully convolutional network architecture and achieves improved segmentation results on the VOC dataset.
This paper presents dense pixel wise segmentation via fully convolution all networks (FCNs). This is achieved by converting previous state of the art networks used for classification into fully convolutional networks. Then, by training them with a pixel wise loss function, they are able to get fine segmentations.
Clarifications: Can the up sampling process be explained throughly?
The paper proposes a technique for semantic segmentation of images using convolutional networks that a trained end-to-end, pixels to pixels. The proposed method uses CNNs for classification which is further fine tuned for segmentation. Deconvolution layer is used to bilinearly upsample output layers. A skip architecture is introduced where the higher and lower level features are combined to improve prediction.
I didn't understand how changing only the filters and the layer strides of a convnet can produce the same output as the shift and stitch trick. Could you please go over the shift and stitch trick and also the trade-offs associated with it?
The paper presents a fully convolutional network that takes as input an arbitrary sized vector and predicts the corresponding pixel wise labeling. The paper talks about how a FCN can be viewed as a convolutional kernels over the entire image. This is used to get a coarse pixel- wise predictions. They use upsampling and ‘shift and stitch’ method to go from coarse outputs to dense predictions.
Questions: I am not sure how transforming fully connected layers into convolutional layers helps reduce computation time.
This paper introduces a new deep learning model “Fully Convolutional Networks” FCN for pixel-wise prediction. FCN consists of only convolutional layers, pre-trained using popular ConvNet models for classification. FCN accepts an input image of any size and produces a coarse output map of the same size. FCN is trained end-to-end and introduces a novel architecture “skip” that combines the local shallow information “where” with the global semantic information “what” from different layers of the network. FCN reported the-state-of-the-art results in PASCAL VOC 2011/2 , NYUDv2, and SIFT Flow.
The paper mentioned that GoogleNet-FCN did not produce the same segmentation results as VGG-FCN, any intuition behind that?
This paper improves upon existing work using convolutional networks for semantic segmentation by using pixel-wise prediction mapping (via a fully convolutional architecture) and combining the coarse, high-layer information with the more fine-grained low-layer information. The fully convolutional version of the net provides a heat-map version of classification, which could be considered a pixel-wise prediction of segmentation. This prediction is improved by adding the up-sampled-to-input-size version of the 2nd-to-the-last pooling layer's predictions with the 2x up-sampling of the last pooling-layer's output, where the up-sampling is initialized to bilinear interpolation but subsequently learned. This process is repeated with the 3rd-to-last pooling layer, also up-sampled to input size, summed with the 2x-up-sampled version of the previously generated prediction image.
ReplyDeleteQuestions/Discussion :
1) are there cases where the coarse predictions of the higher-stride lower grain prediction layers perform better than the finer-detailed predictions? could this be indicative of an error in the original segmentation ground-truth data?
2) Considering the statement "Training from scratch is not feasible considering the time required to learn the base classification nets." - would this mechanism work if it were trained from scratch? Has anyone tried? It would be interesting to train this from scratch and then compare the results to this implementation - it would expose whether the pixel-level capability of the architecture is more a hidden consequence of the fully-connected version of the nets, or if it is inherent in the earlier net layers.
The paper presents a 'fully convolutional' network architecture for dense pixelwise predictions. Pretrained networks from former ILSVRC submissions are converted into fully convolutional models, and fine tuned for image segmentation. The results yield a state of the art segmentation on a number of different benchmark datasets.
ReplyDeleteQuestions:
1) The paper (reasonably) claims that training from scratch is not feasibly, but it would be very interesting to try training these models without a pretrained component, just to see whether it's possible to get comparable segmentation results. Has this been attempted?
Abstract:
ReplyDeleteThe paper presents a fully convolutional network architecture for image segmentation by having dense pixel wise predictions. They take a deep net trained for classification and convert it to a fully convolutional network. Finally they upsample the output and use pixel-wise loss function to train the whole network jointly. To join fine and coarse information together for better segmentation, they combine the output of conv7 with upsampled outputs of various pooling layers. This gives net the ability to understand the larger semantic information as well as fine boundary information. The system gives state-of-arts results on various datasets for pixel wise class prediction task.
Discussion:
1) Can you please explain the part of transforming the fully connected layers into convolution layers?
2) Is the upsampling process bunch of deconv layers? Isn't it similar then with the 'Learning Deconvolution Network for Semantic Segmentation' (http://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Noh_Learning_Deconvolution_Network_ICCV_2015_paper.pdf)
which is deconv layers followed by conv layers?
Summary:
ReplyDeleteThis paper presents the fully convolutional network on image segmentation(/ pixel by pixel classification).
The network can take image at arbitrary size because this end-to-end training is pixel-wise loss minimization with the help of upsampling (deconvolution) after the last fully connected layer in the network. To get better result, they introduced 'skip', which fuse lower and higher level conv features to make prediction
Question:
The 'skip' technique improves fine-grained segmentation a lot according to table2 and figure4, but the since
the input image is not fixed size, to get better segmentation (not only PASCAL 2011), we should consider even
lower level 'skip',right?
This paper introduces it's title. Rather than taking 256x256 images and outputting a constant length output, these networks can take images of any size and output a corresponding output with similar size. For classification, this implies pixel by pixel labeling. They need to start with some pertained networks though.
ReplyDeleteDo they train over the pretrained component? Or do they just leave it as is?
The paper introduces Fully Convolutional Networks (FCNs) as a machinery to do end-to-end semantic segmentation on arbitrarily sized images. This involves convolutionalizing the fully connected layers of existing networks and fine-tuning them to do segmentation. Dense predictions are obtained by providing a deconvolution layer to upsample output layers, where the upsampling filter can be learned. Dense, pixel-wise predictions are improved by providing a novel skip architecture that connects lower level predictions with higher level predictions.
ReplyDeleteDiscussion:
1. I may be missing something, but how does only converting the fully connected layers to convolutional layers allow for arbitrary sized images? Is there any change made to the other layers? I am imagining the convolutionizing step as simply rearranging the previous layer into a grid of feature maps and applying a kernel of the same size with the number of filters being the number of pixels in the aforementioned grid.
2. The Hoiem paper discussed last week talks about discarding techniques before their true potential is realized, simply because the overall improvement is poor. Can that be said about the authors' decision to not go for layer fusion for lower layers?
This paper presents FC networks to do semantic segmentation.They upsample the last feature maps to get a feature map so that they can use a pixel-wise loss function. They use a skip architecture to help improve results.
ReplyDeleteQ:
1. What is the naive approach mentioned in sec 3.1?
2. It seems you could come up with a lot of ways to upsample. Reasons to prefer one to the other?
This paper provides a method for doing semantic segmentation with "fully" convolutional networks. This involves converting current networks and finetuning them to do pixel wise segmentation. They acheive this by combining some of the lower layers with finer strides with higher layers with larger stride.
ReplyDeleteHow do they know to what density they are going to upsample to without knowing the original size of the image?
ReplyDeleteThis paper tries to prove that a Fully Convolutional deep neural network (FCN) can achieve semantic segmentation (pixelwise labeling). The authors take pretrained nets from Imagenet challenge like Alexnet, Googlenet etc. and convert it into a FCN mainly by considering fully connected layers also as a convolutional layer. The output classifier is replaced with a 1x1xN conv layer (N is the number of classes). Since every layer is convolutionalized, the back propagation algo is faster. To get a pixel-wise output image, the paper proposes Shift-and-stitch and Upsampling by interpolation, although implements the latter. The network is then trained on PASCAL VOC 2011 with mean pixel IoU as metric.
Discussion:
1. The idea of considering FC layers as conv layers so that we can work on any image size, is neat. Although, can we really use ANY image size? For example, (section 3.1) the input size in Alexnet is 227x227 and in FCN they use 500x500 (greater than Alexnet's). But can we give an image of size less than 227x227 to the FCN?
The paper presents Fully Convolutional Networks (FCNs), and how they are used for semantic segmentation. They start with CNNs used for classification, and fine tune them for segmentation. A deconvolution layer is used to upsample the coarse output to form dense predictions. These upsampling filters are initialized as bilinear interpolation filters, and are then learned.
ReplyDeleteDiscussion:
Still not too clear on the Shift and Stitch technique. In particular how predictions are made to correspond to the pixels at the centers of their receptive fields.
The paper discusses method for semantic segmentation of image using fully trained end-to-end pixel-to-pixel convolution network, which provides state of art representation. The authors also propose a novel deep architecture which combines semantic information from deeper layer and appearance information from shallower layer to produce segmentation results. The results provided by the authors on Pascal VOC are state of art with relatively fast inference time.
ReplyDeleteQuestion -
Can you elaborate more on the shift and stitch trick?
Summary
ReplyDeleteThis paper defines a novel architecture of fully convolutional layers by adapting AlexNet, GoogLeNet, and VGGNet by fine tuning them to perform state-of-the-art segmentation using end-to-end, pixels-to-pixels training. Dense predictions are achieved by upsampling from pooling layers, decreasing the pooling size, etc.
Questions
How might upsampling from shallower pool layers like pool4 or lower and combing the outputs with the following pooling layers affect the performance? Also, would it be beneficial to do upsampling from multiple pooling layers?
This paper proposes a method to do semantic segmentation of images by learning upsampling filters. They show that simply upsampling the image using interpolation is not as effective as learning a filter to upsample. They call this net as deconvolutional net.
ReplyDeleteThe paper shows how the fully connected layer can be considered as 1x1 kernel which can be than upsampled with filters to the original size of the image. This helps in finding finer details of the image without losing much information. They show a relative improvement of 20% over the state of the art on the PASCAL VOC segemntation dataset.
Questions:
How does the coarse upsampling to 32s work. I thought they were keeping the pooling weights and combing with the learned filters to upsample the image .
They talk about mini batches where they do overlapping crops of the same image and then go about talking of rejection sampling. Can you explain this in more detail.
This also considers the segmentation as purely as classifcation task and neglects the boundaries in the images. Any papers which address this.
This paper proposes using CNNs to perform image segmentation. The major contribution is improving on the state of the art by using fully convolutional layers - an architecture that allows for dense predictions and thus better accuracy. The output of the FCN is a heat map for prediction which in turn in covered into a semantically meaningful segmentations.
ReplyDelete1. would a similar approach work for edge detection or counter detection?
2. how well would this work of segmentation very small objects?
This paper presents a fully convolutional network architecture for image segmentation. The networks are flexible to arbitrary input image sizes by pixel-wise mapping and combining deep, higher-level information with shallow, low-level information. The authors converted popular deep networks into this fully convolutional network architecture and achieves improved segmentation results on the VOC dataset.
ReplyDeleteCan you explain the shift-and-stitch trick?
This paper presents dense pixel wise segmentation via fully convolution all networks (FCNs). This is achieved by converting previous state of the art networks used for classification into fully convolutional networks. Then, by training them with a pixel wise loss function, they are able to get fine segmentations.
ReplyDeleteClarifications:
Can the up sampling process be explained throughly?
Also, I would like an explanation of how the networks are extended to fully convolutional networks.
DeleteThe paper proposes a technique for semantic segmentation of images using convolutional networks that a trained end-to-end, pixels to pixels. The proposed method uses CNNs for classification which is further fine tuned for segmentation. Deconvolution layer is used to bilinearly upsample output layers. A skip architecture is introduced where the higher and lower level features are combined to improve prediction.
ReplyDeleteI didn't understand how changing only the filters and the layer strides of a convnet can produce the same output as the shift and stitch trick. Could you please go over the shift and stitch trick and also the trade-offs associated with it?
The paper presents a fully convolutional network that takes as input an arbitrary sized vector and predicts the corresponding pixel wise labeling. The paper talks about how a FCN can be viewed as a convolutional kernels over the entire image. This is used to get a coarse pixel- wise predictions. They use upsampling and ‘shift and stitch’ method to go from coarse outputs to dense predictions.
ReplyDeleteQuestions:
I am not sure how transforming fully connected layers into convolutional layers helps reduce computation time.
This paper introduces a new deep learning model “Fully Convolutional Networks” FCN for pixel-wise prediction. FCN consists of only convolutional layers, pre-trained using popular ConvNet models for classification. FCN accepts an input image of any size and produces a coarse output map of the same size. FCN is trained end-to-end and introduces a novel architecture “skip” that combines the local shallow information “where” with the global semantic information “what” from different layers of the network. FCN reported the-state-of-the-art results in PASCAL VOC 2011/2 , NYUDv2, and SIFT Flow.
ReplyDeleteThe paper mentioned that GoogleNet-FCN did not produce the same segmentation results as VGG-FCN, any intuition behind that?
so anyone can tell the details about shift-and-stitch?
ReplyDelete