Wednesday, February 17, 2016

Fri, Feb 19 - Deep Neural Decision Forests

Deep Neural Decision Forests. Peter Kontschieder, Madalina Fiterau, Antonio Criminisi, and Samuel Rota Bulo. ICCV 2015.

Project page

21 comments:

  1. This paper presents a network architecture that combines stochastic decision forests with deep convolutional networks. By choosing as their decision nodes sigmoid-driven classifiers, the trees are differentiable, and by making these decision nodes probabilistic, they are therefore able to benefit from stochastic back propagation, which updates the common parameter theta behind the probabilities of each node's "left-or-right" decision-making function. The leaves of the tree are then predictors of classification, whose distributions are derived by minimizing a risk function while holding the tree parameters involved in taking the path to the leaf fixed. When coupled with a convolutional network, the decision nodes of these trees are each tied to a single output of a fully connected layer, which is used to drive the decision function (distribution) for each tree node.

    Discussion/Questions :
    1) What is theta (beyond "decision node parameters") and how is it determined?

    ReplyDelete
  2. summary:
    This paper combined the representation learning from CNN with decision forests classification to the Deep neural decision forests and beat the GoogLeNet results on classification. The key contribution is their design in the network: using fc layer activation function as input to the sigmoid function to be the decision node and do the stochastic routing in the tree with stochastic gradient descent.

    questions:
    In figure2, the fully connected layer activation functions are brought to the decision tree on making the decision described by equation (3).
    My question is:
    (1) on page 5 description under figure2 they mentioned "The order of the assignments of output units to decision nodes can be arbitrary", but the traditional decision tree strategically picking order of the feature is important to the result. Why it is not the case for Deep Neural Decision trees?
    (2) In decision node why sigmoid function is used instead of linear Rectifier?

    ReplyDelete
  3. Abstract:

    The paper presents a new architecture that combines the divide and conquer properties of decision trees with the representation learning. They provide a differentiable architecture of random forest with neural network as decision function. Since the architecture is differentiable, all the weights can be learned by back propagation. Then in the final section they have shown various experiments of shallow and deep neural decision forests and they shown improvements over GoogleNet architecture on image classification on ImageNet.

    Discussion:

    1) How many total number of parameters are there is the proposed architecture? Is it higher as compared to GoogleNet?

    ReplyDelete
  4. This paper tries to unite CNN's and CNN's. It has a differentiable tree. In short, there is a decision tree whose non-terminal nodes take in a fc layer whose output is a prediction by one of the leaf nodes.

    Q:
    How do you interpret the nodes? It seems that in a normal binary tree that you go to the root, then left or right, and so forth. Here the softmax output is feeding directly into all the nodes at once.

    Basically, it seems to me that node, say, node d8, d9, d12 might be the best path, but d8 is such that it will make you almost always go right to d10.

    2. Way to make this work with structured output, like letters. A tree just seems like it is begging for structured data. Certain letters, words are more probable given other words/letters. And you could capture that in a binary tree.

    3. I must be missing something. How do they enforce the constraint that the probability that sample x will reach leaf L sums to 1.

    ReplyDelete
  5. Summary
    This paper presents a novel approach of classification model which is essentially a combination of deep neural networks and decision trees. However, the combination is non-trivial in the sense that the output of this model is from a decision forest whereas the input is taken by the deep network. The final layers of deep network are used to drive the split functions of the decision forests – a process the paper calls as “representative learning” because the features on which to split are learnt using the deep network. This approach was tested by plugging the decision forests to state-of-the-art deep networks like GoogLeNet and without any additional training data other than the ImageNet data, the error rate is observed to come down by 0.29%.
    Questions
    1) Decision trees are prone to overfitting if the depth is not taken care of. How exactly is the algorithm going to tackle this problem?
    2) I did not completely understand the stochastic nature of finding the best split at any node of the tree, and the back propagation techniques used.
    3) No information has been provided about the run-time of this algorithm. Will training the deep network still be the bottle-neck in this type of a model, or decision tree training would also add a considerable amount of time?

    ReplyDelete
  6. This paper introduces a new way to end CNN’s – random forests! They introduce a new global loss function tree training method that allows them to take the last few layers of a CNN and replace them with a RF. By doing this, they squeeze and extra few tenths of a percent out of GoogLeNet when applied to ImageNet. I think the idea is CNN’s are a great way to learn mid level features, but other (much better understood) machine learning techniques might be better suited as the end of the algorithm (the predictor).

    Discussion: I would be interested in seeing the results without the training part for the RF back-propogating into the CNN. If you just use a random forest at the end of GoogleNet (before they softmax), what sort of accuracies do they get? I.e. is the back-propgation RF training relevant?

    ReplyDelete
  7. This paper discusses about Deep Neural Decision Forests, fusion of classification trees with deep convolutional networks. Unlike a softmax layer in the output a prediction layer is at the output (leaf nodes) behind which there is decision tree. Each decision node is a binary classifier which steers the data to its right or left based on a probabilistic function. In learning stage the parameters for this function and the probability of prediction nodes are learnt through back-propagation. For a specific number of iterations, the parameters of decision node function and prediction node probabilities are learnt by using training data in batches. In order to combine this with a deep neural network, one can connect a fully connected layer to the decision tree/forest.

    Discussion:
    1. What is the advantage of training in batches? Does it improve training time?
    2. What is "Top-5 Error"?

    ReplyDelete
  8. This paper presents a combination of decision forests with deep neural networks. They demonstrate a differentiable node in the decision trees which enable back propogation training of the network. As opposed to conventional decision trees, stochastic routing enables split node parameter learning.

    Questions -
    What is the difference in terms of number of parameters to be tuned.

    ReplyDelete
  9. The authors present a method of using deep networks together with random forests. Basically, the random forest is connected to the last fully connected layer of the net, and acts as a classifier. Two important characteristics of the random forest is that it's decision nodes are stochastic and differentiable. Being differentiable allows back propagation to be used to update the forest's parameters. Results of using deep neural decision forests were shown to be on par or better than conventional deep network approaches at the time.

    Questions:

    The top5-error of dNDF.NET against GoogLeNet is 7.84% vs 10.02%, but when an ensemble of 7 dNDF.NET against 7 GoogLeNets is compared, dNDF.NETs only provide an error reduction of .29%. What causes the improvement to drop so much between single model and ensemble classification?

    ReplyDelete
  10. This paper presents a novel design of Random Forest that is able to learn input data presentation and make classification decision in-divide-and-conquer fashion as well. In this design, each decision tree has probabilistic split nodes – instead of the old deterministic one – where the routing decision is based on binary classifier that uses sigmoid function. Therefore, the tree now is differentiable and can utilize stochastic back-propagation to update the learning weights associated with each decision node. The network was tested in different benchmarks and out preformed the state-of-the-art deep learning models (GoogleNet)

    Q: In figure, it shows that the final random forest fully connected to CNN. What layer of the CNN is connected here?

    ReplyDelete
  11. This paper proposes an algorithm to enrich random forests with deep learning and vice versa. One nice benefit to random forests is that they are capable of learning hierarchical rules and are by nature quite modular - which when combined with the already powerful existing deep learning architecture would be even more powerful. From what I can tell, the final layers in the deep network are used to create the nodes within the tree structure (the layers are added with some routing probability which corresponds to the probability of going left or right within the tree). The trees are interactively trained given a user specified number of epoch and updated using stochastic gradient decent. The final output is the sum over the leaves and their corresponding probability vectors.

    1. I am not sure I understand the how this works. Can you go over the learning process of this approach?
    2. When is this the best approach, when would this make no sense to use?

    ReplyDelete
  12. This paper presents a method to combine random forests with deep neural networks to utilize its representational ability. They show state of the art results on Google Imagenet Challenge.

    They show how to learn deep networks combined with random forest. They represent each decision node as an node in the fully connected deep network. They condition the decision nodes on the neural network using this. The leaf nodes are learnt offline in between minibatches.

    They experiment on Google Le net , Alex net and compare with shallower nets.

    Questions:

    Is the supervision still from the leaf nodes ??
    Could you explain how Google Le net was utilized here

    ReplyDelete
  13. Most approaches for decision tree learning prevalent in literature are greedy and only leverage local information. The paper presents a differentiable formulation for decision trees that can be trained via back-propagation. The new model presented works well both as a stand-alone classifier and taking input from fully convolutional layers.

    Questions:
    1) Wouldn't a stochastic routing provide strictly worse classification scores than a deterministic one?

    ReplyDelete
  14. The paper presents a stochastic and differentiable decision tree model that guides the representational learning that happens in the lower layers of a deep neural network. The authors define a decision function for every node which decides whether the input should be routed to left child or the right child. The parameters of the decision function are incrementally learnt so as minimize the risk of wrong classification.
    Questions:
    How is regularization achieved? The paper mentions randomly picking a tree and updating its learnt parameters per iteration. Does that amount to regularization?

    ReplyDelete
  15. The paper proposes an architecture that combines representation learning features of CNNs with the divide and conquer principle of decision forests. It does this by replacing the softmax layers in a tweaked version of the GoogLeNet architecture, with decision forests. The higher layers of deep network are used to drive the split functions of the decision forests. Each decision node is a binary classifier which routes the input data either to its right or left based on a probabilistic split function, the parameters of which are learnt through back-propogation. This architecture has shown to perform significantly better than the GoogLeNet architecture.


    Questions:

    How many parameters need to be learnt for this architecture vs GoogLeNet?

    How exactly is theta learned?

    ReplyDelete
  16. This paper presents a method for classifying images. They did better than GoogleNet in the ImageNet challenge. They use a decision function for every node . They then defined Back propogation for that function

    Questions:
    1. I'm getting really confused about why their using stochastic routing instead of just deterministic one? Is it because they want it to replicate an NN node?
    1. Why sigmoid instead of ReLU

    ReplyDelete
  17. The paper introduces a novel architecture of classification model that combines deep neural networks and decision trees. The proposed model is a differentiable architecture of random forest with neural network as decision function, where the weights are learnt by back propoagation. The paper evaluates the performance of shallow neural decision forests as standalone classifiers, as well as their effect when used as classifiers in deep, convoluional neural network. The decision forests were integrated with GoogleNet architecture, which proves to have improved performance on ImageNet.

    Questions/Discussion:
    1.How does the algorithm take care of overfitting?
    2. It would be great if you could go over the learning process.

    ReplyDelete
  18. This paper describes how random decision forests can be assigned as a classifier to a deep architecture. They go on to describe their approach of how they optimize on two different parameters 'theta' and 'pi' using SGD and online learning methods. They have also explained their optimization strategy in depth and performance with the previous architectures using softmax as their classifier in different competitions is described.

    Questions-

    1. Why use stochastic compared to deterministic decision making?
    2. How much is the overhead?

    ReplyDelete
  19. This paper is novel in that it presents a technique to train decision trees using backpropagation algorithm. This introduces a concept of stochastic routing in the decision nodes (splitting nodes) of the tree rather than deterministic splitting which is done traditionally. The backpropagation procedure aims to reduce a global error function where the optimization is carried out in two stages: first involves learning the parameter theta which corresponds to the inner decision nodes and the second is learning the parameter pi which corresponds to the prediction (leaf) nodes.
    They use this technique to build a standalone neural decision forest or combine it with a ConvNet layer.

    Questions:
    1. WHy is the routing direction chosen to be the output of a bernoulli random variable?
    2. They say that the backpropagation can be carried out in a reverse breadth-first traversal, shouldn't it be reversed depth-first?
    3. What is the guarantee that this stochastic routing will not run into a pathological state and actually produce worse classification performance than deterministic?

    ReplyDelete
  20. This paper presents a new, differentiable architecture that combines random forests with deep networks. The last layers of the CNN are replaced with a random forest which acts as a classifier. Each decision tree has a probabilistic split function, for which the parameters are learned through back-propagation. The representative learning aspect of the deep networks are used for determining which features to split. With this model, there was improvement with GoogLeNet on classification.

    Can you explain the learning aspect of this architecture?

    ReplyDelete
  21. Hi everyone.
    Please find the slides of my presentation for your reference here: http://1drv.ms/1KWtfyp

    I will get back to the class regarding the questions we have about this paper as soon as I get a response from the authors.

    ReplyDelete