Thanks for your feedback Sean, I agree with you. I have logged a JIRA case (https://issues.apache.org/jira/browse/SPARK-23409), I will take a look at the code more in detail and see if I come up with a PR to handle this.
On 13 February 2018 at 12:00, Sean Owen <sro...@gmail.com> wrote: > I think the simple pruning you have in mind was just never implemented. > > That sort of pruning wouldn't help much if the nodes maintained a > distribution over classes, as those are rarely identical, but, they just > maintain a single class prediction. After training, I see no value in > keeping those nodes. Whatever impurity gain the split managed on the > training data is 'lost' when the prediction is collapsed to a single class > anyway. > > Whether it's easy to implement in the code I don't know, but it's > straightforward conceptually. > > On Tue, Feb 13, 2018 at 4:21 AM Alessandro Solimando < > alessandro.solima...@gmail.com> wrote: > >> Hello Nick, >> thanks for the pointer, that's interesting. >> >> However, there seems to be a major difference with what I was discussing. >> >> The JIRA issue relates to overfitting and consideration on information >> gain, while what I propose is a much simpler "syntactic" pruning. >> >> Consider a fragment of the example above, the leftmost subtree in >> particular: >> >> If (feature 1 <= 0.5) >>> If (feature 2 <= 0.5) >>> If (feature 0 <= 0.5) >>> Predict: 0.0 >>> Else (feature 0 > 0.5) >>> Predict: 0.0 >>> Else (feature 2 > 0.5) >>> If (feature 0 <= 0.5) >>> Predict: 0.0 >>> Else (feature 0 > 0.5) >>> Predict: 0.0 >> >> >> Which corresponds to the following "objects": >> >> -InternalNode(prediction = 0.0, impurity = 0.48753462603878117, split = >>> org.apache.spark.ml.tree.ContinuousSplit@fdf00000) >>> --InternalNode(prediction = 0.0, impurity = 0.345679012345679, split = >>> org.apache.spark.ml.tree.ContinuousSplit@ffe00000) >>> ---InternalNode(prediction = 0.0, impurity = 0.4444444444444445, split = >>> org.apache.spark.ml.tree.ContinuousSplit@3fe00000) >>> ----LeafNode(prediction = 0.0, impurity = -1.0) >>> ----LeafNode(prediction = 0.0, impurity = 0.0) >>> ---InternalNode(prediction = 0.0, impurity = 0.2777777777777777, split = >>> org.apache.spark.ml.tree.ContinuousSplit@3fe00000) >>> ----LeafNode(prediction = 0.0, impurity = 0.0) >>> ----LeafNode(prediction = 0.0, impurity = -1.0) >> >> >> For sure a more comprehensive policy for node splitting based on impurity >> might prevent this situation (by splitting node "ffe00000" you have an >> impurity gain on one child, and a loss on the other), but independently >> from this, once the tree is built, I would cut the redundant subtree and >> obtain the following: >> >> -InternalNode(prediction = 0.0, impurity = 0.48753462603878117, split = >>> org.apache.spark.ml.tree.ContinuousSplit@fdf00000) >>> --LeafNode(prediction = 0.0, impurity = ...) >> >> >> I cannot say that this is relevant for all the tree ensemble methods, but >> it for sure is for RF, even more than for DT, as the lever effect will be >> even higher (and the code generating them is the same, DT calls RF with >> numTree = 1 for what I can see). >> >> Being an optimization aiming at saving model memory footprint and >> invocation time, it is independent from any consideration on the >> statistical amortization of overfit, as your reply seems to imply. >> >> Am I missing something? >> >> Best regards, >> Alessandro >> >> >> >> On 13 February 2018 at 10:57, Nick Pentreath <nick.pentre...@gmail.com> >> wrote: >> >>> There is a long outstanding JIRA issue about it: >>> https://issues.apache.org/jira/browse/SPARK-3155. >>> >>> It is probably still a useful feature to have for trees but the priority >>> is not that high since it may not be that useful for the tree ensemble >>> models. >>> >>> >>> On Tue, 13 Feb 2018 at 11:52 Alessandro Solimando < >>> alessandro.solima...@gmail.com> wrote: >>> >>>> Hello community, >>>> I have recently manually inspected some decision trees computed with >>>> Spark (2.2.1, but the behavior is the same with the latest code on the >>>> repo). >>>> >>>> I have observed that the trees are always complete, even if an entire >>>> subtree leads to the same prediction in its different leaves. >>>> >>>> In such case, the root of the subtree, instead of being an >>>> InternalNode, could simply be a LeafNode with the (shared) prediction. >>>> >>>> I know that decision trees computed by scikit-learn share the same >>>> feature, I understand that this is needed by construction, because you >>>> realize this redundancy only at the end. >>>> >>>> So my question is, why is this "post-pruning" missing? >>>> >>>> Three hypothesis: >>>> >>>> 1) It is not suitable (for a reason I fail to see) >>>> 2) Such addition to the code is considered as not worth (in terms of >>>> code complexity, maybe) >>>> 3) It has been overlooked, but could be a favorable addition >>>> >>>> For clarity, I have managed to isolate a small case to reproduce this, >>>> in what follows. >>>> >>>> This is the dataset: >>>> >>>>> +-----+-------------+ >>>>> |label|features | >>>>> +-----+-------------+ >>>>> |1.0 |[1.0,0.0,1.0]| >>>>> |1.0 |[0.0,1.0,0.0]| >>>>> |1.0 |[1.0,1.0,0.0]| >>>>> |0.0 |[0.0,0.0,0.0]| >>>>> |1.0 |[1.0,1.0,0.0]| >>>>> |0.0 |[0.0,1.0,1.0]| >>>>> |1.0 |[0.0,0.0,0.0]| >>>>> |0.0 |[0.0,1.0,1.0]| >>>>> |1.0 |[0.0,1.0,1.0]| >>>>> |0.0 |[1.0,0.0,0.0]| >>>>> |0.0 |[1.0,0.0,1.0]| >>>>> |1.0 |[0.0,1.0,1.0]| >>>>> |0.0 |[0.0,0.0,1.0]| >>>>> |0.0 |[1.0,0.0,1.0]| >>>>> |0.0 |[0.0,0.0,1.0]| >>>>> |0.0 |[1.0,1.0,1.0]| >>>>> |0.0 |[1.0,1.0,0.0]| >>>>> |1.0 |[1.0,1.0,1.0]| >>>>> |0.0 |[1.0,0.0,1.0]| >>>>> +-----+-------------+ >>>> >>>> >>>> Which generates the following model: >>>> >>>> DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with >>>>> 15 nodes >>>>> If (feature 1 <= 0.5) >>>>> If (feature 2 <= 0.5) >>>>> If (feature 0 <= 0.5) >>>>> Predict: 0.0 >>>>> Else (feature 0 > 0.5) >>>>> Predict: 0.0 >>>>> Else (feature 2 > 0.5) >>>>> If (feature 0 <= 0.5) >>>>> Predict: 0.0 >>>>> Else (feature 0 > 0.5) >>>>> Predict: 0.0 >>>>> Else (feature 1 > 0.5) >>>>> If (feature 2 <= 0.5) >>>>> If (feature 0 <= 0.5) >>>>> Predict: 1.0 >>>>> Else (feature 0 > 0.5) >>>>> Predict: 1.0 >>>>> Else (feature 2 > 0.5) >>>>> If (feature 0 <= 0.5) >>>>> Predict: 0.0 >>>>> Else (feature 0 > 0.5) >>>>> Predict: 0.0 >>>> >>>> >>>> As you can see, the following model would be equivalent, but smaller >>>> and >>>> >>>> DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with >>>>> 15 nodes >>>>> If (feature 1 <= 0.5) >>>>> Predict: 0.0 >>>>> Else (feature 1 > 0.5) >>>>> If (feature 2 <= 0.5) >>>>> Predict: 1.0 >>>>> Else (feature 2 > 0.5) >>>>> Predict: 0.0 >>>> >>>> >>>> This happens pretty often in real cases, and despite the small gain in >>>> the single model invocation for the "optimized" version, it can become non >>>> negligible when the number of calls is massive, as one can expect in a Big >>>> Data context. >>>> >>>> I would appreciate your opinion on this matter (if relevant for a PR or >>>> not, pros/cons etc). >>>> >>>> Best regards, >>>> Alessandro >>>> >>> >>