Hello community,
I have recently manually inspected some decision trees computed with Spark
(2.2.1, but the behavior is the same with the latest code on the repo).

I have observed that the trees are always complete, even if an entire
subtree leads to the same prediction in its different leaves.

In such case, the root of the subtree, instead of being an InternalNode,
could simply be a LeafNode with the (shared) prediction.

I know that decision trees computed by scikit-learn share the same feature,
I understand that this is needed by construction, because you realize this
redundancy only at the end.

So my question is, why is this "post-pruning" missing?

Three hypothesis:

1) It is not suitable (for a reason I fail to see)
2) Such addition to the code is considered as not worth (in terms of code
complexity, maybe)
3) It has been overlooked, but could be a favorable addition

For clarity, I have managed to isolate a small case to reproduce this, in
what follows.

This is the dataset:

> +-----+-------------+
> |label|features     |
> +-----+-------------+
> |1.0  |[1.0,0.0,1.0]|
> |1.0  |[0.0,1.0,0.0]|
> |1.0  |[1.0,1.0,0.0]|
> |0.0  |[0.0,0.0,0.0]|
> |1.0  |[1.0,1.0,0.0]|
> |0.0  |[0.0,1.0,1.0]|
> |1.0  |[0.0,0.0,0.0]|
> |0.0  |[0.0,1.0,1.0]|
> |1.0  |[0.0,1.0,1.0]|
> |0.0  |[1.0,0.0,0.0]|
> |0.0  |[1.0,0.0,1.0]|
> |1.0  |[0.0,1.0,1.0]|
> |0.0  |[0.0,0.0,1.0]|
> |0.0  |[1.0,0.0,1.0]|
> |0.0  |[0.0,0.0,1.0]|
> |0.0  |[1.0,1.0,1.0]|
> |0.0  |[1.0,1.0,0.0]|
> |1.0  |[1.0,1.0,1.0]|
> |0.0  |[1.0,0.0,1.0]|
> +-----+-------------+


Which generates the following model:

DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with 15
> nodes
>   If (feature 1 <= 0.5)
>    If (feature 2 <= 0.5)
>     If (feature 0 <= 0.5)
>      Predict: 0.0
>     Else (feature 0 > 0.5)
>      Predict: 0.0
>    Else (feature 2 > 0.5)
>     If (feature 0 <= 0.5)
>      Predict: 0.0
>     Else (feature 0 > 0.5)
>      Predict: 0.0
>   Else (feature 1 > 0.5)
>    If (feature 2 <= 0.5)
>     If (feature 0 <= 0.5)
>      Predict: 1.0
>     Else (feature 0 > 0.5)
>      Predict: 1.0
>    Else (feature 2 > 0.5)
>     If (feature 0 <= 0.5)
>      Predict: 0.0
>     Else (feature 0 > 0.5)
>      Predict: 0.0


As you can see, the following model would be equivalent, but smaller and

DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with 15
> nodes
>   If (feature 1 <= 0.5)
>    Predict: 0.0
>   Else (feature 1 > 0.5)
>    If (feature 2 <= 0.5)
>     Predict: 1.0
>    Else (feature 2 > 0.5)
>     Predict: 0.0


This happens pretty often in real cases, and despite the small gain in the
single model invocation for the "optimized" version, it can become non
negligible when the number of calls is massive, as one can expect in a Big
Data context.

I would appreciate your opinion on this matter (if relevant for a PR or
not, pros/cons etc).

Best regards,
Alessandro

Reply via email to