Perhaps the best way is to read the code. The Decision tree is implemented by 1-tree Random forest, whose entry point is `run` method: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L88
I'm not familiar with the so-called algorithms of decision tree, such as ID4, CART. However, I believe that the implementation of decision tree of sklearn is quite similar with those of spark, and some difference are listed below: 1. Continuous feature. sklearn use all candidate values to find best split, while spark groups all candidate values into fixed bins. 2. Build tree. sklearn provides two methods: depth-first and best-first, while spark has only one: depth-first. 3. Split number. sklearn creates one split per iteration, while spark could split in parallel. If I'm wrong, please let me know. On Sat, Oct 1, 2016 at 10:34 AM, janardhan shetty <janardhan...@gmail.com> wrote: > It would be good to know which paper has inspired to implement the version > which we use in spark 2.0 decision trees ? > > On Fri, Sep 30, 2016 at 4:44 PM, Peter Figliozzi <pete.figlio...@gmail.com > > wrote: > >> It's a good question. People have been publishing papers on decision >> trees and various methods of constructing and pruning them for over 30 >> years. I think it's rather a question for a historian at this point. >> >> On Fri, Sep 30, 2016 at 5:08 PM, janardhan shetty <janardhan...@gmail.com >> > wrote: >> >>> Read this explanation but wondering if this algorithm has the base from >>> a research paper for detail understanding. >>> >>> On Fri, Sep 30, 2016 at 1:36 PM, Kevin Mellott < >>> kevin.r.mell...@gmail.com> wrote: >>> >>>> The documentation details the algorithm being used at >>>> http://spark.apache.org/docs/latest/mllib-decision-tree.html >>>> >>>> Thanks, >>>> Kevin >>>> >>>> On Fri, Sep 30, 2016 at 1:14 AM, janardhan shetty < >>>> janardhan...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> Any help here is appreciated .. >>>>> >>>>> On Wed, Sep 28, 2016 at 11:34 AM, janardhan shetty < >>>>> janardhan...@gmail.com> wrote: >>>>> >>>>>> Is there a reference to the research paper which is implemented in >>>>>> spark 2.0 ? >>>>>> >>>>>> On Wed, Sep 28, 2016 at 9:52 AM, janardhan shetty < >>>>>> janardhan...@gmail.com> wrote: >>>>>> >>>>>>> Which algorithm is used under the covers while doing decision trees >>>>>>> FOR SPARK ? >>>>>>> for example: scikit-learn (python) uses an optimised version of the >>>>>>> CART algorithm. >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >