GitHub user manishamde opened a pull request: https://github.com/apache/spark/pull/79
MLI-1 Decision Trees Joint work with @hirakendu, @etrain, @atalwalkar and @harsha2010. Key features: + Supports binary classification and regression + Supports gini, entropy and variance for information gain calculation + Supports both continuous and categorical features The algorithm has gone through several development iterations over the last few months leading to a highly optimized implementation. Optimizations include: 1. Level-wise training to reduce passes over the entire dataset. 2. Bin-wise split calculation to reduce computation overhead. 3. Aggregation over partitions before combining to reduce communication overhead. You can merge this pull request into a Git repository by running: $ git pull https://github.com/manishamde/spark tree Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/79.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #79 ---- commit cd53eae11313fd30f71f5ec94b20fe8d4427b8cd Author: Manish Amde <manish...@gmail.com> Date: 2013-11-28T10:20:27Z skeletal framework Signed-off-by: Manish Amde <manish...@gmail.com> commit 92cedce2eb5055e0164c90842d6613c618bfed94 Author: Manish Amde <manish...@gmail.com> Date: 2013-12-02T06:52:29Z basic building blocks for intermediate RDD calculation. untested. Signed-off-by: Manish Amde <manish...@gmail.com> commit 8bca1e20b703fd90bc6fcdbed5d36b42a0bdf66e Author: Manish Amde <manish...@gmail.com> Date: 2013-12-09T03:48:39Z additional code for creating intermediate RDD Signed-off-by: Manish Amde <manish...@gmail.com> commit 0012a77eb02e0a6627b7e3e68ac4d0f29d0885e0 Author: Manish Amde <manish...@gmail.com> Date: 2013-12-10T05:08:44Z basic stump working Signed-off-by: Manish Amde <manish...@gmail.com> commit 03f534c2f9a8dd739945f92b98a58e93fa5b716a Author: Manish Amde <manish...@gmail.com> Date: 2013-12-10T06:10:46Z some more tests Signed-off-by: Manish Amde <manish...@gmail.com> commit dad0afc85aea64c06b4dd64504b3112c881ae4e6 Author: Manish Amde <manish...@gmail.com> Date: 2013-12-15T08:25:58Z decison stump functionality working Signed-off-by: Manish Amde <manish...@gmail.com> commit 4798aae63e898fed71e6240462a163ad81ccd64b Author: Manish Amde <manish...@gmail.com> Date: 2013-12-15T08:45:23Z added gain stats class Signed-off-by: Manish Amde <manish...@gmail.com> commit 80e8c66dd25ad03c706f4993b10ba4caafa54c18 Author: Manish Amde <manish...@gmail.com> Date: 2013-12-16T01:41:59Z working version of multi-level split calculation Signed-off-by: Manish Amde <manish...@gmail.com> commit b0eb866cfd2d98a9281127e02e0c159668ca01f4 Author: Manish Amde <manish...@gmail.com> Date: 2013-12-16T04:42:52Z added logic to handle leaf nodes Signed-off-by: Manish Amde <manish...@gmail.com> commit 98ec8d57a0a0897b093ced7e3284228ee21ce5f4 Author: Manish Amde <manish...@gmail.com> Date: 2013-12-22T06:39:29Z tree building and prediction logic Signed-off-by: Manish Amde <manish...@gmail.com> commit 02c595c65f784061b1a78d4cbd5cac5990d1881d Author: Manish Amde <manish...@gmail.com> Date: 2013-12-22T20:00:17Z added command line parsing Signed-off-by: Manish Amde <manish...@gmail.com> commit 733d6ddf51ddf440efb1a17c818da6d7fd027c4b Author: Manish Amde <manish...@gmail.com> Date: 2013-12-22T20:20:50Z fixed tests Signed-off-by: Manish Amde <manish...@gmail.com> commit 154aa77c925e44a92e8bbf2f55e43cab06e75006 Author: Manish Amde <manish...@gmail.com> Date: 2013-12-23T06:51:17Z enums for configurations Signed-off-by: Manish Amde <manish...@gmail.com> commit b0e3e76c47b1b449c91832aee2a6e94cee0a7c6b Author: Manish Amde <manish...@gmail.com> Date: 2014-01-12T19:45:47Z adding enum for feature type Signed-off-by: Manish Amde <manish...@gmail.com> commit c8f6d60c45ec7ec8cfac94b43fb22d8c294221db Author: Manish Amde <manish...@gmail.com> Date: 2014-01-12T19:46:55Z adding enum for feature type Signed-off-by: Manish Amde <manish...@gmail.com> commit e23c2e5089a2bf2a50c5d3f52e5799bf76ca3a16 Author: Manish Amde <manish...@gmail.com> Date: 2014-01-19T21:23:45Z added regression support Signed-off-by: Manish Amde <manish...@gmail.com> commit 53108ed6ad241765757c1e4c68189035505b370f Author: Manish Amde <manish...@gmail.com> Date: 2014-01-20T00:56:15Z fixing index for highest bin Signed-off-by: Manish Amde <manish...@gmail.com> commit 6df35b9e70701528b13b33820b687f295bcfb3a4 Author: Manish Amde <manish...@gmail.com> Date: 2014-01-21T04:33:52Z regression predict logic Signed-off-by: Manish Amde <manish...@gmail.com> commit dbb7ac13d28fba0848062a7bea40c617cb5f2c80 Author: Manish Amde <manish...@gmail.com> Date: 2014-01-23T04:44:23Z categorical feature support Signed-off-by: Manish Amde <manish...@gmail.com> commit d504eb1f8a3f7f06226448d42b709f2f7ec6e91c Author: Manish Amde <manish...@gmail.com> Date: 2014-01-23T05:59:15Z more tests for categorical features Signed-off-by: Manish Amde <manish...@gmail.com> commit 6b7de78e3a59bef8cbb8aff8b2aeed0cd91ab4a1 Author: Manish Amde <manish...@gmail.com> Date: 2014-01-26T01:53:41Z minor refactoring and tests Signed-off-by: Manish Amde <manish...@gmail.com> commit b09dc983f4f05da61479c87617526064b0e3dde8 Author: Manish Amde <manish...@gmail.com> Date: 2014-01-26T22:54:43Z minor refactoring Signed-off-by: Manish Amde <manish...@gmail.com> commit c0e522b7d1f5e27c81d682e5c8c97543fb4242be Author: Manish Amde <manish...@gmail.com> Date: 2014-01-27T03:11:43Z updated predict and split threshold logic Signed-off-by: Manish Amde <manish...@gmail.com> commit f067d68f0d951e7f0f089419c506fbd5ce2c2fc1 Author: Manish Amde <manish...@gmail.com> Date: 2014-01-27T03:36:21Z minor cleanup Signed-off-by: Manish Amde <manish...@gmail.com> commit 5841c2838e6834fc8c767f3c83dba7ef99375fa4 Author: Manish Amde <manish...@gmail.com> Date: 2014-01-27T06:34:49Z unit tests for categorical features Signed-off-by: Manish Amde <manish...@gmail.com> commit 0dd7659055879be9fbb3280964f87b14c735f225 Author: manishamde <manish...@gmail.com> Date: 2014-01-27T06:42:06Z basic doc Signed-off-by: Manish Amde <manish...@gmail.com> commit dd0c0d799d42c94da3f930065a6c2973143bfd75 Author: Manish Amde <manish...@gmail.com> Date: 2014-01-27T08:01:43Z minor: some docs Signed-off-by: Manish Amde <manish...@gmail.com> commit 937277990e80f9a97070c63d39552579f0320fd7 Author: Manish Amde <manish...@gmail.com> Date: 2014-02-17T03:42:48Z code style: max line lenght <= 100 Signed-off-by: Manish Amde <manish...@gmail.com> commit 84f85d6d0a1fe7ed60149cc6b29a9ff76ef09abd Author: Manish Amde <manish...@gmail.com> Date: 2014-02-28T04:57:56Z code documentation Signed-off-by: Manish Amde <manish...@gmail.com> ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---