GitHub user manishamde opened a pull request:
https://github.com/apache/spark/pull/2607
[MLLIB] [WIP] SPARK-1547: Adding Gradient Boosting to MLlib
Given the popular demand for gradient boosting and AdaBoost in MLlib, I am
creating a WIP branch for early feedback on gradient boosting with AdaBoost to
follow soon after this PR is accepted. This is based on work done along with
@hirakendu that was pending due to decision tree optimizations and random
forests work.
Ideally, boosting algorithms should work with any base learners. This will
soon be possible once the MLlib API is finalized -- we want to ensure we use a
consistent interface for the underlying base learners. In the meantime, this PR
uses decision trees as base learners for the gradient boosting algorithm. The
current PR allows "pluggable" loss functions and provides least squares error
and least absolute error by default.
Here is the remaining task list:
- [ ] Stochastic gradient boosting support â Re-use the BaggedPoint
approach used for RandomForest.
- [ ] BaggedRDD caching -- Avoid repeating feature to bin mapping for each
tree estimator. Will require minor refactoring of RandomForest code.
- [ ] Checkpointing â This approach will avoid long lineage chains. Need
to conduct experiments to verify good default settings.
- [ ] Unit Tests â I have performed some basic tests but I need to add
them as unit tests.
- [ ] Create public APIs
- [ ] Tests on multiple cluster sizes and datasets â require help from
the community on this front.
Note: Classification is currently not supported by this PR since it
requires discussion on the best way to support "deviance" as a loss function.
cc: @jkbradley @hirakendu @mengxr @etrain @atalwalkar @chouqin
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/manishamde/spark gbt
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/2607.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2607
----
commit 0ae1c0a77c9de22dd1ff50ad1e4c7b8a691aac38
Author: Manish Amde <[email protected]>
Date: 2014-09-28T03:32:22Z
basic gradient boosting code from earlier branches
commit 55385216ff2d0a470ae783017d434d850762441f
Author: Manish Amde <[email protected]>
Date: 2014-09-28T04:32:31Z
disable checkpointing for now
commit 6251fd56388703d9b9450980a27cf9a9a98e750d
Author: Manish Amde <[email protected]>
Date: 2014-10-01T00:22:26Z
modified method name
commit cdceeef09822145af2620921a94c37384d3f64c7
Author: Manish Amde <[email protected]>
Date: 2014-10-01T01:04:02Z
added documentation
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]