Re: Spark Implementation of XGBoost

2015-11-16 Thread Joseph Bradley
One comment about """ 1) I agree the sorting method you suggested is a very efficient way to handle the unordered categorical variables in binary classification and regression. I propose we have a Spark ML Transformer to do the sorting and encoding, bringing the benefits to many tree based methods.

Re: Spark Implementation of XGBoost

2015-10-27 Thread Meihua Wu
Hi DB Tsai, Thank you again for your insightful comments! 1) I agree the sorting method you suggested is a very efficient way to handle the unordered categorical variables in binary classification and regression. I propose we have a Spark ML Transformer to do the sorting and encoding, bringing th

Re: Spark Implementation of XGBoost

2015-10-27 Thread DB Tsai
Hi Meihua, For categorical features, the ordinal issue can be solved by trying all kind of different partitions 2^(q-1) -1 for q values into two groups. However, it's computational expensive. In Hastie's book, in 9.2.4, the trees can be trained by sorting the residuals and being learnt as if they

Re: Spark Implementation of XGBoost

2015-10-26 Thread Meihua Wu
Hi YiZhi, Thank you for mentioning the jira. I will add a note to the jira. Meihua On Mon, Oct 26, 2015 at 6:16 PM, YiZhi Liu wrote: > There's an xgboost exploration jira SPARK-8547. Can it be a good start? > > 2015-10-27 7:07 GMT+08:00 DB Tsai : >> Also, does it support categorical feature? >>

Re: Spark Implementation of XGBoost

2015-10-26 Thread Meihua Wu
Hi DB Tsai, Thank you very much for your interest and comment. 1) feature sub-sample is per-node, like random forest. 2) The current code heavily exploits the tree structure to speed up the learning (such as processing multiple learning node in one pass of the training data). So a generic GBM is

Re: Spark Implementation of XGBoost

2015-10-26 Thread YiZhi Liu
There's an xgboost exploration jira SPARK-8547. Can it be a good start? 2015-10-27 7:07 GMT+08:00 DB Tsai : > Also, does it support categorical feature? > > Sincerely, > > DB Tsai > -- > Web: https://www.dbtsai.com > PGP Key ID: 0xAF08DF8D >

Re: Spark Implementation of XGBoost

2015-10-26 Thread DB Tsai
Also, does it support categorical feature? Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai wrote: > Interesting. For feature sub-sampling, is it per-node or per-tree? Do >

Re: Spark Implementation of XGBoost

2015-10-26 Thread DB Tsai
Interesting. For feature sub-sampling, is it per-node or per-tree? Do you think you can implement generic GBM and have it merged as part of Spark codebase? Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Mon, Oc