Dear Spark Users and Developers, We (Distributed (Deep) Machine Learning Community (http://dmlc.ml/)) are happy to announce the release of XGBoost4J (http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html), a Portable Distributed XGBoost in Spark, Flink and Dataflow
XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable.XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. It has been the winning solution for many machine learning scenarios, ranging from Machine Learning Challenges (https://github.com/dmlc/xgboost/tree/master/demo#machine-learning-challenge-winning-solutions) to Industrial User Cases (https://github.com/dmlc/xgboost/tree/master/demo#usecases) XGBoost4J is a new package in XGBoost aiming to provide the clean Scala/Java APIs and the seamless integration with the mainstream data processing platform, like Apache Spark. With XGBoost4J, users can run XGBoost as a stage of Spark job and build a unified pipeline from ETL to Model training to data product service within Spark, instead of jumping across two different systems, i.e. XGBoost and Spark. (Example: https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/spark/DistTrainWithSpark.scala) Today, we release the first version of XGBoost4J to bring more choices to the Spark users who are seeking the solutions to build highly efficient data analytic platform and enrich the Spark ecosystem. We will keep moving forward to integrate with more features of Spark. Of course, you are more than welcome to join us and contribute to the project! For more details of distributed XGBoost, you can refer to the recently published paper: http://arxiv.org/abs/1603.02754 Best, -- Nan Zhu http://codingcat.me