This is a really interesting approach. The idea of a ML library over DataFlow is probably a winning move and I hope it will stop the proliferation of worthless reimplementation that is taking place in the big data world. Do you think that DataFlow posed specific problems to your work? Does it missing something that you had to fill in with your work?
Here at RadicalBit we are interested both in DataFlow/Apache Beam and in distributed ML and your approach to us look the best and I hope more and more teams follow your example, maybe integrating existing libraries like H2O with DataFlow. Keep us updated if you plan to develop other algorithms. 2016-03-11 21:32 GMT+01:00 Tianqi Chen <tqc...@cs.washington.edu>: > Hi Flink Developers > I am sending this email to let you know about XGBoost4J, a package that > we are planning to announce next week . Here is the draft version of the > post > https://github.com/dmlc/xgboost/blob/master/doc/jvm/xgboost4j-intro.md > > In short, XGBoost is a machine learning package that is used by more > than half of the machine challenge winning solutions and is already widely > used in industry. The distributed version scale to billion examples(10x > faster than spark.mllib in the experiment) with fewer resources (see . > http://arxiv.org/abs/1603.02754) > > We are interested in putting distributed XGBoost into all Dataflow > platforms include Flink. This does not mean we re-implement it on Flink. > But instead we build a portable API that has a communication library, and > being able to run on different DataFlow programs. > > We hope this can benefit the Flink users, to enable them to get access > to one of the state-of-art machine learning algorithm. I am sending this > email to the mail-list to let you know about it, and hoping to get some > contributors to help improving the XGBoost Flink API to be more compatible > with current FlinkML stack. We also hope to get some support from the > system side, to enable some abstraction needed in XGBoost for using > multiple threads within even one slot for maximum performance. > > > Let us know about your thoughts. > > Cheers > > Tianqi >