Thanks for the insight, what you're doing is really interesting. I will definitely spend some time looking at DMLC and MXNet.
2016-03-12 18:35 GMT+01:00 Tianqi Chen <tqc...@cs.washington.edu>: > Thanks for the reply. I am writing a long email to give the answers to > Simone and clarifies what we do > > I want to mention that *you can use the library already in Flink*. See > Flink example here: > https://github.com/dmlc/xgboost/tree/master/jvm-packages#xgboost-flink > > I have not run pressure test on top of Flink, but we did the pressure test > thing on Spark and it is consistent with our standalone version on a 100M > example dataset, gives around 10x over mllib's version. I assume same thing > holds for Flink as well. > So if you are interested, please try it out. I imagine this can be very > useful to have a Flink demo that can directly give competitive result on > say a kaggle competition, and can attract more users to Flink community as > well. > > *The Internal Details* > Here at dmlc.ml , we are building libraries that we dive deep and aim > for the best performance and flexibility. We build our own abstractions > when needed, for example, XGBoost relies on Allreduce, and MXNet, another > well known deep learning project, relies on parameter server abstraction. > We tried to make these abstractions portable, so they are not stand-alone > C++ programs, but can be used as library in other languages, e.g. scala. > So essentially, here is what is needed: > - Start a tracker on driver side to connect the workers together to use our > version of communication library (this can be swapped depending on level of > integration, if communication API is provided natively by the platform). > - An API to start concurrent jobs(containers), that can execute a function > (either worker or server). > - Gettng accessed to partition of data in each worker. > > Take XGBoost for example, what we do in Flink is to as a MapPartition > stage, and treat each slot as an worker. Each worker then collaboratively > solve the machine learning problem, and return the trained model. > > *What is needed from DataFlow * > Dataflow is a nice abstraction for data processing. As you can see, the > approach we take is somewhat more low level, I would call it developer > API. Since the requirement is basically to start the worker other types of > containers that runs a scala function from driver side. MapPartition works > well for the XGBoost case, but here are what can be improved: > - Being able to specify resources of the slot at the ML stage, for example, > xgboost as well as deep learning program can benefit from using multiple > cores in each worker. While currently mapPartition uses one core for each > Parititon. > - Being able to launch container that does not take data, for example > parameter server instance. This is mainly needed for the deep learning > program. > > > *Why not implement them using DataFlow?* > One thing I can expect people to argue is why not directly use (multiple) > data-flow stages to implement these algorithm. This is a possible approach, > here are the reasons why > - Most work in advanced ML algorithm is actually the machine learning part, > and add a bit communication into it. So directly using communication > library inside ML code allows easier migration from optimized single > machine version to distributed one. > - Not all dataflow executors are alike, for machine learning usually > benefit from persistent program state (which Flink have but not spark), and > we want to be invariant of such difference. > Dataflow was originally designed for data process, and I do feel sometimes > other abstraction fits machine learning well. The idea of embedding the > ML's abstraction into one stage of dataflow allow us to take benefit from > the flexible data processing phase, and also use the best learning > algorithm. > > *Fault Tolerance?* > Most algorithm we have assumes a fail-restart scheme from the host > platform, which means we will rely on system to restart the failed jobs > somewhere, and provide the same input data. Then internally the > communication library will kick in and try to recover, usually via some > checkpoint. Of course if there is checkpoint feature from the host, this > can also be used. > > > > *More Machine Learning Algorithms?* > XGBoost is part of DMLC http://dmlc.ml project. Our goal is *not to > *develop > general library that covers all algorithms, like FlinkML. Instead, we pick > all the most important ones which are used in production pipeline, and > build *deeply optimized for each specific one as a package.* Of course > there are also shared components like communication library and duplicated > effort among the libraries are shared. I believe we covered most things > people need, plus some simple ones that can be directly implemented in > FlinkML(Kmeans, linear model). > > One thing that could be interesting to try next is MXNet > https://github.com/dmlc/mxnet/tree/master/scala-package, which is a full > fledge deep learning library that comes with all the features you need as > well as a Scala Binding. However, > we do need a bit more things that I mentioned in the requirement section. > > > *What Help we can get from Flink Community * > I will list the points that are clear and actionable here: > > *- *Improve xgboost-Flink API so that it is consistent with current FlinkML > pipeline > - Provide some "developer API" that allows perf improvement as I mentioned > in "What is needed from DataFlow" > - Support abstraction needed for MXNet, and enable *streaming, GPU-enabled > distributed deep learning on Flink* > - Main obstacle will be the "developer API" > > While some of these effort seems to be a lot to port specific machine > learning library. Enable them basically enable port all machine learning > libraries we build and we will be building using these abstractions. > > Tianqi > > > On Sat, Mar 12, 2016 at 4:51 AM, Theodore Vasiloudis < > theodoros.vasilou...@gmail.com> wrote: > > > Hello Tianqui, > > > > Yes that definitely sounds interesting for us and we are looking forward > to > > help out with the implementation. > > > > Regards, > > Theodore > > -- > > Sent from a mobile device. May contain autocorrect errors. > > On Mar 12, 2016 11:29 AM, "Simone Robutti" <simone.robu...@radicalbit.io > > > > wrote: > > > > > This is a really interesting approach. The idea of a ML library over > > > DataFlow is probably a winning move and I hope it will stop the > > > proliferation of worthless reimplementation that is taking place in the > > big > > > data world. Do you think that DataFlow posed specific problems to your > > > work? Does it missing something that you had to fill in with your work? > > > > > > Here at RadicalBit we are interested both in DataFlow/Apache Beam and > in > > > distributed ML and your approach to us look the best and I hope more > and > > > more teams follow your example, maybe integrating existing libraries > like > > > H2O with DataFlow. > > > > > > Keep us updated if you plan to develop other algorithms. > > > > > > 2016-03-11 21:32 GMT+01:00 Tianqi Chen <tqc...@cs.washington.edu>: > > > > > > > Hi Flink Developers > > > > I am sending this email to let you know about XGBoost4J, a > package > > > that > > > > we are planning to announce next week . Here is the draft version of > > the > > > > post > > > > > https://github.com/dmlc/xgboost/blob/master/doc/jvm/xgboost4j-intro.md > > > > > > > > In short, XGBoost is a machine learning package that is used by > > more > > > > than half of the machine challenge winning solutions and is already > > > widely > > > > used in industry. The distributed version scale to billion > examples(10x > > > > faster than spark.mllib in the experiment) with fewer resources (see > . > > > > http://arxiv.org/abs/1603.02754) > > > > > > > > We are interested in putting distributed XGBoost into all > Dataflow > > > > platforms include Flink. This does not mean we re-implement it on > > Flink. > > > > But instead we build a portable API that has a communication library, > > and > > > > being able to run on different DataFlow programs. > > > > > > > > We hope this can benefit the Flink users, to enable them to get > > > access > > > > to one of the state-of-art machine learning algorithm. I am sending > > this > > > > email to the mail-list to let you know about it, and hoping to get > some > > > > contributors to help improving the XGBoost Flink API to be more > > > compatible > > > > with current FlinkML stack. We also hope to get some support from > the > > > > system side, to enable some abstraction needed in XGBoost for using > > > > multiple threads within even one slot for maximum performance. > > > > > > > > > > > > Let us know about your thoughts. > > > > > > > > Cheers > > > > > > > > Tianqi > > > > > > > > > >