Re: XGBoost on DataFlow and Flink

Simone Robutti Sat, 12 Mar 2016 10:18:07 -0800

Thanks for the insight, what you're doing is really interesting. I will
definitely spend some time looking at DMLC and MXNet.


2016-03-12 18:35 GMT+01:00 Tianqi Chen <tqc...@cs.washington.edu>:

> Thanks for the reply.  I am writing a long email to give the answers to
> Simone and clarifies what we do
>
> I want to mention that *you can use the library already in Flink*. See
> Flink example here:
> https://github.com/dmlc/xgboost/tree/master/jvm-packages#xgboost-flink
>
> I have not run pressure test on top of Flink, but we did the pressure test
> thing on Spark and it is consistent with our standalone version on a 100M
> example dataset, gives around 10x over mllib's version. I assume same thing
> holds for Flink as well.
> So if you are interested, please try it out. I imagine this can be very
> useful to have a Flink demo that can directly give competitive result on
> say a kaggle competition, and can attract more users to Flink community as
> well.
>
> *The Internal Details*
>    Here at dmlc.ml , we are building libraries that we dive deep and aim
> for the best performance and flexibility. We build our own abstractions
> when needed, for example, XGBoost relies on Allreduce, and MXNet, another
> well known deep learning project, relies on parameter server abstraction.
> We tried to make these abstractions portable, so they are not stand-alone
> C++ programs, but can be used as library in other languages, e.g. scala.
>    So essentially, here is what is needed:
> - Start a tracker on driver side to connect the workers together to use our
> version of communication library (this can be swapped depending on level of
> integration, if communication API is provided natively by the platform).
> - An API to start concurrent jobs(containers), that can execute a function
> (either worker or server).
> - Gettng accessed to partition of data in each worker.
>
> Take XGBoost for example, what we do in Flink is to as a MapPartition
> stage, and treat each slot as an worker.  Each worker then collaboratively
> solve the machine learning problem, and return the trained model.
>
> *What is needed from DataFlow *
> Dataflow is a nice abstraction for data processing. As you can see, the
> approach we take is somewhat more low level, I would call it developer
> API.  Since the requirement is basically to start the worker other types of
> containers that runs a scala function from driver side. MapPartition works
> well for the XGBoost case, but here are what can be improved:
> - Being able to specify resources of the slot at the ML stage, for example,
> xgboost as well as deep learning program can benefit from using multiple
> cores in each worker. While currently mapPartition uses one core for each
> Parititon.
> - Being able to launch container that does not take data, for example
> parameter server instance. This is mainly needed for the deep learning
> program.
>
>
> *Why not implement them using DataFlow?*
> One thing I can expect people to argue is why not directly use (multiple)
> data-flow stages to implement these algorithm. This is a possible approach,
> here are the reasons why
> - Most work in advanced ML algorithm is actually the machine learning part,
> and add a bit communication into it. So directly using communication
> library inside ML code allows easier migration from optimized single
> machine version to distributed one.
> - Not all dataflow executors are alike, for machine learning usually
> benefit from persistent program state (which Flink have but not spark), and
> we want to be invariant of such difference.
> Dataflow was originally designed for data process, and I do feel sometimes
> other abstraction fits machine learning well. The idea of embedding the
> ML's abstraction into one stage of dataflow allow us to take benefit from
> the flexible data processing phase, and also use the best learning
> algorithm.
>
> *Fault Tolerance?*
> Most algorithm we have assumes a fail-restart scheme from the host
> platform, which means we will rely on system to restart the failed jobs
> somewhere, and provide the same input data. Then internally the
> communication library will kick in and try to recover, usually via some
> checkpoint. Of course if there is checkpoint feature from the host, this
> can also be used.
>
>
>
> *More Machine Learning Algorithms?*
> XGBoost is part of DMLC http://dmlc.ml project.  Our goal is *not to
> *develop
> general library that covers all algorithms, like FlinkML. Instead, we pick
> all the most important ones which are used in production pipeline, and
> build *deeply optimized for each specific one as a package.*  Of course
> there are also shared components like communication library and duplicated
> effort among the libraries are shared. I believe we covered most things
> people need,  plus some simple ones that can be directly implemented in
> FlinkML(Kmeans, linear model).
>
> One thing that could be interesting to try next is MXNet
> https://github.com/dmlc/mxnet/tree/master/scala-package, which is a full
> fledge deep learning library that comes with all the features you need as
> well as a Scala Binding. However,
> we do need a bit more things that I mentioned in the requirement section.
>
>
> *What Help we can get from Flink Community *
> I will list the points that are clear and actionable here:
>
> *- *Improve xgboost-Flink API so that it is consistent with current FlinkML
> pipeline
> - Provide some "developer API" that allows perf improvement as I mentioned
> in "What is needed from DataFlow"
> - Support abstraction needed for MXNet, and enable *streaming, GPU-enabled
> distributed deep learning on Flink*
>     - Main obstacle will be the "developer API"
>
> While some of these effort seems to be a lot to port specific machine
> learning library. Enable them basically enable port all machine learning
> libraries we build and we will be building using these abstractions.
>
> Tianqi
>
>
> On Sat, Mar 12, 2016 at 4:51 AM, Theodore Vasiloudis <
> theodoros.vasilou...@gmail.com> wrote:
>
> > Hello Tianqui,
> >
> > Yes that definitely sounds interesting for us and we are looking forward
> to
> > help out with the implementation.
> >
> > Regards,
> > Theodore
> > --
> > Sent from a mobile device. May contain autocorrect errors.
> > On Mar 12, 2016 11:29 AM, "Simone Robutti" <simone.robu...@radicalbit.io
> >
> > wrote:
> >
> > > This is a really interesting approach. The idea of a ML library over
> > > DataFlow is probably a winning move and I hope it will stop the
> > > proliferation of worthless reimplementation that is taking place in the
> > big
> > > data world. Do you think that DataFlow posed specific problems to your
> > > work? Does it missing something that you had to fill in with your work?
> > >
> > > Here at RadicalBit we are interested both in DataFlow/Apache Beam and
> in
> > > distributed ML and your approach to us look the best and I hope more
> and
> > > more teams follow your example, maybe integrating existing libraries
> like
> > > H2O with DataFlow.
> > >
> > > Keep us updated if you plan to develop other algorithms.
> > >
> > > 2016-03-11 21:32 GMT+01:00 Tianqi Chen <tqc...@cs.washington.edu>:
> > >
> > > > Hi Flink Developers
> > > >     I am sending this email to let you know about XGBoost4J, a
> package
> > > that
> > > > we are planning to announce next week . Here is the draft version of
> > the
> > > > post
> > > >
> https://github.com/dmlc/xgboost/blob/master/doc/jvm/xgboost4j-intro.md
> > > >
> > > >     In short, XGBoost is a machine learning package that is used by
> > more
> > > > than half of the machine challenge winning solutions and is already
> > > widely
> > > > used in industry. The distributed version scale to billion
> examples(10x
> > > > faster than spark.mllib in the experiment) with fewer resources (see
> .
> > > > http://arxiv.org/abs/1603.02754)
> > > >
> > > >     We are interested in putting distributed XGBoost into all
> Dataflow
> > > > platforms include Flink. This does not mean we re-implement it on
> > Flink.
> > > > But instead we build a portable API that has a communication library,
> > and
> > > > being able to run on different DataFlow programs.
> > > >
> > > >     We hope this can benefit the Flink users, to enable them to get
> > > access
> > > > to one of the state-of-art machine learning algorithm. I am sending
> > this
> > > > email to the mail-list to let you know about it, and hoping to get
> some
> > > > contributors to help improving  the XGBoost Flink API to be more
> > > compatible
> > > > with current FlinkML stack.  We also hope to get some support from
> the
> > > > system side, to enable some abstraction needed in XGBoost for using
> > > > multiple threads within even one slot for maximum performance.
> > > >
> > > >
> > > > Let us know about your thoughts.
> > > >
> > > > Cheers
> > > >
> > > > Tianqi
> > > >
> > >
> >
>

Re: XGBoost on DataFlow and Flink

Reply via email to