Re: XGBoost on DataFlow and Flink

Tianqi Chen Sat, 12 Mar 2016 09:35:53 -0800

Thanks for the reply.  I am writing a long email to give the answers to
Simone and clarifies what we do


I want to mention that *you can use the library already in Flink*. See
Flink example here:
https://github.com/dmlc/xgboost/tree/master/jvm-packages#xgboost-flink

I have not run pressure test on top of Flink, but we did the pressure test
thing on Spark and it is consistent with our standalone version on a 100M
example dataset, gives around 10x over mllib's version. I assume same thing
holds for Flink as well.
So if you are interested, please try it out. I imagine this can be very
useful to have a Flink demo that can directly give competitive result on
say a kaggle competition, and can attract more users to Flink community as
well.

*The Internal Details*
   Here at dmlc.ml , we are building libraries that we dive deep and aim
for the best performance and flexibility. We build our own abstractions
when needed, for example, XGBoost relies on Allreduce, and MXNet, another
well known deep learning project, relies on parameter server abstraction.
We tried to make these abstractions portable, so they are not stand-alone
C++ programs, but can be used as library in other languages, e.g. scala.
   So essentially, here is what is needed:
- Start a tracker on driver side to connect the workers together to use our
version of communication library (this can be swapped depending on level of
integration, if communication API is provided natively by the platform).
- An API to start concurrent jobs(containers), that can execute a function
(either worker or server).
- Gettng accessed to partition of data in each worker.

Take XGBoost for example, what we do in Flink is to as a MapPartition
stage, and treat each slot as an worker.  Each worker then collaboratively
solve the machine learning problem, and return the trained model.

*What is needed from DataFlow *
Dataflow is a nice abstraction for data processing. As you can see, the
approach we take is somewhat more low level, I would call it developer
API.  Since the requirement is basically to start the worker other types of
containers that runs a scala function from driver side. MapPartition works
well for the XGBoost case, but here are what can be improved:
- Being able to specify resources of the slot at the ML stage, for example,
xgboost as well as deep learning program can benefit from using multiple
cores in each worker. While currently mapPartition uses one core for each
Parititon.
- Being able to launch container that does not take data, for example
parameter server instance. This is mainly needed for the deep learning
program.


*Why not implement them using DataFlow?*
One thing I can expect people to argue is why not directly use (multiple)
data-flow stages to implement these algorithm. This is a possible approach,
here are the reasons why
- Most work in advanced ML algorithm is actually the machine learning part,
and add a bit communication into it. So directly using communication
library inside ML code allows easier migration from optimized single
machine version to distributed one.
- Not all dataflow executors are alike, for machine learning usually
benefit from persistent program state (which Flink have but not spark), and
we want to be invariant of such difference.
Dataflow was originally designed for data process, and I do feel sometimes
other abstraction fits machine learning well. The idea of embedding the
ML's abstraction into one stage of dataflow allow us to take benefit from
the flexible data processing phase, and also use the best learning
algorithm.

*Fault Tolerance?*
Most algorithm we have assumes a fail-restart scheme from the host
platform, which means we will rely on system to restart the failed jobs
somewhere, and provide the same input data. Then internally the
communication library will kick in and try to recover, usually via some
checkpoint. Of course if there is checkpoint feature from the host, this
can also be used.



*More Machine Learning Algorithms?*
XGBoost is part of DMLC http://dmlc.ml project.  Our goal is *not to *develop
general library that covers all algorithms, like FlinkML. Instead, we pick
all the most important ones which are used in production pipeline, and
build *deeply optimized for each specific one as a package.*  Of course
there are also shared components like communication library and duplicated
effort among the libraries are shared. I believe we covered most things
people need,  plus some simple ones that can be directly implemented in
FlinkML(Kmeans, linear model).

One thing that could be interesting to try next is MXNet
https://github.com/dmlc/mxnet/tree/master/scala-package, which is a full
fledge deep learning library that comes with all the features you need as
well as a Scala Binding. However,
we do need a bit more things that I mentioned in the requirement section.


*What Help we can get from Flink Community *
I will list the points that are clear and actionable here:

*- *Improve xgboost-Flink API so that it is consistent with current FlinkML
pipeline
- Provide some "developer API" that allows perf improvement as I mentioned
in "What is needed from DataFlow"
- Support abstraction needed for MXNet, and enable *streaming, GPU-enabled
distributed deep learning on Flink*
    - Main obstacle will be the "developer API"

While some of these effort seems to be a lot to port specific machine
learning library. Enable them basically enable port all machine learning
libraries we build and we will be building using these abstractions.

Tianqi


On Sat, Mar 12, 2016 at 4:51 AM, Theodore Vasiloudis <
[email protected]> wrote:

> Hello Tianqui,
>
> Yes that definitely sounds interesting for us and we are looking forward to
> help out with the implementation.
>
> Regards,
> Theodore
> --
> Sent from a mobile device. May contain autocorrect errors.
> On Mar 12, 2016 11:29 AM, "Simone Robutti" <[email protected]>
> wrote:
>
> > This is a really interesting approach. The idea of a ML library over
> > DataFlow is probably a winning move and I hope it will stop the
> > proliferation of worthless reimplementation that is taking place in the
> big
> > data world. Do you think that DataFlow posed specific problems to your
> > work? Does it missing something that you had to fill in with your work?
> >
> > Here at RadicalBit we are interested both in DataFlow/Apache Beam and in
> > distributed ML and your approach to us look the best and I hope more and
> > more teams follow your example, maybe integrating existing libraries like
> > H2O with DataFlow.
> >
> > Keep us updated if you plan to develop other algorithms.
> >
> > 2016-03-11 21:32 GMT+01:00 Tianqi Chen <[email protected]>:
> >
> > > Hi Flink Developers
> > >     I am sending this email to let you know about XGBoost4J, a package
> > that
> > > we are planning to announce next week . Here is the draft version of
> the
> > > post
> > > https://github.com/dmlc/xgboost/blob/master/doc/jvm/xgboost4j-intro.md
> > >
> > >     In short, XGBoost is a machine learning package that is used by
> more
> > > than half of the machine challenge winning solutions and is already
> > widely
> > > used in industry. The distributed version scale to billion examples(10x
> > > faster than spark.mllib in the experiment) with fewer resources (see .
> > > http://arxiv.org/abs/1603.02754)
> > >
> > >     We are interested in putting distributed XGBoost into all Dataflow
> > > platforms include Flink. This does not mean we re-implement it on
> Flink.
> > > But instead we build a portable API that has a communication library,
> and
> > > being able to run on different DataFlow programs.
> > >
> > >     We hope this can benefit the Flink users, to enable them to get
> > access
> > > to one of the state-of-art machine learning algorithm. I am sending
> this
> > > email to the mail-list to let you know about it, and hoping to get some
> > > contributors to help improving  the XGBoost Flink API to be more
> > compatible
> > > with current FlinkML stack.  We also hope to get some support from the
> > > system side, to enable some abstraction needed in XGBoost for using
> > > multiple threads within even one slot for maximum performance.
> > >
> > >
> > > Let us know about your thoughts.
> > >
> > > Cheers
> > >
> > > Tianqi
> > >
> >
>

Re: XGBoost on DataFlow and Flink

Reply via email to