Re: XGBoost on DataFlow and Flink

Tianqi Chen Mon, 14 Mar 2016 14:41:26 -0700

Thanks! I am not aware of SrcOperator before. Then yes things can be done.

About multi-threading issue, I am looking for more principled API to
specify the resources requirement, e.g. the slots in this stage needs 4 GPU
cores and 1 GPU. So the resource allocator can be aware of that.


We have published the release announcement
http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html

And you can try the Flink version out:)




On Mon, Mar 14, 2016 at 10:00 AM, Till Rohrmann <trohrm...@apache.org>
wrote:

> Hi Tianqi,
>
> dmlc looks really cool and it would be great to integrate it with Flink. As
> far as I understood your requirements, I think that you can already
> implement most of it on Flink.
>
> For example, starting a special container which does not receive any input
> could be a specialized SourceOperator. In this SourceOperator you could
> then start a parameter server which receives the partial updates of the
> processing tasks.
>
> You're right that each task is executed by its own thread. However, you
> always can spawn new threads from within your user function. This should
> allow you to run the ML code multi-threaded. But then it might be advisable
> to lower the number of slots a bit so that you have more cores available
> than slots defined.
>
> The integration of the dmlc API with the existing ML pipelines should not
> be too hard. As long as one has access to the resulting data set it should
> be easy to plug it into another predictor/estimator instance. I guess we
> would mainly need some tooling around it.
>
> Looking forward running xgboost and mxnet with Flink :-)
>
> Cheers,
> Till
>
> On Sat, Mar 12, 2016 at 7:17 PM, Simone Robutti <
> simone.robu...@radicalbit.io> wrote:
>
> > Thanks for the insight, what you're doing is really interesting. I will
> > definitely spend some time looking at DMLC and MXNet.
> >
> > 2016-03-12 18:35 GMT+01:00 Tianqi Chen <tqc...@cs.washington.edu>:
> >
> > > Thanks for the reply.  I am writing a long email to give the answers to
> > > Simone and clarifies what we do
> > >
> > > I want to mention that *you can use the library already in Flink*. See
> > > Flink example here:
> > > https://github.com/dmlc/xgboost/tree/master/jvm-packages#xgboost-flink
> > >
> > > I have not run pressure test on top of Flink, but we did the pressure
> > test
> > > thing on Spark and it is consistent with our standalone version on a
> 100M
> > > example dataset, gives around 10x over mllib's version. I assume same
> > thing
> > > holds for Flink as well.
> > > So if you are interested, please try it out. I imagine this can be very
> > > useful to have a Flink demo that can directly give competitive result
> on
> > > say a kaggle competition, and can attract more users to Flink community
> > as
> > > well.
> > >
> > > *The Internal Details*
> > >    Here at dmlc.ml , we are building libraries that we dive deep and
> aim
> > > for the best performance and flexibility. We build our own abstractions
> > > when needed, for example, XGBoost relies on Allreduce, and MXNet,
> another
> > > well known deep learning project, relies on parameter server
> abstraction.
> > > We tried to make these abstractions portable, so they are not
> stand-alone
> > > C++ programs, but can be used as library in other languages, e.g.
> scala.
> > >    So essentially, here is what is needed:
> > > - Start a tracker on driver side to connect the workers together to use
> > our
> > > version of communication library (this can be swapped depending on
> level
> > of
> > > integration, if communication API is provided natively by the
> platform).
> > > - An API to start concurrent jobs(containers), that can execute a
> > function
> > > (either worker or server).
> > > - Gettng accessed to partition of data in each worker.
> > >
> > > Take XGBoost for example, what we do in Flink is to as a MapPartition
> > > stage, and treat each slot as an worker.  Each worker then
> > collaboratively
> > > solve the machine learning problem, and return the trained model.
> > >
> > > *What is needed from DataFlow *
> > > Dataflow is a nice abstraction for data processing. As you can see, the
> > > approach we take is somewhat more low level, I would call it developer
> > > API.  Since the requirement is basically to start the worker other
> types
> > of
> > > containers that runs a scala function from driver side. MapPartition
> > works
> > > well for the XGBoost case, but here are what can be improved:
> > > - Being able to specify resources of the slot at the ML stage, for
> > example,
> > > xgboost as well as deep learning program can benefit from using
> multiple
> > > cores in each worker. While currently mapPartition uses one core for
> each
> > > Parititon.
> > > - Being able to launch container that does not take data, for example
> > > parameter server instance. This is mainly needed for the deep learning
> > > program.
> > >
> > >
> > > *Why not implement them using DataFlow?*
> > > One thing I can expect people to argue is why not directly use
> (multiple)
> > > data-flow stages to implement these algorithm. This is a possible
> > approach,
> > > here are the reasons why
> > > - Most work in advanced ML algorithm is actually the machine learning
> > part,
> > > and add a bit communication into it. So directly using communication
> > > library inside ML code allows easier migration from optimized single
> > > machine version to distributed one.
> > > - Not all dataflow executors are alike, for machine learning usually
> > > benefit from persistent program state (which Flink have but not spark),
> > and
> > > we want to be invariant of such difference.
> > > Dataflow was originally designed for data process, and I do feel
> > sometimes
> > > other abstraction fits machine learning well. The idea of embedding the
> > > ML's abstraction into one stage of dataflow allow us to take benefit
> from
> > > the flexible data processing phase, and also use the best learning
> > > algorithm.
> > >
> > > *Fault Tolerance?*
> > > Most algorithm we have assumes a fail-restart scheme from the host
> > > platform, which means we will rely on system to restart the failed jobs
> > > somewhere, and provide the same input data. Then internally the
> > > communication library will kick in and try to recover, usually via some
> > > checkpoint. Of course if there is checkpoint feature from the host,
> this
> > > can also be used.
> > >
> > >
> > >
> > > *More Machine Learning Algorithms?*
> > > XGBoost is part of DMLC http://dmlc.ml project.  Our goal is *not to
> > > *develop
> > > general library that covers all algorithms, like FlinkML. Instead, we
> > pick
> > > all the most important ones which are used in production pipeline, and
> > > build *deeply optimized for each specific one as a package.*  Of course
> > > there are also shared components like communication library and
> > duplicated
> > > effort among the libraries are shared. I believe we covered most things
> > > people need,  plus some simple ones that can be directly implemented in
> > > FlinkML(Kmeans, linear model).
> > >
> > > One thing that could be interesting to try next is MXNet
> > > https://github.com/dmlc/mxnet/tree/master/scala-package, which is a
> full
> > > fledge deep learning library that comes with all the features you need
> as
> > > well as a Scala Binding. However,
> > > we do need a bit more things that I mentioned in the requirement
> section.
> > >
> > >
> > > *What Help we can get from Flink Community *
> > > I will list the points that are clear and actionable here:
> > >
> > > *- *Improve xgboost-Flink API so that it is consistent with current
> > FlinkML
> > > pipeline
> > > - Provide some "developer API" that allows perf improvement as I
> > mentioned
> > > in "What is needed from DataFlow"
> > > - Support abstraction needed for MXNet, and enable *streaming,
> > GPU-enabled
> > > distributed deep learning on Flink*
> > >     - Main obstacle will be the "developer API"
> > >
> > > While some of these effort seems to be a lot to port specific machine
> > > learning library. Enable them basically enable port all machine
> learning
> > > libraries we build and we will be building using these abstractions.
> > >
> > > Tianqi
> > >
> > >
> > > On Sat, Mar 12, 2016 at 4:51 AM, Theodore Vasiloudis <
> > > theodoros.vasilou...@gmail.com> wrote:
> > >
> > > > Hello Tianqui,
> > > >
> > > > Yes that definitely sounds interesting for us and we are looking
> > forward
> > > to
> > > > help out with the implementation.
> > > >
> > > > Regards,
> > > > Theodore
> > > > --
> > > > Sent from a mobile device. May contain autocorrect errors.
> > > > On Mar 12, 2016 11:29 AM, "Simone Robutti" <
> > simone.robu...@radicalbit.io
> > > >
> > > > wrote:
> > > >
> > > > > This is a really interesting approach. The idea of a ML library
> over
> > > > > DataFlow is probably a winning move and I hope it will stop the
> > > > > proliferation of worthless reimplementation that is taking place in
> > the
> > > > big
> > > > > data world. Do you think that DataFlow posed specific problems to
> > your
> > > > > work? Does it missing something that you had to fill in with your
> > work?
> > > > >
> > > > > Here at RadicalBit we are interested both in DataFlow/Apache Beam
> and
> > > in
> > > > > distributed ML and your approach to us look the best and I hope
> more
> > > and
> > > > > more teams follow your example, maybe integrating existing
> libraries
> > > like
> > > > > H2O with DataFlow.
> > > > >
> > > > > Keep us updated if you plan to develop other algorithms.
> > > > >
> > > > > 2016-03-11 21:32 GMT+01:00 Tianqi Chen <tqc...@cs.washington.edu>:
> > > > >
> > > > > > Hi Flink Developers
> > > > > >     I am sending this email to let you know about XGBoost4J, a
> > > package
> > > > > that
> > > > > > we are planning to announce next week . Here is the draft version
> > of
> > > > the
> > > > > > post
> > > > > >
> > > https://github.com/dmlc/xgboost/blob/master/doc/jvm/xgboost4j-intro.md
> > > > > >
> > > > > >     In short, XGBoost is a machine learning package that is used
> by
> > > > more
> > > > > > than half of the machine challenge winning solutions and is
> already
> > > > > widely
> > > > > > used in industry. The distributed version scale to billion
> > > examples(10x
> > > > > > faster than spark.mllib in the experiment) with fewer resources
> > (see
> > > .
> > > > > > http://arxiv.org/abs/1603.02754)
> > > > > >
> > > > > >     We are interested in putting distributed XGBoost into all
> > > Dataflow
> > > > > > platforms include Flink. This does not mean we re-implement it on
> > > > Flink.
> > > > > > But instead we build a portable API that has a communication
> > library,
> > > > and
> > > > > > being able to run on different DataFlow programs.
> > > > > >
> > > > > >     We hope this can benefit the Flink users, to enable them to
> get
> > > > > access
> > > > > > to one of the state-of-art machine learning algorithm. I am
> sending
> > > > this
> > > > > > email to the mail-list to let you know about it, and hoping to
> get
> > > some
> > > > > > contributors to help improving  the XGBoost Flink API to be more
> > > > > compatible
> > > > > > with current FlinkML stack.  We also hope to get some support
> from
> > > the
> > > > > > system side, to enable some abstraction needed in XGBoost for
> using
> > > > > > multiple threads within even one slot for maximum performance.
> > > > > >
> > > > > >
> > > > > > Let us know about your thoughts.
> > > > > >
> > > > > > Cheers
> > > > > >
> > > > > > Tianqi
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: XGBoost on DataFlow and Flink

Reply via email to