Re: Kicking off the Machine Learning Library

Stephan Ewen Wed, 14 Jan 2015 11:30:39 -0800

Thank you Dmitriy!

That was quite a bit food for thought - I think I will need a bit more time
to digest that ;-)
Especially the part about the algebraic data structures and how they allow
you to pipeline and combine algorithms is exactly along the lines of what I
was thinking - thank you for sharing your observations there.


Concerning features of Flink: We recently had a large piece of code going
in that is exactly the underpinning for running multiple programs on cached
data sets.
Some of the primitives to bring back counters or data sets to the driver
exist already in pull requests, others will be added soon. So I think we
will be good there.
I would go for an integration between Flink and Mahout along the same lines
as the integration with Spark, rather than in the way it is done with H2O.

Greetings,
Stephan


On Tue, Jan 13, 2015 at 11:53 PM, Dmitriy Lyubimov <dlie...@gmail.com>
wrote:

> In terms of Mahout DSL it means implementing a bunch of physical operators
> such as transpose, A'B or B'A on large row or column partitioned matrices.
>
> Mahout optimizer takes care of simplifying algebraic expressions such as 1+
> exp(drm) => drm.apply-unary(1+exp(x)) and tracking things like identical
> partitioning of datasets where applicable.
>
> Adding a back  turned out to be pretty trivial for h20 back. I don't think
> there's a need to bridge flink operations to any of existing backs of DSL
> for support of Flink. instead i was hoping flink could have its own native
> back. Also keep in mind that I am not aware of any actual applications with
> Mahout + h2o back whereas i have built dozens on Spark in past 2 years
> (albeit on a quite intensely hacked version of the bindings).
>
> With Flink it is a bit less trivial i guess because mahout optimizer
> sometimes tries to run quick summary things such as matrix geometry
> detection as a part of optimizer action itself. And Flink, last time i
> checked, did not support running multiple computational actions on a
> cache-pinned dataset. So that was the main difficulty last time. Perhaps no
> more. Or perhaps there's a clever way to work around this. not sure.
>
> At the same time Flink back for algebraic optimizer also may be more
> trivial than e.g H20 back was since H20 insisted on marrying their own
> in-core Matrix api so they created a bridge between Mahout (former Colt)
> Matrix api and one of their own. Whereas if distributed system just takes
> care of serializing Writables of Matrices and Vectors (such as it was done
> for Spark backend) then there's practically nothing to do. Which, my
> understanding is, Flink is doing good.
>
> Now, on the part of current shortcomings, currently there is a fairly long
> list of performance problems, many of which i fixed internally but not
> publicly.  But hopefully some or all of that will eventually be pushed
> sooner rather than later.
>
> The biggest sticking issue is in-core performance of the Colt apis. Even
> jvm-based matrices could do much better if they took a bit of cost-based
> approach doing so, and perhaps integrating a bit more sophisticated
> Winograd type of gemm approaches. Not to mention integration of things like
> Magma, jcuda or even simply netlib.org native dense blas stuff (as in
> Breeze). Good thing though it does not affect backend work since once it is
> fixed for in-core algebra, it is fixed elsewhere and this is hidden behind
> scalabindings (in-core algebra DSL).
>
> These in-core and out-of-core performance optimization issues are probably
> the only thing between code base and a good well rounded release. Out of
> core bad stuff is mostly fixed internally though.
>
> Anyone interested in working on any of the issues I mentioned please throw
> a note to Mahout list to me or Suneel Marthi.
>
> On the issue of optimized ML stuff vs. general algebra, here's food for
> thought.
>
> I found that 70% of algorithms are not purely algebraic in R-like operation
> set sense.
>
> But close to 95% could contain significant elements of simplification thru
> algebra. Even probabilistic things that just use stuff like MVN or Wishart
> sampling, or Gaussian processes. A lot easier to read and maintain. Even
> LLoyd iteration has a simple algebraic expression (turns out).
>
> Close to 80% could not avoid using some algebra.
>
> Only 5% could get away with not doing algebra at all.
>
> Thus, 95% of things i ever worked on are either purely or quasi algebraic.
> Even when they are probabilisitic at core. What it means is that any ML
> library would benefit enormously if it acts in terms of common algebraic
> data structures. It will create opportunity for pipelines to seamlessly
> connect elements of learning such as standardization, dimensionality
> reduction, MDS/visualization methods, recommender methods as well as
> clustering methods. My general criticism for MLLib has been that until
> recently, they did not work on creating such common math structure
> standards, algebraic data in particular, and every new method's
> input/output came in its own form an shape. Still does.
>
> So dilemma of separating efforts on ones  using and not using algebra is a
> bit false. Most methods are quasi-algebraic (meaning they have at least
> some need for R-like matrix manipulations). Of course there's need for
> specific distributed primitives from time to time, there's no arguing about
> it (like i said, 70% of all methods cannot be purely algebraic, only about
> 25% are). There's an issue of algebra physical ops performance on spark and
> in-core, but there's no reason (for me) to believe that effort to fix it in
> POJO based implementations would require any less effort.
>
> There were also some discussion whether and how it is possible to create
> quasi algebraic methods such that their non-algebraic part is easy to port
> to another platform (provided algebraic part is already compatible), but
> that's completely another topic. But what i am saying additional benefit
> might be that moving ML methodology to platform-agnostic package in Mahout
> (if such interest ever appears) would also be much easier even for
> quasi-algebraic approaches if the solutions gravitated to using algebra as
> opposed to not using it .
>
> just some food for thought.
>
> thanks.
>
> -D
>
>
>
>
> On Sun, Jan 4, 2015 at 10:14 AM, Till Rohrmann <trohrm...@apache.org>
> wrote:
>
> > The idea to work with H2O sounds really interesting.
> >
> > In terms of the Mahout DSL this would mean that we have to translate a
> > Flink dataset into H2O's basic abstraction of distributed data and vice
> > versa. Everything other than writing to disk with one system and reading
> > from there with the other is probably non-trivial and hard to realize.
> > On Jan 4, 2015 9:18 AM, "Henry Saputra" <henry.sapu...@gmail.com> wrote:
> >
> > > Happy new year all!
> > >
> > > Like the idea to add ML module with Flink.
> > >
> > > As I have mentioned to Kostas, Stephan, and Robert before, I would
> > > love to see if we could work with H20 project [1], and it seemed like
> > > the community has added support for it for Apache Mahout backend
> > > binding [2].
> > >
> > > So we might get some additional scale ML algos like deep learning.
> > >
> > > Definitely would love to help with this initiative =)
> > >
> > > - Henry
> > >
> > > [1] https://github.com/h2oai/h2o-dev
> > > [2] https://issues.apache.org/jira/browse/MAHOUT-1500
> > >
> > > On Fri, Jan 2, 2015 at 6:46 AM, Stephan Ewen <se...@apache.org> wrote:
> > > > Hi everyone!
> > > >
> > > > Happy new year, first of all and I hope you had a nice
> end-of-the-year
> > > > season.
> > > >
> > > > I thought that it is a good time now to officially kick off the
> > creation
> > > of
> > > > a library of machine learning algorithms. There are a lot of
> individual
> > > > artifacts and algorithms floating around which we should consolidate.
> > > >
> > > > The machine-learning library in Flink would stand on two legs:
> > > >
> > > >  - A collection of efficient implementations for common problems and
> > > > algorithms, e.g., Regression (logistic), clustering (k-Means,
> Canopy),
> > > > Matrix Factorization (ALS), ...
> > > >
> > > >  - An adapter to the linear algebra DSL in Apache Mahout.
> > > >
> > > > In the long run, it would be the goal to be able to mix and match
> code
> > > from
> > > > both parts.
> > > > The linear algebra DSL is very convenient when it comes to quickly
> > > > composing an algorithm, or some custom pre- and post-processing
> steps.
> > > > For some complex algorithms, however, a low level system specific
> > > > implementation is necessary to make the algorithm efficient.
> > > > Being able to call the tailored algorithms from the DSL would combine
> > the
> > > > benefits.
> > > >
> > > >
> > > > As a concrete initial step, I suggest to do the following:
> > > >
> > > > 1) We create a dedicated maven sub-project for that ML library
> > > > (flink-lib-ml). The project gets two sub-projects, one for the
> > collection
> > > > of specialized algorithms, one for the Mahout DSL
> > > >
> > > > 2) We add the code for the existing specialized algorithms. As
> followup
> > > > work, we need to consolidate data types between those algorithms, to
> > > ensure
> > > > that they can easily be combined/chained.
> > > >
> > > > 3) The code for the Flink bindings to the Mahout DSL will actually
> > reside
> > > > in the Mahout project, which we need to add as a dependency to
> > > flink-lib-ml.
> > > >
> > > > 4) We add some examples of Mahout DSL algorithms, and a template how
> to
> > > use
> > > > them within Flink programs.
> > > >
> > > > 5) Create a good introductory readme.md, outlining this structure.
> The
> > > > readme can also track the implemented algorithms and the ones we put
> on
> > > the
> > > > roadmap.
> > > >
> > > >
> > > > Comments welcome :-)
> > > >
> > > >
> > > > Greetings,
> > > > Stephan
> > >
> >
>

Re: Kicking off the Machine Learning Library

Reply via email to