Re: [DISCUSS] Flink ML roadmap

Stephan Ewen Tue, 14 Mar 2017 12:53:24 -0700

Hi all!

Sorry for joining this discussion late (I have already missed some of the
deadlines set in this thread).


*Here are some thoughts about what we can do immediately*

  (1) Grow ML community by adding committers with a dedicated. Irrespective
of any direction decision, this is a
       must. I know that the PMC is actively working on this, stay tuned
for some updates.

  (2) I think a repository split helps to make library committer additions
easier, even if it does not go hand in hand
       with a community split. I believe that we can trust committers that
were appointed mainly for their library work to
       commit directly to the library repository and go through pull
requests in the engine/api/connector repository.
       In some sense we have the same thing already: we trust committers to
only commit when they are confident
       in the touched component and submit a pull request if in doubt.
Having the separate repositories makes this rule
       of thumb even simpler.


*On the Roadmap Discussion*

   - Thanks for the collection and discussion already, these are super nice
thoughts. Kudos!

   - My personal take is that the "model evaluation" over streams will be
happening in any case - there
     is genuine interest in that and various users have built that
themselves already.

   - The model evaluation as one step of a streaming pipeline (classifying
events), followed by CEP (pattern detection)
     or anomaly detection is a valuable use case on top of what pure model
serving systems usually do.

   - An "ML training library" is certainly interesting, if the community
can pull it off. More details below.

   - A question I have not yet a good intuition on is whether the "model
evaluation" and the training part are so
    different (one a good abstraction for model evaluation has been built)
that there is little cross coordination needed,
    or whether there is potential in integrating them.


*Thoughts on the ML training library*

  - There seems especially now to be a big trend towards deep learning (is
it just temporary or will this be the future?) and in
     that space, little works without GPU acceleration.

  - It is always easier to do something new than to be the n-th version of
something existing (sorry for the generic true-ism).
    The later admittedly gives the "all in one integrated framework"
advantage (which can be a very strong argument indeed),
    but the former attracts completely new communities and can often make
more noise with less effort.

  - The "new" is not required to be "online learning", where Theo has
described well that this does not look like its taking off.
    It can also be traditional ML re-imagined for "continuous
applications", as "continuous / incremental re-training" or so.
    Even on the "model evaluation side", there is a lot of interesting
stuff as mentioned already, like ensembles, multi-armed bandits, ...

  - It may be well worth tapping into the work of an existing library (like
tensorflow) for an easy fix to some hard problems (pre-existing
    hardware integration, pre-existing optimized linear algebra solvers,
etc) and think about how such use cases would look like in
    the context of typical Flink applications.


*A bit of engine background information that may help in the planning:*

  - The DataStream API will in the future also support bounded data
computations explicitly (I say this not as a fact, but as
    a strong believer that this is the right direction).

  - Batch runtime execution has seen less focus recently, but seems to get
a bit more community focus, because some organizations
    that contribute a lot want to use the batch side as well. For example
the effort on file-grained recovery will strengthen batch a lot already.


Stephan


On Fri, Mar 10, 2017 at 2:38 PM, Till Rohrmann <trohrm...@apache.org> wrote:

> Hi Roberto,
>
> jpmml looks quite promising and this could be a first step towards the
> model serving story. Thus, looking really forward seeing it being open
> sourced by you guys :-)
>
> @Katherin, I'm not saying that there is no interest in the community to
> work on batch features. However, there is simply not much capacity left to
> mentor such an effort at the moment. I fear without the mentoring from an
> experienced contributor who has worked on the batch part, it will be
> extremely hard to get such a change into the code base. But this will
> hopefully change in the future.
>
> I think the discussion from this thread moved over to [1] and I will
> continue discussing there.
>
> [1]
> http://apache-flink-mailing-list-archive.1008284.n3.
> nabble.com/Machine-Learning-on-Flink-Next-steps-td16334.html#none
>
> Cheers,
> Till
>
>

Re: [DISCUSS] Flink ML roadmap

Reply via email to