Re: Machine Learning on Flink - Next steps

Stephan Ewen Tue, 14 Mar 2017 13:04:43 -0700

Thanks Theo. Just wrote some comments on the other thread, but it looks
like you got it covered already.


Let me re-post what I think may help as input:

*Concerning Model Evaluation / Serving *

   - My personal take is that the "model evaluation" over streams will be
happening in any case - there
     is genuine interest in that and various users have built that
themselves already.
     I would be a cool way to do something that has a very high chance of
being productionized by users soon.

   - The model evaluation as one step of a streaming pipeline (classifying
events), followed by CEP (pattern detection)
     or anomaly detection is a valuable use case on top of what pure model
serving systems usually do.

   - A question I have not yet a good intuition on is whether the "model
evaluation" and the training part are so
    different (one a good abstraction for model evaluation has been built)
that there is little cross coordination needed,
    or whether there is potential in integrating them.


*Thoughts on the ML training library (DataSet API or DataStream API)*

  - I honestly don't quite understand what the big difference will be in
targeting the batch or streaming API. You can use the
    DataSet API in a quite low-level fashion (missing async iterations).

  - There seems especially now to be a big trend towards deep learning (is
it just temporary or will this be the future?) and in
     that space, little works without GPU acceleration.

  - It is always easier to do something new than to be the n-th version of
something existing (sorry for the generic true-ism).
    The later admittedly gives the "all in one integrated framework"
advantage (which can be a very strong argument indeed),
    but the former attracts completely new communities and can often make
more impact with less effort.

  - The "new" is not required to be "online learning", where Theo has
described some concerns well.
    It can also be traditional ML re-imagined for "continuous
applications", as "continuous / incremental re-training" or so.
    Even on the "model evaluation side", there is a lot of interesting
stuff as mentioned already, like ensembles, multi-armed bandits, ...

  - It may be well worth tapping into the work of an existing library (like
tensorflow) for an easy fix to some hard problems (pre-existing
    hardware integration, pre-existing optimized linear algebra solvers,
etc) and think about how such use cases would look like in
    the context of typical Flink applications.


*A bit of engine background information that may help in the planning:*

  - The DataStream API will in the future also support bounded data
computations explicitly (I say this not as a fact, but as
    a strong believer that this is the right direction).

  - Batch runtime execution has seen less focus recently, but seems to get
a bit more community focus, because some organizations
    that contribute a lot want to use the batch side as well. For example
the effort on file-grained recovery will strengthen batch a lot already.


Stephan



On Tue, Mar 14, 2017 at 1:38 PM, Theodore Vasiloudis <
theodoros.vasilou...@gmail.com> wrote:

> Hello all,
>
> ## Executive summary:
>
>    - Offline-on-streaming most popular, then online and model serving.
>    - Need shepherds to lead development/coordination of each task.
>    - I can shepherd online learning, need shepherds for the other two.
>
>
> so from the people sharing their opinion it seems most people would like to
> try out offline learning with the streaming API.
> I also think this is an interesting option, but probably the most risky of
> the bunch.
>
> After that online learning and model serving seem to have around the same
> amount of interest.
>
> Given that, and the discussions we had in the Gdoc, here's what I recommend
> as next actions:
>
>    -
> *Offline on streaming: *Start by creating a design document, with an MVP
>    specification about what we
>    imagine such a library to look like and what we think should be possible
>    to do.
>    It should state clear goals and limitations; scoping the amount of work
>    is
>    more important at this point than specific engineering choices.
>    -
> *Online learning: *If someone would like instead to work on online learning
>    I can help out there,
>    I have one student working on such a library right now, and I'm sure
>    people
>    at TU Berlin (Felix?) have similar efforts. Ideally we would like to
>    communicate with
>    them. Since this is a much more explored space, we could jump straight
>    into a technical
>    design document, (with scoping included of course) discussing
>    abstractions, and comparing
>    with existing frameworks.
>    -
> *Model serving: *There will be a presentation at Flink Forward SF on such a
>    framework (Flink Tensorflow)
>    by Eron Wright [1]. My recommendation would be to communicate with the
>    author and see
>    if he would be interested in working together to generalize and extend
>    the framework.
>    For more research and resources on the topic see [2] or this
>    presentation [3], particularly the Clipper system.
>
> In order to have some activity on each project I recommend we set a minimum
> of 2 people willing to
> contribute to each project.
>
> If we "assign" people by top choice, that should be possible to do,
> although my original plan was
> to only work on two of the above, to avoid fragmentation. But given that
> online learning will have work
> being done by students as well, it should be possible to keep it running.
>
> Next *I would like us to assign a "shepherd" for each of these tasks.* If
> you are willing to coordinate the development
> on one of these options, let us know here and you can take up the task of
> coordinating with the rest of
> of the people working on the task.
>
> I would like to volunteer to coordinate the *Online learning *effort, since
> I'm already supervising a student
> working on this, and I'm currently developing such algorithms. I plan to
> contribute to the offline on streaming
> task as well, but not coordinate it.
>
> So if someone would like to take the lead on Offline on streaming or Model
> serving, let us know and
> we can take it from there.
>
> Regards,
> Theodore
>
> [1] http://sf.flink-forward.org/kb_sessions/introducing-flink-tensorflow/
>
> [2] https://ucbrise.github.io/cs294-rise-fa16/prediction_serving.html
>
> [3]
> https://ucbrise.github.io/cs294-rise-fa16/assets/slides/
> prediction-serving-systems-cs294-RISE_seminar.pdf
>
> On Fri, Mar 10, 2017 at 6:55 PM, Stavros Kontopoulos <
> st.kontopou...@gmail.com> wrote:
>
> > Thanks Theodore,
> >
> > I'd vote for
> >
> > - Offline learning with Streaming API
> >
> > - Low-latency prediction serving
> >
> > Some comments...
> >
> > Online learning
> >
> > Good to have but my feeling is that it is not a strong requirement (if a
> > requirement at all) across the industry right now. May become hot in the
> > future.
> >
> > Offline learning with Streaming API:
> >
> > Although it requires engine changes or extensions (feasibility is an
> issue
> > here), my understanding is that it reflects the industry common practice
> > (train every few minutes at most) and it would be great if that was
> > supported out of the box providing a friendly API for the developer.
> >
> > Offline learning with the batch API:
> >
> > I would love to have a limited set of algorithms so someone does not
> leave
> > Flink to work  with another tool
> > for some initial dataset if he wants to. In other words, let's reach a
> > mature state with some basic algos merged.
> > There is a lot of work pending let's not waste it.
> >
> > Low-latency prediction serving
> >
> > Model serving is a long standing problem, we could definitely help with
> > that.
> >
> > Regards,
> > Stavros
> >
> >
> >
> > On Fri, Mar 10, 2017 at 4:08 PM, Till Rohrmann <trohrm...@apache.org>
> > wrote:
> >
> > > Thanks Theo for steering Flink's ML effort here :-)
> > >
> > > I'd vote to concentrate on
> > >
> > > - Online learning
> > > - Low-latency prediction serving
> > >
> > > because of the following reasons:
> > >
> > > Online learning:
> > >
> > > I agree that this topic is highly researchy and it's not even clear
> > whether
> > > it will ever be of any interest outside of academia. However, it was
> the
> > > same for other things as well. Adoption in industry is usually slow and
> > > sometimes one has to dare to explore something new.
> > >
> > > Low-latency prediction serving:
> > >
> > > Flink with its streaming engine seems to be the natural fit for such a
> > task
> > > and it is a rather low hanging fruit. Furthermore, I think that users
> > would
> > > directly benefit from such a feature.
> > >
> > > Offline learning with Streaming API:
> > >
> > > I'm not fully convinced yet that the streaming API is powerful enough
> > > (mainly due to lack of proper iteration support and spilling
> > capabilities)
> > > to support a wide range of offline ML algorithms. And if then it will
> > only
> > > support rather small problem sizes because streaming cannot gracefully
> > > spill the data to disk. There are still to many open issues with the
> > > streaming API to be applicable for this use case imo.
> > >
> > > Offline learning with the batch API:
> > >
> > > For offline learning the batch API is imo still better suited than the
> > > streaming API. I think it will only make sense to port the algorithms
> to
> > > the streaming API once batch and streaming are properly unified. Alone
> > the
> > > highly efficient implementations for joining and sorting of data which
> > can
> > > go out of memory are important to support big sized ML problems. In
> > > general, I think it might make sense to offer a basic set of ML
> > primitives.
> > > However, already offering this basic set is a considerable amount of
> > work.
> > >
> > > Concering the independent organization for the development: I think it
> > > would be great if the development could still happen under the umbrella
> > of
> > > Flink's ML library because otherwise we might risk some kind of
> > > fragmentation. In order for people to collaborate, one can also open
> PRs
> > > against a branch of a forked repo.
> > >
> > > I'm currently working on wrapping the project re-organization
> discussion
> > > up. The general position was that it would be best to have an
> incremental
> > > build and keep everything in the same repo. If this is not possible
> then
> > we
> > > want to look into creating a sub repository for the libraries (maybe
> > other
> > > components will follow later). I hope to make some progress on this
> front
> > > in the next couple of days/week. I'll keep you updated.
> > >
> > > As a general remark for the discussions on the google doc. I think it
> > would
> > > be great if we could at least mirror the discussions happening in the
> > > google doc back on the mailing list or ideally conduct the discussions
> > > directly on the mailing list. That's at least what the ASF encourages.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Fri, Mar 10, 2017 at 10:52 AM, Gábor Hermann <m...@gaborhermann.com
> >
> > > wrote:
> > >
> > > > Hey all,
> > > >
> > > > Sorry for the bit late response.
> > > >
> > > > I'd like to work on
> > > > - Offline learning with Streaming API
> > > > - Low-latency prediction serving
> > > >
> > > > I would drop the batch API ML because of past experience with lack of
> > > > support, and online learning because the lack of use-cases.
> > > >
> > > > I completely agree with Kate that offline learning should be
> supported,
> > > > but given Flink's resources I prefer using the streaming API as
> Roberto
> > > > suggested. Also, full model lifecycle (or end-to-end ML) could be
> more
> > > > easily supported in one system (one API). Connecting Flink Batch with
> > > Flink
> > > > Streaming is currently cumbersome (although side inputs [1] might
> > help).
> > > In
> > > > my opinion, a crucial part of end-to-end ML is low-latency
> predictions.
> > > >
> > > > As another direction, we could integrate Flink Streaming API with
> other
> > > > projects (such as Prediction IO). However, I believe it's better to
> > first
> > > > evaluate the capabilities and drawbacks of the streaming API with
> some
> > > > prototype of using Flink Streaming for some ML task. Otherwise we
> could
> > > run
> > > > into critical issues just as the System ML integration with e.g.
> > caching.
> > > > These issues makes the integration of Batch API with other ML
> projects
> > > > practically infeasible.
> > > >
> > > > I've already been experimenting with offline learning with the
> > Streaming
> > > > API. Hopefully, I can share some initial performance results next
> week
> > on
> > > > matrix factorization. Naturally, I've run into issues. E.g. I could
> > only
> > > > mark the end of input with some hacks, because this is not needed at
> a
> > > > streaming job consuming input forever. AFAIK, this would be resolved
> by
> > > > side inputs [1].
> > > >
> > > > @Theodore:
> > > > +1 for doing the prototype project(s) separately the main Flink
> > > > repository. Although, I would strongly suggest to follow Flink
> > > development
> > > > guidelines as closely as possible. As another note, there is already
> a
> > > > GitHub organization for Flink related projects [2], but it seems like
> > it
> > > > has not been used much.
> > > >
> > > > [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+
> > > > Side+Inputs+for+DataStream+API
> > > > [2] https://github.com/project-flink
> > > >
> > > >
> > > > On 2017-03-04 08:44, Roberto Bentivoglio wrote:
> > > >
> > > > Hi All,
> > > >>
> > > >> I'd like to start working on:
> > > >>   - Offline learning with Streaming API
> > > >>   - Online learning
> > > >>
> > > >> I think also that using a new organisation on github, as Theodore
> > > propsed,
> > > >> to keep an initial indipendency to speed up the prototyping and
> > > >> development
> > > >> phases it's really interesting.
> > > >>
> > > >> I totally agree with Katherin, we need offline learning, but my
> > opinion
> > > is
> > > >> that it will be more straightforward to fix the streaming issues
> than
> > > >> batch
> > > >> issues because we will have more support on that by the Flink
> > community.
> > > >>
> > > >> Thanks and have a nice weekend,
> > > >> Roberto
> > > >>
> > > >> On 3 March 2017 at 20:20, amir bahmanyari
> <amirto...@yahoo.com.invalid
> > >
> > > >> wrote:
> > > >>
> > > >> Great points to start:    - Online learning
> > > >>>    - Offline learning with the streaming API
> > > >>>
> > > >>> Thanks + have a great weekend.
> > > >>>
> > > >>>        From: Katherin Eri <katherinm...@gmail.com>
> > > >>>   To: dev@flink.apache.org
> > > >>>   Sent: Friday, March 3, 2017 7:41 AM
> > > >>>   Subject: Re: Machine Learning on Flink - Next steps
> > > >>>
> > > >>> Thank you, Theodore.
> > > >>>
> > > >>> Shortly speaking I vote for:
> > > >>> 1) Online learning
> > > >>> 2) Low-latency prediction serving -> Offline learning with the
> batch
> > > API
> > > >>>
> > > >>> In details:
> > > >>> 1) If streaming is strong side of Flink lets use it, and try to
> > support
> > > >>> some online learning or light weight inmemory learning algorithms.
> > Try
> > > to
> > > >>> build pipeline for them.
> > > >>>
> > > >>> 2) I think that Flink should be part of production ecosystem, and
> if
> > > now
> > > >>> productions require ML support, multiple models deployment and so
> on,
> > > we
> > > >>> should serve this. But in my opinion we shouldn’t compete with such
> > > >>> projects like PredictionIO, but serve them, to be an execution
> core.
> > > But
> > > >>> that means a lot:
> > > >>>
> > > >>> a. Offline training should be supported, because typically most of
> ML
> > > >>> algs
> > > >>> are for offline training.
> > > >>> b. Model lifecycle should be supported:
> > > >>> ETL+transformation+training+scoring+exploitation quality
> monitoring
> > > >>>
> > > >>> I understand that batch world is full of competitors, but for me
> that
> > > >>> doesn’t mean that batch should be ignored. I think that separated
> > > >>> streaming/batching applications causes additional deployment and
> > > >>> exploitation overhead which typically tried to be avoided. That
> means
> > > >>> that
> > > >>> we should attract community to this problem in my opinion.
> > > >>>
> > > >>>
> > > >>> пт, 3 мар. 2017 г. в 15:34, Theodore Vasiloudis <
> > > >>> theodoros.vasilou...@gmail.com>:
> > > >>>
> > > >>> Hello all,
> > > >>>
> > > >>>  From our previous discussion started by Stavros, we decided to
> > start a
> > > >>> planning document [1]
> > > >>> to figure out possible next steps for ML on Flink.
> > > >>>
> > > >>> Our concerns where mainly ensuring active development while
> > satisfying
> > > >>> the
> > > >>> needs of
> > > >>> the community.
> > > >>>
> > > >>> We have listed a number of proposals for future work in the
> document.
> > > In
> > > >>> short they are:
> > > >>>
> > > >>>    - Offline learning with the batch API
> > > >>>    - Online learning
> > > >>>    - Offline learning with the streaming API
> > > >>>    - Low-latency prediction serving
> > > >>>
> > > >>> I saw there is a number of people willing to work on ML for Flink,
> > but
> > > >>> the
> > > >>> truth is that we cannot
> > > >>> cover all of these suggestions without fragmenting the development
> > too
> > > >>> much.
> > > >>>
> > > >>> So my recommendation is to pick out 2 of these options, create
> design
> > > >>> documents and build prototypes for each library.
> > > >>> We can then assess their viability and together with the community
> > > decide
> > > >>> if we should try
> > > >>> to include one (or both) of them in the main Flink distribution.
> > > >>>
> > > >>> So I invite people to express their opinion about which task they
> > would
> > > >>> be
> > > >>> willing to contribute
> > > >>> and hopefully we can settle on two of these options.
> > > >>>
> > > >>> Once that is done we can decide how we do the actual work. Since
> this
> > > is
> > > >>> highly experimental
> > > >>> I would suggest we work on repositories where we have complete
> > control.
> > > >>>
> > > >>> For that purpose I have created an organization [2] on Github which
> > we
> > > >>> can
> > > >>> use to create repositories and teams that work on them in an
> > organized
> > > >>> manner.
> > > >>> Once enough work has accumulated we can start discussing
> contributing
> > > the
> > > >>> code
> > > >>> to the main distribution.
> > > >>>
> > > >>> Regards,
> > > >>> Theodore
> > > >>>
> > > >>> [1]
> > > >>> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3U
> > > >>> d06MIRhahtJ6dw/
> > > >>> [2] https://github.com/flinkml
> > > >>>
> > > >>> --
> > > >>>
> > > >>> *Yours faithfully, *
> > > >>>
> > > >>> *Kate Eri.*
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>
> > > >>
> > > >
> > >
> >
>

Re: Machine Learning on Flink - Next steps

Reply via email to