Re: [DISCUSS] Flink ML roadmap

Katherin Eri Thu, 23 Feb 2017 04:23:51 -0800

Yes, ok.
let's start some design document, and write down there already mentioned
ideas about: parameter server, about clipper and others. Would be nice if
we will also map this approaches to cases.
Will work on it collaboratively on each topic, may be finally we will form
some picture, that could be agreed with committers.
@Gabor, could you please start such shared doc, as you have already several
ideas proposed?


чт, 23 февр. 2017, 15:06 Gábor Hermann <m...@gaborhermann.com>:

> I agree, that it's better to go in one direction first, but I think
> online and offline with streaming API can go somewhat parallel later. We
> could set a short-term goal, concentrate initially on one direction, and
> showcase that direction (e.g. in a blogpost). But first, we should list
> the pros/cons in a design doc as a minimum. Then make a decision what
> direction to go. Would that be feasible?
>
> On 2017-02-23 12:34, Katherin Eri wrote:
>
> > I'm not sure that this is feasible, doing all at the same time could mean
> > doing nothing((((
> > I'm just afraid, that words: we will work on streaming not on batching,
> we
> > have no commiter's time for this, mean that yes, we started work on
> > FLINK-1730, but nobody will commit this work in the end, as it already
> was
> > with this ticket.
> >
> > 23 февр. 2017 г. 14:26 пользователь "Gábor Hermann" <
> m...@gaborhermann.com>
> > написал:
> >
> >> @Theodore: Great to hear you think the "batch on streaming" approach is
> >> possible! Of course, we need to pay attention all the pitfalls there,
> if we
> >> go that way.
> >>
> >> +1 for a design doc!
> >>
> >> I would add that it's possible to make efforts in all the three
> directions
> >> (i.e. batch, online, batch on streaming) at the same time. Although, it
> >> might be worth to concentrate on one. E.g. it would not be so useful to
> >> have the same batch algorithms with both the batch API and streaming
> API.
> >> We can decide later.
> >>
> >> The design doc could be partitioned to these 3 directions, and we can
> >> collect there the pros/cons too. What do you think?
> >>
> >> Cheers,
> >> Gabor
> >>
> >>
> >> On 2017-02-23 12:13, Theodore Vasiloudis wrote:
> >>
> >>> Hello all,
> >>>
> >>>
> >>> @Gabor, we have discussed the idea of using the streaming API to write
> all
> >>> of our ML algorithms with a couple of people offline,
> >>> and I think it might be possible and is generally worth a shot. The
> >>> approach we would take would be close to Vowpal Wabbit, not exactly
> >>> "online", but rather "fast-batch".
> >>>
> >>> There will be problems popping up again, even for very simple algos
> like
> >>> on
> >>> line linear regression with SGD [1], but hopefully fixing those will be
> >>> more aligned with the priorities of the community.
> >>>
> >>> @Katherin, my understanding is that given the limited resources, there
> is
> >>> no development effort focused on batch processing right now.
> >>>
> >>> So to summarize, it seems like there are people willing to work on ML
> on
> >>> Flink, but nobody is sure how to do it.
> >>> There are many directions we could take (batch, online, batch on
> >>> streaming), each with its own merits and downsides.
> >>>
> >>> If you want we can start a design doc and move the conversation there,
> >>> come
> >>> up with a roadmap and start implementing.
> >>>
> >>> Regards,
> >>> Theodore
> >>>
> >>> [1]
> >>> http://apache-flink-user-mailing-list-archive.2336050.n4.
> >>> nabble.com/Understanding-connected-streams-use-without-times
> >>> tamps-td10241.html
> >>>
> >>> On Tue, Feb 21, 2017 at 11:17 PM, Gábor Hermann <m...@gaborhermann.com
> >
> >>> wrote:
> >>>
> >>> It's great to see so much activity in this discussion :)
> >>>> I'll try to add my thoughts.
> >>>>
> >>>> I think building a developer community (Till's 2. point) can be
> slightly
> >>>> separated from what features we should aim for (1. point) and
> showcasing
> >>>> (3. point). Thanks Till for bringing up the ideas for restructuring,
> I'm
> >>>> sure we'll find a way to make the development process more dynamic.
> I'll
> >>>> try to address the rest here.
> >>>>
> >>>> It's hard to choose directions between streaming and batch ML. As Theo
> >>>> has
> >>>> indicated, not much online ML is used in production, but Flink
> >>>> concentrates
> >>>> on streaming, so online ML would be a better fit for Flink. However,
> as
> >>>> most of you argued, there's definite need for batch ML. But batch ML
> >>>> seems
> >>>> hard to achieve because there are blocking issues with persisting,
> >>>> iteration paths etc. So it's no good either way.
> >>>>
> >>>> I propose a seemingly crazy solution: what if we developed batch
> >>>> algorithms also with the streaming API? The batch API would clearly
> seem
> >>>> more suitable for ML algorithms, but there a lot of benefits of this
> >>>> approach too, so it's clearly worth considering. Flink also has the
> high
> >>>> level vision of "streaming for everything" that would clearly fit this
> >>>> case. What do you all think about this? Do you think this solution
> would
> >>>> be
> >>>> feasible? I would be happy to make a more elaborate proposal, but I
> push
> >>>> my
> >>>> main ideas here:
> >>>>
> >>>> 1) Simplifying by using one system
> >>>> It could simplify the work of both the users and the developers. One
> >>>> could
> >>>> execute training once, or could execute it periodically e.g. by using
> >>>> windows. Low-latency serving and training could be done in the same
> >>>> system.
> >>>> We could implement incremental algorithms, without any side inputs for
> >>>> combining online learning (or predictions) with batch learning. Of
> >>>> course,
> >>>> all the logic describing these must be somehow implemented (e.g.
> >>>> synchronizing predictions with training), but it should be easier to
> do
> >>>> so
> >>>> in one system, than by combining e.g. the batch and streaming API.
> >>>>
> >>>> 2) Batch ML with the streaming API is not harder
> >>>> Despite these benefits, it could seem harder to implement batch ML
> with
> >>>> the streaming API, but in my opinion it's not. There are more
> flexible,
> >>>> lower-level optimization potentials with the streaming API. Most
> >>>> distributed ML algorithms use a lower-level model than the batch API
> >>>> anyway, so sometimes it feels like forcing the algorithm logic into
> the
> >>>> training API and tweaking it. Although we could not use the batch
> >>>> primitives like join, we would have the E.g. in my experience with
> >>>> implementing a distributed matrix factorization algorithm [1], I
> couldn't
> >>>> do a simple optimization because of the limitations of the iteration
> API
> >>>> [2]. Even if we pushed all the development effort to make the batch
> API
> >>>> more suitable for ML there would be things we couldn't do. E.g. there
> are
> >>>> approaches for updating a model iteratively without locks [3,4] (i.e.
> >>>> somewhat asynchronously), and I don't see a clear way to implement
> such
> >>>> algorithms with the batch API.
> >>>>
> >>>> 3) Streaming community (users and devs) benefit
> >>>> The Flink streaming community in general would also benefit from this
> >>>> direction. There are many features needed in the streaming API for ML
> to
> >>>> work, but this is also true for the batch API. One really important is
> >>>> the
> >>>> loops API (a.k.a. iterative DataStreams) [5]. There has been a lot of
> >>>> effort (mostly from Paris) for making it mature enough [6]. Kate
> >>>> mentioned
> >>>> using GPUs, and I'm sure they have uses in streaming generally [7].
> Thus,
> >>>> by improving the streaming API to allow ML algorithms, the streaming
> API
> >>>> benefit too (which is important as they have a lot more production
> users
> >>>> than the batch API).
> >>>>
> >>>> 4) Performance can be at least as good
> >>>> I believe the same performance could be achieved with the streaming
> API
> >>>> as
> >>>> with the batch API. Streaming API is much closer to the runtime than
> the
> >>>> batch API. For corner-cases, with runtime-layer optimizations of batch
> >>>> API,
> >>>> we could find a way to do the same (or similar) optimization for the
> >>>> streaming API (see my previous point). Such case could be using
> managed
> >>>> memory (and spilling to disk). There are also benefits by default,
> e.g.
> >>>> we
> >>>> would have a finer grained fault tolerance with the streaming API.
> >>>>
> >>>> 5) We could keep batch ML API
> >>>> For the shorter term, we should not throw away all the algorithms
> >>>> implemented with the batch API. By pushing forward the development
> with
> >>>> side inputs we could make them usable with streaming API. Then, if the
> >>>> library gains some popularity, we could replace the algorithms in the
> >>>> batch
> >>>> API with streaming ones, to avoid the performance costs of e.g. not
> being
> >>>> able to persist.
> >>>>
> >>>> 6) General tools for implementing ML algorithms
> >>>> Besides implementing algorithms one by one, we could give more general
> >>>> tools for making it easier to implement algorithms. E.g. parameter
> server
> >>>> [8,9]. Theo also mentioned in another thread that TensorFlow has a
> >>>> similar
> >>>> model to Flink streaming, we could look into that too. I think often
> when
> >>>> deploying a production ML system, much more configuration and tweaking
> >>>> should be done than e.g. Spark MLlib allows. Why not allow that?
> >>>>
> >>>> 7) Showcasing
> >>>> Showcasing this could be easier. We could say that we're doing batch
> ML
> >>>> with a streaming API. That's interesting in its own. IMHO this
> >>>> integration
> >>>> is also a more approachable way towards end-to-end ML.
> >>>>
> >>>>
> >>>> Thanks for reading so far :)
> >>>>
> >>>> [1] https://github.com/apache/flink/pull/2819
> >>>> [2] https://issues.apache.org/jira/browse/FLINK-2396
> >>>> [3] https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf
> >>>> [4] https://www.usenix.org/system/files/conference/hotos13/hotos
> >>>> 13-final77.pdf
> >>>> [5] https://cwiki.apache.org/confluence/display/FLINK/FLIP-15+
> >>>> Scoped+Loops+and+Job+Termination
> >>>> [6] https://github.com/apache/flink/pull/1668
> >>>> [7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-sigmod16.pdf
> >>>> [8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf
> >>>> [9] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
> >>>> com/Using-QueryableState-inside-Flink-jobs-and-
> >>>> Parameter-Server-implementation-td15880.html
> >>>>
> >>>> Cheers,
> >>>> Gabor
> >>>>
> >>>>
>
> --

*Yours faithfully, *

*Kate Eri.*

Re: [DISCUSS] Flink ML roadmap

Reply via email to