Re: [DISCUSS] Flink ML roadmap

Stavros Kontopoulos Thu, 23 Feb 2017 05:32:26 -0800

+100 for a design doc.

Could we also set a roadmap after some time-boxed investigation captured in
that document? We need action.


Looking forward to work on this (whatever that might be) ;) Also are there
any data supporting one direction or the other from a customer perspective?
It would help to make more informed decisions.

On Thu, Feb 23, 2017 at 2:23 PM, Katherin Eri <katherinm...@gmail.com>
wrote:

> Yes, ok.
> let's start some design document, and write down there already mentioned
> ideas about: parameter server, about clipper and others. Would be nice if
> we will also map this approaches to cases.
> Will work on it collaboratively on each topic, may be finally we will form
> some picture, that could be agreed with committers.
> @Gabor, could you please start such shared doc, as you have already several
> ideas proposed?
>
> чт, 23 февр. 2017, 15:06 Gábor Hermann <m...@gaborhermann.com>:
>
> > I agree, that it's better to go in one direction first, but I think
> > online and offline with streaming API can go somewhat parallel later. We
> > could set a short-term goal, concentrate initially on one direction, and
> > showcase that direction (e.g. in a blogpost). But first, we should list
> > the pros/cons in a design doc as a minimum. Then make a decision what
> > direction to go. Would that be feasible?
> >
> > On 2017-02-23 12:34, Katherin Eri wrote:
> >
> > > I'm not sure that this is feasible, doing all at the same time could
> mean
> > > doing nothing((((
> > > I'm just afraid, that words: we will work on streaming not on batching,
> > we
> > > have no commiter's time for this, mean that yes, we started work on
> > > FLINK-1730, but nobody will commit this work in the end, as it already
> > was
> > > with this ticket.
> > >
> > > 23 февр. 2017 г. 14:26 пользователь "Gábor Hermann" <
> > m...@gaborhermann.com>
> > > написал:
> > >
> > >> @Theodore: Great to hear you think the "batch on streaming" approach
> is
> > >> possible! Of course, we need to pay attention all the pitfalls there,
> > if we
> > >> go that way.
> > >>
> > >> +1 for a design doc!
> > >>
> > >> I would add that it's possible to make efforts in all the three
> > directions
> > >> (i.e. batch, online, batch on streaming) at the same time. Although,
> it
> > >> might be worth to concentrate on one. E.g. it would not be so useful
> to
> > >> have the same batch algorithms with both the batch API and streaming
> > API.
> > >> We can decide later.
> > >>
> > >> The design doc could be partitioned to these 3 directions, and we can
> > >> collect there the pros/cons too. What do you think?
> > >>
> > >> Cheers,
> > >> Gabor
> > >>
> > >>
> > >> On 2017-02-23 12:13, Theodore Vasiloudis wrote:
> > >>
> > >>> Hello all,
> > >>>
> > >>>
> > >>> @Gabor, we have discussed the idea of using the streaming API to
> write
> > all
> > >>> of our ML algorithms with a couple of people offline,
> > >>> and I think it might be possible and is generally worth a shot. The
> > >>> approach we would take would be close to Vowpal Wabbit, not exactly
> > >>> "online", but rather "fast-batch".
> > >>>
> > >>> There will be problems popping up again, even for very simple algos
> > like
> > >>> on
> > >>> line linear regression with SGD [1], but hopefully fixing those will
> be
> > >>> more aligned with the priorities of the community.
> > >>>
> > >>> @Katherin, my understanding is that given the limited resources,
> there
> > is
> > >>> no development effort focused on batch processing right now.
> > >>>
> > >>> So to summarize, it seems like there are people willing to work on ML
> > on
> > >>> Flink, but nobody is sure how to do it.
> > >>> There are many directions we could take (batch, online, batch on
> > >>> streaming), each with its own merits and downsides.
> > >>>
> > >>> If you want we can start a design doc and move the conversation
> there,
> > >>> come
> > >>> up with a roadmap and start implementing.
> > >>>
> > >>> Regards,
> > >>> Theodore
> > >>>
> > >>> [1]
> > >>> http://apache-flink-user-mailing-list-archive.2336050.n4.
> > >>> nabble.com/Understanding-connected-streams-use-without-times
> > >>> tamps-td10241.html
> > >>>
> > >>> On Tue, Feb 21, 2017 at 11:17 PM, Gábor Hermann <
> m...@gaborhermann.com
> > >
> > >>> wrote:
> > >>>
> > >>> It's great to see so much activity in this discussion :)
> > >>>> I'll try to add my thoughts.
> > >>>>
> > >>>> I think building a developer community (Till's 2. point) can be
> > slightly
> > >>>> separated from what features we should aim for (1. point) and
> > showcasing
> > >>>> (3. point). Thanks Till for bringing up the ideas for restructuring,
> > I'm
> > >>>> sure we'll find a way to make the development process more dynamic.
> > I'll
> > >>>> try to address the rest here.
> > >>>>
> > >>>> It's hard to choose directions between streaming and batch ML. As
> Theo
> > >>>> has
> > >>>> indicated, not much online ML is used in production, but Flink
> > >>>> concentrates
> > >>>> on streaming, so online ML would be a better fit for Flink. However,
> > as
> > >>>> most of you argued, there's definite need for batch ML. But batch ML
> > >>>> seems
> > >>>> hard to achieve because there are blocking issues with persisting,
> > >>>> iteration paths etc. So it's no good either way.
> > >>>>
> > >>>> I propose a seemingly crazy solution: what if we developed batch
> > >>>> algorithms also with the streaming API? The batch API would clearly
> > seem
> > >>>> more suitable for ML algorithms, but there a lot of benefits of this
> > >>>> approach too, so it's clearly worth considering. Flink also has the
> > high
> > >>>> level vision of "streaming for everything" that would clearly fit
> this
> > >>>> case. What do you all think about this? Do you think this solution
> > would
> > >>>> be
> > >>>> feasible? I would be happy to make a more elaborate proposal, but I
> > push
> > >>>> my
> > >>>> main ideas here:
> > >>>>
> > >>>> 1) Simplifying by using one system
> > >>>> It could simplify the work of both the users and the developers. One
> > >>>> could
> > >>>> execute training once, or could execute it periodically e.g. by
> using
> > >>>> windows. Low-latency serving and training could be done in the same
> > >>>> system.
> > >>>> We could implement incremental algorithms, without any side inputs
> for
> > >>>> combining online learning (or predictions) with batch learning. Of
> > >>>> course,
> > >>>> all the logic describing these must be somehow implemented (e.g.
> > >>>> synchronizing predictions with training), but it should be easier to
> > do
> > >>>> so
> > >>>> in one system, than by combining e.g. the batch and streaming API.
> > >>>>
> > >>>> 2) Batch ML with the streaming API is not harder
> > >>>> Despite these benefits, it could seem harder to implement batch ML
> > with
> > >>>> the streaming API, but in my opinion it's not. There are more
> > flexible,
> > >>>> lower-level optimization potentials with the streaming API. Most
> > >>>> distributed ML algorithms use a lower-level model than the batch API
> > >>>> anyway, so sometimes it feels like forcing the algorithm logic into
> > the
> > >>>> training API and tweaking it. Although we could not use the batch
> > >>>> primitives like join, we would have the E.g. in my experience with
> > >>>> implementing a distributed matrix factorization algorithm [1], I
> > couldn't
> > >>>> do a simple optimization because of the limitations of the iteration
> > API
> > >>>> [2]. Even if we pushed all the development effort to make the batch
> > API
> > >>>> more suitable for ML there would be things we couldn't do. E.g.
> there
> > are
> > >>>> approaches for updating a model iteratively without locks [3,4]
> (i.e.
> > >>>> somewhat asynchronously), and I don't see a clear way to implement
> > such
> > >>>> algorithms with the batch API.
> > >>>>
> > >>>> 3) Streaming community (users and devs) benefit
> > >>>> The Flink streaming community in general would also benefit from
> this
> > >>>> direction. There are many features needed in the streaming API for
> ML
> > to
> > >>>> work, but this is also true for the batch API. One really important
> is
> > >>>> the
> > >>>> loops API (a.k.a. iterative DataStreams) [5]. There has been a lot
> of
> > >>>> effort (mostly from Paris) for making it mature enough [6]. Kate
> > >>>> mentioned
> > >>>> using GPUs, and I'm sure they have uses in streaming generally [7].
> > Thus,
> > >>>> by improving the streaming API to allow ML algorithms, the streaming
> > API
> > >>>> benefit too (which is important as they have a lot more production
> > users
> > >>>> than the batch API).
> > >>>>
> > >>>> 4) Performance can be at least as good
> > >>>> I believe the same performance could be achieved with the streaming
> > API
> > >>>> as
> > >>>> with the batch API. Streaming API is much closer to the runtime than
> > the
> > >>>> batch API. For corner-cases, with runtime-layer optimizations of
> batch
> > >>>> API,
> > >>>> we could find a way to do the same (or similar) optimization for the
> > >>>> streaming API (see my previous point). Such case could be using
> > managed
> > >>>> memory (and spilling to disk). There are also benefits by default,
> > e.g.
> > >>>> we
> > >>>> would have a finer grained fault tolerance with the streaming API.
> > >>>>
> > >>>> 5) We could keep batch ML API
> > >>>> For the shorter term, we should not throw away all the algorithms
> > >>>> implemented with the batch API. By pushing forward the development
> > with
> > >>>> side inputs we could make them usable with streaming API. Then, if
> the
> > >>>> library gains some popularity, we could replace the algorithms in
> the
> > >>>> batch
> > >>>> API with streaming ones, to avoid the performance costs of e.g. not
> > being
> > >>>> able to persist.
> > >>>>
> > >>>> 6) General tools for implementing ML algorithms
> > >>>> Besides implementing algorithms one by one, we could give more
> general
> > >>>> tools for making it easier to implement algorithms. E.g. parameter
> > server
> > >>>> [8,9]. Theo also mentioned in another thread that TensorFlow has a
> > >>>> similar
> > >>>> model to Flink streaming, we could look into that too. I think often
> > when
> > >>>> deploying a production ML system, much more configuration and
> tweaking
> > >>>> should be done than e.g. Spark MLlib allows. Why not allow that?
> > >>>>
> > >>>> 7) Showcasing
> > >>>> Showcasing this could be easier. We could say that we're doing batch
> > ML
> > >>>> with a streaming API. That's interesting in its own. IMHO this
> > >>>> integration
> > >>>> is also a more approachable way towards end-to-end ML.
> > >>>>
> > >>>>
> > >>>> Thanks for reading so far :)
> > >>>>
> > >>>> [1] https://github.com/apache/flink/pull/2819
> > >>>> [2] https://issues.apache.org/jira/browse/FLINK-2396
> > >>>> [3] https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf
> > >>>> [4] https://www.usenix.org/system/files/conference/hotos13/hotos
> > >>>> 13-final77.pdf
> > >>>> [5] https://cwiki.apache.org/confluence/display/FLINK/FLIP-15+
> > >>>> Scoped+Loops+and+Job+Termination
> > >>>> [6] https://github.com/apache/flink/pull/1668
> > >>>> [7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-sigmod16.pdf
> > >>>> [8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf
> > >>>> [9] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
> > >>>> com/Using-QueryableState-inside-Flink-jobs-and-
> > >>>> Parameter-Server-implementation-td15880.html
> > >>>>
> > >>>> Cheers,
> > >>>> Gabor
> > >>>>
> > >>>>
> >
> > --
>
> *Yours faithfully, *
>
> *Kate Eri.*
>

Re: [DISCUSS] Flink ML roadmap

Reply via email to