+100 for a design doc. Could we also set a roadmap after some time-boxed investigation captured in that document? We need action.
Looking forward to work on this (whatever that might be) ;) Also are there any data supporting one direction or the other from a customer perspective? It would help to make more informed decisions. On Thu, Feb 23, 2017 at 2:23 PM, Katherin Eri <katherinm...@gmail.com> wrote: > Yes, ok. > let's start some design document, and write down there already mentioned > ideas about: parameter server, about clipper and others. Would be nice if > we will also map this approaches to cases. > Will work on it collaboratively on each topic, may be finally we will form > some picture, that could be agreed with committers. > @Gabor, could you please start such shared doc, as you have already several > ideas proposed? > > чт, 23 февр. 2017, 15:06 Gábor Hermann <m...@gaborhermann.com>: > > > I agree, that it's better to go in one direction first, but I think > > online and offline with streaming API can go somewhat parallel later. We > > could set a short-term goal, concentrate initially on one direction, and > > showcase that direction (e.g. in a blogpost). But first, we should list > > the pros/cons in a design doc as a minimum. Then make a decision what > > direction to go. Would that be feasible? > > > > On 2017-02-23 12:34, Katherin Eri wrote: > > > > > I'm not sure that this is feasible, doing all at the same time could > mean > > > doing nothing(((( > > > I'm just afraid, that words: we will work on streaming not on batching, > > we > > > have no commiter's time for this, mean that yes, we started work on > > > FLINK-1730, but nobody will commit this work in the end, as it already > > was > > > with this ticket. > > > > > > 23 февр. 2017 г. 14:26 пользователь "Gábor Hermann" < > > m...@gaborhermann.com> > > > написал: > > > > > >> @Theodore: Great to hear you think the "batch on streaming" approach > is > > >> possible! Of course, we need to pay attention all the pitfalls there, > > if we > > >> go that way. > > >> > > >> +1 for a design doc! > > >> > > >> I would add that it's possible to make efforts in all the three > > directions > > >> (i.e. batch, online, batch on streaming) at the same time. Although, > it > > >> might be worth to concentrate on one. E.g. it would not be so useful > to > > >> have the same batch algorithms with both the batch API and streaming > > API. > > >> We can decide later. > > >> > > >> The design doc could be partitioned to these 3 directions, and we can > > >> collect there the pros/cons too. What do you think? > > >> > > >> Cheers, > > >> Gabor > > >> > > >> > > >> On 2017-02-23 12:13, Theodore Vasiloudis wrote: > > >> > > >>> Hello all, > > >>> > > >>> > > >>> @Gabor, we have discussed the idea of using the streaming API to > write > > all > > >>> of our ML algorithms with a couple of people offline, > > >>> and I think it might be possible and is generally worth a shot. The > > >>> approach we would take would be close to Vowpal Wabbit, not exactly > > >>> "online", but rather "fast-batch". > > >>> > > >>> There will be problems popping up again, even for very simple algos > > like > > >>> on > > >>> line linear regression with SGD [1], but hopefully fixing those will > be > > >>> more aligned with the priorities of the community. > > >>> > > >>> @Katherin, my understanding is that given the limited resources, > there > > is > > >>> no development effort focused on batch processing right now. > > >>> > > >>> So to summarize, it seems like there are people willing to work on ML > > on > > >>> Flink, but nobody is sure how to do it. > > >>> There are many directions we could take (batch, online, batch on > > >>> streaming), each with its own merits and downsides. > > >>> > > >>> If you want we can start a design doc and move the conversation > there, > > >>> come > > >>> up with a roadmap and start implementing. > > >>> > > >>> Regards, > > >>> Theodore > > >>> > > >>> [1] > > >>> http://apache-flink-user-mailing-list-archive.2336050.n4. > > >>> nabble.com/Understanding-connected-streams-use-without-times > > >>> tamps-td10241.html > > >>> > > >>> On Tue, Feb 21, 2017 at 11:17 PM, Gábor Hermann < > m...@gaborhermann.com > > > > > >>> wrote: > > >>> > > >>> It's great to see so much activity in this discussion :) > > >>>> I'll try to add my thoughts. > > >>>> > > >>>> I think building a developer community (Till's 2. point) can be > > slightly > > >>>> separated from what features we should aim for (1. point) and > > showcasing > > >>>> (3. point). Thanks Till for bringing up the ideas for restructuring, > > I'm > > >>>> sure we'll find a way to make the development process more dynamic. > > I'll > > >>>> try to address the rest here. > > >>>> > > >>>> It's hard to choose directions between streaming and batch ML. As > Theo > > >>>> has > > >>>> indicated, not much online ML is used in production, but Flink > > >>>> concentrates > > >>>> on streaming, so online ML would be a better fit for Flink. However, > > as > > >>>> most of you argued, there's definite need for batch ML. But batch ML > > >>>> seems > > >>>> hard to achieve because there are blocking issues with persisting, > > >>>> iteration paths etc. So it's no good either way. > > >>>> > > >>>> I propose a seemingly crazy solution: what if we developed batch > > >>>> algorithms also with the streaming API? The batch API would clearly > > seem > > >>>> more suitable for ML algorithms, but there a lot of benefits of this > > >>>> approach too, so it's clearly worth considering. Flink also has the > > high > > >>>> level vision of "streaming for everything" that would clearly fit > this > > >>>> case. What do you all think about this? Do you think this solution > > would > > >>>> be > > >>>> feasible? I would be happy to make a more elaborate proposal, but I > > push > > >>>> my > > >>>> main ideas here: > > >>>> > > >>>> 1) Simplifying by using one system > > >>>> It could simplify the work of both the users and the developers. One > > >>>> could > > >>>> execute training once, or could execute it periodically e.g. by > using > > >>>> windows. Low-latency serving and training could be done in the same > > >>>> system. > > >>>> We could implement incremental algorithms, without any side inputs > for > > >>>> combining online learning (or predictions) with batch learning. Of > > >>>> course, > > >>>> all the logic describing these must be somehow implemented (e.g. > > >>>> synchronizing predictions with training), but it should be easier to > > do > > >>>> so > > >>>> in one system, than by combining e.g. the batch and streaming API. > > >>>> > > >>>> 2) Batch ML with the streaming API is not harder > > >>>> Despite these benefits, it could seem harder to implement batch ML > > with > > >>>> the streaming API, but in my opinion it's not. There are more > > flexible, > > >>>> lower-level optimization potentials with the streaming API. Most > > >>>> distributed ML algorithms use a lower-level model than the batch API > > >>>> anyway, so sometimes it feels like forcing the algorithm logic into > > the > > >>>> training API and tweaking it. Although we could not use the batch > > >>>> primitives like join, we would have the E.g. in my experience with > > >>>> implementing a distributed matrix factorization algorithm [1], I > > couldn't > > >>>> do a simple optimization because of the limitations of the iteration > > API > > >>>> [2]. Even if we pushed all the development effort to make the batch > > API > > >>>> more suitable for ML there would be things we couldn't do. E.g. > there > > are > > >>>> approaches for updating a model iteratively without locks [3,4] > (i.e. > > >>>> somewhat asynchronously), and I don't see a clear way to implement > > such > > >>>> algorithms with the batch API. > > >>>> > > >>>> 3) Streaming community (users and devs) benefit > > >>>> The Flink streaming community in general would also benefit from > this > > >>>> direction. There are many features needed in the streaming API for > ML > > to > > >>>> work, but this is also true for the batch API. One really important > is > > >>>> the > > >>>> loops API (a.k.a. iterative DataStreams) [5]. There has been a lot > of > > >>>> effort (mostly from Paris) for making it mature enough [6]. Kate > > >>>> mentioned > > >>>> using GPUs, and I'm sure they have uses in streaming generally [7]. > > Thus, > > >>>> by improving the streaming API to allow ML algorithms, the streaming > > API > > >>>> benefit too (which is important as they have a lot more production > > users > > >>>> than the batch API). > > >>>> > > >>>> 4) Performance can be at least as good > > >>>> I believe the same performance could be achieved with the streaming > > API > > >>>> as > > >>>> with the batch API. Streaming API is much closer to the runtime than > > the > > >>>> batch API. For corner-cases, with runtime-layer optimizations of > batch > > >>>> API, > > >>>> we could find a way to do the same (or similar) optimization for the > > >>>> streaming API (see my previous point). Such case could be using > > managed > > >>>> memory (and spilling to disk). There are also benefits by default, > > e.g. > > >>>> we > > >>>> would have a finer grained fault tolerance with the streaming API. > > >>>> > > >>>> 5) We could keep batch ML API > > >>>> For the shorter term, we should not throw away all the algorithms > > >>>> implemented with the batch API. By pushing forward the development > > with > > >>>> side inputs we could make them usable with streaming API. Then, if > the > > >>>> library gains some popularity, we could replace the algorithms in > the > > >>>> batch > > >>>> API with streaming ones, to avoid the performance costs of e.g. not > > being > > >>>> able to persist. > > >>>> > > >>>> 6) General tools for implementing ML algorithms > > >>>> Besides implementing algorithms one by one, we could give more > general > > >>>> tools for making it easier to implement algorithms. E.g. parameter > > server > > >>>> [8,9]. Theo also mentioned in another thread that TensorFlow has a > > >>>> similar > > >>>> model to Flink streaming, we could look into that too. I think often > > when > > >>>> deploying a production ML system, much more configuration and > tweaking > > >>>> should be done than e.g. Spark MLlib allows. Why not allow that? > > >>>> > > >>>> 7) Showcasing > > >>>> Showcasing this could be easier. We could say that we're doing batch > > ML > > >>>> with a streaming API. That's interesting in its own. IMHO this > > >>>> integration > > >>>> is also a more approachable way towards end-to-end ML. > > >>>> > > >>>> > > >>>> Thanks for reading so far :) > > >>>> > > >>>> [1] https://github.com/apache/flink/pull/2819 > > >>>> [2] https://issues.apache.org/jira/browse/FLINK-2396 > > >>>> [3] https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf > > >>>> [4] https://www.usenix.org/system/files/conference/hotos13/hotos > > >>>> 13-final77.pdf > > >>>> [5] https://cwiki.apache.org/confluence/display/FLINK/FLIP-15+ > > >>>> Scoped+Loops+and+Job+Termination > > >>>> [6] https://github.com/apache/flink/pull/1668 > > >>>> [7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-sigmod16.pdf > > >>>> [8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf > > >>>> [9] http://apache-flink-mailing-list-archive.1008284.n3.nabble. > > >>>> com/Using-QueryableState-inside-Flink-jobs-and- > > >>>> Parameter-Server-implementation-td15880.html > > >>>> > > >>>> Cheers, > > >>>> Gabor > > >>>> > > >>>> > > > > -- > > *Yours faithfully, * > > *Kate Eri.* >