Thanks Theo. Just wrote some comments on the other thread, but it looks like you got it covered already.
Let me re-post what I think may help as input: *Concerning Model Evaluation / Serving * - My personal take is that the "model evaluation" over streams will be happening in any case - there is genuine interest in that and various users have built that themselves already. I would be a cool way to do something that has a very high chance of being productionized by users soon. - The model evaluation as one step of a streaming pipeline (classifying events), followed by CEP (pattern detection) or anomaly detection is a valuable use case on top of what pure model serving systems usually do. - A question I have not yet a good intuition on is whether the "model evaluation" and the training part are so different (one a good abstraction for model evaluation has been built) that there is little cross coordination needed, or whether there is potential in integrating them. *Thoughts on the ML training library (DataSet API or DataStream API)* - I honestly don't quite understand what the big difference will be in targeting the batch or streaming API. You can use the DataSet API in a quite low-level fashion (missing async iterations). - There seems especially now to be a big trend towards deep learning (is it just temporary or will this be the future?) and in that space, little works without GPU acceleration. - It is always easier to do something new than to be the n-th version of something existing (sorry for the generic true-ism). The later admittedly gives the "all in one integrated framework" advantage (which can be a very strong argument indeed), but the former attracts completely new communities and can often make more impact with less effort. - The "new" is not required to be "online learning", where Theo has described some concerns well. It can also be traditional ML re-imagined for "continuous applications", as "continuous / incremental re-training" or so. Even on the "model evaluation side", there is a lot of interesting stuff as mentioned already, like ensembles, multi-armed bandits, ... - It may be well worth tapping into the work of an existing library (like tensorflow) for an easy fix to some hard problems (pre-existing hardware integration, pre-existing optimized linear algebra solvers, etc) and think about how such use cases would look like in the context of typical Flink applications. *A bit of engine background information that may help in the planning:* - The DataStream API will in the future also support bounded data computations explicitly (I say this not as a fact, but as a strong believer that this is the right direction). - Batch runtime execution has seen less focus recently, but seems to get a bit more community focus, because some organizations that contribute a lot want to use the batch side as well. For example the effort on file-grained recovery will strengthen batch a lot already. Stephan On Tue, Mar 14, 2017 at 1:38 PM, Theodore Vasiloudis < theodoros.vasilou...@gmail.com> wrote: > Hello all, > > ## Executive summary: > > - Offline-on-streaming most popular, then online and model serving. > - Need shepherds to lead development/coordination of each task. > - I can shepherd online learning, need shepherds for the other two. > > > so from the people sharing their opinion it seems most people would like to > try out offline learning with the streaming API. > I also think this is an interesting option, but probably the most risky of > the bunch. > > After that online learning and model serving seem to have around the same > amount of interest. > > Given that, and the discussions we had in the Gdoc, here's what I recommend > as next actions: > > - > *Offline on streaming: *Start by creating a design document, with an MVP > specification about what we > imagine such a library to look like and what we think should be possible > to do. > It should state clear goals and limitations; scoping the amount of work > is > more important at this point than specific engineering choices. > - > *Online learning: *If someone would like instead to work on online learning > I can help out there, > I have one student working on such a library right now, and I'm sure > people > at TU Berlin (Felix?) have similar efforts. Ideally we would like to > communicate with > them. Since this is a much more explored space, we could jump straight > into a technical > design document, (with scoping included of course) discussing > abstractions, and comparing > with existing frameworks. > - > *Model serving: *There will be a presentation at Flink Forward SF on such a > framework (Flink Tensorflow) > by Eron Wright [1]. My recommendation would be to communicate with the > author and see > if he would be interested in working together to generalize and extend > the framework. > For more research and resources on the topic see [2] or this > presentation [3], particularly the Clipper system. > > In order to have some activity on each project I recommend we set a minimum > of 2 people willing to > contribute to each project. > > If we "assign" people by top choice, that should be possible to do, > although my original plan was > to only work on two of the above, to avoid fragmentation. But given that > online learning will have work > being done by students as well, it should be possible to keep it running. > > Next *I would like us to assign a "shepherd" for each of these tasks.* If > you are willing to coordinate the development > on one of these options, let us know here and you can take up the task of > coordinating with the rest of > of the people working on the task. > > I would like to volunteer to coordinate the *Online learning *effort, since > I'm already supervising a student > working on this, and I'm currently developing such algorithms. I plan to > contribute to the offline on streaming > task as well, but not coordinate it. > > So if someone would like to take the lead on Offline on streaming or Model > serving, let us know and > we can take it from there. > > Regards, > Theodore > > [1] http://sf.flink-forward.org/kb_sessions/introducing-flink-tensorflow/ > > [2] https://ucbrise.github.io/cs294-rise-fa16/prediction_serving.html > > [3] > https://ucbrise.github.io/cs294-rise-fa16/assets/slides/ > prediction-serving-systems-cs294-RISE_seminar.pdf > > On Fri, Mar 10, 2017 at 6:55 PM, Stavros Kontopoulos < > st.kontopou...@gmail.com> wrote: > > > Thanks Theodore, > > > > I'd vote for > > > > - Offline learning with Streaming API > > > > - Low-latency prediction serving > > > > Some comments... > > > > Online learning > > > > Good to have but my feeling is that it is not a strong requirement (if a > > requirement at all) across the industry right now. May become hot in the > > future. > > > > Offline learning with Streaming API: > > > > Although it requires engine changes or extensions (feasibility is an > issue > > here), my understanding is that it reflects the industry common practice > > (train every few minutes at most) and it would be great if that was > > supported out of the box providing a friendly API for the developer. > > > > Offline learning with the batch API: > > > > I would love to have a limited set of algorithms so someone does not > leave > > Flink to work with another tool > > for some initial dataset if he wants to. In other words, let's reach a > > mature state with some basic algos merged. > > There is a lot of work pending let's not waste it. > > > > Low-latency prediction serving > > > > Model serving is a long standing problem, we could definitely help with > > that. > > > > Regards, > > Stavros > > > > > > > > On Fri, Mar 10, 2017 at 4:08 PM, Till Rohrmann <trohrm...@apache.org> > > wrote: > > > > > Thanks Theo for steering Flink's ML effort here :-) > > > > > > I'd vote to concentrate on > > > > > > - Online learning > > > - Low-latency prediction serving > > > > > > because of the following reasons: > > > > > > Online learning: > > > > > > I agree that this topic is highly researchy and it's not even clear > > whether > > > it will ever be of any interest outside of academia. However, it was > the > > > same for other things as well. Adoption in industry is usually slow and > > > sometimes one has to dare to explore something new. > > > > > > Low-latency prediction serving: > > > > > > Flink with its streaming engine seems to be the natural fit for such a > > task > > > and it is a rather low hanging fruit. Furthermore, I think that users > > would > > > directly benefit from such a feature. > > > > > > Offline learning with Streaming API: > > > > > > I'm not fully convinced yet that the streaming API is powerful enough > > > (mainly due to lack of proper iteration support and spilling > > capabilities) > > > to support a wide range of offline ML algorithms. And if then it will > > only > > > support rather small problem sizes because streaming cannot gracefully > > > spill the data to disk. There are still to many open issues with the > > > streaming API to be applicable for this use case imo. > > > > > > Offline learning with the batch API: > > > > > > For offline learning the batch API is imo still better suited than the > > > streaming API. I think it will only make sense to port the algorithms > to > > > the streaming API once batch and streaming are properly unified. Alone > > the > > > highly efficient implementations for joining and sorting of data which > > can > > > go out of memory are important to support big sized ML problems. In > > > general, I think it might make sense to offer a basic set of ML > > primitives. > > > However, already offering this basic set is a considerable amount of > > work. > > > > > > Concering the independent organization for the development: I think it > > > would be great if the development could still happen under the umbrella > > of > > > Flink's ML library because otherwise we might risk some kind of > > > fragmentation. In order for people to collaborate, one can also open > PRs > > > against a branch of a forked repo. > > > > > > I'm currently working on wrapping the project re-organization > discussion > > > up. The general position was that it would be best to have an > incremental > > > build and keep everything in the same repo. If this is not possible > then > > we > > > want to look into creating a sub repository for the libraries (maybe > > other > > > components will follow later). I hope to make some progress on this > front > > > in the next couple of days/week. I'll keep you updated. > > > > > > As a general remark for the discussions on the google doc. I think it > > would > > > be great if we could at least mirror the discussions happening in the > > > google doc back on the mailing list or ideally conduct the discussions > > > directly on the mailing list. That's at least what the ASF encourages. > > > > > > Cheers, > > > Till > > > > > > On Fri, Mar 10, 2017 at 10:52 AM, Gábor Hermann <m...@gaborhermann.com > > > > > wrote: > > > > > > > Hey all, > > > > > > > > Sorry for the bit late response. > > > > > > > > I'd like to work on > > > > - Offline learning with Streaming API > > > > - Low-latency prediction serving > > > > > > > > I would drop the batch API ML because of past experience with lack of > > > > support, and online learning because the lack of use-cases. > > > > > > > > I completely agree with Kate that offline learning should be > supported, > > > > but given Flink's resources I prefer using the streaming API as > Roberto > > > > suggested. Also, full model lifecycle (or end-to-end ML) could be > more > > > > easily supported in one system (one API). Connecting Flink Batch with > > > Flink > > > > Streaming is currently cumbersome (although side inputs [1] might > > help). > > > In > > > > my opinion, a crucial part of end-to-end ML is low-latency > predictions. > > > > > > > > As another direction, we could integrate Flink Streaming API with > other > > > > projects (such as Prediction IO). However, I believe it's better to > > first > > > > evaluate the capabilities and drawbacks of the streaming API with > some > > > > prototype of using Flink Streaming for some ML task. Otherwise we > could > > > run > > > > into critical issues just as the System ML integration with e.g. > > caching. > > > > These issues makes the integration of Batch API with other ML > projects > > > > practically infeasible. > > > > > > > > I've already been experimenting with offline learning with the > > Streaming > > > > API. Hopefully, I can share some initial performance results next > week > > on > > > > matrix factorization. Naturally, I've run into issues. E.g. I could > > only > > > > mark the end of input with some hacks, because this is not needed at > a > > > > streaming job consuming input forever. AFAIK, this would be resolved > by > > > > side inputs [1]. > > > > > > > > @Theodore: > > > > +1 for doing the prototype project(s) separately the main Flink > > > > repository. Although, I would strongly suggest to follow Flink > > > development > > > > guidelines as closely as possible. As another note, there is already > a > > > > GitHub organization for Flink related projects [2], but it seems like > > it > > > > has not been used much. > > > > > > > > [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+ > > > > Side+Inputs+for+DataStream+API > > > > [2] https://github.com/project-flink > > > > > > > > > > > > On 2017-03-04 08:44, Roberto Bentivoglio wrote: > > > > > > > > Hi All, > > > >> > > > >> I'd like to start working on: > > > >> - Offline learning with Streaming API > > > >> - Online learning > > > >> > > > >> I think also that using a new organisation on github, as Theodore > > > propsed, > > > >> to keep an initial indipendency to speed up the prototyping and > > > >> development > > > >> phases it's really interesting. > > > >> > > > >> I totally agree with Katherin, we need offline learning, but my > > opinion > > > is > > > >> that it will be more straightforward to fix the streaming issues > than > > > >> batch > > > >> issues because we will have more support on that by the Flink > > community. > > > >> > > > >> Thanks and have a nice weekend, > > > >> Roberto > > > >> > > > >> On 3 March 2017 at 20:20, amir bahmanyari > <amirto...@yahoo.com.invalid > > > > > > >> wrote: > > > >> > > > >> Great points to start: - Online learning > > > >>> - Offline learning with the streaming API > > > >>> > > > >>> Thanks + have a great weekend. > > > >>> > > > >>> From: Katherin Eri <katherinm...@gmail.com> > > > >>> To: dev@flink.apache.org > > > >>> Sent: Friday, March 3, 2017 7:41 AM > > > >>> Subject: Re: Machine Learning on Flink - Next steps > > > >>> > > > >>> Thank you, Theodore. > > > >>> > > > >>> Shortly speaking I vote for: > > > >>> 1) Online learning > > > >>> 2) Low-latency prediction serving -> Offline learning with the > batch > > > API > > > >>> > > > >>> In details: > > > >>> 1) If streaming is strong side of Flink lets use it, and try to > > support > > > >>> some online learning or light weight inmemory learning algorithms. > > Try > > > to > > > >>> build pipeline for them. > > > >>> > > > >>> 2) I think that Flink should be part of production ecosystem, and > if > > > now > > > >>> productions require ML support, multiple models deployment and so > on, > > > we > > > >>> should serve this. But in my opinion we shouldn’t compete with such > > > >>> projects like PredictionIO, but serve them, to be an execution > core. > > > But > > > >>> that means a lot: > > > >>> > > > >>> a. Offline training should be supported, because typically most of > ML > > > >>> algs > > > >>> are for offline training. > > > >>> b. Model lifecycle should be supported: > > > >>> ETL+transformation+training+scoring+exploitation quality > monitoring > > > >>> > > > >>> I understand that batch world is full of competitors, but for me > that > > > >>> doesn’t mean that batch should be ignored. I think that separated > > > >>> streaming/batching applications causes additional deployment and > > > >>> exploitation overhead which typically tried to be avoided. That > means > > > >>> that > > > >>> we should attract community to this problem in my opinion. > > > >>> > > > >>> > > > >>> пт, 3 мар. 2017 г. в 15:34, Theodore Vasiloudis < > > > >>> theodoros.vasilou...@gmail.com>: > > > >>> > > > >>> Hello all, > > > >>> > > > >>> From our previous discussion started by Stavros, we decided to > > start a > > > >>> planning document [1] > > > >>> to figure out possible next steps for ML on Flink. > > > >>> > > > >>> Our concerns where mainly ensuring active development while > > satisfying > > > >>> the > > > >>> needs of > > > >>> the community. > > > >>> > > > >>> We have listed a number of proposals for future work in the > document. > > > In > > > >>> short they are: > > > >>> > > > >>> - Offline learning with the batch API > > > >>> - Online learning > > > >>> - Offline learning with the streaming API > > > >>> - Low-latency prediction serving > > > >>> > > > >>> I saw there is a number of people willing to work on ML for Flink, > > but > > > >>> the > > > >>> truth is that we cannot > > > >>> cover all of these suggestions without fragmenting the development > > too > > > >>> much. > > > >>> > > > >>> So my recommendation is to pick out 2 of these options, create > design > > > >>> documents and build prototypes for each library. > > > >>> We can then assess their viability and together with the community > > > decide > > > >>> if we should try > > > >>> to include one (or both) of them in the main Flink distribution. > > > >>> > > > >>> So I invite people to express their opinion about which task they > > would > > > >>> be > > > >>> willing to contribute > > > >>> and hopefully we can settle on two of these options. > > > >>> > > > >>> Once that is done we can decide how we do the actual work. Since > this > > > is > > > >>> highly experimental > > > >>> I would suggest we work on repositories where we have complete > > control. > > > >>> > > > >>> For that purpose I have created an organization [2] on Github which > > we > > > >>> can > > > >>> use to create repositories and teams that work on them in an > > organized > > > >>> manner. > > > >>> Once enough work has accumulated we can start discussing > contributing > > > the > > > >>> code > > > >>> to the main distribution. > > > >>> > > > >>> Regards, > > > >>> Theodore > > > >>> > > > >>> [1] > > > >>> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3U > > > >>> d06MIRhahtJ6dw/ > > > >>> [2] https://github.com/flinkml > > > >>> > > > >>> -- > > > >>> > > > >>> *Yours faithfully, * > > > >>> > > > >>> *Kate Eri.* > > > >>> > > > >>> > > > >>> > > > >>> > > > >> > > > >> > > > > > > > > > >