Hello all, ## Executive summary:
- Offline-on-streaming most popular, then online and model serving. - Need shepherds to lead development/coordination of each task. - I can shepherd online learning, need shepherds for the other two. so from the people sharing their opinion it seems most people would like to try out offline learning with the streaming API. I also think this is an interesting option, but probably the most risky of the bunch. After that online learning and model serving seem to have around the same amount of interest. Given that, and the discussions we had in the Gdoc, here's what I recommend as next actions: - *Offline on streaming: *Start by creating a design document, with an MVP specification about what we imagine such a library to look like and what we think should be possible to do. It should state clear goals and limitations; scoping the amount of work is more important at this point than specific engineering choices. - *Online learning: *If someone would like instead to work on online learning I can help out there, I have one student working on such a library right now, and I'm sure people at TU Berlin (Felix?) have similar efforts. Ideally we would like to communicate with them. Since this is a much more explored space, we could jump straight into a technical design document, (with scoping included of course) discussing abstractions, and comparing with existing frameworks. - *Model serving: *There will be a presentation at Flink Forward SF on such a framework (Flink Tensorflow) by Eron Wright [1]. My recommendation would be to communicate with the author and see if he would be interested in working together to generalize and extend the framework. For more research and resources on the topic see [2] or this presentation [3], particularly the Clipper system. In order to have some activity on each project I recommend we set a minimum of 2 people willing to contribute to each project. If we "assign" people by top choice, that should be possible to do, although my original plan was to only work on two of the above, to avoid fragmentation. But given that online learning will have work being done by students as well, it should be possible to keep it running. Next *I would like us to assign a "shepherd" for each of these tasks.* If you are willing to coordinate the development on one of these options, let us know here and you can take up the task of coordinating with the rest of of the people working on the task. I would like to volunteer to coordinate the *Online learning *effort, since I'm already supervising a student working on this, and I'm currently developing such algorithms. I plan to contribute to the offline on streaming task as well, but not coordinate it. So if someone would like to take the lead on Offline on streaming or Model serving, let us know and we can take it from there. Regards, Theodore [1] http://sf.flink-forward.org/kb_sessions/introducing-flink-tensorflow/ [2] https://ucbrise.github.io/cs294-rise-fa16/prediction_serving.html [3] https://ucbrise.github.io/cs294-rise-fa16/assets/slides/prediction-serving-systems-cs294-RISE_seminar.pdf On Fri, Mar 10, 2017 at 6:55 PM, Stavros Kontopoulos < st.kontopou...@gmail.com> wrote: > Thanks Theodore, > > I'd vote for > > - Offline learning with Streaming API > > - Low-latency prediction serving > > Some comments... > > Online learning > > Good to have but my feeling is that it is not a strong requirement (if a > requirement at all) across the industry right now. May become hot in the > future. > > Offline learning with Streaming API: > > Although it requires engine changes or extensions (feasibility is an issue > here), my understanding is that it reflects the industry common practice > (train every few minutes at most) and it would be great if that was > supported out of the box providing a friendly API for the developer. > > Offline learning with the batch API: > > I would love to have a limited set of algorithms so someone does not leave > Flink to work with another tool > for some initial dataset if he wants to. In other words, let's reach a > mature state with some basic algos merged. > There is a lot of work pending let's not waste it. > > Low-latency prediction serving > > Model serving is a long standing problem, we could definitely help with > that. > > Regards, > Stavros > > > > On Fri, Mar 10, 2017 at 4:08 PM, Till Rohrmann <trohrm...@apache.org> > wrote: > > > Thanks Theo for steering Flink's ML effort here :-) > > > > I'd vote to concentrate on > > > > - Online learning > > - Low-latency prediction serving > > > > because of the following reasons: > > > > Online learning: > > > > I agree that this topic is highly researchy and it's not even clear > whether > > it will ever be of any interest outside of academia. However, it was the > > same for other things as well. Adoption in industry is usually slow and > > sometimes one has to dare to explore something new. > > > > Low-latency prediction serving: > > > > Flink with its streaming engine seems to be the natural fit for such a > task > > and it is a rather low hanging fruit. Furthermore, I think that users > would > > directly benefit from such a feature. > > > > Offline learning with Streaming API: > > > > I'm not fully convinced yet that the streaming API is powerful enough > > (mainly due to lack of proper iteration support and spilling > capabilities) > > to support a wide range of offline ML algorithms. And if then it will > only > > support rather small problem sizes because streaming cannot gracefully > > spill the data to disk. There are still to many open issues with the > > streaming API to be applicable for this use case imo. > > > > Offline learning with the batch API: > > > > For offline learning the batch API is imo still better suited than the > > streaming API. I think it will only make sense to port the algorithms to > > the streaming API once batch and streaming are properly unified. Alone > the > > highly efficient implementations for joining and sorting of data which > can > > go out of memory are important to support big sized ML problems. In > > general, I think it might make sense to offer a basic set of ML > primitives. > > However, already offering this basic set is a considerable amount of > work. > > > > Concering the independent organization for the development: I think it > > would be great if the development could still happen under the umbrella > of > > Flink's ML library because otherwise we might risk some kind of > > fragmentation. In order for people to collaborate, one can also open PRs > > against a branch of a forked repo. > > > > I'm currently working on wrapping the project re-organization discussion > > up. The general position was that it would be best to have an incremental > > build and keep everything in the same repo. If this is not possible then > we > > want to look into creating a sub repository for the libraries (maybe > other > > components will follow later). I hope to make some progress on this front > > in the next couple of days/week. I'll keep you updated. > > > > As a general remark for the discussions on the google doc. I think it > would > > be great if we could at least mirror the discussions happening in the > > google doc back on the mailing list or ideally conduct the discussions > > directly on the mailing list. That's at least what the ASF encourages. > > > > Cheers, > > Till > > > > On Fri, Mar 10, 2017 at 10:52 AM, Gábor Hermann <m...@gaborhermann.com> > > wrote: > > > > > Hey all, > > > > > > Sorry for the bit late response. > > > > > > I'd like to work on > > > - Offline learning with Streaming API > > > - Low-latency prediction serving > > > > > > I would drop the batch API ML because of past experience with lack of > > > support, and online learning because the lack of use-cases. > > > > > > I completely agree with Kate that offline learning should be supported, > > > but given Flink's resources I prefer using the streaming API as Roberto > > > suggested. Also, full model lifecycle (or end-to-end ML) could be more > > > easily supported in one system (one API). Connecting Flink Batch with > > Flink > > > Streaming is currently cumbersome (although side inputs [1] might > help). > > In > > > my opinion, a crucial part of end-to-end ML is low-latency predictions. > > > > > > As another direction, we could integrate Flink Streaming API with other > > > projects (such as Prediction IO). However, I believe it's better to > first > > > evaluate the capabilities and drawbacks of the streaming API with some > > > prototype of using Flink Streaming for some ML task. Otherwise we could > > run > > > into critical issues just as the System ML integration with e.g. > caching. > > > These issues makes the integration of Batch API with other ML projects > > > practically infeasible. > > > > > > I've already been experimenting with offline learning with the > Streaming > > > API. Hopefully, I can share some initial performance results next week > on > > > matrix factorization. Naturally, I've run into issues. E.g. I could > only > > > mark the end of input with some hacks, because this is not needed at a > > > streaming job consuming input forever. AFAIK, this would be resolved by > > > side inputs [1]. > > > > > > @Theodore: > > > +1 for doing the prototype project(s) separately the main Flink > > > repository. Although, I would strongly suggest to follow Flink > > development > > > guidelines as closely as possible. As another note, there is already a > > > GitHub organization for Flink related projects [2], but it seems like > it > > > has not been used much. > > > > > > [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+ > > > Side+Inputs+for+DataStream+API > > > [2] https://github.com/project-flink > > > > > > > > > On 2017-03-04 08:44, Roberto Bentivoglio wrote: > > > > > > Hi All, > > >> > > >> I'd like to start working on: > > >> - Offline learning with Streaming API > > >> - Online learning > > >> > > >> I think also that using a new organisation on github, as Theodore > > propsed, > > >> to keep an initial indipendency to speed up the prototyping and > > >> development > > >> phases it's really interesting. > > >> > > >> I totally agree with Katherin, we need offline learning, but my > opinion > > is > > >> that it will be more straightforward to fix the streaming issues than > > >> batch > > >> issues because we will have more support on that by the Flink > community. > > >> > > >> Thanks and have a nice weekend, > > >> Roberto > > >> > > >> On 3 March 2017 at 20:20, amir bahmanyari <amirto...@yahoo.com.invalid > > > > >> wrote: > > >> > > >> Great points to start: - Online learning > > >>> - Offline learning with the streaming API > > >>> > > >>> Thanks + have a great weekend. > > >>> > > >>> From: Katherin Eri <katherinm...@gmail.com> > > >>> To: dev@flink.apache.org > > >>> Sent: Friday, March 3, 2017 7:41 AM > > >>> Subject: Re: Machine Learning on Flink - Next steps > > >>> > > >>> Thank you, Theodore. > > >>> > > >>> Shortly speaking I vote for: > > >>> 1) Online learning > > >>> 2) Low-latency prediction serving -> Offline learning with the batch > > API > > >>> > > >>> In details: > > >>> 1) If streaming is strong side of Flink lets use it, and try to > support > > >>> some online learning or light weight inmemory learning algorithms. > Try > > to > > >>> build pipeline for them. > > >>> > > >>> 2) I think that Flink should be part of production ecosystem, and if > > now > > >>> productions require ML support, multiple models deployment and so on, > > we > > >>> should serve this. But in my opinion we shouldn’t compete with such > > >>> projects like PredictionIO, but serve them, to be an execution core. > > But > > >>> that means a lot: > > >>> > > >>> a. Offline training should be supported, because typically most of ML > > >>> algs > > >>> are for offline training. > > >>> b. Model lifecycle should be supported: > > >>> ETL+transformation+training+scoring+exploitation quality monitoring > > >>> > > >>> I understand that batch world is full of competitors, but for me that > > >>> doesn’t mean that batch should be ignored. I think that separated > > >>> streaming/batching applications causes additional deployment and > > >>> exploitation overhead which typically tried to be avoided. That means > > >>> that > > >>> we should attract community to this problem in my opinion. > > >>> > > >>> > > >>> пт, 3 мар. 2017 г. в 15:34, Theodore Vasiloudis < > > >>> theodoros.vasilou...@gmail.com>: > > >>> > > >>> Hello all, > > >>> > > >>> From our previous discussion started by Stavros, we decided to > start a > > >>> planning document [1] > > >>> to figure out possible next steps for ML on Flink. > > >>> > > >>> Our concerns where mainly ensuring active development while > satisfying > > >>> the > > >>> needs of > > >>> the community. > > >>> > > >>> We have listed a number of proposals for future work in the document. > > In > > >>> short they are: > > >>> > > >>> - Offline learning with the batch API > > >>> - Online learning > > >>> - Offline learning with the streaming API > > >>> - Low-latency prediction serving > > >>> > > >>> I saw there is a number of people willing to work on ML for Flink, > but > > >>> the > > >>> truth is that we cannot > > >>> cover all of these suggestions without fragmenting the development > too > > >>> much. > > >>> > > >>> So my recommendation is to pick out 2 of these options, create design > > >>> documents and build prototypes for each library. > > >>> We can then assess their viability and together with the community > > decide > > >>> if we should try > > >>> to include one (or both) of them in the main Flink distribution. > > >>> > > >>> So I invite people to express their opinion about which task they > would > > >>> be > > >>> willing to contribute > > >>> and hopefully we can settle on two of these options. > > >>> > > >>> Once that is done we can decide how we do the actual work. Since this > > is > > >>> highly experimental > > >>> I would suggest we work on repositories where we have complete > control. > > >>> > > >>> For that purpose I have created an organization [2] on Github which > we > > >>> can > > >>> use to create repositories and teams that work on them in an > organized > > >>> manner. > > >>> Once enough work has accumulated we can start discussing contributing > > the > > >>> code > > >>> to the main distribution. > > >>> > > >>> Regards, > > >>> Theodore > > >>> > > >>> [1] > > >>> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3U > > >>> d06MIRhahtJ6dw/ > > >>> [2] https://github.com/flinkml > > >>> > > >>> -- > > >>> > > >>> *Yours faithfully, * > > >>> > > >>> *Kate Eri.* > > >>> > > >>> > > >>> > > >>> > > >> > > >> > > > > > >