Hi all! Sorry for joining this discussion late (I have already missed some of the deadlines set in this thread).
*Here are some thoughts about what we can do immediately* (1) Grow ML community by adding committers with a dedicated. Irrespective of any direction decision, this is a must. I know that the PMC is actively working on this, stay tuned for some updates. (2) I think a repository split helps to make library committer additions easier, even if it does not go hand in hand with a community split. I believe that we can trust committers that were appointed mainly for their library work to commit directly to the library repository and go through pull requests in the engine/api/connector repository. In some sense we have the same thing already: we trust committers to only commit when they are confident in the touched component and submit a pull request if in doubt. Having the separate repositories makes this rule of thumb even simpler. *On the Roadmap Discussion* - Thanks for the collection and discussion already, these are super nice thoughts. Kudos! - My personal take is that the "model evaluation" over streams will be happening in any case - there is genuine interest in that and various users have built that themselves already. - The model evaluation as one step of a streaming pipeline (classifying events), followed by CEP (pattern detection) or anomaly detection is a valuable use case on top of what pure model serving systems usually do. - An "ML training library" is certainly interesting, if the community can pull it off. More details below. - A question I have not yet a good intuition on is whether the "model evaluation" and the training part are so different (one a good abstraction for model evaluation has been built) that there is little cross coordination needed, or whether there is potential in integrating them. *Thoughts on the ML training library* - There seems especially now to be a big trend towards deep learning (is it just temporary or will this be the future?) and in that space, little works without GPU acceleration. - It is always easier to do something new than to be the n-th version of something existing (sorry for the generic true-ism). The later admittedly gives the "all in one integrated framework" advantage (which can be a very strong argument indeed), but the former attracts completely new communities and can often make more noise with less effort. - The "new" is not required to be "online learning", where Theo has described well that this does not look like its taking off. It can also be traditional ML re-imagined for "continuous applications", as "continuous / incremental re-training" or so. Even on the "model evaluation side", there is a lot of interesting stuff as mentioned already, like ensembles, multi-armed bandits, ... - It may be well worth tapping into the work of an existing library (like tensorflow) for an easy fix to some hard problems (pre-existing hardware integration, pre-existing optimized linear algebra solvers, etc) and think about how such use cases would look like in the context of typical Flink applications. *A bit of engine background information that may help in the planning:* - The DataStream API will in the future also support bounded data computations explicitly (I say this not as a fact, but as a strong believer that this is the right direction). - Batch runtime execution has seen less focus recently, but seems to get a bit more community focus, because some organizations that contribute a lot want to use the batch side as well. For example the effort on file-grained recovery will strengthen batch a lot already. Stephan On Fri, Mar 10, 2017 at 2:38 PM, Till Rohrmann <trohrm...@apache.org> wrote: > Hi Roberto, > > jpmml looks quite promising and this could be a first step towards the > model serving story. Thus, looking really forward seeing it being open > sourced by you guys :-) > > @Katherin, I'm not saying that there is no interest in the community to > work on batch features. However, there is simply not much capacity left to > mentor such an effort at the moment. I fear without the mentoring from an > experienced contributor who has worked on the batch part, it will be > extremely hard to get such a change into the code base. But this will > hopefully change in the future. > > I think the discussion from this thread moved over to [1] and I will > continue discussing there. > > [1] > http://apache-flink-mailing-list-archive.1008284.n3. > nabble.com/Machine-Learning-on-Flink-Next-steps-td16334.html#none > > Cheers, > Till > >