Hi all, Thanks Stavros for pushing forward the discussion which I feel really relevant.
Since I'm approaching actively the community just right now and I haven't enough experience and such visibility around the Flink community, I'd limit myself to share an opinion as a Flink user. I'm using Flink since almost a year along two different experiences, but I've bumped into the question "how to handle ML workloads and keep Flink as the main engine?" in both cases. Then the first point raises in my mind: why do I need to adopt an extra system for purely ML purposes: how amazing could be to benefit the Flink engine as ML features provider and to avoid paying the effort to maintain an additional engine? This thought links also @Timur opinion: I believe that users would prefer way more a unified architecture in this case. Even if a user want to use an external tool/library - perhaps providing additional language support (e.g. R) - so that user should be capable to run it on top of Flink. Along my work with Flink I needed to implement some ML algorithms on both Flink and Spark and I often struggled with Flink performances: namely, I think (in the name of the bigger picture) we should first focus the effort on solving some well-known Flink limitations as @theodore pinpointed. I'd like to highlight [1] and [2] which I find relevant. Since the community would decide to go ahead with FlinkML I believe fixing the above described issues may be a good starting point. That would also definitely push forward some important integrations as Apache SystemML. Given all these points, I'm increasingly convinced that Online Machine Learning would be the real final objective and the more suitable goal since we're talking about a real-time streaming engine and - from a real high point of view - I believe Flink would fit this topic in a more genuine way than the batch case. We've a connector for Apache SAMOA, but it seems in an early stage of development IMHO and not really active. If we want to make something within Flink instead, we need to speed up the design of some features (e.g. side inputs [3]). I really hope we can define a new roadmap by which we can finally push forward the topic. I will put my best to help in this way. Sincerely, Andrea [1] Add a FlinkTools.persist style method to the Data Set https://issues.apache.org/jira/browse/FLINK-1730 [2] Only send data to each taskmanager once for broadcasts https://cwiki.apache.org/confluence/display/FLINK/FLIP-5%3A+Only+send+data+to+each+taskmanager+once+for+broadcasts [3] Side inputs - Evolving or static Filter/Enriching https://docs.google.com/document/d/1hIgxi2Zchww_5fWUHLoYiXwSBXjv-M5eOv-MKQYN3m4/edit# http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Add-Side-Input-Broadcast-Set-For-Streaming-API-td11529.html -- View this message in context: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Flink-ML-roadmap-tp16040p16064.html Sent from the Apache Flink Mailing List archive. mailing list archive at Nabble.com.