Hi all,

Thanks Stavros for pushing forward the discussion which I feel really
relevant.

Since I'm approaching actively the community just right now and I haven't
enough experience and such visibility around the Flink community, I'd limit
myself to share an opinion as a Flink user.

I'm using Flink since almost a year along two different experiences, but
I've bumped into the question "how to handle ML workloads and keep Flink as
the main engine?" in both cases. Then the first point raises in my mind: why
do I need to adopt an extra system for purely ML purposes: how amazing could
be to benefit the Flink engine as ML features provider and to avoid paying
the effort to maintain an additional engine? This thought links also @Timur
opinion: I believe that users would prefer way more a unified architecture
in this case. Even if a user want to use an external tool/library - perhaps
providing additional language support (e.g. R) - so that user should be
capable to run it on top of Flink.

Along my work with Flink I needed to implement some ML algorithms on both
Flink and Spark and I often struggled with Flink performances: namely, I
think (in the name of the bigger picture) we should first focus the effort
on solving some well-known Flink limitations as @theodore pinpointed. I'd
like to highlight [1] and [2] which I find relevant. Since the community
would decide to go ahead with FlinkML I believe fixing the above described
issues may be a good starting point. That would also definitely push forward
some important integrations as Apache SystemML.

Given all these points, I'm increasingly convinced that Online Machine
Learning would be the real final objective and the more suitable goal since
we're talking about a real-time streaming engine and - from a real high
point of view - I believe Flink would fit this topic in a more genuine way
than the batch case. We've a connector for Apache SAMOA, but it seems in an
early stage of development IMHO and not really active. If we want to make
something within Flink instead, we need to speed up the design of some
features (e.g. side inputs [3]).

I really hope we can define a new roadmap by which we can finally push
forward the topic. I will put my best to help in this way.

Sincerely, 
Andrea

[1] Add a FlinkTools.persist style method to the Data Set
https://issues.apache.org/jira/browse/FLINK-1730
[2] Only send data to each taskmanager once for broadcasts
https://cwiki.apache.org/confluence/display/FLINK/FLIP-5%3A+Only+send+data+to+each+taskmanager+once+for+broadcasts
[3] Side inputs - Evolving or static Filter/Enriching
https://docs.google.com/document/d/1hIgxi2Zchww_5fWUHLoYiXwSBXjv-M5eOv-MKQYN3m4/edit#
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Add-Side-Input-Broadcast-Set-For-Streaming-API-td11529.html



--
View this message in context: 
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Flink-ML-roadmap-tp16040p16064.html
Sent from the Apache Flink Mailing List archive. mailing list archive at 
Nabble.com.

Reply via email to