Hi Stavros,
Thanks for bringing this up.
There have been past [1] and recent [2, 3] discussions about the Flink
libraries, because there are some stalling PRs and overloaded
committers. (Actually, Till is the only committer shepherd of the both
the CEP and ML library, and AFAIK he has a ton of other responsibilities
and work to do.) Thus it's hard to get code reviewed and merged, and
without merged code it's hard to get a committer status, so there are
not many committers who can review e.g. ML algorithm implementations,
and the cycle goes on. Until this is resolved somehow, we should help
the committers by reviewing each-others PRs.
I think prioritizing features (b) is a good way to start. We could
declare most blocking features and concentrate on reviewing and merging
them before moving forward. E.g. the evaluation framework is quite
important for an ML library in my opinion, and has a PR stalling for
long [4].
Regarding c), there are styleguides generally for contributing to
Flink, so we should follow that. Is there something more ML specific you
think we could follow? We should definitely declare, we follow
scikit-learn and make sure contributions comply to that.
In terms of features (a, d), I think we should first see the bigger
picture. That is, it would be nice to discuss a clearer direction for
Flink ML. I've seen a lot of interest in contributing to Flink ML
lately. I believe we should rethink our goals, to put the contribution
efforts in making a usable and useful library. Are we trying to
implement as many useful algorithms as possible to create a scalable ML
library? That would seem ambitious, and of course there are a lot of
frameworks and libraries that already has something like this as goal
(e.g. Spark MLlib, Mahout). Should we rather create connectors to
existing libraries? Then we cannot really do Flink specific
optimizations. Should we go for online machine learning (as Flink is
concentrating on streaming)? We already have a connector to SAMOA. We
could go on with questions like this. Maybe I'm missing something, but I
haven't seen such directions declared.
Cheers,
Gabor
[1]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Opening-a-discussion-on-FlinkML-td10265.html
[2]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Flink-CEP-development-is-stalling-td15237.html#a15341
[3]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/New-Flink-team-member-Kate-Eri-td15349.html
[4] https://github.com/apache/flink/pull/1849
On 2017-02-20 11:43, Stavros Kontopoulos wrote:
(Resending with the appropriate topic)
Hi,
I would like to start a discussion about next steps for Flink ML.
Currently there is a lot of work going on but needs a push forward.
Some topics to discuss:
a) How several features should be planned and get aligned with Flink
releases.
b) Priorities of what should be done.
c) Basic guidelines for code: styleguides, scikit-learn compliance etc
d) Missing features important for the success of the library, next steps
etc...
Thoughts?
Best,
Stavros