Hi Meanwhile until FlinkML matures, it might be worth having Flink as the engine powering H2O in a similar way Spark are doing with their Sparkling Water. Any thoughts?
Thanks Slim Baltagi On Feb 12, 2016, at 7:25 AM, Theodore Vasiloudis <theodoros.vasilou...@gmail.com> wrote: > I think Simone raises some good points here. > > The truth is that FlinkML is still in its infancy and it will be hard to > compete with mllib, H2O and Graphlab in terms of features > and algorithm "coverage". > > My hope has always been that the library will be focused on what Flink does > well and implement algorithms that are > built around the inherent advantages Flink provides over other platforms. > > This is an open source project of course it's not up to one person to > decide what makes into the library and what doesn't, > and for me it's been really hard to gauge what the community "wants" from > the library in terms of algorithms. > > The "basics" (sklearn-like predictors, evaluators, CV and pipelines) I > think are necessary and are largely in place already. > Making sure that they provide a good user experience is paramount of course > before we settle on the design. > > But this is less of a discussion on where we take FlinkML, but *how *we do > it. > I do believe there is a need for an integrated ML library for Flink, the > question for me is how can we ensure its continued development. > > > > On Fri, Feb 12, 2016 at 12:59 PM, Chiwan Park <chiwanp...@apache.org> wrote: > >> Hi, >> >> I agree what Theo said. Currently, only few committers spend time to >> review PRs about FlinkML. But I also agree Fabian’s opinion. I would like >> to keep FlinkML under main repository of Flink. I hope new committers >> spending time for FlinkML. >> >> About Simone’s opinion, yes, FlinkML is still immature ML library. There >> is a lack of many useful features and some of the features are pending in >> pull requests. >> >> Integration with some other libraries such as Mahout, H2O, Weka would be >> also good. Already there are some attempts using Flink or other distributed >> data processing framework as a backend of other library [1] [2] [3]. But I >> think, as you can see the link, we have to re-implement many algorithms >> even though we integrate other library with Flink. I doubt if there is a >> big development advantage of integration. >> >> [1]: https://issues.apache.org/jira/browse/MAHOUT-1570 >> [2]: http://mahout.apache.org/users/basics/algorithms.html >> [3]: https://github.com/ariskk/distributedWekaSpark >> >> Regards, >> Chiwan Park >> >>> On Feb 12, 2016, at 7:04 PM, Fabian Hueske <fhue...@gmail.com> wrote: >>> >>> Hi Theo, >>> >>> thanks for starting this discussion. You are certainly right that the >>> development of FlinkML is stalling. On the other hand, we regularly see >>> people on the mailing list asking for feature. >>> >>> Regarding your proposed ways to proceed: >>> >>> 1) I am not sure how much it would help to move FlinkML to a separate >>> repository. >>> We have discussed to move connectors (and libraries) to separate >>> repositories before but the thread fall asleep [1]. >>> We would still need committers to spend time with reviewing, merging, and >>> contributing. >>> So IMO, this is orthogonal to having more committer involvement. >>> >>> 2) Having committers (current / new ones) spending time on FlinkML is >> the >>> requirement for keep it alive within the Flink project. >>> Adding new committers is kind of a bootstrap problem here because it is >>> hard for contributors to get involved with FlinkML if very little >> committer >>> time is spend on code reviews and merging. Nonetheless, I see this as the >>> best option. >>> >>> 3) Forking of a project on Github is certainly possible (even without the >>> endorsement of the Flink community). However, merging changes back into >>> Flink would again require a committer to review and merge (probably a >> much >>> larger chunk of code) and also require the permission of all >> contributors. >>> >>> Best, >>> Fabian >>> >>> [1] >>> >> https://mail-archives.apache.org/mod_mbox/flink-dev/201512.mbox/%3CCAGco--aZhZhrrSzzPROwXwmtYmD5CkoGKe7xNCWG1Vw7V-D%2BaA%40mail.gmail.com%3E >>> >>> 2016-02-12 10:23 GMT+01:00 Theodore Vasiloudis < >>> theodoros.vasilou...@gmail.com>: >>> >>>> Hello all, >>>> >>>> I would like to get a conversation started on how we plan to move >> forward >>>> with FlinkML. >>>> >>>> Development on the library currently has been mostly dormant for the >> past 6 >>>> months, >>>> >>>> mainly I believe because of the lack of available committers to review >> PRs. >>>> >>>> Last month we got together with Till and Marton and talked about how we >>>> could try to >>>> >>>> solve this and ensure continued development of the library. >>>> >>>> We see 3 possible paths we could take: >>>> >>>> 1. >>>> >>>> Externalize the library, creating a new repository under the Apache >>>> Flink project. This decouples the development of FlinkML from the >> Flink >>>> release cycle, allowing us to move faster and incorporate new features >>>> as >>>> they become available. As FlinkML is a library under development tying >>>> it >>>> to specific versions does not make much sense anyway. The library >> would >>>> depend on the latest snapshot version of Flink. It would then be >>>> possible >>>> for the Flink distribution to cherry-pick parts of the library to be >>>> included with the core distribution. >>>> 2. >>>> >>>> Keep the development under the main Flink project but bring in new >>>> committers. This would mean that the development remains as is and is >>>> tied >>>> to core Flink releases, but new worked should get merged at much more >>>> regular intervals through the help of committers other than Till. >> Marton >>>> Balassi has volunteered for that role and I hope that more might take >> up >>>> that role. >>>> 3. A third option is to fork FlinkML on a repository on which we are >>>> able to commit freely (again through PRs and reviews of course) and >>>> merge >>>> good parts back into the main repo once in a while. This allows for >>>> faster >>>> progress and more experimental work but obviously creates >> fragmentation. >>>> >>>> >>>> I would like to hear your thoughts on these three options, as well as >>>> discuss other >>>> >>>> alternatives that could help move FlinkML forward. >>>> >>>> Cheers, >>>> Theodore >>>> >> >>