I think Simone raises some good points here.

The truth is that FlinkML is still in its infancy and it will be hard to
compete with mllib, H2O and Graphlab in terms of features
and algorithm "coverage".

My hope has always been that the library will be focused on what Flink does
well and implement algorithms that are
built around the inherent advantages Flink provides over other platforms.

This is an open source project of course it's not up to one person to
decide what makes into the library and what doesn't,
and for me it's been really hard to gauge what the community "wants" from
the library in terms of algorithms.

The "basics" (sklearn-like predictors, evaluators, CV and pipelines) I
think are necessary and are largely in place already.
Making sure that they provide a good user experience is paramount of course
before we settle on the design.

But this is less of a discussion on where we take FlinkML, but *how *we do
it.
I do believe there is a need for an integrated ML library for Flink, the
question for me is how can we ensure its continued development.



On Fri, Feb 12, 2016 at 12:59 PM, Chiwan Park <chiwanp...@apache.org> wrote:

> Hi,
>
> I agree what Theo said. Currently, only few committers spend time to
> review PRs about FlinkML. But I also agree Fabian’s opinion. I would like
> to keep FlinkML under main repository of Flink. I hope new committers
> spending time for FlinkML.
>
> About Simone’s opinion, yes, FlinkML is still immature ML library. There
> is a lack of many useful features and some of the features are pending in
> pull requests.
>
> Integration with some other libraries such as Mahout, H2O, Weka would be
> also good. Already there are some attempts using Flink or other distributed
> data processing framework as a backend of other library [1] [2] [3]. But I
> think, as you can see the link, we have to re-implement many algorithms
> even though we integrate other library with Flink. I doubt if there is a
> big development advantage of integration.
>
> [1]: https://issues.apache.org/jira/browse/MAHOUT-1570
> [2]: http://mahout.apache.org/users/basics/algorithms.html
> [3]: https://github.com/ariskk/distributedWekaSpark
>
> Regards,
> Chiwan Park
>
> > On Feb 12, 2016, at 7:04 PM, Fabian Hueske <fhue...@gmail.com> wrote:
> >
> > Hi Theo,
> >
> > thanks for starting this discussion. You are certainly right that the
> > development of FlinkML is stalling. On the other hand, we regularly see
> > people on the mailing list asking for feature.
> >
> > Regarding your proposed ways to proceed:
> >
> > 1) I am not sure how much it would help to move FlinkML to a separate
> > repository.
> > We have discussed to move connectors (and libraries) to separate
> > repositories before but the thread fall asleep [1].
> > We would still need committers to spend time with reviewing, merging, and
> > contributing.
> > So IMO, this is orthogonal to having more committer involvement.
> >
> > 2) Having committers (current /  new ones) spending time on FlinkML is
> the
> > requirement for keep it alive within the Flink project.
> > Adding new committers is kind of a bootstrap problem here because it is
> > hard for contributors to get involved with FlinkML if very little
> committer
> > time is spend on code reviews and merging. Nonetheless, I see this as the
> > best option.
> >
> > 3) Forking of a project on Github is certainly possible (even without the
> > endorsement of the Flink community). However, merging changes back into
> > Flink would again require a committer to review and merge (probably a
> much
> > larger chunk of code) and also require the permission of all
> contributors.
> >
> > Best,
> > Fabian
> >
> > [1]
> >
> https://mail-archives.apache.org/mod_mbox/flink-dev/201512.mbox/%3CCAGco--aZhZhrrSzzPROwXwmtYmD5CkoGKe7xNCWG1Vw7V-D%2BaA%40mail.gmail.com%3E
> >
> > 2016-02-12 10:23 GMT+01:00 Theodore Vasiloudis <
> > theodoros.vasilou...@gmail.com>:
> >
> >> Hello all,
> >>
> >> I would like to get a conversation started on how we plan to move
> forward
> >> with FlinkML.
> >>
> >> Development on the library currently has been mostly dormant for the
> past 6
> >> months,
> >>
> >> mainly I believe because of the lack of available committers to review
> PRs.
> >>
> >> Last month we got together with Till and Marton and talked about how we
> >> could try to
> >>
> >> solve this and ensure continued development of the library.
> >>
> >> We see 3 possible paths we could take:
> >>
> >>   1.
> >>
> >>   Externalize the library, creating a new repository under the Apache
> >>   Flink project. This decouples the development of FlinkML from the
> Flink
> >>   release cycle, allowing us to move faster and incorporate new features
> >> as
> >>   they become available. As FlinkML is a library under development tying
> >> it
> >>   to specific versions does not make much sense anyway. The library
> would
> >>   depend on the latest snapshot version of Flink. It would then be
> >> possible
> >>   for the Flink distribution to cherry-pick parts of the library to be
> >>   included with the core distribution.
> >>   2.
> >>
> >>   Keep the development under the main Flink project but bring in new
> >>   committers. This would mean that the development remains as is and is
> >> tied
> >>   to core Flink releases, but new worked should get merged at much more
> >>   regular intervals through the help of committers other than Till.
> Marton
> >>   Balassi has volunteered for that role and I hope that more might take
> up
> >>   that role.
> >>   3. A third option is to fork FlinkML on a repository on which we are
> >>   able to commit freely (again through PRs and reviews of course) and
> >> merge
> >>   good parts back into the main repo once in a while. This allows for
> >> faster
> >>   progress and more experimental work but obviously creates
> fragmentation.
> >>
> >>
> >> I would like to hear your thoughts on these three options, as well as
> >> discuss other
> >>
> >> alternatives that could help move FlinkML forward.
> >>
> >> Cheers,
> >> Theodore
> >>
>
>

Reply via email to