Re: [DISCUSS] Move Flink ML pipeline API and library code to a separate repository named flink-ml

Dong Lin Thu, 18 Mar 2021 19:03:15 -0700

Thank you Becket and Till for your comments!

Since the discussion has been open for about 1 week and there is no concern
with this proposal, I have started the voting thread. Please help vote when
you get time.


Cheers,
Dong

On Mon, Mar 15, 2021 at 6:00 PM Till Rohrmann <[email protected]> wrote:

> +1 for moving Flink ML to a separate repository. Thanks for driving this
> discussion and effort Dong!
>
> Cheers,
> Till
>
> On Fri, Mar 12, 2021 at 1:19 PM Becket Qin <[email protected]> wrote:
>
> > Thanks for raising the discussion, Dong. +1 on moving the Flink ML to a
> > separate repository.
> >
> > Machine learning is a big area which deserves a separate project so the
> > development can be decoupled from Flink core. In the meantime, it gives
> us
> > the flexibility of evolving Flink without breaking the existing ML users.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Fri, Mar 12, 2021 at 6:16 PM Dong Lin <[email protected]> wrote:
> >
> > > Hi everyone,
> > >
> > > I am opening this thread to discuss the idea of moving Flink ML
> pipeline
> > > API and library code to a separate repository in Flink (similar to what
> > we
> > > did for flink-statefun <https://github.com/apache/flink-statefun>).
> > >
> > > The Flink ML pipeline API was proposed by FLIP-39: Flink ML pipeline
> and
> > ML
> > > libs
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-39+Flink+ML+pipeline+and+ML+libs
> > > >.
> > > It allows MLlib developers and users to develop ML pipelines on top of
> > > Flink.
> > >
> > > According to the discussion in this
> > > <
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Deprecation-and-removal-of-the-legacy-SQL-planner-td48988.html
> > > >
> > > thread, we plan to remove SQL planner in Flink 1.14. However,
> > > there exist ML libraries which currently use Flink's DataSet API
> together
> > > with Table API. Those libraries will either stop working or suffer
> > > considerable performance regression if they bump up dependency to Flink
> > > 1.14. As a result, if we keep ML pipeline API in Flink, then those ML
> > > libraries can not use the latest ML pipeline API/lib in Flink until
> Flink
> > > compenstates the missing functionality with new DataStream APIs, which
> is
> > > supposed to happen about 1 year from now in e.g. Flink 1.15.
> > >
> > > In order to allow us to remove SQL planner in Flink 1.14 while still
> > > allowing ML pipeline API/lib development in the coming year, we propose
> > to
> > > move Flink ML pipeline API and library code to a separate repository.
> > More
> > > specifically, the new repo will have the following setup:
> > > - The repo will be created at https://github.com/apache/flink-ml. This
> > > repo
> > > will depend on the core Flink repo.
> > > - The flink-ml documentation will be linked from the existing main
> Flink
> > > docs similar to
> > > https://ci.apache.org/projects/flink/flink-statefun-docs-master.
> > > - The new repo will be under namespace org.apache.flink.
> > > - We can revisit whether we should put it back to the core Flink repo
> > after
> > > the above issue is resolved and if there is good reason to make the
> > change.
> > >
> > > Here is the proposed plan if we agree to make this change:
> > > - We will create the flink-ml repo and move Flink ML pipeline related
> > code
> > > to this repo before Flink 1.13 code release (3/31/2021)
> > > - Then we update flink-ml repo to depend on Flink 1.13 after Flink 1.13
> > is
> > > released.
> > > - Then we update core Flink with new DataStream API (e.g. DataStream
> > > iteration) such that core Flink can support the same (or better) ML lib
> > > performance as it does now with the SQL planner. This is supposed to
> > happen
> > > in about 1 year.
> > > - Then we update flink-ml repo to depend on the latest Flink version
> once
> > > Flink has the new DataStream API.
> > >
> > > Besides the main motivation described above, this change also shares
> > > similar pros/cons of creating a separate repo for flink-statefun
> > > <https://github.com/apache/flink-statefun> (see this
> > > <
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Stateful-Functions-in-which-form-to-contribute-same-or-different-repository-td34034.html
> > > >
> > > and this
> > > <
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/PROPOSAL-Contribute-Stateful-Functions-to-Apache-Flink-td33913.html
> > > >
> > > for priory discussion).
> > >
> > > Pros:
> > > - A separate repos allows faster development for an early stage project
> > > like flink ML pipeline (both API and libs).
> > > - Flink repo is already super large and it is good not to bloat its
> size
> > > (and the number of tests)
> > > - Less tests to run when we make code changes in each repo.
> > >
> > > Cons:
> > > - The code change in the core Flink might potentially break the test or
> > > cause performance regression in flink-ml since they are in different
> > repo.
> > > So more effort is needed when we bump up flink-ml's Flink dependency.
> > >
> > > Overall it seems that the pros outweigh the cons. Looking forward to
> > > hearing what you think!
> > >
> > >
> > > Regards,
> > > Dong
> > >
> >
>

Re: [DISCUSS] Move Flink ML pipeline API and library code to a separate repository named flink-ml

Reply via email to