+1 for moving Flink ML to a separate repository. Thanks for driving this
discussion and effort Dong!

Cheers,
Till

On Fri, Mar 12, 2021 at 1:19 PM Becket Qin <becket....@gmail.com> wrote:

> Thanks for raising the discussion, Dong. +1 on moving the Flink ML to a
> separate repository.
>
> Machine learning is a big area which deserves a separate project so the
> development can be decoupled from Flink core. In the meantime, it gives us
> the flexibility of evolving Flink without breaking the existing ML users.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Fri, Mar 12, 2021 at 6:16 PM Dong Lin <lindon...@gmail.com> wrote:
>
> > Hi everyone,
> >
> > I am opening this thread to discuss the idea of moving Flink ML pipeline
> > API and library code to a separate repository in Flink (similar to what
> we
> > did for flink-statefun <https://github.com/apache/flink-statefun>).
> >
> > The Flink ML pipeline API was proposed by FLIP-39: Flink ML pipeline and
> ML
> > libs
> > <
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-39+Flink+ML+pipeline+and+ML+libs
> > >.
> > It allows MLlib developers and users to develop ML pipelines on top of
> > Flink.
> >
> > According to the discussion in this
> > <
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Deprecation-and-removal-of-the-legacy-SQL-planner-td48988.html
> > >
> > thread, we plan to remove SQL planner in Flink 1.14. However,
> > there exist ML libraries which currently use Flink's DataSet API together
> > with Table API. Those libraries will either stop working or suffer
> > considerable performance regression if they bump up dependency to Flink
> > 1.14. As a result, if we keep ML pipeline API in Flink, then those ML
> > libraries can not use the latest ML pipeline API/lib in Flink until Flink
> > compenstates the missing functionality with new DataStream APIs, which is
> > supposed to happen about 1 year from now in e.g. Flink 1.15.
> >
> > In order to allow us to remove SQL planner in Flink 1.14 while still
> > allowing ML pipeline API/lib development in the coming year, we propose
> to
> > move Flink ML pipeline API and library code to a separate repository.
> More
> > specifically, the new repo will have the following setup:
> > - The repo will be created at https://github.com/apache/flink-ml. This
> > repo
> > will depend on the core Flink repo.
> > - The flink-ml documentation will be linked from the existing main Flink
> > docs similar to
> > https://ci.apache.org/projects/flink/flink-statefun-docs-master.
> > - The new repo will be under namespace org.apache.flink.
> > - We can revisit whether we should put it back to the core Flink repo
> after
> > the above issue is resolved and if there is good reason to make the
> change.
> >
> > Here is the proposed plan if we agree to make this change:
> > - We will create the flink-ml repo and move Flink ML pipeline related
> code
> > to this repo before Flink 1.13 code release (3/31/2021)
> > - Then we update flink-ml repo to depend on Flink 1.13 after Flink 1.13
> is
> > released.
> > - Then we update core Flink with new DataStream API (e.g. DataStream
> > iteration) such that core Flink can support the same (or better) ML lib
> > performance as it does now with the SQL planner. This is supposed to
> happen
> > in about 1 year.
> > - Then we update flink-ml repo to depend on the latest Flink version once
> > Flink has the new DataStream API.
> >
> > Besides the main motivation described above, this change also shares
> > similar pros/cons of creating a separate repo for flink-statefun
> > <https://github.com/apache/flink-statefun> (see this
> > <
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Stateful-Functions-in-which-form-to-contribute-same-or-different-repository-td34034.html
> > >
> > and this
> > <
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/PROPOSAL-Contribute-Stateful-Functions-to-Apache-Flink-td33913.html
> > >
> > for priory discussion).
> >
> > Pros:
> > - A separate repos allows faster development for an early stage project
> > like flink ML pipeline (both API and libs).
> > - Flink repo is already super large and it is good not to bloat its size
> > (and the number of tests)
> > - Less tests to run when we make code changes in each repo.
> >
> > Cons:
> > - The code change in the core Flink might potentially break the test or
> > cause performance regression in flink-ml since they are in different
> repo.
> > So more effort is needed when we bump up flink-ml's Flink dependency.
> >
> > Overall it seems that the pros outweigh the cons. Looking forward to
> > hearing what you think!
> >
> >
> > Regards,
> > Dong
> >
>

Reply via email to