+1 for moving Flink ML to a separate repository. Thanks for driving this discussion and effort Dong!
Cheers, Till On Fri, Mar 12, 2021 at 1:19 PM Becket Qin <becket....@gmail.com> wrote: > Thanks for raising the discussion, Dong. +1 on moving the Flink ML to a > separate repository. > > Machine learning is a big area which deserves a separate project so the > development can be decoupled from Flink core. In the meantime, it gives us > the flexibility of evolving Flink without breaking the existing ML users. > > Thanks, > > Jiangjie (Becket) Qin > > On Fri, Mar 12, 2021 at 6:16 PM Dong Lin <lindon...@gmail.com> wrote: > > > Hi everyone, > > > > I am opening this thread to discuss the idea of moving Flink ML pipeline > > API and library code to a separate repository in Flink (similar to what > we > > did for flink-statefun <https://github.com/apache/flink-statefun>). > > > > The Flink ML pipeline API was proposed by FLIP-39: Flink ML pipeline and > ML > > libs > > < > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-39+Flink+ML+pipeline+and+ML+libs > > >. > > It allows MLlib developers and users to develop ML pipelines on top of > > Flink. > > > > According to the discussion in this > > < > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Deprecation-and-removal-of-the-legacy-SQL-planner-td48988.html > > > > > thread, we plan to remove SQL planner in Flink 1.14. However, > > there exist ML libraries which currently use Flink's DataSet API together > > with Table API. Those libraries will either stop working or suffer > > considerable performance regression if they bump up dependency to Flink > > 1.14. As a result, if we keep ML pipeline API in Flink, then those ML > > libraries can not use the latest ML pipeline API/lib in Flink until Flink > > compenstates the missing functionality with new DataStream APIs, which is > > supposed to happen about 1 year from now in e.g. Flink 1.15. > > > > In order to allow us to remove SQL planner in Flink 1.14 while still > > allowing ML pipeline API/lib development in the coming year, we propose > to > > move Flink ML pipeline API and library code to a separate repository. > More > > specifically, the new repo will have the following setup: > > - The repo will be created at https://github.com/apache/flink-ml. This > > repo > > will depend on the core Flink repo. > > - The flink-ml documentation will be linked from the existing main Flink > > docs similar to > > https://ci.apache.org/projects/flink/flink-statefun-docs-master. > > - The new repo will be under namespace org.apache.flink. > > - We can revisit whether we should put it back to the core Flink repo > after > > the above issue is resolved and if there is good reason to make the > change. > > > > Here is the proposed plan if we agree to make this change: > > - We will create the flink-ml repo and move Flink ML pipeline related > code > > to this repo before Flink 1.13 code release (3/31/2021) > > - Then we update flink-ml repo to depend on Flink 1.13 after Flink 1.13 > is > > released. > > - Then we update core Flink with new DataStream API (e.g. DataStream > > iteration) such that core Flink can support the same (or better) ML lib > > performance as it does now with the SQL planner. This is supposed to > happen > > in about 1 year. > > - Then we update flink-ml repo to depend on the latest Flink version once > > Flink has the new DataStream API. > > > > Besides the main motivation described above, this change also shares > > similar pros/cons of creating a separate repo for flink-statefun > > <https://github.com/apache/flink-statefun> (see this > > < > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Stateful-Functions-in-which-form-to-contribute-same-or-different-repository-td34034.html > > > > > and this > > < > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/PROPOSAL-Contribute-Stateful-Functions-to-Apache-Flink-td33913.html > > > > > for priory discussion). > > > > Pros: > > - A separate repos allows faster development for an early stage project > > like flink ML pipeline (both API and libs). > > - Flink repo is already super large and it is good not to bloat its size > > (and the number of tests) > > - Less tests to run when we make code changes in each repo. > > > > Cons: > > - The code change in the core Flink might potentially break the test or > > cause performance regression in flink-ml since they are in different > repo. > > So more effort is needed when we bump up flink-ml's Flink dependency. > > > > Overall it seems that the pros outweigh the cons. Looking forward to > > hearing what you think! > > > > > > Regards, > > Dong > > >