Hi Rong, That's great! Looking forward to your feedback.
Thanks, Hequn On Tue, Feb 11, 2020 at 1:06 AM Rong Rong <walter...@gmail.com> wrote: > Yes. I think the argument is fairly valid - we can always adjust the API > in the future, in fact most of the APIs are labeled publicEvolving at this > moment. > I was only trying to provide the info, that the interfaces in flink-ml-api > might change in the near future, for others when voting. > > In fact, I am actually always +1 on moving flink-ml-api to /opt :-) > Regarding the Python ML API. sorry for not noticing it earlier as I > haven't given it a deep look yet. will do very soon! > > -- > Rong > > On Sun, Feb 9, 2020 at 7:33 PM Hequn Cheng <he...@apache.org> wrote: > >> Hi Rong, >> >> Thanks a lot for joining the discussion! >> >> It would be great if we can have a long term plan. My intention is to >> provide a way for users to add dependencies of Flink ML, either through the >> opt or download page. This would be more and more critical along with the >> improvement of the Flink ML, as you said there are multiple PRs under >> review and I'm also going to support Python Pipeline API recently[1]. >> >> Meanwhile, it also makes sense to include the API into the opt, so it >> would probably not break the long term plan. >> However, even find something wrong in the future, we can revisit this >> easily instead of blocking the improvement for users. What do you think? >> >> Best, >> Hequn >> >> [1] >> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Support-Python-ML-Pipeline-API-td37291.html >> >> On Sat, Feb 8, 2020 at 1:57 AM Rong Rong <walter...@gmail.com> wrote: >> >>> CC @Xu Yang <xuyang1...@gmail.com> >>> >>> Thanks for starting the discussion @Hequn Cheng <chenghe...@gmail.com> and >>> sorry for joining the discussion late. >>> >>> I've mainly helped merging the code in flink-ml-api and flink-ml-lib in >>> the past several months. >>> IMO the flink-ml-api are an extension on top of the table API and agree >>> that it should be treated as a part of the "core" core. >>> >>> However, I think given the fact that there are multiple PRs still under >>> review [1], is it a better idea to come up with a long term plan first >>> before make the decision to moving it to /opt now? >>> >>> >>> -- >>> Rong >>> >>> [1] >>> https://github.com/apache/flink/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Aopen+label%3Acomponent%3DLibrary%2FMachineLearning+ >>> >>> On Fri, Feb 7, 2020 at 5:54 AM Hequn Cheng <he...@apache.org> wrote: >>> >>>> Hi, >>>> >>>> @Till Rohrmann <trohrm...@apache.org> Thanks for the great inputs. I >>>> agree >>>> with you that we should have a long term plan for this. It definitely >>>> deserves another discussion. >>>> @Jeff Zhang <zjf...@gmail.com> Thanks for your reports and ideas. It's >>>> a >>>> good idea to improve the error messages. Do we have any JIRAs for it or >>>> maybe we can create one for it. >>>> >>>> Thank you again for your feedback and suggestions. I will go on with the >>>> PR. Thanks! >>>> >>>> Best, >>>> Hequn >>>> >>>> On Thu, Feb 6, 2020 at 11:51 PM Jeff Zhang <zjf...@gmail.com> wrote: >>>> >>>> > I have another concern which may not be closely related to this >>>> thread. >>>> > Since flink doesn't include all the necessary jars, I think it is >>>> critical >>>> > for flink to display meaningful error message when any class is >>>> missing. >>>> > e.g. Here's the error message when I use kafka but miss >>>> > including flink-json. To be honest, the kind of error message is >>>> hard to >>>> > understand for new users. >>>> > >>>> > >>>> > Reason: No factory implements >>>> > 'org.apache.flink.table.factories.DeserializationSchemaFactory'. The >>>> > following properties are requested: >>>> > connector.properties.bootstrap.servers=localhost:9092 >>>> > connector.properties.group.id=testGroup >>>> > connector.properties.zookeeper.connect=localhost:2181 >>>> > connector.startup-mode=earliest-offset >>>> connector.topic=generated.events >>>> > connector.type=kafka connector.version=universal format.type=json >>>> > schema.0.data-type=VARCHAR(2147483647) schema.0.name=status >>>> > schema.1.data-type=VARCHAR(2147483647) schema.1.name=direction >>>> > schema.2.data-type=BIGINT schema.2.name=event_ts update-mode=append >>>> The >>>> > following factories have been considered: >>>> > org.apache.flink.table.catalog.hive.factories.HiveCatalogFactory >>>> > org.apache.flink.table.module.hive.HiveModuleFactory >>>> > org.apache.flink.table.module.CoreModuleFactory >>>> > org.apache.flink.table.catalog.GenericInMemoryCatalogFactory >>>> > org.apache.flink.table.sources.CsvBatchTableSourceFactory >>>> > org.apache.flink.table.sources.CsvAppendTableSourceFactory >>>> > org.apache.flink.table.sinks.CsvBatchTableSinkFactory >>>> > org.apache.flink.table.sinks.CsvAppendTableSinkFactory >>>> > org.apache.flink.table.planner.delegation.BlinkPlannerFactory >>>> > org.apache.flink.table.planner.delegation.BlinkExecutorFactory >>>> > org.apache.flink.table.planner.StreamPlannerFactory >>>> > org.apache.flink.table.executor.StreamExecutorFactory >>>> > >>>> org.apache.flink.streaming.connectors.kafka.KafkaTableSourceSinkFactory at >>>> > >>>> > >>>> org.apache.flink.table.factories.TableFactoryService.filterByFactoryClass(TableFactoryService.java:238) >>>> > at >>>> > >>>> > >>>> org.apache.flink.table.factories.TableFactoryService.filter(TableFactoryService.java:185) >>>> > at >>>> > >>>> > >>>> org.apache.flink.table.factories.TableFactoryService.findSingleInternal(TableFactoryService.java:143) >>>> > at >>>> > >>>> > >>>> org.apache.flink.table.factories.TableFactoryService.find(TableFactoryService.java:113) >>>> > at >>>> > >>>> > >>>> org.apache.flink.streaming.connectors.kafka.KafkaTableSourceSinkFactoryBase.getDeserializationSchema(KafkaTableSourceSinkFactoryBase.java:277) >>>> > at >>>> > >>>> > >>>> org.apache.flink.streaming.connectors.kafka.KafkaTableSourceSinkFactoryBase.createStreamTableSource(KafkaTableSourceSinkFactoryBase.java:161) >>>> > at >>>> > >>>> > >>>> org.apache.flink.table.factories.StreamTableSourceFactory.createTableSource(StreamTableSourceFactory.java:49) >>>> > at >>>> > >>>> > >>>> org.apache.flink.table.factories.TableFactoryUtil.findAndCreateTableSource(TableFactoryUtil.java:53) >>>> > ... 36 more >>>> > >>>> > >>>> > >>>> > Till Rohrmann <trohrm...@apache.org> 于2020年2月6日周四 下午11:30写道: >>>> > >>>> > > I would not object given that it is rather small at the moment. >>>> However, >>>> > I >>>> > > also think that we should have a plan how to handle the ever growing >>>> > Flink >>>> > > ecosystem and how to make it easily accessible to our users. E.g. >>>> one far >>>> > > fetched idea could be something like a configuration script which >>>> > downloads >>>> > > the required components for the user. But this deserves definitely a >>>> > > separate discussion and does not really belong here. >>>> > > >>>> > > Cheers, >>>> > > Till >>>> > > >>>> > > On Thu, Feb 6, 2020 at 3:35 PM Hequn Cheng <he...@apache.org> >>>> wrote: >>>> > > >>>> > > > >>>> > > > Hi everyone, >>>> > > > >>>> > > > Thank you all for the great inputs! >>>> > > > >>>> > > > I think probably what we all agree on is we should try to make a >>>> leaner >>>> > > > flink-dist. However, we may also need to do some compromises >>>> > considering >>>> > > > the user experience that users don't need to download the >>>> dependencies >>>> > > from >>>> > > > different places. Otherwise, we can move all the jars in the >>>> current >>>> > opt >>>> > > > folder to the download page. >>>> > > > >>>> > > > The missing of clear rules for guiding such compromises makes >>>> things >>>> > more >>>> > > > complicated now. I would agree that the decisive factor for what >>>> goes >>>> > > into >>>> > > > Flink's binary distribution should be how core it is to Flink. >>>> > Meanwhile, >>>> > > > it's better to treat Flink API as a (core) core to Flink. Not >>>> only it >>>> > is >>>> > > a >>>> > > > very clear rule that easy to be followed but also in most cases, >>>> API is >>>> > > > very significant and deserved to be included in the dist. >>>> > > > >>>> > > > Given this, it might make sense to put flink-ml-api and >>>> flink-ml-lib >>>> > into >>>> > > > the opt. >>>> > > > What do you think? >>>> > > > >>>> > > > Best, >>>> > > > Hequn >>>> > > > >>>> > > > On Wed, Feb 5, 2020 at 12:39 AM Chesnay Schepler < >>>> ches...@apache.org> >>>> > > > wrote: >>>> > > > >>>> > > >> Around a year ago I started a discussion >>>> > > >> < >>>> > > >>>> > >>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/DISCUSS-Towards-a-leaner-flink-dist-tp25615.html >>>> > > > >>>> > > >> on reducing the amount of jars we ship with the distribution. >>>> > > >> >>>> > > >> While there was no definitive conclusion there was a shared >>>> sentiment >>>> > > >> that APIs should be shipped with the distribution. >>>> > > >> >>>> > > >> On 04/02/2020 17:25, Till Rohrmann wrote: >>>> > > >> >>>> > > >> I think there is no such rule that APIs go automatically into >>>> opt/ and >>>> > > >> "libraries" not. The contents of opt/ have mainly grown over >>>> time w/o >>>> > > >> following a strict rule. >>>> > > >> >>>> > > >> I think the decisive factor for what goes into Flink's binary >>>> > > distribution >>>> > > >> should be how core it is to Flink. Of course another important >>>> > > >> consideration is which use cases Flink should promote "out of >>>> the box" >>>> > > (not >>>> > > >> sure whether this is actual true for content shipped in opt/ >>>> because >>>> > you >>>> > > >> also have to move it to lib). >>>> > > >> >>>> > > >> For example, Gelly would be an example which I would rather see >>>> as an >>>> > > >> optional component than shipping it with every Flink binary >>>> > > distribution. >>>> > > >> >>>> > > >> Cheers, >>>> > > >> Till >>>> > > >> >>>> > > >> On Tue, Feb 4, 2020 at 11:24 AM Becket Qin <becket....@gmail.com> >>>> < >>>> > > becket....@gmail.com> wrote: >>>> > > >> >>>> > > >> >>>> > > >> Thanks for the suggestion, Till. >>>> > > >> >>>> > > >> I am curious about how do we usually decide when to put the jars >>>> into >>>> > > the >>>> > > >> opt folder? >>>> > > >> >>>> > > >> Technically speaking, it seems that `flink-ml-api` should be put >>>> into >>>> > > the >>>> > > >> opt directory because they are actually API instead of >>>> libraries, just >>>> > > like >>>> > > >> CEP and Table. >>>> > > >> >>>> > > >> `flink-ml-lib` seems to be on the border. On one hand, it is a >>>> > library. >>>> > > On >>>> > > >> the other hand, unlike SQL formats and Hadoop whose major code >>>> are >>>> > > outside >>>> > > >> of Flink, the algorithm codes are in Flink. So `flink-ml-lib` is >>>> more >>>> > > like >>>> > > >> those of built-in SQL UDFs. So it seems fine to either put it in >>>> the >>>> > opt >>>> > > >> folder or in the downloads page. >>>> > > >> >>>> > > >> From the user experience perspective, it might be better to have >>>> both >>>> > > >> `flink-ml-lib` and `flink-ml-api` in opt folder so users needn't >>>> go to >>>> > > two >>>> > > >> places for the required dependencies. >>>> > > >> >>>> > > >> Thanks, >>>> > > >> >>>> > > >> Jiangjie (Becket) Qin >>>> > > >> >>>> > > >> On Tue, Feb 4, 2020 at 2:32 PM Hequn Cheng <he...@apache.org> < >>>> > > he...@apache.org> wrote: >>>> > > >> >>>> > > >> >>>> > > >> Hi Till, >>>> > > >> >>>> > > >> Thanks a lot for your suggestion. It's a good idea to offer the >>>> > flink-ml >>>> > > >> libraries as optional dependencies on the download page which >>>> can make >>>> > > >> >>>> > > >> the >>>> > > >> >>>> > > >> dist smaller. >>>> > > >> >>>> > > >> But I also have some concerns for it, e.g., the download page >>>> now only >>>> > > >> includes the latest 3 releases. We may need to find ways to >>>> support >>>> > more >>>> > > >> versions. >>>> > > >> On the other hand, the size of the flink-ml libraries now is very >>>> > > >> small(about 246K), so it would not bring much impact on the size >>>> of >>>> > > dist. >>>> > > >> >>>> > > >> What do you think? >>>> > > >> >>>> > > >> Best, >>>> > > >> Hequn >>>> > > >> >>>> > > >> On Mon, Feb 3, 2020 at 6:24 PM Till Rohrmann < >>>> trohrm...@apache.org> < >>>> > > trohrm...@apache.org> >>>> > > >> >>>> > > >> wrote: >>>> > > >> >>>> > > >> An alternative solution would be to offer the flink-ml libraries >>>> as >>>> > > >> optional dependencies on the download page. Similar to how we >>>> offer >>>> > the >>>> > > >> different SQL formats and Hadoop releases [1]. >>>> > > >> >>>> > > >> [1] https://flink.apache.org/downloads.html >>>> > > >> >>>> > > >> Cheers, >>>> > > >> Till >>>> > > >> >>>> > > >> On Mon, Feb 3, 2020 at 10:19 AM Hequn Cheng <he...@apache.org> < >>>> > > he...@apache.org> wrote: >>>> > > >> >>>> > > >> >>>> > > >> Thank you all for your feedback and suggestions! >>>> > > >> >>>> > > >> Best, Hequn >>>> > > >> >>>> > > >> On Mon, Feb 3, 2020 at 5:07 PM Becket Qin <becket....@gmail.com> >>>> < >>>> > > becket....@gmail.com> >>>> > > >> >>>> > > >> wrote: >>>> > > >> >>>> > > >> Thanks for bringing up the discussion, Hequn. >>>> > > >> >>>> > > >> +1 on adding `flink-ml-api` and `flink-ml-lib` into opt. This >>>> would >>>> > > >> >>>> > > >> make >>>> > > >> >>>> > > >> it much easier for the users to try out some simple ml tasks. >>>> > > >> >>>> > > >> Thanks, >>>> > > >> >>>> > > >> Jiangjie (Becket) Qin >>>> > > >> >>>> > > >> On Mon, Feb 3, 2020 at 4:34 PM jincheng sun < >>>> > > >> >>>> > > >> sunjincheng...@gmail.com >>>> > > >> >>>> > > >> wrote: >>>> > > >> >>>> > > >> >>>> > > >> Thank you for pushing forward @Hequn Cheng <he...@apache.org> < >>>> > > he...@apache.org> ! >>>> > > >> >>>> > > >> Hi @Becket Qin <becket....@gmail.com> <becket....@gmail.com> , >>>> Do >>>> > you >>>> > > have any concerns >>>> > > >> >>>> > > >> on >>>> > > >> >>>> > > >> this ? >>>> > > >> >>>> > > >> Best, >>>> > > >> Jincheng >>>> > > >> >>>> > > >> Hequn Cheng <he...@apache.org> <he...@apache.org> 于2020年2月3日周一 >>>> > > 下午2:09写道: >>>> > > >> >>>> > > >> >>>> > > >> Hi everyone, >>>> > > >> >>>> > > >> Thanks for the feedback. As there are no objections, I've opened >>>> a >>>> > > >> >>>> > > >> JIRA >>>> > > >> >>>> > > >> issue(FLINK-15847[1]) to address this issue. >>>> > > >> The implementation details can be discussed in the issue or in >>>> the >>>> > > >> following PR. >>>> > > >> >>>> > > >> Best, >>>> > > >> Hequn >>>> > > >> >>>> > > >> [1] https://issues.apache.org/jira/browse/FLINK-15847 >>>> > > >> >>>> > > >> On Wed, Jan 8, 2020 at 9:15 PM Hequn Cheng <chenghe...@gmail.com> >>>> < >>>> > > chenghe...@gmail.com> >>>> > > >> >>>> > > >> wrote: >>>> > > >> >>>> > > >> Hi Jincheng, >>>> > > >> >>>> > > >> Thanks a lot for your feedback! >>>> > > >> Yes, I agree with you. There are cases that multi jars need to >>>> > > >> >>>> > > >> be >>>> > > >> >>>> > > >> uploaded. I will prepare another discussion later. Maybe with a >>>> > > >> >>>> > > >> simple >>>> > > >> >>>> > > >> design doc. >>>> > > >> >>>> > > >> Best, Hequn >>>> > > >> >>>> > > >> On Wed, Jan 8, 2020 at 3:06 PM jincheng sun < >>>> > > >> >>>> > > >> sunjincheng...@gmail.com> >>>> > > >> >>>> > > >> wrote: >>>> > > >> >>>> > > >> >>>> > > >> Thanks for bring up this discussion Hequn! >>>> > > >> >>>> > > >> +1 for include `flink-ml-api` and `flink-ml-lib` in opt. >>>> > > >> >>>> > > >> BTW: I think would be great if bring up a discussion for upload >>>> > > >> >>>> > > >> multiple >>>> > > >> >>>> > > >> Jars at the same time. as PyFlink JOB also can have the benefit >>>> > > >> >>>> > > >> if >>>> > > >> >>>> > > >> we >>>> > > >> >>>> > > >> do >>>> > > >> >>>> > > >> that improvement. >>>> > > >> >>>> > > >> Best, >>>> > > >> Jincheng >>>> > > >> >>>> > > >> >>>> > > >> Hequn Cheng <chenghe...@gmail.com> <chenghe...@gmail.com> >>>> > 于2020年1月8日周三 >>>> > > 上午11:50写道: >>>> > > >> >>>> > > >> >>>> > > >> Hi everyone, >>>> > > >> >>>> > > >> FLIP-39[1] rebuilds Flink ML pipeline on top of TableAPI >>>> > > >> >>>> > > >> which >>>> > > >> >>>> > > >> moves >>>> > > >> >>>> > > >> Flink >>>> > > >> >>>> > > >> ML a step further. Base on it, users can develop their ML >>>> > > >> >>>> > > >> jobs >>>> > > >> >>>> > > >> and >>>> > > >> >>>> > > >> more >>>> > > >> >>>> > > >> and >>>> > > >> >>>> > > >> more machine learning platforms are providing ML services. >>>> > > >> >>>> > > >> However, the problem now is the jars of flink-ml-api and >>>> > > >> >>>> > > >> flink-ml-lib >>>> > > >> >>>> > > >> are >>>> > > >> >>>> > > >> only exist on maven repo. Whenever users want to submit ML >>>> > > >> >>>> > > >> jobs, >>>> > > >> >>>> > > >> they >>>> > > >> >>>> > > >> can >>>> > > >> >>>> > > >> only depend on the ml modules and package a fat jar. This >>>> > > >> >>>> > > >> would be >>>> > > >> >>>> > > >> inconvenient especially for the machine learning platforms on >>>> > > >> >>>> > > >> which >>>> > > >> >>>> > > >> nearly >>>> > > >> >>>> > > >> all jobs depend on Flink ML modules and have to package a fat >>>> > > >> >>>> > > >> jar. >>>> > > >> >>>> > > >> Given this, it would be better to include jars of >>>> > > >> >>>> > > >> flink-ml-api >>>> > > >> >>>> > > >> and >>>> > > >> >>>> > > >> flink-ml-lib in the `opt` folder, so that users can directly >>>> > > >> >>>> > > >> use >>>> > > >> >>>> > > >> the >>>> > > >> >>>> > > >> jars >>>> > > >> >>>> > > >> with the binary release. For example, users can move the jars >>>> > > >> >>>> > > >> into >>>> > > >> >>>> > > >> the >>>> > > >> >>>> > > >> `lib` folder or use -j to upload the jars. (Currently, -j >>>> > > >> >>>> > > >> only >>>> > > >> >>>> > > >> support >>>> > > >> >>>> > > >> upload one jar. Supporting multi jars for -j can be discussed >>>> > > >> >>>> > > >> in >>>> > > >> >>>> > > >> another >>>> > > >> >>>> > > >> discussion.) >>>> > > >> >>>> > > >> Putting the jars in the `opt` folder instead of the `lib` >>>> > > >> >>>> > > >> folder >>>> > > >> >>>> > > >> is >>>> > > >> >>>> > > >> because >>>> > > >> >>>> > > >> currently, the ml jars are still optional for the Flink >>>> > > >> >>>> > > >> project by >>>> > > >> >>>> > > >> default. >>>> > > >> >>>> > > >> What do you think? Welcome any feedback! >>>> > > >> >>>> > > >> Best, >>>> > > >> >>>> > > >> Hequn >>>> > > >> >>>> > > >> [1] >>>> > > >> >>>> > > >> >>>> > > >> >>>> > > >> >>>> > > >>>> > >>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-39+Flink+ML+pipeline+and+ML+libs >>>> > > >> >>>> > > >> >>>> > > >> >>>> > > >>>> > >>>> > >>>> > -- >>>> > Best Regards >>>> > >>>> > Jeff Zhang >>>> > >>>> >>>