CC @Xu Yang <xuyang1...@gmail.com> Thanks for starting the discussion @Hequn Cheng <chenghe...@gmail.com> and sorry for joining the discussion late.
I've mainly helped merging the code in flink-ml-api and flink-ml-lib in the past several months. IMO the flink-ml-api are an extension on top of the table API and agree that it should be treated as a part of the "core" core. However, I think given the fact that there are multiple PRs still under review [1], is it a better idea to come up with a long term plan first before make the decision to moving it to /opt now? -- Rong [1] https://github.com/apache/flink/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Aopen+label%3Acomponent%3DLibrary%2FMachineLearning+ On Fri, Feb 7, 2020 at 5:54 AM Hequn Cheng <he...@apache.org> wrote: > Hi, > > @Till Rohrmann <trohrm...@apache.org> Thanks for the great inputs. I agree > with you that we should have a long term plan for this. It definitely > deserves another discussion. > @Jeff Zhang <zjf...@gmail.com> Thanks for your reports and ideas. It's a > good idea to improve the error messages. Do we have any JIRAs for it or > maybe we can create one for it. > > Thank you again for your feedback and suggestions. I will go on with the > PR. Thanks! > > Best, > Hequn > > On Thu, Feb 6, 2020 at 11:51 PM Jeff Zhang <zjf...@gmail.com> wrote: > > > I have another concern which may not be closely related to this thread. > > Since flink doesn't include all the necessary jars, I think it is > critical > > for flink to display meaningful error message when any class is missing. > > e.g. Here's the error message when I use kafka but miss > > including flink-json. To be honest, the kind of error message is hard to > > understand for new users. > > > > > > Reason: No factory implements > > 'org.apache.flink.table.factories.DeserializationSchemaFactory'. The > > following properties are requested: > > connector.properties.bootstrap.servers=localhost:9092 > > connector.properties.group.id=testGroup > > connector.properties.zookeeper.connect=localhost:2181 > > connector.startup-mode=earliest-offset connector.topic=generated.events > > connector.type=kafka connector.version=universal format.type=json > > schema.0.data-type=VARCHAR(2147483647) schema.0.name=status > > schema.1.data-type=VARCHAR(2147483647) schema.1.name=direction > > schema.2.data-type=BIGINT schema.2.name=event_ts update-mode=append The > > following factories have been considered: > > org.apache.flink.table.catalog.hive.factories.HiveCatalogFactory > > org.apache.flink.table.module.hive.HiveModuleFactory > > org.apache.flink.table.module.CoreModuleFactory > > org.apache.flink.table.catalog.GenericInMemoryCatalogFactory > > org.apache.flink.table.sources.CsvBatchTableSourceFactory > > org.apache.flink.table.sources.CsvAppendTableSourceFactory > > org.apache.flink.table.sinks.CsvBatchTableSinkFactory > > org.apache.flink.table.sinks.CsvAppendTableSinkFactory > > org.apache.flink.table.planner.delegation.BlinkPlannerFactory > > org.apache.flink.table.planner.delegation.BlinkExecutorFactory > > org.apache.flink.table.planner.StreamPlannerFactory > > org.apache.flink.table.executor.StreamExecutorFactory > > org.apache.flink.streaming.connectors.kafka.KafkaTableSourceSinkFactory > at > > > > > org.apache.flink.table.factories.TableFactoryService.filterByFactoryClass(TableFactoryService.java:238) > > at > > > > > org.apache.flink.table.factories.TableFactoryService.filter(TableFactoryService.java:185) > > at > > > > > org.apache.flink.table.factories.TableFactoryService.findSingleInternal(TableFactoryService.java:143) > > at > > > > > org.apache.flink.table.factories.TableFactoryService.find(TableFactoryService.java:113) > > at > > > > > org.apache.flink.streaming.connectors.kafka.KafkaTableSourceSinkFactoryBase.getDeserializationSchema(KafkaTableSourceSinkFactoryBase.java:277) > > at > > > > > org.apache.flink.streaming.connectors.kafka.KafkaTableSourceSinkFactoryBase.createStreamTableSource(KafkaTableSourceSinkFactoryBase.java:161) > > at > > > > > org.apache.flink.table.factories.StreamTableSourceFactory.createTableSource(StreamTableSourceFactory.java:49) > > at > > > > > org.apache.flink.table.factories.TableFactoryUtil.findAndCreateTableSource(TableFactoryUtil.java:53) > > ... 36 more > > > > > > > > Till Rohrmann <trohrm...@apache.org> 于2020年2月6日周四 下午11:30写道: > > > > > I would not object given that it is rather small at the moment. > However, > > I > > > also think that we should have a plan how to handle the ever growing > > Flink > > > ecosystem and how to make it easily accessible to our users. E.g. one > far > > > fetched idea could be something like a configuration script which > > downloads > > > the required components for the user. But this deserves definitely a > > > separate discussion and does not really belong here. > > > > > > Cheers, > > > Till > > > > > > On Thu, Feb 6, 2020 at 3:35 PM Hequn Cheng <he...@apache.org> wrote: > > > > > > > > > > > Hi everyone, > > > > > > > > Thank you all for the great inputs! > > > > > > > > I think probably what we all agree on is we should try to make a > leaner > > > > flink-dist. However, we may also need to do some compromises > > considering > > > > the user experience that users don't need to download the > dependencies > > > from > > > > different places. Otherwise, we can move all the jars in the current > > opt > > > > folder to the download page. > > > > > > > > The missing of clear rules for guiding such compromises makes things > > more > > > > complicated now. I would agree that the decisive factor for what goes > > > into > > > > Flink's binary distribution should be how core it is to Flink. > > Meanwhile, > > > > it's better to treat Flink API as a (core) core to Flink. Not only it > > is > > > a > > > > very clear rule that easy to be followed but also in most cases, API > is > > > > very significant and deserved to be included in the dist. > > > > > > > > Given this, it might make sense to put flink-ml-api and flink-ml-lib > > into > > > > the opt. > > > > What do you think? > > > > > > > > Best, > > > > Hequn > > > > > > > > On Wed, Feb 5, 2020 at 12:39 AM Chesnay Schepler <ches...@apache.org > > > > > > wrote: > > > > > > > >> Around a year ago I started a discussion > > > >> < > > > > > > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/DISCUSS-Towards-a-leaner-flink-dist-tp25615.html > > > > > > > >> on reducing the amount of jars we ship with the distribution. > > > >> > > > >> While there was no definitive conclusion there was a shared > sentiment > > > >> that APIs should be shipped with the distribution. > > > >> > > > >> On 04/02/2020 17:25, Till Rohrmann wrote: > > > >> > > > >> I think there is no such rule that APIs go automatically into opt/ > and > > > >> "libraries" not. The contents of opt/ have mainly grown over time > w/o > > > >> following a strict rule. > > > >> > > > >> I think the decisive factor for what goes into Flink's binary > > > distribution > > > >> should be how core it is to Flink. Of course another important > > > >> consideration is which use cases Flink should promote "out of the > box" > > > (not > > > >> sure whether this is actual true for content shipped in opt/ because > > you > > > >> also have to move it to lib). > > > >> > > > >> For example, Gelly would be an example which I would rather see as > an > > > >> optional component than shipping it with every Flink binary > > > distribution. > > > >> > > > >> Cheers, > > > >> Till > > > >> > > > >> On Tue, Feb 4, 2020 at 11:24 AM Becket Qin <becket....@gmail.com> < > > > becket....@gmail.com> wrote: > > > >> > > > >> > > > >> Thanks for the suggestion, Till. > > > >> > > > >> I am curious about how do we usually decide when to put the jars > into > > > the > > > >> opt folder? > > > >> > > > >> Technically speaking, it seems that `flink-ml-api` should be put > into > > > the > > > >> opt directory because they are actually API instead of libraries, > just > > > like > > > >> CEP and Table. > > > >> > > > >> `flink-ml-lib` seems to be on the border. On one hand, it is a > > library. > > > On > > > >> the other hand, unlike SQL formats and Hadoop whose major code are > > > outside > > > >> of Flink, the algorithm codes are in Flink. So `flink-ml-lib` is > more > > > like > > > >> those of built-in SQL UDFs. So it seems fine to either put it in the > > opt > > > >> folder or in the downloads page. > > > >> > > > >> From the user experience perspective, it might be better to have > both > > > >> `flink-ml-lib` and `flink-ml-api` in opt folder so users needn't go > to > > > two > > > >> places for the required dependencies. > > > >> > > > >> Thanks, > > > >> > > > >> Jiangjie (Becket) Qin > > > >> > > > >> On Tue, Feb 4, 2020 at 2:32 PM Hequn Cheng <he...@apache.org> < > > > he...@apache.org> wrote: > > > >> > > > >> > > > >> Hi Till, > > > >> > > > >> Thanks a lot for your suggestion. It's a good idea to offer the > > flink-ml > > > >> libraries as optional dependencies on the download page which can > make > > > >> > > > >> the > > > >> > > > >> dist smaller. > > > >> > > > >> But I also have some concerns for it, e.g., the download page now > only > > > >> includes the latest 3 releases. We may need to find ways to support > > more > > > >> versions. > > > >> On the other hand, the size of the flink-ml libraries now is very > > > >> small(about 246K), so it would not bring much impact on the size of > > > dist. > > > >> > > > >> What do you think? > > > >> > > > >> Best, > > > >> Hequn > > > >> > > > >> On Mon, Feb 3, 2020 at 6:24 PM Till Rohrmann <trohrm...@apache.org> > < > > > trohrm...@apache.org> > > > >> > > > >> wrote: > > > >> > > > >> An alternative solution would be to offer the flink-ml libraries as > > > >> optional dependencies on the download page. Similar to how we offer > > the > > > >> different SQL formats and Hadoop releases [1]. > > > >> > > > >> [1] https://flink.apache.org/downloads.html > > > >> > > > >> Cheers, > > > >> Till > > > >> > > > >> On Mon, Feb 3, 2020 at 10:19 AM Hequn Cheng <he...@apache.org> < > > > he...@apache.org> wrote: > > > >> > > > >> > > > >> Thank you all for your feedback and suggestions! > > > >> > > > >> Best, Hequn > > > >> > > > >> On Mon, Feb 3, 2020 at 5:07 PM Becket Qin <becket....@gmail.com> < > > > becket....@gmail.com> > > > >> > > > >> wrote: > > > >> > > > >> Thanks for bringing up the discussion, Hequn. > > > >> > > > >> +1 on adding `flink-ml-api` and `flink-ml-lib` into opt. This would > > > >> > > > >> make > > > >> > > > >> it much easier for the users to try out some simple ml tasks. > > > >> > > > >> Thanks, > > > >> > > > >> Jiangjie (Becket) Qin > > > >> > > > >> On Mon, Feb 3, 2020 at 4:34 PM jincheng sun < > > > >> > > > >> sunjincheng...@gmail.com > > > >> > > > >> wrote: > > > >> > > > >> > > > >> Thank you for pushing forward @Hequn Cheng <he...@apache.org> < > > > he...@apache.org> ! > > > >> > > > >> Hi @Becket Qin <becket....@gmail.com> <becket....@gmail.com> , Do > > you > > > have any concerns > > > >> > > > >> on > > > >> > > > >> this ? > > > >> > > > >> Best, > > > >> Jincheng > > > >> > > > >> Hequn Cheng <he...@apache.org> <he...@apache.org> 于2020年2月3日周一 > > > 下午2:09写道: > > > >> > > > >> > > > >> Hi everyone, > > > >> > > > >> Thanks for the feedback. As there are no objections, I've opened a > > > >> > > > >> JIRA > > > >> > > > >> issue(FLINK-15847[1]) to address this issue. > > > >> The implementation details can be discussed in the issue or in the > > > >> following PR. > > > >> > > > >> Best, > > > >> Hequn > > > >> > > > >> [1] https://issues.apache.org/jira/browse/FLINK-15847 > > > >> > > > >> On Wed, Jan 8, 2020 at 9:15 PM Hequn Cheng <chenghe...@gmail.com> < > > > chenghe...@gmail.com> > > > >> > > > >> wrote: > > > >> > > > >> Hi Jincheng, > > > >> > > > >> Thanks a lot for your feedback! > > > >> Yes, I agree with you. There are cases that multi jars need to > > > >> > > > >> be > > > >> > > > >> uploaded. I will prepare another discussion later. Maybe with a > > > >> > > > >> simple > > > >> > > > >> design doc. > > > >> > > > >> Best, Hequn > > > >> > > > >> On Wed, Jan 8, 2020 at 3:06 PM jincheng sun < > > > >> > > > >> sunjincheng...@gmail.com> > > > >> > > > >> wrote: > > > >> > > > >> > > > >> Thanks for bring up this discussion Hequn! > > > >> > > > >> +1 for include `flink-ml-api` and `flink-ml-lib` in opt. > > > >> > > > >> BTW: I think would be great if bring up a discussion for upload > > > >> > > > >> multiple > > > >> > > > >> Jars at the same time. as PyFlink JOB also can have the benefit > > > >> > > > >> if > > > >> > > > >> we > > > >> > > > >> do > > > >> > > > >> that improvement. > > > >> > > > >> Best, > > > >> Jincheng > > > >> > > > >> > > > >> Hequn Cheng <chenghe...@gmail.com> <chenghe...@gmail.com> > > 于2020年1月8日周三 > > > 上午11:50写道: > > > >> > > > >> > > > >> Hi everyone, > > > >> > > > >> FLIP-39[1] rebuilds Flink ML pipeline on top of TableAPI > > > >> > > > >> which > > > >> > > > >> moves > > > >> > > > >> Flink > > > >> > > > >> ML a step further. Base on it, users can develop their ML > > > >> > > > >> jobs > > > >> > > > >> and > > > >> > > > >> more > > > >> > > > >> and > > > >> > > > >> more machine learning platforms are providing ML services. > > > >> > > > >> However, the problem now is the jars of flink-ml-api and > > > >> > > > >> flink-ml-lib > > > >> > > > >> are > > > >> > > > >> only exist on maven repo. Whenever users want to submit ML > > > >> > > > >> jobs, > > > >> > > > >> they > > > >> > > > >> can > > > >> > > > >> only depend on the ml modules and package a fat jar. This > > > >> > > > >> would be > > > >> > > > >> inconvenient especially for the machine learning platforms on > > > >> > > > >> which > > > >> > > > >> nearly > > > >> > > > >> all jobs depend on Flink ML modules and have to package a fat > > > >> > > > >> jar. > > > >> > > > >> Given this, it would be better to include jars of > > > >> > > > >> flink-ml-api > > > >> > > > >> and > > > >> > > > >> flink-ml-lib in the `opt` folder, so that users can directly > > > >> > > > >> use > > > >> > > > >> the > > > >> > > > >> jars > > > >> > > > >> with the binary release. For example, users can move the jars > > > >> > > > >> into > > > >> > > > >> the > > > >> > > > >> `lib` folder or use -j to upload the jars. (Currently, -j > > > >> > > > >> only > > > >> > > > >> support > > > >> > > > >> upload one jar. Supporting multi jars for -j can be discussed > > > >> > > > >> in > > > >> > > > >> another > > > >> > > > >> discussion.) > > > >> > > > >> Putting the jars in the `opt` folder instead of the `lib` > > > >> > > > >> folder > > > >> > > > >> is > > > >> > > > >> because > > > >> > > > >> currently, the ml jars are still optional for the Flink > > > >> > > > >> project by > > > >> > > > >> default. > > > >> > > > >> What do you think? Welcome any feedback! > > > >> > > > >> Best, > > > >> > > > >> Hequn > > > >> > > > >> [1] > > > >> > > > >> > > > >> > > > >> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-39+Flink+ML+pipeline+and+ML+libs > > > >> > > > >> > > > >> > > > > > > > > > -- > > Best Regards > > > > Jeff Zhang > > >