Re: [DISCUSS] Include flink-ml-api and flink-ml-lib in opt

Rong Rong Fri, 07 Feb 2020 09:57:32 -0800

CC @Xu Yang <[email protected]>

Thanks for starting the discussion @Hequn Cheng <[email protected]> and
sorry for joining the discussion late.


I've mainly helped merging the code in flink-ml-api and flink-ml-lib in the
past several months.
IMO the flink-ml-api are an extension on top of the table API and agree
that it should be treated as a part of the "core" core.

However, I think given the fact that there are multiple PRs still under
review [1], is it a better idea to come up with a long term plan first
before make the decision to moving it to /opt now?


--
Rong

[1]
https://github.com/apache/flink/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Aopen+label%3Acomponent%3DLibrary%2FMachineLearning+

On Fri, Feb 7, 2020 at 5:54 AM Hequn Cheng <[email protected]> wrote:

> Hi,
>
> @Till Rohrmann <[email protected]> Thanks for the great inputs. I agree
> with you that we should have a long term plan for this. It definitely
> deserves another discussion.
> @Jeff Zhang <[email protected]> Thanks for your reports and ideas. It's a
> good idea to improve the error messages. Do we have any JIRAs for it or
> maybe we can create one for it.
>
> Thank you again for your feedback and suggestions. I will go on with the
> PR. Thanks!
>
> Best,
> Hequn
>
> On Thu, Feb 6, 2020 at 11:51 PM Jeff Zhang <[email protected]> wrote:
>
> > I have another concern which may not be closely related to this thread.
> > Since flink doesn't include all the necessary jars, I think it is
> critical
> > for flink to display meaningful error message when any class is missing.
> > e.g. Here's the error message when I use kafka but miss
> > including flink-json.  To be honest, the kind of error message is hard to
> > understand for new users.
> >
> >
> > Reason: No factory implements
> > 'org.apache.flink.table.factories.DeserializationSchemaFactory'. The
> > following properties are requested:
> > connector.properties.bootstrap.servers=localhost:9092
> > connector.properties.group.id=testGroup
> > connector.properties.zookeeper.connect=localhost:2181
> > connector.startup-mode=earliest-offset connector.topic=generated.events
> > connector.type=kafka connector.version=universal format.type=json
> > schema.0.data-type=VARCHAR(2147483647) schema.0.name=status
> > schema.1.data-type=VARCHAR(2147483647) schema.1.name=direction
> > schema.2.data-type=BIGINT schema.2.name=event_ts update-mode=append The
> > following factories have been considered:
> > org.apache.flink.table.catalog.hive.factories.HiveCatalogFactory
> > org.apache.flink.table.module.hive.HiveModuleFactory
> > org.apache.flink.table.module.CoreModuleFactory
> > org.apache.flink.table.catalog.GenericInMemoryCatalogFactory
> > org.apache.flink.table.sources.CsvBatchTableSourceFactory
> > org.apache.flink.table.sources.CsvAppendTableSourceFactory
> > org.apache.flink.table.sinks.CsvBatchTableSinkFactory
> > org.apache.flink.table.sinks.CsvAppendTableSinkFactory
> > org.apache.flink.table.planner.delegation.BlinkPlannerFactory
> > org.apache.flink.table.planner.delegation.BlinkExecutorFactory
> > org.apache.flink.table.planner.StreamPlannerFactory
> > org.apache.flink.table.executor.StreamExecutorFactory
> > org.apache.flink.streaming.connectors.kafka.KafkaTableSourceSinkFactory
> at
> >
> >
> org.apache.flink.table.factories.TableFactoryService.filterByFactoryClass(TableFactoryService.java:238)
> > at
> >
> >
> org.apache.flink.table.factories.TableFactoryService.filter(TableFactoryService.java:185)
> > at
> >
> >
> org.apache.flink.table.factories.TableFactoryService.findSingleInternal(TableFactoryService.java:143)
> > at
> >
> >
> org.apache.flink.table.factories.TableFactoryService.find(TableFactoryService.java:113)
> > at
> >
> >
> org.apache.flink.streaming.connectors.kafka.KafkaTableSourceSinkFactoryBase.getDeserializationSchema(KafkaTableSourceSinkFactoryBase.java:277)
> > at
> >
> >
> org.apache.flink.streaming.connectors.kafka.KafkaTableSourceSinkFactoryBase.createStreamTableSource(KafkaTableSourceSinkFactoryBase.java:161)
> > at
> >
> >
> org.apache.flink.table.factories.StreamTableSourceFactory.createTableSource(StreamTableSourceFactory.java:49)
> > at
> >
> >
> org.apache.flink.table.factories.TableFactoryUtil.findAndCreateTableSource(TableFactoryUtil.java:53)
> > ... 36 more
> >
> >
> >
> > Till Rohrmann <[email protected]> 于2020年2月6日周四 下午11:30写道：
> >
> > > I would not object given that it is rather small at the moment.
> However,
> > I
> > > also think that we should have a plan how to handle the ever growing
> > Flink
> > > ecosystem and how to make it easily accessible to our users. E.g. one
> far
> > > fetched idea could be something like a configuration script which
> > downloads
> > > the required components for the user. But this deserves definitely a
> > > separate discussion and does not really belong here.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Thu, Feb 6, 2020 at 3:35 PM Hequn Cheng <[email protected]> wrote:
> > >
> > > >
> > > > Hi everyone,
> > > >
> > > > Thank you all for the great inputs!
> > > >
> > > > I think probably what we all agree on is we should try to make a
> leaner
> > > > flink-dist. However, we may also need to do some compromises
> > considering
> > > > the user experience that users don't need to download the
> dependencies
> > > from
> > > > different places. Otherwise, we can move all the jars in the current
> > opt
> > > > folder to the download page.
> > > >
> > > > The missing of clear rules for guiding such compromises makes things
> > more
> > > > complicated now. I would agree that the decisive factor for what goes
> > > into
> > > > Flink's binary distribution should be how core it is to Flink.
> > Meanwhile,
> > > > it's better to treat Flink API as a (core) core to Flink. Not only it
> > is
> > > a
> > > > very clear rule that easy to be followed but also in most cases, API
> is
> > > > very significant and deserved to be included in the dist.
> > > >
> > > > Given this, it might make sense to put flink-ml-api and flink-ml-lib
> > into
> > > > the opt.
> > > > What do you think?
> > > >
> > > > Best,
> > > > Hequn
> > > >
> > > > On Wed, Feb 5, 2020 at 12:39 AM Chesnay Schepler <[email protected]
> >
> > > > wrote:
> > > >
> > > >> Around a year ago I started a discussion
> > > >> <
> > >
> >
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/DISCUSS-Towards-a-leaner-flink-dist-tp25615.html
> > > >
> > > >> on reducing the amount of jars we ship with the distribution.
> > > >>
> > > >> While there was no definitive conclusion there was a shared
> sentiment
> > > >> that APIs should be shipped with the distribution.
> > > >>
> > > >> On 04/02/2020 17:25, Till Rohrmann wrote:
> > > >>
> > > >> I think there is no such rule that APIs go automatically into opt/
> and
> > > >> "libraries" not. The contents of opt/ have mainly grown over time
> w/o
> > > >> following a strict rule.
> > > >>
> > > >> I think the decisive factor for what goes into Flink's binary
> > > distribution
> > > >> should be how core it is to Flink. Of course another important
> > > >> consideration is which use cases Flink should promote "out of the
> box"
> > > (not
> > > >> sure whether this is actual true for content shipped in opt/ because
> > you
> > > >> also have to move it to lib).
> > > >>
> > > >> For example, Gelly would be an example which I would rather see as
> an
> > > >> optional component than shipping it with every Flink binary
> > > distribution.
> > > >>
> > > >> Cheers,
> > > >> Till
> > > >>
> > > >> On Tue, Feb 4, 2020 at 11:24 AM Becket Qin <[email protected]> <
> > > [email protected]> wrote:
> > > >>
> > > >>
> > > >> Thanks for the suggestion, Till.
> > > >>
> > > >> I am curious about how do we usually decide when to put the jars
> into
> > > the
> > > >> opt folder?
> > > >>
> > > >> Technically speaking, it seems that `flink-ml-api` should be put
> into
> > > the
> > > >> opt directory because they are actually API instead of libraries,
> just
> > > like
> > > >> CEP and Table.
> > > >>
> > > >> `flink-ml-lib` seems to be on the border. On one hand, it is a
> > library.
> > > On
> > > >> the other hand, unlike SQL formats and Hadoop whose major code are
> > > outside
> > > >> of Flink, the algorithm codes are in Flink. So `flink-ml-lib` is
> more
> > > like
> > > >> those of built-in SQL UDFs. So it seems fine to either put it in the
> > opt
> > > >> folder or in the downloads page.
> > > >>
> > > >> From the user experience perspective, it might be better to have
> both
> > > >> `flink-ml-lib` and `flink-ml-api` in opt folder so users needn't go
> to
> > > two
> > > >> places for the required dependencies.
> > > >>
> > > >> Thanks,
> > > >>
> > > >> Jiangjie (Becket) Qin
> > > >>
> > > >> On Tue, Feb 4, 2020 at 2:32 PM Hequn Cheng <[email protected]> <
> > > [email protected]> wrote:
> > > >>
> > > >>
> > > >> Hi Till,
> > > >>
> > > >> Thanks a lot for your suggestion. It's a good idea to offer the
> > flink-ml
> > > >> libraries as optional dependencies on the download page which can
> make
> > > >>
> > > >> the
> > > >>
> > > >> dist smaller.
> > > >>
> > > >> But I also have some concerns for it, e.g., the download page now
> only
> > > >> includes the latest 3 releases. We may need to find ways to support
> > more
> > > >> versions.
> > > >> On the other hand, the size of the flink-ml libraries now is very
> > > >> small(about 246K), so it would not bring much impact on the size of
> > > dist.
> > > >>
> > > >> What do you think?
> > > >>
> > > >> Best,
> > > >> Hequn
> > > >>
> > > >> On Mon, Feb 3, 2020 at 6:24 PM Till Rohrmann <[email protected]>
> <
> > > [email protected]>
> > > >>
> > > >> wrote:
> > > >>
> > > >> An alternative solution would be to offer the flink-ml libraries as
> > > >> optional dependencies on the download page. Similar to how we offer
> > the
> > > >> different SQL formats and Hadoop releases [1].
> > > >>
> > > >> [1] https://flink.apache.org/downloads.html
> > > >>
> > > >> Cheers,
> > > >> Till
> > > >>
> > > >> On Mon, Feb 3, 2020 at 10:19 AM Hequn Cheng <[email protected]> <
> > > [email protected]> wrote:
> > > >>
> > > >>
> > > >> Thank you all for your feedback and suggestions!
> > > >>
> > > >> Best, Hequn
> > > >>
> > > >> On Mon, Feb 3, 2020 at 5:07 PM Becket Qin <[email protected]> <
> > > [email protected]>
> > > >>
> > > >> wrote:
> > > >>
> > > >> Thanks for bringing up the discussion, Hequn.
> > > >>
> > > >> +1 on adding `flink-ml-api` and `flink-ml-lib` into opt. This would
> > > >>
> > > >> make
> > > >>
> > > >> it much easier for the users to try out some simple ml tasks.
> > > >>
> > > >> Thanks,
> > > >>
> > > >> Jiangjie (Becket) Qin
> > > >>
> > > >> On Mon, Feb 3, 2020 at 4:34 PM jincheng sun <
> > > >>
> > > >> [email protected]
> > > >>
> > > >> wrote:
> > > >>
> > > >>
> > > >> Thank you for pushing forward @Hequn Cheng <[email protected]> <
> > > [email protected]> !
> > > >>
> > > >> Hi  @Becket Qin <[email protected]> <[email protected]> , Do
> > you
> > > have any concerns
> > > >>
> > > >> on
> > > >>
> > > >> this ?
> > > >>
> > > >> Best,
> > > >> Jincheng
> > > >>
> > > >> Hequn Cheng <[email protected]> <[email protected]> 于2020年2月3日周一
> > > 下午2:09写道：
> > > >>
> > > >>
> > > >> Hi everyone,
> > > >>
> > > >> Thanks for the feedback. As there are no objections, I've opened a
> > > >>
> > > >> JIRA
> > > >>
> > > >> issue(FLINK-15847[1]) to address this issue.
> > > >> The implementation details can be discussed in the issue or in the
> > > >> following PR.
> > > >>
> > > >> Best,
> > > >> Hequn
> > > >>
> > > >> [1] https://issues.apache.org/jira/browse/FLINK-15847
> > > >>
> > > >> On Wed, Jan 8, 2020 at 9:15 PM Hequn Cheng <[email protected]> <
> > > [email protected]>
> > > >>
> > > >> wrote:
> > > >>
> > > >> Hi Jincheng,
> > > >>
> > > >> Thanks a lot for your feedback!
> > > >> Yes, I agree with you. There are cases that multi jars need to
> > > >>
> > > >> be
> > > >>
> > > >> uploaded. I will prepare another discussion later. Maybe with a
> > > >>
> > > >> simple
> > > >>
> > > >> design doc.
> > > >>
> > > >> Best, Hequn
> > > >>
> > > >> On Wed, Jan 8, 2020 at 3:06 PM jincheng sun <
> > > >>
> > > >> [email protected]>
> > > >>
> > > >> wrote:
> > > >>
> > > >>
> > > >> Thanks for bring up this discussion Hequn!
> > > >>
> > > >> +1 for include `flink-ml-api` and `flink-ml-lib` in opt.
> > > >>
> > > >> BTW: I think would be great if bring up a discussion for upload
> > > >>
> > > >> multiple
> > > >>
> > > >> Jars at the same time. as PyFlink JOB also can have the benefit
> > > >>
> > > >> if
> > > >>
> > > >> we
> > > >>
> > > >> do
> > > >>
> > > >> that improvement.
> > > >>
> > > >> Best,
> > > >> Jincheng
> > > >>
> > > >>
> > > >> Hequn Cheng <[email protected]> <[email protected]>
> > 于2020年1月8日周三
> > > 上午11:50写道：
> > > >>
> > > >>
> > > >> Hi everyone,
> > > >>
> > > >> FLIP-39[1] rebuilds Flink ML pipeline on top of TableAPI
> > > >>
> > > >> which
> > > >>
> > > >> moves
> > > >>
> > > >> Flink
> > > >>
> > > >> ML a step further. Base on it, users can develop their ML
> > > >>
> > > >> jobs
> > > >>
> > > >> and
> > > >>
> > > >> more
> > > >>
> > > >> and
> > > >>
> > > >> more machine learning platforms are providing ML services.
> > > >>
> > > >> However, the problem now is the jars of flink-ml-api and
> > > >>
> > > >> flink-ml-lib
> > > >>
> > > >> are
> > > >>
> > > >> only exist on maven repo. Whenever users want to submit ML
> > > >>
> > > >> jobs,
> > > >>
> > > >> they
> > > >>
> > > >> can
> > > >>
> > > >> only depend on the ml modules and package a fat jar. This
> > > >>
> > > >> would be
> > > >>
> > > >> inconvenient especially for the machine learning platforms on
> > > >>
> > > >> which
> > > >>
> > > >> nearly
> > > >>
> > > >> all jobs depend on Flink ML modules and have to package a fat
> > > >>
> > > >> jar.
> > > >>
> > > >> Given this, it would be better to include jars of
> > > >>
> > > >> flink-ml-api
> > > >>
> > > >> and
> > > >>
> > > >> flink-ml-lib in the `opt` folder, so that users can directly
> > > >>
> > > >> use
> > > >>
> > > >> the
> > > >>
> > > >> jars
> > > >>
> > > >> with the binary release. For example, users can move the jars
> > > >>
> > > >> into
> > > >>
> > > >> the
> > > >>
> > > >> `lib` folder or use -j to upload the jars. (Currently, -j
> > > >>
> > > >> only
> > > >>
> > > >> support
> > > >>
> > > >> upload one jar. Supporting multi jars for -j can be discussed
> > > >>
> > > >> in
> > > >>
> > > >> another
> > > >>
> > > >> discussion.)
> > > >>
> > > >> Putting the jars in the `opt` folder instead of the `lib`
> > > >>
> > > >> folder
> > > >>
> > > >> is
> > > >>
> > > >> because
> > > >>
> > > >> currently, the ml jars are still optional for the Flink
> > > >>
> > > >> project by
> > > >>
> > > >> default.
> > > >>
> > > >> What do you think? Welcome any feedback!
> > > >>
> > > >> Best,
> > > >>
> > > >> Hequn
> > > >>
> > > >> [1]
> > > >>
> > > >>
> > > >>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-39+Flink+ML+pipeline+and+ML+libs
> > > >>
> > > >>
> > > >>
> > >
> >
> >
> > --
> > Best Regards
> >
> > Jeff Zhang
> >
>

Re: [DISCUSS] Include flink-ml-api and flink-ml-lib in opt

Reply via email to