Re: [DISCUSS] Include flink-ml-api and flink-ml-lib in opt

Jeff Zhang Thu, 06 Feb 2020 07:51:34 -0800

I have another concern which may not be closely related to this thread.
Since flink doesn't include all the necessary jars, I think it is critical
for flink to display meaningful error message when any class is missing.
e.g. Here's the error message when I use kafka but miss
including flink-json.  To be honest, the kind of error message is hard to
understand for new users.



Reason: No factory implements
'org.apache.flink.table.factories.DeserializationSchemaFactory'. The
following properties are requested:
connector.properties.bootstrap.servers=localhost:9092
connector.properties.group.id=testGroup
connector.properties.zookeeper.connect=localhost:2181
connector.startup-mode=earliest-offset connector.topic=generated.events
connector.type=kafka connector.version=universal format.type=json
schema.0.data-type=VARCHAR(2147483647) schema.0.name=status
schema.1.data-type=VARCHAR(2147483647) schema.1.name=direction
schema.2.data-type=BIGINT schema.2.name=event_ts update-mode=append The
following factories have been considered:
org.apache.flink.table.catalog.hive.factories.HiveCatalogFactory
org.apache.flink.table.module.hive.HiveModuleFactory
org.apache.flink.table.module.CoreModuleFactory
org.apache.flink.table.catalog.GenericInMemoryCatalogFactory
org.apache.flink.table.sources.CsvBatchTableSourceFactory
org.apache.flink.table.sources.CsvAppendTableSourceFactory
org.apache.flink.table.sinks.CsvBatchTableSinkFactory
org.apache.flink.table.sinks.CsvAppendTableSinkFactory
org.apache.flink.table.planner.delegation.BlinkPlannerFactory
org.apache.flink.table.planner.delegation.BlinkExecutorFactory
org.apache.flink.table.planner.StreamPlannerFactory
org.apache.flink.table.executor.StreamExecutorFactory
org.apache.flink.streaming.connectors.kafka.KafkaTableSourceSinkFactory at
org.apache.flink.table.factories.TableFactoryService.filterByFactoryClass(TableFactoryService.java:238)
at
org.apache.flink.table.factories.TableFactoryService.filter(TableFactoryService.java:185)
at
org.apache.flink.table.factories.TableFactoryService.findSingleInternal(TableFactoryService.java:143)
at
org.apache.flink.table.factories.TableFactoryService.find(TableFactoryService.java:113)
at
org.apache.flink.streaming.connectors.kafka.KafkaTableSourceSinkFactoryBase.getDeserializationSchema(KafkaTableSourceSinkFactoryBase.java:277)
at
org.apache.flink.streaming.connectors.kafka.KafkaTableSourceSinkFactoryBase.createStreamTableSource(KafkaTableSourceSinkFactoryBase.java:161)
at
org.apache.flink.table.factories.StreamTableSourceFactory.createTableSource(StreamTableSourceFactory.java:49)
at
org.apache.flink.table.factories.TableFactoryUtil.findAndCreateTableSource(TableFactoryUtil.java:53)
... 36 more



Till Rohrmann <[email protected]> 于2020年2月6日周四 下午11:30写道：

> I would not object given that it is rather small at the moment. However, I
> also think that we should have a plan how to handle the ever growing Flink
> ecosystem and how to make it easily accessible to our users. E.g. one far
> fetched idea could be something like a configuration script which downloads
> the required components for the user. But this deserves definitely a
> separate discussion and does not really belong here.
>
> Cheers,
> Till
>
> On Thu, Feb 6, 2020 at 3:35 PM Hequn Cheng <[email protected]> wrote:
>
> >
> > Hi everyone,
> >
> > Thank you all for the great inputs!
> >
> > I think probably what we all agree on is we should try to make a leaner
> > flink-dist. However, we may also need to do some compromises considering
> > the user experience that users don't need to download the dependencies
> from
> > different places. Otherwise, we can move all the jars in the current opt
> > folder to the download page.
> >
> > The missing of clear rules for guiding such compromises makes things more
> > complicated now. I would agree that the decisive factor for what goes
> into
> > Flink's binary distribution should be how core it is to Flink. Meanwhile,
> > it's better to treat Flink API as a (core) core to Flink. Not only it is
> a
> > very clear rule that easy to be followed but also in most cases, API is
> > very significant and deserved to be included in the dist.
> >
> > Given this, it might make sense to put flink-ml-api and flink-ml-lib into
> > the opt.
> > What do you think?
> >
> > Best,
> > Hequn
> >
> > On Wed, Feb 5, 2020 at 12:39 AM Chesnay Schepler <[email protected]>
> > wrote:
> >
> >> Around a year ago I started a discussion
> >> <
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/DISCUSS-Towards-a-leaner-flink-dist-tp25615.html
> >
> >> on reducing the amount of jars we ship with the distribution.
> >>
> >> While there was no definitive conclusion there was a shared sentiment
> >> that APIs should be shipped with the distribution.
> >>
> >> On 04/02/2020 17:25, Till Rohrmann wrote:
> >>
> >> I think there is no such rule that APIs go automatically into opt/ and
> >> "libraries" not. The contents of opt/ have mainly grown over time w/o
> >> following a strict rule.
> >>
> >> I think the decisive factor for what goes into Flink's binary
> distribution
> >> should be how core it is to Flink. Of course another important
> >> consideration is which use cases Flink should promote "out of the box"
> (not
> >> sure whether this is actual true for content shipped in opt/ because you
> >> also have to move it to lib).
> >>
> >> For example, Gelly would be an example which I would rather see as an
> >> optional component than shipping it with every Flink binary
> distribution.
> >>
> >> Cheers,
> >> Till
> >>
> >> On Tue, Feb 4, 2020 at 11:24 AM Becket Qin <[email protected]> <
> [email protected]> wrote:
> >>
> >>
> >> Thanks for the suggestion, Till.
> >>
> >> I am curious about how do we usually decide when to put the jars into
> the
> >> opt folder?
> >>
> >> Technically speaking, it seems that `flink-ml-api` should be put into
> the
> >> opt directory because they are actually API instead of libraries, just
> like
> >> CEP and Table.
> >>
> >> `flink-ml-lib` seems to be on the border. On one hand, it is a library.
> On
> >> the other hand, unlike SQL formats and Hadoop whose major code are
> outside
> >> of Flink, the algorithm codes are in Flink. So `flink-ml-lib` is more
> like
> >> those of built-in SQL UDFs. So it seems fine to either put it in the opt
> >> folder or in the downloads page.
> >>
> >> From the user experience perspective, it might be better to have both
> >> `flink-ml-lib` and `flink-ml-api` in opt folder so users needn't go to
> two
> >> places for the required dependencies.
> >>
> >> Thanks,
> >>
> >> Jiangjie (Becket) Qin
> >>
> >> On Tue, Feb 4, 2020 at 2:32 PM Hequn Cheng <[email protected]> <
> [email protected]> wrote:
> >>
> >>
> >> Hi Till,
> >>
> >> Thanks a lot for your suggestion. It's a good idea to offer the flink-ml
> >> libraries as optional dependencies on the download page which can make
> >>
> >> the
> >>
> >> dist smaller.
> >>
> >> But I also have some concerns for it, e.g., the download page now only
> >> includes the latest 3 releases. We may need to find ways to support more
> >> versions.
> >> On the other hand, the size of the flink-ml libraries now is very
> >> small(about 246K), so it would not bring much impact on the size of
> dist.
> >>
> >> What do you think?
> >>
> >> Best,
> >> Hequn
> >>
> >> On Mon, Feb 3, 2020 at 6:24 PM Till Rohrmann <[email protected]> <
> [email protected]>
> >>
> >> wrote:
> >>
> >> An alternative solution would be to offer the flink-ml libraries as
> >> optional dependencies on the download page. Similar to how we offer the
> >> different SQL formats and Hadoop releases [1].
> >>
> >> [1] https://flink.apache.org/downloads.html
> >>
> >> Cheers,
> >> Till
> >>
> >> On Mon, Feb 3, 2020 at 10:19 AM Hequn Cheng <[email protected]> <
> [email protected]> wrote:
> >>
> >>
> >> Thank you all for your feedback and suggestions!
> >>
> >> Best, Hequn
> >>
> >> On Mon, Feb 3, 2020 at 5:07 PM Becket Qin <[email protected]> <
> [email protected]>
> >>
> >> wrote:
> >>
> >> Thanks for bringing up the discussion, Hequn.
> >>
> >> +1 on adding `flink-ml-api` and `flink-ml-lib` into opt. This would
> >>
> >> make
> >>
> >> it much easier for the users to try out some simple ml tasks.
> >>
> >> Thanks,
> >>
> >> Jiangjie (Becket) Qin
> >>
> >> On Mon, Feb 3, 2020 at 4:34 PM jincheng sun <
> >>
> >> [email protected]
> >>
> >> wrote:
> >>
> >>
> >> Thank you for pushing forward @Hequn Cheng <[email protected]> <
> [email protected]> !
> >>
> >> Hi  @Becket Qin <[email protected]> <[email protected]> , Do you
> have any concerns
> >>
> >> on
> >>
> >> this ?
> >>
> >> Best,
> >> Jincheng
> >>
> >> Hequn Cheng <[email protected]> <[email protected]> 于2020年2月3日周一
> 下午2:09写道：
> >>
> >>
> >> Hi everyone,
> >>
> >> Thanks for the feedback. As there are no objections, I've opened a
> >>
> >> JIRA
> >>
> >> issue(FLINK-15847[1]) to address this issue.
> >> The implementation details can be discussed in the issue or in the
> >> following PR.
> >>
> >> Best,
> >> Hequn
> >>
> >> [1] https://issues.apache.org/jira/browse/FLINK-15847
> >>
> >> On Wed, Jan 8, 2020 at 9:15 PM Hequn Cheng <[email protected]> <
> [email protected]>
> >>
> >> wrote:
> >>
> >> Hi Jincheng,
> >>
> >> Thanks a lot for your feedback!
> >> Yes, I agree with you. There are cases that multi jars need to
> >>
> >> be
> >>
> >> uploaded. I will prepare another discussion later. Maybe with a
> >>
> >> simple
> >>
> >> design doc.
> >>
> >> Best, Hequn
> >>
> >> On Wed, Jan 8, 2020 at 3:06 PM jincheng sun <
> >>
> >> [email protected]>
> >>
> >> wrote:
> >>
> >>
> >> Thanks for bring up this discussion Hequn!
> >>
> >> +1 for include `flink-ml-api` and `flink-ml-lib` in opt.
> >>
> >> BTW: I think would be great if bring up a discussion for upload
> >>
> >> multiple
> >>
> >> Jars at the same time. as PyFlink JOB also can have the benefit
> >>
> >> if
> >>
> >> we
> >>
> >> do
> >>
> >> that improvement.
> >>
> >> Best,
> >> Jincheng
> >>
> >>
> >> Hequn Cheng <[email protected]> <[email protected]> 于2020年1月8日周三
> 上午11:50写道：
> >>
> >>
> >> Hi everyone,
> >>
> >> FLIP-39[1] rebuilds Flink ML pipeline on top of TableAPI
> >>
> >> which
> >>
> >> moves
> >>
> >> Flink
> >>
> >> ML a step further. Base on it, users can develop their ML
> >>
> >> jobs
> >>
> >> and
> >>
> >> more
> >>
> >> and
> >>
> >> more machine learning platforms are providing ML services.
> >>
> >> However, the problem now is the jars of flink-ml-api and
> >>
> >> flink-ml-lib
> >>
> >> are
> >>
> >> only exist on maven repo. Whenever users want to submit ML
> >>
> >> jobs,
> >>
> >> they
> >>
> >> can
> >>
> >> only depend on the ml modules and package a fat jar. This
> >>
> >> would be
> >>
> >> inconvenient especially for the machine learning platforms on
> >>
> >> which
> >>
> >> nearly
> >>
> >> all jobs depend on Flink ML modules and have to package a fat
> >>
> >> jar.
> >>
> >> Given this, it would be better to include jars of
> >>
> >> flink-ml-api
> >>
> >> and
> >>
> >> flink-ml-lib in the `opt` folder, so that users can directly
> >>
> >> use
> >>
> >> the
> >>
> >> jars
> >>
> >> with the binary release. For example, users can move the jars
> >>
> >> into
> >>
> >> the
> >>
> >> `lib` folder or use -j to upload the jars. (Currently, -j
> >>
> >> only
> >>
> >> support
> >>
> >> upload one jar. Supporting multi jars for -j can be discussed
> >>
> >> in
> >>
> >> another
> >>
> >> discussion.)
> >>
> >> Putting the jars in the `opt` folder instead of the `lib`
> >>
> >> folder
> >>
> >> is
> >>
> >> because
> >>
> >> currently, the ml jars are still optional for the Flink
> >>
> >> project by
> >>
> >> default.
> >>
> >> What do you think? Welcome any feedback!
> >>
> >> Best,
> >>
> >> Hequn
> >>
> >> [1]
> >>
> >>
> >>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-39+Flink+ML+pipeline+and+ML+libs
> >>
> >>
> >>
>


-- 
Best Regards

Jeff Zhang

Re: [DISCUSS] Include flink-ml-api and flink-ml-lib in opt

Reply via email to