Re: [DISCUSS] Include flink-ml-api and flink-ml-lib in opt

Hequn Cheng Mon, 10 Feb 2020 17:40:41 -0800

Hi Rong,

That's great! Looking forward to your feedback.


Thanks,
Hequn


On Tue, Feb 11, 2020 at 1:06 AM Rong Rong <walter...@gmail.com> wrote:

> Yes. I think the argument is fairly valid - we can always adjust the API
> in the future, in fact most of the APIs are labeled publicEvolving at this
> moment.
> I was only trying to provide the info, that the interfaces in flink-ml-api
> might change in the near future, for others when voting.
>
> In fact, I am actually always +1 on moving flink-ml-api to /opt :-)
> Regarding the Python ML API. sorry for not noticing it earlier as I
> haven't given it a deep look yet. will do very soon!
>
> --
> Rong
>
> On Sun, Feb 9, 2020 at 7:33 PM Hequn Cheng <he...@apache.org> wrote:
>
>> Hi Rong,
>>
>> Thanks a lot for joining the discussion!
>>
>> It would be great if we can have a long term plan. My intention is to
>> provide a way for users to add dependencies of Flink ML, either through the
>> opt or download page. This would be more and more critical along with the
>> improvement of the Flink ML, as you said there are multiple PRs under
>> review and I'm also going to support Python Pipeline API recently[1].
>>
>> Meanwhile, it also makes sense to include the API into the opt, so it
>> would probably not break the long term plan.
>> However, even find something wrong in the future, we can revisit this
>> easily instead of blocking the improvement for users. What do you think?
>>
>> Best,
>> Hequn
>>
>> [1]
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Support-Python-ML-Pipeline-API-td37291.html
>>
>> On Sat, Feb 8, 2020 at 1:57 AM Rong Rong <walter...@gmail.com> wrote:
>>
>>> CC @Xu Yang <xuyang1...@gmail.com>
>>>
>>> Thanks for starting the discussion @Hequn Cheng <chenghe...@gmail.com> and
>>> sorry for joining the discussion late.
>>>
>>> I've mainly helped merging the code in flink-ml-api and flink-ml-lib in
>>> the past several months.
>>> IMO the flink-ml-api are an extension on top of the table API and agree
>>> that it should be treated as a part of the "core" core.
>>>
>>> However, I think given the fact that there are multiple PRs still under
>>> review [1], is it a better idea to come up with a long term plan first
>>> before make the decision to moving it to /opt now?
>>>
>>>
>>> --
>>> Rong
>>>
>>> [1]
>>> https://github.com/apache/flink/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Aopen+label%3Acomponent%3DLibrary%2FMachineLearning+
>>>
>>> On Fri, Feb 7, 2020 at 5:54 AM Hequn Cheng <he...@apache.org> wrote:
>>>
>>>> Hi,
>>>>
>>>> @Till Rohrmann <trohrm...@apache.org> Thanks for the great inputs. I
>>>> agree
>>>> with you that we should have a long term plan for this. It definitely
>>>> deserves another discussion.
>>>> @Jeff Zhang <zjf...@gmail.com> Thanks for your reports and ideas. It's
>>>> a
>>>> good idea to improve the error messages. Do we have any JIRAs for it or
>>>> maybe we can create one for it.
>>>>
>>>> Thank you again for your feedback and suggestions. I will go on with the
>>>> PR. Thanks!
>>>>
>>>> Best,
>>>> Hequn
>>>>
>>>> On Thu, Feb 6, 2020 at 11:51 PM Jeff Zhang <zjf...@gmail.com> wrote:
>>>>
>>>> > I have another concern which may not be closely related to this
>>>> thread.
>>>> > Since flink doesn't include all the necessary jars, I think it is
>>>> critical
>>>> > for flink to display meaningful error message when any class is
>>>> missing.
>>>> > e.g. Here's the error message when I use kafka but miss
>>>> > including flink-json.  To be honest, the kind of error message is
>>>> hard to
>>>> > understand for new users.
>>>> >
>>>> >
>>>> > Reason: No factory implements
>>>> > 'org.apache.flink.table.factories.DeserializationSchemaFactory'. The
>>>> > following properties are requested:
>>>> > connector.properties.bootstrap.servers=localhost:9092
>>>> > connector.properties.group.id=testGroup
>>>> > connector.properties.zookeeper.connect=localhost:2181
>>>> > connector.startup-mode=earliest-offset
>>>> connector.topic=generated.events
>>>> > connector.type=kafka connector.version=universal format.type=json
>>>> > schema.0.data-type=VARCHAR(2147483647) schema.0.name=status
>>>> > schema.1.data-type=VARCHAR(2147483647) schema.1.name=direction
>>>> > schema.2.data-type=BIGINT schema.2.name=event_ts update-mode=append
>>>> The
>>>> > following factories have been considered:
>>>> > org.apache.flink.table.catalog.hive.factories.HiveCatalogFactory
>>>> > org.apache.flink.table.module.hive.HiveModuleFactory
>>>> > org.apache.flink.table.module.CoreModuleFactory
>>>> > org.apache.flink.table.catalog.GenericInMemoryCatalogFactory
>>>> > org.apache.flink.table.sources.CsvBatchTableSourceFactory
>>>> > org.apache.flink.table.sources.CsvAppendTableSourceFactory
>>>> > org.apache.flink.table.sinks.CsvBatchTableSinkFactory
>>>> > org.apache.flink.table.sinks.CsvAppendTableSinkFactory
>>>> > org.apache.flink.table.planner.delegation.BlinkPlannerFactory
>>>> > org.apache.flink.table.planner.delegation.BlinkExecutorFactory
>>>> > org.apache.flink.table.planner.StreamPlannerFactory
>>>> > org.apache.flink.table.executor.StreamExecutorFactory
>>>> >
>>>> org.apache.flink.streaming.connectors.kafka.KafkaTableSourceSinkFactory at
>>>> >
>>>> >
>>>> org.apache.flink.table.factories.TableFactoryService.filterByFactoryClass(TableFactoryService.java:238)
>>>> > at
>>>> >
>>>> >
>>>> org.apache.flink.table.factories.TableFactoryService.filter(TableFactoryService.java:185)
>>>> > at
>>>> >
>>>> >
>>>> org.apache.flink.table.factories.TableFactoryService.findSingleInternal(TableFactoryService.java:143)
>>>> > at
>>>> >
>>>> >
>>>> org.apache.flink.table.factories.TableFactoryService.find(TableFactoryService.java:113)
>>>> > at
>>>> >
>>>> >
>>>> org.apache.flink.streaming.connectors.kafka.KafkaTableSourceSinkFactoryBase.getDeserializationSchema(KafkaTableSourceSinkFactoryBase.java:277)
>>>> > at
>>>> >
>>>> >
>>>> org.apache.flink.streaming.connectors.kafka.KafkaTableSourceSinkFactoryBase.createStreamTableSource(KafkaTableSourceSinkFactoryBase.java:161)
>>>> > at
>>>> >
>>>> >
>>>> org.apache.flink.table.factories.StreamTableSourceFactory.createTableSource(StreamTableSourceFactory.java:49)
>>>> > at
>>>> >
>>>> >
>>>> org.apache.flink.table.factories.TableFactoryUtil.findAndCreateTableSource(TableFactoryUtil.java:53)
>>>> > ... 36 more
>>>> >
>>>> >
>>>> >
>>>> > Till Rohrmann <trohrm...@apache.org> 于2020年2月6日周四 下午11:30写道：
>>>> >
>>>> > > I would not object given that it is rather small at the moment.
>>>> However,
>>>> > I
>>>> > > also think that we should have a plan how to handle the ever growing
>>>> > Flink
>>>> > > ecosystem and how to make it easily accessible to our users. E.g.
>>>> one far
>>>> > > fetched idea could be something like a configuration script which
>>>> > downloads
>>>> > > the required components for the user. But this deserves definitely a
>>>> > > separate discussion and does not really belong here.
>>>> > >
>>>> > > Cheers,
>>>> > > Till
>>>> > >
>>>> > > On Thu, Feb 6, 2020 at 3:35 PM Hequn Cheng <he...@apache.org>
>>>> wrote:
>>>> > >
>>>> > > >
>>>> > > > Hi everyone,
>>>> > > >
>>>> > > > Thank you all for the great inputs!
>>>> > > >
>>>> > > > I think probably what we all agree on is we should try to make a
>>>> leaner
>>>> > > > flink-dist. However, we may also need to do some compromises
>>>> > considering
>>>> > > > the user experience that users don't need to download the
>>>> dependencies
>>>> > > from
>>>> > > > different places. Otherwise, we can move all the jars in the
>>>> current
>>>> > opt
>>>> > > > folder to the download page.
>>>> > > >
>>>> > > > The missing of clear rules for guiding such compromises makes
>>>> things
>>>> > more
>>>> > > > complicated now. I would agree that the decisive factor for what
>>>> goes
>>>> > > into
>>>> > > > Flink's binary distribution should be how core it is to Flink.
>>>> > Meanwhile,
>>>> > > > it's better to treat Flink API as a (core) core to Flink. Not
>>>> only it
>>>> > is
>>>> > > a
>>>> > > > very clear rule that easy to be followed but also in most cases,
>>>> API is
>>>> > > > very significant and deserved to be included in the dist.
>>>> > > >
>>>> > > > Given this, it might make sense to put flink-ml-api and
>>>> flink-ml-lib
>>>> > into
>>>> > > > the opt.
>>>> > > > What do you think?
>>>> > > >
>>>> > > > Best,
>>>> > > > Hequn
>>>> > > >
>>>> > > > On Wed, Feb 5, 2020 at 12:39 AM Chesnay Schepler <
>>>> ches...@apache.org>
>>>> > > > wrote:
>>>> > > >
>>>> > > >> Around a year ago I started a discussion
>>>> > > >> <
>>>> > >
>>>> >
>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/DISCUSS-Towards-a-leaner-flink-dist-tp25615.html
>>>> > > >
>>>> > > >> on reducing the amount of jars we ship with the distribution.
>>>> > > >>
>>>> > > >> While there was no definitive conclusion there was a shared
>>>> sentiment
>>>> > > >> that APIs should be shipped with the distribution.
>>>> > > >>
>>>> > > >> On 04/02/2020 17:25, Till Rohrmann wrote:
>>>> > > >>
>>>> > > >> I think there is no such rule that APIs go automatically into
>>>> opt/ and
>>>> > > >> "libraries" not. The contents of opt/ have mainly grown over
>>>> time w/o
>>>> > > >> following a strict rule.
>>>> > > >>
>>>> > > >> I think the decisive factor for what goes into Flink's binary
>>>> > > distribution
>>>> > > >> should be how core it is to Flink. Of course another important
>>>> > > >> consideration is which use cases Flink should promote "out of
>>>> the box"
>>>> > > (not
>>>> > > >> sure whether this is actual true for content shipped in opt/
>>>> because
>>>> > you
>>>> > > >> also have to move it to lib).
>>>> > > >>
>>>> > > >> For example, Gelly would be an example which I would rather see
>>>> as an
>>>> > > >> optional component than shipping it with every Flink binary
>>>> > > distribution.
>>>> > > >>
>>>> > > >> Cheers,
>>>> > > >> Till
>>>> > > >>
>>>> > > >> On Tue, Feb 4, 2020 at 11:24 AM Becket Qin <becket....@gmail.com>
>>>> <
>>>> > > becket....@gmail.com> wrote:
>>>> > > >>
>>>> > > >>
>>>> > > >> Thanks for the suggestion, Till.
>>>> > > >>
>>>> > > >> I am curious about how do we usually decide when to put the jars
>>>> into
>>>> > > the
>>>> > > >> opt folder?
>>>> > > >>
>>>> > > >> Technically speaking, it seems that `flink-ml-api` should be put
>>>> into
>>>> > > the
>>>> > > >> opt directory because they are actually API instead of
>>>> libraries, just
>>>> > > like
>>>> > > >> CEP and Table.
>>>> > > >>
>>>> > > >> `flink-ml-lib` seems to be on the border. On one hand, it is a
>>>> > library.
>>>> > > On
>>>> > > >> the other hand, unlike SQL formats and Hadoop whose major code
>>>> are
>>>> > > outside
>>>> > > >> of Flink, the algorithm codes are in Flink. So `flink-ml-lib` is
>>>> more
>>>> > > like
>>>> > > >> those of built-in SQL UDFs. So it seems fine to either put it in
>>>> the
>>>> > opt
>>>> > > >> folder or in the downloads page.
>>>> > > >>
>>>> > > >> From the user experience perspective, it might be better to have
>>>> both
>>>> > > >> `flink-ml-lib` and `flink-ml-api` in opt folder so users needn't
>>>> go to
>>>> > > two
>>>> > > >> places for the required dependencies.
>>>> > > >>
>>>> > > >> Thanks,
>>>> > > >>
>>>> > > >> Jiangjie (Becket) Qin
>>>> > > >>
>>>> > > >> On Tue, Feb 4, 2020 at 2:32 PM Hequn Cheng <he...@apache.org> <
>>>> > > he...@apache.org> wrote:
>>>> > > >>
>>>> > > >>
>>>> > > >> Hi Till,
>>>> > > >>
>>>> > > >> Thanks a lot for your suggestion. It's a good idea to offer the
>>>> > flink-ml
>>>> > > >> libraries as optional dependencies on the download page which
>>>> can make
>>>> > > >>
>>>> > > >> the
>>>> > > >>
>>>> > > >> dist smaller.
>>>> > > >>
>>>> > > >> But I also have some concerns for it, e.g., the download page
>>>> now only
>>>> > > >> includes the latest 3 releases. We may need to find ways to
>>>> support
>>>> > more
>>>> > > >> versions.
>>>> > > >> On the other hand, the size of the flink-ml libraries now is very
>>>> > > >> small(about 246K), so it would not bring much impact on the size
>>>> of
>>>> > > dist.
>>>> > > >>
>>>> > > >> What do you think?
>>>> > > >>
>>>> > > >> Best,
>>>> > > >> Hequn
>>>> > > >>
>>>> > > >> On Mon, Feb 3, 2020 at 6:24 PM Till Rohrmann <
>>>> trohrm...@apache.org> <
>>>> > > trohrm...@apache.org>
>>>> > > >>
>>>> > > >> wrote:
>>>> > > >>
>>>> > > >> An alternative solution would be to offer the flink-ml libraries
>>>> as
>>>> > > >> optional dependencies on the download page. Similar to how we
>>>> offer
>>>> > the
>>>> > > >> different SQL formats and Hadoop releases [1].
>>>> > > >>
>>>> > > >> [1] https://flink.apache.org/downloads.html
>>>> > > >>
>>>> > > >> Cheers,
>>>> > > >> Till
>>>> > > >>
>>>> > > >> On Mon, Feb 3, 2020 at 10:19 AM Hequn Cheng <he...@apache.org> <
>>>> > > he...@apache.org> wrote:
>>>> > > >>
>>>> > > >>
>>>> > > >> Thank you all for your feedback and suggestions!
>>>> > > >>
>>>> > > >> Best, Hequn
>>>> > > >>
>>>> > > >> On Mon, Feb 3, 2020 at 5:07 PM Becket Qin <becket....@gmail.com>
>>>> <
>>>> > > becket....@gmail.com>
>>>> > > >>
>>>> > > >> wrote:
>>>> > > >>
>>>> > > >> Thanks for bringing up the discussion, Hequn.
>>>> > > >>
>>>> > > >> +1 on adding `flink-ml-api` and `flink-ml-lib` into opt. This
>>>> would
>>>> > > >>
>>>> > > >> make
>>>> > > >>
>>>> > > >> it much easier for the users to try out some simple ml tasks.
>>>> > > >>
>>>> > > >> Thanks,
>>>> > > >>
>>>> > > >> Jiangjie (Becket) Qin
>>>> > > >>
>>>> > > >> On Mon, Feb 3, 2020 at 4:34 PM jincheng sun <
>>>> > > >>
>>>> > > >> sunjincheng...@gmail.com
>>>> > > >>
>>>> > > >> wrote:
>>>> > > >>
>>>> > > >>
>>>> > > >> Thank you for pushing forward @Hequn Cheng <he...@apache.org> <
>>>> > > he...@apache.org> !
>>>> > > >>
>>>> > > >> Hi  @Becket Qin <becket....@gmail.com> <becket....@gmail.com> ,
>>>> Do
>>>> > you
>>>> > > have any concerns
>>>> > > >>
>>>> > > >> on
>>>> > > >>
>>>> > > >> this ?
>>>> > > >>
>>>> > > >> Best,
>>>> > > >> Jincheng
>>>> > > >>
>>>> > > >> Hequn Cheng <he...@apache.org> <he...@apache.org> 于2020年2月3日周一
>>>> > > 下午2:09写道：
>>>> > > >>
>>>> > > >>
>>>> > > >> Hi everyone,
>>>> > > >>
>>>> > > >> Thanks for the feedback. As there are no objections, I've opened
>>>> a
>>>> > > >>
>>>> > > >> JIRA
>>>> > > >>
>>>> > > >> issue(FLINK-15847[1]) to address this issue.
>>>> > > >> The implementation details can be discussed in the issue or in
>>>> the
>>>> > > >> following PR.
>>>> > > >>
>>>> > > >> Best,
>>>> > > >> Hequn
>>>> > > >>
>>>> > > >> [1] https://issues.apache.org/jira/browse/FLINK-15847
>>>> > > >>
>>>> > > >> On Wed, Jan 8, 2020 at 9:15 PM Hequn Cheng <chenghe...@gmail.com>
>>>> <
>>>> > > chenghe...@gmail.com>
>>>> > > >>
>>>> > > >> wrote:
>>>> > > >>
>>>> > > >> Hi Jincheng,
>>>> > > >>
>>>> > > >> Thanks a lot for your feedback!
>>>> > > >> Yes, I agree with you. There are cases that multi jars need to
>>>> > > >>
>>>> > > >> be
>>>> > > >>
>>>> > > >> uploaded. I will prepare another discussion later. Maybe with a
>>>> > > >>
>>>> > > >> simple
>>>> > > >>
>>>> > > >> design doc.
>>>> > > >>
>>>> > > >> Best, Hequn
>>>> > > >>
>>>> > > >> On Wed, Jan 8, 2020 at 3:06 PM jincheng sun <
>>>> > > >>
>>>> > > >> sunjincheng...@gmail.com>
>>>> > > >>
>>>> > > >> wrote:
>>>> > > >>
>>>> > > >>
>>>> > > >> Thanks for bring up this discussion Hequn!
>>>> > > >>
>>>> > > >> +1 for include `flink-ml-api` and `flink-ml-lib` in opt.
>>>> > > >>
>>>> > > >> BTW: I think would be great if bring up a discussion for upload
>>>> > > >>
>>>> > > >> multiple
>>>> > > >>
>>>> > > >> Jars at the same time. as PyFlink JOB also can have the benefit
>>>> > > >>
>>>> > > >> if
>>>> > > >>
>>>> > > >> we
>>>> > > >>
>>>> > > >> do
>>>> > > >>
>>>> > > >> that improvement.
>>>> > > >>
>>>> > > >> Best,
>>>> > > >> Jincheng
>>>> > > >>
>>>> > > >>
>>>> > > >> Hequn Cheng <chenghe...@gmail.com> <chenghe...@gmail.com>
>>>> > 于2020年1月8日周三
>>>> > > 上午11:50写道：
>>>> > > >>
>>>> > > >>
>>>> > > >> Hi everyone,
>>>> > > >>
>>>> > > >> FLIP-39[1] rebuilds Flink ML pipeline on top of TableAPI
>>>> > > >>
>>>> > > >> which
>>>> > > >>
>>>> > > >> moves
>>>> > > >>
>>>> > > >> Flink
>>>> > > >>
>>>> > > >> ML a step further. Base on it, users can develop their ML
>>>> > > >>
>>>> > > >> jobs
>>>> > > >>
>>>> > > >> and
>>>> > > >>
>>>> > > >> more
>>>> > > >>
>>>> > > >> and
>>>> > > >>
>>>> > > >> more machine learning platforms are providing ML services.
>>>> > > >>
>>>> > > >> However, the problem now is the jars of flink-ml-api and
>>>> > > >>
>>>> > > >> flink-ml-lib
>>>> > > >>
>>>> > > >> are
>>>> > > >>
>>>> > > >> only exist on maven repo. Whenever users want to submit ML
>>>> > > >>
>>>> > > >> jobs,
>>>> > > >>
>>>> > > >> they
>>>> > > >>
>>>> > > >> can
>>>> > > >>
>>>> > > >> only depend on the ml modules and package a fat jar. This
>>>> > > >>
>>>> > > >> would be
>>>> > > >>
>>>> > > >> inconvenient especially for the machine learning platforms on
>>>> > > >>
>>>> > > >> which
>>>> > > >>
>>>> > > >> nearly
>>>> > > >>
>>>> > > >> all jobs depend on Flink ML modules and have to package a fat
>>>> > > >>
>>>> > > >> jar.
>>>> > > >>
>>>> > > >> Given this, it would be better to include jars of
>>>> > > >>
>>>> > > >> flink-ml-api
>>>> > > >>
>>>> > > >> and
>>>> > > >>
>>>> > > >> flink-ml-lib in the `opt` folder, so that users can directly
>>>> > > >>
>>>> > > >> use
>>>> > > >>
>>>> > > >> the
>>>> > > >>
>>>> > > >> jars
>>>> > > >>
>>>> > > >> with the binary release. For example, users can move the jars
>>>> > > >>
>>>> > > >> into
>>>> > > >>
>>>> > > >> the
>>>> > > >>
>>>> > > >> `lib` folder or use -j to upload the jars. (Currently, -j
>>>> > > >>
>>>> > > >> only
>>>> > > >>
>>>> > > >> support
>>>> > > >>
>>>> > > >> upload one jar. Supporting multi jars for -j can be discussed
>>>> > > >>
>>>> > > >> in
>>>> > > >>
>>>> > > >> another
>>>> > > >>
>>>> > > >> discussion.)
>>>> > > >>
>>>> > > >> Putting the jars in the `opt` folder instead of the `lib`
>>>> > > >>
>>>> > > >> folder
>>>> > > >>
>>>> > > >> is
>>>> > > >>
>>>> > > >> because
>>>> > > >>
>>>> > > >> currently, the ml jars are still optional for the Flink
>>>> > > >>
>>>> > > >> project by
>>>> > > >>
>>>> > > >> default.
>>>> > > >>
>>>> > > >> What do you think? Welcome any feedback!
>>>> > > >>
>>>> > > >> Best,
>>>> > > >>
>>>> > > >> Hequn
>>>> > > >>
>>>> > > >> [1]
>>>> > > >>
>>>> > > >>
>>>> > > >>
>>>> > > >>
>>>> > >
>>>> >
>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-39+Flink+ML+pipeline+and+ML+libs
>>>> > > >>
>>>> > > >>
>>>> > > >>
>>>> > >
>>>> >
>>>> >
>>>> > --
>>>> > Best Regards
>>>> >
>>>> > Jeff Zhang
>>>> >
>>>>
>>>

Re: [DISCUSS] Include flink-ml-api and flink-ml-lib in opt

Reply via email to