Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Rui Li Tue, 04 Feb 2020 04:00:00 -0800

Hi Stephan,

As Jingsong stated, in our documentation the recommended way to add Hive
deps is to use exactly what users have installed. It's just we ask users to
manually add those jars, instead of automatically find them based on env
variables. I prefer to keep it this way for a while, and see if there're
real concerns/complaints from user feedbacks.


Please also note the Hive jars are not the only ones needed to integrate
with Hive, users have to make sure flink-connector-hive and Hadoop jars are
in classpath too. So I'm afraid a single "HIVE" env variable wouldn't save
all the manual work for our users.

On Tue, Feb 4, 2020 at 5:54 PM Jingsong Li <jingsongl...@gmail.com> wrote:

> Hi all,
>
> For your information, we have document the dependencies detailed
> information [1]. I think it's a lot clearer than before, but it's worse
> than presto and spark (they avoid or have built-in hive dependency).
>
> I thought about Stephan's suggestion:
> - The hive/lib has 200+ jars, but we only need hive-exec.jar or plus two
> or three jars, if so many jars are introduced, maybe will there be a big
> conflict.
> - And hive/lib is not available on every machine. We need to upload so
> many jars.
> - A separate classloader maybe hard to work too, our flink-connector-hive
> need hive jars, we may need to deal with flink-connector-hive jar spacial
> too.
> CC: Rui Li
>
> I think the best system to integrate with hive is presto, which only
> connects hive metastore through thrift protocol. But I understand that it
> costs a lot to rewrite the code.
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
>
> Best,
> Jingsong Lee
>
> On Tue, Feb 4, 2020 at 1:44 AM Stephan Ewen <se...@apache.org> wrote:
>
>> We have had much trouble in the past from "too deep too custom"
>> integrations that everyone got out of the box, i.e., Hadoop.
>> Flink has has such a broad spectrum of use cases, if we have custom build
>> for every other framework in that spectrum, we'll be in trouble.
>>
>> So I would also be -1 for custom builds.
>>
>> Couldn't we do something similar as we started doing for Hadoop? Moving
>> away from convenience downloads to allowing users to "export" their setup
>> for Flink?
>>
>>   - We can have a "hive module (loader)" in flink/lib by default
>>   - The module loader would look for an environment variable like
>> "HIVE_CLASSPATH" and load these classes (ideally in a separate
>> classloader).
>>   - The loader can search for certain classes and instantiate catalog /
>> functions / etc. when finding them instantiates the hive module
>> referencing
>> them
>>   - That way, we use exactly what users have installed, without needing to
>> build our own bundles.
>>
>> Could that work?
>>
>> Best,
>> Stephan
>>
>>
>> On Wed, Dec 18, 2019 at 9:43 AM Till Rohrmann <trohrm...@apache.org>
>> wrote:
>>
>> > Couldn't it simply be documented which jars are in the convenience jars
>> > which are pre built and can be downloaded from the website? Then people
>> who
>> > need a custom version know which jars they need to provide to Flink?
>> >
>> > Cheers,
>> > Till
>> >
>> > On Tue, Dec 17, 2019 at 6:49 PM Bowen Li <bowenl...@gmail.com> wrote:
>> >
>> > > I'm not sure providing an uber jar would be possible.
>> > >
>> > > Different from kafka and elasticsearch connector who have dependencies
>> > for
>> > > a specific kafka/elastic version, or the kafka universal connector
>> that
>> > > provides good compatibilities, hive connector needs to deal with hive
>> > jars
>> > > in all 1.x, 2.x, 3.x versions (let alone all the HDP/CDH
>> distributions)
>> > > with incompatibility even between minor versions, different versioned
>> > > hadoop and other extra dependency jars for each hive version.
>> > >
>> > > Besides, users usually need to be able to easily see which individual
>> > jars
>> > > are required, which is invisible from an uber jar. Hive users already
>> > have
>> > > their hive deployments. They usually have to use their own hive jars
>> > > because, unlike hive jars on mvn, their own jars contain changes
>> in-house
>> > > or from vendors. They need to easily tell which jars Flink requires
>> for
>> > > corresponding open sourced hive version to their own hive deployment,
>> and
>> > > copy in-hosue jars over from hive deployments as replacements.
>> > >
>> > > Providing a script to download all the individual jars for a specified
>> > hive
>> > > version can be an alternative.
>> > >
>> > > The goal is we need to provide a *product*, not a technology, to make
>> it
>> > > less hassle for Hive users. Afterall, it's Flink embracing Hive
>> community
>> > > and ecosystem, not the other way around. I'd argue Hive connector can
>> be
>> > > treat differently because its community/ecosystem/userbase is much
>> larger
>> > > than the other connectors, and it's way more important than other
>> > > connectors to Flink on the mission of becoming a batch/streaming
>> unified
>> > > engine and get Flink more widely adopted.
>> > >
>> > >
>> > > On Sun, Dec 15, 2019 at 10:03 PM Danny Chan <yuzhao....@gmail.com>
>> > wrote:
>> > >
>> > > > Also -1 on separate builds.
>> > > >
>> > > > After referencing some other BigData engines for distribution[1], i
>> > > didn't
>> > > > find strong needs to publish a separate build
>> > > > for just a separate Hive version, indeed there are builds for
>> different
>> > > > Hadoop version.
>> > > >
>> > > > Just like Seth and Aljoscha said, we could push a
>> > > > flink-hive-version-uber.jar to use as a lib of SQL-CLI or other use
>> > > cases.
>> > > >
>> > > > [1] https://spark.apache.org/downloads.html
>> > > > [2]
>> > >
>> https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html
>> > > >
>> > > > Best,
>> > > > Danny Chan
>> > > > 在 2019年12月14日 +0800 AM3:03，dev@flink.apache.org，写道：
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies
>> > > >
>> > >
>> >
>>
>
>
> --
> Best, Jingsong Lee
>

Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Reply via email to