Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Jingsong Li Thu, 06 Feb 2020 01:26:39 -0800

Hi Stephan,

The hive/lib/ has many jars, this lib is for execution, metastore, hive
client and all things.
What we really depend on is hive-exec.jar. (hive-metastore.jar is also
required in the low version hive)
And hive-exec.jar is a uber jar. We just want half classes of it. These
half classes are not so clean, but it is OK to have them.


Our solution now:
- exclude hive jars from build
- provide 8 versions dependencies way, user choose by his hive version.[1]

Spark's solution:
- build-in hive 1.2.1 dependencies to support hive 0.12.0 through 2.3.3. [2]
    - hive-exec.jar is hive-exec.spark.jar, Spark has modified the
hive-exec build pom to exclude unnecessary classes including Orc and
parquet.
    - build-in orc and parquet dependencies to optimizer performance.
- support hive version 2.3.3 upper by "mvn install -Phive-2.3", to built-in
hive-exec-2.3.6.jar. It seems that since this version, hive's API has been
seriously incompatible.
Most of the versions used by users are hive 0.12.0 through 2.3.3. So the
default build of Spark is good to most of users.

Presto's solution:
- Built-in presto's hive.[3] Shade hive classes instead of thrift classes.
- Rewrite some client related code to solve kinds of issues.
This approach is the heaviest, but also the cleanest. It can support all
kinds of hive versions with one build.

So I think we can do:

- The eight versions we now maintain are too many. I think we can move
forward in the direction of Presto/Spark and try to reduce dependencies
versions.

- As your said, about provide fat/uber jars or helper script, I prefer uber
jars, user can download one jar to their startup. Just like Kafka.

[1]
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
[2]
https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore
[3] https://github.com/prestodb/presto-hive-apache

Best,
Jingsong Lee

On Wed, Feb 5, 2020 at 10:15 PM Stephan Ewen <se...@apache.org> wrote:

> Some thoughts about other options we have:
>
>   - Put fat/shaded jars for the common versions into "flink-shaded" and
> offer them for download on the website, similar to pre-bundles Hadoop
> versions.
>
>   - Look at the Presto code (Metastore protocol) and see if we can reuse
> that
>
>   - Have a setup helper script that takes the versions and pulls the
> required dependencies.
>
> Can you share how can a "built-in" dependency could work, if there are so
> many different conflicting versions?
>
> Thanks,
> Stephan
>
>
> On Tue, Feb 4, 2020 at 12:59 PM Rui Li <li...@apache.org> wrote:
>
>> Hi Stephan,
>>
>> As Jingsong stated, in our documentation the recommended way to add Hive
>> deps is to use exactly what users have installed. It's just we ask users
>> to
>> manually add those jars, instead of automatically find them based on env
>> variables. I prefer to keep it this way for a while, and see if there're
>> real concerns/complaints from user feedbacks.
>>
>> Please also note the Hive jars are not the only ones needed to integrate
>> with Hive, users have to make sure flink-connector-hive and Hadoop jars
>> are
>> in classpath too. So I'm afraid a single "HIVE" env variable wouldn't save
>> all the manual work for our users.
>>
>> On Tue, Feb 4, 2020 at 5:54 PM Jingsong Li <jingsongl...@gmail.com>
>> wrote:
>>
>> > Hi all,
>> >
>> > For your information, we have document the dependencies detailed
>> > information [1]. I think it's a lot clearer than before, but it's worse
>> > than presto and spark (they avoid or have built-in hive dependency).
>> >
>> > I thought about Stephan's suggestion:
>> > - The hive/lib has 200+ jars, but we only need hive-exec.jar or plus two
>> > or three jars, if so many jars are introduced, maybe will there be a big
>> > conflict.
>> > - And hive/lib is not available on every machine. We need to upload so
>> > many jars.
>> > - A separate classloader maybe hard to work too, our
>> flink-connector-hive
>> > need hive jars, we may need to deal with flink-connector-hive jar
>> spacial
>> > too.
>> > CC: Rui Li
>> >
>> > I think the best system to integrate with hive is presto, which only
>> > connects hive metastore through thrift protocol. But I understand that
>> it
>> > costs a lot to rewrite the code.
>> >
>> > [1]
>> >
>> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
>> >
>> > Best,
>> > Jingsong Lee
>> >
>> > On Tue, Feb 4, 2020 at 1:44 AM Stephan Ewen <se...@apache.org> wrote:
>> >
>> >> We have had much trouble in the past from "too deep too custom"
>> >> integrations that everyone got out of the box, i.e., Hadoop.
>> >> Flink has has such a broad spectrum of use cases, if we have custom
>> build
>> >> for every other framework in that spectrum, we'll be in trouble.
>> >>
>> >> So I would also be -1 for custom builds.
>> >>
>> >> Couldn't we do something similar as we started doing for Hadoop? Moving
>> >> away from convenience downloads to allowing users to "export" their
>> setup
>> >> for Flink?
>> >>
>> >>   - We can have a "hive module (loader)" in flink/lib by default
>> >>   - The module loader would look for an environment variable like
>> >> "HIVE_CLASSPATH" and load these classes (ideally in a separate
>> >> classloader).
>> >>   - The loader can search for certain classes and instantiate catalog /
>> >> functions / etc. when finding them instantiates the hive module
>> >> referencing
>> >> them
>> >>   - That way, we use exactly what users have installed, without
>> needing to
>> >> build our own bundles.
>> >>
>> >> Could that work?
>> >>
>> >> Best,
>> >> Stephan
>> >>
>> >>
>> >> On Wed, Dec 18, 2019 at 9:43 AM Till Rohrmann <trohrm...@apache.org>
>> >> wrote:
>> >>
>> >> > Couldn't it simply be documented which jars are in the convenience
>> jars
>> >> > which are pre built and can be downloaded from the website? Then
>> people
>> >> who
>> >> > need a custom version know which jars they need to provide to Flink?
>> >> >
>> >> > Cheers,
>> >> > Till
>> >> >
>> >> > On Tue, Dec 17, 2019 at 6:49 PM Bowen Li <bowenl...@gmail.com>
>> wrote:
>> >> >
>> >> > > I'm not sure providing an uber jar would be possible.
>> >> > >
>> >> > > Different from kafka and elasticsearch connector who have
>> dependencies
>> >> > for
>> >> > > a specific kafka/elastic version, or the kafka universal connector
>> >> that
>> >> > > provides good compatibilities, hive connector needs to deal with
>> hive
>> >> > jars
>> >> > > in all 1.x, 2.x, 3.x versions (let alone all the HDP/CDH
>> >> distributions)
>> >> > > with incompatibility even between minor versions, different
>> versioned
>> >> > > hadoop and other extra dependency jars for each hive version.
>> >> > >
>> >> > > Besides, users usually need to be able to easily see which
>> individual
>> >> > jars
>> >> > > are required, which is invisible from an uber jar. Hive users
>> already
>> >> > have
>> >> > > their hive deployments. They usually have to use their own hive
>> jars
>> >> > > because, unlike hive jars on mvn, their own jars contain changes
>> >> in-house
>> >> > > or from vendors. They need to easily tell which jars Flink requires
>> >> for
>> >> > > corresponding open sourced hive version to their own hive
>> deployment,
>> >> and
>> >> > > copy in-hosue jars over from hive deployments as replacements.
>> >> > >
>> >> > > Providing a script to download all the individual jars for a
>> specified
>> >> > hive
>> >> > > version can be an alternative.
>> >> > >
>> >> > > The goal is we need to provide a *product*, not a technology, to
>> make
>> >> it
>> >> > > less hassle for Hive users. Afterall, it's Flink embracing Hive
>> >> community
>> >> > > and ecosystem, not the other way around. I'd argue Hive connector
>> can
>> >> be
>> >> > > treat differently because its community/ecosystem/userbase is much
>> >> larger
>> >> > > than the other connectors, and it's way more important than other
>> >> > > connectors to Flink on the mission of becoming a batch/streaming
>> >> unified
>> >> > > engine and get Flink more widely adopted.
>> >> > >
>> >> > >
>> >> > > On Sun, Dec 15, 2019 at 10:03 PM Danny Chan <yuzhao....@gmail.com>
>> >> > wrote:
>> >> > >
>> >> > > > Also -1 on separate builds.
>> >> > > >
>> >> > > > After referencing some other BigData engines for
>> distribution[1], i
>> >> > > didn't
>> >> > > > find strong needs to publish a separate build
>> >> > > > for just a separate Hive version, indeed there are builds for
>> >> different
>> >> > > > Hadoop version.
>> >> > > >
>> >> > > > Just like Seth and Aljoscha said, we could push a
>> >> > > > flink-hive-version-uber.jar to use as a lib of SQL-CLI or other
>> use
>> >> > > cases.
>> >> > > >
>> >> > > > [1] https://spark.apache.org/downloads.html
>> >> > > > [2]
>> >> > >
>> >> https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html
>> >> > > >
>> >> > > > Best,
>> >> > > > Danny Chan
>> >> > > > 在 2019年12月14日 +0800 AM3:03，dev@flink.apache.org，写道：
>> >> > > > >
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies
>> >> > > >
>> >> > >
>> >> >
>> >>
>> >
>> >
>> > --
>> > Best, Jingsong Lee
>> >
>>
>

-- 
Best, Jingsong Lee

Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Reply via email to