Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Stephan Ewen Tue, 11 Feb 2020 02:06:13 -0800

IIRC, Guowei wants to work on supporting Table API connectors in Plugins.
With that, we could have the Hive dependency as a plugin, avoiding
dependency conflicts.


On Thu, Feb 6, 2020 at 1:11 PM Jingsong Li <jingsongl...@gmail.com> wrote:

> Hi Stephan,
>
> Good idea. Just like hadoop, we can have flink-shaded-hive-uber.
> Then the startup of hive integration will be very simple with one or two
> pre-bundled, user just add these dependencies:
> - flink-connector-hive.jar
> - flink-shaded-hive-uber-<version>.jar
>
> Some changes are needed, but I think it should work.
>
> Another thing is can we put flink-connector-hive.jar into flink/lib, it
> should clean and no dependencies.
>
> Best,
> Jingsong Lee
>
> On Thu, Feb 6, 2020 at 7:13 PM Stephan Ewen <se...@apache.org> wrote:
>
>> Hi Jingsong!
>>
>> This sounds that with two pre-bundled versions (hive 1.2.1 and hive
>> 2.3.6) you can cover a lot of versions.
>>
>> Would it make sense to add these to flink-shaded (with proper dependency
>> exclusions of unnecessary dependencies) and offer them as a download,
>> similar as we offer pre-shaded Hadoop downloads?
>>
>> Best,
>> Stephan
>>
>>
>> On Thu, Feb 6, 2020 at 10:26 AM Jingsong Li <jingsongl...@gmail.com>
>> wrote:
>>
>>> Hi Stephan,
>>>
>>> The hive/lib/ has many jars, this lib is for execution, metastore, hive
>>> client and all things.
>>> What we really depend on is hive-exec.jar. (hive-metastore.jar is also
>>> required in the low version hive)
>>> And hive-exec.jar is a uber jar. We just want half classes of it. These
>>> half classes are not so clean, but it is OK to have them.
>>>
>>> Our solution now:
>>> - exclude hive jars from build
>>> - provide 8 versions dependencies way, user choose by his hive
>>> version.[1]
>>>
>>> Spark's solution:
>>> - build-in hive 1.2.1 dependencies to support hive 0.12.0 through 2.3.3.
>>> [2]
>>>     - hive-exec.jar is hive-exec.spark.jar, Spark has modified the
>>> hive-exec build pom to exclude unnecessary classes including Orc and
>>> parquet.
>>>     - build-in orc and parquet dependencies to optimizer performance.
>>> - support hive version 2.3.3 upper by "mvn install -Phive-2.3", to
>>> built-in hive-exec-2.3.6.jar. It seems that since this version, hive's API
>>> has been seriously incompatible.
>>> Most of the versions used by users are hive 0.12.0 through 2.3.3. So the
>>> default build of Spark is good to most of users.
>>>
>>> Presto's solution:
>>> - Built-in presto's hive.[3] Shade hive classes instead of thrift
>>> classes.
>>> - Rewrite some client related code to solve kinds of issues.
>>> This approach is the heaviest, but also the cleanest. It can support all
>>> kinds of hive versions with one build.
>>>
>>> So I think we can do:
>>>
>>> - The eight versions we now maintain are too many. I think we can move
>>> forward in the direction of Presto/Spark and try to reduce dependencies
>>> versions.
>>>
>>> - As your said, about provide fat/uber jars or helper script, I prefer
>>> uber jars, user can download one jar to their startup. Just like Kafka.
>>>
>>> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
>>> [2]
>>> https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore
>>> [3] https://github.com/prestodb/presto-hive-apache
>>>
>>> Best,
>>> Jingsong Lee
>>>
>>> On Wed, Feb 5, 2020 at 10:15 PM Stephan Ewen <se...@apache.org> wrote:
>>>
>>>> Some thoughts about other options we have:
>>>>
>>>>   - Put fat/shaded jars for the common versions into "flink-shaded" and
>>>> offer them for download on the website, similar to pre-bundles Hadoop
>>>> versions.
>>>>
>>>>   - Look at the Presto code (Metastore protocol) and see if we can
>>>> reuse that
>>>>
>>>>   - Have a setup helper script that takes the versions and pulls the
>>>> required dependencies.
>>>>
>>>> Can you share how can a "built-in" dependency could work, if there are
>>>> so many different conflicting versions?
>>>>
>>>> Thanks,
>>>> Stephan
>>>>
>>>>
>>>> On Tue, Feb 4, 2020 at 12:59 PM Rui Li <li...@apache.org> wrote:
>>>>
>>>>> Hi Stephan,
>>>>>
>>>>> As Jingsong stated, in our documentation the recommended way to add
>>>>> Hive
>>>>> deps is to use exactly what users have installed. It's just we ask
>>>>> users to
>>>>> manually add those jars, instead of automatically find them based on
>>>>> env
>>>>> variables. I prefer to keep it this way for a while, and see if
>>>>> there're
>>>>> real concerns/complaints from user feedbacks.
>>>>>
>>>>> Please also note the Hive jars are not the only ones needed to
>>>>> integrate
>>>>> with Hive, users have to make sure flink-connector-hive and Hadoop
>>>>> jars are
>>>>> in classpath too. So I'm afraid a single "HIVE" env variable wouldn't
>>>>> save
>>>>> all the manual work for our users.
>>>>>
>>>>> On Tue, Feb 4, 2020 at 5:54 PM Jingsong Li <jingsongl...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> > Hi all,
>>>>> >
>>>>> > For your information, we have document the dependencies detailed
>>>>> > information [1]. I think it's a lot clearer than before, but it's
>>>>> worse
>>>>> > than presto and spark (they avoid or have built-in hive dependency).
>>>>> >
>>>>> > I thought about Stephan's suggestion:
>>>>> > - The hive/lib has 200+ jars, but we only need hive-exec.jar or plus
>>>>> two
>>>>> > or three jars, if so many jars are introduced, maybe will there be a
>>>>> big
>>>>> > conflict.
>>>>> > - And hive/lib is not available on every machine. We need to upload
>>>>> so
>>>>> > many jars.
>>>>> > - A separate classloader maybe hard to work too, our
>>>>> flink-connector-hive
>>>>> > need hive jars, we may need to deal with flink-connector-hive jar
>>>>> spacial
>>>>> > too.
>>>>> > CC: Rui Li
>>>>> >
>>>>> > I think the best system to integrate with hive is presto, which only
>>>>> > connects hive metastore through thrift protocol. But I understand
>>>>> that it
>>>>> > costs a lot to rewrite the code.
>>>>> >
>>>>> > [1]
>>>>> >
>>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
>>>>> >
>>>>> > Best,
>>>>> > Jingsong Lee
>>>>> >
>>>>> > On Tue, Feb 4, 2020 at 1:44 AM Stephan Ewen <se...@apache.org>
>>>>> wrote:
>>>>> >
>>>>> >> We have had much trouble in the past from "too deep too custom"
>>>>> >> integrations that everyone got out of the box, i.e., Hadoop.
>>>>> >> Flink has has such a broad spectrum of use cases, if we have custom
>>>>> build
>>>>> >> for every other framework in that spectrum, we'll be in trouble.
>>>>> >>
>>>>> >> So I would also be -1 for custom builds.
>>>>> >>
>>>>> >> Couldn't we do something similar as we started doing for Hadoop?
>>>>> Moving
>>>>> >> away from convenience downloads to allowing users to "export" their
>>>>> setup
>>>>> >> for Flink?
>>>>> >>
>>>>> >>   - We can have a "hive module (loader)" in flink/lib by default
>>>>> >>   - The module loader would look for an environment variable like
>>>>> >> "HIVE_CLASSPATH" and load these classes (ideally in a separate
>>>>> >> classloader).
>>>>> >>   - The loader can search for certain classes and instantiate
>>>>> catalog /
>>>>> >> functions / etc. when finding them instantiates the hive module
>>>>> >> referencing
>>>>> >> them
>>>>> >>   - That way, we use exactly what users have installed, without
>>>>> needing to
>>>>> >> build our own bundles.
>>>>> >>
>>>>> >> Could that work?
>>>>> >>
>>>>> >> Best,
>>>>> >> Stephan
>>>>> >>
>>>>> >>
>>>>> >> On Wed, Dec 18, 2019 at 9:43 AM Till Rohrmann <trohrm...@apache.org
>>>>> >
>>>>> >> wrote:
>>>>> >>
>>>>> >> > Couldn't it simply be documented which jars are in the
>>>>> convenience jars
>>>>> >> > which are pre built and can be downloaded from the website? Then
>>>>> people
>>>>> >> who
>>>>> >> > need a custom version know which jars they need to provide to
>>>>> Flink?
>>>>> >> >
>>>>> >> > Cheers,
>>>>> >> > Till
>>>>> >> >
>>>>> >> > On Tue, Dec 17, 2019 at 6:49 PM Bowen Li <bowenl...@gmail.com>
>>>>> wrote:
>>>>> >> >
>>>>> >> > > I'm not sure providing an uber jar would be possible.
>>>>> >> > >
>>>>> >> > > Different from kafka and elasticsearch connector who have
>>>>> dependencies
>>>>> >> > for
>>>>> >> > > a specific kafka/elastic version, or the kafka universal
>>>>> connector
>>>>> >> that
>>>>> >> > > provides good compatibilities, hive connector needs to deal
>>>>> with hive
>>>>> >> > jars
>>>>> >> > > in all 1.x, 2.x, 3.x versions (let alone all the HDP/CDH
>>>>> >> distributions)
>>>>> >> > > with incompatibility even between minor versions, different
>>>>> versioned
>>>>> >> > > hadoop and other extra dependency jars for each hive version.
>>>>> >> > >
>>>>> >> > > Besides, users usually need to be able to easily see which
>>>>> individual
>>>>> >> > jars
>>>>> >> > > are required, which is invisible from an uber jar. Hive users
>>>>> already
>>>>> >> > have
>>>>> >> > > their hive deployments. They usually have to use their own hive
>>>>> jars
>>>>> >> > > because, unlike hive jars on mvn, their own jars contain changes
>>>>> >> in-house
>>>>> >> > > or from vendors. They need to easily tell which jars Flink
>>>>> requires
>>>>> >> for
>>>>> >> > > corresponding open sourced hive version to their own hive
>>>>> deployment,
>>>>> >> and
>>>>> >> > > copy in-hosue jars over from hive deployments as replacements.
>>>>> >> > >
>>>>> >> > > Providing a script to download all the individual jars for a
>>>>> specified
>>>>> >> > hive
>>>>> >> > > version can be an alternative.
>>>>> >> > >
>>>>> >> > > The goal is we need to provide a *product*, not a technology,
>>>>> to make
>>>>> >> it
>>>>> >> > > less hassle for Hive users. Afterall, it's Flink embracing Hive
>>>>> >> community
>>>>> >> > > and ecosystem, not the other way around. I'd argue Hive
>>>>> connector can
>>>>> >> be
>>>>> >> > > treat differently because its community/ecosystem/userbase is
>>>>> much
>>>>> >> larger
>>>>> >> > > than the other connectors, and it's way more important than
>>>>> other
>>>>> >> > > connectors to Flink on the mission of becoming a batch/streaming
>>>>> >> unified
>>>>> >> > > engine and get Flink more widely adopted.
>>>>> >> > >
>>>>> >> > >
>>>>> >> > > On Sun, Dec 15, 2019 at 10:03 PM Danny Chan <
>>>>> yuzhao....@gmail.com>
>>>>> >> > wrote:
>>>>> >> > >
>>>>> >> > > > Also -1 on separate builds.
>>>>> >> > > >
>>>>> >> > > > After referencing some other BigData engines for
>>>>> distribution[1], i
>>>>> >> > > didn't
>>>>> >> > > > find strong needs to publish a separate build
>>>>> >> > > > for just a separate Hive version, indeed there are builds for
>>>>> >> different
>>>>> >> > > > Hadoop version.
>>>>> >> > > >
>>>>> >> > > > Just like Seth and Aljoscha said, we could push a
>>>>> >> > > > flink-hive-version-uber.jar to use as a lib of SQL-CLI or
>>>>> other use
>>>>> >> > > cases.
>>>>> >> > > >
>>>>> >> > > > [1] https://spark.apache.org/downloads.html
>>>>> >> > > > [2]
>>>>> >> > >
>>>>> >>
>>>>> https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html
>>>>> >> > > >
>>>>> >> > > > Best,
>>>>> >> > > > Danny Chan
>>>>> >> > > > 在 2019年12月14日 +0800 AM3:03，dev@flink.apache.org，写道：
>>>>> >> > > > >
>>>>> >> > > > >
>>>>> >> > > >
>>>>> >> > >
>>>>> >> >
>>>>> >>
>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies
>>>>> >> > > >
>>>>> >> > >
>>>>> >> >
>>>>> >>
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Best, Jingsong Lee
>>>>> >
>>>>>
>>>>
>>>
>>> --
>>> Best, Jingsong Lee
>>>
>>
>
> --
> Best, Jingsong Lee
>

Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Reply via email to