Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Stephan Ewen Wed, 05 Feb 2020 06:15:30 -0800

Some thoughts about other options we have:

  - Put fat/shaded jars for the common versions into "flink-shaded" and
offer them for download on the website, similar to pre-bundles Hadoop
versions.


  - Look at the Presto code (Metastore protocol) and see if we can reuse
that

  - Have a setup helper script that takes the versions and pulls the
required dependencies.

Can you share how can a "built-in" dependency could work, if there are so
many different conflicting versions?

Thanks,
Stephan


On Tue, Feb 4, 2020 at 12:59 PM Rui Li <[email protected]> wrote:

> Hi Stephan,
>
> As Jingsong stated, in our documentation the recommended way to add Hive
> deps is to use exactly what users have installed. It's just we ask users to
> manually add those jars, instead of automatically find them based on env
> variables. I prefer to keep it this way for a while, and see if there're
> real concerns/complaints from user feedbacks.
>
> Please also note the Hive jars are not the only ones needed to integrate
> with Hive, users have to make sure flink-connector-hive and Hadoop jars are
> in classpath too. So I'm afraid a single "HIVE" env variable wouldn't save
> all the manual work for our users.
>
> On Tue, Feb 4, 2020 at 5:54 PM Jingsong Li <[email protected]> wrote:
>
> > Hi all,
> >
> > For your information, we have document the dependencies detailed
> > information [1]. I think it's a lot clearer than before, but it's worse
> > than presto and spark (they avoid or have built-in hive dependency).
> >
> > I thought about Stephan's suggestion:
> > - The hive/lib has 200+ jars, but we only need hive-exec.jar or plus two
> > or three jars, if so many jars are introduced, maybe will there be a big
> > conflict.
> > - And hive/lib is not available on every machine. We need to upload so
> > many jars.
> > - A separate classloader maybe hard to work too, our flink-connector-hive
> > need hive jars, we may need to deal with flink-connector-hive jar spacial
> > too.
> > CC: Rui Li
> >
> > I think the best system to integrate with hive is presto, which only
> > connects hive metastore through thrift protocol. But I understand that it
> > costs a lot to rewrite the code.
> >
> > [1]
> >
> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
> >
> > Best,
> > Jingsong Lee
> >
> > On Tue, Feb 4, 2020 at 1:44 AM Stephan Ewen <[email protected]> wrote:
> >
> >> We have had much trouble in the past from "too deep too custom"
> >> integrations that everyone got out of the box, i.e., Hadoop.
> >> Flink has has such a broad spectrum of use cases, if we have custom
> build
> >> for every other framework in that spectrum, we'll be in trouble.
> >>
> >> So I would also be -1 for custom builds.
> >>
> >> Couldn't we do something similar as we started doing for Hadoop? Moving
> >> away from convenience downloads to allowing users to "export" their
> setup
> >> for Flink?
> >>
> >>   - We can have a "hive module (loader)" in flink/lib by default
> >>   - The module loader would look for an environment variable like
> >> "HIVE_CLASSPATH" and load these classes (ideally in a separate
> >> classloader).
> >>   - The loader can search for certain classes and instantiate catalog /
> >> functions / etc. when finding them instantiates the hive module
> >> referencing
> >> them
> >>   - That way, we use exactly what users have installed, without needing
> to
> >> build our own bundles.
> >>
> >> Could that work?
> >>
> >> Best,
> >> Stephan
> >>
> >>
> >> On Wed, Dec 18, 2019 at 9:43 AM Till Rohrmann <[email protected]>
> >> wrote:
> >>
> >> > Couldn't it simply be documented which jars are in the convenience
> jars
> >> > which are pre built and can be downloaded from the website? Then
> people
> >> who
> >> > need a custom version know which jars they need to provide to Flink?
> >> >
> >> > Cheers,
> >> > Till
> >> >
> >> > On Tue, Dec 17, 2019 at 6:49 PM Bowen Li <[email protected]> wrote:
> >> >
> >> > > I'm not sure providing an uber jar would be possible.
> >> > >
> >> > > Different from kafka and elasticsearch connector who have
> dependencies
> >> > for
> >> > > a specific kafka/elastic version, or the kafka universal connector
> >> that
> >> > > provides good compatibilities, hive connector needs to deal with
> hive
> >> > jars
> >> > > in all 1.x, 2.x, 3.x versions (let alone all the HDP/CDH
> >> distributions)
> >> > > with incompatibility even between minor versions, different
> versioned
> >> > > hadoop and other extra dependency jars for each hive version.
> >> > >
> >> > > Besides, users usually need to be able to easily see which
> individual
> >> > jars
> >> > > are required, which is invisible from an uber jar. Hive users
> already
> >> > have
> >> > > their hive deployments. They usually have to use their own hive jars
> >> > > because, unlike hive jars on mvn, their own jars contain changes
> >> in-house
> >> > > or from vendors. They need to easily tell which jars Flink requires
> >> for
> >> > > corresponding open sourced hive version to their own hive
> deployment,
> >> and
> >> > > copy in-hosue jars over from hive deployments as replacements.
> >> > >
> >> > > Providing a script to download all the individual jars for a
> specified
> >> > hive
> >> > > version can be an alternative.
> >> > >
> >> > > The goal is we need to provide a *product*, not a technology, to
> make
> >> it
> >> > > less hassle for Hive users. Afterall, it's Flink embracing Hive
> >> community
> >> > > and ecosystem, not the other way around. I'd argue Hive connector
> can
> >> be
> >> > > treat differently because its community/ecosystem/userbase is much
> >> larger
> >> > > than the other connectors, and it's way more important than other
> >> > > connectors to Flink on the mission of becoming a batch/streaming
> >> unified
> >> > > engine and get Flink more widely adopted.
> >> > >
> >> > >
> >> > > On Sun, Dec 15, 2019 at 10:03 PM Danny Chan <[email protected]>
> >> > wrote:
> >> > >
> >> > > > Also -1 on separate builds.
> >> > > >
> >> > > > After referencing some other BigData engines for distribution[1],
> i
> >> > > didn't
> >> > > > find strong needs to publish a separate build
> >> > > > for just a separate Hive version, indeed there are builds for
> >> different
> >> > > > Hadoop version.
> >> > > >
> >> > > > Just like Seth and Aljoscha said, we could push a
> >> > > > flink-hive-version-uber.jar to use as a lib of SQL-CLI or other
> use
> >> > > cases.
> >> > > >
> >> > > > [1] https://spark.apache.org/downloads.html
> >> > > > [2]
> >> > >
> >> https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html
> >> > > >
> >> > > > Best,
> >> > > > Danny Chan
> >> > > > 在 2019年12月14日 +0800 AM3:03，[email protected]，写道：
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies
> >> > > >
> >> > >
> >> >
> >>
> >
> >
> > --
> > Best, Jingsong Lee
> >
>

Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Reply via email to