Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Jingsong Li Fri, 13 Dec 2019 01:31:10 -0800

Hi Bowen,

Thanks for driving this.
+1 for this proposal.


Due to our multi version support, users are required to rely on
different dependencies, it does break the "out of box" experience.
Now that the client has changed to go to child first class loader resolve
by default, it puts forward higher requirements for user dependence, which
also leads to: a related bug to run hive job using the dependencies from
document.[1]
It is really hard to use.

I have some more thinking:
- I think we can make the user's jar package as thin as possible by
providing the appropriate excludes. Sometimes, the transmission of jar
packets consumes a lot of resources and time.
- Why we not add for hive version 3?

[1] https://issues.apache.org/jira/browse/FLINK-14849

Best,
Jingsong Lee

On Fri, Dec 13, 2019 at 5:12 PM Terry Wang <[email protected]> wrote:

> Hi Bowen~
>
> Thanks for driving on this. I have tried using sql client with hive
> connector about two weeks ago, it’s painful to set up the environment from
> my experience.
> + 1 for this proposal.
>
> Best,
> Terry Wang
>
>
>
> > 2019年12月13日 16:44，Bowen Li <[email protected]> 写道：
> >
> > Hi all,
> >
> > I want to propose to have a couple separate Flink distributions with Hive
> > dependencies on specific Hive versions (2.3.4 and 1.2.1). The
> distributions
> > will be provided to users on Flink download page [1].
> >
> > A few reasons to do this:
> >
> > 1) Flink-Hive integration is important to many many Flink and Hive users
> in
> > two dimensions:
> >     a) for Flink metadata: HiveCatalog is the only persistent catalog to
> > manage Flink tables. With Flink 1.10 supporting more DDL, the persistent
> > catalog would be playing even more critical role in users' workflow
> >     b) for Flink data: Hive data connector (source/sink) helps both Flink
> > and Hive users to unlock new use cases in streaming,
> near-realtime/realtime
> > data warehouse, backfill, etc.
> >
> > 2) currently users have to go thru a *really* tedious process to get
> > started, because it requires lots of extra jars (see [2]) that are absent
> > in Flink's lean distribution. We've had so many users from public mailing
> > list, private email, DingTalk groups who got frustrated on spending lots
> of
> > time figuring out the jars themselves. They would rather have a more
> "right
> > out of box" quickstart experience, and play with the catalog and
> > source/sink without hassle.
> >
> > 3) it's easier for users to replace those Hive dependencies for their own
> > Hive versions - just replace those jars with the right versions and no
> need
> > to find the doc.
> >
> > * Hive 2.3.4 and 1.2.1 are two versions that represent lots of user base
> > out there, and that's why we are using them as examples for dependencies
> in
> > [1] even though we've supported almost all Hive versions [3] now.
> >
> > I want to hear what the community think about this, and how to achieve it
> > if we believe that's the way to go.
> >
> > Cheers,
> > Bowen
> >
> > [1] https://flink.apache.org/downloads.html
> > [2]
> >
> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
> > [3]
> >
> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#supported-hive-versions
>
>

-- 
Best, Jingsong Lee

Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Reply via email to