Hi all,

I want to propose to have a couple separate Flink distributions with Hive
dependencies on specific Hive versions (2.3.4 and 1.2.1). The distributions
will be provided to users on Flink download page [1].

A few reasons to do this:

1) Flink-Hive integration is important to many many Flink and Hive users in
two dimensions:
     a) for Flink metadata: HiveCatalog is the only persistent catalog to
manage Flink tables. With Flink 1.10 supporting more DDL, the persistent
catalog would be playing even more critical role in users' workflow
     b) for Flink data: Hive data connector (source/sink) helps both Flink
and Hive users to unlock new use cases in streaming, near-realtime/realtime
data warehouse, backfill, etc.

2) currently users have to go thru a *really* tedious process to get
started, because it requires lots of extra jars (see [2]) that are absent
in Flink's lean distribution. We've had so many users from public mailing
list, private email, DingTalk groups who got frustrated on spending lots of
time figuring out the jars themselves. They would rather have a more "right
out of box" quickstart experience, and play with the catalog and
source/sink without hassle.

3) it's easier for users to replace those Hive dependencies for their own
Hive versions - just replace those jars with the right versions and no need
to find the doc.

* Hive 2.3.4 and 1.2.1 are two versions that represent lots of user base
out there, and that's why we are using them as examples for dependencies in
[1] even though we've supported almost all Hive versions [3] now.

I want to hear what the community think about this, and how to achieve it
if we believe that's the way to go.

Cheers,
Bowen

[1] https://flink.apache.org/downloads.html
[2]
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
[3]
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#supported-hive-versions

Reply via email to