-1
We shouldn't need to deploy additional binaries to have a feature be
remotely usable.
This usually points to something else being done incorrectly.
If it is indeed such a hassle to setup hive on Flink, then my conclusion
would be that either
a) the documentation needs to be improved
b) the architecture needs to be improved
or, if all else fails c) provide a utility script for setting it up easier.
We spent a lot of time on reducing the number of binaries in the hadoop
days, and also go extra steps to prevent a separate Java 11 binary, and
I see no reason why Hive should get special treatment on this matter.
Regards,
Chesnay
On 13/12/2019 09:44, Bowen Li wrote:
Hi all,
I want to propose to have a couple separate Flink distributions with Hive
dependencies on specific Hive versions (2.3.4 and 1.2.1). The distributions
will be provided to users on Flink download page [1].
A few reasons to do this:
1) Flink-Hive integration is important to many many Flink and Hive users in
two dimensions:
a) for Flink metadata: HiveCatalog is the only persistent catalog to
manage Flink tables. With Flink 1.10 supporting more DDL, the persistent
catalog would be playing even more critical role in users' workflow
b) for Flink data: Hive data connector (source/sink) helps both Flink
and Hive users to unlock new use cases in streaming, near-realtime/realtime
data warehouse, backfill, etc.
2) currently users have to go thru a *really* tedious process to get
started, because it requires lots of extra jars (see [2]) that are absent
in Flink's lean distribution. We've had so many users from public mailing
list, private email, DingTalk groups who got frustrated on spending lots of
time figuring out the jars themselves. They would rather have a more "right
out of box" quickstart experience, and play with the catalog and
source/sink without hassle.
3) it's easier for users to replace those Hive dependencies for their own
Hive versions - just replace those jars with the right versions and no need
to find the doc.
* Hive 2.3.4 and 1.2.1 are two versions that represent lots of user base
out there, and that's why we are using them as examples for dependencies in
[1] even though we've supported almost all Hive versions [3] now.
I want to hear what the community think about this, and how to achieve it
if we believe that's the way to go.
Cheers,
Bowen
[1] https://flink.apache.org/downloads.html
[2]
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
[3]
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#supported-hive-versions