Hi,

I want to follow up on the earlier discussion about Hive support, where it
was decided to remove the Hive runtime support from the Iceberg repo, and
leave the Hive metastore support (where the HiveCatalog is implemented) on
Hive 2 for the time being.

In that earlier thread, Peter Vary proposed that we support Hive 2.3.10,
3.1.3 and 4.0.1 for the hive-metastore module and ensure that it gets built
and tested against those versions.

I have implemented this in https://github.com/apache/iceberg/pull/12681. I
have left the existing hive-metastore module depending on Hive 2.3.10, and
added new hive3-metastore and hive4-metastore modules that depend on Hive
3.1.3 and 4.0.1 respectively. I have followed the approach used by the mr
and hive3 modules previously and kept all common code in one directory (the
existing hive-metastore directory) and avoided code duplication. In order
to workaround https://issues.apache.org/jira/browse/HIVE-27925, which
introduced a backward incompatibility in Hive 4, I have avoided the use of
HiveConf.ConfVars enums and used the conf property names (which have not
changed) instead. (This is also the approach used by Spark in
https://issues.apache.org/jira/browse/SPARK-47679.) Please see the PR for
more details.

The Flink and Spark modules (along with the delta-lake module) have test
code that depend on the hive-metastore module as well as on Hive metastore
jars. Having those modules test against Hive 3 and Hive 4 metastore
versions is not in the scope of the above PR. I plan to work on that
separately as follow up, and I want to hear opinions on the approach. As a
proof of concept, I have put up https://github.com/apache/iceberg/pull/12693
with follow on changes to test the Flink modules against hive4-metastore.
This is straightforward and there are no issues.

For Spark though, as I have mentioned in the earlier thread, Spark uses a
built-in version of the Hive metastore (currently 2.3.10), but can be
configured to use a different version and be pointed to a path containing
Hive metastore jars for the different version. However, the highest Hive
version that can be configured for Spark 3.5 is 3.1.3 (Spark 4 will support
4.0.x), as changes in Spark code is needed to be able to workaround
HIVE-27925.

What I'm interested in hearing is: For testing Flink and Spark against Hive
versions, do we want to test against
(1) just one version, e.g., the highest version supportable by that
Flink/Spark version (or alternatively just 2.3.10).
(2) multiple versions from 2.3.10, 3.1.3 and 4.0.1, as long as they are
supportable by that Flink/Spark version.
And if (2), how do we want to do that, e.g., full matrix, or some kind of
sampling.

Thanks,
Wing Yew

Reply via email to