Hey Wing, There are a few concerning issues I have with the current PR:
1) We need to update LICENSE/NOTICE for the hive3/hive4 dependencies because I believe we only looked at the referenced version 2) We're producing artifacts for hive3 and hive4 modules which I think we want to exclude (we shouldn't triplicate the artifacts that are cross compatible) 3) I don't think we should be copying the hive DDL files into the Iceberg project I'd really like to find a workaround for #3. I've added comments to the PR as well. Thanks, Dan On Fri, Apr 11, 2025 at 9:28 AM Wing Yew Poon <wyp...@cloudera.com.invalid> wrote: > Hi, > > I created https://github.com/apache/iceberg/pull/12721 as an alternative > to my earlier PR, https://github.com/apache/iceberg/pull/12681. With very > helpful pointers from Peter Vary and his persistent prodding, we have an > implementation where a single source set can be used to build and test with > Hive 2, 3 and 4. > > In the new PR, the existing hive-metastore module is still built with Hive > 2, and additional modules, hive3-metastore and hive4-metastore are > introduced. Currently no other modules depend on the latter two. All three > are built and tested in the existing java-ci workflow (without any change). > > The (non-test) classes in the hive-metastore module are packaged in the > Flink and Spark runtime JARs. As a single source is used to support Hive 2, > 3 and 4, they can be used with a HiveCatalog backed by a Hive 2, 3 or 4 > HMS. In the PR, we do not make any changes to Flink or Spark modules. > Testing them with a different version of Hive is the next step. I'd like to > hear your opinion on how to approach that. > > Thanks, > Wing Yew > > > > > On Thu, Apr 3, 2025 at 5:31 PM Wing Yew Poon <wyp...@cloudera.com> wrote: > >> Hi Peter, >> >> Thank you for your thoughts on this. >> >> My PR, https://github.com/apache/iceberg/pull/12681, achieves the >> following objectives: >> (1) Build and test the hive-metastore module against Hive 2, 3, and 4. >> (As you suggested in the earlier thread, we build and test against 2.3.10, >> 3.1.3, and 4.0.1.) >> (2) Keep a single code source for the hive-metastore implementation (the >> test code needed to be different, unfortunately). >> >> In order to achieve (2), code changes were in fact needed (due to >> HIVE-27925 <https://issues.apache.org/jira/browse/HIVE-27925> breaking >> backward compatibility). So, even if HMS API is backwards compatible (and >> HiveConf is not considered to be a part of HMS API), this shows the >> importance of actually doing (1). >> As a result of (2), even though in this PR, Flink and Spark modules >> continue to be built with Hive 2 dependencies, the hive-metastore classes >> packaged in the Flink and Spark runtime jars are usable with Hive 2, 3, or >> 4 HMS (provided the engine supported that version of Hive). >> >> The use of separate Gradle modules (hive3-metastore, hive4-metastore) is >> an implementation detail. I don't think we disagree on the objectives. If >> you have suggestions for implementing the objectives differently, I'm happy >> to hear them. To avoid getting into the weeds, we can discuss them in the >> PR. >> >> I understand the desire to have a single version of `TestHiveMetastore`, >> a test class that is used by the Flink and Spark modules as well. >> Unfortunately, I did not see a way of doing this even with >> `DynConstructors` and `DynMethods`. In Hive 4, `HMSHandler` became a >> top-level class instead of being a nested class in `HiveMetaStore` (and I >> don't see how to import one class from one version of Hive and a different >> class from another version of Hive). Even with `DynConstructors` you have >> to give it the instance of the class. Again, if you have suggestions to >> overcome this, we can discuss them in the PR. However, I feel that keeping >> a single version of test code that works for all Hive versions, while >> desirable, should not be a blocker. >> >> The scope of the PR is deliberately limited to the hive-metastore module. >> I feel that what versions of Hive to use for testing Flink and Spark >> modules against the HiveCatalog and how to do so should be a second (and >> possibly even third) step. As an experiment/proof of concept, in >> https://github.com/apache/iceberg/pull/12693, I show that Flink modules >> run successfully with hive-metastore built with Hive 4. Additionally, Spark >> modules run successfully with hive-metastore built with Hive 3 (they fail >> with Hive 4). I would say though that while Spark 4.0 will support using a >> Hive 4 HMS, I do not expect we will see Spark 3.4 or 3.5 supporting using a >> Hive 4 HMS. Thus, I do not think we should block this work on what Spark >> supports. >> >> Thanks, >> Wing Yew >> >> >> >> >> On Thu, Apr 3, 2025 at 3:01 AM Péter Váry <peter.vary.apa...@gmail.com> >> wrote: >> >>> Hi Wing Yew, >>> >>> Thanks for taking a look at this. >>> After the removal of the Hive runtime code, we only depend on HMS in the >>> HiveCatalog module in the production code. Since HMS API is supposed to be >>> backward compatible, I would prefer to keep a single hive-metastore module >>> and a single source dir for accessing HMS. When there is need we can use >>> the DynMethods as we do now. >>> >>> I understand that for testing we need to start a HiveMetastore. It is >>> used by Flink and Spark as well. Here things become harder as we need to >>> instantiate a test Hive Metastore where we don't have backward >>> compatibility guarantees. If possible, I would still prefer to keep a >>> single codebase for testing purposes where we use DynMethods and >>> HiveVersion to decide which version of HMS we instantiate. This could >>> easily become a mess, but I would try to check if we can get away with it >>> somehow. >>> >>> I would definitely run the hive-metastore tests with all of the >>> supported Hive versions. OTOH, I would defer to the Spark/Flink maintainers >>> to decide how confident they are in their products ability to abstract away >>> the differences between the Hive versions. For Flink I would suggest to run >>> the Flink tests only with the latest supported Hive version. >>> >>> Thanks, >>> Peter >>> >>> >>> Wing Yew Poon <wyp...@cloudera.com.invalid> ezt írta (időpont: 2025. >>> márc. 31., H, 4:27): >>> >>>> Hi, >>>> >>>> I want to follow up on the earlier discussion about Hive support, where >>>> it was decided to remove the Hive runtime support from the Iceberg repo, >>>> and leave the Hive metastore support (where the HiveCatalog is implemented) >>>> on Hive 2 for the time being. >>>> >>>> In that earlier thread, Peter Vary proposed that we support Hive >>>> 2.3.10, 3.1.3 and 4.0.1 for the hive-metastore module and ensure that it >>>> gets built and tested against those versions. >>>> >>>> I have implemented this in https://github.com/apache/iceberg/pull/12681. >>>> I have left the existing hive-metastore module depending on Hive 2.3.10, >>>> and added new hive3-metastore and hive4-metastore modules that depend on >>>> Hive 3.1.3 and 4.0.1 respectively. I have followed the approach used by the >>>> mr and hive3 modules previously and kept all common code in one directory >>>> (the existing hive-metastore directory) and avoided code duplication. In >>>> order to workaround https://issues.apache.org/jira/browse/HIVE-27925, >>>> which introduced a backward incompatibility in Hive 4, I have avoided the >>>> use of HiveConf.ConfVars enums and used the conf property names (which have >>>> not changed) instead. (This is also the approach used by Spark in >>>> https://issues.apache.org/jira/browse/SPARK-47679.) Please see the PR >>>> for more details. >>>> >>>> The Flink and Spark modules (along with the delta-lake module) have >>>> test code that depend on the hive-metastore module as well as on Hive >>>> metastore jars. Having those modules test against Hive 3 and Hive 4 >>>> metastore versions is not in the scope of the above PR. I plan to work on >>>> that separately as follow up, and I want to hear opinions on the approach. >>>> As a proof of concept, I have put up >>>> https://github.com/apache/iceberg/pull/12693 with follow on changes to >>>> test the Flink modules against hive4-metastore. This is straightforward and >>>> there are no issues. >>>> >>>> For Spark though, as I have mentioned in the earlier thread, Spark uses >>>> a built-in version of the Hive metastore (currently 2.3.10), but can be >>>> configured to use a different version and be pointed to a path containing >>>> Hive metastore jars for the different version. However, the highest Hive >>>> version that can be configured for Spark 3.5 is 3.1.3 (Spark 4 will support >>>> 4.0.x), as changes in Spark code is needed to be able to workaround >>>> HIVE-27925. >>>> >>>> What I'm interested in hearing is: For testing Flink and Spark against >>>> Hive versions, do we want to test against >>>> (1) just one version, e.g., the highest version supportable by that >>>> Flink/Spark version (or alternatively just 2.3.10). >>>> (2) multiple versions from 2.3.10, 3.1.3 and 4.0.1, as long as they are >>>> supportable by that Flink/Spark version. >>>> And if (2), how do we want to do that, e.g., full matrix, or some kind >>>> of sampling. >>>> >>>> Thanks, >>>> Wing Yew >>>> >>>>