Hi Peter, Thank you for your thoughts on this.
My PR, https://github.com/apache/iceberg/pull/12681, achieves the following objectives: (1) Build and test the hive-metastore module against Hive 2, 3, and 4. (As you suggested in the earlier thread, we build and test against 2.3.10, 3.1.3, and 4.0.1.) (2) Keep a single code source for the hive-metastore implementation (the test code needed to be different, unfortunately). In order to achieve (2), code changes were in fact needed (due to HIVE-27925 <https://issues.apache.org/jira/browse/HIVE-27925> breaking backward compatibility). So, even if HMS API is backwards compatible (and HiveConf is not considered to be a part of HMS API), this shows the importance of actually doing (1). As a result of (2), even though in this PR, Flink and Spark modules continue to be built with Hive 2 dependencies, the hive-metastore classes packaged in the Flink and Spark runtime jars are usable with Hive 2, 3, or 4 HMS (provided the engine supported that version of Hive). The use of separate Gradle modules (hive3-metastore, hive4-metastore) is an implementation detail. I don't think we disagree on the objectives. If you have suggestions for implementing the objectives differently, I'm happy to hear them. To avoid getting into the weeds, we can discuss them in the PR. I understand the desire to have a single version of `TestHiveMetastore`, a test class that is used by the Flink and Spark modules as well. Unfortunately, I did not see a way of doing this even with `DynConstructors` and `DynMethods`. In Hive 4, `HMSHandler` became a top-level class instead of being a nested class in `HiveMetaStore` (and I don't see how to import one class from one version of Hive and a different class from another version of Hive). Even with `DynConstructors` you have to give it the instance of the class. Again, if you have suggestions to overcome this, we can discuss them in the PR. However, I feel that keeping a single version of test code that works for all Hive versions, while desirable, should not be a blocker. The scope of the PR is deliberately limited to the hive-metastore module. I feel that what versions of Hive to use for testing Flink and Spark modules against the HiveCatalog and how to do so should be a second (and possibly even third) step. As an experiment/proof of concept, in https://github.com/apache/iceberg/pull/12693, I show that Flink modules run successfully with hive-metastore built with Hive 4. Additionally, Spark modules run successfully with hive-metastore built with Hive 3 (they fail with Hive 4). I would say though that while Spark 4.0 will support using a Hive 4 HMS, I do not expect we will see Spark 3.4 or 3.5 supporting using a Hive 4 HMS. Thus, I do not think we should block this work on what Spark supports. Thanks, Wing Yew On Thu, Apr 3, 2025 at 3:01 AM Péter Váry <peter.vary.apa...@gmail.com> wrote: > Hi Wing Yew, > > Thanks for taking a look at this. > After the removal of the Hive runtime code, we only depend on HMS in the > HiveCatalog module in the production code. Since HMS API is supposed to be > backward compatible, I would prefer to keep a single hive-metastore module > and a single source dir for accessing HMS. When there is need we can use > the DynMethods as we do now. > > I understand that for testing we need to start a HiveMetastore. It is used > by Flink and Spark as well. Here things become harder as we need to > instantiate a test Hive Metastore where we don't have backward > compatibility guarantees. If possible, I would still prefer to keep a > single codebase for testing purposes where we use DynMethods and > HiveVersion to decide which version of HMS we instantiate. This could > easily become a mess, but I would try to check if we can get away with it > somehow. > > I would definitely run the hive-metastore tests with all of the supported > Hive versions. OTOH, I would defer to the Spark/Flink maintainers to decide > how confident they are in their products ability to abstract away the > differences between the Hive versions. For Flink I would suggest to run the > Flink tests only with the latest supported Hive version. > > Thanks, > Peter > > > Wing Yew Poon <wyp...@cloudera.com.invalid> ezt írta (időpont: 2025. > márc. 31., H, 4:27): > >> Hi, >> >> I want to follow up on the earlier discussion about Hive support, where >> it was decided to remove the Hive runtime support from the Iceberg repo, >> and leave the Hive metastore support (where the HiveCatalog is implemented) >> on Hive 2 for the time being. >> >> In that earlier thread, Peter Vary proposed that we support Hive 2.3.10, >> 3.1.3 and 4.0.1 for the hive-metastore module and ensure that it gets built >> and tested against those versions. >> >> I have implemented this in https://github.com/apache/iceberg/pull/12681. >> I have left the existing hive-metastore module depending on Hive 2.3.10, >> and added new hive3-metastore and hive4-metastore modules that depend on >> Hive 3.1.3 and 4.0.1 respectively. I have followed the approach used by the >> mr and hive3 modules previously and kept all common code in one directory >> (the existing hive-metastore directory) and avoided code duplication. In >> order to workaround https://issues.apache.org/jira/browse/HIVE-27925, >> which introduced a backward incompatibility in Hive 4, I have avoided the >> use of HiveConf.ConfVars enums and used the conf property names (which have >> not changed) instead. (This is also the approach used by Spark in >> https://issues.apache.org/jira/browse/SPARK-47679.) Please see the PR >> for more details. >> >> The Flink and Spark modules (along with the delta-lake module) have test >> code that depend on the hive-metastore module as well as on Hive metastore >> jars. Having those modules test against Hive 3 and Hive 4 metastore >> versions is not in the scope of the above PR. I plan to work on that >> separately as follow up, and I want to hear opinions on the approach. As a >> proof of concept, I have put up >> https://github.com/apache/iceberg/pull/12693 with follow on changes to >> test the Flink modules against hive4-metastore. This is straightforward and >> there are no issues. >> >> For Spark though, as I have mentioned in the earlier thread, Spark uses a >> built-in version of the Hive metastore (currently 2.3.10), but can be >> configured to use a different version and be pointed to a path containing >> Hive metastore jars for the different version. However, the highest Hive >> version that can be configured for Spark 3.5 is 3.1.3 (Spark 4 will support >> 4.0.x), as changes in Spark code is needed to be able to workaround >> HIVE-27925. >> >> What I'm interested in hearing is: For testing Flink and Spark against >> Hive versions, do we want to test against >> (1) just one version, e.g., the highest version supportable by that >> Flink/Spark version (or alternatively just 2.3.10). >> (2) multiple versions from 2.3.10, 3.1.3 and 4.0.1, as long as they are >> supportable by that Flink/Spark version. >> And if (2), how do we want to do that, e.g., full matrix, or some kind of >> sampling. >> >> Thanks, >> Wing Yew >> >>