Thanks for working on this @Wing Yew Poon <wyp...@gmail.com>! As I have mentioned on the PR, I would prefer unbundling the hive-metastore dependency from the runtime jars. This would help in multiple ways: - Decrease the size of the runtime jars - nowadays more-and-more people use Iceberg against a REST catalog, so bundling hive-metastore jar is unnecessary for them - Hive 4 is the only officially supported Hive version - with the current solution we don't support it - Testing against a specific Hive version will be more clear
Thanks, Peter Wing Yew Poon <wyp...@cloudera.com.invalid> ezt írta (időpont: 2025. máj. 16., P, 23:52): > Hi, > > I'd like to bring attention to this again. > As you know, the hive-metastore module (which provides the HiveCatalog) is > built against Hive 2. > I have a PR, https://github.com/apache/iceberg/pull/12721, for building > and testing the hive-metastore module against Hive 2, 3 and 4. Daniel Weeks > looked into building the module against a single version of Hive and > testing that against all three Hive versions, but this approach turned out > to be infeasible because of binary incompatibility introduced by Hive 4 (so > e.g., a hive-metastore jar built against Hive 2 would be incompatible if we > use Hive 4 jars for our Hive dependencies). Note that Hive 4 also breaks > thrift compatibility, so talking to a Hive 4 HMS requires using a Hive 4 > client. > > Given this, I propose that we have three versions of the module, > hive-metastore, hive3-metastore and hive4-metastore, built against Hive 2, > 3 and 4 respectively. (This is currently what is implemented in the PR.) > This means three versions of the artifact. First, I'd like to get agreement > from the community on doing this. > Second, the hive-metastore classes are currently bundled in runtime jars, > e.g., the Spark runtime jar and the Flink runtime jar. If we have agreement > on the first point, I'd like to hear opinions on options for this second > point: > a. Remove hive-metastore classes from these runtime jars. Users will pick > the hive-metastore jar appropriate to the Hive version they are using and > add it to the classpath of their engine. > b. Keep the runtime jars unchanged, which means bundling the > hive-metastore classes built against Hive 2. > c. Build multiple versions of the runtime jars, for Hive versions > supported by those engines. (This does not mean all combinations, as e.g., > Spark 3 does not support Hive 4.) > > a. provides the most flexibility at the expense of inconvenience to the > user (to have to add another jar to the classpath). > c. is the most complex. > > There is a separate question of testing engines such as Spark and Flink > against different versions of HMS. At this time, I do not wish to get into > this question until the above is resolved. Regardless of the option chosen > for the second point, I think it is possible to test engines against > different versions of HMS. > > Thanks, > Wing Yew > > > On Fri, Apr 11, 2025 at 12:03 PM Daniel Weeks <dwe...@apache.org> wrote: > >> Hey Wing, >> >> There are a few concerning issues I have with the current PR: >> >> 1) We need to update LICENSE/NOTICE for the hive3/hive4 dependencies >> because I believe we only looked at the referenced version >> 2) We're producing artifacts for hive3 and hive4 modules which I think we >> want to exclude (we shouldn't triplicate the artifacts that are cross >> compatible) >> 3) I don't think we should be copying the hive DDL files into the Iceberg >> project >> >> I'd really like to find a workaround for #3. I've added comments to the >> PR as well. >> >> Thanks, >> Dan >> >> On Fri, Apr 11, 2025 at 9:28 AM Wing Yew Poon <wyp...@cloudera.com.invalid> >> wrote: >> >>> Hi, >>> >>> I created https://github.com/apache/iceberg/pull/12721 as an >>> alternative to my earlier PR, >>> https://github.com/apache/iceberg/pull/12681. With very helpful >>> pointers from Peter Vary and his persistent prodding, we have an >>> implementation where a single source set can be used to build and test with >>> Hive 2, 3 and 4. >>> >>> In the new PR, the existing hive-metastore module is still built with >>> Hive 2, and additional modules, hive3-metastore and hive4-metastore are >>> introduced. Currently no other modules depend on the latter two. All three >>> are built and tested in the existing java-ci workflow (without any change). >>> >>> The (non-test) classes in the hive-metastore module are packaged in the >>> Flink and Spark runtime JARs. As a single source is used to support Hive 2, >>> 3 and 4, they can be used with a HiveCatalog backed by a Hive 2, 3 or 4 >>> HMS. In the PR, we do not make any changes to Flink or Spark modules. >>> Testing them with a different version of Hive is the next step. I'd like to >>> hear your opinion on how to approach that. >>> >>> Thanks, >>> Wing Yew >>> >>> >>> >>> >>> On Thu, Apr 3, 2025 at 5:31 PM Wing Yew Poon <wyp...@cloudera.com> >>> wrote: >>> >>>> Hi Peter, >>>> >>>> Thank you for your thoughts on this. >>>> >>>> My PR, https://github.com/apache/iceberg/pull/12681, achieves the >>>> following objectives: >>>> (1) Build and test the hive-metastore module against Hive 2, 3, and 4. >>>> (As you suggested in the earlier thread, we build and test against 2.3.10, >>>> 3.1.3, and 4.0.1.) >>>> (2) Keep a single code source for the hive-metastore implementation >>>> (the test code needed to be different, unfortunately). >>>> >>>> In order to achieve (2), code changes were in fact needed (due to >>>> HIVE-27925 <https://issues.apache.org/jira/browse/HIVE-27925> breaking >>>> backward compatibility). So, even if HMS API is backwards compatible (and >>>> HiveConf is not considered to be a part of HMS API), this shows the >>>> importance of actually doing (1). >>>> As a result of (2), even though in this PR, Flink and Spark modules >>>> continue to be built with Hive 2 dependencies, the hive-metastore classes >>>> packaged in the Flink and Spark runtime jars are usable with Hive 2, 3, or >>>> 4 HMS (provided the engine supported that version of Hive). >>>> >>>> The use of separate Gradle modules (hive3-metastore, hive4-metastore) >>>> is an implementation detail. I don't think we disagree on the objectives. >>>> If you have suggestions for implementing the objectives differently, I'm >>>> happy to hear them. To avoid getting into the weeds, we can discuss them in >>>> the PR. >>>> >>>> I understand the desire to have a single version of >>>> `TestHiveMetastore`, a test class that is used by the Flink and Spark >>>> modules as well. Unfortunately, I did not see a way of doing this even with >>>> `DynConstructors` and `DynMethods`. In Hive 4, `HMSHandler` became a >>>> top-level class instead of being a nested class in `HiveMetaStore` (and I >>>> don't see how to import one class from one version of Hive and a different >>>> class from another version of Hive). Even with `DynConstructors` you have >>>> to give it the instance of the class. Again, if you have suggestions to >>>> overcome this, we can discuss them in the PR. However, I feel that keeping >>>> a single version of test code that works for all Hive versions, while >>>> desirable, should not be a blocker. >>>> >>>> The scope of the PR is deliberately limited to the hive-metastore >>>> module. I feel that what versions of Hive to use for testing Flink and >>>> Spark modules against the HiveCatalog and how to do so should be a second >>>> (and possibly even third) step. As an experiment/proof of concept, in >>>> https://github.com/apache/iceberg/pull/12693, I show that Flink >>>> modules run successfully with hive-metastore built with Hive 4. >>>> Additionally, Spark modules run successfully with hive-metastore built with >>>> Hive 3 (they fail with Hive 4). I would say though that while Spark 4.0 >>>> will support using a Hive 4 HMS, I do not expect we will see Spark 3.4 or >>>> 3.5 supporting using a Hive 4 HMS. Thus, I do not think we should block >>>> this work on what Spark supports. >>>> >>>> Thanks, >>>> Wing Yew >>>> >>>> >>>> >>>> >>>> On Thu, Apr 3, 2025 at 3:01 AM Péter Váry <peter.vary.apa...@gmail.com> >>>> wrote: >>>> >>>>> Hi Wing Yew, >>>>> >>>>> Thanks for taking a look at this. >>>>> After the removal of the Hive runtime code, we only depend on HMS in >>>>> the HiveCatalog module in the production code. Since HMS API is supposed >>>>> to >>>>> be backward compatible, I would prefer to keep a single hive-metastore >>>>> module and a single source dir for accessing HMS. When there is need we >>>>> can >>>>> use the DynMethods as we do now. >>>>> >>>>> I understand that for testing we need to start a HiveMetastore. It is >>>>> used by Flink and Spark as well. Here things become harder as we need to >>>>> instantiate a test Hive Metastore where we don't have backward >>>>> compatibility guarantees. If possible, I would still prefer to keep a >>>>> single codebase for testing purposes where we use DynMethods and >>>>> HiveVersion to decide which version of HMS we instantiate. This could >>>>> easily become a mess, but I would try to check if we can get away with it >>>>> somehow. >>>>> >>>>> I would definitely run the hive-metastore tests with all of the >>>>> supported Hive versions. OTOH, I would defer to the Spark/Flink >>>>> maintainers >>>>> to decide how confident they are in their products ability to abstract >>>>> away >>>>> the differences between the Hive versions. For Flink I would suggest to >>>>> run >>>>> the Flink tests only with the latest supported Hive version. >>>>> >>>>> Thanks, >>>>> Peter >>>>> >>>>> >>>>> Wing Yew Poon <wyp...@cloudera.com.invalid> ezt írta (időpont: 2025. >>>>> márc. 31., H, 4:27): >>>>> >>>>>> Hi, >>>>>> >>>>>> I want to follow up on the earlier discussion about Hive support, >>>>>> where it was decided to remove the Hive runtime support from the Iceberg >>>>>> repo, and leave the Hive metastore support (where the HiveCatalog is >>>>>> implemented) on Hive 2 for the time being. >>>>>> >>>>>> In that earlier thread, Peter Vary proposed that we support Hive >>>>>> 2.3.10, 3.1.3 and 4.0.1 for the hive-metastore module and ensure that it >>>>>> gets built and tested against those versions. >>>>>> >>>>>> I have implemented this in >>>>>> https://github.com/apache/iceberg/pull/12681. I have left the >>>>>> existing hive-metastore module depending on Hive 2.3.10, and added new >>>>>> hive3-metastore and hive4-metastore modules that depend on Hive 3.1.3 and >>>>>> 4.0.1 respectively. I have followed the approach used by the mr and hive3 >>>>>> modules previously and kept all common code in one directory (the >>>>>> existing >>>>>> hive-metastore directory) and avoided code duplication. In order to >>>>>> workaround https://issues.apache.org/jira/browse/HIVE-27925, which >>>>>> introduced a backward incompatibility in Hive 4, I have avoided the use >>>>>> of >>>>>> HiveConf.ConfVars enums and used the conf property names (which have not >>>>>> changed) instead. (This is also the approach used by Spark in >>>>>> https://issues.apache.org/jira/browse/SPARK-47679.) Please see the >>>>>> PR for more details. >>>>>> >>>>>> The Flink and Spark modules (along with the delta-lake module) have >>>>>> test code that depend on the hive-metastore module as well as on Hive >>>>>> metastore jars. Having those modules test against Hive 3 and Hive 4 >>>>>> metastore versions is not in the scope of the above PR. I plan to work on >>>>>> that separately as follow up, and I want to hear opinions on the >>>>>> approach. >>>>>> As a proof of concept, I have put up >>>>>> https://github.com/apache/iceberg/pull/12693 with follow on changes >>>>>> to test the Flink modules against hive4-metastore. This is >>>>>> straightforward >>>>>> and there are no issues. >>>>>> >>>>>> For Spark though, as I have mentioned in the earlier thread, Spark >>>>>> uses a built-in version of the Hive metastore (currently 2.3.10), but can >>>>>> be configured to use a different version and be pointed to a path >>>>>> containing Hive metastore jars for the different version. However, the >>>>>> highest Hive version that can be configured for Spark 3.5 is 3.1.3 >>>>>> (Spark 4 >>>>>> will support 4.0.x), as changes in Spark code is needed to be able to >>>>>> workaround HIVE-27925. >>>>>> >>>>>> What I'm interested in hearing is: For testing Flink and Spark >>>>>> against Hive versions, do we want to test against >>>>>> (1) just one version, e.g., the highest version supportable by that >>>>>> Flink/Spark version (or alternatively just 2.3.10). >>>>>> (2) multiple versions from 2.3.10, 3.1.3 and 4.0.1, as long as they >>>>>> are supportable by that Flink/Spark version. >>>>>> And if (2), how do we want to do that, e.g., full matrix, or some >>>>>> kind of sampling. >>>>>> >>>>>> Thanks, >>>>>> Wing Yew >>>>>> >>>>>>