Hi,

I created https://github.com/apache/iceberg/pull/12721 as an alternative to
my earlier PR, https://github.com/apache/iceberg/pull/12681. With very
helpful pointers from Peter Vary and his persistent prodding, we have an
implementation where a single source set can be used to build and test with
Hive 2, 3 and 4.

In the new PR, the existing hive-metastore module is still built with Hive
2, and additional modules, hive3-metastore and hive4-metastore are
introduced. Currently no other modules depend on the latter two. All three
are built and tested in the existing java-ci workflow (without any change).

The (non-test) classes in the hive-metastore module are packaged in the
Flink and Spark runtime JARs. As a single source is used to support Hive 2,
3 and 4, they can be used with a HiveCatalog backed by a Hive 2, 3 or 4
HMS. In the PR, we do not make any changes to Flink or Spark modules.
Testing them with a different version of Hive is the next step. I'd like to
hear your opinion on how to approach that.

Thanks,
Wing Yew




On Thu, Apr 3, 2025 at 5:31 PM Wing Yew Poon <wyp...@cloudera.com> wrote:

> Hi Peter,
>
> Thank you for your thoughts on this.
>
> My PR, https://github.com/apache/iceberg/pull/12681, achieves the
> following objectives:
> (1) Build and test the hive-metastore module against Hive 2, 3, and 4. (As
> you suggested in the earlier thread, we build and test against 2.3.10,
> 3.1.3, and 4.0.1.)
> (2) Keep a single code source for the hive-metastore implementation (the
> test code needed to be different, unfortunately).
>
> In order to achieve (2), code changes were in fact needed (due to
> HIVE-27925 <https://issues.apache.org/jira/browse/HIVE-27925> breaking
> backward compatibility). So, even if HMS API is backwards compatible (and
> HiveConf is not considered to be a part of HMS API), this shows the
> importance of actually doing (1).
> As a result of (2), even though in this PR, Flink and Spark modules
> continue to be built with Hive 2 dependencies, the hive-metastore classes
> packaged in the Flink and Spark runtime jars are usable with Hive 2, 3, or
> 4 HMS (provided the engine supported that version of Hive).
>
> The use of separate Gradle modules (hive3-metastore, hive4-metastore) is
> an implementation detail. I don't think we disagree on the objectives. If
> you have suggestions for implementing the objectives differently, I'm happy
> to hear them. To avoid getting into the weeds, we can discuss them in the
> PR.
>
> I understand the desire to have a single version of `TestHiveMetastore`, a
> test class that is used by the Flink and Spark modules as well.
> Unfortunately, I did not see a way of doing this even with
> `DynConstructors` and `DynMethods`. In Hive 4, `HMSHandler` became a
> top-level class instead of being a nested class in `HiveMetaStore` (and I
> don't see how to import one class from one version of Hive and a different
> class from another version of Hive). Even with `DynConstructors` you have
> to give it the instance of the class. Again, if you have suggestions to
> overcome this, we can discuss them in the PR. However, I feel that keeping
> a single version of test code that works for all Hive versions, while
> desirable, should not be a blocker.
>
> The scope of the PR is deliberately limited to the hive-metastore module.
> I feel that what versions of Hive to use for testing Flink and Spark
> modules against the HiveCatalog and how to do so should be a second (and
> possibly even third) step. As an experiment/proof of concept, in
> https://github.com/apache/iceberg/pull/12693, I show that Flink modules
> run successfully with hive-metastore built with Hive 4. Additionally, Spark
> modules run successfully with hive-metastore built with Hive 3 (they fail
> with Hive 4). I would say though that while Spark 4.0 will support using a
> Hive 4 HMS, I do not expect we will see Spark 3.4 or 3.5 supporting using a
> Hive 4 HMS. Thus, I do not think we should block this work on what Spark
> supports.
>
> Thanks,
> Wing Yew
>
>
>
>
> On Thu, Apr 3, 2025 at 3:01 AM Péter Váry <peter.vary.apa...@gmail.com>
> wrote:
>
>> Hi Wing Yew,
>>
>> Thanks for taking a look at this.
>> After the removal of the Hive runtime code, we only depend on HMS in the
>> HiveCatalog module in the production code. Since HMS API is supposed to be
>> backward compatible, I would prefer to keep a single hive-metastore module
>> and a single source dir for accessing HMS. When there is need we can use
>> the DynMethods as we do now.
>>
>> I understand that for testing we need to start a HiveMetastore. It is
>> used by Flink and Spark as well. Here things become harder as we need to
>> instantiate a test Hive Metastore where we don't have backward
>> compatibility guarantees. If possible, I would still prefer to keep a
>> single codebase for testing purposes where we use DynMethods and
>> HiveVersion to decide which version of HMS we instantiate. This could
>> easily become a mess, but I would try to check if we can get away with it
>> somehow.
>>
>> I would definitely run the hive-metastore tests with all of the supported
>> Hive versions. OTOH, I would defer to the Spark/Flink maintainers to decide
>> how confident they are in their products ability to abstract away the
>> differences between the Hive versions. For Flink I would suggest to run the
>> Flink tests only with the latest supported Hive version.
>>
>> Thanks,
>> Peter
>>
>>
>> Wing Yew Poon <wyp...@cloudera.com.invalid> ezt írta (időpont: 2025.
>> márc. 31., H, 4:27):
>>
>>> Hi,
>>>
>>> I want to follow up on the earlier discussion about Hive support, where
>>> it was decided to remove the Hive runtime support from the Iceberg repo,
>>> and leave the Hive metastore support (where the HiveCatalog is implemented)
>>> on Hive 2 for the time being.
>>>
>>> In that earlier thread, Peter Vary proposed that we support Hive 2.3.10,
>>> 3.1.3 and 4.0.1 for the hive-metastore module and ensure that it gets built
>>> and tested against those versions.
>>>
>>> I have implemented this in https://github.com/apache/iceberg/pull/12681.
>>> I have left the existing hive-metastore module depending on Hive 2.3.10,
>>> and added new hive3-metastore and hive4-metastore modules that depend on
>>> Hive 3.1.3 and 4.0.1 respectively. I have followed the approach used by the
>>> mr and hive3 modules previously and kept all common code in one directory
>>> (the existing hive-metastore directory) and avoided code duplication. In
>>> order to workaround https://issues.apache.org/jira/browse/HIVE-27925,
>>> which introduced a backward incompatibility in Hive 4, I have avoided the
>>> use of HiveConf.ConfVars enums and used the conf property names (which have
>>> not changed) instead. (This is also the approach used by Spark in
>>> https://issues.apache.org/jira/browse/SPARK-47679.) Please see the PR
>>> for more details.
>>>
>>> The Flink and Spark modules (along with the delta-lake module) have test
>>> code that depend on the hive-metastore module as well as on Hive metastore
>>> jars. Having those modules test against Hive 3 and Hive 4 metastore
>>> versions is not in the scope of the above PR. I plan to work on that
>>> separately as follow up, and I want to hear opinions on the approach. As a
>>> proof of concept, I have put up
>>> https://github.com/apache/iceberg/pull/12693 with follow on changes to
>>> test the Flink modules against hive4-metastore. This is straightforward and
>>> there are no issues.
>>>
>>> For Spark though, as I have mentioned in the earlier thread, Spark uses
>>> a built-in version of the Hive metastore (currently 2.3.10), but can be
>>> configured to use a different version and be pointed to a path containing
>>> Hive metastore jars for the different version. However, the highest Hive
>>> version that can be configured for Spark 3.5 is 3.1.3 (Spark 4 will support
>>> 4.0.x), as changes in Spark code is needed to be able to workaround
>>> HIVE-27925.
>>>
>>> What I'm interested in hearing is: For testing Flink and Spark against
>>> Hive versions, do we want to test against
>>> (1) just one version, e.g., the highest version supportable by that
>>> Flink/Spark version (or alternatively just 2.3.10).
>>> (2) multiple versions from 2.3.10, 3.1.3 and 4.0.1, as long as they are
>>> supportable by that Flink/Spark version.
>>> And if (2), how do we want to do that, e.g., full matrix, or some kind
>>> of sampling.
>>>
>>> Thanks,
>>> Wing Yew
>>>
>>>

Reply via email to