Hey Wing,

There are a few concerning issues I have with the current PR:

1) We need to update LICENSE/NOTICE for the hive3/hive4 dependencies
because I believe we only looked at the referenced version
2) We're producing artifacts for hive3 and hive4 modules which I think we
want to exclude (we shouldn't triplicate the artifacts that are cross
compatible)
3) I don't think we should be copying the hive DDL files into the Iceberg
project

I'd really like to find a workaround for #3.  I've added comments to the PR
as well.

Thanks,
Dan

On Fri, Apr 11, 2025 at 9:28 AM Wing Yew Poon <wyp...@cloudera.com.invalid>
wrote:

> Hi,
>
> I created https://github.com/apache/iceberg/pull/12721 as an alternative
> to my earlier PR, https://github.com/apache/iceberg/pull/12681. With very
> helpful pointers from Peter Vary and his persistent prodding, we have an
> implementation where a single source set can be used to build and test with
> Hive 2, 3 and 4.
>
> In the new PR, the existing hive-metastore module is still built with Hive
> 2, and additional modules, hive3-metastore and hive4-metastore are
> introduced. Currently no other modules depend on the latter two. All three
> are built and tested in the existing java-ci workflow (without any change).
>
> The (non-test) classes in the hive-metastore module are packaged in the
> Flink and Spark runtime JARs. As a single source is used to support Hive 2,
> 3 and 4, they can be used with a HiveCatalog backed by a Hive 2, 3 or 4
> HMS. In the PR, we do not make any changes to Flink or Spark modules.
> Testing them with a different version of Hive is the next step. I'd like to
> hear your opinion on how to approach that.
>
> Thanks,
> Wing Yew
>
>
>
>
> On Thu, Apr 3, 2025 at 5:31 PM Wing Yew Poon <wyp...@cloudera.com> wrote:
>
>> Hi Peter,
>>
>> Thank you for your thoughts on this.
>>
>> My PR, https://github.com/apache/iceberg/pull/12681, achieves the
>> following objectives:
>> (1) Build and test the hive-metastore module against Hive 2, 3, and 4.
>> (As you suggested in the earlier thread, we build and test against 2.3.10,
>> 3.1.3, and 4.0.1.)
>> (2) Keep a single code source for the hive-metastore implementation (the
>> test code needed to be different, unfortunately).
>>
>> In order to achieve (2), code changes were in fact needed (due to
>> HIVE-27925 <https://issues.apache.org/jira/browse/HIVE-27925> breaking
>> backward compatibility). So, even if HMS API is backwards compatible (and
>> HiveConf is not considered to be a part of HMS API), this shows the
>> importance of actually doing (1).
>> As a result of (2), even though in this PR, Flink and Spark modules
>> continue to be built with Hive 2 dependencies, the hive-metastore classes
>> packaged in the Flink and Spark runtime jars are usable with Hive 2, 3, or
>> 4 HMS (provided the engine supported that version of Hive).
>>
>> The use of separate Gradle modules (hive3-metastore, hive4-metastore) is
>> an implementation detail. I don't think we disagree on the objectives. If
>> you have suggestions for implementing the objectives differently, I'm happy
>> to hear them. To avoid getting into the weeds, we can discuss them in the
>> PR.
>>
>> I understand the desire to have a single version of `TestHiveMetastore`,
>> a test class that is used by the Flink and Spark modules as well.
>> Unfortunately, I did not see a way of doing this even with
>> `DynConstructors` and `DynMethods`. In Hive 4, `HMSHandler` became a
>> top-level class instead of being a nested class in `HiveMetaStore` (and I
>> don't see how to import one class from one version of Hive and a different
>> class from another version of Hive). Even with `DynConstructors` you have
>> to give it the instance of the class. Again, if you have suggestions to
>> overcome this, we can discuss them in the PR. However, I feel that keeping
>> a single version of test code that works for all Hive versions, while
>> desirable, should not be a blocker.
>>
>> The scope of the PR is deliberately limited to the hive-metastore module.
>> I feel that what versions of Hive to use for testing Flink and Spark
>> modules against the HiveCatalog and how to do so should be a second (and
>> possibly even third) step. As an experiment/proof of concept, in
>> https://github.com/apache/iceberg/pull/12693, I show that Flink modules
>> run successfully with hive-metastore built with Hive 4. Additionally, Spark
>> modules run successfully with hive-metastore built with Hive 3 (they fail
>> with Hive 4). I would say though that while Spark 4.0 will support using a
>> Hive 4 HMS, I do not expect we will see Spark 3.4 or 3.5 supporting using a
>> Hive 4 HMS. Thus, I do not think we should block this work on what Spark
>> supports.
>>
>> Thanks,
>> Wing Yew
>>
>>
>>
>>
>> On Thu, Apr 3, 2025 at 3:01 AM Péter Váry <peter.vary.apa...@gmail.com>
>> wrote:
>>
>>> Hi Wing Yew,
>>>
>>> Thanks for taking a look at this.
>>> After the removal of the Hive runtime code, we only depend on HMS in the
>>> HiveCatalog module in the production code. Since HMS API is supposed to be
>>> backward compatible, I would prefer to keep a single hive-metastore module
>>> and a single source dir for accessing HMS. When there is need we can use
>>> the DynMethods as we do now.
>>>
>>> I understand that for testing we need to start a HiveMetastore. It is
>>> used by Flink and Spark as well. Here things become harder as we need to
>>> instantiate a test Hive Metastore where we don't have backward
>>> compatibility guarantees. If possible, I would still prefer to keep a
>>> single codebase for testing purposes where we use DynMethods and
>>> HiveVersion to decide which version of HMS we instantiate. This could
>>> easily become a mess, but I would try to check if we can get away with it
>>> somehow.
>>>
>>> I would definitely run the hive-metastore tests with all of the
>>> supported Hive versions. OTOH, I would defer to the Spark/Flink maintainers
>>> to decide how confident they are in their products ability to abstract away
>>> the differences between the Hive versions. For Flink I would suggest to run
>>> the Flink tests only with the latest supported Hive version.
>>>
>>> Thanks,
>>> Peter
>>>
>>>
>>> Wing Yew Poon <wyp...@cloudera.com.invalid> ezt írta (időpont: 2025.
>>> márc. 31., H, 4:27):
>>>
>>>> Hi,
>>>>
>>>> I want to follow up on the earlier discussion about Hive support, where
>>>> it was decided to remove the Hive runtime support from the Iceberg repo,
>>>> and leave the Hive metastore support (where the HiveCatalog is implemented)
>>>> on Hive 2 for the time being.
>>>>
>>>> In that earlier thread, Peter Vary proposed that we support Hive
>>>> 2.3.10, 3.1.3 and 4.0.1 for the hive-metastore module and ensure that it
>>>> gets built and tested against those versions.
>>>>
>>>> I have implemented this in https://github.com/apache/iceberg/pull/12681.
>>>> I have left the existing hive-metastore module depending on Hive 2.3.10,
>>>> and added new hive3-metastore and hive4-metastore modules that depend on
>>>> Hive 3.1.3 and 4.0.1 respectively. I have followed the approach used by the
>>>> mr and hive3 modules previously and kept all common code in one directory
>>>> (the existing hive-metastore directory) and avoided code duplication. In
>>>> order to workaround https://issues.apache.org/jira/browse/HIVE-27925,
>>>> which introduced a backward incompatibility in Hive 4, I have avoided the
>>>> use of HiveConf.ConfVars enums and used the conf property names (which have
>>>> not changed) instead. (This is also the approach used by Spark in
>>>> https://issues.apache.org/jira/browse/SPARK-47679.) Please see the PR
>>>> for more details.
>>>>
>>>> The Flink and Spark modules (along with the delta-lake module) have
>>>> test code that depend on the hive-metastore module as well as on Hive
>>>> metastore jars. Having those modules test against Hive 3 and Hive 4
>>>> metastore versions is not in the scope of the above PR. I plan to work on
>>>> that separately as follow up, and I want to hear opinions on the approach.
>>>> As a proof of concept, I have put up
>>>> https://github.com/apache/iceberg/pull/12693 with follow on changes to
>>>> test the Flink modules against hive4-metastore. This is straightforward and
>>>> there are no issues.
>>>>
>>>> For Spark though, as I have mentioned in the earlier thread, Spark uses
>>>> a built-in version of the Hive metastore (currently 2.3.10), but can be
>>>> configured to use a different version and be pointed to a path containing
>>>> Hive metastore jars for the different version. However, the highest Hive
>>>> version that can be configured for Spark 3.5 is 3.1.3 (Spark 4 will support
>>>> 4.0.x), as changes in Spark code is needed to be able to workaround
>>>> HIVE-27925.
>>>>
>>>> What I'm interested in hearing is: For testing Flink and Spark against
>>>> Hive versions, do we want to test against
>>>> (1) just one version, e.g., the highest version supportable by that
>>>> Flink/Spark version (or alternatively just 2.3.10).
>>>> (2) multiple versions from 2.3.10, 3.1.3 and 4.0.1, as long as they are
>>>> supportable by that Flink/Spark version.
>>>> And if (2), how do we want to do that, e.g., full matrix, or some kind
>>>> of sampling.
>>>>
>>>> Thanks,
>>>> Wing Yew
>>>>
>>>>

Reply via email to