Thanks for working on this @Wing Yew Poon <wyp...@gmail.com>!

As I have mentioned on the PR, I would prefer unbundling the hive-metastore
dependency from the runtime jars.
This would help in multiple ways:
- Decrease the size of the runtime jars - nowadays more-and-more people use
Iceberg against a REST catalog, so bundling hive-metastore jar is
unnecessary for them
- Hive 4 is the only officially supported Hive version - with the current
solution we don't support it
- Testing against a specific Hive version will be more clear

Thanks,
Peter

Wing Yew Poon <wyp...@cloudera.com.invalid> ezt írta (időpont: 2025. máj.
16., P, 23:52):

> Hi,
>
> I'd like to bring attention to this again.
> As you know, the hive-metastore module (which provides the HiveCatalog) is
> built against Hive 2.
> I have a PR, https://github.com/apache/iceberg/pull/12721, for building
> and testing the hive-metastore module against Hive 2, 3 and 4. Daniel Weeks
> looked into building the module against a single version of Hive and
> testing that against all three Hive versions, but this approach turned out
> to be infeasible because of binary incompatibility introduced by Hive 4 (so
> e.g., a hive-metastore jar built against Hive 2 would be incompatible if we
> use Hive 4 jars for our Hive dependencies). Note that Hive 4 also breaks
> thrift compatibility, so talking to a Hive 4 HMS requires using a Hive 4
> client.
>
> Given this, I propose that we have three versions of the module,
> hive-metastore, hive3-metastore and hive4-metastore, built against Hive 2,
> 3 and 4 respectively. (This is currently what is implemented in the PR.)
> This means three versions of the artifact. First, I'd like to get agreement
> from the community on doing this.
> Second, the hive-metastore classes are currently bundled in runtime jars,
> e.g., the Spark runtime jar and the Flink runtime jar. If we have agreement
> on the first point, I'd like to hear opinions on options for this second
> point:
> a. Remove hive-metastore classes from these runtime jars. Users will pick
> the hive-metastore jar appropriate to the Hive version they are using and
> add it to the classpath of their engine.
> b. Keep the runtime jars unchanged, which means bundling the
> hive-metastore classes built against Hive 2.
> c. Build multiple versions of the runtime jars, for Hive versions
> supported by those engines. (This does not mean all combinations, as e.g.,
> Spark 3 does not support Hive 4.)
>
> a. provides the most flexibility at the expense of inconvenience to the
> user (to have to add another jar to the classpath).
> c. is the most complex.
>
> There is a separate question of testing engines such as Spark and Flink
> against different versions of HMS. At this time, I do not wish to get into
> this question until the above is resolved. Regardless of the option chosen
> for the second point, I think it is possible to test engines against
> different versions of HMS.
>
> Thanks,
> Wing Yew
>
>
> On Fri, Apr 11, 2025 at 12:03 PM Daniel Weeks <dwe...@apache.org> wrote:
>
>> Hey Wing,
>>
>> There are a few concerning issues I have with the current PR:
>>
>> 1) We need to update LICENSE/NOTICE for the hive3/hive4 dependencies
>> because I believe we only looked at the referenced version
>> 2) We're producing artifacts for hive3 and hive4 modules which I think we
>> want to exclude (we shouldn't triplicate the artifacts that are cross
>> compatible)
>> 3) I don't think we should be copying the hive DDL files into the Iceberg
>> project
>>
>> I'd really like to find a workaround for #3.  I've added comments to the
>> PR as well.
>>
>> Thanks,
>> Dan
>>
>> On Fri, Apr 11, 2025 at 9:28 AM Wing Yew Poon <wyp...@cloudera.com.invalid>
>> wrote:
>>
>>> Hi,
>>>
>>> I created https://github.com/apache/iceberg/pull/12721 as an
>>> alternative to my earlier PR,
>>> https://github.com/apache/iceberg/pull/12681. With very helpful
>>> pointers from Peter Vary and his persistent prodding, we have an
>>> implementation where a single source set can be used to build and test with
>>> Hive 2, 3 and 4.
>>>
>>> In the new PR, the existing hive-metastore module is still built with
>>> Hive 2, and additional modules, hive3-metastore and hive4-metastore are
>>> introduced. Currently no other modules depend on the latter two. All three
>>> are built and tested in the existing java-ci workflow (without any change).
>>>
>>> The (non-test) classes in the hive-metastore module are packaged in the
>>> Flink and Spark runtime JARs. As a single source is used to support Hive 2,
>>> 3 and 4, they can be used with a HiveCatalog backed by a Hive 2, 3 or 4
>>> HMS. In the PR, we do not make any changes to Flink or Spark modules.
>>> Testing them with a different version of Hive is the next step. I'd like to
>>> hear your opinion on how to approach that.
>>>
>>> Thanks,
>>> Wing Yew
>>>
>>>
>>>
>>>
>>> On Thu, Apr 3, 2025 at 5:31 PM Wing Yew Poon <wyp...@cloudera.com>
>>> wrote:
>>>
>>>> Hi Peter,
>>>>
>>>> Thank you for your thoughts on this.
>>>>
>>>> My PR, https://github.com/apache/iceberg/pull/12681, achieves the
>>>> following objectives:
>>>> (1) Build and test the hive-metastore module against Hive 2, 3, and 4.
>>>> (As you suggested in the earlier thread, we build and test against 2.3.10,
>>>> 3.1.3, and 4.0.1.)
>>>> (2) Keep a single code source for the hive-metastore implementation
>>>> (the test code needed to be different, unfortunately).
>>>>
>>>> In order to achieve (2), code changes were in fact needed (due to
>>>> HIVE-27925 <https://issues.apache.org/jira/browse/HIVE-27925> breaking
>>>> backward compatibility). So, even if HMS API is backwards compatible (and
>>>> HiveConf is not considered to be a part of HMS API), this shows the
>>>> importance of actually doing (1).
>>>> As a result of (2), even though in this PR, Flink and Spark modules
>>>> continue to be built with Hive 2 dependencies, the hive-metastore classes
>>>> packaged in the Flink and Spark runtime jars are usable with Hive 2, 3, or
>>>> 4 HMS (provided the engine supported that version of Hive).
>>>>
>>>> The use of separate Gradle modules (hive3-metastore, hive4-metastore)
>>>> is an implementation detail. I don't think we disagree on the objectives.
>>>> If you have suggestions for implementing the objectives differently, I'm
>>>> happy to hear them. To avoid getting into the weeds, we can discuss them in
>>>> the PR.
>>>>
>>>> I understand the desire to have a single version of
>>>> `TestHiveMetastore`, a test class that is used by the Flink and Spark
>>>> modules as well. Unfortunately, I did not see a way of doing this even with
>>>> `DynConstructors` and `DynMethods`. In Hive 4, `HMSHandler` became a
>>>> top-level class instead of being a nested class in `HiveMetaStore` (and I
>>>> don't see how to import one class from one version of Hive and a different
>>>> class from another version of Hive). Even with `DynConstructors` you have
>>>> to give it the instance of the class. Again, if you have suggestions to
>>>> overcome this, we can discuss them in the PR. However, I feel that keeping
>>>> a single version of test code that works for all Hive versions, while
>>>> desirable, should not be a blocker.
>>>>
>>>> The scope of the PR is deliberately limited to the hive-metastore
>>>> module. I feel that what versions of Hive to use for testing Flink and
>>>> Spark modules against the HiveCatalog and how to do so should be a second
>>>> (and possibly even third) step. As an experiment/proof of concept, in
>>>> https://github.com/apache/iceberg/pull/12693, I show that Flink
>>>> modules run successfully with hive-metastore built with Hive 4.
>>>> Additionally, Spark modules run successfully with hive-metastore built with
>>>> Hive 3 (they fail with Hive 4). I would say though that while Spark 4.0
>>>> will support using a Hive 4 HMS, I do not expect we will see Spark 3.4 or
>>>> 3.5 supporting using a Hive 4 HMS. Thus, I do not think we should block
>>>> this work on what Spark supports.
>>>>
>>>> Thanks,
>>>> Wing Yew
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Apr 3, 2025 at 3:01 AM Péter Váry <peter.vary.apa...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Wing Yew,
>>>>>
>>>>> Thanks for taking a look at this.
>>>>> After the removal of the Hive runtime code, we only depend on HMS in
>>>>> the HiveCatalog module in the production code. Since HMS API is supposed 
>>>>> to
>>>>> be backward compatible, I would prefer to keep a single hive-metastore
>>>>> module and a single source dir for accessing HMS. When there is need we 
>>>>> can
>>>>> use the DynMethods as we do now.
>>>>>
>>>>> I understand that for testing we need to start a HiveMetastore. It is
>>>>> used by Flink and Spark as well. Here things become harder as we need to
>>>>> instantiate a test Hive Metastore where we don't have backward
>>>>> compatibility guarantees. If possible, I would still prefer to keep a
>>>>> single codebase for testing purposes where we use DynMethods and
>>>>> HiveVersion to decide which version of HMS we instantiate. This could
>>>>> easily become a mess, but I would try to check if we can get away with it
>>>>> somehow.
>>>>>
>>>>> I would definitely run the hive-metastore tests with all of the
>>>>> supported Hive versions. OTOH, I would defer to the Spark/Flink 
>>>>> maintainers
>>>>> to decide how confident they are in their products ability to abstract 
>>>>> away
>>>>> the differences between the Hive versions. For Flink I would suggest to 
>>>>> run
>>>>> the Flink tests only with the latest supported Hive version.
>>>>>
>>>>> Thanks,
>>>>> Peter
>>>>>
>>>>>
>>>>> Wing Yew Poon <wyp...@cloudera.com.invalid> ezt írta (időpont: 2025.
>>>>> márc. 31., H, 4:27):
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I want to follow up on the earlier discussion about Hive support,
>>>>>> where it was decided to remove the Hive runtime support from the Iceberg
>>>>>> repo, and leave the Hive metastore support (where the HiveCatalog is
>>>>>> implemented) on Hive 2 for the time being.
>>>>>>
>>>>>> In that earlier thread, Peter Vary proposed that we support Hive
>>>>>> 2.3.10, 3.1.3 and 4.0.1 for the hive-metastore module and ensure that it
>>>>>> gets built and tested against those versions.
>>>>>>
>>>>>> I have implemented this in
>>>>>> https://github.com/apache/iceberg/pull/12681. I have left the
>>>>>> existing hive-metastore module depending on Hive 2.3.10, and added new
>>>>>> hive3-metastore and hive4-metastore modules that depend on Hive 3.1.3 and
>>>>>> 4.0.1 respectively. I have followed the approach used by the mr and hive3
>>>>>> modules previously and kept all common code in one directory (the 
>>>>>> existing
>>>>>> hive-metastore directory) and avoided code duplication. In order to
>>>>>> workaround https://issues.apache.org/jira/browse/HIVE-27925, which
>>>>>> introduced a backward incompatibility in Hive 4, I have avoided the use 
>>>>>> of
>>>>>> HiveConf.ConfVars enums and used the conf property names (which have not
>>>>>> changed) instead. (This is also the approach used by Spark in
>>>>>> https://issues.apache.org/jira/browse/SPARK-47679.) Please see the
>>>>>> PR for more details.
>>>>>>
>>>>>> The Flink and Spark modules (along with the delta-lake module) have
>>>>>> test code that depend on the hive-metastore module as well as on Hive
>>>>>> metastore jars. Having those modules test against Hive 3 and Hive 4
>>>>>> metastore versions is not in the scope of the above PR. I plan to work on
>>>>>> that separately as follow up, and I want to hear opinions on the 
>>>>>> approach.
>>>>>> As a proof of concept, I have put up
>>>>>> https://github.com/apache/iceberg/pull/12693 with follow on changes
>>>>>> to test the Flink modules against hive4-metastore. This is 
>>>>>> straightforward
>>>>>> and there are no issues.
>>>>>>
>>>>>> For Spark though, as I have mentioned in the earlier thread, Spark
>>>>>> uses a built-in version of the Hive metastore (currently 2.3.10), but can
>>>>>> be configured to use a different version and be pointed to a path
>>>>>> containing Hive metastore jars for the different version. However, the
>>>>>> highest Hive version that can be configured for Spark 3.5 is 3.1.3 
>>>>>> (Spark 4
>>>>>> will support 4.0.x), as changes in Spark code is needed to be able to
>>>>>> workaround HIVE-27925.
>>>>>>
>>>>>> What I'm interested in hearing is: For testing Flink and Spark
>>>>>> against Hive versions, do we want to test against
>>>>>> (1) just one version, e.g., the highest version supportable by that
>>>>>> Flink/Spark version (or alternatively just 2.3.10).
>>>>>> (2) multiple versions from 2.3.10, 3.1.3 and 4.0.1, as long as they
>>>>>> are supportable by that Flink/Spark version.
>>>>>> And if (2), how do we want to do that, e.g., full matrix, or some
>>>>>> kind of sampling.
>>>>>>
>>>>>> Thanks,
>>>>>> Wing Yew
>>>>>>
>>>>>>

Reply via email to