Hi Manu,

> Spark has only added hive 4.0 metastore support recently for Spark 4.0[1]
and there will be conflicts

Does this mean that Spark 4.0 will always use Hive 4 code? Or it will use
Hive 2 when it is present on the classpath, but if older Hive versions are
not on the classpath then it will use the embedded Hive 4 code?

> Firstly, upgrading from Hive 2 to Hive 4 is a huge change

Is this a huge change even after we remove the Hive runtime module?

After removing the Hive runtime module, we have 2 remaining Hive
dependencies:

   - HMS Client
      - The Thrift API should not be changed between the Hive versions, so
      unless we start to use specific Hive 4 features we should be
fine here - so
      whatever version of Hive we use, it should work
      - Java API changes. We found that in Hive 2, and Hive 3 the HMSClient
      classes used different constructors so we ended up using
DynMethods to use
      the appropriate constructors - if we use a strict Hive version here, then
      we won't need the DynMethods anymore
      - Based on our experience, even if Hive 3 itself doesn't support Java
      11, the HMS Client for Hive 3 doesn't have any issues when used
with Java 11
   - Testing infrastructure
      - TestHiveMetastore creates and starts a HMS instance. This could be
      highly dependent on the version of Hive we are using. Since this
is only a
      testing code I expect that only our tests are interacting with this

*@Manu*: You know more of the details here. Do we have HMSClient issues
when we use Hive 4 code? If I miss something in the listing above, please
correct me.

Based on this, in an ideal world:

   - Hive would provide a HMS client jar which only contains java code
   which is needed to connect and communicate using Thrift with a HMS instance
   (no internal HMS server code etc). We could use it as a dependency for our
   iceberg-hive-metastore module. Either setting a minimal version, or using a
   shaded embedded version. *@Hive* folks - is this a valid option? What
   are the reasons that there is no metastore-client jar provided currently?
   Would it be possible to generate one in some of the future Hive releases.
   Seems like a worthy feature for me.
   - We would create our version dependent HMS infrastructure if we want to
   support Spark versions which support older Hive versions.

As a result of this, we could have:

   - Clean definition of which Hive version is supported
   - Testing for the supported Hive versions
   - Java 11 support

As an alternative we can create a testing matrix where some tests are run
with both Hive 3 and Hive 4, and some tests are run with only Hive3 (older
Spark versions which does not support Hive 4)

Thanks Manu for driving this!
Peter

Manu Zhang <owenzhang1...@gmail.com> ezt írta (időpont: 2025. jan. 5., V,
5:18):

> This basically means that we need to support every exact Hive versions
>> which are used by Spark, and we need to exclude our own Hive version from
>> the Spark runtime.
>
>
> Firstly, upgrading from Hive 2 to Hive 4 is a huge change, and I expect
> compatibility to be much better once Iceberg and Spark are both on Hive 4.
>
> Secondly, the coupling can be loosed if we are moving toward the REST
> catalog.
>
> On Fri, Jan 3, 2025 at 7:26 PM Péter Váry <peter.vary.apa...@gmail.com>
> wrote:
>
>> That sounds really interesting in a bad way :) :(
>>
>> This basically means that we need to support every exact Hive versions
>> which are used by Spark, and we need to exclude our own Hive version from
>> the Spark runtime.
>>
>> On Thu, Dec 19, 2024, 04:00 Manu Zhang <owenzhang1...@gmail.com> wrote:
>>
>>> Hi Peter,
>>>
>>>> I think we should make sure that the Iceberg Hive version is
>>>> independent from the version used by Spark
>>>
>>>  I'm afraid that is not how it works currently. When Spark is deployed
>>> with hive libraries (I suppose this is common), iceberg-spark runtime must
>>> be compatible with them.
>>> Otherwise, we need to ask users to exclude hive libraries from Spark and
>>> ship iceberg-spark runtime with Iceberg's hive dependencies.\
>>>
>>> Regards,
>>> Manu
>>>
>>> On Wed, Dec 18, 2024 at 9:08 PM Péter Váry <peter.vary.apa...@gmail.com>
>>> wrote:
>>>
>>>> @Manu: What will be the end result? Do we have to use the same Hive
>>>> version in Iceberg as it is defined by Spark? I think we should make sure
>>>> that the Iceberg Hive version is independent from the version used by Spark
>>>>
>>>> On Mon, Dec 16, 2024, 21:58 rdb...@gmail.com <rdb...@gmail.com> wrote:
>>>>
>>>>> > I'm not sure there's an upgrade path before Spark 4.0. Any ideas?
>>>>>
>>>>> We can at least separate the concerns. We can remove the runtime
>>>>> modules that are the main issue. If we compile against an older version of
>>>>> the Hive metastore module (leaving it unchanged) that at least has a
>>>>> dramatically reduced surface area for Java version issues. As long as the
>>>>> API is compatible (and we haven't heard complaints that it is not) then I
>>>>> think users can override the version in their environments.
>>>>>
>>>>> Ryan
>>>>>
>>>>> On Sun, Dec 15, 2024 at 5:55 PM Manu Zhang <owenzhang1...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Daniel,
>>>>>> I'll start a vote once I get the PR ready.
>>>>>>
>>>>>> Hi Ryan,
>>>>>> Sorry, I wasn't clear in the last email that the consensus is to
>>>>>> upgrade Hive metastore support.
>>>>>>
>>>>>> Well, I was too optimistic about the upgrade. Spark has only added
>>>>>> hive 4.0 metastore support recently for Spark 4.0[1] and there will be
>>>>>> conflicts
>>>>>> between Spark's hive 2.3.9 and our hive 4.0 dependencies.
>>>>>> I'm not sure there's an upgrade path before Spark 4.0. Any ideas?
>>>>>>
>>>>>> 1. https://issues.apache.org/jira/browse/SPARK-45265
>>>>>>
>>>>>> Thanks,
>>>>>> Manu
>>>>>>
>>>>>>
>>>>>> On Sat, Dec 14, 2024 at 4:31 AM rdb...@gmail.com <rdb...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Oh, I think I see. The upgrade to Hive 4 is just for the Hive
>>>>>>> metastore support? When I read the thread, I thought that we weren't 
>>>>>>> going
>>>>>>> to change the metastore. That seems reasonable to me. Sorry for
>>>>>>> the confusion.
>>>>>>>
>>>>>>> On Fri, Dec 13, 2024 at 10:24 AM rdb...@gmail.com <rdb...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Sorry, I must have missed something. I don't think that we should
>>>>>>>> upgrade anything in Iceberg to Hive 4. Why not simply remove the Hive
>>>>>>>> support entirely? Why would anyone need Hive 4 support from Iceberg 
>>>>>>>> when it
>>>>>>>> is built into Hive 4?
>>>>>>>>
>>>>>>>> On Thu, Dec 12, 2024 at 11:03 AM Daniel Weeks <dwe...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hey Manu,
>>>>>>>>>
>>>>>>>>> I agree with the direction here, but we should probably hold a
>>>>>>>>> quick procedural vote just to confirm since this is a significant 
>>>>>>>>> change in
>>>>>>>>> support for Hive.
>>>>>>>>>
>>>>>>>>> -Dan
>>>>>>>>>
>>>>>>>>> On Wed, Dec 11, 2024 at 5:19 PM Manu Zhang <
>>>>>>>>> owenzhang1...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks all for sharing your thoughts. It looks there's a
>>>>>>>>>> consensus on upgrading to Hive 4 and dropping hive-runtime.
>>>>>>>>>> I've submitted a PR[1] as the first step. Please help review.
>>>>>>>>>>
>>>>>>>>>> 1. https://github.com/apache/iceberg/pull/11750
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Manu
>>>>>>>>>>
>>>>>>>>>> On Thu, Nov 28, 2024 at 11:26 PM Shohei Okumiya <
>>>>>>>>>> oku...@apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> I also prefer option 1. I have some initiatives[1] to improve
>>>>>>>>>>> integrations between Hive and Iceberg. The current style allows
>>>>>>>>>>> us to
>>>>>>>>>>> develop both Hive's core and HiveIcebergStorageHandler
>>>>>>>>>>> simultaneously.
>>>>>>>>>>> That would help us enhance integrations.
>>>>>>>>>>>
>>>>>>>>>>> - [1] https://issues.apache.org/jira/browse/HIVE-28410
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Okumin
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Nov 28, 2024 at 4:17 AM Fokko Driesprong <
>>>>>>>>>>> fo...@apache.org> wrote:
>>>>>>>>>>> >
>>>>>>>>>>> > Hey Cheng,
>>>>>>>>>>> >
>>>>>>>>>>> > Thanks for the suggestion. The nightly snapshots are
>>>>>>>>>>> available:
>>>>>>>>>>> https://repository.apache.org/content/groups/snapshots/org/apache/iceberg/iceberg-core/,
>>>>>>>>>>> which might help when working on features that are not released yet 
>>>>>>>>>>> (eg
>>>>>>>>>>> Nanosecond timestamps). Besides that, we should run RCs against 
>>>>>>>>>>> Hive to
>>>>>>>>>>> check if everything works as expected.
>>>>>>>>>>> >
>>>>>>>>>>> > I'm leaning toward removing Hive 2 and 3 as well.
>>>>>>>>>>> >
>>>>>>>>>>> > Kind regards,
>>>>>>>>>>> > Fokko
>>>>>>>>>>> >
>>>>>>>>>>> > Op wo 27 nov 2024 om 20:05 schreef rdb...@gmail.com <
>>>>>>>>>>> rdb...@gmail.com>:
>>>>>>>>>>> >>
>>>>>>>>>>> >> I think that we should remove Hive 2 and Hive 3. We already
>>>>>>>>>>> agreed to remove Hive 2, but Hive 3 is not compatible with the 
>>>>>>>>>>> project
>>>>>>>>>>> anymore and is already EOL and will not see a release to update it 
>>>>>>>>>>> so that
>>>>>>>>>>> it can be compatible. Anyone using the existing Hive 3 support 
>>>>>>>>>>> should be
>>>>>>>>>>> able to continue using older releases.
>>>>>>>>>>> >>
>>>>>>>>>>> >> In general, I think it's a good idea to let people use older
>>>>>>>>>>> releases when these situations happen. It is difficult for the 
>>>>>>>>>>> project to
>>>>>>>>>>> continue to support libraries that are EOL and I don't think 
>>>>>>>>>>> there's a
>>>>>>>>>>> great justification for it, considering Iceberg support in Hive 4 
>>>>>>>>>>> is native
>>>>>>>>>>> and much better!
>>>>>>>>>>> >>
>>>>>>>>>>> >> On Wed, Nov 27, 2024 at 7:12 AM Cheng Pan <pan3...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> >>>
>>>>>>>>>>> >>> That said, it would be helpful if they continue running
>>>>>>>>>>> >>> tests against the latest stable Hive releases to ensure that
>>>>>>>>>>> any
>>>>>>>>>>> >>> changes don’t unintentionally break something for Hive,
>>>>>>>>>>> which would be
>>>>>>>>>>> >>> beyond our control.
>>>>>>>>>>> >>>
>>>>>>>>>>> >>>
>>>>>>>>>>> >>> I believe we should continue maintaining a Hive Iceberg
>>>>>>>>>>> runtime test suite with the latest version of Hive in the Iceberg
>>>>>>>>>>> repository.
>>>>>>>>>>> >>>
>>>>>>>>>>> >>>
>>>>>>>>>>> >>> i think we can keep some basic Hive4 tests in iceberg repo
>>>>>>>>>>> >>>
>>>>>>>>>>> >>>
>>>>>>>>>>> >>> Instead of running basic tests on the Iceberg repo, maybe
>>>>>>>>>>> let Iceberg publish daily snapshot jars to Nexus, and have a daily 
>>>>>>>>>>> CI in
>>>>>>>>>>> Hive to consume those jars and run full Iceberg tests makes more 
>>>>>>>>>>> sense?
>>>>>>>>>>> >>>
>>>>>>>>>>> >>> Thanks,
>>>>>>>>>>> >>> Cheng Pan
>>>>>>>>>>> >>>
>>>>>>>>>>>
>>>>>>>>>>>

Reply via email to