Hi Manu, > Spark has only added hive 4.0 metastore support recently for Spark 4.0[1] and there will be conflicts
Does this mean that Spark 4.0 will always use Hive 4 code? Or it will use Hive 2 when it is present on the classpath, but if older Hive versions are not on the classpath then it will use the embedded Hive 4 code? > Firstly, upgrading from Hive 2 to Hive 4 is a huge change Is this a huge change even after we remove the Hive runtime module? After removing the Hive runtime module, we have 2 remaining Hive dependencies: - HMS Client - The Thrift API should not be changed between the Hive versions, so unless we start to use specific Hive 4 features we should be fine here - so whatever version of Hive we use, it should work - Java API changes. We found that in Hive 2, and Hive 3 the HMSClient classes used different constructors so we ended up using DynMethods to use the appropriate constructors - if we use a strict Hive version here, then we won't need the DynMethods anymore - Based on our experience, even if Hive 3 itself doesn't support Java 11, the HMS Client for Hive 3 doesn't have any issues when used with Java 11 - Testing infrastructure - TestHiveMetastore creates and starts a HMS instance. This could be highly dependent on the version of Hive we are using. Since this is only a testing code I expect that only our tests are interacting with this *@Manu*: You know more of the details here. Do we have HMSClient issues when we use Hive 4 code? If I miss something in the listing above, please correct me. Based on this, in an ideal world: - Hive would provide a HMS client jar which only contains java code which is needed to connect and communicate using Thrift with a HMS instance (no internal HMS server code etc). We could use it as a dependency for our iceberg-hive-metastore module. Either setting a minimal version, or using a shaded embedded version. *@Hive* folks - is this a valid option? What are the reasons that there is no metastore-client jar provided currently? Would it be possible to generate one in some of the future Hive releases. Seems like a worthy feature for me. - We would create our version dependent HMS infrastructure if we want to support Spark versions which support older Hive versions. As a result of this, we could have: - Clean definition of which Hive version is supported - Testing for the supported Hive versions - Java 11 support As an alternative we can create a testing matrix where some tests are run with both Hive 3 and Hive 4, and some tests are run with only Hive3 (older Spark versions which does not support Hive 4) Thanks Manu for driving this! Peter Manu Zhang <owenzhang1...@gmail.com> ezt írta (időpont: 2025. jan. 5., V, 5:18): > This basically means that we need to support every exact Hive versions >> which are used by Spark, and we need to exclude our own Hive version from >> the Spark runtime. > > > Firstly, upgrading from Hive 2 to Hive 4 is a huge change, and I expect > compatibility to be much better once Iceberg and Spark are both on Hive 4. > > Secondly, the coupling can be loosed if we are moving toward the REST > catalog. > > On Fri, Jan 3, 2025 at 7:26 PM Péter Váry <peter.vary.apa...@gmail.com> > wrote: > >> That sounds really interesting in a bad way :) :( >> >> This basically means that we need to support every exact Hive versions >> which are used by Spark, and we need to exclude our own Hive version from >> the Spark runtime. >> >> On Thu, Dec 19, 2024, 04:00 Manu Zhang <owenzhang1...@gmail.com> wrote: >> >>> Hi Peter, >>> >>>> I think we should make sure that the Iceberg Hive version is >>>> independent from the version used by Spark >>> >>> I'm afraid that is not how it works currently. When Spark is deployed >>> with hive libraries (I suppose this is common), iceberg-spark runtime must >>> be compatible with them. >>> Otherwise, we need to ask users to exclude hive libraries from Spark and >>> ship iceberg-spark runtime with Iceberg's hive dependencies.\ >>> >>> Regards, >>> Manu >>> >>> On Wed, Dec 18, 2024 at 9:08 PM Péter Váry <peter.vary.apa...@gmail.com> >>> wrote: >>> >>>> @Manu: What will be the end result? Do we have to use the same Hive >>>> version in Iceberg as it is defined by Spark? I think we should make sure >>>> that the Iceberg Hive version is independent from the version used by Spark >>>> >>>> On Mon, Dec 16, 2024, 21:58 rdb...@gmail.com <rdb...@gmail.com> wrote: >>>> >>>>> > I'm not sure there's an upgrade path before Spark 4.0. Any ideas? >>>>> >>>>> We can at least separate the concerns. We can remove the runtime >>>>> modules that are the main issue. If we compile against an older version of >>>>> the Hive metastore module (leaving it unchanged) that at least has a >>>>> dramatically reduced surface area for Java version issues. As long as the >>>>> API is compatible (and we haven't heard complaints that it is not) then I >>>>> think users can override the version in their environments. >>>>> >>>>> Ryan >>>>> >>>>> On Sun, Dec 15, 2024 at 5:55 PM Manu Zhang <owenzhang1...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Daniel, >>>>>> I'll start a vote once I get the PR ready. >>>>>> >>>>>> Hi Ryan, >>>>>> Sorry, I wasn't clear in the last email that the consensus is to >>>>>> upgrade Hive metastore support. >>>>>> >>>>>> Well, I was too optimistic about the upgrade. Spark has only added >>>>>> hive 4.0 metastore support recently for Spark 4.0[1] and there will be >>>>>> conflicts >>>>>> between Spark's hive 2.3.9 and our hive 4.0 dependencies. >>>>>> I'm not sure there's an upgrade path before Spark 4.0. Any ideas? >>>>>> >>>>>> 1. https://issues.apache.org/jira/browse/SPARK-45265 >>>>>> >>>>>> Thanks, >>>>>> Manu >>>>>> >>>>>> >>>>>> On Sat, Dec 14, 2024 at 4:31 AM rdb...@gmail.com <rdb...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Oh, I think I see. The upgrade to Hive 4 is just for the Hive >>>>>>> metastore support? When I read the thread, I thought that we weren't >>>>>>> going >>>>>>> to change the metastore. That seems reasonable to me. Sorry for >>>>>>> the confusion. >>>>>>> >>>>>>> On Fri, Dec 13, 2024 at 10:24 AM rdb...@gmail.com <rdb...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Sorry, I must have missed something. I don't think that we should >>>>>>>> upgrade anything in Iceberg to Hive 4. Why not simply remove the Hive >>>>>>>> support entirely? Why would anyone need Hive 4 support from Iceberg >>>>>>>> when it >>>>>>>> is built into Hive 4? >>>>>>>> >>>>>>>> On Thu, Dec 12, 2024 at 11:03 AM Daniel Weeks <dwe...@apache.org> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hey Manu, >>>>>>>>> >>>>>>>>> I agree with the direction here, but we should probably hold a >>>>>>>>> quick procedural vote just to confirm since this is a significant >>>>>>>>> change in >>>>>>>>> support for Hive. >>>>>>>>> >>>>>>>>> -Dan >>>>>>>>> >>>>>>>>> On Wed, Dec 11, 2024 at 5:19 PM Manu Zhang < >>>>>>>>> owenzhang1...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Thanks all for sharing your thoughts. It looks there's a >>>>>>>>>> consensus on upgrading to Hive 4 and dropping hive-runtime. >>>>>>>>>> I've submitted a PR[1] as the first step. Please help review. >>>>>>>>>> >>>>>>>>>> 1. https://github.com/apache/iceberg/pull/11750 >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Manu >>>>>>>>>> >>>>>>>>>> On Thu, Nov 28, 2024 at 11:26 PM Shohei Okumiya < >>>>>>>>>> oku...@apache.org> wrote: >>>>>>>>>> >>>>>>>>>>> Hi all, >>>>>>>>>>> >>>>>>>>>>> I also prefer option 1. I have some initiatives[1] to improve >>>>>>>>>>> integrations between Hive and Iceberg. The current style allows >>>>>>>>>>> us to >>>>>>>>>>> develop both Hive's core and HiveIcebergStorageHandler >>>>>>>>>>> simultaneously. >>>>>>>>>>> That would help us enhance integrations. >>>>>>>>>>> >>>>>>>>>>> - [1] https://issues.apache.org/jira/browse/HIVE-28410 >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Okumin >>>>>>>>>>> >>>>>>>>>>> On Thu, Nov 28, 2024 at 4:17 AM Fokko Driesprong < >>>>>>>>>>> fo...@apache.org> wrote: >>>>>>>>>>> > >>>>>>>>>>> > Hey Cheng, >>>>>>>>>>> > >>>>>>>>>>> > Thanks for the suggestion. The nightly snapshots are >>>>>>>>>>> available: >>>>>>>>>>> https://repository.apache.org/content/groups/snapshots/org/apache/iceberg/iceberg-core/, >>>>>>>>>>> which might help when working on features that are not released yet >>>>>>>>>>> (eg >>>>>>>>>>> Nanosecond timestamps). Besides that, we should run RCs against >>>>>>>>>>> Hive to >>>>>>>>>>> check if everything works as expected. >>>>>>>>>>> > >>>>>>>>>>> > I'm leaning toward removing Hive 2 and 3 as well. >>>>>>>>>>> > >>>>>>>>>>> > Kind regards, >>>>>>>>>>> > Fokko >>>>>>>>>>> > >>>>>>>>>>> > Op wo 27 nov 2024 om 20:05 schreef rdb...@gmail.com < >>>>>>>>>>> rdb...@gmail.com>: >>>>>>>>>>> >> >>>>>>>>>>> >> I think that we should remove Hive 2 and Hive 3. We already >>>>>>>>>>> agreed to remove Hive 2, but Hive 3 is not compatible with the >>>>>>>>>>> project >>>>>>>>>>> anymore and is already EOL and will not see a release to update it >>>>>>>>>>> so that >>>>>>>>>>> it can be compatible. Anyone using the existing Hive 3 support >>>>>>>>>>> should be >>>>>>>>>>> able to continue using older releases. >>>>>>>>>>> >> >>>>>>>>>>> >> In general, I think it's a good idea to let people use older >>>>>>>>>>> releases when these situations happen. It is difficult for the >>>>>>>>>>> project to >>>>>>>>>>> continue to support libraries that are EOL and I don't think >>>>>>>>>>> there's a >>>>>>>>>>> great justification for it, considering Iceberg support in Hive 4 >>>>>>>>>>> is native >>>>>>>>>>> and much better! >>>>>>>>>>> >> >>>>>>>>>>> >> On Wed, Nov 27, 2024 at 7:12 AM Cheng Pan <pan3...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>> >>>>>>>>>>> >>> That said, it would be helpful if they continue running >>>>>>>>>>> >>> tests against the latest stable Hive releases to ensure that >>>>>>>>>>> any >>>>>>>>>>> >>> changes don’t unintentionally break something for Hive, >>>>>>>>>>> which would be >>>>>>>>>>> >>> beyond our control. >>>>>>>>>>> >>> >>>>>>>>>>> >>> >>>>>>>>>>> >>> I believe we should continue maintaining a Hive Iceberg >>>>>>>>>>> runtime test suite with the latest version of Hive in the Iceberg >>>>>>>>>>> repository. >>>>>>>>>>> >>> >>>>>>>>>>> >>> >>>>>>>>>>> >>> i think we can keep some basic Hive4 tests in iceberg repo >>>>>>>>>>> >>> >>>>>>>>>>> >>> >>>>>>>>>>> >>> Instead of running basic tests on the Iceberg repo, maybe >>>>>>>>>>> let Iceberg publish daily snapshot jars to Nexus, and have a daily >>>>>>>>>>> CI in >>>>>>>>>>> Hive to consume those jars and run full Iceberg tests makes more >>>>>>>>>>> sense? >>>>>>>>>>> >>> >>>>>>>>>>> >>> Thanks, >>>>>>>>>>> >>> Cheng Pan >>>>>>>>>>> >>> >>>>>>>>>>> >>>>>>>>>>>