Hi Peter, In Spark, you can specify the Hive version of the metastore that you want to use. There is a configuration, spark.sql.hive.metastore.version, which currently (as of Spark 3.5) defaults to 2.3.9, and the jars supporting this default version are shipped with Spark as built-in. You can specify a different version and then specify spark.sql.hive.metastore.jars=path (the default is built-in) and spark.sql.hive.metastore.jars.path to point to jars for the Hive metastore version you want to use. What https://issues.apache.org/jira/browse/SPARK-45265 does is to allow 4.0.x to be supported as a spark.sql.hive.metastore.version. I haven't been following Spark 4, but I suspect that the built-in version is not changing to Hive 4.0. The built-in version is also used for other things that Spark may use from Hive (aside from interaction with HMS), such as Hive SerDes. See https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html. - Wing Yew
On Mon, Jan 6, 2025 at 2:04 AM Péter Váry <peter.vary.apa...@gmail.com> wrote: > Hi Manu, > > > Spark has only added hive 4.0 metastore support recently for Spark > 4.0[1] and there will be conflicts > > Does this mean that Spark 4.0 will always use Hive 4 code? Or it will use > Hive 2 when it is present on the classpath, but if older Hive versions are > not on the classpath then it will use the embedded Hive 4 code? > > > Firstly, upgrading from Hive 2 to Hive 4 is a huge change > > Is this a huge change even after we remove the Hive runtime module? > > After removing the Hive runtime module, we have 2 remaining Hive > dependencies: > > - HMS Client > - The Thrift API should not be changed between the Hive versions, > so unless we start to use specific Hive 4 features we should be fine > here - > so whatever version of Hive we use, it should work > - Java API changes. We found that in Hive 2, and Hive 3 the > HMSClient classes used different constructors so we ended up using > DynMethods to use the appropriate constructors - if we use a strict Hive > version here, then we won't need the DynMethods anymore > - Based on our experience, even if Hive 3 itself doesn't support > Java 11, the HMS Client for Hive 3 doesn't have any issues when used > with > Java 11 > - Testing infrastructure > - TestHiveMetastore creates and starts a HMS instance. This could > be highly dependent on the version of Hive we are using. Since this is > only > a testing code I expect that only our tests are interacting with this > > *@Manu*: You know more of the details here. Do we have HMSClient issues > when we use Hive 4 code? If I miss something in the listing above, please > correct me. > > Based on this, in an ideal world: > > - Hive would provide a HMS client jar which only contains java code > which is needed to connect and communicate using Thrift with a HMS instance > (no internal HMS server code etc). We could use it as a dependency for our > iceberg-hive-metastore module. Either setting a minimal version, or using a > shaded embedded version. *@Hive* folks - is this a valid option? What > are the reasons that there is no metastore-client jar provided currently? > Would it be possible to generate one in some of the future Hive releases. > Seems like a worthy feature for me. > - We would create our version dependent HMS infrastructure if we want > to support Spark versions which support older Hive versions. > > As a result of this, we could have: > > - Clean definition of which Hive version is supported > - Testing for the supported Hive versions > - Java 11 support > > As an alternative we can create a testing matrix where some tests are run > with both Hive 3 and Hive 4, and some tests are run with only Hive3 (older > Spark versions which does not support Hive 4) > > Thanks Manu for driving this! > Peter > > Manu Zhang <owenzhang1...@gmail.com> ezt írta (időpont: 2025. jan. 5., V, > 5:18): > >> This basically means that we need to support every exact Hive versions >>> which are used by Spark, and we need to exclude our own Hive version from >>> the Spark runtime. >> >> >> Firstly, upgrading from Hive 2 to Hive 4 is a huge change, and I expect >> compatibility to be much better once Iceberg and Spark are both on Hive 4. >> >> Secondly, the coupling can be loosed if we are moving toward the REST >> catalog. >> >> On Fri, Jan 3, 2025 at 7:26 PM Péter Váry <peter.vary.apa...@gmail.com> >> wrote: >> >>> That sounds really interesting in a bad way :) :( >>> >>> This basically means that we need to support every exact Hive versions >>> which are used by Spark, and we need to exclude our own Hive version from >>> the Spark runtime. >>> >>> On Thu, Dec 19, 2024, 04:00 Manu Zhang <owenzhang1...@gmail.com> wrote: >>> >>>> Hi Peter, >>>> >>>>> I think we should make sure that the Iceberg Hive version is >>>>> independent from the version used by Spark >>>> >>>> I'm afraid that is not how it works currently. When Spark is deployed >>>> with hive libraries (I suppose this is common), iceberg-spark runtime must >>>> be compatible with them. >>>> Otherwise, we need to ask users to exclude hive libraries from Spark >>>> and ship iceberg-spark runtime with Iceberg's hive dependencies.\ >>>> >>>> Regards, >>>> Manu >>>> >>>> On Wed, Dec 18, 2024 at 9:08 PM Péter Váry <peter.vary.apa...@gmail.com> >>>> wrote: >>>> >>>>> @Manu: What will be the end result? Do we have to use the same Hive >>>>> version in Iceberg as it is defined by Spark? I think we should make sure >>>>> that the Iceberg Hive version is independent from the version used by >>>>> Spark >>>>> >>>>> On Mon, Dec 16, 2024, 21:58 rdb...@gmail.com <rdb...@gmail.com> wrote: >>>>> >>>>>> > I'm not sure there's an upgrade path before Spark 4.0. Any ideas? >>>>>> >>>>>> We can at least separate the concerns. We can remove the runtime >>>>>> modules that are the main issue. If we compile against an older version >>>>>> of >>>>>> the Hive metastore module (leaving it unchanged) that at least has a >>>>>> dramatically reduced surface area for Java version issues. As long as the >>>>>> API is compatible (and we haven't heard complaints that it is not) then I >>>>>> think users can override the version in their environments. >>>>>> >>>>>> Ryan >>>>>> >>>>>> On Sun, Dec 15, 2024 at 5:55 PM Manu Zhang <owenzhang1...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Daniel, >>>>>>> I'll start a vote once I get the PR ready. >>>>>>> >>>>>>> Hi Ryan, >>>>>>> Sorry, I wasn't clear in the last email that the consensus is to >>>>>>> upgrade Hive metastore support. >>>>>>> >>>>>>> Well, I was too optimistic about the upgrade. Spark has only added >>>>>>> hive 4.0 metastore support recently for Spark 4.0[1] and there will be >>>>>>> conflicts >>>>>>> between Spark's hive 2.3.9 and our hive 4.0 dependencies. >>>>>>> I'm not sure there's an upgrade path before Spark 4.0. Any ideas? >>>>>>> >>>>>>> 1. https://issues.apache.org/jira/browse/SPARK-45265 >>>>>>> >>>>>>> Thanks, >>>>>>> Manu >>>>>>> >>>>>>> >>>>>>> On Sat, Dec 14, 2024 at 4:31 AM rdb...@gmail.com <rdb...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Oh, I think I see. The upgrade to Hive 4 is just for the Hive >>>>>>>> metastore support? When I read the thread, I thought that we weren't >>>>>>>> going >>>>>>>> to change the metastore. That seems reasonable to me. Sorry for >>>>>>>> the confusion. >>>>>>>> >>>>>>>> On Fri, Dec 13, 2024 at 10:24 AM rdb...@gmail.com <rdb...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Sorry, I must have missed something. I don't think that we should >>>>>>>>> upgrade anything in Iceberg to Hive 4. Why not simply remove the Hive >>>>>>>>> support entirely? Why would anyone need Hive 4 support from Iceberg >>>>>>>>> when it >>>>>>>>> is built into Hive 4? >>>>>>>>> >>>>>>>>> On Thu, Dec 12, 2024 at 11:03 AM Daniel Weeks <dwe...@apache.org> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hey Manu, >>>>>>>>>> >>>>>>>>>> I agree with the direction here, but we should probably hold a >>>>>>>>>> quick procedural vote just to confirm since this is a significant >>>>>>>>>> change in >>>>>>>>>> support for Hive. >>>>>>>>>> >>>>>>>>>> -Dan >>>>>>>>>> >>>>>>>>>> On Wed, Dec 11, 2024 at 5:19 PM Manu Zhang < >>>>>>>>>> owenzhang1...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Thanks all for sharing your thoughts. It looks there's a >>>>>>>>>>> consensus on upgrading to Hive 4 and dropping hive-runtime. >>>>>>>>>>> I've submitted a PR[1] as the first step. Please help review. >>>>>>>>>>> >>>>>>>>>>> 1. https://github.com/apache/iceberg/pull/11750 >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Manu >>>>>>>>>>> >>>>>>>>>>> On Thu, Nov 28, 2024 at 11:26 PM Shohei Okumiya < >>>>>>>>>>> oku...@apache.org> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi all, >>>>>>>>>>>> >>>>>>>>>>>> I also prefer option 1. I have some initiatives[1] to improve >>>>>>>>>>>> integrations between Hive and Iceberg. The current style allows >>>>>>>>>>>> us to >>>>>>>>>>>> develop both Hive's core and HiveIcebergStorageHandler >>>>>>>>>>>> simultaneously. >>>>>>>>>>>> That would help us enhance integrations. >>>>>>>>>>>> >>>>>>>>>>>> - [1] https://issues.apache.org/jira/browse/HIVE-28410 >>>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> Okumin >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Nov 28, 2024 at 4:17 AM Fokko Driesprong < >>>>>>>>>>>> fo...@apache.org> wrote: >>>>>>>>>>>> > >>>>>>>>>>>> > Hey Cheng, >>>>>>>>>>>> > >>>>>>>>>>>> > Thanks for the suggestion. The nightly snapshots are >>>>>>>>>>>> available: >>>>>>>>>>>> https://repository.apache.org/content/groups/snapshots/org/apache/iceberg/iceberg-core/, >>>>>>>>>>>> which might help when working on features that are not released >>>>>>>>>>>> yet (eg >>>>>>>>>>>> Nanosecond timestamps). Besides that, we should run RCs against >>>>>>>>>>>> Hive to >>>>>>>>>>>> check if everything works as expected. >>>>>>>>>>>> > >>>>>>>>>>>> > I'm leaning toward removing Hive 2 and 3 as well. >>>>>>>>>>>> > >>>>>>>>>>>> > Kind regards, >>>>>>>>>>>> > Fokko >>>>>>>>>>>> > >>>>>>>>>>>> > Op wo 27 nov 2024 om 20:05 schreef rdb...@gmail.com < >>>>>>>>>>>> rdb...@gmail.com>: >>>>>>>>>>>> >> >>>>>>>>>>>> >> I think that we should remove Hive 2 and Hive 3. We already >>>>>>>>>>>> agreed to remove Hive 2, but Hive 3 is not compatible with the >>>>>>>>>>>> project >>>>>>>>>>>> anymore and is already EOL and will not see a release to update it >>>>>>>>>>>> so that >>>>>>>>>>>> it can be compatible. Anyone using the existing Hive 3 support >>>>>>>>>>>> should be >>>>>>>>>>>> able to continue using older releases. >>>>>>>>>>>> >> >>>>>>>>>>>> >> In general, I think it's a good idea to let people use older >>>>>>>>>>>> releases when these situations happen. It is difficult for the >>>>>>>>>>>> project to >>>>>>>>>>>> continue to support libraries that are EOL and I don't think >>>>>>>>>>>> there's a >>>>>>>>>>>> great justification for it, considering Iceberg support in Hive 4 >>>>>>>>>>>> is native >>>>>>>>>>>> and much better! >>>>>>>>>>>> >> >>>>>>>>>>>> >> On Wed, Nov 27, 2024 at 7:12 AM Cheng Pan <pan3...@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>> >>>>>>>>>>>> >>> That said, it would be helpful if they continue running >>>>>>>>>>>> >>> tests against the latest stable Hive releases to ensure >>>>>>>>>>>> that any >>>>>>>>>>>> >>> changes don’t unintentionally break something for Hive, >>>>>>>>>>>> which would be >>>>>>>>>>>> >>> beyond our control. >>>>>>>>>>>> >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> >>> I believe we should continue maintaining a Hive Iceberg >>>>>>>>>>>> runtime test suite with the latest version of Hive in the Iceberg >>>>>>>>>>>> repository. >>>>>>>>>>>> >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> >>> i think we can keep some basic Hive4 tests in iceberg repo >>>>>>>>>>>> >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> >>> Instead of running basic tests on the Iceberg repo, maybe >>>>>>>>>>>> let Iceberg publish daily snapshot jars to Nexus, and have a daily >>>>>>>>>>>> CI in >>>>>>>>>>>> Hive to consume those jars and run full Iceberg tests makes more >>>>>>>>>>>> sense? >>>>>>>>>>>> >>> >>>>>>>>>>>> >>> Thanks, >>>>>>>>>>>> >>> Cheng Pan >>>>>>>>>>>> >>> >>>>>>>>>>>> >>>>>>>>>>>>