FYI -- It looks like the built-in Hive version in the master branch of Apache Spark is 2.3.10 (https://issues.apache.org/jira/browse/SPARK-47018), and https://issues.apache.org/jira/browse/SPARK-44114 (upgrade built-in Hive to 3+) is an open issue.
On Mon, Jan 6, 2025 at 1:07 PM Wing Yew Poon <wyp...@cloudera.com> wrote: > Hi Peter, > In Spark, you can specify the Hive version of the metastore that you want > to use. There is a configuration, spark.sql.hive.metastore.version, which > currently (as of Spark 3.5) defaults to 2.3.9, and the jars supporting this > default version are shipped with Spark as built-in. You can specify a > different version and then specify spark.sql.hive.metastore.jars=path (the > default is built-in) and spark.sql.hive.metastore.jars.path to point to > jars for the Hive metastore version you want to use. What > https://issues.apache.org/jira/browse/SPARK-45265 does is to allow 4.0.x > to be supported as a spark.sql.hive.metastore.version. I haven't been > following Spark 4, but I suspect that the built-in version is not changing > to Hive 4.0. The built-in version is also used for other things that Spark > may use from Hive (aside from interaction with HMS), such as Hive SerDes. > See https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html > . > - Wing Yew > > > On Mon, Jan 6, 2025 at 2:04 AM Péter Váry <peter.vary.apa...@gmail.com> > wrote: > >> Hi Manu, >> >> > Spark has only added hive 4.0 metastore support recently for Spark >> 4.0[1] and there will be conflicts >> >> Does this mean that Spark 4.0 will always use Hive 4 code? Or it will use >> Hive 2 when it is present on the classpath, but if older Hive versions are >> not on the classpath then it will use the embedded Hive 4 code? >> >> > Firstly, upgrading from Hive 2 to Hive 4 is a huge change >> >> Is this a huge change even after we remove the Hive runtime module? >> >> After removing the Hive runtime module, we have 2 remaining Hive >> dependencies: >> >> - HMS Client >> - The Thrift API should not be changed between the Hive versions, >> so unless we start to use specific Hive 4 features we should be fine >> here - >> so whatever version of Hive we use, it should work >> - Java API changes. We found that in Hive 2, and Hive 3 the >> HMSClient classes used different constructors so we ended up using >> DynMethods to use the appropriate constructors - if we use a strict >> Hive >> version here, then we won't need the DynMethods anymore >> - Based on our experience, even if Hive 3 itself doesn't support >> Java 11, the HMS Client for Hive 3 doesn't have any issues when used >> with >> Java 11 >> - Testing infrastructure >> - TestHiveMetastore creates and starts a HMS instance. This could >> be highly dependent on the version of Hive we are using. Since this is >> only >> a testing code I expect that only our tests are interacting with this >> >> *@Manu*: You know more of the details here. Do we have HMSClient issues >> when we use Hive 4 code? If I miss something in the listing above, please >> correct me. >> >> Based on this, in an ideal world: >> >> - Hive would provide a HMS client jar which only contains java code >> which is needed to connect and communicate using Thrift with a HMS >> instance >> (no internal HMS server code etc). We could use it as a dependency for our >> iceberg-hive-metastore module. Either setting a minimal version, or using >> a >> shaded embedded version. *@Hive* folks - is this a valid option? What >> are the reasons that there is no metastore-client jar provided currently? >> Would it be possible to generate one in some of the future Hive releases. >> Seems like a worthy feature for me. >> - We would create our version dependent HMS infrastructure if we want >> to support Spark versions which support older Hive versions. >> >> As a result of this, we could have: >> >> - Clean definition of which Hive version is supported >> - Testing for the supported Hive versions >> - Java 11 support >> >> As an alternative we can create a testing matrix where some tests are run >> with both Hive 3 and Hive 4, and some tests are run with only Hive3 (older >> Spark versions which does not support Hive 4) >> >> Thanks Manu for driving this! >> Peter >> >> Manu Zhang <owenzhang1...@gmail.com> ezt írta (időpont: 2025. jan. 5., >> V, 5:18): >> >>> This basically means that we need to support every exact Hive versions >>>> which are used by Spark, and we need to exclude our own Hive version from >>>> the Spark runtime. >>> >>> >>> Firstly, upgrading from Hive 2 to Hive 4 is a huge change, and I expect >>> compatibility to be much better once Iceberg and Spark are both on Hive 4. >>> >>> Secondly, the coupling can be loosed if we are moving toward the REST >>> catalog. >>> >>> On Fri, Jan 3, 2025 at 7:26 PM Péter Váry <peter.vary.apa...@gmail.com> >>> wrote: >>> >>>> That sounds really interesting in a bad way :) :( >>>> >>>> This basically means that we need to support every exact Hive versions >>>> which are used by Spark, and we need to exclude our own Hive version from >>>> the Spark runtime. >>>> >>>> On Thu, Dec 19, 2024, 04:00 Manu Zhang <owenzhang1...@gmail.com> wrote: >>>> >>>>> Hi Peter, >>>>> >>>>>> I think we should make sure that the Iceberg Hive version is >>>>>> independent from the version used by Spark >>>>> >>>>> I'm afraid that is not how it works currently. When Spark is deployed >>>>> with hive libraries (I suppose this is common), iceberg-spark runtime must >>>>> be compatible with them. >>>>> Otherwise, we need to ask users to exclude hive libraries from Spark >>>>> and ship iceberg-spark runtime with Iceberg's hive dependencies.\ >>>>> >>>>> Regards, >>>>> Manu >>>>> >>>>> On Wed, Dec 18, 2024 at 9:08 PM Péter Váry < >>>>> peter.vary.apa...@gmail.com> wrote: >>>>> >>>>>> @Manu: What will be the end result? Do we have to use the same Hive >>>>>> version in Iceberg as it is defined by Spark? I think we should make sure >>>>>> that the Iceberg Hive version is independent from the version used by >>>>>> Spark >>>>>> >>>>>> On Mon, Dec 16, 2024, 21:58 rdb...@gmail.com <rdb...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> > I'm not sure there's an upgrade path before Spark 4.0. Any ideas? >>>>>>> >>>>>>> We can at least separate the concerns. We can remove the runtime >>>>>>> modules that are the main issue. If we compile against an older version >>>>>>> of >>>>>>> the Hive metastore module (leaving it unchanged) that at least has a >>>>>>> dramatically reduced surface area for Java version issues. As long as >>>>>>> the >>>>>>> API is compatible (and we haven't heard complaints that it is not) then >>>>>>> I >>>>>>> think users can override the version in their environments. >>>>>>> >>>>>>> Ryan >>>>>>> >>>>>>> On Sun, Dec 15, 2024 at 5:55 PM Manu Zhang <owenzhang1...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Daniel, >>>>>>>> I'll start a vote once I get the PR ready. >>>>>>>> >>>>>>>> Hi Ryan, >>>>>>>> Sorry, I wasn't clear in the last email that the consensus is to >>>>>>>> upgrade Hive metastore support. >>>>>>>> >>>>>>>> Well, I was too optimistic about the upgrade. Spark has only added >>>>>>>> hive 4.0 metastore support recently for Spark 4.0[1] and there will be >>>>>>>> conflicts >>>>>>>> between Spark's hive 2.3.9 and our hive 4.0 dependencies. >>>>>>>> I'm not sure there's an upgrade path before Spark 4.0. Any ideas? >>>>>>>> >>>>>>>> 1. https://issues.apache.org/jira/browse/SPARK-45265 >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Manu >>>>>>>> >>>>>>>> >>>>>>>> On Sat, Dec 14, 2024 at 4:31 AM rdb...@gmail.com <rdb...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Oh, I think I see. The upgrade to Hive 4 is just for the Hive >>>>>>>>> metastore support? When I read the thread, I thought that we weren't >>>>>>>>> going >>>>>>>>> to change the metastore. That seems reasonable to me. Sorry for >>>>>>>>> the confusion. >>>>>>>>> >>>>>>>>> On Fri, Dec 13, 2024 at 10:24 AM rdb...@gmail.com < >>>>>>>>> rdb...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Sorry, I must have missed something. I don't think that we should >>>>>>>>>> upgrade anything in Iceberg to Hive 4. Why not simply remove the Hive >>>>>>>>>> support entirely? Why would anyone need Hive 4 support from Iceberg >>>>>>>>>> when it >>>>>>>>>> is built into Hive 4? >>>>>>>>>> >>>>>>>>>> On Thu, Dec 12, 2024 at 11:03 AM Daniel Weeks <dwe...@apache.org> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hey Manu, >>>>>>>>>>> >>>>>>>>>>> I agree with the direction here, but we should probably hold a >>>>>>>>>>> quick procedural vote just to confirm since this is a significant >>>>>>>>>>> change in >>>>>>>>>>> support for Hive. >>>>>>>>>>> >>>>>>>>>>> -Dan >>>>>>>>>>> >>>>>>>>>>> On Wed, Dec 11, 2024 at 5:19 PM Manu Zhang < >>>>>>>>>>> owenzhang1...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Thanks all for sharing your thoughts. It looks there's a >>>>>>>>>>>> consensus on upgrading to Hive 4 and dropping hive-runtime. >>>>>>>>>>>> I've submitted a PR[1] as the first step. Please help review. >>>>>>>>>>>> >>>>>>>>>>>> 1. https://github.com/apache/iceberg/pull/11750 >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Manu >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Nov 28, 2024 at 11:26 PM Shohei Okumiya < >>>>>>>>>>>> oku...@apache.org> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi all, >>>>>>>>>>>>> >>>>>>>>>>>>> I also prefer option 1. I have some initiatives[1] to improve >>>>>>>>>>>>> integrations between Hive and Iceberg. The current style >>>>>>>>>>>>> allows us to >>>>>>>>>>>>> develop both Hive's core and HiveIcebergStorageHandler >>>>>>>>>>>>> simultaneously. >>>>>>>>>>>>> That would help us enhance integrations. >>>>>>>>>>>>> >>>>>>>>>>>>> - [1] https://issues.apache.org/jira/browse/HIVE-28410 >>>>>>>>>>>>> >>>>>>>>>>>>> Regards, >>>>>>>>>>>>> Okumin >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Nov 28, 2024 at 4:17 AM Fokko Driesprong < >>>>>>>>>>>>> fo...@apache.org> wrote: >>>>>>>>>>>>> > >>>>>>>>>>>>> > Hey Cheng, >>>>>>>>>>>>> > >>>>>>>>>>>>> > Thanks for the suggestion. The nightly snapshots are >>>>>>>>>>>>> available: >>>>>>>>>>>>> https://repository.apache.org/content/groups/snapshots/org/apache/iceberg/iceberg-core/, >>>>>>>>>>>>> which might help when working on features that are not released >>>>>>>>>>>>> yet (eg >>>>>>>>>>>>> Nanosecond timestamps). Besides that, we should run RCs against >>>>>>>>>>>>> Hive to >>>>>>>>>>>>> check if everything works as expected. >>>>>>>>>>>>> > >>>>>>>>>>>>> > I'm leaning toward removing Hive 2 and 3 as well. >>>>>>>>>>>>> > >>>>>>>>>>>>> > Kind regards, >>>>>>>>>>>>> > Fokko >>>>>>>>>>>>> > >>>>>>>>>>>>> > Op wo 27 nov 2024 om 20:05 schreef rdb...@gmail.com < >>>>>>>>>>>>> rdb...@gmail.com>: >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> I think that we should remove Hive 2 and Hive 3. We already >>>>>>>>>>>>> agreed to remove Hive 2, but Hive 3 is not compatible with the >>>>>>>>>>>>> project >>>>>>>>>>>>> anymore and is already EOL and will not see a release to update >>>>>>>>>>>>> it so that >>>>>>>>>>>>> it can be compatible. Anyone using the existing Hive 3 support >>>>>>>>>>>>> should be >>>>>>>>>>>>> able to continue using older releases. >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> In general, I think it's a good idea to let people use >>>>>>>>>>>>> older releases when these situations happen. It is difficult for >>>>>>>>>>>>> the >>>>>>>>>>>>> project to continue to support libraries that are EOL and I don't >>>>>>>>>>>>> think >>>>>>>>>>>>> there's a great justification for it, considering Iceberg support >>>>>>>>>>>>> in Hive 4 >>>>>>>>>>>>> is native and much better! >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> On Wed, Nov 27, 2024 at 7:12 AM Cheng Pan < >>>>>>>>>>>>> pan3...@gmail.com> wrote: >>>>>>>>>>>>> >>> >>>>>>>>>>>>> >>> That said, it would be helpful if they continue running >>>>>>>>>>>>> >>> tests against the latest stable Hive releases to ensure >>>>>>>>>>>>> that any >>>>>>>>>>>>> >>> changes don’t unintentionally break something for Hive, >>>>>>>>>>>>> which would be >>>>>>>>>>>>> >>> beyond our control. >>>>>>>>>>>>> >>> >>>>>>>>>>>>> >>> >>>>>>>>>>>>> >>> I believe we should continue maintaining a Hive Iceberg >>>>>>>>>>>>> runtime test suite with the latest version of Hive in the Iceberg >>>>>>>>>>>>> repository. >>>>>>>>>>>>> >>> >>>>>>>>>>>>> >>> >>>>>>>>>>>>> >>> i think we can keep some basic Hive4 tests in iceberg repo >>>>>>>>>>>>> >>> >>>>>>>>>>>>> >>> >>>>>>>>>>>>> >>> Instead of running basic tests on the Iceberg repo, maybe >>>>>>>>>>>>> let Iceberg publish daily snapshot jars to Nexus, and have a >>>>>>>>>>>>> daily CI in >>>>>>>>>>>>> Hive to consume those jars and run full Iceberg tests makes more >>>>>>>>>>>>> sense? >>>>>>>>>>>>> >>> >>>>>>>>>>>>> >>> Thanks, >>>>>>>>>>>>> >>> Cheng Pan >>>>>>>>>>>>> >>> >>>>>>>>>>>>> >>>>>>>>>>>>>