Thanks Wing Yew, <tl;dr> We should remove the Iceberg Hive Runtime module, but make sure that the Iceberg Hive Metastore module tests are running against the supported(?) Hive 2.3.10/3.1.3/4.0.1 versions. Other tests could run against whatever Hive version they prefer
In details: -------------- Let me recap what I understand here: - Iceberg Hive metastore module is working with Hive 2, Hive 3 and Java 11 - since neither the tests nor the users are complaining about it - Iceberg Hive runtime tests are using features from Hive which does not support Java 11 - as we have seen broken tests when we upgraded the Java version - Even Spark 4 uses an embedded Hive 2.3.10 - This means that the features used by Spark and Iceberg from Hive 2.3.10 are working with Java 11, since neither the tests nor the users were complaining about it - Iceberg Hive Runtime tests are running against Hive 2.3.9 and Hive 3.1.3 - Iceberg Hive Metastore tests are running against Hive 2.3.9 - Spark tests are running against Hive 2.3.10 We already decided that we would like to remove the Hive runtime support from the Iceberg code in 1.8.0 release. We should decide which Hive versions we would like to support for the Iceberg Hive Metastore module. Based on my understanding above: - Hive 2.3.10 should be mandatory as Spark uses as a default - Hive 3.1.3 is what probably most of our users are using - Hive 4.0.1 is the current Hive version Tell me if you think otherwise. Since the Iceberg Hive Metastore module uses very specific Hive 3 related codes (DynMethod loader for the HMS Client proxy), I don't think we can claim support without at least some tests running using the appropriate Hive versions. I am not even sure that the metastore module is working with Hive 4 - maybe @Manu has more knowledge here. Thanks, Peter Wing Yew Poon <wyp...@cloudera.com.invalid> ezt írta (időpont: 2025. jan. 7., K, 1:18): > FYI -- > It looks like the built-in Hive version in the master branch of Apache > Spark is 2.3.10 (https://issues.apache.org/jira/browse/SPARK-47018), and > https://issues.apache.org/jira/browse/SPARK-44114 (upgrade built-in Hive > to 3+) is an open issue. > > > On Mon, Jan 6, 2025 at 1:07 PM Wing Yew Poon <wyp...@cloudera.com> wrote: > >> Hi Peter, >> In Spark, you can specify the Hive version of the metastore that you want >> to use. There is a configuration, spark.sql.hive.metastore.version, which >> currently (as of Spark 3.5) defaults to 2.3.9, and the jars supporting this >> default version are shipped with Spark as built-in. You can specify a >> different version and then specify spark.sql.hive.metastore.jars=path (the >> default is built-in) and spark.sql.hive.metastore.jars.path to point to >> jars for the Hive metastore version you want to use. What >> https://issues.apache.org/jira/browse/SPARK-45265 does is to allow 4.0.x >> to be supported as a spark.sql.hive.metastore.version. I haven't been >> following Spark 4, but I suspect that the built-in version is not changing >> to Hive 4.0. The built-in version is also used for other things that Spark >> may use from Hive (aside from interaction with HMS), such as Hive SerDes. >> See >> https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html. >> - Wing Yew >> >> >> On Mon, Jan 6, 2025 at 2:04 AM Péter Váry <peter.vary.apa...@gmail.com> >> wrote: >> >>> Hi Manu, >>> >>> > Spark has only added hive 4.0 metastore support recently for Spark >>> 4.0[1] and there will be conflicts >>> >>> Does this mean that Spark 4.0 will always use Hive 4 code? Or it will >>> use Hive 2 when it is present on the classpath, but if older Hive versions >>> are not on the classpath then it will use the embedded Hive 4 code? >>> >>> > Firstly, upgrading from Hive 2 to Hive 4 is a huge change >>> >>> Is this a huge change even after we remove the Hive runtime module? >>> >>> After removing the Hive runtime module, we have 2 remaining Hive >>> dependencies: >>> >>> - HMS Client >>> - The Thrift API should not be changed between the Hive versions, >>> so unless we start to use specific Hive 4 features we should be fine >>> here - >>> so whatever version of Hive we use, it should work >>> - Java API changes. We found that in Hive 2, and Hive 3 the >>> HMSClient classes used different constructors so we ended up using >>> DynMethods to use the appropriate constructors - if we use a strict >>> Hive >>> version here, then we won't need the DynMethods anymore >>> - Based on our experience, even if Hive 3 itself doesn't support >>> Java 11, the HMS Client for Hive 3 doesn't have any issues when used >>> with >>> Java 11 >>> - Testing infrastructure >>> - TestHiveMetastore creates and starts a HMS instance. This could >>> be highly dependent on the version of Hive we are using. Since this >>> is only >>> a testing code I expect that only our tests are interacting with this >>> >>> *@Manu*: You know more of the details here. Do we have HMSClient issues >>> when we use Hive 4 code? If I miss something in the listing above, please >>> correct me. >>> >>> Based on this, in an ideal world: >>> >>> - Hive would provide a HMS client jar which only contains java code >>> which is needed to connect and communicate using Thrift with a HMS >>> instance >>> (no internal HMS server code etc). We could use it as a dependency for >>> our >>> iceberg-hive-metastore module. Either setting a minimal version, or >>> using a >>> shaded embedded version. *@Hive* folks - is this a valid option? >>> What are the reasons that there is no metastore-client jar provided >>> currently? Would it be possible to generate one in some of the future >>> Hive >>> releases. Seems like a worthy feature for me. >>> - We would create our version dependent HMS infrastructure if we >>> want to support Spark versions which support older Hive versions. >>> >>> As a result of this, we could have: >>> >>> - Clean definition of which Hive version is supported >>> - Testing for the supported Hive versions >>> - Java 11 support >>> >>> As an alternative we can create a testing matrix where some tests are >>> run with both Hive 3 and Hive 4, and some tests are run with only Hive3 >>> (older Spark versions which does not support Hive 4) >>> >>> Thanks Manu for driving this! >>> Peter >>> >>> Manu Zhang <owenzhang1...@gmail.com> ezt írta (időpont: 2025. jan. 5., >>> V, 5:18): >>> >>>> This basically means that we need to support every exact Hive versions >>>>> which are used by Spark, and we need to exclude our own Hive version from >>>>> the Spark runtime. >>>> >>>> >>>> Firstly, upgrading from Hive 2 to Hive 4 is a huge change, and I expect >>>> compatibility to be much better once Iceberg and Spark are both on Hive 4. >>>> >>>> Secondly, the coupling can be loosed if we are moving toward the REST >>>> catalog. >>>> >>>> On Fri, Jan 3, 2025 at 7:26 PM Péter Váry <peter.vary.apa...@gmail.com> >>>> wrote: >>>> >>>>> That sounds really interesting in a bad way :) :( >>>>> >>>>> This basically means that we need to support every exact Hive versions >>>>> which are used by Spark, and we need to exclude our own Hive version from >>>>> the Spark runtime. >>>>> >>>>> On Thu, Dec 19, 2024, 04:00 Manu Zhang <owenzhang1...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Peter, >>>>>> >>>>>>> I think we should make sure that the Iceberg Hive version is >>>>>>> independent from the version used by Spark >>>>>> >>>>>> I'm afraid that is not how it works currently. When Spark is >>>>>> deployed with hive libraries (I suppose this is common), iceberg-spark >>>>>> runtime must be compatible with them. >>>>>> Otherwise, we need to ask users to exclude hive libraries from Spark >>>>>> and ship iceberg-spark runtime with Iceberg's hive dependencies.\ >>>>>> >>>>>> Regards, >>>>>> Manu >>>>>> >>>>>> On Wed, Dec 18, 2024 at 9:08 PM Péter Váry < >>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>> >>>>>>> @Manu: What will be the end result? Do we have to use the same Hive >>>>>>> version in Iceberg as it is defined by Spark? I think we should make >>>>>>> sure >>>>>>> that the Iceberg Hive version is independent from the version used by >>>>>>> Spark >>>>>>> >>>>>>> On Mon, Dec 16, 2024, 21:58 rdb...@gmail.com <rdb...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> > I'm not sure there's an upgrade path before Spark 4.0. Any ideas? >>>>>>>> >>>>>>>> We can at least separate the concerns. We can remove the runtime >>>>>>>> modules that are the main issue. If we compile against an older >>>>>>>> version of >>>>>>>> the Hive metastore module (leaving it unchanged) that at least has a >>>>>>>> dramatically reduced surface area for Java version issues. As long as >>>>>>>> the >>>>>>>> API is compatible (and we haven't heard complaints that it is not) >>>>>>>> then I >>>>>>>> think users can override the version in their environments. >>>>>>>> >>>>>>>> Ryan >>>>>>>> >>>>>>>> On Sun, Dec 15, 2024 at 5:55 PM Manu Zhang <owenzhang1...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Daniel, >>>>>>>>> I'll start a vote once I get the PR ready. >>>>>>>>> >>>>>>>>> Hi Ryan, >>>>>>>>> Sorry, I wasn't clear in the last email that the consensus is to >>>>>>>>> upgrade Hive metastore support. >>>>>>>>> >>>>>>>>> Well, I was too optimistic about the upgrade. Spark has only added >>>>>>>>> hive 4.0 metastore support recently for Spark 4.0[1] and there will be >>>>>>>>> conflicts >>>>>>>>> between Spark's hive 2.3.9 and our hive 4.0 dependencies. >>>>>>>>> I'm not sure there's an upgrade path before Spark 4.0. Any ideas? >>>>>>>>> >>>>>>>>> 1. https://issues.apache.org/jira/browse/SPARK-45265 >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Manu >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sat, Dec 14, 2024 at 4:31 AM rdb...@gmail.com <rdb...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Oh, I think I see. The upgrade to Hive 4 is just for the Hive >>>>>>>>>> metastore support? When I read the thread, I thought that we weren't >>>>>>>>>> going >>>>>>>>>> to change the metastore. That seems reasonable to me. Sorry for >>>>>>>>>> the confusion. >>>>>>>>>> >>>>>>>>>> On Fri, Dec 13, 2024 at 10:24 AM rdb...@gmail.com < >>>>>>>>>> rdb...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Sorry, I must have missed something. I don't think that we >>>>>>>>>>> should upgrade anything in Iceberg to Hive 4. Why not simply remove >>>>>>>>>>> the >>>>>>>>>>> Hive support entirely? Why would anyone need Hive 4 support from >>>>>>>>>>> Iceberg >>>>>>>>>>> when it is built into Hive 4? >>>>>>>>>>> >>>>>>>>>>> On Thu, Dec 12, 2024 at 11:03 AM Daniel Weeks <dwe...@apache.org> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hey Manu, >>>>>>>>>>>> >>>>>>>>>>>> I agree with the direction here, but we should probably hold a >>>>>>>>>>>> quick procedural vote just to confirm since this is a significant >>>>>>>>>>>> change in >>>>>>>>>>>> support for Hive. >>>>>>>>>>>> >>>>>>>>>>>> -Dan >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Dec 11, 2024 at 5:19 PM Manu Zhang < >>>>>>>>>>>> owenzhang1...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Thanks all for sharing your thoughts. It looks there's a >>>>>>>>>>>>> consensus on upgrading to Hive 4 and dropping hive-runtime. >>>>>>>>>>>>> I've submitted a PR[1] as the first step. Please help review. >>>>>>>>>>>>> >>>>>>>>>>>>> 1. https://github.com/apache/iceberg/pull/11750 >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Manu >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Nov 28, 2024 at 11:26 PM Shohei Okumiya < >>>>>>>>>>>>> oku...@apache.org> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I also prefer option 1. I have some initiatives[1] to improve >>>>>>>>>>>>>> integrations between Hive and Iceberg. The current style >>>>>>>>>>>>>> allows us to >>>>>>>>>>>>>> develop both Hive's core and HiveIcebergStorageHandler >>>>>>>>>>>>>> simultaneously. >>>>>>>>>>>>>> That would help us enhance integrations. >>>>>>>>>>>>>> >>>>>>>>>>>>>> - [1] https://issues.apache.org/jira/browse/HIVE-28410 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>> Okumin >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Nov 28, 2024 at 4:17 AM Fokko Driesprong < >>>>>>>>>>>>>> fo...@apache.org> wrote: >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > Hey Cheng, >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > Thanks for the suggestion. The nightly snapshots are >>>>>>>>>>>>>> available: >>>>>>>>>>>>>> https://repository.apache.org/content/groups/snapshots/org/apache/iceberg/iceberg-core/, >>>>>>>>>>>>>> which might help when working on features that are not released >>>>>>>>>>>>>> yet (eg >>>>>>>>>>>>>> Nanosecond timestamps). Besides that, we should run RCs against >>>>>>>>>>>>>> Hive to >>>>>>>>>>>>>> check if everything works as expected. >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > I'm leaning toward removing Hive 2 and 3 as well. >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > Kind regards, >>>>>>>>>>>>>> > Fokko >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > Op wo 27 nov 2024 om 20:05 schreef rdb...@gmail.com < >>>>>>>>>>>>>> rdb...@gmail.com>: >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> I think that we should remove Hive 2 and Hive 3. We >>>>>>>>>>>>>> already agreed to remove Hive 2, but Hive 3 is not compatible >>>>>>>>>>>>>> with the >>>>>>>>>>>>>> project anymore and is already EOL and will not see a release to >>>>>>>>>>>>>> update it >>>>>>>>>>>>>> so that it can be compatible. Anyone using the existing Hive 3 >>>>>>>>>>>>>> support >>>>>>>>>>>>>> should be able to continue using older releases. >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> In general, I think it's a good idea to let people use >>>>>>>>>>>>>> older releases when these situations happen. It is difficult for >>>>>>>>>>>>>> the >>>>>>>>>>>>>> project to continue to support libraries that are EOL and I >>>>>>>>>>>>>> don't think >>>>>>>>>>>>>> there's a great justification for it, considering Iceberg >>>>>>>>>>>>>> support in Hive 4 >>>>>>>>>>>>>> is native and much better! >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> On Wed, Nov 27, 2024 at 7:12 AM Cheng Pan < >>>>>>>>>>>>>> pan3...@gmail.com> wrote: >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> >>> That said, it would be helpful if they continue running >>>>>>>>>>>>>> >>> tests against the latest stable Hive releases to ensure >>>>>>>>>>>>>> that any >>>>>>>>>>>>>> >>> changes don’t unintentionally break something for Hive, >>>>>>>>>>>>>> which would be >>>>>>>>>>>>>> >>> beyond our control. >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> >>> I believe we should continue maintaining a Hive Iceberg >>>>>>>>>>>>>> runtime test suite with the latest version of Hive in the Iceberg >>>>>>>>>>>>>> repository. >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> >>> i think we can keep some basic Hive4 tests in iceberg repo >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> >>> Instead of running basic tests on the Iceberg repo, maybe >>>>>>>>>>>>>> let Iceberg publish daily snapshot jars to Nexus, and have a >>>>>>>>>>>>>> daily CI in >>>>>>>>>>>>>> Hive to consume those jars and run full Iceberg tests makes more >>>>>>>>>>>>>> sense? >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> >>> Thanks, >>>>>>>>>>>>>> >>> Cheng Pan >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>