FYI --
It looks like the built-in Hive version in the master branch of Apache
Spark is 2.3.10 (https://issues.apache.org/jira/browse/SPARK-47018), and
https://issues.apache.org/jira/browse/SPARK-44114 (upgrade built-in Hive to
3+) is an open issue.


On Mon, Jan 6, 2025 at 1:07 PM Wing Yew Poon <wyp...@cloudera.com> wrote:

> Hi Peter,
> In Spark, you can specify the Hive version of the metastore that you want
> to use. There is a configuration, spark.sql.hive.metastore.version, which
> currently (as of Spark 3.5) defaults to 2.3.9, and the jars supporting this
> default version are shipped with Spark as built-in. You can specify a
> different version and then specify spark.sql.hive.metastore.jars=path (the
> default is built-in) and spark.sql.hive.metastore.jars.path to point to
> jars for the Hive metastore version you want to use. What
> https://issues.apache.org/jira/browse/SPARK-45265 does is to allow 4.0.x
> to be supported as a spark.sql.hive.metastore.version. I haven't been
> following Spark 4, but I suspect that the built-in version is not changing
> to Hive 4.0. The built-in version is also used for other things that Spark
> may use from Hive (aside from interaction with HMS), such as Hive SerDes.
> See https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html
> .
> - Wing Yew
>
>
> On Mon, Jan 6, 2025 at 2:04 AM Péter Váry <peter.vary.apa...@gmail.com>
> wrote:
>
>> Hi Manu,
>>
>> > Spark has only added hive 4.0 metastore support recently for Spark
>> 4.0[1] and there will be conflicts
>>
>> Does this mean that Spark 4.0 will always use Hive 4 code? Or it will use
>> Hive 2 when it is present on the classpath, but if older Hive versions are
>> not on the classpath then it will use the embedded Hive 4 code?
>>
>> > Firstly, upgrading from Hive 2 to Hive 4 is a huge change
>>
>> Is this a huge change even after we remove the Hive runtime module?
>>
>> After removing the Hive runtime module, we have 2 remaining Hive
>> dependencies:
>>
>>    - HMS Client
>>       - The Thrift API should not be changed between the Hive versions,
>>       so unless we start to use specific Hive 4 features we should be fine 
>> here -
>>       so whatever version of Hive we use, it should work
>>       - Java API changes. We found that in Hive 2, and Hive 3 the
>>       HMSClient classes used different constructors so we ended up using
>>       DynMethods to use the appropriate constructors - if we use a strict 
>> Hive
>>       version here, then we won't need the DynMethods anymore
>>       - Based on our experience, even if Hive 3 itself doesn't support
>>       Java 11, the HMS Client for Hive 3 doesn't have any issues when used 
>> with
>>       Java 11
>>    - Testing infrastructure
>>       - TestHiveMetastore creates and starts a HMS instance. This could
>>       be highly dependent on the version of Hive we are using. Since this is 
>> only
>>       a testing code I expect that only our tests are interacting with this
>>
>> *@Manu*: You know more of the details here. Do we have HMSClient issues
>> when we use Hive 4 code? If I miss something in the listing above, please
>> correct me.
>>
>> Based on this, in an ideal world:
>>
>>    - Hive would provide a HMS client jar which only contains java code
>>    which is needed to connect and communicate using Thrift with a HMS 
>> instance
>>    (no internal HMS server code etc). We could use it as a dependency for our
>>    iceberg-hive-metastore module. Either setting a minimal version, or using 
>> a
>>    shaded embedded version. *@Hive* folks - is this a valid option? What
>>    are the reasons that there is no metastore-client jar provided currently?
>>    Would it be possible to generate one in some of the future Hive releases.
>>    Seems like a worthy feature for me.
>>    - We would create our version dependent HMS infrastructure if we want
>>    to support Spark versions which support older Hive versions.
>>
>> As a result of this, we could have:
>>
>>    - Clean definition of which Hive version is supported
>>    - Testing for the supported Hive versions
>>    - Java 11 support
>>
>> As an alternative we can create a testing matrix where some tests are run
>> with both Hive 3 and Hive 4, and some tests are run with only Hive3 (older
>> Spark versions which does not support Hive 4)
>>
>> Thanks Manu for driving this!
>> Peter
>>
>> Manu Zhang <owenzhang1...@gmail.com> ezt írta (időpont: 2025. jan. 5.,
>> V, 5:18):
>>
>>> This basically means that we need to support every exact Hive versions
>>>> which are used by Spark, and we need to exclude our own Hive version from
>>>> the Spark runtime.
>>>
>>>
>>> Firstly, upgrading from Hive 2 to Hive 4 is a huge change, and I expect
>>> compatibility to be much better once Iceberg and Spark are both on Hive 4.
>>>
>>> Secondly, the coupling can be loosed if we are moving toward the REST
>>> catalog.
>>>
>>> On Fri, Jan 3, 2025 at 7:26 PM Péter Váry <peter.vary.apa...@gmail.com>
>>> wrote:
>>>
>>>> That sounds really interesting in a bad way :) :(
>>>>
>>>> This basically means that we need to support every exact Hive versions
>>>> which are used by Spark, and we need to exclude our own Hive version from
>>>> the Spark runtime.
>>>>
>>>> On Thu, Dec 19, 2024, 04:00 Manu Zhang <owenzhang1...@gmail.com> wrote:
>>>>
>>>>> Hi Peter,
>>>>>
>>>>>> I think we should make sure that the Iceberg Hive version is
>>>>>> independent from the version used by Spark
>>>>>
>>>>>  I'm afraid that is not how it works currently. When Spark is deployed
>>>>> with hive libraries (I suppose this is common), iceberg-spark runtime must
>>>>> be compatible with them.
>>>>> Otherwise, we need to ask users to exclude hive libraries from Spark
>>>>> and ship iceberg-spark runtime with Iceberg's hive dependencies.\
>>>>>
>>>>> Regards,
>>>>> Manu
>>>>>
>>>>> On Wed, Dec 18, 2024 at 9:08 PM Péter Váry <
>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>
>>>>>> @Manu: What will be the end result? Do we have to use the same Hive
>>>>>> version in Iceberg as it is defined by Spark? I think we should make sure
>>>>>> that the Iceberg Hive version is independent from the version used by 
>>>>>> Spark
>>>>>>
>>>>>> On Mon, Dec 16, 2024, 21:58 rdb...@gmail.com <rdb...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> > I'm not sure there's an upgrade path before Spark 4.0. Any ideas?
>>>>>>>
>>>>>>> We can at least separate the concerns. We can remove the runtime
>>>>>>> modules that are the main issue. If we compile against an older version 
>>>>>>> of
>>>>>>> the Hive metastore module (leaving it unchanged) that at least has a
>>>>>>> dramatically reduced surface area for Java version issues. As long as 
>>>>>>> the
>>>>>>> API is compatible (and we haven't heard complaints that it is not) then 
>>>>>>> I
>>>>>>> think users can override the version in their environments.
>>>>>>>
>>>>>>> Ryan
>>>>>>>
>>>>>>> On Sun, Dec 15, 2024 at 5:55 PM Manu Zhang <owenzhang1...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Daniel,
>>>>>>>> I'll start a vote once I get the PR ready.
>>>>>>>>
>>>>>>>> Hi Ryan,
>>>>>>>> Sorry, I wasn't clear in the last email that the consensus is to
>>>>>>>> upgrade Hive metastore support.
>>>>>>>>
>>>>>>>> Well, I was too optimistic about the upgrade. Spark has only added
>>>>>>>> hive 4.0 metastore support recently for Spark 4.0[1] and there will be
>>>>>>>> conflicts
>>>>>>>> between Spark's hive 2.3.9 and our hive 4.0 dependencies.
>>>>>>>> I'm not sure there's an upgrade path before Spark 4.0. Any ideas?
>>>>>>>>
>>>>>>>> 1. https://issues.apache.org/jira/browse/SPARK-45265
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Manu
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Dec 14, 2024 at 4:31 AM rdb...@gmail.com <rdb...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Oh, I think I see. The upgrade to Hive 4 is just for the Hive
>>>>>>>>> metastore support? When I read the thread, I thought that we weren't 
>>>>>>>>> going
>>>>>>>>> to change the metastore. That seems reasonable to me. Sorry for
>>>>>>>>> the confusion.
>>>>>>>>>
>>>>>>>>> On Fri, Dec 13, 2024 at 10:24 AM rdb...@gmail.com <
>>>>>>>>> rdb...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Sorry, I must have missed something. I don't think that we should
>>>>>>>>>> upgrade anything in Iceberg to Hive 4. Why not simply remove the Hive
>>>>>>>>>> support entirely? Why would anyone need Hive 4 support from Iceberg 
>>>>>>>>>> when it
>>>>>>>>>> is built into Hive 4?
>>>>>>>>>>
>>>>>>>>>> On Thu, Dec 12, 2024 at 11:03 AM Daniel Weeks <dwe...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey Manu,
>>>>>>>>>>>
>>>>>>>>>>> I agree with the direction here, but we should probably hold a
>>>>>>>>>>> quick procedural vote just to confirm since this is a significant 
>>>>>>>>>>> change in
>>>>>>>>>>> support for Hive.
>>>>>>>>>>>
>>>>>>>>>>> -Dan
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Dec 11, 2024 at 5:19 PM Manu Zhang <
>>>>>>>>>>> owenzhang1...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks all for sharing your thoughts. It looks there's a
>>>>>>>>>>>> consensus on upgrading to Hive 4 and dropping hive-runtime.
>>>>>>>>>>>> I've submitted a PR[1] as the first step. Please help review.
>>>>>>>>>>>>
>>>>>>>>>>>> 1. https://github.com/apache/iceberg/pull/11750
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Manu
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Nov 28, 2024 at 11:26 PM Shohei Okumiya <
>>>>>>>>>>>> oku...@apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I also prefer option 1. I have some initiatives[1] to improve
>>>>>>>>>>>>> integrations between Hive and Iceberg. The current style
>>>>>>>>>>>>> allows us to
>>>>>>>>>>>>> develop both Hive's core and HiveIcebergStorageHandler
>>>>>>>>>>>>> simultaneously.
>>>>>>>>>>>>> That would help us enhance integrations.
>>>>>>>>>>>>>
>>>>>>>>>>>>> - [1] https://issues.apache.org/jira/browse/HIVE-28410
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Okumin
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Nov 28, 2024 at 4:17 AM Fokko Driesprong <
>>>>>>>>>>>>> fo...@apache.org> wrote:
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > Hey Cheng,
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > Thanks for the suggestion. The nightly snapshots are
>>>>>>>>>>>>> available:
>>>>>>>>>>>>> https://repository.apache.org/content/groups/snapshots/org/apache/iceberg/iceberg-core/,
>>>>>>>>>>>>> which might help when working on features that are not released 
>>>>>>>>>>>>> yet (eg
>>>>>>>>>>>>> Nanosecond timestamps). Besides that, we should run RCs against 
>>>>>>>>>>>>> Hive to
>>>>>>>>>>>>> check if everything works as expected.
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > I'm leaning toward removing Hive 2 and 3 as well.
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > Kind regards,
>>>>>>>>>>>>> > Fokko
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > Op wo 27 nov 2024 om 20:05 schreef rdb...@gmail.com <
>>>>>>>>>>>>> rdb...@gmail.com>:
>>>>>>>>>>>>> >>
>>>>>>>>>>>>> >> I think that we should remove Hive 2 and Hive 3. We already
>>>>>>>>>>>>> agreed to remove Hive 2, but Hive 3 is not compatible with the 
>>>>>>>>>>>>> project
>>>>>>>>>>>>> anymore and is already EOL and will not see a release to update 
>>>>>>>>>>>>> it so that
>>>>>>>>>>>>> it can be compatible. Anyone using the existing Hive 3 support 
>>>>>>>>>>>>> should be
>>>>>>>>>>>>> able to continue using older releases.
>>>>>>>>>>>>> >>
>>>>>>>>>>>>> >> In general, I think it's a good idea to let people use
>>>>>>>>>>>>> older releases when these situations happen. It is difficult for 
>>>>>>>>>>>>> the
>>>>>>>>>>>>> project to continue to support libraries that are EOL and I don't 
>>>>>>>>>>>>> think
>>>>>>>>>>>>> there's a great justification for it, considering Iceberg support 
>>>>>>>>>>>>> in Hive 4
>>>>>>>>>>>>> is native and much better!
>>>>>>>>>>>>> >>
>>>>>>>>>>>>> >> On Wed, Nov 27, 2024 at 7:12 AM Cheng Pan <
>>>>>>>>>>>>> pan3...@gmail.com> wrote:
>>>>>>>>>>>>> >>>
>>>>>>>>>>>>> >>> That said, it would be helpful if they continue running
>>>>>>>>>>>>> >>> tests against the latest stable Hive releases to ensure
>>>>>>>>>>>>> that any
>>>>>>>>>>>>> >>> changes don’t unintentionally break something for Hive,
>>>>>>>>>>>>> which would be
>>>>>>>>>>>>> >>> beyond our control.
>>>>>>>>>>>>> >>>
>>>>>>>>>>>>> >>>
>>>>>>>>>>>>> >>> I believe we should continue maintaining a Hive Iceberg
>>>>>>>>>>>>> runtime test suite with the latest version of Hive in the Iceberg
>>>>>>>>>>>>> repository.
>>>>>>>>>>>>> >>>
>>>>>>>>>>>>> >>>
>>>>>>>>>>>>> >>> i think we can keep some basic Hive4 tests in iceberg repo
>>>>>>>>>>>>> >>>
>>>>>>>>>>>>> >>>
>>>>>>>>>>>>> >>> Instead of running basic tests on the Iceberg repo, maybe
>>>>>>>>>>>>> let Iceberg publish daily snapshot jars to Nexus, and have a 
>>>>>>>>>>>>> daily CI in
>>>>>>>>>>>>> Hive to consume those jars and run full Iceberg tests makes more 
>>>>>>>>>>>>> sense?
>>>>>>>>>>>>> >>>
>>>>>>>>>>>>> >>> Thanks,
>>>>>>>>>>>>> >>> Cheng Pan
>>>>>>>>>>>>> >>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>

Reply via email to