Re: [DISCUSS] Hive Support

Simhadri G Tue, 26 Nov 2024 12:31:08 -0800

Hi Everyone,

Thank you, Peter, for the discussion!


I’m also leaning toward option one. However, given that Apache Iceberg is
designed to be engine-agnostic, I believe we should continue maintaining a
Hive Iceberg runtime test suite with the latest version of Hive in the
Iceberg repository. This will help identify any changes that could break
Hive compatibility early on.

So I agree with ayush, denys and Butao on option . I think Options 2 and 3
would be difficult , as they would require a significant amount of time and
effort from the community to maintain.


Thanks,
Simhadri G



On Tue, Nov 26, 2024, 7:50 AM Butao Zhang <butaozha...@163.com> wrote:

> Hi folks,
>
>              Firstly Thanks Peter for bringing it up!  I also think option
> 1 is a more reasonable solution right now, as we have developed lots of
> advanced iceberg features in hive repo, such as mor & cow & compaction,
> etc, and these feats are coupled with Hive core code base. Hive
> runtime/connector in iceberg repo can not easily make this advanced feats
> happen. So in the long term, drop Hive runtime from iceberg repo and
> maintain it in Hive repo is more sensible.
>
>              BTW, i have did some work about upgrading iceberg in Hive
> repo, like HIVE-28495. We often backport some hive-iceberg related commits
> from Iceberg repo to Hive repo.  What i noticed that the iceberg-catalog in
> Hive repo (equals to hive-metastore in Iceberg repo)  rarely changes. As
> Denys said above *we could potentially drop it from Hive repo and maybe
> rename to `hive-catalog` in iceberg.  *I think it makes more sense to
> keep the hive catalog in one place.  But i am not sure if the hive catalog
> will be coupled with hive core codes when developing some Upcoming advanced
> features. If being coupled with hive core codes, it's better to stay in
> Hive repo.  Some folks who know more about catalogs can give more context.
>
>           About the hive(Hive 4) test integration in iceberg repo, in
> general, i think we can keep some basic Hive4 tests in iceberg repo, as
> this not only makes iceberg core more stable, but also ensures that
> Hive4's  iceberg runtime will not be damaged at time. I have seen that
> Trino repo did some Spark integration testing(
> https://github.com/trinodb/trino/blob/master/testing/trino-product-tests/src/main/java/io/trino/tests/product/iceberg/TestIcebergSparkCompatibility.java)
> . Maybe we can consider this way.
>
>
>
> Thanks,
> Butao Zhang
> ---- Replied Message ----
> From Wing Yew Poon<wyp...@cloudera.com.INVALID> <undefined>
> Date 11/26/2024 05:50
> To <dev@iceberg.apache.org> <dev@iceberg.apache.org>
> Cc <d...@hive.apache.org> <d...@hive.apache.org>
> Subject Re: [DISCUSS] Hive Support
> For the Hive runtime, would it be feasible for the Hive community to
> contribute a suite of tests to the Iceberg repo that can be run with
> dependencies from the latest Hive release (Hive 4.x), and then update them
> from time to time as appropriate? The purpose of this suite would be to
> test integration of Iceberg core with the Hive runtime. Perhaps the
> existing tests in the mr and hive3 modules could be a starting point, or
> you might decide on different tests altogether.
> The development of the Hive runtime would then continue as now in the Hive
> repo, but you gain better assurance of compatibility with ongoing Iceberg
> development, with a relatively small maintenance burden in Iceberg.
>
>
>
> On Mon, Nov 25, 2024 at 11:56 AM Ayush Saxena <ayush...@gmail.com> wrote:
>
>> Hi Peter,
>>
>> Thanks for bringing this to our attention.
>>
>> From my side, I have a say only on the code that resides in the Hive
>> repository. I am okay with the first approach, as we are already
>> following it for the most part. Whether Iceberg keeps or drops the
>> code shouldn’t have much impact on us. (I don't think I have a say on
>> that either) That said, it would be helpful if they continue running
>> tests against the latest stable Hive releases to ensure that any
>> changes don’t unintentionally break something for Hive, which would be
>> beyond our control.
>>
>> Regarding having a separate code repository for the connectors, I
>> believe the challenges would outweigh the benefits. As mentioned, the
>> initial workload would be significant, but more importantly,
>> maintaining a regular cadence of releases would be even more
>> difficult. I don’t see a large pool of contributors specifically
>> focused on this area who could take ownership and drive releases for a
>> single repository. Additionally, the ASF doesn’t officially allow
>> repo-level committers or PMC members who could be recruited solely to
>> manage one repository. Given these constraints, I suggest dropping
>> this idea for now.
>>
>> Best,
>> Ayush
>>
>> On Tue, 26 Nov 2024 at 01:05, Denys Kuzmenko <dkuzme...@apache.org>
>> wrote:
>> >
>> > Hi Peter,
>> >
>> > Thanks for bringing it up!
>> >
>> > I think that option 1 is the only viable solution here (remove the
>> hive-runtime from the iceberg repo). Main reason: lack of reviewers for
>> things other than Spark.
>> >
>> > Note: need to double check, but I am pretty sure there is no difference
>> between Hive `iceberg-catalog` and iceberg's `hive-metastore`, so we could
>> potentially drop it from Hive repo and maybe rename to `hive-catalog` in
>> iceberg?
>> >
>> > Supporting one more connector repo seems like an overhead: need to
>> setup infra, CI, have active contributors/release managers. Later probably
>> is the reason why we still haven't moved HMS into a separate repo.
>> >
>> > Having iceberg connector in Hive gives us more flexibility and
>> ownership of that component, doesn't block an active development.
>> > We try to be up-to-date with latest iceberg, but it usually takes some
>> time.
>> >
>> > I'd be glad to hear other opinions.
>> >
>> > Thanks,
>> > Denys
>>
>

Re: [DISCUSS] Hive Support

Reply via email to