Re: [DISCUSS] Hive Support

Gabor Kaszab Wed, 27 Nov 2024 05:28:07 -0800

Hi All,

As I see there is a general opinion on not keeping the Hive code in the
Iceberg repo, but maintaining a set of tests that verifies the actual
Iceberg code against the latest Hive release. For me it would seem a bit
odd to maintain a test suite for verifying some code that is not maintained
within this repo. In a similar fashion we could maintain a test suite for
any other query engine, not just for Hive. I think it's either code+tests
or none.


I'd rather challenge the general consensus of Option 1 (remove Hive code
from Iceberg) and I'd like to understand the motivation why this
whole replication of code happened between Iceberg and Hive. If I read well
between the lines Hive developers found it difficult to get their PRs
merged or even reviewed by Iceberg committers and hence decided to
replicate the code and do their own implementation. However, I haven't seen
any communication about raising awareness of these difficulties.
I think that people have put some serious efforts into the Hive code within
the Iceberg repo and before dropping it we should take a step back and see
if the current situation can be fixed somehow:
 - Can we raise awareness that the code reviewing bandwidth of the Hive PRs
is not sufficient? (If this was indeed the motivation of the Hive devs)
 - Would it be feasible to collect a list of PRs that would be required to
push the missing Hive related code into Iceberg?
 - Would it be possible to get some commitment from the people reviewing
Iceberg-Hive code that they will try to find some time taking a look at
these?

Let me know if the above doesn't make any sense, though!
Regards,
Gabor

On Tue, Nov 26, 2024 at 9:31 PM Simhadri G <simhad...@apache.org> wrote:

>
>
> Hi Everyone,
>
> Thank you, Peter, for the discussion!
>
> I’m also leaning toward option one. However, given that Apache Iceberg is
> designed to be engine-agnostic, I believe we should continue maintaining a
> Hive Iceberg runtime test suite with the latest version of Hive in the
> Iceberg repository. This will help identify any changes that could break
> Hive compatibility early on.
>
> So I agree with ayush, denys and Butao on option . I think Options 2 and 3
> would be difficult , as they would require a significant amount of time and
> effort from the community to maintain.
>
>
> Thanks,
> Simhadri G
>
>
>
> On Tue, Nov 26, 2024, 7:50 AM Butao Zhang <butaozha...@163.com> wrote:
>
>> Hi folks,
>>
>>              Firstly Thanks Peter for bringing it up!  I also think
>> option 1 is a more reasonable solution right now, as we have developed lots
>> of advanced iceberg features in hive repo, such as mor & cow & compaction,
>> etc, and these feats are coupled with Hive core code base. Hive
>> runtime/connector in iceberg repo can not easily make this advanced feats
>> happen. So in the long term, drop Hive runtime from iceberg repo and
>> maintain it in Hive repo is more sensible.
>>
>>              BTW, i have did some work about upgrading iceberg in Hive
>> repo, like HIVE-28495. We often backport some hive-iceberg related commits
>> from Iceberg repo to Hive repo.  What i noticed that the iceberg-catalog in
>> Hive repo (equals to hive-metastore in Iceberg repo)  rarely changes. As
>> Denys said above *we could potentially drop it from Hive repo and maybe
>> rename to `hive-catalog` in iceberg.  *I think it makes more sense to
>> keep the hive catalog in one place.  But i am not sure if the hive catalog
>> will be coupled with hive core codes when developing some Upcoming advanced
>> features. If being coupled with hive core codes, it's better to stay in
>> Hive repo.  Some folks who know more about catalogs can give more context.
>>
>>           About the hive(Hive 4) test integration in iceberg repo, in
>> general, i think we can keep some basic Hive4 tests in iceberg repo, as
>> this not only makes iceberg core more stable, but also ensures that
>> Hive4's  iceberg runtime will not be damaged at time. I have seen that
>> Trino repo did some Spark integration testing(
>> https://github.com/trinodb/trino/blob/master/testing/trino-product-tests/src/main/java/io/trino/tests/product/iceberg/TestIcebergSparkCompatibility.java)
>> . Maybe we can consider this way.
>>
>>
>>
>> Thanks,
>> Butao Zhang
>> ---- Replied Message ----
>> From Wing Yew Poon<wyp...@cloudera.com.INVALID> <undefined>
>> Date 11/26/2024 05:50
>> To <dev@iceberg.apache.org> <dev@iceberg.apache.org>
>> Cc <d...@hive.apache.org> <d...@hive.apache.org>
>> Subject Re: [DISCUSS] Hive Support
>> For the Hive runtime, would it be feasible for the Hive community to
>> contribute a suite of tests to the Iceberg repo that can be run with
>> dependencies from the latest Hive release (Hive 4.x), and then update them
>> from time to time as appropriate? The purpose of this suite would be to
>> test integration of Iceberg core with the Hive runtime. Perhaps the
>> existing tests in the mr and hive3 modules could be a starting point, or
>> you might decide on different tests altogether.
>> The development of the Hive runtime would then continue as now in the
>> Hive repo, but you gain better assurance of compatibility with ongoing
>> Iceberg development, with a relatively small maintenance burden in Iceberg.
>>
>>
>>
>> On Mon, Nov 25, 2024 at 11:56 AM Ayush Saxena <ayush...@gmail.com> wrote:
>>
>>> Hi Peter,
>>>
>>> Thanks for bringing this to our attention.
>>>
>>> From my side, I have a say only on the code that resides in the Hive
>>> repository. I am okay with the first approach, as we are already
>>> following it for the most part. Whether Iceberg keeps or drops the
>>> code shouldn’t have much impact on us. (I don't think I have a say on
>>> that either) That said, it would be helpful if they continue running
>>> tests against the latest stable Hive releases to ensure that any
>>> changes don’t unintentionally break something for Hive, which would be
>>> beyond our control.
>>>
>>> Regarding having a separate code repository for the connectors, I
>>> believe the challenges would outweigh the benefits. As mentioned, the
>>> initial workload would be significant, but more importantly,
>>> maintaining a regular cadence of releases would be even more
>>> difficult. I don’t see a large pool of contributors specifically
>>> focused on this area who could take ownership and drive releases for a
>>> single repository. Additionally, the ASF doesn’t officially allow
>>> repo-level committers or PMC members who could be recruited solely to
>>> manage one repository. Given these constraints, I suggest dropping
>>> this idea for now.
>>>
>>> Best,
>>> Ayush
>>>
>>> On Tue, 26 Nov 2024 at 01:05, Denys Kuzmenko <dkuzme...@apache.org>
>>> wrote:
>>> >
>>> > Hi Peter,
>>> >
>>> > Thanks for bringing it up!
>>> >
>>> > I think that option 1 is the only viable solution here (remove the
>>> hive-runtime from the iceberg repo). Main reason: lack of reviewers for
>>> things other than Spark.
>>> >
>>> > Note: need to double check, but I am pretty sure there is no
>>> difference between Hive `iceberg-catalog` and iceberg's `hive-metastore`,
>>> so we could potentially drop it from Hive repo and maybe rename to
>>> `hive-catalog` in iceberg?
>>> >
>>> > Supporting one more connector repo seems like an overhead: need to
>>> setup infra, CI, have active contributors/release managers. Later probably
>>> is the reason why we still haven't moved HMS into a separate repo.
>>> >
>>> > Having iceberg connector in Hive gives us more flexibility and
>>> ownership of that component, doesn't block an active development.
>>> > We try to be up-to-date with latest iceberg, but it usually takes some
>>> time.
>>> >
>>> > I'd be glad to hear other opinions.
>>> >
>>> > Thanks,
>>> > Denys
>>>
>>

Re: [DISCUSS] Hive Support

Reply via email to