Re: [Announce] Singapore Apache Iceberg Community Meetup

2024-12-16 Thread Denny Lee
Hey Kevin,

Do you have a bigger image for this event so we can help promote  the event
via social media?

Thanks!
Denny


On Fri, Dec 13, 2024 at 3:06 PM Kevin Liu  wrote:

> Hey everyone,
>
> The Iceberg community meetup has expanded to Singapore! The next Singapore
> meetup will be on Wednesday, December 18 from 5:00 PM to 8:00 PM at 16
> Collyer Quay, level 12
> Singapore.
>
> Here's the luma page to sign up for the event, https://lu.ma/79xk5w5t
>
> There will be 3 presentations from community members.
> * Rayner Chen (Tech VP at VeloDB) - Building Fast Data Lake Analysis on
> top of Apache Iceberg & Apache Doris
> * Xinyu Zhou (Co-Founder & CTO at AutoMQ) - AutoMQ Table Topic: Bridging
> Streaming and Analytics with Iceberg
> * Jay Chia (Co-Founder at Eventual (Daft)) - Iceberg in Python
>
> Presentations will be recorded and uploaded to the Iceberg meetup YouTube
> channel (https://www.youtube.com/@IcebergMeetup)
>
> Best,
> Kevin Liu
>
>


Re: [DISCUSS] Hive Support

2024-12-16 Thread rdb...@gmail.com
> I'm not sure there's an upgrade path before Spark 4.0. Any ideas?

We can at least separate the concerns. We can remove the runtime modules
that are the main issue. If we compile against an older version of the Hive
metastore module (leaving it unchanged) that at least has a dramatically
reduced surface area for Java version issues. As long as the API is
compatible (and we haven't heard complaints that it is not) then I think
users can override the version in their environments.

Ryan

On Sun, Dec 15, 2024 at 5:55 PM Manu Zhang  wrote:

> Hi Daniel,
> I'll start a vote once I get the PR ready.
>
> Hi Ryan,
> Sorry, I wasn't clear in the last email that the consensus is to upgrade
> Hive metastore support.
>
> Well, I was too optimistic about the upgrade. Spark has only added hive
> 4.0 metastore support recently for Spark 4.0[1] and there will be conflicts
> between Spark's hive 2.3.9 and our hive 4.0 dependencies.
> I'm not sure there's an upgrade path before Spark 4.0. Any ideas?
>
> 1. https://issues.apache.org/jira/browse/SPARK-45265
>
> Thanks,
> Manu
>
>
> On Sat, Dec 14, 2024 at 4:31 AM rdb...@gmail.com  wrote:
>
>> Oh, I think I see. The upgrade to Hive 4 is just for the Hive metastore
>> support? When I read the thread, I thought that we weren't going to change
>> the metastore. That seems reasonable to me. Sorry for the confusion.
>>
>> On Fri, Dec 13, 2024 at 10:24 AM rdb...@gmail.com 
>> wrote:
>>
>>> Sorry, I must have missed something. I don't think that we should
>>> upgrade anything in Iceberg to Hive 4. Why not simply remove the Hive
>>> support entirely? Why would anyone need Hive 4 support from Iceberg when it
>>> is built into Hive 4?
>>>
>>> On Thu, Dec 12, 2024 at 11:03 AM Daniel Weeks  wrote:
>>>
 Hey Manu,

 I agree with the direction here, but we should probably hold a quick
 procedural vote just to confirm since this is a significant change in
 support for Hive.

 -Dan

 On Wed, Dec 11, 2024 at 5:19 PM Manu Zhang 
 wrote:

> Thanks all for sharing your thoughts. It looks there's a consensus on
> upgrading to Hive 4 and dropping hive-runtime.
> I've submitted a PR[1] as the first step. Please help review.
>
> 1. https://github.com/apache/iceberg/pull/11750
>
> Thanks,
> Manu
>
> On Thu, Nov 28, 2024 at 11:26 PM Shohei Okumiya 
> wrote:
>
>> Hi all,
>>
>> I also prefer option 1. I have some initiatives[1] to improve
>> integrations between Hive and Iceberg. The current style allows us to
>> develop both Hive's core and HiveIcebergStorageHandler simultaneously.
>> That would help us enhance integrations.
>>
>> - [1] https://issues.apache.org/jira/browse/HIVE-28410
>>
>> Regards,
>> Okumin
>>
>> On Thu, Nov 28, 2024 at 4:17 AM Fokko Driesprong 
>> wrote:
>> >
>> > Hey Cheng,
>> >
>> > Thanks for the suggestion. The nightly snapshots are available:
>> https://repository.apache.org/content/groups/snapshots/org/apache/iceberg/iceberg-core/,
>> which might help when working on features that are not released yet (eg
>> Nanosecond timestamps). Besides that, we should run RCs against Hive to
>> check if everything works as expected.
>> >
>> > I'm leaning toward removing Hive 2 and 3 as well.
>> >
>> > Kind regards,
>> > Fokko
>> >
>> > Op wo 27 nov 2024 om 20:05 schreef rdb...@gmail.com <
>> rdb...@gmail.com>:
>> >>
>> >> I think that we should remove Hive 2 and Hive 3. We already agreed
>> to remove Hive 2, but Hive 3 is not compatible with the project anymore 
>> and
>> is already EOL and will not see a release to update it so that it can be
>> compatible. Anyone using the existing Hive 3 support should be able to
>> continue using older releases.
>> >>
>> >> In general, I think it's a good idea to let people use older
>> releases when these situations happen. It is difficult for the project to
>> continue to support libraries that are EOL and I don't think there's a
>> great justification for it, considering Iceberg support in Hive 4 is 
>> native
>> and much better!
>> >>
>> >> On Wed, Nov 27, 2024 at 7:12 AM Cheng Pan 
>> wrote:
>> >>>
>> >>> That said, it would be helpful if they continue running
>> >>> tests against the latest stable Hive releases to ensure that any
>> >>> changes don’t unintentionally break something for Hive, which
>> would be
>> >>> beyond our control.
>> >>>
>> >>>
>> >>> I believe we should continue maintaining a Hive Iceberg runtime
>> test suite with the latest version of Hive in the Iceberg repository.
>> >>>
>> >>>
>> >>> i think we can keep some basic Hive4 tests in iceberg repo
>> >>>
>> >>>
>> >>> Instead of running basic tests on the Iceberg repo, maybe let
>> Iceberg publish daily snapshot jars to Nexus, and ha

Re: [Discuss] Document Snapshot Summary Optional Fields for Standardization

2024-12-16 Thread Fokko Driesprong
I'm in favor of this as well. While working on PyIceberg I had to deduce
this from the Java code, having a more condensed version in the appendix of
the spec would be great.

Kind regards,
Fokko

Op ma 16 dec 2024 om 14:21 schreef Jean-Baptiste Onofré :

> Hi,
>
> yes I agree, I don't think we have to couple of spec version.
>
> Regards
> JB
>
> On Wed, Dec 11, 2024 at 11:17 PM Russell Spitzer
>  wrote:
> >
> > I want to float this back up, I think this is a really good idea for
> cross engine support. I don't think we have to tie this to any specific
> Spec version since they are just recommendations so I think we can do this
> at any time
> >
> > On Wed, Nov 27, 2024 at 1:31 PM Szehon Ho 
> wrote:
> >>
> >> This makes sense to me generally, I've tried a few times to search in
> the spec to find a list of possible snapshot summary properties, and was a
> bit surprised to not find them there.  So I think this would be a nice
> addition.
> >>
> >> I'm curious if there's any historical reason it's not been included in
> the spec.
> >>
> >> Thanks
> >> Szehon
> >>
> >> On Wed, Nov 27, 2024 at 10:55 AM Kevin Liu 
> wrote:
> >>>
> >>> Thanks for driving this Honah!
> >>>
> >>> It's important to have a consistent naming scheme so that we don't
> need to worry about edge cases when using multiple engines, and possibly
> have to deal with migrations.
> >>>
> >>> Also, since users can store arbitrary key/value pairs in the summary
> property, it's good to document the currently used properties to avoid
> collision.
> >>>
> >>> I like the proposal to document all properties in a "snapshot summary"
> table, this will ensure a centralized place to view all possible key/value
> pairs, similar to how FileIO configuration is handled in iceberg-python.
> Other implementations can use this table as a reference.
> >>>
> >>>  > This approach offers flexibility, as new fields can be added
> through documentation updates without requiring specification changes.
> >>> This will save a lot of effort since specification changes require
> greater scrutiny.
> >>>
> >>> > summary details would not be located near the Snapshot section,
> which explains the summary field.
> >>> We can link the table to the Snapshot section.
> >>>
> >>>
> >>> Would love to hear others' thoughts on this.
> >>>
> >>> Best,
> >>> Kevin Liu
> >>>
> >>> On Tue, Nov 26, 2024 at 2:50 PM Honah J.  wrote:
> 
>  Hi everyone,
> 
>  I’d like to propose an addition to the table specification to
> document optional fields in the snapshot summary.
> 
>  Currently, the snapshot summary includes a required operation field
> and various optional fields. While these optional fields—such as metrics
> and partition-level summaries—are supported by Java and Python
> implementations, they are not officially documented. This creates risks of
> inconsistency as other implementations and engines adopt and interact with
> these fields.
> 
>  I propose adding a new section to the table specification to document
> these optional fields, ensuring consistent naming conventions and reducing
> ambiguity across implementations. While this is the primary proposal, it
> may also be worth discussing whether documenting these fields separately in
> Docs/Table would provide additional flexibility for future updates.
> 
>  I’d love to hear your thoughts, suggestions, or concerns about this
> proposal.
> 
>  Looking forward to the discussion!
> 
>  Links
> 
>  GitHub tracking issue: https://github.com/apache/iceberg/issues/11659
>  Proposal:
> https://docs.google.com/document/d/1Gt1ZOXVXK60IGdlmt4QlyRzaZ1iCVyYUBfMJCsiz14I/edit?usp=sharing
>  PR: https://github.com/apache/iceberg/pull/11660
> 
> 
>  Best regards,
>  Honah
>


Re: [DISCUSS] Remove snapshot-id from IRC SetStatisticsUpdate

2024-12-16 Thread Fokko Driesprong
Hey Christian,

Great catch, I would also be in favor of removing the outer one. I don't
see any value in having them both.

Kind regards,
Fokko

Op ma 16 dec 2024 om 14:26 schreef Jean-Baptiste Onofré :

> Hi,
>
> I saw the discussion on Slack. Yeah, it's redundant.
> I know some catalogs only consider the snapshot id in SetStatisticsUpdate.
>
> Regards
> JB
>
> On Fri, Dec 13, 2024 at 8:03 PM Christian 
> wrote:
> >
> > Dear all,
> >
> > I believe we currently have a redundancy in the IRC SetStatisticsUpdate
> [1].
> > SetStatisticsUpdate has a required field `snapshot-id` but also a
> `StatisticsFile` which in turn contains the `snapshot-id` as required. The
> redundant information is used in Java only for an assertion to check if the
> ids are identical [2].
> >
> > Are there any good reasons to keep both `snapshot-id`s? If not I would
> propose to deprecate the outer `snapshot-id`.
> > To remove redundancy in libraries, I am using a custom serializer /
> deserializer to handle this in Rust [3].
> >
> > Let me know what you think!
> >
> > Thanks,
> > Christian
> >
> > [1]:
> https://github.com/apache/iceberg/blob/540d6a6251e31b232fe6ed2413680621454d107a/open-api/rest-catalog-open-api.yaml#L2902
> > [2]:
> https://github.com/apache/iceberg/blob/540d6a6251e31b232fe6ed2413680621454d107a/core/src/main/java/org/apache/iceberg/TableMetadata.java#L1314
> > [3]: https://github.com/apache/iceberg-rust/pull/799
> >
> >
>


Re: [DISCUSS] Remove snapshot-id from IRC SetStatisticsUpdate

2024-12-16 Thread Jean-Baptiste Onofré
Hi,

I saw the discussion on Slack. Yeah, it's redundant.
I know some catalogs only consider the snapshot id in SetStatisticsUpdate.

Regards
JB

On Fri, Dec 13, 2024 at 8:03 PM Christian  wrote:
>
> Dear all,
>
> I believe we currently have a redundancy in the IRC SetStatisticsUpdate [1].
> SetStatisticsUpdate has a required field `snapshot-id` but also a 
> `StatisticsFile` which in turn contains the `snapshot-id` as required. The 
> redundant information is used in Java only for an assertion to check if the 
> ids are identical [2].
>
> Are there any good reasons to keep both `snapshot-id`s? If not I would 
> propose to deprecate the outer `snapshot-id`.
> To remove redundancy in libraries, I am using a custom serializer / 
> deserializer to handle this in Rust [3].
>
> Let me know what you think!
>
> Thanks,
> Christian
>
> [1]: 
> https://github.com/apache/iceberg/blob/540d6a6251e31b232fe6ed2413680621454d107a/open-api/rest-catalog-open-api.yaml#L2902
> [2]: 
> https://github.com/apache/iceberg/blob/540d6a6251e31b232fe6ed2413680621454d107a/core/src/main/java/org/apache/iceberg/TableMetadata.java#L1314
> [3]: https://github.com/apache/iceberg-rust/pull/799
>
>


Re: [DISCUSS] Spark Catalog - Drop vs Drop with Purge

2024-12-16 Thread Jean-Baptiste Onofré
It sounds good to me.

Thanks !
Regards
JB

On Wed, Dec 11, 2024 at 7:20 PM Russell Spitzer
 wrote:
>
> Hi Y'all!
>
> Today we had a little discussion on the Apache Iceberg Catalog Community Sync
> about DROP and DROP WITH PURGE. Currently the SparkCatalog implementation
> inside of the reference library has a unique method of DROP WITH PURGE vs 
> other
> implementations. The pseudo code is essentially
>
>
> ```
> use Spark to list files to be removed and delete them
> send a drop table request to the Catalog
> ```
>
> As opposed to other systems
>
> ```
> send a drop table request to the Catalog with the purge flag enabled
> ```
>
> This has led us to a situation where it becomes difficult for REST Catalogs
> with custom purge implementations (or those with ignore purge) to
> work properly with Spark.
>
> Bringing this behavior in line with non-Spark implementations
> would have possibly dramatic impacts on users of the
> iceberg library but our consensus in the Catalog Sync today was that we should
> eventually have that be the default behavior. To this end I propose the 
> following
>
> We support a flag to allow current Spark users to delegate to the REST Catalog
> (all other catalog behaviors remain the same). PR available here from
> (Credit to Tobias who wrote the PR and brought up this topic)
>  We deprecate the client side delete for Spark
> In the next major release (Iceberg 2.0?) we change the behavior officially to 
> only
> send through the Drop Purge flag with no client side file removal.
> For all non-REST catalog implementations we keep the code the same for legacy 
> compatibility.
>
> A user of 1.8 will then have the ability to choose for their Spark DROP 
> PURGES whether
> or not to purge locally or Remotely for REST
>
> A user of 2.0 will only be able to do a remote purge
>
> Users of non-REST Catalogs will have no change in behavior.
>
>
> Thanks for your consideration,
> Russ


Re: [Discuss] Document Snapshot Summary Optional Fields for Standardization

2024-12-16 Thread Jean-Baptiste Onofré
Hi,

yes I agree, I don't think we have to couple of spec version.

Regards
JB

On Wed, Dec 11, 2024 at 11:17 PM Russell Spitzer
 wrote:
>
> I want to float this back up, I think this is a really good idea for cross 
> engine support. I don't think we have to tie this to any specific Spec 
> version since they are just recommendations so I think we can do this at any 
> time
>
> On Wed, Nov 27, 2024 at 1:31 PM Szehon Ho  wrote:
>>
>> This makes sense to me generally, I've tried a few times to search in the 
>> spec to find a list of possible snapshot summary properties, and was a bit 
>> surprised to not find them there.  So I think this would be a nice addition.
>>
>> I'm curious if there's any historical reason it's not been included in the 
>> spec.
>>
>> Thanks
>> Szehon
>>
>> On Wed, Nov 27, 2024 at 10:55 AM Kevin Liu  wrote:
>>>
>>> Thanks for driving this Honah!
>>>
>>> It's important to have a consistent naming scheme so that we don't need to 
>>> worry about edge cases when using multiple engines, and possibly have to 
>>> deal with migrations.
>>>
>>> Also, since users can store arbitrary key/value pairs in the summary 
>>> property, it's good to document the currently used properties to avoid 
>>> collision.
>>>
>>> I like the proposal to document all properties in a "snapshot summary" 
>>> table, this will ensure a centralized place to view all possible key/value 
>>> pairs, similar to how FileIO configuration is handled in iceberg-python. 
>>> Other implementations can use this table as a reference.
>>>
>>>  > This approach offers flexibility, as new fields can be added through 
>>> documentation updates without requiring specification changes.
>>> This will save a lot of effort since specification changes require greater 
>>> scrutiny.
>>>
>>> > summary details would not be located near the Snapshot section, which 
>>> > explains the summary field.
>>> We can link the table to the Snapshot section.
>>>
>>>
>>> Would love to hear others' thoughts on this.
>>>
>>> Best,
>>> Kevin Liu
>>>
>>> On Tue, Nov 26, 2024 at 2:50 PM Honah J.  wrote:

 Hi everyone,

 I’d like to propose an addition to the table specification to document 
 optional fields in the snapshot summary.

 Currently, the snapshot summary includes a required operation field and 
 various optional fields. While these optional fields—such as metrics and 
 partition-level summaries—are supported by Java and Python 
 implementations, they are not officially documented. This creates risks of 
 inconsistency as other implementations and engines adopt and interact with 
 these fields.

 I propose adding a new section to the table specification to document 
 these optional fields, ensuring consistent naming conventions and reducing 
 ambiguity across implementations. While this is the primary proposal, it 
 may also be worth discussing whether documenting these fields separately 
 in Docs/Table would provide additional flexibility for future updates.

 I’d love to hear your thoughts, suggestions, or concerns about this 
 proposal.

 Looking forward to the discussion!

 Links

 GitHub tracking issue: https://github.com/apache/iceberg/issues/11659
 Proposal: 
 https://docs.google.com/document/d/1Gt1ZOXVXK60IGdlmt4QlyRzaZ1iCVyYUBfMJCsiz14I/edit?usp=sharing
 PR: https://github.com/apache/iceberg/pull/11660


 Best regards,
 Honah


Re: [Announce] Singapore Apache Iceberg Community Meetup

2024-12-16 Thread Kevin Liu
Hey Denny,

Here's one with a better resolution,
https://drive.google.com/file/d/1H2scgq70fJU8AMLXzOadOdKfVPjHLDH9/view?usp=sharing

Best,
Kevin Liu

On Mon, Dec 16, 2024 at 10:35 AM Denny Lee  wrote:

> Hey Kevin,
>
> Do you have a bigger image for this event so we can help promote  the
> event via social media?
>
> Thanks!
> Denny
>
>
> On Fri, Dec 13, 2024 at 3:06 PM Kevin Liu  wrote:
>
>> Hey everyone,
>>
>> The Iceberg community meetup has expanded to Singapore! The next
>> Singapore meetup will be on Wednesday, December 18 from 5:00 PM to 8:00
>> PM at 16 Collyer Quay, level 12
>> Singapore.
>>
>> Here's the luma page to sign up for the event, https://lu.ma/79xk5w5t
>>
>> There will be 3 presentations from community members.
>> * Rayner Chen (Tech VP at VeloDB) - Building Fast Data Lake Analysis on
>> top of Apache Iceberg & Apache Doris
>> * Xinyu Zhou (Co-Founder & CTO at AutoMQ) - AutoMQ Table Topic: Bridging
>> Streaming and Analytics with Iceberg
>> * Jay Chia (Co-Founder at Eventual (Daft)) - Iceberg in Python
>>
>> Presentations will be recorded and uploaded to the Iceberg meetup YouTube
>> channel (https://www.youtube.com/@IcebergMeetup)
>>
>> Best,
>> Kevin Liu
>>
>>


Re: [Announce] Singapore Apache Iceberg Community Meetup

2024-12-16 Thread Denny Lee
merci beaucoup! :)

On Mon, Dec 16, 2024 at 10:50 AM Kevin Liu  wrote:

> Hey Denny,
>
> Here's one with a better resolution,
> https://drive.google.com/file/d/1H2scgq70fJU8AMLXzOadOdKfVPjHLDH9/view?usp=sharing
>
> Best,
> Kevin Liu
>
> On Mon, Dec 16, 2024 at 10:35 AM Denny Lee  wrote:
>
>> Hey Kevin,
>>
>> Do you have a bigger image for this event so we can help promote  the
>> event via social media?
>>
>> Thanks!
>> Denny
>>
>>
>> On Fri, Dec 13, 2024 at 3:06 PM Kevin Liu  wrote:
>>
>>> Hey everyone,
>>>
>>> The Iceberg community meetup has expanded to Singapore! The next
>>> Singapore meetup will be on Wednesday, December 18 from 5:00 PM to 8:00
>>> PM at 16 Collyer Quay, level 12
>>> Singapore.
>>>
>>> Here's the luma page to sign up for the event, https://lu.ma/79xk5w5t
>>>
>>> There will be 3 presentations from community members.
>>> * Rayner Chen (Tech VP at VeloDB) - Building Fast Data Lake Analysis on
>>> top of Apache Iceberg & Apache Doris
>>> * Xinyu Zhou (Co-Founder & CTO at AutoMQ) - AutoMQ Table Topic:
>>> Bridging Streaming and Analytics with Iceberg
>>> * Jay Chia (Co-Founder at Eventual (Daft)) - Iceberg in Python
>>>
>>> Presentations will be recorded and uploaded to the Iceberg meetup
>>> YouTube channel (https://www.youtube.com/@IcebergMeetup)
>>>
>>> Best,
>>> Kevin Liu
>>>
>>>


Re: 1.7.1 breaking change related to ADLS support

2024-12-16 Thread Jean-Baptiste Onofré
Hi Alex,

It was exactly my concern (and question) when I did the review on the PR.

I agree it's breaking change and definitely not good in a micro/patch release.

As 1.7.0 is still working, I would propose to wait 1.8.0 if possible:
I'm actually preparing a new thread with 1.8.0 proposal (in terms of
timing and content like improved security/auth flow in REST).

Would it work for you or are you calling for 1.7.2 ?

Regards
JB

On Tue, Dec 17, 2024 at 1:06 AM Alex Reid  wrote:
>
> Just a heads up, but I believe this PR 
> (https://github.com/apache/iceberg/pull/11504) that was added to the 1.7.1 
> introduced a breaking change for anyone previously supporting ADLSFileIO / 
> credentials with sasTokens. I added some details in the PR comments, but will 
> also provide those details here as well since the PR is closed.
>
> On initialization of ADLSFileIO, AzureProperties is created and builds a map 
> of account -> sasToken here when you create ADLSFileIO using adls.sas-token. 
> as the credential mechanism.
>
> Prior to this change, the account passed in here (which is used to lookup the 
> sasToken from the map mentioned above) was parsed to include 
> .dfs.core.windows.net so when generating the adls.sas-token. property to pass 
> into ADLSFileIO, you needed to include .dfs.core.windows.net as part of the 
> adls.sas-token. property name. After this change, we are parsing the 
> ADLSLocation account to NOT include adls.sas-token., so now the sasToken 
> lookup from the map doesn't find the sasToken. When someone updates to 1.7.1, 
> they will/would also need to update how they are configuring / building the 
> ADLSFileIO properties when using sasTokens.
>
>
> -alex


1.7.1 breaking change related to ADLS support

2024-12-16 Thread Alex Reid
Just a heads up, but I believe this PR
 (
https://github.com/apache/iceberg/pull/11504) that was added to the 1.7.1
introduced a breaking change for anyone previously supporting ADLSFileIO /
credentials with sasTokens. I added some details in the PR comments, but
will also provide those details here as well since the PR is closed.

On initialization of ADLSFileIO, AzureProperties is created and builds a
map of account -> sasToken here when you create ADLSFileIO using
adls.sas-token. as the credential mechanism.

Prior to this change, the account passed in here

(which
is used to lookup the sasToken from the map mentioned above) was parsed to
include .dfs.core.windows.net so when generating the adls.sas-token. property
to pass into ADLSFileIO, you needed to include .dfs.core.windows.net as
part of the adls.sas-token. property name. After this change, we are
parsing the ADLSLocation account to NOT include adls.sas-token., so now the
sasToken lookup from the map doesn't find the sasToken. When someone
updates to 1.7.1, they will/would also need to update how they are
configuring / building the ADLSFileIO properties when using sasTokens.


-alex