Re: [DISCUSS] Apache Iceberg (java) 1.8.0 release

2025-01-16 Thread Jean-Baptiste Onofré
Hi folks,

Following the Community Meeting yesterday, I would like to propose the
following plan regarding releases:

0. As a prerequisite to any release (1.7.2, 1.8.0, 1.9.0), as said by
Ryan, we have to double check the NOTICE/LICENSE. Interestingly, I
discussed this point with Fokko at the beginning of this week, because
I have some doubts about LICENSE/NOTICE content in the "uber" jar
artifacts where we shade dependencies. I'm doing a complete pass on
all artifacts in 1.7.2-SNAPSHOT and 1.8.0-SNAPSHOT. I should have a
complete analysis by tomorrow. This is potentially a blocker for
release votes.

1. As soon as (0) is done, 1.7.2 can be submitted to vote. I will work
with Fokko on this one.

2. We plan to do 1.8.0 in a couple of weeks (Amogh is the release
manager). Due to still some WIP, we "revisited" the 1.8.0 release
content: for instance, as best effort, we wanted to include REST Auth
Manager improvement (OAuth2 Manager) but we preferred to postpone to
1.9.0. That's totally fine to me, however, I would propose to strongly
focus on pending PRs for 1.9.0. Imho, we should "target" (again as
clear best effort) on variant, partition stats and Auth Manager.

3. Assuming 1.8.0 will be released at the end of Jan/beginning of Feb,
according to our "release cadence", what do you think about planning
1.9.0 in April ? Again with the main targets listed in (2).

I tried to sum up what we discussed yesterday :)
Thoughts ?

Regards
JB

On Thu, Jan 9, 2025 at 7:51 AM Jean-Baptiste Onofré  wrote:
>
> Hi folks,
>
> We did Apache Iceberg 1.7.0 release on Nov 8, 2024. If we want to keep
> our release "pace", 1.8.0 should be released around mid February.
>
> I think we already have a good "train" of merged PRs (or should be
> merged soon): default values, REST auth improvements, dependencies
> updates, etc.
>
> WDYT about 1.8.0 mid Feb ? If so, I propose we update GitHub Issues
> and PRs we would like to "target" to 1.8.0.
>
> Thoughts ?
>
> Regards
> JB


Re: [DISCUSS] Apache Iceberg (java) 1.8.0 release

2025-01-16 Thread Robert Stupp

Hey,

IMHO 1.8 should definitely include the Auth-Manager work, which tackles 
actual bugs in the Iceberg code base wrt OAuth implementation. That work 
was originally intended to go into 1.7 and now it shall be delayed again 
for 1.9. The PR was originally opened in July 2024, about half a year 
ago and is still getting reviewed. In the meantime there were more than 
600 other PRs that got reviewed and merged.


The overall agreement around spring 2024, please correct me if I am 
wrong, was the whole REST/OAuth area needs to be improved, and the oauth 
endpoint removed entirely.


Generally speaking, and I know I'm not alone, getting reviews from 
Iceberg committers is extremely hard. A lot of issues and PRs just get 
closed (by that stale bot) without having gotten _any_ attention from an 
Iceberg committer. This is not a new situation but going on for a long 
time. I have been talking to two Iceberg PMC members in person many 
months ago that this is a very disappointing situation that needs to be 
fixed. The reply was always "we are already working on it" - but at 
least from my personal POV the situation did not improve.


Robert

On 16.01.25 10:56, Jean-Baptiste Onofré wrote:

Hi folks,

Following the Community Meeting yesterday, I would like to propose the
following plan regarding releases:

0. As a prerequisite to any release (1.7.2, 1.8.0, 1.9.0), as said by
Ryan, we have to double check the NOTICE/LICENSE. Interestingly, I
discussed this point with Fokko at the beginning of this week, because
I have some doubts about LICENSE/NOTICE content in the "uber" jar
artifacts where we shade dependencies. I'm doing a complete pass on
all artifacts in 1.7.2-SNAPSHOT and 1.8.0-SNAPSHOT. I should have a
complete analysis by tomorrow. This is potentially a blocker for
release votes.

1. As soon as (0) is done, 1.7.2 can be submitted to vote. I will work
with Fokko on this one.

2. We plan to do 1.8.0 in a couple of weeks (Amogh is the release
manager). Due to still some WIP, we "revisited" the 1.8.0 release
content: for instance, as best effort, we wanted to include REST Auth
Manager improvement (OAuth2 Manager) but we preferred to postpone to
1.9.0. That's totally fine to me, however, I would propose to strongly
focus on pending PRs for 1.9.0. Imho, we should "target" (again as
clear best effort) on variant, partition stats and Auth Manager.

3. Assuming 1.8.0 will be released at the end of Jan/beginning of Feb,
according to our "release cadence", what do you think about planning
1.9.0 in April ? Again with the main targets listed in (2).

I tried to sum up what we discussed yesterday :)
Thoughts ?

Regards
JB

On Thu, Jan 9, 2025 at 7:51 AM Jean-Baptiste Onofré  wrote:

Hi folks,

We did Apache Iceberg 1.7.0 release on Nov 8, 2024. If we want to keep
our release "pace", 1.8.0 should be released around mid February.

I think we already have a good "train" of merged PRs (or should be
merged soon): default values, REST auth improvements, dependencies
updates, etc.

WDYT about 1.8.0 mid Feb ? If so, I propose we update GitHub Issues
and PRs we would like to "target" to 1.8.0.

Thoughts ?

Regards
JB


--
Robert Stupp
@snazy



Re: [Discuss][Vote] Spec Change - Add optional field added-rows to Snapshot for Row Lineage

2025-01-16 Thread Eduard Tudenhöfner
+1

On Thu, Jan 16, 2025 at 7:53 AM huaxin gao  wrote:

> +1 (non-binding)
>
> On Wed, Jan 15, 2025 at 10:51 PM Gang Wu  wrote:
>
>> +1 (non-binding)
>>
>> On Thu, Jan 16, 2025 at 2:30 PM Péter Váry 
>> wrote:
>>
>>> +1
>>>
>>> Steven Wu  ezt írta (időpont: 2025. jan. 16., Cs,
>>> 0:46):
>>>
 +1

 On Wed, Jan 15, 2025 at 9:00 AM Russell Spitzer <
 russell.spit...@gmail.com> wrote:

> Hi Everyone!
>
> PR: https://github.com/apache/iceberg/pull/11976/files
>
> Split out from #11948 
>
> Working on the row-lineage implementation made it clear that we needed
> a way to get information from the Snapshot object propagated into the
> Metadata layer. Specifically we need to know the count of all newly added
> rows in that Snapshot to change the last-row-id of the TableMetadata.
> While we can potentially read this from Snapshot Summary, it would be a 
> bit
> odd to have a requirement on key value pair within Snapshot summary. We
> could also potentially re-read and calculate the number of added rows, but
> this would require re-opening the manifest list itself.
>
> I believe it makes more sense to formally have added-rows as an
> optional field within the Snapshot itself so we can make it clear in the
> spec this value is expected to be stored within the Snapshot metadata if
> row-lineage is enabled.
>
>
> Please take a look at the PR and signal vote or not you approve of
> adding this additional field to Snapshot
>



Re: [DISCUSS] Apache Iceberg (java) 1.8.0 release

2025-01-16 Thread Daniel Weeks
Robert,

I hear your frustration with the progress on the Auth Manager work, but I
believe everyone recognizes that this was a large refactor further
complicated by the need to preserve backward compatibility and handling
deprecations appropriately.  This work has gone through many iterations as
we explored how to make the changes cleanly.  Eventually the scale of the
change led to breaking up the PR for closer review, which I believe was the
right decision because we identified multiple issues after taking that
step.  That may have slowed down progress, but a lot of hours have gone
into discussing, reviewing, and validating the work in this area.

As a project, we have leaned away from gating releases on specific features
because it leads to slower release cycles and prevents other features that
are ready from going out.  We also don't want to rush features just to hit
a release target, but rather release more frequently to make changes
available as they are ready.

At this point, I believe the plan is to follow up soon with a 1.9 release.

-Dan

On Thu, Jan 16, 2025 at 2:36 AM Robert Stupp  wrote:

> Hey,
>
> IMHO 1.8 should definitely include the Auth-Manager work, which tackles
> actual bugs in the Iceberg code base wrt OAuth implementation. That work
> was originally intended to go into 1.7 and now it shall be delayed again
> for 1.9. The PR was originally opened in July 2024, about half a year
> ago and is still getting reviewed. In the meantime there were more than
> 600 other PRs that got reviewed and merged.
>
> The overall agreement around spring 2024, please correct me if I am
> wrong, was the whole REST/OAuth area needs to be improved, and the oauth
> endpoint removed entirely.
>
> Generally speaking, and I know I'm not alone, getting reviews from
> Iceberg committers is extremely hard. A lot of issues and PRs just get
> closed (by that stale bot) without having gotten _any_ attention from an
> Iceberg committer. This is not a new situation but going on for a long
> time. I have been talking to two Iceberg PMC members in person many
> months ago that this is a very disappointing situation that needs to be
> fixed. The reply was always "we are already working on it" - but at
> least from my personal POV the situation did not improve.
>
> Robert
>
> On 16.01.25 10:56, Jean-Baptiste Onofré wrote:
> > Hi folks,
> >
> > Following the Community Meeting yesterday, I would like to propose the
> > following plan regarding releases:
> >
> > 0. As a prerequisite to any release (1.7.2, 1.8.0, 1.9.0), as said by
> > Ryan, we have to double check the NOTICE/LICENSE. Interestingly, I
> > discussed this point with Fokko at the beginning of this week, because
> > I have some doubts about LICENSE/NOTICE content in the "uber" jar
> > artifacts where we shade dependencies. I'm doing a complete pass on
> > all artifacts in 1.7.2-SNAPSHOT and 1.8.0-SNAPSHOT. I should have a
> > complete analysis by tomorrow. This is potentially a blocker for
> > release votes.
> >
> > 1. As soon as (0) is done, 1.7.2 can be submitted to vote. I will work
> > with Fokko on this one.
> >
> > 2. We plan to do 1.8.0 in a couple of weeks (Amogh is the release
> > manager). Due to still some WIP, we "revisited" the 1.8.0 release
> > content: for instance, as best effort, we wanted to include REST Auth
> > Manager improvement (OAuth2 Manager) but we preferred to postpone to
> > 1.9.0. That's totally fine to me, however, I would propose to strongly
> > focus on pending PRs for 1.9.0. Imho, we should "target" (again as
> > clear best effort) on variant, partition stats and Auth Manager.
> >
> > 3. Assuming 1.8.0 will be released at the end of Jan/beginning of Feb,
> > according to our "release cadence", what do you think about planning
> > 1.9.0 in April ? Again with the main targets listed in (2).
> >
> > I tried to sum up what we discussed yesterday :)
> > Thoughts ?
> >
> > Regards
> > JB
> >
> > On Thu, Jan 9, 2025 at 7:51 AM Jean-Baptiste Onofré 
> wrote:
> >> Hi folks,
> >>
> >> We did Apache Iceberg 1.7.0 release on Nov 8, 2024. If we want to keep
> >> our release "pace", 1.8.0 should be released around mid February.
> >>
> >> I think we already have a good "train" of merged PRs (or should be
> >> merged soon): default values, REST auth improvements, dependencies
> >> updates, etc.
> >>
> >> WDYT about 1.8.0 mid Feb ? If so, I propose we update GitHub Issues
> >> and PRs we would like to "target" to 1.8.0.
> >>
> >> Thoughts ?
> >>
> >> Regards
> >> JB
>
> --
> Robert Stupp
> @snazy
>
>


Re: [DISCUSS] Apache Iceberg 1.7.2 release

2025-01-16 Thread Fokko Driesprong
Hey JB,

Yufei approved the PR, and Amogh just merged it. I took the liberty of
creating a backport: #11983 .

Kind regards,
Fokko

Op wo 15 jan 2025 om 06:05 schreef Jean-Baptiste Onofré :

> Hi Yuffei
>
> That makes sense to me. Do we have an ETA for this issue ?
> Are you working on a fix ? Do you need my help on this ?
>
> Thanks !
> Regards
> JB
>
> On Tue, Jan 14, 2025 at 6:16 PM Yufei Gu  wrote:
> >
> > Hi folks,
> >
> > We are working on a bug fix,
> https://github.com/apache/iceberg/issues/11922. It'd be nice to include
> it in 1.7.2.
> >
> > Yufei
> >
> >
> > On Tue, Jan 14, 2025 at 2:00 AM Jean-Baptiste Onofré 
> wrote:
> >>
> >> Hi Fokko
> >>
> >> Thanks for the update. I will do a quick pass on GH issues and I will
> >> run the release (I will ping you on Slack).
> >>
> >> Regards
> >> JB
> >>
> >> On Tue, Jan 14, 2025 at 9:07 AM Fokko Driesprong 
> wrote:
> >> >
> >> > Good morning everyone,
> >> >
> >> > Thanks for the replies so far. As I'm catching up from the holiday, I
> was looking at #11895 to possibly include in 1.7.2, but Russell suggested a
> more fundamental rework which is better in the long run. Let's defer that
> to a later version. JB, if you can run the release, that would be great.
> I'm happy to help where needed.
> >> >
> >> > Kind regards,
> >> > Fokko
> >> >
> >> >
> >> >
> >> >
> >> > Op ma 13 jan 2025 om 08:32 schreef Jean-Baptiste Onofré <
> j...@nanthrax.net>:
> >> >>
> >> >> Hi Fokko
> >> >>
> >> >> It sounds good ! Thanks for the "reminder" :)
> >> >>
> >> >> I'm also fine either way for the release manager, I can tackle it
> >> >> (with you) or you do, everything is fine for me :)
> >> >>
> >> >> Thanks !
> >> >> Regards
> >> >> JB
> >> >>
> >> >> On Mon, Jan 13, 2025 at 7:24 AM Fokko Driesprong 
> wrote:
> >> >> >
> >> >> > Hi everyone,
> >> >> >
> >> >> > Over the weekend the last item from the 1.7.2 milestone has been
> resolved. I took the liberty of cherry-picking the commits to the 1.7.x
> branch. JB volunteered to run the release here, but I'm also happy to run
> this one if he's busy with 1.8.0. I'm fine either way :) The question to
> the community is, is anything missing, or can we start cutting the RC0 for
> 1.7.2?
> >> >> >
> >> >> > Kind regards,
> >> >> > Fokko
>


Re: [Discuss][Vote] Spec Change - Add optional field added-rows to Snapshot for Row Lineage

2025-01-16 Thread Steve Zhang
Thank you Russell! +1 (non-binding)

Thanks,
Steve Zhang



> On Jan 15, 2025, at 10:53 PM, huaxin gao  wrote:
> 
> +1 (non-binding)



Re: [DISCUSS] Support keeping at most N snapshots

2025-01-16 Thread Manu Zhang
Hi all,

Do you have more comments on this feature? Do you have concerns about
adding a new field to SnapshotRef?

Thanks,
Manu

On Tue, Jan 7, 2025 at 2:37 PM Manu Zhang  wrote:

> Hi Ajantha,
>
> `history.expire.min-snapshots-to-keep` is the *minimum number of
> snapshots* we can keep. I'm proposing to decide the *maximum number of
> snapshots* to keep by count rather than by age.
>
> Thanks,
> Manu
>
> On Tue, Jan 7, 2025 at 2:18 PM Ajantha Bhat  wrote:
>
>> Hi Manu,
>>
>> We already have `retain_last` and `history.expire.min-snapshots-to-keep`
>> to retain the snapshots based on count. Can you please elaborate on why
>> can't we use the same?
>>
>> - Ajantha
>>
>> On Tue, Jan 7, 2025 at 11:33 AM Walaa Eldin Moustafa <
>> wa.moust...@gmail.com> wrote:
>>
>>> Thanks Manu for starting this discussion. That is definitely a valid
>>> feature. I have always found maintaining snapshots by day makes it harder
>>> to provide different types of guarantees/contracts especially when tables
>>> change rates are diverse or irregular. Maintaining by snapshot count makes
>>> a lot of sense and prevents table sizes from growing excessively when
>>> change rate is frequent.
>>>
>>> Thanks,
>>> Walaa.
>>>
>>>
>>> On Mon, Jan 6, 2025 at 8:38 PM Manu Zhang 
>>> wrote:
>>>
 Hi all,

 While maintaining Iceberg tables for our customers, I find it's
 difficult to set a default snapshot expiration time
 (`history.expire.max-snapshot-age-ms`) for different workloads. The default
 value of 5 days looks good for daily batch jobs but is too long for
 frequently-updated jobs.

 I'm thinking about adding another option like
 `history.expire.max-snapshots-to-keep` to keep at most N snapshots. A
 snapshot will be removed when either its age is larger than
 `history.expire.max-snapshot-age-ms` or it's the oldest in
 `history.expire.max-snapshots-to-keep + 1` snapshots. I've created a draft
 PR to demo the idea[1].

 If you agree this is a valid feature request, we also need to update
 SnapshotRef[2] adding a new field `max-snapshots-to-keep`. Will there be a
 compatibility issue or too much cost to maintain compatibility? My
 experiment shows many parsers need to be updated.

 I'd like to hear your thoughts on this.

 1. https://github.com/apache/iceberg/pull/11879
 2. https://iceberg.apache.org/spec/#snapshot-references

 Happy New Year!
 Manu

>>>


Re: [DISCUSS] Apache Iceberg 1.7.2 release

2025-01-16 Thread Jean-Baptiste Onofré
Awesome ! Thanks Fokko !

Le jeu. 16 janv. 2025 à 19:48, Fokko Driesprong  a écrit :

> Hey JB,
>
> Yufei approved the PR, and Amogh just merged it. I took the liberty of
> creating a backport: #11983 
> .
>
> Kind regards,
> Fokko
>
> Op wo 15 jan 2025 om 06:05 schreef Jean-Baptiste Onofré :
>
>> Hi Yuffei
>>
>> That makes sense to me. Do we have an ETA for this issue ?
>> Are you working on a fix ? Do you need my help on this ?
>>
>> Thanks !
>> Regards
>> JB
>>
>> On Tue, Jan 14, 2025 at 6:16 PM Yufei Gu  wrote:
>> >
>> > Hi folks,
>> >
>> > We are working on a bug fix,
>> https://github.com/apache/iceberg/issues/11922. It'd be nice to include
>> it in 1.7.2.
>> >
>> > Yufei
>> >
>> >
>> > On Tue, Jan 14, 2025 at 2:00 AM Jean-Baptiste Onofré 
>> wrote:
>> >>
>> >> Hi Fokko
>> >>
>> >> Thanks for the update. I will do a quick pass on GH issues and I will
>> >> run the release (I will ping you on Slack).
>> >>
>> >> Regards
>> >> JB
>> >>
>> >> On Tue, Jan 14, 2025 at 9:07 AM Fokko Driesprong 
>> wrote:
>> >> >
>> >> > Good morning everyone,
>> >> >
>> >> > Thanks for the replies so far. As I'm catching up from the holiday,
>> I was looking at #11895 to possibly include in 1.7.2, but Russell suggested
>> a more fundamental rework which is better in the long run. Let's defer that
>> to a later version. JB, if you can run the release, that would be great.
>> I'm happy to help where needed.
>> >> >
>> >> > Kind regards,
>> >> > Fokko
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > Op ma 13 jan 2025 om 08:32 schreef Jean-Baptiste Onofré <
>> j...@nanthrax.net>:
>> >> >>
>> >> >> Hi Fokko
>> >> >>
>> >> >> It sounds good ! Thanks for the "reminder" :)
>> >> >>
>> >> >> I'm also fine either way for the release manager, I can tackle it
>> >> >> (with you) or you do, everything is fine for me :)
>> >> >>
>> >> >> Thanks !
>> >> >> Regards
>> >> >> JB
>> >> >>
>> >> >> On Mon, Jan 13, 2025 at 7:24 AM Fokko Driesprong 
>> wrote:
>> >> >> >
>> >> >> > Hi everyone,
>> >> >> >
>> >> >> > Over the weekend the last item from the 1.7.2 milestone has been
>> resolved. I took the liberty of cherry-picking the commits to the 1.7.x
>> branch. JB volunteered to run the release here, but I'm also happy to run
>> this one if he's busy with 1.8.0. I'm fine either way :) The question to
>> the community is, is anything missing, or can we start cutting the RC0 for
>> 1.7.2?
>> >> >> >
>> >> >> > Kind regards,
>> >> >> > Fokko
>>
>


Re: [Discuss][Vote] Spec Change - Add optional field added-rows to Snapshot for Row Lineage

2025-01-16 Thread Manish Malhotra
+1, thanks Russel!
this will help other engines as well.


Thanks,
Manish

On Thu, Jan 16, 2025 at 3:15 PM Amogh Jahagirdar <2am...@gmail.com> wrote:

> I'm +0. I definitely agree with the premise that we need a spec change to
> ensure added rows exist at the snapshot level for row lineage, but I feel
> like there is an advantage to just formalizing the added-records snapshot
> summary property, and make it required for writers in case row lineage is
> enabled on the table. The advantage is that in the ecosystem more
> implementations are likely to populate the summary already (beyond the Java
> implementation, I see Python does as well) so for those implementations,
> the lift to support row lineage is a little bit reduced since the field
> will probably already be populated. It also avoids any awkwardness around
> having 2 of essentially the same field in metadata.
>
> In the end, I think that is a minor advantage so I'm not very opinionated
> on this. We're talking about one field, and the additional lift for that is
> some slightly additional parsing handling in implementations which in the
> grand scheme of things is a smaller portion of the work involved. I also
> understand the argument that it's awkward to have required fields be in the
> summary in the first place (thinking back to our discussions around
> operation handling).
>
> Thanks,
>
> Amogh Jahagirdar
>
> On Thu, Jan 16, 2025 at 2:52 PM Prashant Singh 
> wrote:
>
>> +1 (non-binding) !
>>
>> Best,
>> Prashant Singh
>>
>> On Thu, Jan 16, 2025 at 1:14 PM Daniel Weeks  wrote:
>>
>>> +1
>>>
>>> On Thu, Jan 16, 2025 at 10:39 AM Steve Zhang
>>>  wrote:
>>>
 Thank you Russell! +1 (non-binding)

 Thanks,
 Steve Zhang



 On Jan 15, 2025, at 10:53 PM, huaxin gao 
 wrote:

 +1 (non-binding)





Re: [Discuss][Vote] Spec Change - Add optional field added-rows to Snapshot for Row Lineage

2025-01-16 Thread Honah J.
+1

Best,
Honah

On Thu, Jan 16, 2025 at 22:54 Manish Malhotra <
manish.malhotra.w...@gmail.com> wrote:

> +1, thanks Russel!
> this will help other engines as well.
>
>
> Thanks,
> Manish
>
> On Thu, Jan 16, 2025 at 3:15 PM Amogh Jahagirdar <2am...@gmail.com> wrote:
>
>> I'm +0. I definitely agree with the premise that we need a spec change to
>> ensure added rows exist at the snapshot level for row lineage, but I feel
>> like there is an advantage to just formalizing the added-records snapshot
>> summary property, and make it required for writers in case row lineage is
>> enabled on the table. The advantage is that in the ecosystem more
>> implementations are likely to populate the summary already (beyond the Java
>> implementation, I see Python does as well) so for those implementations,
>> the lift to support row lineage is a little bit reduced since the field
>> will probably already be populated. It also avoids any awkwardness around
>> having 2 of essentially the same field in metadata.
>>
>> In the end, I think that is a minor advantage so I'm not very opinionated
>> on this. We're talking about one field, and the additional lift for that is
>> some slightly additional parsing handling in implementations which in the
>> grand scheme of things is a smaller portion of the work involved. I also
>> understand the argument that it's awkward to have required fields be in the
>> summary in the first place (thinking back to our discussions around
>> operation handling).
>>
>> Thanks,
>>
>> Amogh Jahagirdar
>>
>> On Thu, Jan 16, 2025 at 2:52 PM Prashant Singh 
>> wrote:
>>
>>> +1 (non-binding) !
>>>
>>> Best,
>>> Prashant Singh
>>>
>>> On Thu, Jan 16, 2025 at 1:14 PM Daniel Weeks  wrote:
>>>
 +1

 On Thu, Jan 16, 2025 at 10:39 AM Steve Zhang
  wrote:

> Thank you Russell! +1 (non-binding)
>
> Thanks,
> Steve Zhang
>
>
>
> On Jan 15, 2025, at 10:53 PM, huaxin gao 
> wrote:
>
> +1 (non-binding)
>
>
>


Re: [VOTE] Document Snapshot Summary Optional Fields as Appendix in Spec

2025-01-16 Thread rdb...@gmail.com
+1

Thanks, Honah!

On Thu, Jan 16, 2025 at 3:38 PM Honah J.  wrote:

> Hi everyone,
>
> Thank you for your votes and valuable suggestions. I have updated the PR
> to remove the statement, "Metrics must be accurate if written," and have
> relocated the relevant documentation to Appendix F - Implementation Notes.
>
> Updated PR: https://github.com/apache/iceberg/pull/11660
>
> Given the recent restructuring and additional reviews/modifications in the
> PR, I would like to cancel the current vote and initiate a new one later.
> This will ensure that all votes are based on the latest version of the spec
> change.
>
> Best regards,
> Honah
>
> On Wed, Jan 15, 2025 at 10:16 AM Daniel Weeks  wrote:
>
>> I don't think can include the statement: "Metrics must be accurate if
>> written"
>>
>> Equality deletes make this requirement very difficult to satisfy for some
>> of the fields.
>>
>> The reason I suggested appendix was that we shouldn't be adding new
>> requirements, just documenting field names for consistency across
>> implementations.
>>
>> -Dan
>>
>> On Wed, Jan 15, 2025 at 8:07 AM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> @Daniel Weeks what do you think?
>>>
>>> I know both you and I had the opposite feeling here.
>>>
>>> On Tue, Jan 14, 2025 at 6:21 PM rdb...@gmail.com 
>>> wrote:
>>>
 The content looks correct to me, but because this states a requirement
 ("Metrics must be accurate if written") I would rather move this content
 into the section on the snapshot summary instead of an appendix.

 On Tue, Jan 14, 2025 at 1:30 PM huaxin gao 
 wrote:

> +1 non-binding
>
> On Tue, Jan 14, 2025 at 1:21 PM Steve Zhang
>  wrote:
>
>> +1 non-binding
>>
>> Thanks,
>> Steve Zhang
>>
>>
>>
>> On Jan 14, 2025, at 1:14 PM, Kevin Liu  wrote:
>>
>> +1 non-binding.
>>
>>
>>


Re: [DISCUSS] Support keeping at most N snapshots

2025-01-16 Thread Yufei Gu
It makes sense to have an option to control the max number of snapshots.
Thanks Manu for the proposal.

Yufei


On Thu, Jan 16, 2025 at 7:46 PM Manu Zhang  wrote:

> Hi all,
>
> Do you have more comments on this feature? Do you have concerns about
> adding a new field to SnapshotRef?
>
> Thanks,
> Manu
>
> On Tue, Jan 7, 2025 at 2:37 PM Manu Zhang  wrote:
>
>> Hi Ajantha,
>>
>> `history.expire.min-snapshots-to-keep` is the *minimum number of
>> snapshots* we can keep. I'm proposing to decide the *maximum number of
>> snapshots* to keep by count rather than by age.
>>
>> Thanks,
>> Manu
>>
>> On Tue, Jan 7, 2025 at 2:18 PM Ajantha Bhat 
>> wrote:
>>
>>> Hi Manu,
>>>
>>> We already have `retain_last` and `history.expire.min-snapshots-to-keep`
>>> to retain the snapshots based on count. Can you please elaborate on why
>>> can't we use the same?
>>>
>>> - Ajantha
>>>
>>> On Tue, Jan 7, 2025 at 11:33 AM Walaa Eldin Moustafa <
>>> wa.moust...@gmail.com> wrote:
>>>
 Thanks Manu for starting this discussion. That is definitely a valid
 feature. I have always found maintaining snapshots by day makes it harder
 to provide different types of guarantees/contracts especially when tables
 change rates are diverse or irregular. Maintaining by snapshot count makes
 a lot of sense and prevents table sizes from growing excessively when
 change rate is frequent.

 Thanks,
 Walaa.


 On Mon, Jan 6, 2025 at 8:38 PM Manu Zhang 
 wrote:

> Hi all,
>
> While maintaining Iceberg tables for our customers, I find it's
> difficult to set a default snapshot expiration time
> (`history.expire.max-snapshot-age-ms`) for different workloads. The 
> default
> value of 5 days looks good for daily batch jobs but is too long for
> frequently-updated jobs.
>
> I'm thinking about adding another option like
> `history.expire.max-snapshots-to-keep` to keep at most N snapshots. A
> snapshot will be removed when either its age is larger than
> `history.expire.max-snapshot-age-ms` or it's the oldest in
> `history.expire.max-snapshots-to-keep + 1` snapshots. I've created a draft
> PR to demo the idea[1].
>
> If you agree this is a valid feature request, we also need to update
> SnapshotRef[2] adding a new field `max-snapshots-to-keep`. Will there be a
> compatibility issue or too much cost to maintain compatibility? My
> experiment shows many parsers need to be updated.
>
> I'd like to hear your thoughts on this.
>
> 1. https://github.com/apache/iceberg/pull/11879
> 2. https://iceberg.apache.org/spec/#snapshot-references
>
> Happy New Year!
> Manu
>



Re: [Discuss][Vote] Spec Change - Add optional field added-rows to Snapshot for Row Lineage

2025-01-16 Thread Prashant Singh
+1 (non-binding) !

Best,
Prashant Singh

On Thu, Jan 16, 2025 at 1:14 PM Daniel Weeks  wrote:

> +1
>
> On Thu, Jan 16, 2025 at 10:39 AM Steve Zhang
>  wrote:
>
>> Thank you Russell! +1 (non-binding)
>>
>> Thanks,
>> Steve Zhang
>>
>>
>>
>> On Jan 15, 2025, at 10:53 PM, huaxin gao  wrote:
>>
>> +1 (non-binding)
>>
>>
>>


Re: [Discuss][Vote] Spec Change - Add optional field added-rows to Snapshot for Row Lineage

2025-01-16 Thread Daniel Weeks
+1

On Thu, Jan 16, 2025 at 10:39 AM Steve Zhang
 wrote:

> Thank you Russell! +1 (non-binding)
>
> Thanks,
> Steve Zhang
>
>
>
> On Jan 15, 2025, at 10:53 PM, huaxin gao  wrote:
>
> +1 (non-binding)
>
>
>


Re: [Discuss][Vote] Spec Change - Add optional field added-rows to Snapshot for Row Lineage

2025-01-16 Thread Amogh Jahagirdar
I'm +0. I definitely agree with the premise that we need a spec change to
ensure added rows exist at the snapshot level for row lineage, but I feel
like there is an advantage to just formalizing the added-records snapshot
summary property, and make it required for writers in case row lineage is
enabled on the table. The advantage is that in the ecosystem more
implementations are likely to populate the summary already (beyond the Java
implementation, I see Python does as well) so for those implementations,
the lift to support row lineage is a little bit reduced since the field
will probably already be populated. It also avoids any awkwardness around
having 2 of essentially the same field in metadata.

In the end, I think that is a minor advantage so I'm not very opinionated
on this. We're talking about one field, and the additional lift for that is
some slightly additional parsing handling in implementations which in the
grand scheme of things is a smaller portion of the work involved. I also
understand the argument that it's awkward to have required fields be in the
summary in the first place (thinking back to our discussions around
operation handling).

Thanks,

Amogh Jahagirdar

On Thu, Jan 16, 2025 at 2:52 PM Prashant Singh 
wrote:

> +1 (non-binding) !
>
> Best,
> Prashant Singh
>
> On Thu, Jan 16, 2025 at 1:14 PM Daniel Weeks  wrote:
>
>> +1
>>
>> On Thu, Jan 16, 2025 at 10:39 AM Steve Zhang
>>  wrote:
>>
>>> Thank you Russell! +1 (non-binding)
>>>
>>> Thanks,
>>> Steve Zhang
>>>
>>>
>>>
>>> On Jan 15, 2025, at 10:53 PM, huaxin gao  wrote:
>>>
>>> +1 (non-binding)
>>>
>>>
>>>


Re: [VOTE] Document Snapshot Summary Optional Fields as Appendix in Spec

2025-01-16 Thread Honah J.
Hi everyone,

Thank you for your votes and valuable suggestions. I have updated the PR to
remove the statement, "Metrics must be accurate if written," and have
relocated the relevant documentation to Appendix F - Implementation Notes.

Updated PR: https://github.com/apache/iceberg/pull/11660

Given the recent restructuring and additional reviews/modifications in the
PR, I would like to cancel the current vote and initiate a new one later.
This will ensure that all votes are based on the latest version of the spec
change.

Best regards,
Honah

On Wed, Jan 15, 2025 at 10:16 AM Daniel Weeks  wrote:

> I don't think can include the statement: "Metrics must be accurate if
> written"
>
> Equality deletes make this requirement very difficult to satisfy for some
> of the fields.
>
> The reason I suggested appendix was that we shouldn't be adding new
> requirements, just documenting field names for consistency across
> implementations.
>
> -Dan
>
> On Wed, Jan 15, 2025 at 8:07 AM Russell Spitzer 
> wrote:
>
>> @Daniel Weeks what do you think?
>>
>> I know both you and I had the opposite feeling here.
>>
>> On Tue, Jan 14, 2025 at 6:21 PM rdb...@gmail.com 
>> wrote:
>>
>>> The content looks correct to me, but because this states a requirement
>>> ("Metrics must be accurate if written") I would rather move this content
>>> into the section on the snapshot summary instead of an appendix.
>>>
>>> On Tue, Jan 14, 2025 at 1:30 PM huaxin gao 
>>> wrote:
>>>
 +1 non-binding

 On Tue, Jan 14, 2025 at 1:21 PM Steve Zhang
  wrote:

> +1 non-binding
>
> Thanks,
> Steve Zhang
>
>
>
> On Jan 14, 2025, at 1:14 PM, Kevin Liu  wrote:
>
> +1 non-binding.
>
>
>