Re: [DISCUSS] Support keeping at most N snapshots

2025-01-17 Thread Manu Zhang
The intention is to set an upper limit for table size while keeping as much
snapshots as possible. Setting max-snapshot-age-ms to a small value will
lose history for some tables while setting min-snapshots-to-keep to a
medium value will keep too much history for others.

Lewis, William 于2025年1月18日 周六03:50写道:

> To clarify, the intention is that if max-snapshots-to-keep is set, then
> snapshots will be expired even if they are younger than
> max-snapshot-age-ms? Does this solve a problem that isn’t solved by setting
> max-snapshot-age-ms to a small value and setting min-snapshots-to-keep to
> your desired value?
>
>
>


Re: [DISCUSS] Support keeping at most N snapshots

2025-01-17 Thread Lewis, William
To clarify, the intention is that if max-snapshots-to-keep is set, then 
snapshots will be expired even if they are younger than max-snapshot-age-ms? 
Does this solve a problem that isn’t solved by setting max-snapshot-age-ms to a 
small value and setting min-snapshots-to-keep to your desired value?




Re: [Discuss][Vote] Spec Change - Add optional field added-rows to Snapshot for Row Lineage

2025-01-17 Thread Fokko Driesprong
+0, as I agree with Amogh, I think it would fit nicely with Honah's work of
formalizing the properties.

Kind regards,
Fokko



Op vr 17 jan 2025 om 08:55 schreef Honah J. :

> +1
>
> Best,
> Honah
>
> On Thu, Jan 16, 2025 at 22:54 Manish Malhotra <
> manish.malhotra.w...@gmail.com> wrote:
>
>> +1, thanks Russel!
>> this will help other engines as well.
>>
>>
>> Thanks,
>> Manish
>>
>> On Thu, Jan 16, 2025 at 3:15 PM Amogh Jahagirdar <2am...@gmail.com>
>> wrote:
>>
>>> I'm +0. I definitely agree with the premise that we need a spec change
>>> to ensure added rows exist at the snapshot level for row lineage, but I
>>> feel like there is an advantage to just formalizing the added-records
>>> snapshot summary property, and make it required for writers in case row
>>> lineage is enabled on the table. The advantage is that in the ecosystem
>>> more implementations are likely to populate the summary already (beyond the
>>> Java implementation, I see Python does as well) so for those
>>> implementations, the lift to support row lineage is a little bit reduced
>>> since the field will probably already be populated. It also avoids any
>>> awkwardness around having 2 of essentially the same field in metadata.
>>>
>>> In the end, I think that is a minor advantage so I'm not very
>>> opinionated on this. We're talking about one field, and the additional lift
>>> for that is some slightly additional parsing handling in implementations
>>> which in the grand scheme of things is a smaller portion of the work
>>> involved. I also understand the argument that it's awkward to have required
>>> fields be in the summary in the first place (thinking back to our
>>> discussions around operation handling).
>>>
>>> Thanks,
>>>
>>> Amogh Jahagirdar
>>>
>>> On Thu, Jan 16, 2025 at 2:52 PM Prashant Singh 
>>> wrote:
>>>
 +1 (non-binding) !

 Best,
 Prashant Singh

 On Thu, Jan 16, 2025 at 1:14 PM Daniel Weeks  wrote:

> +1
>
> On Thu, Jan 16, 2025 at 10:39 AM Steve Zhang
>  wrote:
>
>> Thank you Russell! +1 (non-binding)
>>
>> Thanks,
>> Steve Zhang
>>
>>
>>
>> On Jan 15, 2025, at 10:53 PM, huaxin gao 
>> wrote:
>>
>> +1 (non-binding)
>>
>>
>>


Re: [Discuss][Vote] Spec Change - Add optional field added-rows to Snapshot for Row Lineage

2025-01-17 Thread Russell Spitzer
Thanks everyone, especially Amogh and Fokko for providing other views on
this change. I think your viewpoints are definitely valid although I think
currently we should be working on moving things away from Snapshot Summary
if at all possible. I think it's better if the information there is never
critical for the correct functioning of an Iceberg implementation and I
think in the future it would probably be better if we got operation out of
there as well.

That said I think we have consensus and I'll merge this change after the
last review comments.

On Fri, Jan 17, 2025 at 4:22 AM Jean-Baptiste Onofré 
wrote:

> +1 (non binding)
>
> Regards
> JB
>
> On Wed, Jan 15, 2025 at 5:59 PM Russell Spitzer
>  wrote:
> >
> > Hi Everyone!
> >
> > PR: https://github.com/apache/iceberg/pull/11976/files
> >
> > Split out from #11948
> >
> > Working on the row-lineage implementation made it clear that we needed a
> way to get information from the Snapshot object propagated into the
> Metadata layer. Specifically we need to know the count of all newly added
> rows in that Snapshot to change the last-row-id of the TableMetadata. While
> we can potentially read this from Snapshot Summary, it would be a bit odd
> to have a requirement on key value pair within Snapshot summary. We could
> also potentially re-read and calculate the number of added rows, but this
> would require re-opening the manifest list itself.
> >
> > I believe it makes more sense to formally have added-rows as an optional
> field within the Snapshot itself so we can make it clear in the spec this
> value is expected to be stored within the Snapshot metadata if row-lineage
> is enabled.
> >
> >
> > Please take a look at the PR and signal vote or not you approve of
> adding this additional field to Snapshot
>


Re: [Discuss][Vote] Spec Change - Add optional field added-rows to Snapshot for Row Lineage

2025-01-17 Thread Jean-Baptiste Onofré
+1 (non binding)

Regards
JB

On Wed, Jan 15, 2025 at 5:59 PM Russell Spitzer
 wrote:
>
> Hi Everyone!
>
> PR: https://github.com/apache/iceberg/pull/11976/files
>
> Split out from #11948
>
> Working on the row-lineage implementation made it clear that we needed a way 
> to get information from the Snapshot object propagated into the Metadata 
> layer. Specifically we need to know the count of all newly added rows in that 
> Snapshot to change the last-row-id of the TableMetadata. While we can 
> potentially read this from Snapshot Summary, it would be a bit odd to have a 
> requirement on key value pair within Snapshot summary. We could also 
> potentially re-read and calculate the number of added rows, but this would 
> require re-opening the manifest list itself.
>
> I believe it makes more sense to formally have added-rows as an optional 
> field within the Snapshot itself so we can make it clear in the spec this 
> value is expected to be stored within the Snapshot metadata if row-lineage is 
> enabled.
>
>
> Please take a look at the PR and signal vote or not you approve of adding 
> this additional field to Snapshot