Thanks Manu for starting this discussion. That is definitely a valid
feature. I have always found maintaining snapshots by day makes it harder
to provide different types of guarantees/contracts especially when tables
change rates are diverse or irregular. Maintaining by snapshot count makes
a lot of sense and prevents table sizes from growing excessively when
change rate is frequent.

Thanks,
Walaa.


On Mon, Jan 6, 2025 at 8:38 PM Manu Zhang <owenzhang1...@gmail.com> wrote:

> Hi all,
>
> While maintaining Iceberg tables for our customers, I find it's difficult
> to set a default snapshot expiration time
> (`history.expire.max-snapshot-age-ms`) for different workloads. The default
> value of 5 days looks good for daily batch jobs but is too long for
> frequently-updated jobs.
>
> I'm thinking about adding another option like
> `history.expire.max-snapshots-to-keep` to keep at most N snapshots. A
> snapshot will be removed when either its age is larger than
> `history.expire.max-snapshot-age-ms` or it's the oldest in
> `history.expire.max-snapshots-to-keep + 1` snapshots. I've created a draft
> PR to demo the idea[1].
>
> If you agree this is a valid feature request, we also need to update
> SnapshotRef[2] adding a new field `max-snapshots-to-keep`. Will there be a
> compatibility issue or too much cost to maintain compatibility? My
> experiment shows many parsers need to be updated.
>
> I'd like to hear your thoughts on this.
>
> 1. https://github.com/apache/iceberg/pull/11879
> 2. https://iceberg.apache.org/spec/#snapshot-references
>
> Happy New Year!
> Manu
>

Reply via email to