Hi Manu, We already have `retain_last` and `history.expire.min-snapshots-to-keep` to retain the snapshots based on count. Can you please elaborate on why can't we use the same?
- Ajantha On Tue, Jan 7, 2025 at 11:33 AM Walaa Eldin Moustafa <wa.moust...@gmail.com> wrote: > Thanks Manu for starting this discussion. That is definitely a valid > feature. I have always found maintaining snapshots by day makes it harder > to provide different types of guarantees/contracts especially when tables > change rates are diverse or irregular. Maintaining by snapshot count makes > a lot of sense and prevents table sizes from growing excessively when > change rate is frequent. > > Thanks, > Walaa. > > > On Mon, Jan 6, 2025 at 8:38 PM Manu Zhang <owenzhang1...@gmail.com> wrote: > >> Hi all, >> >> While maintaining Iceberg tables for our customers, I find it's difficult >> to set a default snapshot expiration time >> (`history.expire.max-snapshot-age-ms`) for different workloads. The default >> value of 5 days looks good for daily batch jobs but is too long for >> frequently-updated jobs. >> >> I'm thinking about adding another option like >> `history.expire.max-snapshots-to-keep` to keep at most N snapshots. A >> snapshot will be removed when either its age is larger than >> `history.expire.max-snapshot-age-ms` or it's the oldest in >> `history.expire.max-snapshots-to-keep + 1` snapshots. I've created a draft >> PR to demo the idea[1]. >> >> If you agree this is a valid feature request, we also need to update >> SnapshotRef[2] adding a new field `max-snapshots-to-keep`. Will there be a >> compatibility issue or too much cost to maintain compatibility? My >> experiment shows many parsers need to be updated. >> >> I'd like to hear your thoughts on this. >> >> 1. https://github.com/apache/iceberg/pull/11879 >> 2. https://iceberg.apache.org/spec/#snapshot-references >> >> Happy New Year! >> Manu >> >