Hi Szehon, Thanks for sharing this proposal. We have thought along the same lines and implemented an external system (LakeChime [1]) that retains snapshot + partition metadata for longer (actual internal implementation keeps data for 13 months, but that can be tuned). For efficient analysis, we have kept this data in an RDBMS. My opinion is this may be a better fit to an external system (similar to LakeChime) since it could potentially complicate the Iceberg spec, APIs, or their implementations. Also, the type of metadata tracked can differ depending on the use case. For example, while LakeChime retains partition and operation type metadata, it does not track file-level metadata as there was no specific use case for that.
[1] https://www.linkedin.com/blog/engineering/data-management/lakechime-a-data-trigger-service-for-modern-data-lakes Thanks, Walaa. On Fri, Jul 5, 2024 at 11:49 PM Szehon Ho <szehon.apa...@gmail.com> wrote: > Hi folks, > > I would like to discuss an idea for an optional extension of Iceberg's > Snapshot metadata lifecycle. Thanks Piotr for replying on the other thread > that this should be a fuller Iceberg format change. > > *Proposal Summary* > > Currently, ExpireSnapshots(long olderThan) purges metadata and deleted > data of a Snapshot together. Purging deleted data often requires a smaller > timeline, due to strict requirements to claw back unused disk space, > fulfill data lifecycle compliance, etc. In many deployments, this means > 'olderThan' timestamp is set to just a few days before the current time > (the default is 5 days). > > On the other hand, purging metadata could be ideally done on a more > relaxed timeline, such as months or more, to allow for meaningful > historical table analysis. > > We should have an optional way to purge Snapshot metadata separately from > purging deleted data. This would allow us to get history of the table, and > answer questions like: > > - When was a file/partition added > - When was a file/partition deleted > - How much data was added or removed in time X > > that are currently only possible for data operations within a few days. > > *Github Proposal*: https://github.com/apache/iceberg/issues/10646 > *Google Design Doc*: > https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit > <https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit> > > Curious if anyone has thought along these lines and/or sees obvious > issues. Would appreciate any feedback on the proposal. > > Thanks > Szehon >