Hi Shehon, Walaa Thank Shehon for bringing this up. And thank you Walaa for proving more context from similar existing solution to the problem. The choices that LakeChime seems to have made -- to keep information in a separate RDBMS and which particular metadata information to retain -- they indeed look as use-case specific, until we observe repeating patterns. The idea to bake lifecycle changes into table format spec was proposed as an alternative to the idea to bake lifecycle changes into REST catalog spec. It was brought into discussion based on the intuition that REST catalog is first-class citizen in Iceberg world, just like other catalogs, and so solutions to table-centric problems do not need to be limited to REST catalog. What is the information we retain, how/whether this is configurable are open question and applicable to both avenues.
As a 3rd/another alternative, we could focus on REST catalog *extensions*, without naming snapshot metadata lifecycle, and leave the problem up to REST's implementors. That would mean Iceberg project doesn't address snapshot metadata lifecycle changes topic directly, but instead gives users tools to build solutions around it. At this point I am not trying to judge whether it's a good idea or not. Probably depends how important it is to solve the problem and have a common solution. Best, Piotr On Sat, 6 Jul 2024 at 09:46, Walaa Eldin Moustafa <wa.moust...@gmail.com> wrote: > Hi Szehon, > > Thanks for sharing this proposal. We have thought along the same lines and > implemented an external system (LakeChime [1]) that retains snapshot + > partition metadata for longer (actual internal implementation keeps data > for 13 months, but that can be tuned). For efficient analysis, we have kept > this data in an RDBMS. My opinion is this may be a better fit to an > external system (similar to LakeChime) since it could potentially > complicate the Iceberg spec, APIs, or their implementations. Also, the type > of metadata tracked can differ depending on the use case. For example, > while LakeChime retains partition and operation type metadata, it does not > track file-level metadata as there was no specific use case for that. > > [1] > https://www.linkedin.com/blog/engineering/data-management/lakechime-a-data-trigger-service-for-modern-data-lakes > > Thanks, > Walaa. > > On Fri, Jul 5, 2024 at 11:49 PM Szehon Ho <szehon.apa...@gmail.com> wrote: > >> Hi folks, >> >> I would like to discuss an idea for an optional extension of Iceberg's >> Snapshot metadata lifecycle. Thanks Piotr for replying on the other thread >> that this should be a fuller Iceberg format change. >> >> *Proposal Summary* >> >> Currently, ExpireSnapshots(long olderThan) purges metadata and deleted >> data of a Snapshot together. Purging deleted data often requires a smaller >> timeline, due to strict requirements to claw back unused disk space, >> fulfill data lifecycle compliance, etc. In many deployments, this means >> 'olderThan' timestamp is set to just a few days before the current time >> (the default is 5 days). >> >> On the other hand, purging metadata could be ideally done on a more >> relaxed timeline, such as months or more, to allow for meaningful >> historical table analysis. >> >> We should have an optional way to purge Snapshot metadata separately from >> purging deleted data. This would allow us to get history of the table, and >> answer questions like: >> >> - When was a file/partition added >> - When was a file/partition deleted >> - How much data was added or removed in time X >> >> that are currently only possible for data operations within a few days. >> >> *Github Proposal*: https://github.com/apache/iceberg/issues/10646 >> *Google Design Doc*: >> https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit >> <https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit> >> >> Curious if anyone has thought along these lines and/or sees obvious >> issues. Would appreciate any feedback on the proposal. >> >> Thanks >> Szehon >> >