I do agree with the need that this proposal solves, to decouple the snapshot history from the data deletion. I do wonder, will keeping expired snapshots as is slow down manifest/scan planning though (REST catalog approaches could probably mitigate this)?
On Mon, Jul 8, 2024, 5:34 AM Piotr Findeisen <piotr.findei...@gmail.com> wrote: > Hi Shehon, Walaa > > Thank Shehon for bringing this up. And thank you Walaa for proving more > context from similar existing solution to the problem. > The choices that LakeChime seems to have made -- to keep information in a > separate RDBMS and which particular metadata information to retain -- they > indeed look as use-case specific, until we observe repeating patterns. > The idea to bake lifecycle changes into table format spec was proposed as > an alternative to the idea to bake lifecycle changes into REST catalog > spec. It was brought into discussion based on the intuition that REST > catalog is first-class citizen in Iceberg world, just like other catalogs, > and so solutions to table-centric problems do not need to be limited to > REST catalog. What is the information we retain, how/whether this is > configurable are open question and applicable to both avenues. > > As a 3rd/another alternative, we could focus on REST catalog *extensions*, > without naming snapshot metadata lifecycle, and leave the problem up to > REST's implementors. That would mean Iceberg project doesn't address > snapshot metadata lifecycle changes topic directly, but instead gives users > tools to build solutions around it. At this point I am not trying to judge > whether it's a good idea or not. Probably depends how important it is to > solve the problem and have a common solution. > > Best, > Piotr > > > > > On Sat, 6 Jul 2024 at 09:46, Walaa Eldin Moustafa <wa.moust...@gmail.com> > wrote: > >> Hi Szehon, >> >> Thanks for sharing this proposal. We have thought along the same lines >> and implemented an external system (LakeChime [1]) that retains snapshot + >> partition metadata for longer (actual internal implementation keeps data >> for 13 months, but that can be tuned). For efficient analysis, we have kept >> this data in an RDBMS. My opinion is this may be a better fit to an >> external system (similar to LakeChime) since it could potentially >> complicate the Iceberg spec, APIs, or their implementations. Also, the type >> of metadata tracked can differ depending on the use case. For example, >> while LakeChime retains partition and operation type metadata, it does not >> track file-level metadata as there was no specific use case for that. >> >> [1] >> https://www.linkedin.com/blog/engineering/data-management/lakechime-a-data-trigger-service-for-modern-data-lakes >> >> Thanks, >> Walaa. >> >> On Fri, Jul 5, 2024 at 11:49 PM Szehon Ho <szehon.apa...@gmail.com> >> wrote: >> >>> Hi folks, >>> >>> I would like to discuss an idea for an optional extension of Iceberg's >>> Snapshot metadata lifecycle. Thanks Piotr for replying on the other thread >>> that this should be a fuller Iceberg format change. >>> >>> *Proposal Summary* >>> >>> Currently, ExpireSnapshots(long olderThan) purges metadata and deleted >>> data of a Snapshot together. Purging deleted data often requires a smaller >>> timeline, due to strict requirements to claw back unused disk space, >>> fulfill data lifecycle compliance, etc. In many deployments, this means >>> 'olderThan' timestamp is set to just a few days before the current time >>> (the default is 5 days). >>> >>> On the other hand, purging metadata could be ideally done on a more >>> relaxed timeline, such as months or more, to allow for meaningful >>> historical table analysis. >>> >>> We should have an optional way to purge Snapshot metadata separately >>> from purging deleted data. This would allow us to get history of the >>> table, and answer questions like: >>> >>> - When was a file/partition added >>> - When was a file/partition deleted >>> - How much data was added or removed in time X >>> >>> that are currently only possible for data operations within a few days. >>> >>> *Github Proposal*: https://github.com/apache/iceberg/issues/10646 >>> *Google Design Doc*: >>> https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit >>> <https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit> >>> >>> Curious if anyone has thought along these lines and/or sees obvious >>> issues. Would appreciate any feedback on the proposal. >>> >>> Thanks >>> Szehon >>> >>