> Yes, it won't impact reads/writes but it may become a bottleneck for other operations that need that information.
We can set a limit to allow a certain number of snapshots, and purge old items in each commit just like what we did for metadata logs. I admit that it doesn't seem like an elegant solution, but it may solve most of the problems. > I'd be curious to hear more from people who have experience implementing the REST catalog API. REST catalog can definitely preserve snapshot entries for a long period, but we still need an interface/spec to allow metadata table query(in the client side) to reference these expired entries. Yufei On Tue, Aug 6, 2024 at 11:30 AM Anton Okolnychyi <aokolnyc...@gmail.com> wrote: > I agree it is unfortunate to not be able to find the snapshot information > from a manifest entry when the original snapshot is expired even though we > still know the snapshot ID that added the file. I am not sure about a > separate JSON file, though. It is still JSON and I bet people will store > the snapshot history forever, so the size of that file will gradually > increase. Yes, it won't impact reads/writes but it may become a bottleneck > for other operations that need that information. Using Parquet may help but > I am not sure that's the right approach overall. > > I'd be curious to hear more from people who have experience > implementing the REST catalog API. It seems like most implementations have > addressed that or at least have a way to do that. > > - Anton > > пн, 5 серп. 2024 р. о 18:12 Yufei Gu <flyrain...@gmail.com> пише: > >> Thanks Szehone for the new proposal. I think it is a useful feature with >> the least spec change. A candidate for v3 spec? >> >> Yufei >> >> >> On Tue, Jul 16, 2024 at 3:02 PM Szehon Ho <szehon.apa...@gmail.com> >> wrote: >> >>> Hi, >>> >>> Thanks for reading through the proposal and the good feedback. I was >>> thinking about the mentioned concerns: >>> >>> - The motivation for the change >>> - Too much additional metadata (storage overhead, namenode pressure >>> on HDFS) >>> - Performance impact for read/writing TableMetadata >>> - Some impact to existing Table API's, and maintenance procedures, >>> to have to check for these snapshots >>> >>> I chatted a bit offline with Yufei to brainstorm, and I wrote a V2 of >>> the proposal at the same link: >>> https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit. >>> I also tried to clarify the motivation in the doc with actual metadata >>> table queries that would be possible. >>> >>> This version now simply adds an optional 'expired-snapshots-path' that >>> contains the metadata of expired Snapshots. I think this should address >>> the above concerns: >>> >>> - Minimal storage overhead for just snapshot references (capped). I >>> don't propose anymore to keep old snapshot manifest-list/manifest files, >>> the snapshot reference to the expired snapshot should be a good start. >>> - Minimize perf overhead of read/write TableMetadata. The >>> additional file is only written by ExpireSnapshots if feature is enabled, >>> and only read on demand (via metadata table query for example) >>> - No impact to other Table APIs or maintenance procedures (as these >>> dont show up as regular table.snapshots() list anymore). >>> - Only additive optional spec change (backwards compatible) >>> >>> Of course, again, this feature is possible outside Iceberg, but the >>> advantage of doing it in Iceberg is that it could be integrated into >>> ExpireSnapshots and Metadata Table frameworks. >>> >>> Curious what people think? >>> >>> Thanks >>> Szehon >>> >>> On Wed, Jul 10, 2024 at 1:44 AM Péter Váry <peter.vary.apa...@gmail.com> >>> wrote: >>> >>>> > I believe DeleteOrphanFiles may be ok as is, because currently the >>>> logic walks down the reachable graph and marks those metadata files as >>>> 'not-orphan', so it should naturally walk these 'expired' snapshots as >>>> well. >>>> >>>> We need to keep the metadata files, but remove data files if they are >>>> not removed for whatever reason. Doable, but logic change. >>>> >>>> > You mean purging expired snapshots in the middle of the history, >>>> right? I think the current mechanism for this is 'tagging' and >>>> 'branching'. >>>> >>>> I think for most users the compaction commits are technical details >>>> which they would like to avoid / don't want to see. The real table history >>>> is only the changes initiated by the user, and it would be good to hide the >>>> technical/compaction commits. >>>> >>>> >>>> On Wed, Jul 10, 2024, 08:52 himadri pal <meh...@gmail.com> wrote: >>>> >>>>> Hi Szehon, >>>>> >>>>> This is a good idea considering the use case it intends to solve. >>>>> Added few questions and comments in the design doc. >>>>> >>>>> IMO , Alternate options considered specified in the design doc look >>>>> cleaner to me. >>>>> >>>>> I think, it might add to maintenance burden, now that we need to >>>>> remember to remove these metadata only snapshots. >>>>> >>>>> Also I wonder some of the use cases it intends to address, is solvable >>>>> by metadata alone? - i.e how much data was added in a given time range? - >>>>> May be to answer these kind of questions user would prefer a to create KPI >>>>> using columns in the dataset. >>>>> >>>>> >>>>> Regards, >>>>> Himadri Pal >>>>> >>>>> >>>>> On Tue, Jul 9, 2024 at 11:10 PM Steven Wu <stevenz...@gmail.com> >>>>> wrote: >>>>> >>>>>> I am not totally convinced of the motivation yet. >>>>>> >>>>>> I thought the snapshot retention window is primarily meant for time >>>>>> travel and troubleshooting table changes that happened recently (like a >>>>>> few >>>>>> days or weeks). >>>>>> >>>>>> Is it valuable enough to keep expired snapshots for as long as months >>>>>> or years? While metadata files are typically smaller than data files in >>>>>> total size, it can still be significant considering the default amount of >>>>>> column stats written today (especially for wide tables with many >>>>>> columns). >>>>>> >>>>>> How long are we going to keep the expired snapshot references by >>>>>> default? If it is months/years, it can have major implications on the >>>>>> query >>>>>> performance of metadata tables (like snapshots, all_*). >>>>>> >>>>>> I assume it will also have some performance impact on table loading >>>>>> as a lot more expired snapshots are still referenced. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Jul 9, 2024 at 6:36 PM Szehon Ho <szehon.apa...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Thanks Peter and Yufei. >>>>>>> >>>>>>> Yes, in terms of implementation, I noted in the doc we need to add >>>>>>> error checks to prevent time-travel / rollback / cherry-pick operations >>>>>>> to >>>>>>> 'expired' snapshots. I'll make it more clear in the doc, which >>>>>>> operations >>>>>>> we need to check against. >>>>>>> >>>>>>> I believe DeleteOrphanFiles may be ok as is, because currently the >>>>>>> logic walks down the reachable graph and marks those metadata files as >>>>>>> 'not-orphan', so it should naturally walk these 'expired' snapshots as >>>>>>> well. >>>>>>> >>>>>>> So, I think the main changes in terms of implementations is going to >>>>>>> be adding error checks in those Table API's, and updating >>>>>>> ExpireSnapshots >>>>>>> API. >>>>>>> >>>>>>> Do we want to consider expiring snapshots in the middle of the >>>>>>>> history of the table? >>>>>>>> >>>>>>> You mean purging expired snapshots in the middle of the history, >>>>>>> right? I think the current mechanism for this is 'tagging' and >>>>>>> 'branching'. So interestingly, I was thinking its related to your other >>>>>>> question, and if we don't add error-check to 'tagging' and 'branching' >>>>>>> on >>>>>>> 'expired' snapshot, it could be handled just as they are handled today >>>>>>> for >>>>>>> other snapshots. Its one option. We could support it subsequently as >>>>>>> well >>>>>>> , after the first version and if there's some usage of this. >>>>>>> >>>>>>> One thing that comes up in this thread and google doc is some >>>>>>> question about the size of preserved metadata. I had put in the >>>>>>> Alternatives section, that we could potentially make the ExpireSnapshots >>>>>>> purge boolean argument more nuanced like PURGE, PRESERVE_REFS (snapshot >>>>>>> refs are preserved), PRESERVE_METADATA (snapshot refs and all metadata >>>>>>> files are preserved), though I am still debating if its worth it, as >>>>>>> users >>>>>>> could choose not to use this feature. >>>>>>> >>>>>>> Thanks >>>>>>> Szehon >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Jul 9, 2024 at 6:02 PM Yufei Gu <flyrain...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Thank you for the interesting proposal. With a minor specification >>>>>>>> change, it could indeed enable different retention periods for data >>>>>>>> files >>>>>>>> and metadata files. This differentiation is useful for two reasons: >>>>>>>> >>>>>>>> 1. More metadata helps us better understand the table history, >>>>>>>> providing valuable insights. >>>>>>>> 2. Users often prioritize data file deletion as it frees up >>>>>>>> significant storage space and removes potentially sensitive data. >>>>>>>> >>>>>>>> However, adding a boolean property to the specification isn't >>>>>>>> necessarily a lightweight solution. As Peter mentioned, implementing >>>>>>>> this >>>>>>>> change requires modifications in several places. In this context, >>>>>>>> external >>>>>>>> systems like LakeChime or a REST catalog implementation could offer >>>>>>>> effective solutions to manage extended metadata retention periods, >>>>>>>> without >>>>>>>> spec changes. >>>>>>>> >>>>>>>> I am neutral on this proposal (+0) and look forward to seeing more >>>>>>>> input from people. >>>>>>>> Yufei >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Jul 8, 2024 at 10:32 PM Péter Váry < >>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>> >>>>>>>>> We need to handle expired snapshots in several places differently >>>>>>>>> in Iceberg core as well. >>>>>>>>> - We need to add checks to prevent scans read these snapshots and >>>>>>>>> throw a meaningful error. >>>>>>>>> - We need to add checks to prevent tagging/branching these >>>>>>>>> snapshots >>>>>>>>> - We need to update DeleteOrphanFiles in Spark/Flink to not >>>>>>>>> consider files only referenced by the expired snapshots >>>>>>>>> >>>>>>>>> Some Flink jobs do frequent commits, and in these cases, the size >>>>>>>>> of the metadata file becomes a constraining factor too. In this case, >>>>>>>>> we >>>>>>>>> could just tell not to use this feature, and expire the metadata as >>>>>>>>> we do >>>>>>>>> now, but I thought it's worth to mention. >>>>>>>>> >>>>>>>>> Do we want to consider expiring snapshots in the middle of the >>>>>>>>> history of the table? >>>>>>>>> When we compact the table, then the compaction commits litter the >>>>>>>>> real history of the table. Consider the following: >>>>>>>>> - S1 writes some data >>>>>>>>> - S2 writes some more data >>>>>>>>> - S3 compacts the previous 2 commits >>>>>>>>> - S4 writes even more data >>>>>>>>> From the query engine user perspective S3 is a commit which does >>>>>>>>> nothing, not initiated by the user, and most probably they don't even >>>>>>>>> want >>>>>>>>> to know of. If one can expire a snapshot from the middle of the >>>>>>>>> history, >>>>>>>>> that would be nice, so users would see only S1/S2/S4. The only >>>>>>>>> downside is >>>>>>>>> that reading S2 is less performant than reading S3, but IMHO this >>>>>>>>> could be >>>>>>>>> acceptable for having only user driven changes in the table history. >>>>>>>>> >>>>>>>>> >>>>>>>>> In Mon, Jul 8, 2024, 20:15 Szehon Ho <szehon.apa...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Thanks for the comments so far. I also thought previously that >>>>>>>>>> this functionality would be in an external system, like LakeChime, >>>>>>>>>> or a >>>>>>>>>> custom catalog extension. But after doing an initial analysis >>>>>>>>>> (please >>>>>>>>>> double check), I thought it's a small enough change that it would be >>>>>>>>>> worth >>>>>>>>>> putting in the Iceberg spec/API's for all users: >>>>>>>>>> >>>>>>>>>> - Table Spec, only one optional boolean field (on Snapshot, >>>>>>>>>> only set if the functionality is used). >>>>>>>>>> - API, only one boolean parameter (on ExpireSnapshots). >>>>>>>>>> >>>>>>>>>> I do wonder, will keeping expired snapshots as is slow down >>>>>>>>>>> manifest/scan planning though (REST catalog approaches could >>>>>>>>>>> probably >>>>>>>>>>> mitigate this)? >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I think it should not slow down manifest/scan planning, because >>>>>>>>>> we plan using the current snapshot (or the one we specify via time >>>>>>>>>> travel), >>>>>>>>>> and we wouldn't read expired snapshots in this case. >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> Szehon >>>>>>>>>> >>>>>>>>>> On Mon, Jul 8, 2024 at 10:54 AM John Greene < >>>>>>>>>> jgreene1...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> I do agree with the need that this proposal solves, to decouple >>>>>>>>>>> the snapshot history from the data deletion. I do wonder, will >>>>>>>>>>> keeping >>>>>>>>>>> expired snapshots as is slow down manifest/scan planning though >>>>>>>>>>> (REST >>>>>>>>>>> catalog approaches could probably mitigate this)? >>>>>>>>>>> >>>>>>>>>>> On Mon, Jul 8, 2024, 5:34 AM Piotr Findeisen < >>>>>>>>>>> piotr.findei...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Shehon, Walaa >>>>>>>>>>>> >>>>>>>>>>>> Thank Shehon for bringing this up. And thank you Walaa for >>>>>>>>>>>> proving more context from similar existing solution to the problem. >>>>>>>>>>>> The choices that LakeChime seems to have made -- to keep >>>>>>>>>>>> information in a separate RDBMS and which particular metadata >>>>>>>>>>>> information >>>>>>>>>>>> to retain -- they indeed look as use-case specific, until we >>>>>>>>>>>> observe >>>>>>>>>>>> repeating patterns. >>>>>>>>>>>> The idea to bake lifecycle changes into table format spec was >>>>>>>>>>>> proposed as an alternative to the idea to bake lifecycle changes >>>>>>>>>>>> into REST >>>>>>>>>>>> catalog spec. It was brought into discussion based on the >>>>>>>>>>>> intuition that >>>>>>>>>>>> REST catalog is first-class citizen in Iceberg world, just like >>>>>>>>>>>> other >>>>>>>>>>>> catalogs, and so solutions to table-centric problems do not need >>>>>>>>>>>> to be >>>>>>>>>>>> limited to REST catalog. What is the information we retain, >>>>>>>>>>>> how/whether >>>>>>>>>>>> this is configurable are open question and applicable to both >>>>>>>>>>>> avenues. >>>>>>>>>>>> >>>>>>>>>>>> As a 3rd/another alternative, we could focus on REST catalog >>>>>>>>>>>> *extensions*, without naming snapshot metadata lifecycle, and >>>>>>>>>>>> leave the problem up to REST's implementors. That would mean >>>>>>>>>>>> Iceberg >>>>>>>>>>>> project doesn't address snapshot metadata lifecycle changes topic >>>>>>>>>>>> directly, >>>>>>>>>>>> but instead gives users tools to build solutions around it. At >>>>>>>>>>>> this point I >>>>>>>>>>>> am not trying to judge whether it's a good idea or not. Probably >>>>>>>>>>>> depends >>>>>>>>>>>> how important it is to solve the problem and have a common >>>>>>>>>>>> solution. >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> Piotr >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sat, 6 Jul 2024 at 09:46, Walaa Eldin Moustafa < >>>>>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Szehon, >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks for sharing this proposal. We have thought along the >>>>>>>>>>>>> same lines and implemented an external system (LakeChime [1]) >>>>>>>>>>>>> that retains >>>>>>>>>>>>> snapshot + partition metadata for longer (actual internal >>>>>>>>>>>>> implementation >>>>>>>>>>>>> keeps data for 13 months, but that can be tuned). For efficient >>>>>>>>>>>>> analysis, >>>>>>>>>>>>> we have kept this data in an RDBMS. My opinion is this may be a >>>>>>>>>>>>> better fit >>>>>>>>>>>>> to an external system (similar to LakeChime) since it could >>>>>>>>>>>>> potentially >>>>>>>>>>>>> complicate the Iceberg spec, APIs, or their implementations. >>>>>>>>>>>>> Also, the type >>>>>>>>>>>>> of metadata tracked can differ depending on the use case. For >>>>>>>>>>>>> example, >>>>>>>>>>>>> while LakeChime retains partition and operation type metadata, it >>>>>>>>>>>>> does not >>>>>>>>>>>>> track file-level metadata as there was no specific use case for >>>>>>>>>>>>> that. >>>>>>>>>>>>> >>>>>>>>>>>>> [1] >>>>>>>>>>>>> https://www.linkedin.com/blog/engineering/data-management/lakechime-a-data-trigger-service-for-modern-data-lakes >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Walaa. >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Jul 5, 2024 at 11:49 PM Szehon Ho < >>>>>>>>>>>>> szehon.apa...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi folks, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I would like to discuss an idea for an optional extension of >>>>>>>>>>>>>> Iceberg's Snapshot metadata lifecycle. Thanks Piotr for >>>>>>>>>>>>>> replying on the >>>>>>>>>>>>>> other thread that this should be a fuller Iceberg format change. >>>>>>>>>>>>>> >>>>>>>>>>>>>> *Proposal Summary* >>>>>>>>>>>>>> >>>>>>>>>>>>>> Currently, ExpireSnapshots(long olderThan) purges metadata >>>>>>>>>>>>>> and deleted data of a Snapshot together. Purging deleted data >>>>>>>>>>>>>> often >>>>>>>>>>>>>> requires a smaller timeline, due to strict requirements to claw >>>>>>>>>>>>>> back unused >>>>>>>>>>>>>> disk space, fulfill data lifecycle compliance, etc. In many >>>>>>>>>>>>>> deployments, >>>>>>>>>>>>>> this means 'olderThan' timestamp is set to just a few days >>>>>>>>>>>>>> before the >>>>>>>>>>>>>> current time (the default is 5 days). >>>>>>>>>>>>>> >>>>>>>>>>>>>> On the other hand, purging metadata could be ideally done on >>>>>>>>>>>>>> a more relaxed timeline, such as months or more, to allow for >>>>>>>>>>>>>> meaningful >>>>>>>>>>>>>> historical table analysis. >>>>>>>>>>>>>> >>>>>>>>>>>>>> We should have an optional way to purge Snapshot metadata >>>>>>>>>>>>>> separately from purging deleted data. This would allow us to >>>>>>>>>>>>>> get history >>>>>>>>>>>>>> of the table, and answer questions like: >>>>>>>>>>>>>> >>>>>>>>>>>>>> - When was a file/partition added >>>>>>>>>>>>>> - When was a file/partition deleted >>>>>>>>>>>>>> - How much data was added or removed in time X >>>>>>>>>>>>>> >>>>>>>>>>>>>> that are currently only possible for data operations within a >>>>>>>>>>>>>> few days. >>>>>>>>>>>>>> >>>>>>>>>>>>>> *Github Proposal*: >>>>>>>>>>>>>> https://github.com/apache/iceberg/issues/10646 >>>>>>>>>>>>>> *Google Design Doc*: >>>>>>>>>>>>>> https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit >>>>>>>>>>>>>> <https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Curious if anyone has thought along these lines and/or sees >>>>>>>>>>>>>> obvious issues. Would appreciate any feedback on the proposal. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>> Szehon >>>>>>>>>>>>>> >>>>>>>>>>>>>