Thanks Szehone for the new proposal. I think it is a useful feature with the least spec change. A candidate for v3 spec?
Yufei On Tue, Jul 16, 2024 at 3:02 PM Szehon Ho <szehon.apa...@gmail.com> wrote: > Hi, > > Thanks for reading through the proposal and the good feedback. I was > thinking about the mentioned concerns: > > - The motivation for the change > - Too much additional metadata (storage overhead, namenode pressure on > HDFS) > - Performance impact for read/writing TableMetadata > - Some impact to existing Table API's, and maintenance procedures, to > have to check for these snapshots > > I chatted a bit offline with Yufei to brainstorm, and I wrote a V2 of the > proposal at the same link: > https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit. > I also tried to clarify the motivation in the doc with actual metadata > table queries that would be possible. > > This version now simply adds an optional 'expired-snapshots-path' that > contains the metadata of expired Snapshots. I think this should address > the above concerns: > > - Minimal storage overhead for just snapshot references (capped). I > don't propose anymore to keep old snapshot manifest-list/manifest files, > the snapshot reference to the expired snapshot should be a good start. > - Minimize perf overhead of read/write TableMetadata. The additional > file is only written by ExpireSnapshots if feature is enabled, and only > read on demand (via metadata table query for example) > - No impact to other Table APIs or maintenance procedures (as these > dont show up as regular table.snapshots() list anymore). > - Only additive optional spec change (backwards compatible) > > Of course, again, this feature is possible outside Iceberg, but the > advantage of doing it in Iceberg is that it could be integrated into > ExpireSnapshots and Metadata Table frameworks. > > Curious what people think? > > Thanks > Szehon > > On Wed, Jul 10, 2024 at 1:44 AM Péter Váry <peter.vary.apa...@gmail.com> > wrote: > >> > I believe DeleteOrphanFiles may be ok as is, because currently the >> logic walks down the reachable graph and marks those metadata files as >> 'not-orphan', so it should naturally walk these 'expired' snapshots as well. >> >> We need to keep the metadata files, but remove data files if they are not >> removed for whatever reason. Doable, but logic change. >> >> > You mean purging expired snapshots in the middle of the history, >> right? I think the current mechanism for this is 'tagging' and 'branching'. >> >> I think for most users the compaction commits are technical details which >> they would like to avoid / don't want to see. The real table history is >> only the changes initiated by the user, and it would be good to hide the >> technical/compaction commits. >> >> >> On Wed, Jul 10, 2024, 08:52 himadri pal <meh...@gmail.com> wrote: >> >>> Hi Szehon, >>> >>> This is a good idea considering the use case it intends to solve. Added >>> few questions and comments in the design doc. >>> >>> IMO , Alternate options considered specified in the design doc look >>> cleaner to me. >>> >>> I think, it might add to maintenance burden, now that we need to >>> remember to remove these metadata only snapshots. >>> >>> Also I wonder some of the use cases it intends to address, is solvable >>> by metadata alone? - i.e how much data was added in a given time range? - >>> May be to answer these kind of questions user would prefer a to create KPI >>> using columns in the dataset. >>> >>> >>> Regards, >>> Himadri Pal >>> >>> >>> On Tue, Jul 9, 2024 at 11:10 PM Steven Wu <stevenz...@gmail.com> wrote: >>> >>>> I am not totally convinced of the motivation yet. >>>> >>>> I thought the snapshot retention window is primarily meant for time >>>> travel and troubleshooting table changes that happened recently (like a few >>>> days or weeks). >>>> >>>> Is it valuable enough to keep expired snapshots for as long as months >>>> or years? While metadata files are typically smaller than data files in >>>> total size, it can still be significant considering the default amount of >>>> column stats written today (especially for wide tables with many columns). >>>> >>>> How long are we going to keep the expired snapshot references by >>>> default? If it is months/years, it can have major implications on the query >>>> performance of metadata tables (like snapshots, all_*). >>>> >>>> I assume it will also have some performance impact on table loading as >>>> a lot more expired snapshots are still referenced. >>>> >>>> >>>> >>>> >>>> On Tue, Jul 9, 2024 at 6:36 PM Szehon Ho <szehon.apa...@gmail.com> >>>> wrote: >>>> >>>>> Thanks Peter and Yufei. >>>>> >>>>> Yes, in terms of implementation, I noted in the doc we need to add >>>>> error checks to prevent time-travel / rollback / cherry-pick operations to >>>>> 'expired' snapshots. I'll make it more clear in the doc, which operations >>>>> we need to check against. >>>>> >>>>> I believe DeleteOrphanFiles may be ok as is, because currently the >>>>> logic walks down the reachable graph and marks those metadata files as >>>>> 'not-orphan', so it should naturally walk these 'expired' snapshots as >>>>> well. >>>>> >>>>> So, I think the main changes in terms of implementations is going to >>>>> be adding error checks in those Table API's, and updating ExpireSnapshots >>>>> API. >>>>> >>>>> Do we want to consider expiring snapshots in the middle of the history >>>>>> of the table? >>>>>> >>>>> You mean purging expired snapshots in the middle of the history, >>>>> right? I think the current mechanism for this is 'tagging' and >>>>> 'branching'. So interestingly, I was thinking its related to your other >>>>> question, and if we don't add error-check to 'tagging' and 'branching' on >>>>> 'expired' snapshot, it could be handled just as they are handled today for >>>>> other snapshots. Its one option. We could support it subsequently as >>>>> well >>>>> , after the first version and if there's some usage of this. >>>>> >>>>> One thing that comes up in this thread and google doc is some question >>>>> about the size of preserved metadata. I had put in the Alternatives >>>>> section, that we could potentially make the ExpireSnapshots purge boolean >>>>> argument more nuanced like PURGE, PRESERVE_REFS (snapshot refs are >>>>> preserved), PRESERVE_METADATA (snapshot refs and all metadata files are >>>>> preserved), though I am still debating if its worth it, as users could >>>>> choose not to use this feature. >>>>> >>>>> Thanks >>>>> Szehon >>>>> >>>>> >>>>> >>>>> On Tue, Jul 9, 2024 at 6:02 PM Yufei Gu <flyrain...@gmail.com> wrote: >>>>> >>>>>> Thank you for the interesting proposal. With a minor specification >>>>>> change, it could indeed enable different retention periods for data files >>>>>> and metadata files. This differentiation is useful for two reasons: >>>>>> >>>>>> 1. More metadata helps us better understand the table history, >>>>>> providing valuable insights. >>>>>> 2. Users often prioritize data file deletion as it frees up >>>>>> significant storage space and removes potentially sensitive data. >>>>>> >>>>>> However, adding a boolean property to the specification isn't >>>>>> necessarily a lightweight solution. As Peter mentioned, implementing this >>>>>> change requires modifications in several places. In this context, >>>>>> external >>>>>> systems like LakeChime or a REST catalog implementation could offer >>>>>> effective solutions to manage extended metadata retention periods, >>>>>> without >>>>>> spec changes. >>>>>> >>>>>> I am neutral on this proposal (+0) and look forward to seeing more >>>>>> input from people. >>>>>> Yufei >>>>>> >>>>>> >>>>>> On Mon, Jul 8, 2024 at 10:32 PM Péter Váry < >>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>> >>>>>>> We need to handle expired snapshots in several places differently in >>>>>>> Iceberg core as well. >>>>>>> - We need to add checks to prevent scans read these snapshots and >>>>>>> throw a meaningful error. >>>>>>> - We need to add checks to prevent tagging/branching these snapshots >>>>>>> - We need to update DeleteOrphanFiles in Spark/Flink to not consider >>>>>>> files only referenced by the expired snapshots >>>>>>> >>>>>>> Some Flink jobs do frequent commits, and in these cases, the size of >>>>>>> the metadata file becomes a constraining factor too. In this case, we >>>>>>> could >>>>>>> just tell not to use this feature, and expire the metadata as we do now, >>>>>>> but I thought it's worth to mention. >>>>>>> >>>>>>> Do we want to consider expiring snapshots in the middle of the >>>>>>> history of the table? >>>>>>> When we compact the table, then the compaction commits litter the >>>>>>> real history of the table. Consider the following: >>>>>>> - S1 writes some data >>>>>>> - S2 writes some more data >>>>>>> - S3 compacts the previous 2 commits >>>>>>> - S4 writes even more data >>>>>>> From the query engine user perspective S3 is a commit which does >>>>>>> nothing, not initiated by the user, and most probably they don't even >>>>>>> want >>>>>>> to know of. If one can expire a snapshot from the middle of the history, >>>>>>> that would be nice, so users would see only S1/S2/S4. The only downside >>>>>>> is >>>>>>> that reading S2 is less performant than reading S3, but IMHO this could >>>>>>> be >>>>>>> acceptable for having only user driven changes in the table history. >>>>>>> >>>>>>> >>>>>>> In Mon, Jul 8, 2024, 20:15 Szehon Ho <szehon.apa...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Thanks for the comments so far. I also thought previously that >>>>>>>> this functionality would be in an external system, like LakeChime, or a >>>>>>>> custom catalog extension. But after doing an initial analysis (please >>>>>>>> double check), I thought it's a small enough change that it would be >>>>>>>> worth >>>>>>>> putting in the Iceberg spec/API's for all users: >>>>>>>> >>>>>>>> - Table Spec, only one optional boolean field (on Snapshot, >>>>>>>> only set if the functionality is used). >>>>>>>> - API, only one boolean parameter (on ExpireSnapshots). >>>>>>>> >>>>>>>> I do wonder, will keeping expired snapshots as is slow down >>>>>>>>> manifest/scan planning though (REST catalog approaches could probably >>>>>>>>> mitigate this)? >>>>>>>>> >>>>>>>> >>>>>>>> I think it should not slow down manifest/scan planning, because we >>>>>>>> plan using the current snapshot (or the one we specify via time >>>>>>>> travel), >>>>>>>> and we wouldn't read expired snapshots in this case. >>>>>>>> >>>>>>>> Thanks >>>>>>>> Szehon >>>>>>>> >>>>>>>> On Mon, Jul 8, 2024 at 10:54 AM John Greene <jgreene1...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I do agree with the need that this proposal solves, to decouple >>>>>>>>> the snapshot history from the data deletion. I do wonder, will keeping >>>>>>>>> expired snapshots as is slow down manifest/scan planning though (REST >>>>>>>>> catalog approaches could probably mitigate this)? >>>>>>>>> >>>>>>>>> On Mon, Jul 8, 2024, 5:34 AM Piotr Findeisen < >>>>>>>>> piotr.findei...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Shehon, Walaa >>>>>>>>>> >>>>>>>>>> Thank Shehon for bringing this up. And thank you Walaa for >>>>>>>>>> proving more context from similar existing solution to the problem. >>>>>>>>>> The choices that LakeChime seems to have made -- to keep >>>>>>>>>> information in a separate RDBMS and which particular metadata >>>>>>>>>> information >>>>>>>>>> to retain -- they indeed look as use-case specific, until we observe >>>>>>>>>> repeating patterns. >>>>>>>>>> The idea to bake lifecycle changes into table format spec was >>>>>>>>>> proposed as an alternative to the idea to bake lifecycle changes >>>>>>>>>> into REST >>>>>>>>>> catalog spec. It was brought into discussion based on the intuition >>>>>>>>>> that >>>>>>>>>> REST catalog is first-class citizen in Iceberg world, just like other >>>>>>>>>> catalogs, and so solutions to table-centric problems do not need to >>>>>>>>>> be >>>>>>>>>> limited to REST catalog. What is the information we retain, >>>>>>>>>> how/whether >>>>>>>>>> this is configurable are open question and applicable to both >>>>>>>>>> avenues. >>>>>>>>>> >>>>>>>>>> As a 3rd/another alternative, we could focus on REST catalog >>>>>>>>>> *extensions*, without naming snapshot metadata lifecycle, and >>>>>>>>>> leave the problem up to REST's implementors. That would mean Iceberg >>>>>>>>>> project doesn't address snapshot metadata lifecycle changes topic >>>>>>>>>> directly, >>>>>>>>>> but instead gives users tools to build solutions around it. At this >>>>>>>>>> point I >>>>>>>>>> am not trying to judge whether it's a good idea or not. Probably >>>>>>>>>> depends >>>>>>>>>> how important it is to solve the problem and have a common solution. >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Piotr >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sat, 6 Jul 2024 at 09:46, Walaa Eldin Moustafa < >>>>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Szehon, >>>>>>>>>>> >>>>>>>>>>> Thanks for sharing this proposal. We have thought along the same >>>>>>>>>>> lines and implemented an external system (LakeChime [1]) that >>>>>>>>>>> retains >>>>>>>>>>> snapshot + partition metadata for longer (actual internal >>>>>>>>>>> implementation >>>>>>>>>>> keeps data for 13 months, but that can be tuned). For efficient >>>>>>>>>>> analysis, >>>>>>>>>>> we have kept this data in an RDBMS. My opinion is this may be a >>>>>>>>>>> better fit >>>>>>>>>>> to an external system (similar to LakeChime) since it could >>>>>>>>>>> potentially >>>>>>>>>>> complicate the Iceberg spec, APIs, or their implementations. Also, >>>>>>>>>>> the type >>>>>>>>>>> of metadata tracked can differ depending on the use case. For >>>>>>>>>>> example, >>>>>>>>>>> while LakeChime retains partition and operation type metadata, it >>>>>>>>>>> does not >>>>>>>>>>> track file-level metadata as there was no specific use case for >>>>>>>>>>> that. >>>>>>>>>>> >>>>>>>>>>> [1] >>>>>>>>>>> https://www.linkedin.com/blog/engineering/data-management/lakechime-a-data-trigger-service-for-modern-data-lakes >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Walaa. >>>>>>>>>>> >>>>>>>>>>> On Fri, Jul 5, 2024 at 11:49 PM Szehon Ho < >>>>>>>>>>> szehon.apa...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi folks, >>>>>>>>>>>> >>>>>>>>>>>> I would like to discuss an idea for an optional extension of >>>>>>>>>>>> Iceberg's Snapshot metadata lifecycle. Thanks Piotr for replying >>>>>>>>>>>> on the >>>>>>>>>>>> other thread that this should be a fuller Iceberg format change. >>>>>>>>>>>> >>>>>>>>>>>> *Proposal Summary* >>>>>>>>>>>> >>>>>>>>>>>> Currently, ExpireSnapshots(long olderThan) purges metadata and >>>>>>>>>>>> deleted data of a Snapshot together. Purging deleted data often >>>>>>>>>>>> requires a >>>>>>>>>>>> smaller timeline, due to strict requirements to claw back unused >>>>>>>>>>>> disk >>>>>>>>>>>> space, fulfill data lifecycle compliance, etc. In many >>>>>>>>>>>> deployments, this >>>>>>>>>>>> means 'olderThan' timestamp is set to just a few days before the >>>>>>>>>>>> current >>>>>>>>>>>> time (the default is 5 days). >>>>>>>>>>>> >>>>>>>>>>>> On the other hand, purging metadata could be ideally done on a >>>>>>>>>>>> more relaxed timeline, such as months or more, to allow for >>>>>>>>>>>> meaningful >>>>>>>>>>>> historical table analysis. >>>>>>>>>>>> >>>>>>>>>>>> We should have an optional way to purge Snapshot metadata >>>>>>>>>>>> separately from purging deleted data. This would allow us to get >>>>>>>>>>>> history >>>>>>>>>>>> of the table, and answer questions like: >>>>>>>>>>>> >>>>>>>>>>>> - When was a file/partition added >>>>>>>>>>>> - When was a file/partition deleted >>>>>>>>>>>> - How much data was added or removed in time X >>>>>>>>>>>> >>>>>>>>>>>> that are currently only possible for data operations within a >>>>>>>>>>>> few days. >>>>>>>>>>>> >>>>>>>>>>>> *Github Proposal*: >>>>>>>>>>>> https://github.com/apache/iceberg/issues/10646 >>>>>>>>>>>> *Google Design Doc*: >>>>>>>>>>>> https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit >>>>>>>>>>>> <https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit> >>>>>>>>>>>> >>>>>>>>>>>> Curious if anyone has thought along these lines and/or sees >>>>>>>>>>>> obvious issues. Would appreciate any feedback on the proposal. >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> Szehon >>>>>>>>>>>> >>>>>>>>>>>