Thanks Szehone for the new proposal. I think it is a useful feature with
the least spec change. A candidate for v3 spec?

Yufei


On Tue, Jul 16, 2024 at 3:02 PM Szehon Ho <szehon.apa...@gmail.com> wrote:

> Hi,
>
> Thanks for reading through the proposal and the good feedback. I was
> thinking about the mentioned concerns:
>
>    - The motivation for the change
>    - Too much additional metadata (storage overhead, namenode pressure on
>    HDFS)
>    - Performance impact for read/writing TableMetadata
>    - Some impact to existing Table API's, and maintenance procedures, to
>    have to check for these snapshots
>
> I chatted a bit offline with Yufei to brainstorm, and I wrote a V2 of the
> proposal at the same link:
> https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit.
> I also tried to clarify the motivation in the doc with actual metadata
> table queries that would be possible.
>
> This version now simply adds an optional 'expired-snapshots-path' that
> contains the metadata of expired Snapshots.  I think this should address
> the above concerns:
>
>    - Minimal storage overhead for just snapshot references (capped).  I
>    don't propose anymore to keep old snapshot manifest-list/manifest files,
>    the snapshot reference to the expired snapshot should be a good start.
>    - Minimize perf overhead of read/write TableMetadata.  The additional
>    file is only written by ExpireSnapshots if feature is enabled, and only
>    read on demand (via metadata table query for example)
>    - No impact to other Table APIs or maintenance procedures (as these
>    dont show up as regular table.snapshots() list anymore).
>    - Only additive optional spec change (backwards compatible)
>
> Of course, again, this feature is possible outside Iceberg, but the
> advantage of doing it in Iceberg is that it could be integrated into
> ExpireSnapshots and Metadata Table frameworks.
>
> Curious what people think?
>
> Thanks
> Szehon
>
> On Wed, Jul 10, 2024 at 1:44 AM Péter Váry <peter.vary.apa...@gmail.com>
> wrote:
>
>> > I believe DeleteOrphanFiles may be ok as is, because currently the
>> logic walks down the reachable graph and marks those metadata files as
>> 'not-orphan', so it should naturally walk these 'expired' snapshots as well.
>>
>> We need to keep the metadata files, but remove data files if they are not
>> removed for whatever reason. Doable, but logic change.
>>
>> > You mean purging expired snapshots in the middle of the history,
>> right?  I think the current mechanism for this is 'tagging' and 'branching'.
>>
>> I think for most users the compaction commits are technical details which
>> they would like to avoid / don't want to see. The real table history is
>> only the changes initiated by the user, and it would be good to hide the
>> technical/compaction commits.
>>
>>
>> On Wed, Jul 10, 2024, 08:52 himadri pal <meh...@gmail.com> wrote:
>>
>>> Hi Szehon,
>>>
>>> This is a good idea considering the use case it intends to solve. Added
>>> few questions and comments in the design doc.
>>>
>>> IMO , Alternate options considered specified in the design doc look
>>> cleaner to me.
>>>
>>> I think, it might add to maintenance burden, now that we need to
>>> remember to remove these metadata only snapshots.
>>>
>>> Also I wonder some of the use cases it intends to address, is solvable
>>> by metadata alone? - i.e how much data was added in a given time range? -
>>> May be to answer these kind of questions user would prefer a to create KPI
>>> using columns in the dataset.
>>>
>>>
>>> Regards,
>>> Himadri Pal
>>>
>>>
>>> On Tue, Jul 9, 2024 at 11:10 PM Steven Wu <stevenz...@gmail.com> wrote:
>>>
>>>> I am not totally convinced of the motivation yet.
>>>>
>>>> I thought the snapshot retention window is primarily meant for time
>>>> travel and troubleshooting table changes that happened recently (like a few
>>>> days or weeks).
>>>>
>>>> Is it valuable enough to keep expired snapshots for as long as months
>>>> or years? While metadata files are typically smaller than data files in
>>>> total size, it can still be significant considering the default amount of
>>>> column stats written today (especially for wide tables with many columns).
>>>>
>>>> How long are we going to keep the expired snapshot references by
>>>> default? If it is months/years, it can have major implications on the query
>>>> performance of metadata tables (like snapshots, all_*).
>>>>
>>>> I assume it will also have some performance impact on table loading as
>>>> a lot more expired snapshots are still referenced.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Jul 9, 2024 at 6:36 PM Szehon Ho <szehon.apa...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks Peter and Yufei.
>>>>>
>>>>> Yes, in terms of implementation, I noted in the doc we need to add
>>>>> error checks to prevent time-travel / rollback / cherry-pick operations to
>>>>> 'expired' snapshots.  I'll make it more clear in the doc, which operations
>>>>> we need to check against.
>>>>>
>>>>> I believe DeleteOrphanFiles may be ok as is, because currently the
>>>>> logic walks down the reachable graph and marks those metadata files as
>>>>> 'not-orphan', so it should naturally walk these 'expired' snapshots as 
>>>>> well.
>>>>>
>>>>> So, I think the main changes in terms of implementations is going to
>>>>> be adding error checks in those Table API's, and updating ExpireSnapshots
>>>>> API.
>>>>>
>>>>> Do we want to consider expiring snapshots in the middle of the history
>>>>>> of the table?
>>>>>>
>>>>> You mean purging expired snapshots in the middle of the history,
>>>>> right?  I think the current mechanism for this is 'tagging' and
>>>>> 'branching'.  So interestingly, I was thinking its related to your other
>>>>> question, and if we don't add error-check to 'tagging' and 'branching' on
>>>>> 'expired' snapshot, it could be handled just as they are handled today for
>>>>> other snapshots.  Its one option.  We could support it subsequently as 
>>>>> well
>>>>> , after the first version and if there's some usage of this.
>>>>>
>>>>> One thing that comes up in this thread and google doc is some question
>>>>> about the size of preserved metadata.  I had put in the Alternatives
>>>>> section, that we could potentially make the ExpireSnapshots purge boolean
>>>>> argument more nuanced like PURGE, PRESERVE_REFS (snapshot refs are
>>>>> preserved), PRESERVE_METADATA (snapshot refs and all metadata files are
>>>>> preserved), though I am still debating if its worth it, as users could
>>>>> choose not to use this feature.
>>>>>
>>>>> Thanks
>>>>> Szehon
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jul 9, 2024 at 6:02 PM Yufei Gu <flyrain...@gmail.com> wrote:
>>>>>
>>>>>> Thank you for the interesting proposal. With a minor specification
>>>>>> change, it could indeed enable different retention periods for data files
>>>>>> and metadata files. This differentiation is useful for two reasons:
>>>>>>
>>>>>>    1. More metadata helps us better understand the table history,
>>>>>>    providing valuable insights.
>>>>>>    2. Users often prioritize data file deletion as it frees up
>>>>>>    significant storage space and removes potentially sensitive data.
>>>>>>
>>>>>> However, adding a boolean property to the specification isn't
>>>>>> necessarily a lightweight solution. As Peter mentioned, implementing this
>>>>>> change requires modifications in several places. In this context, 
>>>>>> external
>>>>>> systems like LakeChime or a REST catalog implementation could offer
>>>>>> effective solutions to manage extended metadata retention periods, 
>>>>>> without
>>>>>> spec changes.
>>>>>>
>>>>>> I am neutral on this proposal (+0) and look forward to seeing more
>>>>>> input from people.
>>>>>> Yufei
>>>>>>
>>>>>>
>>>>>> On Mon, Jul 8, 2024 at 10:32 PM Péter Váry <
>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>
>>>>>>> We need to handle expired snapshots in several places differently in
>>>>>>> Iceberg core as well.
>>>>>>> - We need to add checks to prevent scans read these snapshots and
>>>>>>> throw a meaningful error.
>>>>>>> - We need to add checks to prevent tagging/branching these snapshots
>>>>>>> - We need to update DeleteOrphanFiles in Spark/Flink to not consider
>>>>>>> files only referenced by the expired snapshots
>>>>>>>
>>>>>>> Some Flink jobs do frequent commits, and in these cases, the size of
>>>>>>> the metadata file becomes a constraining factor too. In this case, we 
>>>>>>> could
>>>>>>> just tell not to use this feature, and expire the metadata as we do now,
>>>>>>> but I thought it's worth to mention.
>>>>>>>
>>>>>>> Do we want to consider expiring snapshots in the middle of the
>>>>>>> history of the table?
>>>>>>> When we compact the table, then the compaction commits litter the
>>>>>>> real history of the table. Consider the following:
>>>>>>> - S1 writes some data
>>>>>>> - S2 writes some more data
>>>>>>> - S3 compacts the previous 2 commits
>>>>>>> - S4 writes even more data
>>>>>>> From the query engine user perspective S3 is a commit which does
>>>>>>> nothing, not initiated by the user, and most probably they don't even 
>>>>>>> want
>>>>>>> to know of. If one can expire a snapshot from the middle of the history,
>>>>>>> that would be nice, so users would see only S1/S2/S4. The only downside 
>>>>>>> is
>>>>>>> that reading S2 is less performant than reading S3, but IMHO this could 
>>>>>>> be
>>>>>>> acceptable for having only user driven changes in the table history.
>>>>>>>
>>>>>>>
>>>>>>> In Mon, Jul 8, 2024, 20:15 Szehon Ho <szehon.apa...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks for the comments so far.  I also thought previously that
>>>>>>>> this functionality would be in an external system, like LakeChime, or a
>>>>>>>> custom catalog extension.  But after doing an initial analysis (please
>>>>>>>> double check), I thought it's a small enough change that it would be 
>>>>>>>> worth
>>>>>>>> putting in the Iceberg spec/API's for all users:
>>>>>>>>
>>>>>>>>    - Table Spec, only one optional boolean field (on Snapshot,
>>>>>>>>    only set if the functionality is used).
>>>>>>>>    - API, only one boolean parameter (on ExpireSnapshots).
>>>>>>>>
>>>>>>>> I do wonder, will keeping expired snapshots as is slow down
>>>>>>>>> manifest/scan planning though (REST catalog approaches could probably
>>>>>>>>> mitigate this)?
>>>>>>>>>
>>>>>>>>
>>>>>>>> I think it should not slow down manifest/scan planning, because we
>>>>>>>> plan using the current snapshot (or the one we specify via time 
>>>>>>>> travel),
>>>>>>>> and we wouldn't read expired snapshots in this case.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Szehon
>>>>>>>>
>>>>>>>> On Mon, Jul 8, 2024 at 10:54 AM John Greene <jgreene1...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I do agree with the need that this proposal solves, to decouple
>>>>>>>>> the snapshot history from the data deletion. I do wonder, will keeping
>>>>>>>>> expired snapshots as is slow down manifest/scan planning though (REST
>>>>>>>>> catalog approaches could probably mitigate this)?
>>>>>>>>>
>>>>>>>>> On Mon, Jul 8, 2024, 5:34 AM Piotr Findeisen <
>>>>>>>>> piotr.findei...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Shehon, Walaa
>>>>>>>>>>
>>>>>>>>>> Thank Shehon for bringing this up. And thank you Walaa for
>>>>>>>>>> proving more context from similar existing solution to the problem.
>>>>>>>>>> The choices that LakeChime seems to have made -- to keep
>>>>>>>>>> information in a separate RDBMS and which particular metadata 
>>>>>>>>>> information
>>>>>>>>>> to retain -- they indeed look as use-case specific, until we observe
>>>>>>>>>> repeating patterns.
>>>>>>>>>> The idea to bake lifecycle changes into table format spec was
>>>>>>>>>> proposed as an alternative to the idea to bake lifecycle changes 
>>>>>>>>>> into REST
>>>>>>>>>> catalog spec. It was brought into discussion based on the intuition 
>>>>>>>>>> that
>>>>>>>>>> REST catalog is first-class citizen in Iceberg world, just like other
>>>>>>>>>> catalogs, and so solutions to table-centric problems do not need to 
>>>>>>>>>> be
>>>>>>>>>> limited to REST catalog. What is the information we retain, 
>>>>>>>>>> how/whether
>>>>>>>>>> this is configurable are open question and applicable to both 
>>>>>>>>>> avenues.
>>>>>>>>>>
>>>>>>>>>> As a 3rd/another alternative, we could focus on REST catalog
>>>>>>>>>> *extensions*, without naming snapshot metadata lifecycle, and
>>>>>>>>>> leave the problem up to REST's implementors. That would mean Iceberg
>>>>>>>>>> project doesn't address snapshot metadata lifecycle changes topic 
>>>>>>>>>> directly,
>>>>>>>>>> but instead gives users tools to build solutions around it. At this 
>>>>>>>>>> point I
>>>>>>>>>> am not trying to judge whether it's a good idea or not. Probably 
>>>>>>>>>> depends
>>>>>>>>>> how important it is to solve the problem and have a common solution.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Piotr
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, 6 Jul 2024 at 09:46, Walaa Eldin Moustafa <
>>>>>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Szehon,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for sharing this proposal. We have thought along the same
>>>>>>>>>>> lines and implemented an external system (LakeChime [1]) that 
>>>>>>>>>>> retains
>>>>>>>>>>> snapshot + partition metadata for longer (actual internal 
>>>>>>>>>>> implementation
>>>>>>>>>>> keeps data for 13 months, but that can be tuned). For efficient 
>>>>>>>>>>> analysis,
>>>>>>>>>>> we have kept this data in an RDBMS. My opinion is this may be a 
>>>>>>>>>>> better fit
>>>>>>>>>>> to an external system (similar to LakeChime) since it could 
>>>>>>>>>>> potentially
>>>>>>>>>>> complicate the Iceberg spec, APIs, or their implementations. Also, 
>>>>>>>>>>> the type
>>>>>>>>>>> of metadata tracked can differ depending on the use case. For 
>>>>>>>>>>> example,
>>>>>>>>>>> while LakeChime retains partition and operation type metadata, it 
>>>>>>>>>>> does not
>>>>>>>>>>> track file-level metadata as there was no specific use case for 
>>>>>>>>>>> that.
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> https://www.linkedin.com/blog/engineering/data-management/lakechime-a-data-trigger-service-for-modern-data-lakes
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Walaa.
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Jul 5, 2024 at 11:49 PM Szehon Ho <
>>>>>>>>>>> szehon.apa...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi folks,
>>>>>>>>>>>>
>>>>>>>>>>>> I would like to discuss an idea for an optional extension of
>>>>>>>>>>>> Iceberg's Snapshot metadata lifecycle.  Thanks Piotr for replying 
>>>>>>>>>>>> on the
>>>>>>>>>>>> other thread that this should be a fuller Iceberg format change.
>>>>>>>>>>>>
>>>>>>>>>>>> *Proposal Summary*
>>>>>>>>>>>>
>>>>>>>>>>>> Currently, ExpireSnapshots(long olderThan) purges metadata and
>>>>>>>>>>>> deleted data of a Snapshot together.  Purging deleted data often 
>>>>>>>>>>>> requires a
>>>>>>>>>>>> smaller timeline, due to strict requirements to claw back unused 
>>>>>>>>>>>> disk
>>>>>>>>>>>> space, fulfill data lifecycle compliance, etc.  In many 
>>>>>>>>>>>> deployments, this
>>>>>>>>>>>> means 'olderThan' timestamp is set to just a few days before the 
>>>>>>>>>>>> current
>>>>>>>>>>>> time (the default is 5 days).
>>>>>>>>>>>>
>>>>>>>>>>>> On the other hand, purging metadata could be ideally done on a
>>>>>>>>>>>> more relaxed timeline, such as months or more, to allow for 
>>>>>>>>>>>> meaningful
>>>>>>>>>>>> historical table analysis.
>>>>>>>>>>>>
>>>>>>>>>>>> We should have an optional way to purge Snapshot metadata
>>>>>>>>>>>> separately from purging deleted data.  This would allow us to get 
>>>>>>>>>>>> history
>>>>>>>>>>>> of the table, and answer questions like:
>>>>>>>>>>>>
>>>>>>>>>>>>    - When was a file/partition added
>>>>>>>>>>>>    - When was a file/partition deleted
>>>>>>>>>>>>    - How much data was added or removed in time X
>>>>>>>>>>>>
>>>>>>>>>>>> that are currently only possible for data operations within a
>>>>>>>>>>>> few days.
>>>>>>>>>>>>
>>>>>>>>>>>> *Github Proposal*:
>>>>>>>>>>>> https://github.com/apache/iceberg/issues/10646
>>>>>>>>>>>> *Google Design Doc*:
>>>>>>>>>>>> https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit
>>>>>>>>>>>> <https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit>
>>>>>>>>>>>>
>>>>>>>>>>>> Curious if anyone has thought along these lines and/or sees
>>>>>>>>>>>> obvious issues.  Would appreciate any feedback on the proposal.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Szehon
>>>>>>>>>>>>
>>>>>>>>>>>

Reply via email to