Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

Steven Wu Tue, 09 Jul 2024 23:10:49 -0700

I am not totally convinced of the motivation yet.

I thought the snapshot retention window is primarily meant for time travel
and troubleshooting table changes that happened recently (like a few days
or weeks).


Is it valuable enough to keep expired snapshots for as long as months or
years? While metadata files are typically smaller than data files in total
size, it can still be significant considering the default amount of column
stats written today (especially for wide tables with many columns).

How long are we going to keep the expired snapshot references by default?
If it is months/years, it can have major implications on the query
performance of metadata tables (like snapshots, all_*).

I assume it will also have some performance impact on table loading as a
lot more expired snapshots are still referenced.




On Tue, Jul 9, 2024 at 6:36 PM Szehon Ho <szehon.apa...@gmail.com> wrote:

> Thanks Peter and Yufei.
>
> Yes, in terms of implementation, I noted in the doc we need to add error
> checks to prevent time-travel / rollback / cherry-pick operations to
> 'expired' snapshots.  I'll make it more clear in the doc, which operations
> we need to check against.
>
> I believe DeleteOrphanFiles may be ok as is, because currently the logic
> walks down the reachable graph and marks those metadata files as
> 'not-orphan', so it should naturally walk these 'expired' snapshots as well.
>
> So, I think the main changes in terms of implementations is going to be
> adding error checks in those Table API's, and updating ExpireSnapshots API.
>
> Do we want to consider expiring snapshots in the middle of the history of
>> the table?
>>
> You mean purging expired snapshots in the middle of the history, right?  I
> think the current mechanism for this is 'tagging' and 'branching'.  So
> interestingly, I was thinking its related to your other question, and if we
> don't add error-check to 'tagging' and 'branching' on 'expired' snapshot,
> it could be handled just as they are handled today for other snapshots.
> Its one option.  We could support it subsequently as well , after the first
> version and if there's some usage of this.
>
> One thing that comes up in this thread and google doc is some question
> about the size of preserved metadata.  I had put in the Alternatives
> section, that we could potentially make the ExpireSnapshots purge boolean
> argument more nuanced like PURGE, PRESERVE_REFS (snapshot refs are
> preserved), PRESERVE_METADATA (snapshot refs and all metadata files are
> preserved), though I am still debating if its worth it, as users could
> choose not to use this feature.
>
> Thanks
> Szehon
>
>
>
> On Tue, Jul 9, 2024 at 6:02 PM Yufei Gu <flyrain...@gmail.com> wrote:
>
>> Thank you for the interesting proposal. With a minor specification
>> change, it could indeed enable different retention periods for data files
>> and metadata files. This differentiation is useful for two reasons:
>>
>>    1. More metadata helps us better understand the table history,
>>    providing valuable insights.
>>    2. Users often prioritize data file deletion as it frees up
>>    significant storage space and removes potentially sensitive data.
>>
>> However, adding a boolean property to the specification isn't necessarily
>> a lightweight solution. As Peter mentioned, implementing this change
>> requires modifications in several places. In this context, external systems
>> like LakeChime or a REST catalog implementation could offer effective
>> solutions to manage extended metadata retention periods, without spec
>> changes.
>>
>> I am neutral on this proposal (+0) and look forward to seeing more input
>> from people.
>> Yufei
>>
>>
>> On Mon, Jul 8, 2024 at 10:32 PM Péter Váry <peter.vary.apa...@gmail.com>
>> wrote:
>>
>>> We need to handle expired snapshots in several places differently in
>>> Iceberg core as well.
>>> - We need to add checks to prevent scans read these snapshots and throw
>>> a meaningful error.
>>> - We need to add checks to prevent tagging/branching these snapshots
>>> - We need to update DeleteOrphanFiles in Spark/Flink to not consider
>>> files only referenced by the expired snapshots
>>>
>>> Some Flink jobs do frequent commits, and in these cases, the size of the
>>> metadata file becomes a constraining factor too. In this case, we could
>>> just tell not to use this feature, and expire the metadata as we do now,
>>> but I thought it's worth to mention.
>>>
>>> Do we want to consider expiring snapshots in the middle of the history
>>> of the table?
>>> When we compact the table, then the compaction commits litter the real
>>> history of the table. Consider the following:
>>> - S1 writes some data
>>> - S2 writes some more data
>>> - S3 compacts the previous 2 commits
>>> - S4 writes even more data
>>> From the query engine user perspective S3 is a commit which does
>>> nothing, not initiated by the user, and most probably they don't even want
>>> to know of. If one can expire a snapshot from the middle of the history,
>>> that would be nice, so users would see only S1/S2/S4. The only downside is
>>> that reading S2 is less performant than reading S3, but IMHO this could be
>>> acceptable for having only user driven changes in the table history.
>>>
>>>
>>> In Mon, Jul 8, 2024, 20:15 Szehon Ho <szehon.apa...@gmail.com> wrote:
>>>
>>>> Thanks for the comments so far.  I also thought previously that this
>>>> functionality would be in an external system, like LakeChime, or a custom
>>>> catalog extension.  But after doing an initial analysis (please double
>>>> check), I thought it's a small enough change that it would be worth putting
>>>> in the Iceberg spec/API's for all users:
>>>>
>>>>    - Table Spec, only one optional boolean field (on Snapshot, only
>>>>    set if the functionality is used).
>>>>    - API, only one boolean parameter (on ExpireSnapshots).
>>>>
>>>> I do wonder, will keeping expired snapshots as is slow down
>>>>> manifest/scan planning though (REST catalog approaches could probably
>>>>> mitigate this)?
>>>>>
>>>>
>>>> I think it should not slow down manifest/scan planning, because we plan
>>>> using the current snapshot (or the one we specify via time travel), and we
>>>> wouldn't read expired snapshots in this case.
>>>>
>>>> Thanks
>>>> Szehon
>>>>
>>>> On Mon, Jul 8, 2024 at 10:54 AM John Greene <jgreene1...@gmail.com>
>>>> wrote:
>>>>
>>>>> I do agree with the need that this proposal solves, to decouple the
>>>>> snapshot history from the data deletion. I do wonder, will keeping expired
>>>>> snapshots as is slow down manifest/scan planning though (REST catalog
>>>>> approaches could probably mitigate this)?
>>>>>
>>>>> On Mon, Jul 8, 2024, 5:34 AM Piotr Findeisen <
>>>>> piotr.findei...@gmail.com> wrote:
>>>>>
>>>>>> Hi Shehon, Walaa
>>>>>>
>>>>>> Thank Shehon for bringing this up. And thank you Walaa for proving
>>>>>> more context from similar existing solution to the problem.
>>>>>> The choices that LakeChime seems to have made -- to keep information
>>>>>> in a separate RDBMS and which particular metadata information to retain 
>>>>>> --
>>>>>> they indeed look as use-case specific, until we observe repeating 
>>>>>> patterns.
>>>>>> The idea to bake lifecycle changes into table format spec was
>>>>>> proposed as an alternative to the idea to bake lifecycle changes into 
>>>>>> REST
>>>>>> catalog spec. It was brought into discussion based on the intuition that
>>>>>> REST catalog is first-class citizen in Iceberg world, just like other
>>>>>> catalogs, and so solutions to table-centric problems do not need to be
>>>>>> limited to REST catalog. What is the information we retain, how/whether
>>>>>> this is configurable are open question and applicable to both avenues.
>>>>>>
>>>>>> As a 3rd/another alternative, we could focus on REST catalog
>>>>>> *extensions*, without naming snapshot metadata lifecycle, and leave
>>>>>> the problem up to REST's implementors. That would mean Iceberg project
>>>>>> doesn't address snapshot metadata lifecycle changes topic directly, but
>>>>>> instead gives users tools to build solutions around it. At this point I 
>>>>>> am
>>>>>> not trying to judge whether it's a good idea or not. Probably depends how
>>>>>> important it is to solve the problem and have a common solution.
>>>>>>
>>>>>> Best,
>>>>>> Piotr
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, 6 Jul 2024 at 09:46, Walaa Eldin Moustafa <
>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Szehon,
>>>>>>>
>>>>>>> Thanks for sharing this proposal. We have thought along the same
>>>>>>> lines and implemented an external system (LakeChime [1]) that retains
>>>>>>> snapshot + partition metadata for longer (actual internal implementation
>>>>>>> keeps data for 13 months, but that can be tuned). For efficient 
>>>>>>> analysis,
>>>>>>> we have kept this data in an RDBMS. My opinion is this may be a better 
>>>>>>> fit
>>>>>>> to an external system (similar to LakeChime) since it could potentially
>>>>>>> complicate the Iceberg spec, APIs, or their implementations. Also, the 
>>>>>>> type
>>>>>>> of metadata tracked can differ depending on the use case. For example,
>>>>>>> while LakeChime retains partition and operation type metadata, it does 
>>>>>>> not
>>>>>>> track file-level metadata as there was no specific use case for that.
>>>>>>>
>>>>>>> [1]
>>>>>>> https://www.linkedin.com/blog/engineering/data-management/lakechime-a-data-trigger-service-for-modern-data-lakes
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Walaa.
>>>>>>>
>>>>>>> On Fri, Jul 5, 2024 at 11:49 PM Szehon Ho <szehon.apa...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi folks,
>>>>>>>>
>>>>>>>> I would like to discuss an idea for an optional extension of
>>>>>>>> Iceberg's Snapshot metadata lifecycle.  Thanks Piotr for replying on 
>>>>>>>> the
>>>>>>>> other thread that this should be a fuller Iceberg format change.
>>>>>>>>
>>>>>>>> *Proposal Summary*
>>>>>>>>
>>>>>>>> Currently, ExpireSnapshots(long olderThan) purges metadata and
>>>>>>>> deleted data of a Snapshot together.  Purging deleted data often 
>>>>>>>> requires a
>>>>>>>> smaller timeline, due to strict requirements to claw back unused disk
>>>>>>>> space, fulfill data lifecycle compliance, etc.  In many deployments, 
>>>>>>>> this
>>>>>>>> means 'olderThan' timestamp is set to just a few days before the 
>>>>>>>> current
>>>>>>>> time (the default is 5 days).
>>>>>>>>
>>>>>>>> On the other hand, purging metadata could be ideally done on a more
>>>>>>>> relaxed timeline, such as months or more, to allow for meaningful
>>>>>>>> historical table analysis.
>>>>>>>>
>>>>>>>> We should have an optional way to purge Snapshot metadata
>>>>>>>> separately from purging deleted data.  This would allow us to get 
>>>>>>>> history
>>>>>>>> of the table, and answer questions like:
>>>>>>>>
>>>>>>>>    - When was a file/partition added
>>>>>>>>    - When was a file/partition deleted
>>>>>>>>    - How much data was added or removed in time X
>>>>>>>>
>>>>>>>> that are currently only possible for data operations within a few
>>>>>>>> days.
>>>>>>>>
>>>>>>>> *Github Proposal*:  https://github.com/apache/iceberg/issues/10646
>>>>>>>> *Google Design Doc*:
>>>>>>>> https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit
>>>>>>>> <https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit>
>>>>>>>>
>>>>>>>> Curious if anyone has thought along these lines and/or sees obvious
>>>>>>>> issues.  Would appreciate any feedback on the proposal.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Szehon
>>>>>>>>
>>>>>>>

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

Reply via email to