I do agree with the need that this proposal solves, to decouple the
snapshot history from the data deletion. I do wonder, will keeping expired
snapshots as is slow down manifest/scan planning though (REST catalog
approaches could probably mitigate this)?

On Mon, Jul 8, 2024, 5:34 AM Piotr Findeisen <piotr.findei...@gmail.com>
wrote:

> Hi Shehon, Walaa
>
> Thank Shehon for bringing this up. And thank you Walaa for proving more
> context from similar existing solution to the problem.
> The choices that LakeChime seems to have made -- to keep information in a
> separate RDBMS and which particular metadata information to retain -- they
> indeed look as use-case specific, until we observe repeating patterns.
> The idea to bake lifecycle changes into table format spec was proposed as
> an alternative to the idea to bake lifecycle changes into REST catalog
> spec. It was brought into discussion based on the intuition that REST
> catalog is first-class citizen in Iceberg world, just like other catalogs,
> and so solutions to table-centric problems do not need to be limited to
> REST catalog. What is the information we retain, how/whether this is
> configurable are open question and applicable to both avenues.
>
> As a 3rd/another alternative, we could focus on REST catalog *extensions*,
> without naming snapshot metadata lifecycle, and leave the problem up to
> REST's implementors. That would mean Iceberg project doesn't address
> snapshot metadata lifecycle changes topic directly, but instead gives users
> tools to build solutions around it. At this point I am not trying to judge
> whether it's a good idea or not. Probably depends how important it is to
> solve the problem and have a common solution.
>
> Best,
> Piotr
>
>
>
>
> On Sat, 6 Jul 2024 at 09:46, Walaa Eldin Moustafa <wa.moust...@gmail.com>
> wrote:
>
>> Hi Szehon,
>>
>> Thanks for sharing this proposal. We have thought along the same lines
>> and implemented an external system (LakeChime [1]) that retains snapshot +
>> partition metadata for longer (actual internal implementation keeps data
>> for 13 months, but that can be tuned). For efficient analysis, we have kept
>> this data in an RDBMS. My opinion is this may be a better fit to an
>> external system (similar to LakeChime) since it could potentially
>> complicate the Iceberg spec, APIs, or their implementations. Also, the type
>> of metadata tracked can differ depending on the use case. For example,
>> while LakeChime retains partition and operation type metadata, it does not
>> track file-level metadata as there was no specific use case for that.
>>
>> [1]
>> https://www.linkedin.com/blog/engineering/data-management/lakechime-a-data-trigger-service-for-modern-data-lakes
>>
>> Thanks,
>> Walaa.
>>
>> On Fri, Jul 5, 2024 at 11:49 PM Szehon Ho <szehon.apa...@gmail.com>
>> wrote:
>>
>>> Hi folks,
>>>
>>> I would like to discuss an idea for an optional extension of Iceberg's
>>> Snapshot metadata lifecycle.  Thanks Piotr for replying on the other thread
>>> that this should be a fuller Iceberg format change.
>>>
>>> *Proposal Summary*
>>>
>>> Currently, ExpireSnapshots(long olderThan) purges metadata and deleted
>>> data of a Snapshot together.  Purging deleted data often requires a smaller
>>> timeline, due to strict requirements to claw back unused disk space,
>>> fulfill data lifecycle compliance, etc.  In many deployments, this means
>>> 'olderThan' timestamp is set to just a few days before the current time
>>> (the default is 5 days).
>>>
>>> On the other hand, purging metadata could be ideally done on a more
>>> relaxed timeline, such as months or more, to allow for meaningful
>>> historical table analysis.
>>>
>>> We should have an optional way to purge Snapshot metadata separately
>>> from purging deleted data.  This would allow us to get history of the
>>> table, and answer questions like:
>>>
>>>    - When was a file/partition added
>>>    - When was a file/partition deleted
>>>    - How much data was added or removed in time X
>>>
>>> that are currently only possible for data operations within a few days.
>>>
>>> *Github Proposal*:  https://github.com/apache/iceberg/issues/10646
>>> *Google Design Doc*:
>>> https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit
>>> <https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit>
>>>
>>> Curious if anyone has thought along these lines and/or sees obvious
>>> issues.  Would appreciate any feedback on the proposal.
>>>
>>> Thanks
>>> Szehon
>>>
>>

Reply via email to