Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

Anton Okolnychyi Tue, 06 Aug 2024 11:30:26 -0700

I agree it is unfortunate to not be able to find the snapshot information
from a manifest entry when the original snapshot is expired even though we
still know the snapshot ID that added the file. I am not sure about a
separate JSON file, though. It is still JSON and I bet people will store
the snapshot history forever, so the size of that file will gradually
increase. Yes, it won't impact reads/writes but it may become a bottleneck
for other operations that need that information. Using Parquet may help but
I am not sure that's the right approach overall.


I'd be curious to hear more from people who have experience
implementing the REST catalog API. It seems like most implementations have
addressed that or at least have a way to do that.

- Anton

пн, 5 серп. 2024 р. о 18:12 Yufei Gu <[email protected]> пише:

> Thanks Szehone for the new proposal. I think it is a useful feature with
> the least spec change. A candidate for v3 spec?
>
> Yufei
>
>
> On Tue, Jul 16, 2024 at 3:02 PM Szehon Ho <[email protected]> wrote:
>
>> Hi,
>>
>> Thanks for reading through the proposal and the good feedback. I was
>> thinking about the mentioned concerns:
>>
>>    - The motivation for the change
>>    - Too much additional metadata (storage overhead, namenode pressure
>>    on HDFS)
>>    - Performance impact for read/writing TableMetadata
>>    - Some impact to existing Table API's, and maintenance procedures, to
>>    have to check for these snapshots
>>
>> I chatted a bit offline with Yufei to brainstorm, and I wrote a V2 of the
>> proposal at the same link:
>> https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit.
>> I also tried to clarify the motivation in the doc with actual metadata
>> table queries that would be possible.
>>
>> This version now simply adds an optional 'expired-snapshots-path' that
>> contains the metadata of expired Snapshots.  I think this should address
>> the above concerns:
>>
>>    - Minimal storage overhead for just snapshot references (capped).  I
>>    don't propose anymore to keep old snapshot manifest-list/manifest files,
>>    the snapshot reference to the expired snapshot should be a good start.
>>    - Minimize perf overhead of read/write TableMetadata.  The additional
>>    file is only written by ExpireSnapshots if feature is enabled, and only
>>    read on demand (via metadata table query for example)
>>    - No impact to other Table APIs or maintenance procedures (as these
>>    dont show up as regular table.snapshots() list anymore).
>>    - Only additive optional spec change (backwards compatible)
>>
>> Of course, again, this feature is possible outside Iceberg, but the
>> advantage of doing it in Iceberg is that it could be integrated into
>> ExpireSnapshots and Metadata Table frameworks.
>>
>> Curious what people think?
>>
>> Thanks
>> Szehon
>>
>> On Wed, Jul 10, 2024 at 1:44 AM Péter Váry <[email protected]>
>> wrote:
>>
>>> > I believe DeleteOrphanFiles may be ok as is, because currently the
>>> logic walks down the reachable graph and marks those metadata files as
>>> 'not-orphan', so it should naturally walk these 'expired' snapshots as well.
>>>
>>> We need to keep the metadata files, but remove data files if they are
>>> not removed for whatever reason. Doable, but logic change.
>>>
>>> > You mean purging expired snapshots in the middle of the history,
>>> right?  I think the current mechanism for this is 'tagging' and 'branching'.
>>>
>>> I think for most users the compaction commits are technical details
>>> which they would like to avoid / don't want to see. The real table history
>>> is only the changes initiated by the user, and it would be good to hide the
>>> technical/compaction commits.
>>>
>>>
>>> On Wed, Jul 10, 2024, 08:52 himadri pal <[email protected]> wrote:
>>>
>>>> Hi Szehon,
>>>>
>>>> This is a good idea considering the use case it intends to solve. Added
>>>> few questions and comments in the design doc.
>>>>
>>>> IMO , Alternate options considered specified in the design doc look
>>>> cleaner to me.
>>>>
>>>> I think, it might add to maintenance burden, now that we need to
>>>> remember to remove these metadata only snapshots.
>>>>
>>>> Also I wonder some of the use cases it intends to address, is solvable
>>>> by metadata alone? - i.e how much data was added in a given time range? -
>>>> May be to answer these kind of questions user would prefer a to create KPI
>>>> using columns in the dataset.
>>>>
>>>>
>>>> Regards,
>>>> Himadri Pal
>>>>
>>>>
>>>> On Tue, Jul 9, 2024 at 11:10 PM Steven Wu <[email protected]> wrote:
>>>>
>>>>> I am not totally convinced of the motivation yet.
>>>>>
>>>>> I thought the snapshot retention window is primarily meant for time
>>>>> travel and troubleshooting table changes that happened recently (like a 
>>>>> few
>>>>> days or weeks).
>>>>>
>>>>> Is it valuable enough to keep expired snapshots for as long as months
>>>>> or years? While metadata files are typically smaller than data files in
>>>>> total size, it can still be significant considering the default amount of
>>>>> column stats written today (especially for wide tables with many columns).
>>>>>
>>>>> How long are we going to keep the expired snapshot references by
>>>>> default? If it is months/years, it can have major implications on the 
>>>>> query
>>>>> performance of metadata tables (like snapshots, all_*).
>>>>>
>>>>> I assume it will also have some performance impact on table loading as
>>>>> a lot more expired snapshots are still referenced.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jul 9, 2024 at 6:36 PM Szehon Ho <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Thanks Peter and Yufei.
>>>>>>
>>>>>> Yes, in terms of implementation, I noted in the doc we need to add
>>>>>> error checks to prevent time-travel / rollback / cherry-pick operations 
>>>>>> to
>>>>>> 'expired' snapshots.  I'll make it more clear in the doc, which 
>>>>>> operations
>>>>>> we need to check against.
>>>>>>
>>>>>> I believe DeleteOrphanFiles may be ok as is, because currently the
>>>>>> logic walks down the reachable graph and marks those metadata files as
>>>>>> 'not-orphan', so it should naturally walk these 'expired' snapshots as 
>>>>>> well.
>>>>>>
>>>>>> So, I think the main changes in terms of implementations is going to
>>>>>> be adding error checks in those Table API's, and updating ExpireSnapshots
>>>>>> API.
>>>>>>
>>>>>> Do we want to consider expiring snapshots in the middle of the
>>>>>>> history of the table?
>>>>>>>
>>>>>> You mean purging expired snapshots in the middle of the history,
>>>>>> right?  I think the current mechanism for this is 'tagging' and
>>>>>> 'branching'.  So interestingly, I was thinking its related to your other
>>>>>> question, and if we don't add error-check to 'tagging' and 'branching' on
>>>>>> 'expired' snapshot, it could be handled just as they are handled today 
>>>>>> for
>>>>>> other snapshots.  Its one option.  We could support it subsequently as 
>>>>>> well
>>>>>> , after the first version and if there's some usage of this.
>>>>>>
>>>>>> One thing that comes up in this thread and google doc is some
>>>>>> question about the size of preserved metadata.  I had put in the
>>>>>> Alternatives section, that we could potentially make the ExpireSnapshots
>>>>>> purge boolean argument more nuanced like PURGE, PRESERVE_REFS (snapshot
>>>>>> refs are preserved), PRESERVE_METADATA (snapshot refs and all metadata
>>>>>> files are preserved), though I am still debating if its worth it, as 
>>>>>> users
>>>>>> could choose not to use this feature.
>>>>>>
>>>>>> Thanks
>>>>>> Szehon
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jul 9, 2024 at 6:02 PM Yufei Gu <[email protected]> wrote:
>>>>>>
>>>>>>> Thank you for the interesting proposal. With a minor specification
>>>>>>> change, it could indeed enable different retention periods for data 
>>>>>>> files
>>>>>>> and metadata files. This differentiation is useful for two reasons:
>>>>>>>
>>>>>>>    1. More metadata helps us better understand the table history,
>>>>>>>    providing valuable insights.
>>>>>>>    2. Users often prioritize data file deletion as it frees up
>>>>>>>    significant storage space and removes potentially sensitive data.
>>>>>>>
>>>>>>> However, adding a boolean property to the specification isn't
>>>>>>> necessarily a lightweight solution. As Peter mentioned, implementing 
>>>>>>> this
>>>>>>> change requires modifications in several places. In this context, 
>>>>>>> external
>>>>>>> systems like LakeChime or a REST catalog implementation could offer
>>>>>>> effective solutions to manage extended metadata retention periods, 
>>>>>>> without
>>>>>>> spec changes.
>>>>>>>
>>>>>>> I am neutral on this proposal (+0) and look forward to seeing more
>>>>>>> input from people.
>>>>>>> Yufei
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jul 8, 2024 at 10:32 PM Péter Váry <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> We need to handle expired snapshots in several places differently
>>>>>>>> in Iceberg core as well.
>>>>>>>> - We need to add checks to prevent scans read these snapshots and
>>>>>>>> throw a meaningful error.
>>>>>>>> - We need to add checks to prevent tagging/branching these
>>>>>>>> snapshots
>>>>>>>> - We need to update DeleteOrphanFiles in Spark/Flink to not
>>>>>>>> consider files only referenced by the expired snapshots
>>>>>>>>
>>>>>>>> Some Flink jobs do frequent commits, and in these cases, the size
>>>>>>>> of the metadata file becomes a constraining factor too. In this case, 
>>>>>>>> we
>>>>>>>> could just tell not to use this feature, and expire the metadata as we 
>>>>>>>> do
>>>>>>>> now, but I thought it's worth to mention.
>>>>>>>>
>>>>>>>> Do we want to consider expiring snapshots in the middle of the
>>>>>>>> history of the table?
>>>>>>>> When we compact the table, then the compaction commits litter the
>>>>>>>> real history of the table. Consider the following:
>>>>>>>> - S1 writes some data
>>>>>>>> - S2 writes some more data
>>>>>>>> - S3 compacts the previous 2 commits
>>>>>>>> - S4 writes even more data
>>>>>>>> From the query engine user perspective S3 is a commit which does
>>>>>>>> nothing, not initiated by the user, and most probably they don't even 
>>>>>>>> want
>>>>>>>> to know of. If one can expire a snapshot from the middle of the 
>>>>>>>> history,
>>>>>>>> that would be nice, so users would see only S1/S2/S4. The only 
>>>>>>>> downside is
>>>>>>>> that reading S2 is less performant than reading S3, but IMHO this 
>>>>>>>> could be
>>>>>>>> acceptable for having only user driven changes in the table history.
>>>>>>>>
>>>>>>>>
>>>>>>>> In Mon, Jul 8, 2024, 20:15 Szehon Ho <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks for the comments so far.  I also thought previously that
>>>>>>>>> this functionality would be in an external system, like LakeChime, or 
>>>>>>>>> a
>>>>>>>>> custom catalog extension.  But after doing an initial analysis (please
>>>>>>>>> double check), I thought it's a small enough change that it would be 
>>>>>>>>> worth
>>>>>>>>> putting in the Iceberg spec/API's for all users:
>>>>>>>>>
>>>>>>>>>    - Table Spec, only one optional boolean field (on Snapshot,
>>>>>>>>>    only set if the functionality is used).
>>>>>>>>>    - API, only one boolean parameter (on ExpireSnapshots).
>>>>>>>>>
>>>>>>>>> I do wonder, will keeping expired snapshots as is slow down
>>>>>>>>>> manifest/scan planning though (REST catalog approaches could probably
>>>>>>>>>> mitigate this)?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I think it should not slow down manifest/scan planning, because we
>>>>>>>>> plan using the current snapshot (or the one we specify via time 
>>>>>>>>> travel),
>>>>>>>>> and we wouldn't read expired snapshots in this case.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Szehon
>>>>>>>>>
>>>>>>>>> On Mon, Jul 8, 2024 at 10:54 AM John Greene <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I do agree with the need that this proposal solves, to decouple
>>>>>>>>>> the snapshot history from the data deletion. I do wonder, will 
>>>>>>>>>> keeping
>>>>>>>>>> expired snapshots as is slow down manifest/scan planning though (REST
>>>>>>>>>> catalog approaches could probably mitigate this)?
>>>>>>>>>>
>>>>>>>>>> On Mon, Jul 8, 2024, 5:34 AM Piotr Findeisen <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Shehon, Walaa
>>>>>>>>>>>
>>>>>>>>>>> Thank Shehon for bringing this up. And thank you Walaa for
>>>>>>>>>>> proving more context from similar existing solution to the problem.
>>>>>>>>>>> The choices that LakeChime seems to have made -- to keep
>>>>>>>>>>> information in a separate RDBMS and which particular metadata 
>>>>>>>>>>> information
>>>>>>>>>>> to retain -- they indeed look as use-case specific, until we observe
>>>>>>>>>>> repeating patterns.
>>>>>>>>>>> The idea to bake lifecycle changes into table format spec was
>>>>>>>>>>> proposed as an alternative to the idea to bake lifecycle changes 
>>>>>>>>>>> into REST
>>>>>>>>>>> catalog spec. It was brought into discussion based on the intuition 
>>>>>>>>>>> that
>>>>>>>>>>> REST catalog is first-class citizen in Iceberg world, just like 
>>>>>>>>>>> other
>>>>>>>>>>> catalogs, and so solutions to table-centric problems do not need to 
>>>>>>>>>>> be
>>>>>>>>>>> limited to REST catalog. What is the information we retain, 
>>>>>>>>>>> how/whether
>>>>>>>>>>> this is configurable are open question and applicable to both 
>>>>>>>>>>> avenues.
>>>>>>>>>>>
>>>>>>>>>>> As a 3rd/another alternative, we could focus on REST catalog
>>>>>>>>>>> *extensions*, without naming snapshot metadata lifecycle, and
>>>>>>>>>>> leave the problem up to REST's implementors. That would mean Iceberg
>>>>>>>>>>> project doesn't address snapshot metadata lifecycle changes topic 
>>>>>>>>>>> directly,
>>>>>>>>>>> but instead gives users tools to build solutions around it. At this 
>>>>>>>>>>> point I
>>>>>>>>>>> am not trying to judge whether it's a good idea or not. Probably 
>>>>>>>>>>> depends
>>>>>>>>>>> how important it is to solve the problem and have a common solution.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Piotr
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, 6 Jul 2024 at 09:46, Walaa Eldin Moustafa <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Szehon,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for sharing this proposal. We have thought along the
>>>>>>>>>>>> same lines and implemented an external system (LakeChime [1]) that 
>>>>>>>>>>>> retains
>>>>>>>>>>>> snapshot + partition metadata for longer (actual internal 
>>>>>>>>>>>> implementation
>>>>>>>>>>>> keeps data for 13 months, but that can be tuned). For efficient 
>>>>>>>>>>>> analysis,
>>>>>>>>>>>> we have kept this data in an RDBMS. My opinion is this may be a 
>>>>>>>>>>>> better fit
>>>>>>>>>>>> to an external system (similar to LakeChime) since it could 
>>>>>>>>>>>> potentially
>>>>>>>>>>>> complicate the Iceberg spec, APIs, or their implementations. Also, 
>>>>>>>>>>>> the type
>>>>>>>>>>>> of metadata tracked can differ depending on the use case. For 
>>>>>>>>>>>> example,
>>>>>>>>>>>> while LakeChime retains partition and operation type metadata, it 
>>>>>>>>>>>> does not
>>>>>>>>>>>> track file-level metadata as there was no specific use case for 
>>>>>>>>>>>> that.
>>>>>>>>>>>>
>>>>>>>>>>>> [1]
>>>>>>>>>>>> https://www.linkedin.com/blog/engineering/data-management/lakechime-a-data-trigger-service-for-modern-data-lakes
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Walaa.
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Jul 5, 2024 at 11:49 PM Szehon Ho <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi folks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would like to discuss an idea for an optional extension of
>>>>>>>>>>>>> Iceberg's Snapshot metadata lifecycle.  Thanks Piotr for replying 
>>>>>>>>>>>>> on the
>>>>>>>>>>>>> other thread that this should be a fuller Iceberg format change.
>>>>>>>>>>>>>
>>>>>>>>>>>>> *Proposal Summary*
>>>>>>>>>>>>>
>>>>>>>>>>>>> Currently, ExpireSnapshots(long olderThan) purges metadata and
>>>>>>>>>>>>> deleted data of a Snapshot together.  Purging deleted data often 
>>>>>>>>>>>>> requires a
>>>>>>>>>>>>> smaller timeline, due to strict requirements to claw back unused 
>>>>>>>>>>>>> disk
>>>>>>>>>>>>> space, fulfill data lifecycle compliance, etc.  In many 
>>>>>>>>>>>>> deployments, this
>>>>>>>>>>>>> means 'olderThan' timestamp is set to just a few days before the 
>>>>>>>>>>>>> current
>>>>>>>>>>>>> time (the default is 5 days).
>>>>>>>>>>>>>
>>>>>>>>>>>>> On the other hand, purging metadata could be ideally done on a
>>>>>>>>>>>>> more relaxed timeline, such as months or more, to allow for 
>>>>>>>>>>>>> meaningful
>>>>>>>>>>>>> historical table analysis.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We should have an optional way to purge Snapshot metadata
>>>>>>>>>>>>> separately from purging deleted data.  This would allow us to get 
>>>>>>>>>>>>> history
>>>>>>>>>>>>> of the table, and answer questions like:
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - When was a file/partition added
>>>>>>>>>>>>>    - When was a file/partition deleted
>>>>>>>>>>>>>    - How much data was added or removed in time X
>>>>>>>>>>>>>
>>>>>>>>>>>>> that are currently only possible for data operations within a
>>>>>>>>>>>>> few days.
>>>>>>>>>>>>>
>>>>>>>>>>>>> *Github Proposal*:
>>>>>>>>>>>>> https://github.com/apache/iceberg/issues/10646
>>>>>>>>>>>>> *Google Design Doc*:
>>>>>>>>>>>>> https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit
>>>>>>>>>>>>> <https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Curious if anyone has thought along these lines and/or sees
>>>>>>>>>>>>> obvious issues.  Would appreciate any feedback on the proposal.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> Szehon
>>>>>>>>>>>>>
>>>>>>>>>>>>

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

Reply via email to