Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

Yufei Gu Tue, 06 Aug 2024 11:48:00 -0700

> Yes, it won't impact reads/writes but it may become a bottleneck for
other operations that need that information.


We can set a limit to allow a certain number of snapshots, and purge old
items in each commit just like what we did for metadata logs. I admit that
it doesn't seem like an elegant solution, but it may solve most of the
problems.

> I'd be curious to hear more from people who have experience
implementing the REST catalog API.

REST catalog can definitely preserve snapshot entries for a long period,
but we still need an interface/spec to allow metadata table query(in the
client side) to reference these expired entries.

Yufei


On Tue, Aug 6, 2024 at 11:30 AM Anton Okolnychyi <[email protected]>
wrote:

> I agree it is unfortunate to not be able to find the snapshot information
> from a manifest entry when the original snapshot is expired even though we
> still know the snapshot ID that added the file. I am not sure about a
> separate JSON file, though. It is still JSON and I bet people will store
> the snapshot history forever, so the size of that file will gradually
> increase. Yes, it won't impact reads/writes but it may become a bottleneck
> for other operations that need that information. Using Parquet may help but
> I am not sure that's the right approach overall.
>
> I'd be curious to hear more from people who have experience
> implementing the REST catalog API. It seems like most implementations have
> addressed that or at least have a way to do that.
>
> - Anton
>
> пн, 5 серп. 2024 р. о 18:12 Yufei Gu <[email protected]> пише:
>
>> Thanks Szehone for the new proposal. I think it is a useful feature with
>> the least spec change. A candidate for v3 spec?
>>
>> Yufei
>>
>>
>> On Tue, Jul 16, 2024 at 3:02 PM Szehon Ho <[email protected]>
>> wrote:
>>
>>> Hi,
>>>
>>> Thanks for reading through the proposal and the good feedback. I was
>>> thinking about the mentioned concerns:
>>>
>>>    - The motivation for the change
>>>    - Too much additional metadata (storage overhead, namenode pressure
>>>    on HDFS)
>>>    - Performance impact for read/writing TableMetadata
>>>    - Some impact to existing Table API's, and maintenance procedures,
>>>    to have to check for these snapshots
>>>
>>> I chatted a bit offline with Yufei to brainstorm, and I wrote a V2 of
>>> the proposal at the same link:
>>> https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit.
>>> I also tried to clarify the motivation in the doc with actual metadata
>>> table queries that would be possible.
>>>
>>> This version now simply adds an optional 'expired-snapshots-path' that
>>> contains the metadata of expired Snapshots.  I think this should address
>>> the above concerns:
>>>
>>>    - Minimal storage overhead for just snapshot references (capped).  I
>>>    don't propose anymore to keep old snapshot manifest-list/manifest files,
>>>    the snapshot reference to the expired snapshot should be a good start.
>>>    - Minimize perf overhead of read/write TableMetadata.  The
>>>    additional file is only written by ExpireSnapshots if feature is enabled,
>>>    and only read on demand (via metadata table query for example)
>>>    - No impact to other Table APIs or maintenance procedures (as these
>>>    dont show up as regular table.snapshots() list anymore).
>>>    - Only additive optional spec change (backwards compatible)
>>>
>>> Of course, again, this feature is possible outside Iceberg, but the
>>> advantage of doing it in Iceberg is that it could be integrated into
>>> ExpireSnapshots and Metadata Table frameworks.
>>>
>>> Curious what people think?
>>>
>>> Thanks
>>> Szehon
>>>
>>> On Wed, Jul 10, 2024 at 1:44 AM Péter Váry <[email protected]>
>>> wrote:
>>>
>>>> > I believe DeleteOrphanFiles may be ok as is, because currently the
>>>> logic walks down the reachable graph and marks those metadata files as
>>>> 'not-orphan', so it should naturally walk these 'expired' snapshots as 
>>>> well.
>>>>
>>>> We need to keep the metadata files, but remove data files if they are
>>>> not removed for whatever reason. Doable, but logic change.
>>>>
>>>> > You mean purging expired snapshots in the middle of the history,
>>>> right?  I think the current mechanism for this is 'tagging' and 
>>>> 'branching'.
>>>>
>>>> I think for most users the compaction commits are technical details
>>>> which they would like to avoid / don't want to see. The real table history
>>>> is only the changes initiated by the user, and it would be good to hide the
>>>> technical/compaction commits.
>>>>
>>>>
>>>> On Wed, Jul 10, 2024, 08:52 himadri pal <[email protected]> wrote:
>>>>
>>>>> Hi Szehon,
>>>>>
>>>>> This is a good idea considering the use case it intends to solve.
>>>>> Added few questions and comments in the design doc.
>>>>>
>>>>> IMO , Alternate options considered specified in the design doc look
>>>>> cleaner to me.
>>>>>
>>>>> I think, it might add to maintenance burden, now that we need to
>>>>> remember to remove these metadata only snapshots.
>>>>>
>>>>> Also I wonder some of the use cases it intends to address, is solvable
>>>>> by metadata alone? - i.e how much data was added in a given time range? -
>>>>> May be to answer these kind of questions user would prefer a to create KPI
>>>>> using columns in the dataset.
>>>>>
>>>>>
>>>>> Regards,
>>>>> Himadri Pal
>>>>>
>>>>>
>>>>> On Tue, Jul 9, 2024 at 11:10 PM Steven Wu <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I am not totally convinced of the motivation yet.
>>>>>>
>>>>>> I thought the snapshot retention window is primarily meant for time
>>>>>> travel and troubleshooting table changes that happened recently (like a 
>>>>>> few
>>>>>> days or weeks).
>>>>>>
>>>>>> Is it valuable enough to keep expired snapshots for as long as months
>>>>>> or years? While metadata files are typically smaller than data files in
>>>>>> total size, it can still be significant considering the default amount of
>>>>>> column stats written today (especially for wide tables with many 
>>>>>> columns).
>>>>>>
>>>>>> How long are we going to keep the expired snapshot references by
>>>>>> default? If it is months/years, it can have major implications on the 
>>>>>> query
>>>>>> performance of metadata tables (like snapshots, all_*).
>>>>>>
>>>>>> I assume it will also have some performance impact on table loading
>>>>>> as a lot more expired snapshots are still referenced.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jul 9, 2024 at 6:36 PM Szehon Ho <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks Peter and Yufei.
>>>>>>>
>>>>>>> Yes, in terms of implementation, I noted in the doc we need to add
>>>>>>> error checks to prevent time-travel / rollback / cherry-pick operations 
>>>>>>> to
>>>>>>> 'expired' snapshots.  I'll make it more clear in the doc, which 
>>>>>>> operations
>>>>>>> we need to check against.
>>>>>>>
>>>>>>> I believe DeleteOrphanFiles may be ok as is, because currently the
>>>>>>> logic walks down the reachable graph and marks those metadata files as
>>>>>>> 'not-orphan', so it should naturally walk these 'expired' snapshots as 
>>>>>>> well.
>>>>>>>
>>>>>>> So, I think the main changes in terms of implementations is going to
>>>>>>> be adding error checks in those Table API's, and updating 
>>>>>>> ExpireSnapshots
>>>>>>> API.
>>>>>>>
>>>>>>> Do we want to consider expiring snapshots in the middle of the
>>>>>>>> history of the table?
>>>>>>>>
>>>>>>> You mean purging expired snapshots in the middle of the history,
>>>>>>> right?  I think the current mechanism for this is 'tagging' and
>>>>>>> 'branching'.  So interestingly, I was thinking its related to your other
>>>>>>> question, and if we don't add error-check to 'tagging' and 'branching' 
>>>>>>> on
>>>>>>> 'expired' snapshot, it could be handled just as they are handled today 
>>>>>>> for
>>>>>>> other snapshots.  Its one option.  We could support it subsequently as 
>>>>>>> well
>>>>>>> , after the first version and if there's some usage of this.
>>>>>>>
>>>>>>> One thing that comes up in this thread and google doc is some
>>>>>>> question about the size of preserved metadata.  I had put in the
>>>>>>> Alternatives section, that we could potentially make the ExpireSnapshots
>>>>>>> purge boolean argument more nuanced like PURGE, PRESERVE_REFS (snapshot
>>>>>>> refs are preserved), PRESERVE_METADATA (snapshot refs and all metadata
>>>>>>> files are preserved), though I am still debating if its worth it, as 
>>>>>>> users
>>>>>>> could choose not to use this feature.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Szehon
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jul 9, 2024 at 6:02 PM Yufei Gu <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thank you for the interesting proposal. With a minor specification
>>>>>>>> change, it could indeed enable different retention periods for data 
>>>>>>>> files
>>>>>>>> and metadata files. This differentiation is useful for two reasons:
>>>>>>>>
>>>>>>>>    1. More metadata helps us better understand the table history,
>>>>>>>>    providing valuable insights.
>>>>>>>>    2. Users often prioritize data file deletion as it frees up
>>>>>>>>    significant storage space and removes potentially sensitive data.
>>>>>>>>
>>>>>>>> However, adding a boolean property to the specification isn't
>>>>>>>> necessarily a lightweight solution. As Peter mentioned, implementing 
>>>>>>>> this
>>>>>>>> change requires modifications in several places. In this context, 
>>>>>>>> external
>>>>>>>> systems like LakeChime or a REST catalog implementation could offer
>>>>>>>> effective solutions to manage extended metadata retention periods, 
>>>>>>>> without
>>>>>>>> spec changes.
>>>>>>>>
>>>>>>>> I am neutral on this proposal (+0) and look forward to seeing more
>>>>>>>> input from people.
>>>>>>>> Yufei
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Jul 8, 2024 at 10:32 PM Péter Váry <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> We need to handle expired snapshots in several places differently
>>>>>>>>> in Iceberg core as well.
>>>>>>>>> - We need to add checks to prevent scans read these snapshots and
>>>>>>>>> throw a meaningful error.
>>>>>>>>> - We need to add checks to prevent tagging/branching these
>>>>>>>>> snapshots
>>>>>>>>> - We need to update DeleteOrphanFiles in Spark/Flink to not
>>>>>>>>> consider files only referenced by the expired snapshots
>>>>>>>>>
>>>>>>>>> Some Flink jobs do frequent commits, and in these cases, the size
>>>>>>>>> of the metadata file becomes a constraining factor too. In this case, 
>>>>>>>>> we
>>>>>>>>> could just tell not to use this feature, and expire the metadata as 
>>>>>>>>> we do
>>>>>>>>> now, but I thought it's worth to mention.
>>>>>>>>>
>>>>>>>>> Do we want to consider expiring snapshots in the middle of the
>>>>>>>>> history of the table?
>>>>>>>>> When we compact the table, then the compaction commits litter the
>>>>>>>>> real history of the table. Consider the following:
>>>>>>>>> - S1 writes some data
>>>>>>>>> - S2 writes some more data
>>>>>>>>> - S3 compacts the previous 2 commits
>>>>>>>>> - S4 writes even more data
>>>>>>>>> From the query engine user perspective S3 is a commit which does
>>>>>>>>> nothing, not initiated by the user, and most probably they don't even 
>>>>>>>>> want
>>>>>>>>> to know of. If one can expire a snapshot from the middle of the 
>>>>>>>>> history,
>>>>>>>>> that would be nice, so users would see only S1/S2/S4. The only 
>>>>>>>>> downside is
>>>>>>>>> that reading S2 is less performant than reading S3, but IMHO this 
>>>>>>>>> could be
>>>>>>>>> acceptable for having only user driven changes in the table history.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In Mon, Jul 8, 2024, 20:15 Szehon Ho <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks for the comments so far.  I also thought previously that
>>>>>>>>>> this functionality would be in an external system, like LakeChime, 
>>>>>>>>>> or a
>>>>>>>>>> custom catalog extension.  But after doing an initial analysis 
>>>>>>>>>> (please
>>>>>>>>>> double check), I thought it's a small enough change that it would be 
>>>>>>>>>> worth
>>>>>>>>>> putting in the Iceberg spec/API's for all users:
>>>>>>>>>>
>>>>>>>>>>    - Table Spec, only one optional boolean field (on Snapshot,
>>>>>>>>>>    only set if the functionality is used).
>>>>>>>>>>    - API, only one boolean parameter (on ExpireSnapshots).
>>>>>>>>>>
>>>>>>>>>> I do wonder, will keeping expired snapshots as is slow down
>>>>>>>>>>> manifest/scan planning though (REST catalog approaches could 
>>>>>>>>>>> probably
>>>>>>>>>>> mitigate this)?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I think it should not slow down manifest/scan planning, because
>>>>>>>>>> we plan using the current snapshot (or the one we specify via time 
>>>>>>>>>> travel),
>>>>>>>>>> and we wouldn't read expired snapshots in this case.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Szehon
>>>>>>>>>>
>>>>>>>>>> On Mon, Jul 8, 2024 at 10:54 AM John Greene <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> I do agree with the need that this proposal solves, to decouple
>>>>>>>>>>> the snapshot history from the data deletion. I do wonder, will 
>>>>>>>>>>> keeping
>>>>>>>>>>> expired snapshots as is slow down manifest/scan planning though 
>>>>>>>>>>> (REST
>>>>>>>>>>> catalog approaches could probably mitigate this)?
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Jul 8, 2024, 5:34 AM Piotr Findeisen <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Shehon, Walaa
>>>>>>>>>>>>
>>>>>>>>>>>> Thank Shehon for bringing this up. And thank you Walaa for
>>>>>>>>>>>> proving more context from similar existing solution to the problem.
>>>>>>>>>>>> The choices that LakeChime seems to have made -- to keep
>>>>>>>>>>>> information in a separate RDBMS and which particular metadata 
>>>>>>>>>>>> information
>>>>>>>>>>>> to retain -- they indeed look as use-case specific, until we 
>>>>>>>>>>>> observe
>>>>>>>>>>>> repeating patterns.
>>>>>>>>>>>> The idea to bake lifecycle changes into table format spec was
>>>>>>>>>>>> proposed as an alternative to the idea to bake lifecycle changes 
>>>>>>>>>>>> into REST
>>>>>>>>>>>> catalog spec. It was brought into discussion based on the 
>>>>>>>>>>>> intuition that
>>>>>>>>>>>> REST catalog is first-class citizen in Iceberg world, just like 
>>>>>>>>>>>> other
>>>>>>>>>>>> catalogs, and so solutions to table-centric problems do not need 
>>>>>>>>>>>> to be
>>>>>>>>>>>> limited to REST catalog. What is the information we retain, 
>>>>>>>>>>>> how/whether
>>>>>>>>>>>> this is configurable are open question and applicable to both 
>>>>>>>>>>>> avenues.
>>>>>>>>>>>>
>>>>>>>>>>>> As a 3rd/another alternative, we could focus on REST catalog
>>>>>>>>>>>> *extensions*, without naming snapshot metadata lifecycle, and
>>>>>>>>>>>> leave the problem up to REST's implementors. That would mean 
>>>>>>>>>>>> Iceberg
>>>>>>>>>>>> project doesn't address snapshot metadata lifecycle changes topic 
>>>>>>>>>>>> directly,
>>>>>>>>>>>> but instead gives users tools to build solutions around it. At 
>>>>>>>>>>>> this point I
>>>>>>>>>>>> am not trying to judge whether it's a good idea or not. Probably 
>>>>>>>>>>>> depends
>>>>>>>>>>>> how important it is to solve the problem and have a common 
>>>>>>>>>>>> solution.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Piotr
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, 6 Jul 2024 at 09:46, Walaa Eldin Moustafa <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Szehon,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for sharing this proposal. We have thought along the
>>>>>>>>>>>>> same lines and implemented an external system (LakeChime [1]) 
>>>>>>>>>>>>> that retains
>>>>>>>>>>>>> snapshot + partition metadata for longer (actual internal 
>>>>>>>>>>>>> implementation
>>>>>>>>>>>>> keeps data for 13 months, but that can be tuned). For efficient 
>>>>>>>>>>>>> analysis,
>>>>>>>>>>>>> we have kept this data in an RDBMS. My opinion is this may be a 
>>>>>>>>>>>>> better fit
>>>>>>>>>>>>> to an external system (similar to LakeChime) since it could 
>>>>>>>>>>>>> potentially
>>>>>>>>>>>>> complicate the Iceberg spec, APIs, or their implementations. 
>>>>>>>>>>>>> Also, the type
>>>>>>>>>>>>> of metadata tracked can differ depending on the use case. For 
>>>>>>>>>>>>> example,
>>>>>>>>>>>>> while LakeChime retains partition and operation type metadata, it 
>>>>>>>>>>>>> does not
>>>>>>>>>>>>> track file-level metadata as there was no specific use case for 
>>>>>>>>>>>>> that.
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> https://www.linkedin.com/blog/engineering/data-management/lakechime-a-data-trigger-service-for-modern-data-lakes
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Walaa.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Jul 5, 2024 at 11:49 PM Szehon Ho <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi folks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would like to discuss an idea for an optional extension of
>>>>>>>>>>>>>> Iceberg's Snapshot metadata lifecycle.  Thanks Piotr for 
>>>>>>>>>>>>>> replying on the
>>>>>>>>>>>>>> other thread that this should be a fuller Iceberg format change.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *Proposal Summary*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Currently, ExpireSnapshots(long olderThan) purges metadata
>>>>>>>>>>>>>> and deleted data of a Snapshot together.  Purging deleted data 
>>>>>>>>>>>>>> often
>>>>>>>>>>>>>> requires a smaller timeline, due to strict requirements to claw 
>>>>>>>>>>>>>> back unused
>>>>>>>>>>>>>> disk space, fulfill data lifecycle compliance, etc.  In many 
>>>>>>>>>>>>>> deployments,
>>>>>>>>>>>>>> this means 'olderThan' timestamp is set to just a few days 
>>>>>>>>>>>>>> before the
>>>>>>>>>>>>>> current time (the default is 5 days).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On the other hand, purging metadata could be ideally done on
>>>>>>>>>>>>>> a more relaxed timeline, such as months or more, to allow for 
>>>>>>>>>>>>>> meaningful
>>>>>>>>>>>>>> historical table analysis.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We should have an optional way to purge Snapshot metadata
>>>>>>>>>>>>>> separately from purging deleted data.  This would allow us to 
>>>>>>>>>>>>>> get history
>>>>>>>>>>>>>> of the table, and answer questions like:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - When was a file/partition added
>>>>>>>>>>>>>>    - When was a file/partition deleted
>>>>>>>>>>>>>>    - How much data was added or removed in time X
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> that are currently only possible for data operations within a
>>>>>>>>>>>>>> few days.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *Github Proposal*:
>>>>>>>>>>>>>> https://github.com/apache/iceberg/issues/10646
>>>>>>>>>>>>>> *Google Design Doc*:
>>>>>>>>>>>>>> https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit
>>>>>>>>>>>>>> <https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Curious if anyone has thought along these lines and/or sees
>>>>>>>>>>>>>> obvious issues.  Would appreciate any feedback on the proposal.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> Szehon
>>>>>>>>>>>>>>
>>>>>>>>>>>>>

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

Reply via email to