Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-08-06 Thread Yufei Gu
> Yes, it won't impact reads/writes but it may become a bottleneck for other operations that need that information. We can set a limit to allow a certain number of snapshots, and purge old items in each commit just like what we did for metadata logs. I admit that it doesn't seem like an elegant so

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-08-06 Thread Anton Okolnychyi
I agree it is unfortunate to not be able to find the snapshot information from a manifest entry when the original snapshot is expired even though we still know the snapshot ID that added the file. I am not sure about a separate JSON file, though. It is still JSON and I bet people will store the sna

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-08-05 Thread Yufei Gu
Thanks Szehone for the new proposal. I think it is a useful feature with the least spec change. A candidate for v3 spec? Yufei On Tue, Jul 16, 2024 at 3:02 PM Szehon Ho wrote: > Hi, > > Thanks for reading through the proposal and the good feedback. I was > thinking about the mentioned concerns

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-07-16 Thread Szehon Ho
Hi, Thanks for reading through the proposal and the good feedback. I was thinking about the mentioned concerns: - The motivation for the change - Too much additional metadata (storage overhead, namenode pressure on HDFS) - Performance impact for read/writing TableMetadata - Some im

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-07-10 Thread Péter Váry
> I believe DeleteOrphanFiles may be ok as is, because currently the logic walks down the reachable graph and marks those metadata files as 'not-orphan', so it should naturally walk these 'expired' snapshots as well. We need to keep the metadata files, but remove data files if they are not removed

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-07-09 Thread himadri pal
Hi Szehon, This is a good idea considering the use case it intends to solve. Added few questions and comments in the design doc. IMO , Alternate options considered specified in the design doc look cleaner to me. I think, it might add to maintenance burden, now that we need to remember to remove

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-07-09 Thread Steven Wu
I am not totally convinced of the motivation yet. I thought the snapshot retention window is primarily meant for time travel and troubleshooting table changes that happened recently (like a few days or weeks). Is it valuable enough to keep expired snapshots for as long as months or years? While m

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-07-09 Thread Szehon Ho
Thanks Peter and Yufei. Yes, in terms of implementation, I noted in the doc we need to add error checks to prevent time-travel / rollback / cherry-pick operations to 'expired' snapshots. I'll make it more clear in the doc, which operations we need to check against. I believe DeleteOrphanFiles ma

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-07-09 Thread Yufei Gu
Thank you for the interesting proposal. With a minor specification change, it could indeed enable different retention periods for data files and metadata files. This differentiation is useful for two reasons: 1. More metadata helps us better understand the table history, providing valuable i

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-07-08 Thread Péter Váry
We need to handle expired snapshots in several places differently in Iceberg core as well. - We need to add checks to prevent scans read these snapshots and throw a meaningful error. - We need to add checks to prevent tagging/branching these snapshots - We need to update DeleteOrphanFiles in Spark/

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-07-08 Thread Szehon Ho
Thanks for the comments so far. I also thought previously that this functionality would be in an external system, like LakeChime, or a custom catalog extension. But after doing an initial analysis (please double check), I thought it's a small enough change that it would be worth putting in the Ic

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-07-08 Thread John Greene
I do agree with the need that this proposal solves, to decouple the snapshot history from the data deletion. I do wonder, will keeping expired snapshots as is slow down manifest/scan planning though (REST catalog approaches could probably mitigate this)? On Mon, Jul 8, 2024, 5:34 AM Piotr Findeise

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-07-08 Thread Piotr Findeisen
Hi Shehon, Walaa Thank Shehon for bringing this up. And thank you Walaa for proving more context from similar existing solution to the problem. The choices that LakeChime seems to have made -- to keep information in a separate RDBMS and which particular metadata information to retain -- they indee

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-07-06 Thread Walaa Eldin Moustafa
Hi Szehon, Thanks for sharing this proposal. We have thought along the same lines and implemented an external system (LakeChime [1]) that retains snapshot + partition metadata for longer (actual internal implementation keeps data for 13 months, but that can be tuned). For efficient analysis, we ha