Yea, for the original use case in this thread, agree it's delete (soft) + expire (physical, permanent).
I guess I should have phrased my thought better, I was replying to Ryan's question above > We don't often have people ask to keep snapshots that can't be read and had thought it'd be nice to have a ExpireSnapshot mode where we keep older metadata for longer periods of time beyond physical expiration. But the main use case I had was table historical analysis (last update time for each partitions, how many snapshots did this table ever have, for example), it's more a nice-to-have and definitely not sure it is a very compelling use-case. Another option I guess, is custom catalog can keep around these historical information. Thanks Szehon On Fri, Jun 2, 2023 at 10:28 PM Russell Spitzer <russell.spit...@gmail.com> wrote: > I think "soft-mode" is really just doing the delete. You can then recover > the snapshot if you happen to have accidentally TTL'd a partition. > > On Fri, Jun 2, 2023 at 8:51 AM Szehon Ho <szehon.apa...@gmail.com> wrote: > >> I think this violates Iceberg’s assumption of immutable snapshots. That >> would require modifying the old snapshot to no longer point to those gc’ed >> data files, else not sure how you can time-travel to read from that >> snapshot, if some of its files are deleted? >> >> That being said, I also had this thought at some point, to keep snapshot >> info around longer. I expect most organizations operate in a mode where >> they expire snapshots after a few days, and reasonably expect any >> time-travel or snapshot-related operation (like CDC) to happen within this >> timeframe. And of course, use tags to keep the snapshot from expiration. >> >> But there are some use-cases where keeping more snapshot metadata for a >> period longer than when it could be read could be interesting. For >> example, if I want to know info about the snapshot that added each data >> file, we probably have lost most of those snapshot metadata as they were >> added long ago. Example, the frequent ask to find each partition's last >> modified time, (in an earlier email thread). >> >> I haven't thought it completely through, but it crossed my mind that a >> ‘Soft’-mode of ExpireSnapshot may be useful, where we can delete data files >> but just mark snapshot’s metadata files as expired without physically >> deleting them, and so retain the ability to answer these questions. It >> could be done by adding ‘expired-snapshots’ list to metadata.json. That >> being said, its a singular use case and not sure if anyone also has >> interest or other use-case? It would add a bit of complexity. >> >> Thanks >> Szehon >> Szehon >> >> On Fri, Jun 2, 2023 at 7:12 AM Pucheng Yang <py...@pinterest.com.invalid> >> wrote: >> >>> Ryan, >>> >>> One use case is the user might need to time travel to a certain >>> snapshot. However, such a snapshot is expired due to the snapshot >>> expiration that only retains the latest snapshot operation, and this >>> operation's only intent is to remove the gc partition. It seems a little >>> overkill to me. >>> >>> I hope my explanation makes sense to you. >>> >>> On Thu, Jun 1, 2023 at 3:39 PM Ryan Blue <b...@tabular.io> wrote: >>> >>>> Pucheng, >>>> >>>> What is the use case around keeping the snapshot longer? We don't often >>>> have people ask to keep snapshots that can't be read, so it sounds like you >>>> might have something specific in mind? >>>> >>>> Ryan >>>> >>>> On Wed, May 31, 2023 at 8:19 PM Pucheng Yang >>>> <py...@pinterest.com.invalid> wrote: >>>> >>>>> Hi community, >>>>> >>>>> In my organization, a big portion of the datasets are partitioned by >>>>> date, normally we keep the latest X dates of partition for a given >>>>> dataset. >>>>> >>>>> One issue that always bothers me is if I want to delete a partition >>>>> that should be GC, I will run SQL query "delete from tbl where dt = ..." >>>>> and do snapshot expiration to keep the latest snapshot to make sure that >>>>> partition data is physically removed. However, the downside of this >>>>> approach is the table snapshot history will be completely lost.. >>>>> >>>>> I wonder if anyone else in the community has the same pain point? How >>>>> do you solve this? I would love to understand if there is a solution to >>>>> this otherwise we can brainstorm if there is a way to solve this. >>>>> >>>>> Thanks! >>>>> >>>>> Pucheng >>>>> >>>> >>>> >>>> -- >>>> Ryan Blue >>>> Tabular >>>> >>>