I think "soft-mode" is really just doing the delete. You can then recover the snapshot if you happen to have accidentally TTL'd a partition.
On Fri, Jun 2, 2023 at 8:51 AM Szehon Ho <szehon.apa...@gmail.com> wrote: > I think this violates Iceberg’s assumption of immutable snapshots. That > would require modifying the old snapshot to no longer point to those gc’ed > data files, else not sure how you can time-travel to read from that > snapshot, if some of its files are deleted? > > That being said, I also had this thought at some point, to keep snapshot > info around longer. I expect most organizations operate in a mode where > they expire snapshots after a few days, and reasonably expect any > time-travel or snapshot-related operation (like CDC) to happen within this > timeframe. And of course, use tags to keep the snapshot from expiration. > > But there are some use-cases where keeping more snapshot metadata for a > period longer than when it could be read could be interesting. For > example, if I want to know info about the snapshot that added each data > file, we probably have lost most of those snapshot metadata as they were > added long ago. Example, the frequent ask to find each partition's last > modified time, (in an earlier email thread). > > I haven't thought it completely through, but it crossed my mind that a > ‘Soft’-mode of ExpireSnapshot may be useful, where we can delete data files > but just mark snapshot’s metadata files as expired without physically > deleting them, and so retain the ability to answer these questions. It > could be done by adding ‘expired-snapshots’ list to metadata.json. That > being said, its a singular use case and not sure if anyone also has > interest or other use-case? It would add a bit of complexity. > > Thanks > Szehon > Szehon > > On Fri, Jun 2, 2023 at 7:12 AM Pucheng Yang <py...@pinterest.com.invalid> > wrote: > >> Ryan, >> >> One use case is the user might need to time travel to a certain snapshot. >> However, such a snapshot is expired due to the snapshot expiration >> that only retains the latest snapshot operation, and this operation's only >> intent is to remove the gc partition. It seems a little overkill to me. >> >> I hope my explanation makes sense to you. >> >> On Thu, Jun 1, 2023 at 3:39 PM Ryan Blue <b...@tabular.io> wrote: >> >>> Pucheng, >>> >>> What is the use case around keeping the snapshot longer? We don't often >>> have people ask to keep snapshots that can't be read, so it sounds like you >>> might have something specific in mind? >>> >>> Ryan >>> >>> On Wed, May 31, 2023 at 8:19 PM Pucheng Yang <py...@pinterest.com.invalid> >>> wrote: >>> >>>> Hi community, >>>> >>>> In my organization, a big portion of the datasets are partitioned by >>>> date, normally we keep the latest X dates of partition for a given dataset. >>>> >>>> One issue that always bothers me is if I want to delete a partition >>>> that should be GC, I will run SQL query "delete from tbl where dt = ..." >>>> and do snapshot expiration to keep the latest snapshot to make sure that >>>> partition data is physically removed. However, the downside of this >>>> approach is the table snapshot history will be completely lost.. >>>> >>>> I wonder if anyone else in the community has the same pain point? How >>>> do you solve this? I would love to understand if there is a solution to >>>> this otherwise we can brainstorm if there is a way to solve this. >>>> >>>> Thanks! >>>> >>>> Pucheng >>>> >>> >>> >>> -- >>> Ryan Blue >>> Tabular >>> >>