I think "soft-mode" is really just doing the delete. You can then recover
the snapshot if you happen to have accidentally TTL'd a partition.

On Fri, Jun 2, 2023 at 8:51 AM Szehon Ho <szehon.apa...@gmail.com> wrote:

> I think this violates Iceberg’s assumption of immutable snapshots.  That
> would require modifying the old snapshot to no longer point to those gc’ed
> data files, else not sure how you can time-travel to read from that
> snapshot, if some of its files are deleted?
>
> That being said, I also had this thought at some point, to keep snapshot
> info around longer.  I expect most organizations operate in a mode where
> they expire snapshots after a few days, and reasonably expect any
> time-travel or snapshot-related operation (like CDC) to happen within this
> timeframe.   And of course, use tags to keep the snapshot from expiration.
>
> But there are some use-cases where keeping more snapshot metadata for a
> period longer than when it could be read could be interesting.  For
> example, if I want to know info about the snapshot that added each data
> file, we probably have lost most of those snapshot metadata as they were
> added long ago.  Example, the frequent ask to find each partition's last
> modified time, (in an earlier email thread).
>
> I haven't thought it completely through, but it crossed my mind that a
> ‘Soft’-mode of ExpireSnapshot may be useful, where we can delete data files
> but just mark snapshot’s metadata files as expired without physically
> deleting them, and so retain the ability to answer these questions.  It
> could be done by adding ‘expired-snapshots’ list to metadata.json.  That
> being said, its a singular use case and not sure if anyone also has
> interest or other use-case?  It would add a bit of complexity.
>
> Thanks
> Szehon
> Szehon
>
> On Fri, Jun 2, 2023 at 7:12 AM Pucheng Yang <py...@pinterest.com.invalid>
> wrote:
>
>> Ryan,
>>
>> One use case is the user might need to time travel to a certain snapshot.
>> However, such a snapshot is expired due to the snapshot expiration
>> that only retains the latest snapshot operation, and this operation's only
>> intent is to remove the gc partition. It seems a little overkill to me.
>>
>> I hope my explanation makes sense to you.
>>
>> On Thu, Jun 1, 2023 at 3:39 PM Ryan Blue <b...@tabular.io> wrote:
>>
>>> Pucheng,
>>>
>>> What is the use case around keeping the snapshot longer? We don't often
>>> have people ask to keep snapshots that can't be read, so it sounds like you
>>> might have something specific in mind?
>>>
>>> Ryan
>>>
>>> On Wed, May 31, 2023 at 8:19 PM Pucheng Yang <py...@pinterest.com.invalid>
>>> wrote:
>>>
>>>> Hi community,
>>>>
>>>> In my organization, a big portion of the datasets are partitioned by
>>>> date, normally we keep the latest X dates of partition for a given dataset.
>>>>
>>>> One issue that always bothers me is if I want to delete a partition
>>>> that should be GC, I will run SQL query "delete from tbl where dt = ..."
>>>> and do snapshot expiration to keep the latest snapshot to make sure that
>>>> partition data is physically removed. However, the downside of this
>>>> approach is the table snapshot history will be completely lost..
>>>>
>>>> I wonder if anyone else in the community has the same pain point? How
>>>> do you solve this? I would love to understand if there is a solution to
>>>> this otherwise we can brainstorm if there is a way to solve this.
>>>>
>>>> Thanks!
>>>>
>>>> Pucheng
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>

Reply via email to