Yea, for the original use case in this thread, agree it's delete (soft) +
expire (physical, permanent).
I guess I should have phrased my thought better, I was replying to Ryan's
question above

>  We don't often have people ask to keep snapshots that can't be read


and had thought it'd be nice to have a ExpireSnapshot mode where we
keep older metadata for longer periods of time beyond physical expiration.

But the main use case I had was table historical analysis (last update time
for each partitions, how many snapshots did this table ever have, for
example), it's more a nice-to-have and definitely not sure it is a very
compelling use-case.  Another option I guess, is custom catalog can keep
around these historical information.

Thanks
Szehon

On Fri, Jun 2, 2023 at 10:28 PM Russell Spitzer <russell.spit...@gmail.com>
wrote:

> I think "soft-mode" is really just doing the delete. You can then recover
> the snapshot if you happen to have accidentally TTL'd a partition.
>
> On Fri, Jun 2, 2023 at 8:51 AM Szehon Ho <szehon.apa...@gmail.com> wrote:
>
>> I think this violates Iceberg’s assumption of immutable snapshots.  That
>> would require modifying the old snapshot to no longer point to those gc’ed
>> data files, else not sure how you can time-travel to read from that
>> snapshot, if some of its files are deleted?
>>
>> That being said, I also had this thought at some point, to keep snapshot
>> info around longer.  I expect most organizations operate in a mode where
>> they expire snapshots after a few days, and reasonably expect any
>> time-travel or snapshot-related operation (like CDC) to happen within this
>> timeframe.   And of course, use tags to keep the snapshot from expiration.
>>
>> But there are some use-cases where keeping more snapshot metadata for a
>> period longer than when it could be read could be interesting.  For
>> example, if I want to know info about the snapshot that added each data
>> file, we probably have lost most of those snapshot metadata as they were
>> added long ago.  Example, the frequent ask to find each partition's last
>> modified time, (in an earlier email thread).
>>
>> I haven't thought it completely through, but it crossed my mind that a
>> ‘Soft’-mode of ExpireSnapshot may be useful, where we can delete data files
>> but just mark snapshot’s metadata files as expired without physically
>> deleting them, and so retain the ability to answer these questions.  It
>> could be done by adding ‘expired-snapshots’ list to metadata.json.  That
>> being said, its a singular use case and not sure if anyone also has
>> interest or other use-case?  It would add a bit of complexity.
>>
>> Thanks
>> Szehon
>> Szehon
>>
>> On Fri, Jun 2, 2023 at 7:12 AM Pucheng Yang <py...@pinterest.com.invalid>
>> wrote:
>>
>>> Ryan,
>>>
>>> One use case is the user might need to time travel to a certain
>>> snapshot. However, such a snapshot is expired due to the snapshot
>>> expiration that only retains the latest snapshot operation, and this
>>> operation's only intent is to remove the gc partition. It seems a little
>>> overkill to me.
>>>
>>> I hope my explanation makes sense to you.
>>>
>>> On Thu, Jun 1, 2023 at 3:39 PM Ryan Blue <b...@tabular.io> wrote:
>>>
>>>> Pucheng,
>>>>
>>>> What is the use case around keeping the snapshot longer? We don't often
>>>> have people ask to keep snapshots that can't be read, so it sounds like you
>>>> might have something specific in mind?
>>>>
>>>> Ryan
>>>>
>>>> On Wed, May 31, 2023 at 8:19 PM Pucheng Yang
>>>> <py...@pinterest.com.invalid> wrote:
>>>>
>>>>> Hi community,
>>>>>
>>>>> In my organization, a big portion of the datasets are partitioned by
>>>>> date, normally we keep the latest X dates of partition for a given 
>>>>> dataset.
>>>>>
>>>>> One issue that always bothers me is if I want to delete a partition
>>>>> that should be GC, I will run SQL query "delete from tbl where dt = ..."
>>>>> and do snapshot expiration to keep the latest snapshot to make sure that
>>>>> partition data is physically removed. However, the downside of this
>>>>> approach is the table snapshot history will be completely lost..
>>>>>
>>>>> I wonder if anyone else in the community has the same pain point? How
>>>>> do you solve this? I would love to understand if there is a solution to
>>>>> this otherwise we can brainstorm if there is a way to solve this.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Pucheng
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>

Reply via email to