I think this violates Iceberg’s assumption of immutable snapshots.  That
would require modifying the old snapshot to no longer point to those gc’ed
data files, else not sure how you can time-travel to read from that
snapshot, if some of its files are deleted?

That being said, I also had this thought at some point, to keep snapshot
info around longer.  I expect most organizations operate in a mode where
they expire snapshots after a few days, and reasonably expect any
time-travel or snapshot-related operation (like CDC) to happen within this
timeframe.   And of course, use tags to keep the snapshot from expiration.

But there are some use-cases where keeping more snapshot metadata for a
period longer than when it could be read could be interesting.  For
example, if I want to know info about the snapshot that added each data
file, we probably have lost most of those snapshot metadata as they were
added long ago.  Example, the frequent ask to find each partition's last
modified time, (in an earlier email thread).

I haven't thought it completely through, but it crossed my mind that a
‘Soft’-mode of ExpireSnapshot may be useful, where we can delete data files
but just mark snapshot’s metadata files as expired without physically
deleting them, and so retain the ability to answer these questions.  It
could be done by adding ‘expired-snapshots’ list to metadata.json.  That
being said, its a singular use case and not sure if anyone also has
interest or other use-case?  It would add a bit of complexity.

Thanks
Szehon
Szehon

On Fri, Jun 2, 2023 at 7:12 AM Pucheng Yang <py...@pinterest.com.invalid>
wrote:

> Ryan,
>
> One use case is the user might need to time travel to a certain snapshot.
> However, such a snapshot is expired due to the snapshot expiration
> that only retains the latest snapshot operation, and this operation's only
> intent is to remove the gc partition. It seems a little overkill to me.
>
> I hope my explanation makes sense to you.
>
> On Thu, Jun 1, 2023 at 3:39 PM Ryan Blue <b...@tabular.io> wrote:
>
>> Pucheng,
>>
>> What is the use case around keeping the snapshot longer? We don't often
>> have people ask to keep snapshots that can't be read, so it sounds like you
>> might have something specific in mind?
>>
>> Ryan
>>
>> On Wed, May 31, 2023 at 8:19 PM Pucheng Yang <py...@pinterest.com.invalid>
>> wrote:
>>
>>> Hi community,
>>>
>>> In my organization, a big portion of the datasets are partitioned by
>>> date, normally we keep the latest X dates of partition for a given dataset.
>>>
>>> One issue that always bothers me is if I want to delete a partition
>>> that should be GC, I will run SQL query "delete from tbl where dt = ..."
>>> and do snapshot expiration to keep the latest snapshot to make sure that
>>> partition data is physically removed. However, the downside of this
>>> approach is the table snapshot history will be completely lost..
>>>
>>> I wonder if anyone else in the community has the same pain point? How do
>>> you solve this? I would love to understand if there is a solution to this
>>> otherwise we can brainstorm if there is a way to solve this.
>>>
>>> Thanks!
>>>
>>> Pucheng
>>>
>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

Reply via email to