Re: Iceberg old partition gc

2023-06-04 Thread Ryan Blue
Let me paraphrase the use case to make sure I'm getting it right: The idea is to be able to remove expired data and delete the data files associated with it, but without losing the history of other changes to the table. Because new data and old data are modified in the same linear history, physical

Re: Iceberg old partition gc

2023-06-03 Thread Szehon Ho
> > @Szehon, I am wondering if we can create materialized views for metadata > tables to support infinite history on metadata tables (like snapshots or > partitions). Obviously, materialized views can't be used for time travel or > rollback. They are only meant for maintaining long/infinite histori

Re: Iceberg old partition gc

2023-06-02 Thread Steven Wu
> the main use case I had was table historical analysis (last update time for each partitions, how many snapshots did this table ever have, for example), Partition level stats can probably help with questions like "last update time for each partition". @Szehon, I am wondering if we can create mat

Re: Iceberg old partition gc

2023-06-02 Thread Szehon Ho
Yea, for the original use case in this thread, agree it's delete (soft) + expire (physical, permanent). I guess I should have phrased my thought better, I was replying to Ryan's question above > We don't often have people ask to keep snapshots that can't be read and had thought it'd be nice to

Re: Iceberg old partition gc

2023-06-02 Thread Russell Spitzer
I think "soft-mode" is really just doing the delete. You can then recover the snapshot if you happen to have accidentally TTL'd a partition. On Fri, Jun 2, 2023 at 8:51 AM Szehon Ho wrote: > I think this violates Iceberg’s assumption of immutable snapshots. That > would require modifying the ol

Re: Iceberg old partition gc

2023-06-02 Thread Szehon Ho
I think this violates Iceberg’s assumption of immutable snapshots. That would require modifying the old snapshot to no longer point to those gc’ed data files, else not sure how you can time-travel to read from that snapshot, if some of its files are deleted? That being said, I also had this thoug

Re: Iceberg old partition gc

2023-06-01 Thread Pucheng Yang
Ryan, One use case is the user might need to time travel to a certain snapshot. However, such a snapshot is expired due to the snapshot expiration that only retains the latest snapshot operation, and this operation's only intent is to remove the gc partition. It seems a little overkill to me. I h

Re: Iceberg old partition gc

2023-06-01 Thread Ryan Blue
Pucheng, What is the use case around keeping the snapshot longer? We don't often have people ask to keep snapshots that can't be read, so it sounds like you might have something specific in mind? Ryan On Wed, May 31, 2023 at 8:19 PM Pucheng Yang wrote: > Hi community, > > In my organization, a

Iceberg old partition gc

2023-05-31 Thread Pucheng Yang
Hi community, In my organization, a big portion of the datasets are partitioned by date, normally we keep the latest X dates of partition for a given dataset. One issue that always bothers me is if I want to delete a partition that should be GC, I will run SQL query "delete from tbl where dt = ..