Re: Data retention and expire the last snapshot

Steve Zhang Fri, 01 Jul 2022 10:07:45 -0700

Thanks Russell, Truncate suggested by you 
https://spark.apache.org/docs/3.3.0/sql-ref-syntax-ddl-truncate-table.html#content
 
<https://spark.apache.org/docs/3.3.0/sql-ref-syntax-ddl-truncate-table.html#content>
 help generate a new empty snapshot which enable "previously last" snapshot to 
be expired. I think it works for me. Appreciate your pointers


Thanks,
Steve Zhang



> On Jun 29, 2022, at 4:32 PM, Russell Spitzer <russell.spit...@gmail.com> 
> wrote:
> 
> Is "truncate" not an option? This would do a table wide delete which would 
> create a new snapshot which you can keep. No data files would be valid after 
> this?
> 
> On Wed, Jun 29, 2022 at 6:29 PM Steve Zhang <hongyue_zh...@apple.com.invalid> 
> wrote:
> Hey Iceberg Community:
> 
> I am wondering if there’s any best practice to handle residual of data files 
> deleted from last snapshot in the iceberg table. 
> 
> Let me explain the use case here, considering the data retention policy in 
> place where some of the sensitive data can only be stored on disk for a 
> month. In iceberg way to keep the data off the disk, we need to generally 
> complete it in 3 steps
> 1. delete data from the table, or drop partition (logical deletion)
> 2. expire old snapshots (physical deletion to get data off the disk)
> 3. remove orphaned files (not needed, but at scale this might be needed to 
> account for any failure in 2nd steps)
> 
> However, from what I can tell, the iceberg expire-snapshot stored procedure 
> will not delete the last snapshot of the given table as stated in 
> https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/RemoveSnapshots.java#L141-L146
>  
> <https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/RemoveSnapshots.java#L141-L146>.
>  
> 
> So if the last snapshot happen to be the delete in step 1, and if there’s no 
> more transaction happen to the table, then the snapshot will not be expired 
> properly and leave the data files behind. I am not sure what’s the right way 
> to clean up the data files from the disk to comply with our retention policy. 
> Can anyone share some ideas? 
> 
> I guess drop table is one workaround but I am looking for less intrusive way 
> to leave the table as is, like its original state right after table creation, 
> before any data is written.
> 
> Thanks,
> Steve Zhang
> 
> 
>

Re: Data retention and expire the last snapshot

Reply via email to