Thanks Russell, Truncate suggested by you https://spark.apache.org/docs/3.3.0/sql-ref-syntax-ddl-truncate-table.html#content <https://spark.apache.org/docs/3.3.0/sql-ref-syntax-ddl-truncate-table.html#content> help generate a new empty snapshot which enable "previously last" snapshot to be expired. I think it works for me. Appreciate your pointers
Thanks, Steve Zhang > On Jun 29, 2022, at 4:32 PM, Russell Spitzer <russell.spit...@gmail.com> > wrote: > > Is "truncate" not an option? This would do a table wide delete which would > create a new snapshot which you can keep. No data files would be valid after > this? > > On Wed, Jun 29, 2022 at 6:29 PM Steve Zhang <hongyue_zh...@apple.com.invalid> > wrote: > Hey Iceberg Community: > > I am wondering if there’s any best practice to handle residual of data files > deleted from last snapshot in the iceberg table. > > Let me explain the use case here, considering the data retention policy in > place where some of the sensitive data can only be stored on disk for a > month. In iceberg way to keep the data off the disk, we need to generally > complete it in 3 steps > 1. delete data from the table, or drop partition (logical deletion) > 2. expire old snapshots (physical deletion to get data off the disk) > 3. remove orphaned files (not needed, but at scale this might be needed to > account for any failure in 2nd steps) > > However, from what I can tell, the iceberg expire-snapshot stored procedure > will not delete the last snapshot of the given table as stated in > https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/RemoveSnapshots.java#L141-L146 > > <https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/RemoveSnapshots.java#L141-L146>. > > > So if the last snapshot happen to be the delete in step 1, and if there’s no > more transaction happen to the table, then the snapshot will not be expired > properly and leave the data files behind. I am not sure what’s the right way > to clean up the data files from the disk to comply with our retention policy. > Can anyone share some ideas? > > I guess drop table is one workaround but I am looking for less intrusive way > to leave the table as is, like its original state right after table creation, > before any data is written. > > Thanks, > Steve Zhang > > >