These people using S3 versioned buckets? If so, until actually purged, they are just hiding under tombstone markers
Our little cloud-storage support-call library, cloudstore, has something to list and recover these https://github.com/steveloughran/cloudstore https://github.com/steveloughran/cloudstore/blob/main/src/main/site/versioned-objects.md If they are backed up to s3 glacier, different problem. it's an ongoing issue about what to do there. Now there's a fast-but-expensive glacier storage class, it may actually be possible to do the retrieval on demand. https://issues.apache.org/jira/browse/HADOOP-14837 Otherwise, something to at least scan a table and initiate slow recovery could be useful. steve On Tue, 28 Jan 2025 at 15:16, Zach Dischner <zach.disch...@gmail.com> wrote: > Hi Wing, > > Thank you for bringing this up. We run into this all the time, > particularly when the underlying storage has data management settings > outside of Iceberg's ownership (I.E. s3 retention policies). It is probably > a weekly occurrence, and one of the biggest pain points for new builders. > Thanks for kicking this off! > > Zach > > On Tue, Jan 28, 2025 at 5:36 AM Gabor Kaszab <gaborkas...@apache.org> > wrote: > >> Hi, >> >> I can also confirm that there are a number of users who find themselves >> unintentionally deleting some files and not being able to use their Iceberg >> tables anymore. The number of these incidents is surprisingly high for some >> reason. There was also a question on Iceberg Slack around this problem the >> other day. So I think it's reasonable to provide some recovery mechanisms >> in the Iceberg lib in some form to the users. >> >> I went through the PR for my own education and left some comments, mostly >> around the introduced table API for this. Please let me know if any of this >> makes sense. >> >> Cheers, >> Gabor >> >> On Mon, Jan 27, 2025 at 6:10 PM Wing Yew Poon <wyp...@cloudera.com.invalid> >> wrote: >> >>> Hi, >>> A surprising number of our customers have inadvertently deleted files >>> that are part of their Iceberg tables (from storage), both data and >>> metadata. This has caused their Iceberg tables to be unreadable (or >>> unloadable in the case of missing metadata). >>> In the case of missing data files, we have provided code to the customer >>> to "repair" the table to make it readable again without the missing files >>> (where they are not able to recover the files at all). I have put up a PR, >>> https://github.com/apache/iceberg/pull/12106, for a Spark action to >>> remove missing data and delete files from table metadata. Perhaps this >>> would be useful to others. >>> I have kept the action simple. Removing a data file may result in >>> dangling deletes but the action does not do anything about that. However, >>> running rewrite_position_deletes_files or rewrite_data_files subsequently >>> would clean them up. >>> Repairing a table with missing metadata is more difficult and depends on >>> what metadata files are missing. >>> - Wing Yew >>> >>> > > -- > Zach Dischner > 303-919-1364 | zach.disch...@gmail.com > Senior Software Development Engineer | Amazon Advertising > zachdischner.com <http://www.zachdischner.com/> | Flickr > <http://www.flickr.com/photos/zachd1_618/> | Smugmug > <http://zachdischner.smugmug.com/> | 2manventure > <http://2manventure.wordpress.com/> >