Hey Wing Yew, I would agree that this is a common problem and we need a way to get tables back into a good state when something unexpected happens. Amogh and Matt have a PR (API: Define RepairManifests action interface <https://github.com/apache/iceberg/pull/10784#top> #10784) that was originally intended to address this and was part of some other changes (here <https://github.com/apache/iceberg/pull/10711> and here <https://github.com/apache/iceberg/pull/10721>), to provide mechanisms to recover files where possible (e.g. versioned buckets or HDFS trash).
I think this lost a little momentum over the holidays, but it would be great if you could work with them to come finalize this work, -Dan On Tue, Jan 28, 2025 at 7:16 AM Zach Dischner <zach.disch...@gmail.com> wrote: > Hi Wing, > > Thank you for bringing this up. We run into this all the time, > particularly when the underlying storage has data management settings > outside of Iceberg's ownership (I.E. s3 retention policies). It is probably > a weekly occurrence, and one of the biggest pain points for new builders. > Thanks for kicking this off! > > Zach > > On Tue, Jan 28, 2025 at 5:36 AM Gabor Kaszab <gaborkas...@apache.org> > wrote: > >> Hi, >> >> I can also confirm that there are a number of users who find themselves >> unintentionally deleting some files and not being able to use their Iceberg >> tables anymore. The number of these incidents is surprisingly high for some >> reason. There was also a question on Iceberg Slack around this problem the >> other day. So I think it's reasonable to provide some recovery mechanisms >> in the Iceberg lib in some form to the users. >> >> I went through the PR for my own education and left some comments, mostly >> around the introduced table API for this. Please let me know if any of this >> makes sense. >> >> Cheers, >> Gabor >> >> On Mon, Jan 27, 2025 at 6:10 PM Wing Yew Poon <wyp...@cloudera.com.invalid> >> wrote: >> >>> Hi, >>> A surprising number of our customers have inadvertently deleted files >>> that are part of their Iceberg tables (from storage), both data and >>> metadata. This has caused their Iceberg tables to be unreadable (or >>> unloadable in the case of missing metadata). >>> In the case of missing data files, we have provided code to the customer >>> to "repair" the table to make it readable again without the missing files >>> (where they are not able to recover the files at all). I have put up a PR, >>> https://github.com/apache/iceberg/pull/12106, for a Spark action to >>> remove missing data and delete files from table metadata. Perhaps this >>> would be useful to others. >>> I have kept the action simple. Removing a data file may result in >>> dangling deletes but the action does not do anything about that. However, >>> running rewrite_position_deletes_files or rewrite_data_files subsequently >>> would clean them up. >>> Repairing a table with missing metadata is more difficult and depends on >>> what metadata files are missing. >>> - Wing Yew >>> >>> > > -- > Zach Dischner > 303-919-1364 | zach.disch...@gmail.com > Senior Software Development Engineer | Amazon Advertising > zachdischner.com <http://www.zachdischner.com/> | Flickr > <http://www.flickr.com/photos/zachd1_618/> | Smugmug > <http://zachdischner.smugmug.com/> | 2manventure > <http://2manventure.wordpress.com/> >