Dan, Thanks for the pointers. Let me look into that work. - Wing Yew
On Tue, Jan 28, 2025 at 8:49 AM Daniel Weeks <dwe...@apache.org> wrote: > Hey Wing Yew, > > I would agree that this is a common problem and we need a way to get > tables back into a good state when something unexpected happens. Amogh and > Matt have a PR (API: Define RepairManifests action interface > <https://github.com/apache/iceberg/pull/10784#top> > #10784) that was originally intended to address this and was part of some > other changes (here <https://github.com/apache/iceberg/pull/10711> and > here <https://github.com/apache/iceberg/pull/10721>), to provide > mechanisms to recover files where possible (e.g. versioned buckets or HDFS > trash). > > I think this lost a little momentum over the holidays, but it would be > great if you could work with them to come finalize this work, > > -Dan > > On Tue, Jan 28, 2025 at 7:16 AM Zach Dischner <zach.disch...@gmail.com> > wrote: > >> Hi Wing, >> >> Thank you for bringing this up. We run into this all the time, >> particularly when the underlying storage has data management settings >> outside of Iceberg's ownership (I.E. s3 retention policies). It is probably >> a weekly occurrence, and one of the biggest pain points for new builders. >> Thanks for kicking this off! >> >> Zach >> >> On Tue, Jan 28, 2025 at 5:36 AM Gabor Kaszab <gaborkas...@apache.org> >> wrote: >> >>> Hi, >>> >>> I can also confirm that there are a number of users who find themselves >>> unintentionally deleting some files and not being able to use their Iceberg >>> tables anymore. The number of these incidents is surprisingly high for some >>> reason. There was also a question on Iceberg Slack around this problem the >>> other day. So I think it's reasonable to provide some recovery mechanisms >>> in the Iceberg lib in some form to the users. >>> >>> I went through the PR for my own education and left some comments, >>> mostly around the introduced table API for this. Please let me know if any >>> of this makes sense. >>> >>> Cheers, >>> Gabor >>> >>> On Mon, Jan 27, 2025 at 6:10 PM Wing Yew Poon >>> <wyp...@cloudera.com.invalid> wrote: >>> >>>> Hi, >>>> A surprising number of our customers have inadvertently deleted files >>>> that are part of their Iceberg tables (from storage), both data and >>>> metadata. This has caused their Iceberg tables to be unreadable (or >>>> unloadable in the case of missing metadata). >>>> In the case of missing data files, we have provided code to the >>>> customer to "repair" the table to make it readable again without the >>>> missing files (where they are not able to recover the files at all). I have >>>> put up a PR, https://github.com/apache/iceberg/pull/12106, for a Spark >>>> action to remove missing data and delete files from table metadata. Perhaps >>>> this would be useful to others. >>>> I have kept the action simple. Removing a data file may result in >>>> dangling deletes but the action does not do anything about that. However, >>>> running rewrite_position_deletes_files or rewrite_data_files subsequently >>>> would clean them up. >>>> Repairing a table with missing metadata is more difficult and depends >>>> on what metadata files are missing. >>>> - Wing Yew >>>> >>>> >> >> -- >> Zach Dischner >> 303-919-1364 | zach.disch...@gmail.com >> Senior Software Development Engineer | Amazon Advertising >> zachdischner.com <http://www.zachdischner.com/> | Flickr >> <http://www.flickr.com/photos/zachd1_618/> | Smugmug >> <http://zachdischner.smugmug.com/> | 2manventure >> <http://2manventure.wordpress.com/> >> >