Hey Wing Yew,

I would agree that this is a common problem and we need a way to get tables
back into a good state when something unexpected happens.  Amogh and Matt
have a PR (API: Define RepairManifests action interface
<https://github.com/apache/iceberg/pull/10784#top>
#10784) that was originally intended to address this and was part of some
other changes (here <https://github.com/apache/iceberg/pull/10711> and here
<https://github.com/apache/iceberg/pull/10721>), to provide mechanisms to
recover files where possible (e.g. versioned buckets or HDFS trash).

I think this lost a little momentum over the holidays, but it would be
great if you could work with them to come finalize this work,

-Dan

On Tue, Jan 28, 2025 at 7:16 AM Zach Dischner <zach.disch...@gmail.com>
wrote:

> Hi Wing,
>
> Thank you for bringing this up. We run into this all the time,
> particularly when the underlying storage has data management settings
> outside of Iceberg's ownership (I.E. s3 retention policies). It is probably
> a weekly occurrence, and one of the biggest pain points for new builders.
> Thanks for kicking this off!
>
> Zach
>
> On Tue, Jan 28, 2025 at 5:36 AM Gabor Kaszab <gaborkas...@apache.org>
> wrote:
>
>> Hi,
>>
>> I can also confirm that there are a number of users who find themselves
>> unintentionally deleting some files and not being able to use their Iceberg
>> tables anymore. The number of these incidents is surprisingly high for some
>> reason. There was also a question on Iceberg Slack around this problem the
>> other day. So I think it's reasonable to provide some recovery mechanisms
>> in the Iceberg lib in some form to the users.
>>
>> I went through the PR for my own education and left some comments, mostly
>> around the introduced table API for this. Please let me know if any of this
>> makes sense.
>>
>> Cheers,
>> Gabor
>>
>> On Mon, Jan 27, 2025 at 6:10 PM Wing Yew Poon <wyp...@cloudera.com.invalid>
>> wrote:
>>
>>> Hi,
>>> A surprising number of our customers have inadvertently deleted files
>>> that are part of their Iceberg tables (from storage), both data and
>>> metadata. This has caused their Iceberg tables to be unreadable (or
>>> unloadable in the case of missing metadata).
>>> In the case of missing data files, we have provided code to the customer
>>> to "repair" the table to make it readable again without the missing files
>>> (where they are not able to recover the files at all). I have put up a PR,
>>> https://github.com/apache/iceberg/pull/12106, for a Spark action to
>>> remove missing data and delete files from table metadata. Perhaps this
>>> would be useful to others.
>>> I have kept the action simple. Removing a data file may result in
>>> dangling deletes but the action does not do anything about that. However,
>>> running rewrite_position_deletes_files or rewrite_data_files subsequently
>>> would clean them up.
>>> Repairing a table with missing metadata is more difficult and depends on
>>> what metadata files are missing.
>>> - Wing Yew
>>>
>>>
>
> --
> Zach Dischner
> 303-919-1364 | zach.disch...@gmail.com
> Senior Software Development Engineer | Amazon Advertising
> zachdischner.com <http://www.zachdischner.com/> | Flickr
> <http://www.flickr.com/photos/zachd1_618/> | Smugmug
> <http://zachdischner.smugmug.com/> | 2manventure
> <http://2manventure.wordpress.com/>
>

Reply via email to