Dan,
Thanks for the pointers. Let me look into that work.
- Wing Yew

On Tue, Jan 28, 2025 at 8:49 AM Daniel Weeks <dwe...@apache.org> wrote:

> Hey Wing Yew,
>
> I would agree that this is a common problem and we need a way to get
> tables back into a good state when something unexpected happens.  Amogh and
> Matt have a PR (API: Define RepairManifests action interface
> <https://github.com/apache/iceberg/pull/10784#top>
> #10784) that was originally intended to address this and was part of some
> other changes (here <https://github.com/apache/iceberg/pull/10711> and
> here <https://github.com/apache/iceberg/pull/10721>), to provide
> mechanisms to recover files where possible (e.g. versioned buckets or HDFS
> trash).
>
> I think this lost a little momentum over the holidays, but it would be
> great if you could work with them to come finalize this work,
>
> -Dan
>
> On Tue, Jan 28, 2025 at 7:16 AM Zach Dischner <zach.disch...@gmail.com>
> wrote:
>
>> Hi Wing,
>>
>> Thank you for bringing this up. We run into this all the time,
>> particularly when the underlying storage has data management settings
>> outside of Iceberg's ownership (I.E. s3 retention policies). It is probably
>> a weekly occurrence, and one of the biggest pain points for new builders.
>> Thanks for kicking this off!
>>
>> Zach
>>
>> On Tue, Jan 28, 2025 at 5:36 AM Gabor Kaszab <gaborkas...@apache.org>
>> wrote:
>>
>>> Hi,
>>>
>>> I can also confirm that there are a number of users who find themselves
>>> unintentionally deleting some files and not being able to use their Iceberg
>>> tables anymore. The number of these incidents is surprisingly high for some
>>> reason. There was also a question on Iceberg Slack around this problem the
>>> other day. So I think it's reasonable to provide some recovery mechanisms
>>> in the Iceberg lib in some form to the users.
>>>
>>> I went through the PR for my own education and left some comments,
>>> mostly around the introduced table API for this. Please let me know if any
>>> of this makes sense.
>>>
>>> Cheers,
>>> Gabor
>>>
>>> On Mon, Jan 27, 2025 at 6:10 PM Wing Yew Poon
>>> <wyp...@cloudera.com.invalid> wrote:
>>>
>>>> Hi,
>>>> A surprising number of our customers have inadvertently deleted files
>>>> that are part of their Iceberg tables (from storage), both data and
>>>> metadata. This has caused their Iceberg tables to be unreadable (or
>>>> unloadable in the case of missing metadata).
>>>> In the case of missing data files, we have provided code to the
>>>> customer to "repair" the table to make it readable again without the
>>>> missing files (where they are not able to recover the files at all). I have
>>>> put up a PR, https://github.com/apache/iceberg/pull/12106, for a Spark
>>>> action to remove missing data and delete files from table metadata. Perhaps
>>>> this would be useful to others.
>>>> I have kept the action simple. Removing a data file may result in
>>>> dangling deletes but the action does not do anything about that. However,
>>>> running rewrite_position_deletes_files or rewrite_data_files subsequently
>>>> would clean them up.
>>>> Repairing a table with missing metadata is more difficult and depends
>>>> on what metadata files are missing.
>>>> - Wing Yew
>>>>
>>>>
>>
>> --
>> Zach Dischner
>> 303-919-1364 | zach.disch...@gmail.com
>> Senior Software Development Engineer | Amazon Advertising
>> zachdischner.com <http://www.zachdischner.com/> | Flickr
>> <http://www.flickr.com/photos/zachd1_618/> | Smugmug
>> <http://zachdischner.smugmug.com/> | 2manventure
>> <http://2manventure.wordpress.com/>
>>
>

Reply via email to