Re: missing files in an Iceberg table

Steve Loughran Thu, 30 Jan 2025 04:57:21 -0800

These people using S3 versioned buckets?

If so, until actually purged, they are just hiding under tombstone markers


Our little cloud-storage support-call library, cloudstore, has something to
list and recover these

https://github.com/steveloughran/cloudstore
https://github.com/steveloughran/cloudstore/blob/main/src/main/site/versioned-objects.md

If they are backed up to s3 glacier, different problem. it's an ongoing
issue about what to do there. Now there's a fast-but-expensive glacier
storage class, it may actually be possible to do the retrieval on demand.

https://issues.apache.org/jira/browse/HADOOP-14837

Otherwise, something to at least scan a table and initiate slow recovery
could be useful.

steve

On Tue, 28 Jan 2025 at 15:16, Zach Dischner <zach.disch...@gmail.com> wrote:

> Hi Wing,
>
> Thank you for bringing this up. We run into this all the time,
> particularly when the underlying storage has data management settings
> outside of Iceberg's ownership (I.E. s3 retention policies). It is probably
> a weekly occurrence, and one of the biggest pain points for new builders.
> Thanks for kicking this off!
>
> Zach
>
> On Tue, Jan 28, 2025 at 5:36 AM Gabor Kaszab <gaborkas...@apache.org>
> wrote:
>
>> Hi,
>>
>> I can also confirm that there are a number of users who find themselves
>> unintentionally deleting some files and not being able to use their Iceberg
>> tables anymore. The number of these incidents is surprisingly high for some
>> reason. There was also a question on Iceberg Slack around this problem the
>> other day. So I think it's reasonable to provide some recovery mechanisms
>> in the Iceberg lib in some form to the users.
>>
>> I went through the PR for my own education and left some comments, mostly
>> around the introduced table API for this. Please let me know if any of this
>> makes sense.
>>
>> Cheers,
>> Gabor
>>
>> On Mon, Jan 27, 2025 at 6:10 PM Wing Yew Poon <wyp...@cloudera.com.invalid>
>> wrote:
>>
>>> Hi,
>>> A surprising number of our customers have inadvertently deleted files
>>> that are part of their Iceberg tables (from storage), both data and
>>> metadata. This has caused their Iceberg tables to be unreadable (or
>>> unloadable in the case of missing metadata).
>>> In the case of missing data files, we have provided code to the customer
>>> to "repair" the table to make it readable again without the missing files
>>> (where they are not able to recover the files at all). I have put up a PR,
>>> https://github.com/apache/iceberg/pull/12106, for a Spark action to
>>> remove missing data and delete files from table metadata. Perhaps this
>>> would be useful to others.
>>> I have kept the action simple. Removing a data file may result in
>>> dangling deletes but the action does not do anything about that. However,
>>> running rewrite_position_deletes_files or rewrite_data_files subsequently
>>> would clean them up.
>>> Repairing a table with missing metadata is more difficult and depends on
>>> what metadata files are missing.
>>> - Wing Yew
>>>
>>>
>
> --
> Zach Dischner
> 303-919-1364 | zach.disch...@gmail.com
> Senior Software Development Engineer | Amazon Advertising
> zachdischner.com <http://www.zachdischner.com/> | Flickr
> <http://www.flickr.com/photos/zachd1_618/> | Smugmug
> <http://zachdischner.smugmug.com/> | 2manventure
> <http://2manventure.wordpress.com/>
>

Re: missing files in an Iceberg table

Reply via email to