Hi everyone, I’m reopening the RepairManifests work originally started by Mathew Fournier and continued by Amogh Jahagirdar. This feature addresses cases where Iceberg tables end up in an invalid state due to duplicate or missing files in the manifest list. The PR has been inactive for a while, and Amogh currently doesn’t have the bandwidth and I’d like to help move it forward since we’ve encountered this issue at AWS as well.
For instance, I’ve seen this occur in a few ways, such as when service has a commit that falsely reports as failed and a service issues the commit again causing the same data file to be committed twice. In other cases, I’ve seen large portions of table data deleted due to a retention policy on an S3 bucket. There are also scenarios, as mentioned in the original issue, where Kafka Connect appends the same file multiple times to a table. In each case, one fix has been to fork the original PR and execute the repairManifest action. Here is a link to the pr: https://github.com/apache/iceberg/pull/14341 Thanks, Drew
