Hi everyone,

I’m reopening the RepairManifests work originally started by Mathew
Fournier and continued by Amogh Jahagirdar. This feature addresses cases
where Iceberg tables end up in an invalid state due to duplicate or missing
files in the manifest list. The PR has been inactive for a while, and Amogh
currently doesn’t have the bandwidth and I’d like to help move it forward
since we’ve encountered this issue at AWS as well.

For instance, I’ve seen this occur in a few ways, such as when service has
a commit that falsely reports as failed and a service issues the commit
again causing the same data file to be committed twice. In other cases,
I’ve seen large portions of table data deleted due to a retention policy on
an S3 bucket. There are also scenarios, as mentioned in the original issue,
where Kafka Connect appends the same file multiple times to a table. In
each case, one fix has been to fork the original PR and execute the
repairManifest action.

Here is a link to the pr: https://github.com/apache/iceberg/pull/14341

Thanks,
Drew

Reply via email to