Hi all, Recently, I came across an issue where some Puffin statistics files remain in storage after calling *dropTableData()*.
The issue occurs because calling *updateStatistics()* or *updatePartitionStatistics()* on a snapshot ID that already has existing stat files commits a new metadata with the new statistics file path, without deleting the old stat files. Since the current implementation of *dropTableData()* only removes stat files referenced in the latest metadata, older stat files that were referenced in previous metadata versions remain in storage. I drafted a PR that iterates through historical metadata to collect all referenced stats files for deletion in dropTableData, but this adds significant I/O overhead. Some alternative ideas raised in the PR: - Introduce a flag to trigger iterating over historical metadata files to cleanup all stat files - Delete old stats files when updating/removing stats (risks missing files on rollback). - Track unreferenced stats files in the latest metadata (increases metadata size). Each approach comes with trade-offs in terms of performance and complexity. I hope to get some insights and opinions from the community—has this been discussed before, or is this the expected behavior? Are there any alternative approaches to handle this more efficiently? References: - PR: https://github.com/apache/iceberg/pull/12132 - Issues: https://github.com/apache/iceberg/issues/11876, https://github.com/trinodb/trino/issues/16583 - Code: https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/CatalogUtil.java#L94-L139 Thanks, Leon