I believe the reason stats files allow replacing statistics with the same snapshot ID is to enable the recomputation of optional stats for the same snapshot. This process does leave the old stats files orphaned, but they will be properly cleaned up by the `remove_orphan_files` action or procedure.
As stated in the Javadoc of `dropTableData`, the responsibility of this function is solely to clean up referenced files, not orphaned ones. Therefore, handling orphaned stats files within this method does not seem appropriate. - Ajantha On Sat, Feb 15, 2025 at 11:29 PM Leon Lin <lianglin....@gmail.com> wrote: > Hi all, > > Recently, I came across an issue where some Puffin statistics files remain > in storage after calling *dropTableData()*. > > The issue occurs because calling *updateStatistics()* or > *updatePartitionStatistics()* on a snapshot ID that already has existing > stat files commits a new metadata with the new statistics file path, > without deleting the old stat files. Since the current implementation of > *dropTableData()* only removes stat files referenced in the latest > metadata, older stat files that were referenced in previous metadata > versions remain in storage. > > I drafted a PR that iterates through historical metadata to collect all > referenced stats files for deletion in dropTableData, but this adds > significant I/O overhead. Some alternative ideas raised in the PR: > > - Introduce a flag to trigger iterating over historical metadata files > to cleanup all stat files > - Delete old stats files when updating/removing stats (risks missing > files on rollback). > - Track unreferenced stats files in the latest metadata (increases > metadata size). > > Each approach comes with trade-offs in terms of performance and > complexity. I hope to get some insights and opinions from the community—has > this been discussed before, or is this the expected behavior? Are there any > alternative approaches to handle this more efficiently? > > References: > > - PR: https://github.com/apache/iceberg/pull/12132 > - Issues: https://github.com/apache/iceberg/issues/11876, > https://github.com/trinodb/trino/issues/16583 > - Code: > > https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/CatalogUtil.java#L94-L139 > > Thanks, > Leon >