Hi All, `remove_orphan_files` for sure drops the previous stat files, but in case you drop the table they will remain on disk forever. I don't have an answer here (I reviewed the above mentioned PR and raised these concerns there) but I think we should figure out a way to avoid accumulating unreferenced stat files. With this we introduced another easy way of creating orphan files and we can't always rely on users running `remove_orphan_files` (maybe multiple tables sharing the same location or users not having an automated process of removing orphaned files?). And the introduction of partition stats could make the situation even worse.
I'm thinking of an improvement where we drop the previous stat file if there is any when updating the table with a new one but in theory that could cause issues when some reader is currently reading that stat file. Not sure if there is a sophisticated solution here. I wonder what others think. Gabor On Tue, Feb 18, 2025 at 9:07 AM Ajantha Bhat <ajanthab...@gmail.com> wrote: > I believe the reason stats files allow replacing statistics with the same > snapshot ID is to enable the recomputation of optional stats for the same > snapshot. This process does leave the old stats files orphaned, but they > will be properly cleaned up by the `remove_orphan_files` action or > procedure. > > As stated in the Javadoc of `dropTableData`, the responsibility of this > function is solely to clean up referenced files, not orphaned ones. > Therefore, handling orphaned stats files within this method does not seem > appropriate. > > - Ajantha > > On Sat, Feb 15, 2025 at 11:29 PM Leon Lin <lianglin....@gmail.com> wrote: > >> Hi all, >> >> Recently, I came across an issue where some Puffin statistics files >> remain in storage after calling *dropTableData()*. >> >> The issue occurs because calling *updateStatistics()* or >> *updatePartitionStatistics()* on a snapshot ID that already has existing >> stat files commits a new metadata with the new statistics file path, >> without deleting the old stat files. Since the current implementation of >> *dropTableData()* only removes stat files referenced in the latest >> metadata, older stat files that were referenced in previous metadata >> versions remain in storage. >> >> I drafted a PR that iterates through historical metadata to collect all >> referenced stats files for deletion in dropTableData, but this adds >> significant I/O overhead. Some alternative ideas raised in the PR: >> >> - Introduce a flag to trigger iterating over historical metadata >> files to cleanup all stat files >> - Delete old stats files when updating/removing stats (risks missing >> files on rollback). >> - Track unreferenced stats files in the latest metadata (increases >> metadata size). >> >> Each approach comes with trade-offs in terms of performance and >> complexity. I hope to get some insights and opinions from the community—has >> this been discussed before, or is this the expected behavior? Are there any >> alternative approaches to handle this more efficiently? >> >> References: >> >> - PR: https://github.com/apache/iceberg/pull/12132 >> - Issues: https://github.com/apache/iceberg/issues/11876, >> https://github.com/trinodb/trino/issues/16583 >> - Code: >> >> https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/CatalogUtil.java#L94-L139 >> >> Thanks, >> Leon >> >