Hi all,

Recently, I came across an issue where some Puffin statistics files remain
in storage after calling *dropTableData()*.

The issue occurs because calling *updateStatistics()* or
*updatePartitionStatistics()* on a snapshot ID that already has existing
stat files commits a new metadata with the new statistics file path,
without deleting the old stat files. Since the current implementation of
*dropTableData()* only removes stat files referenced in the latest
metadata, older stat files that were referenced in previous metadata
versions remain in storage.

I drafted a PR that iterates through historical metadata to collect all
referenced stats files for deletion in dropTableData, but this adds
significant I/O overhead. Some alternative ideas raised in the PR:

   - Introduce a flag to trigger iterating over historical metadata files
   to cleanup all stat files
   - Delete old stats files when updating/removing stats (risks missing
   files on rollback).
   - Track unreferenced stats files in the latest metadata (increases
   metadata size).

Each approach comes with trade-offs in terms of performance and complexity.
I hope to get some insights and opinions from the community—has this been
discussed before, or is this the expected behavior? Are there any
alternative approaches to handle this more efficiently?

References:

   - PR: https://github.com/apache/iceberg/pull/12132
   - Issues: https://github.com/apache/iceberg/issues/11876,
   https://github.com/trinodb/trino/issues/16583
   - Code:
   
https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/CatalogUtil.java#L94-L139

Thanks,
Leon

Reply via email to