Gabor kindly pointed out to me in direct communication that I was mistaken to assert that "any files that already appear as `orphan` in current metadata.json are safe to remove." At the time a new metadata.json is committed adding a file to an `orphan` list, a reader could be performing a read using the previous metadata.json and be planning to read the now-orphan statistics file. If a process that reads the new current metadata.json soon after then deletes the orphan file, that could be deleting the file from under that reader. I therefore do not think removing the previous orphan statistics file for a snapshot as part of updating existing statistics for that snapshot is a good idea. However, I think removing the orphan files as part of snapshot expiration is fine. There is always the potential to remove files from under a reader with snapshot expiration, but this is generic to all files associated with a snapshot and we live with this "unsafeness".
On Wed, Mar 5, 2025 at 7:01 PM Wing Yew Poon <wyp...@cloudera.com> wrote: > Hi Gabor, > > I agree that with the use of table and partition statistics (and possibly > other auxiliary files in the future), this problem of orphan files due to > recomputation of existing statistics (replacing existing files without > deleting them) will grow. I agree that while remove_orphan_files would > delete such files, it would be good to have some other mechanism for > cleaning them up. > > Your proposal is interesting and I think it is feasible. It would require > a spec change. Can we introduce a change for this in v3? If so, I'd > suggest, for handling the existing cases of table statistics and partition > statistics, to introduce two fields in table metadata, `orphan-statistics` > and `orphan-partition-statistics`, which will be a list of table statistics > and a list of partition statistics respectively. If we want to be more > general, maybe we can have `orphan-files` instead, which will also be a > list. The (table) `statistics` and `partition-statistics` structs already > contain `snapshot-id` fields, so I don't think we need a map of snapshot-id > to file. For future use cases, where a map keyed by snapshot-id could be > useful, you are already assuming the files used correspond to snapshots, so > it would also make sense for the struct representing them to contain > snapshot-id. > > When table statistics or partition statistics for a snapshot are updated, > if there are existing statistics for that snapshot, the existing file needs > to be written into this `orphan-*` list. I don't think we need to use the > mechanism of 'write.statistics.delete-after-commit.enabled' and > 'write.statistics.previous-versions-max'. I think that if we require the > orphan files to be cleaned up (the list trimmed in metadata and the files > deleted) during snapshot expiration, that might be enough, if snapshot > expiration is run frequently enough. If we want, as an additional/optional > way to clean up these orphan files, when table statistics or partition > statistics for a snapshot are updated, in addition to writing an existing > file into the `orphan-*` list, any file in the `orphan-*` list for the same > snapshot needs to be deleted and removed from the list as well. Note that > any files that already appear as `orphan` in current metadata.json are safe > to remove. (We still need snapshot expiration to remove all referenced > orphan files for old snapshots, but this would potentially keep the lists > shorter.) However, I think this is extra. > > What do folks think? > > - Wing Yew > > ps. I also found that the `statistics` and `partition-statistics`fields in > table metadata are lists, with the unwritten expectation (that is to say, > not written into the spec) that for each snapshot, there is at most one > file in the list. I also thought about the idea that we could just add > updated statistics to the list without removing the existing statistics > (this would be allowed by the spec) and ensuring that the first one (or > last one) for a snapshot is the latest one and thus the one to use. This > way, we don't need a spec change, but much existing implementation would > need to change and I think it is too complicated anyway. > > > On Thu, Feb 27, 2025 at 5:31 AM Gabor Kaszab <gaborkas...@apache.org> > wrote: > >> Thanks for the discussion on this topic during the community sync! >> Let me sum up what we discussed and also follow-up with some additional >> thoughts. >> >> *Summary:* >> As long as the table is there users can run orphan file cleanup to remove >> the orphaned stat files. >> If you drop the table the orphaned stat files will remain on disk >> unfortunately. This is however a catalog matter for the location ownership >> with the table. >> >> *My follow-up thoughts:* >> - Orphan file cleanup is not always feasible. e.g. when tables share >> locations. >> - Orphan files are expected when something goes wrong. With stat files >> now even successful queries could create orphan files. >> - With time it seems that there are more and more new ways of creating >> orphan files even in a successful use-case. Table stats, (soon coming) >> partition stats and who knows what else (col stats? indexes?). The >> situation might not be that severe now but could get worse over time. >> - Users seem to complain even for the /data and /metadata folders >> remaining on storage after a drop table. Remaining stat files could also be >> a reason for recurring complaints. >> >> I think even though orphan file removal (if feasible) could be a solution >> for the symptom here but I think the table format should offer a way to >> take care of the root cause (unreferencing files when updating them) too. >> >> *Proposal:* >> What I have in mind to tackle the root cause is to keep track not only >> the current stat files but also the historical ones from the table >> metadata. This could increase the size of the metadata for sure, but could >> be kept at a manageable size with a similar mechanism to what we do with >> the historical metadata.jsons. In practice we could have flags like: >> 'write.statistics.delete-after-commit.enabled' >> 'write.statistics.previous-versions-max' >> >> Or if we want to make this even more generic we could make these new >> flags to control not just stat files but everything else we come up with >> (e.g. col stats, indexes, etc.). Probably the naming could be better but >> something like: >> 'write.historical-files.delete-after-commit.enabled' >> 'write.historical-files.previous-versions-max' >> >> With this we could: >> - Keep a mapping of 'SnapshotID to historical files' in the table metadata >> - Delete historical (stat) file(s) when updating the current one if we >> exceeded the threshold history length >> - Delete the historical (stat) files when expiring snapshots >> - Delete the historical (stat) files when dropping the table >> - Take care of all kind of files that are assigned 'offline' to a >> snapshot (like stats now) if we choose a generic enough representation, >> like: >> Map<SnapshotID, Map<FileType, List<Path>>> >> (where FileType could be 'TABLE_STATS', 'PARTITION_STATS', etc.) >> - Take care of historical metadata.json paths too, however that would >> change the existing layout of the metadata.json files, so we most probably >> don't want to do that. >> >> Let me know what you think! >> Regards, >> Gabor >> >> >> On Mon, Feb 24, 2025 at 10:30 AM Gabor Kaszab <gaborkas...@apache.org> >> wrote: >> >>> Hi All, >>> >>> `remove_orphan_files` for sure drops the previous stat files, but in >>> case you drop the table they will remain on disk forever. I don't have an >>> answer here (I reviewed the above mentioned PR and raised these concerns >>> there) but I think we should figure out a way to avoid accumulating >>> unreferenced stat files. With this we introduced another easy way of >>> creating orphan files and we can't always rely on users running >>> `remove_orphan_files` (maybe multiple tables sharing the same location or >>> users not having an automated process of removing orphaned files?). And the >>> introduction of partition stats could make the situation even worse. >>> >>> I'm thinking of an improvement where we drop the previous stat file if >>> there is any when updating the table with a new one but in theory that >>> could cause issues when some reader is currently reading that stat file. >>> Not sure if there is a sophisticated solution here. >>> >>> I wonder what others think. >>> >>> Gabor >>> >>> On Tue, Feb 18, 2025 at 9:07 AM Ajantha Bhat <ajanthab...@gmail.com> >>> wrote: >>> >>>> I believe the reason stats files allow replacing statistics with the >>>> same snapshot ID is to enable the recomputation of optional stats for the >>>> same snapshot. This process does leave the old stats files orphaned, but >>>> they will be properly cleaned up by the `remove_orphan_files` action >>>> or procedure. >>>> >>>> As stated in the Javadoc of `dropTableData`, the responsibility of >>>> this function is solely to clean up referenced files, not orphaned ones. >>>> Therefore, handling orphaned stats files within this method does not seem >>>> appropriate. >>>> >>>> - Ajantha >>>> >>>> On Sat, Feb 15, 2025 at 11:29 PM Leon Lin <lianglin....@gmail.com> >>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> Recently, I came across an issue where some Puffin statistics files >>>>> remain in storage after calling *dropTableData()*. >>>>> >>>>> The issue occurs because calling *updateStatistics()* or >>>>> *updatePartitionStatistics()* on a snapshot ID that already has >>>>> existing stat files commits a new metadata with the new statistics file >>>>> path, without deleting the old stat files. Since the current >>>>> implementation >>>>> of *dropTableData()* only removes stat files referenced in the latest >>>>> metadata, older stat files that were referenced in previous metadata >>>>> versions remain in storage. >>>>> >>>>> I drafted a PR that iterates through historical metadata to collect >>>>> all referenced stats files for deletion in dropTableData, but this adds >>>>> significant I/O overhead. Some alternative ideas raised in the PR: >>>>> >>>>> - Introduce a flag to trigger iterating over historical metadata >>>>> files to cleanup all stat files >>>>> - Delete old stats files when updating/removing stats (risks >>>>> missing files on rollback). >>>>> - Track unreferenced stats files in the latest metadata (increases >>>>> metadata size). >>>>> >>>>> Each approach comes with trade-offs in terms of performance and >>>>> complexity. I hope to get some insights and opinions from the >>>>> community—has >>>>> this been discussed before, or is this the expected behavior? Are there >>>>> any >>>>> alternative approaches to handle this more efficiently? >>>>> >>>>> References: >>>>> >>>>> - PR: https://github.com/apache/iceberg/pull/12132 >>>>> - Issues: https://github.com/apache/iceberg/issues/11876, >>>>> https://github.com/trinodb/trino/issues/16583 >>>>> - Code: >>>>> >>>>> https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/CatalogUtil.java#L94-L139 >>>>> >>>>> Thanks, >>>>> Leon >>>>> >>>>