Re: [DISCUSS] Cleanup unreferenced statistics files through DropTableData

2025-03-14 Thread Gabor Kaszab
Thanks for taking a look and answering, Wing Yew! I agree that with the proposed design snapshot expiration should remove the historical statistics files. Additionally, I think it also makes sense to keep max x number of historical stat files in metadata and when we exceed this threshold, we could

Re: [DISCUSS] Cleanup unreferenced statistics files through DropTableData

2025-03-07 Thread Wing Yew Poon
Gabor kindly pointed out to me in direct communication that I was mistaken to assert that "any files that already appear as `orphan` in current metadata.json are safe to remove." At the time a new metadata.json is committed adding a file to an `orphan` list, a reader could be performing a read usin

Re: [DISCUSS] Cleanup unreferenced statistics files through DropTableData

2025-03-05 Thread Wing Yew Poon
Hi Gabor, I agree that with the use of table and partition statistics (and possibly other auxiliary files in the future), this problem of orphan files due to recomputation of existing statistics (replacing existing files without deleting them) will grow. I agree that while remove_orphan_files woul

Re: [DISCUSS] Cleanup unreferenced statistics files through DropTableData

2025-02-27 Thread Gabor Kaszab
Thanks for the discussion on this topic during the community sync! Let me sum up what we discussed and also follow-up with some additional thoughts. *Summary:* As long as the table is there users can run orphan file cleanup to remove the orphaned stat files. If you drop the table the orphaned stat

Re: [DISCUSS] Cleanup unreferenced statistics files through DropTableData

2025-02-24 Thread Gabor Kaszab
Hi All, `remove_orphan_files` for sure drops the previous stat files, but in case you drop the table they will remain on disk forever. I don't have an answer here (I reviewed the above mentioned PR and raised these concerns there) but I think we should figure out a way to avoid accumulating unrefe

Re: [DISCUSS] Cleanup unreferenced statistics files through DropTableData

2025-02-18 Thread Ajantha Bhat
I believe the reason stats files allow replacing statistics with the same snapshot ID is to enable the recomputation of optional stats for the same snapshot. This process does leave the old stats files orphaned, but they will be properly cleaned up by the `remove_orphan_files` action or procedure.

[DISCUSS] Cleanup unreferenced statistics files through DropTableData

2025-02-15 Thread Leon Lin
Hi all, Recently, I came across an issue where some Puffin statistics files remain in storage after calling *dropTableData()*. The issue occurs because calling *updateStatistics()* or *updatePartitionStatistics()* on a snapshot ID that already has existing stat files commits a new metadata with t