Thanks for taking a look and answering, Wing Yew!
I agree that with the proposed design snapshot expiration should remove the
historical statistics files. Additionally, I think it also makes sense to
keep max x number of historical stat files in metadata and when we exceed
this threshold, we could
Gabor kindly pointed out to me in direct communication that I was mistaken
to assert that "any files that already appear as `orphan` in current
metadata.json are safe to remove." At the time a new metadata.json is
committed adding a file to an `orphan` list, a reader could be performing a
read usin
Hi Gabor,
I agree that with the use of table and partition statistics (and possibly
other auxiliary files in the future), this problem of orphan files due to
recomputation of existing statistics (replacing existing files without
deleting them) will grow. I agree that while remove_orphan_files woul
Thanks for the discussion on this topic during the community sync!
Let me sum up what we discussed and also follow-up with some additional
thoughts.
*Summary:*
As long as the table is there users can run orphan file cleanup to remove
the orphaned stat files.
If you drop the table the orphaned stat
Hi All,
`remove_orphan_files` for sure drops the previous stat files, but in case
you drop the table they will remain on disk forever. I don't have an answer
here (I reviewed the above mentioned PR and raised these concerns there)
but I think we should figure out a way to avoid accumulating unrefe
I believe the reason stats files allow replacing statistics with the same
snapshot ID is to enable the recomputation of optional stats for the same
snapshot. This process does leave the old stats files orphaned, but they
will be properly cleaned up by the `remove_orphan_files` action or
procedure.
Hi all,
Recently, I came across an issue where some Puffin statistics files remain
in storage after calling *dropTableData()*.
The issue occurs because calling *updateStatistics()* or
*updatePartitionStatistics()* on a snapshot ID that already has existing
stat files commits a new metadata with t