Re: [DISCUSS] Cleanup unreferenced statistics files through DropTableData

Wing Yew Poon Fri, 07 Mar 2025 16:56:35 -0800

Gabor kindly pointed out to me in direct communication that I was mistaken
to assert that "any files that already appear as `orphan` in current
metadata.json are safe to remove." At the time a new metadata.json is
committed adding a file to an `orphan` list, a reader could be performing a
read using the previous metadata.json and be planning to read the
now-orphan statistics file. If a process that reads the new current
metadata.json soon after then deletes the orphan file, that could be
deleting the file from under that reader.
I therefore do not think removing the previous orphan statistics file for a
snapshot as part of updating existing statistics for that snapshot is a
good idea. However, I think removing the orphan files as part of snapshot
expiration is fine. There is always the potential to remove files from
under a reader with snapshot expiration, but this is generic to all files
associated with a snapshot and we live with this "unsafeness".



On Wed, Mar 5, 2025 at 7:01 PM Wing Yew Poon <wyp...@cloudera.com> wrote:

> Hi Gabor,
>
> I agree that with the use of table and partition statistics (and possibly
> other auxiliary files in the future), this problem of orphan files due to
> recomputation of existing statistics (replacing existing files without
> deleting them) will grow. I agree that while remove_orphan_files would
> delete such files, it would be good to have some other mechanism for
> cleaning them up.
>
> Your proposal is interesting and I think it is feasible. It would require
> a spec change. Can we introduce a change for this in v3? If so, I'd
> suggest, for handling the existing cases of table statistics and partition
> statistics, to introduce two fields in table metadata, `orphan-statistics`
> and `orphan-partition-statistics`, which will be a list of table statistics
> and a list of partition statistics respectively. If we want to be more
> general, maybe we can have `orphan-files` instead, which will also be a
> list. The (table) `statistics` and `partition-statistics` structs already
> contain `snapshot-id` fields, so I don't think we need a map of snapshot-id
> to file. For future use cases, where a map keyed by snapshot-id could be
> useful, you are already assuming the files used correspond to snapshots, so
> it would also make sense for the struct representing them to contain
> snapshot-id.
>
> When table statistics or partition statistics for a snapshot are updated,
> if there are existing statistics for that snapshot, the existing file needs
> to be written into this `orphan-*` list. I don't think we need to use the
> mechanism of 'write.statistics.delete-after-commit.enabled' and
> 'write.statistics.previous-versions-max'. I think that if we require the
> orphan files to be cleaned up (the list trimmed in metadata and the files
> deleted) during snapshot expiration, that might be enough, if snapshot
> expiration is run frequently enough. If we want, as an additional/optional
> way to clean up these orphan files, when table statistics or partition
> statistics for a snapshot are updated, in addition to writing an existing
> file into the `orphan-*` list, any file in the `orphan-*` list for the same
> snapshot needs to be deleted and removed from the list as well. Note that
> any files that already appear as `orphan` in current metadata.json are safe
> to remove. (We still need snapshot expiration to remove all referenced
> orphan files for old snapshots, but this would potentially keep the lists
> shorter.) However, I think this is extra.
>
> What do folks think?
>
> - Wing Yew
>
> ps. I also found that the `statistics` and `partition-statistics`fields in
> table metadata are lists, with the unwritten expectation (that is to say,
> not written into the spec) that for each snapshot, there is at most one
> file in the list. I also thought about the idea that we could just add
> updated statistics to the list without removing the existing statistics
> (this would be allowed by the spec) and ensuring that the first one (or
> last one) for a snapshot is the latest one and thus the one to use. This
> way, we don't need a spec change, but much existing implementation would
> need to change and I think it is too complicated anyway.
>
>
> On Thu, Feb 27, 2025 at 5:31 AM Gabor Kaszab <gaborkas...@apache.org>
> wrote:
>
>> Thanks for the discussion on this topic during the community sync!
>> Let me sum up what we discussed and also follow-up with some additional
>> thoughts.
>>
>> *Summary:*
>> As long as the table is there users can run orphan file cleanup to remove
>> the orphaned stat files.
>> If you drop the table the orphaned stat files will remain on disk
>> unfortunately. This is however a catalog matter for the location ownership
>> with the table.
>>
>> *My follow-up thoughts:*
>> - Orphan file cleanup is not always feasible. e.g. when tables share
>> locations.
>> - Orphan files are expected when something goes wrong. With stat files
>> now even successful queries could create orphan files.
>> - With time it seems that there are more and more new ways of creating
>> orphan files even in a successful use-case. Table stats, (soon coming)
>> partition stats and who knows what else (col stats? indexes?). The
>> situation might not be that severe now but could get worse over time.
>> - Users seem to complain even for the /data and /metadata folders
>> remaining on storage after a drop table. Remaining stat files could also be
>> a reason for recurring complaints.
>>
>> I think even though orphan file removal (if feasible) could be a solution
>> for the symptom here but I think the table format should offer a way to
>> take care of the root cause (unreferencing files when updating them) too.
>>
>> *Proposal:*
>> What I have in mind to tackle the root cause is to keep track not only
>> the current stat files but also the historical ones from the table
>> metadata. This could increase the size of the metadata for sure, but could
>> be kept at a manageable size with a similar mechanism to what we do with
>> the historical metadata.jsons. In practice we could have flags like:
>> 'write.statistics.delete-after-commit.enabled'
>> 'write.statistics.previous-versions-max'
>>
>> Or if we want to make this even more generic we could make these new
>> flags to control not just stat files but everything else we come up with
>> (e.g. col stats, indexes, etc.). Probably the naming could be better but
>> something like:
>> 'write.historical-files.delete-after-commit.enabled'
>> 'write.historical-files.previous-versions-max'
>>
>> With this we could:
>> - Keep a mapping of 'SnapshotID to historical files' in the table metadata
>> - Delete historical (stat) file(s) when updating the current one if we
>> exceeded the threshold history length
>> - Delete the historical (stat) files when expiring snapshots
>> - Delete the historical (stat) files when dropping the table
>> - Take care of all kind of files that are assigned 'offline' to a
>> snapshot (like stats now) if we choose a generic enough representation,
>> like:
>>     Map<SnapshotID, Map<FileType, List<Path>>>
>>     (where FileType could be 'TABLE_STATS', 'PARTITION_STATS', etc.)
>> - Take care of historical metadata.json paths too, however that would
>> change the existing layout of the metadata.json files, so we most probably
>> don't want to do that.
>>
>> Let me know what you think!
>> Regards,
>> Gabor
>>
>>
>> On Mon, Feb 24, 2025 at 10:30 AM Gabor Kaszab <gaborkas...@apache.org>
>> wrote:
>>
>>> Hi All,
>>>
>>> `remove_orphan_files` for sure drops the previous stat files, but in
>>> case you drop the table they will remain on disk forever. I don't have an
>>> answer here (I reviewed the above mentioned PR and raised these concerns
>>> there) but I think we should figure out a way to avoid accumulating
>>> unreferenced stat files. With this we introduced another easy way of
>>> creating orphan files and we can't always rely on users running
>>> `remove_orphan_files` (maybe multiple tables sharing the same location or
>>> users not having an automated process of removing orphaned files?). And the
>>> introduction of partition stats could make the situation even worse.
>>>
>>> I'm thinking of an improvement where we drop the previous stat file if
>>> there is any when updating the table with a new one but in theory that
>>> could cause issues when some reader is currently reading that stat file.
>>> Not sure if there is a sophisticated solution here.
>>>
>>> I wonder what others think.
>>>
>>> Gabor
>>>
>>> On Tue, Feb 18, 2025 at 9:07 AM Ajantha Bhat <ajanthab...@gmail.com>
>>> wrote:
>>>
>>>> I believe the reason stats files allow replacing statistics with the
>>>> same snapshot ID is to enable the recomputation of optional stats for the
>>>> same snapshot. This process does leave the old stats files orphaned, but
>>>> they will be properly cleaned up by the `remove_orphan_files` action
>>>> or procedure.
>>>>
>>>> As stated in the Javadoc of `dropTableData`, the responsibility of
>>>> this function is solely to clean up referenced files, not orphaned ones.
>>>> Therefore, handling orphaned stats files within this method does not seem
>>>> appropriate.
>>>>
>>>> - Ajantha
>>>>
>>>> On Sat, Feb 15, 2025 at 11:29 PM Leon Lin <lianglin....@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> Recently, I came across an issue where some Puffin statistics files
>>>>> remain in storage after calling *dropTableData()*.
>>>>>
>>>>> The issue occurs because calling *updateStatistics()* or
>>>>> *updatePartitionStatistics()* on a snapshot ID that already has
>>>>> existing stat files commits a new metadata with the new statistics file
>>>>> path, without deleting the old stat files. Since the current 
>>>>> implementation
>>>>> of *dropTableData()* only removes stat files referenced in the latest
>>>>> metadata, older stat files that were referenced in previous metadata
>>>>> versions remain in storage.
>>>>>
>>>>> I drafted a PR that iterates through historical metadata to collect
>>>>> all referenced stats files for deletion in dropTableData, but this adds
>>>>> significant I/O overhead. Some alternative ideas raised in the PR:
>>>>>
>>>>>    - Introduce a flag to trigger iterating over historical metadata
>>>>>    files to cleanup all stat files
>>>>>    - Delete old stats files when updating/removing stats (risks
>>>>>    missing files on rollback).
>>>>>    - Track unreferenced stats files in the latest metadata (increases
>>>>>    metadata size).
>>>>>
>>>>> Each approach comes with trade-offs in terms of performance and
>>>>> complexity. I hope to get some insights and opinions from the 
>>>>> community—has
>>>>> this been discussed before, or is this the expected behavior? Are there 
>>>>> any
>>>>> alternative approaches to handle this more efficiently?
>>>>>
>>>>> References:
>>>>>
>>>>>    - PR: https://github.com/apache/iceberg/pull/12132
>>>>>    - Issues: https://github.com/apache/iceberg/issues/11876,
>>>>>    https://github.com/trinodb/trino/issues/16583
>>>>>    - Code:
>>>>>    
>>>>> https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/CatalogUtil.java#L94-L139
>>>>>
>>>>> Thanks,
>>>>> Leon
>>>>>
>>>>

Re: [DISCUSS] Cleanup unreferenced statistics files through DropTableData

Reply via email to