Re: [DISCUSS] Cleanup unreferenced statistics files through DropTableData

Gabor Kaszab Thu, 27 Feb 2025 05:31:26 -0800

Thanks for the discussion on this topic during the community sync!
Let me sum up what we discussed and also follow-up with some additional
thoughts.


*Summary:*
As long as the table is there users can run orphan file cleanup to remove
the orphaned stat files.
If you drop the table the orphaned stat files will remain on disk
unfortunately. This is however a catalog matter for the location ownership
with the table.

*My follow-up thoughts:*
- Orphan file cleanup is not always feasible. e.g. when tables share
locations.
- Orphan files are expected when something goes wrong. With stat files now
even successful queries could create orphan files.
- With time it seems that there are more and more new ways of creating
orphan files even in a successful use-case. Table stats, (soon coming)
partition stats and who knows what else (col stats? indexes?). The
situation might not be that severe now but could get worse over time.
- Users seem to complain even for the /data and /metadata folders remaining
on storage after a drop table. Remaining stat files could also be a reason
for recurring complaints.

I think even though orphan file removal (if feasible) could be a solution
for the symptom here but I think the table format should offer a way to
take care of the root cause (unreferencing files when updating them) too.

*Proposal:*
What I have in mind to tackle the root cause is to keep track not only the
current stat files but also the historical ones from the table metadata.
This could increase the size of the metadata for sure, but could be kept at
a manageable size with a similar mechanism to what we do with the
historical metadata.jsons. In practice we could have flags like:
'write.statistics.delete-after-commit.enabled'
'write.statistics.previous-versions-max'

Or if we want to make this even more generic we could make these new flags
to control not just stat files but everything else we come up with (e.g.
col stats, indexes, etc.). Probably the naming could be better but
something like:
'write.historical-files.delete-after-commit.enabled'
'write.historical-files.previous-versions-max'

With this we could:
- Keep a mapping of 'SnapshotID to historical files' in the table metadata
- Delete historical (stat) file(s) when updating the current one if we
exceeded the threshold history length
- Delete the historical (stat) files when expiring snapshots
- Delete the historical (stat) files when dropping the table
- Take care of all kind of files that are assigned 'offline' to a snapshot
(like stats now) if we choose a generic enough representation, like:
    Map<SnapshotID, Map<FileType, List<Path>>>
    (where FileType could be 'TABLE_STATS', 'PARTITION_STATS', etc.)
- Take care of historical metadata.json paths too, however that would
change the existing layout of the metadata.json files, so we most probably
don't want to do that.

Let me know what you think!
Regards,
Gabor


On Mon, Feb 24, 2025 at 10:30 AM Gabor Kaszab <gaborkas...@apache.org>
wrote:

> Hi All,
>
> `remove_orphan_files` for sure drops the previous stat files, but in case
> you drop the table they will remain on disk forever. I don't have an answer
> here (I reviewed the above mentioned PR and raised these concerns there)
> but I think we should figure out a way to avoid accumulating unreferenced
> stat files. With this we introduced another easy way of creating orphan
> files and we can't always rely on users running `remove_orphan_files`
> (maybe multiple tables sharing the same location or users not having an
> automated process of removing orphaned files?). And the introduction of
> partition stats could make the situation even worse.
>
> I'm thinking of an improvement where we drop the previous stat file if
> there is any when updating the table with a new one but in theory that
> could cause issues when some reader is currently reading that stat file.
> Not sure if there is a sophisticated solution here.
>
> I wonder what others think.
>
> Gabor
>
> On Tue, Feb 18, 2025 at 9:07 AM Ajantha Bhat <ajanthab...@gmail.com>
> wrote:
>
>> I believe the reason stats files allow replacing statistics with the same
>> snapshot ID is to enable the recomputation of optional stats for the same
>> snapshot. This process does leave the old stats files orphaned, but they
>> will be properly cleaned up by the `remove_orphan_files` action or
>> procedure.
>>
>> As stated in the Javadoc of `dropTableData`, the responsibility of this
>> function is solely to clean up referenced files, not orphaned ones.
>> Therefore, handling orphaned stats files within this method does not seem
>> appropriate.
>>
>> - Ajantha
>>
>> On Sat, Feb 15, 2025 at 11:29 PM Leon Lin <lianglin....@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> Recently, I came across an issue where some Puffin statistics files
>>> remain in storage after calling *dropTableData()*.
>>>
>>> The issue occurs because calling *updateStatistics()* or
>>> *updatePartitionStatistics()* on a snapshot ID that already has
>>> existing stat files commits a new metadata with the new statistics file
>>> path, without deleting the old stat files. Since the current implementation
>>> of *dropTableData()* only removes stat files referenced in the latest
>>> metadata, older stat files that were referenced in previous metadata
>>> versions remain in storage.
>>>
>>> I drafted a PR that iterates through historical metadata to collect all
>>> referenced stats files for deletion in dropTableData, but this adds
>>> significant I/O overhead. Some alternative ideas raised in the PR:
>>>
>>>    - Introduce a flag to trigger iterating over historical metadata
>>>    files to cleanup all stat files
>>>    - Delete old stats files when updating/removing stats (risks missing
>>>    files on rollback).
>>>    - Track unreferenced stats files in the latest metadata (increases
>>>    metadata size).
>>>
>>> Each approach comes with trade-offs in terms of performance and
>>> complexity. I hope to get some insights and opinions from the community—has
>>> this been discussed before, or is this the expected behavior? Are there any
>>> alternative approaches to handle this more efficiently?
>>>
>>> References:
>>>
>>>    - PR: https://github.com/apache/iceberg/pull/12132
>>>    - Issues: https://github.com/apache/iceberg/issues/11876,
>>>    https://github.com/trinodb/trino/issues/16583
>>>    - Code:
>>>    
>>> https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/CatalogUtil.java#L94-L139
>>>
>>> Thanks,
>>> Leon
>>>
>>

Re: [DISCUSS] Cleanup unreferenced statistics files through DropTableData

Reply via email to