Re: [DISCUSS] Reduce memory pressure due to column stats in position delete files

Ryan Blue Wed, 04 Jun 2025 14:05:33 -0700

I think we can discard column stats for position deletes, as long as the
data file path is preserved (as it is in #13161). For position deletes, we
need to preserve the stats for any equality ID columns. That reduces false
positives by ensuring that the IDs being deleted might be in the data file
the equality deletes are applied to.


We should also take a look at how these files are written and possibly
prevent the stats from being written at all. I think Anton updated position
deletes to discard most column stats already.

On Wed, Jun 4, 2025 at 10:09 AM Steven Wu <stevenz...@gmail.com> wrote:

> It seems like a reasonable approach for DeleteFileIndex . I saw equality
> delete file matching uses column stats. But it seems that column stats
> (like lower/upper bounds) aren't used for associating position delete files
> with a data file. Plus with file-scoped position delete files (V2),
> matching won't need column stats too. With Delete Vector (DV) in V3, there
> won't be column stats written for position deletes.
>
> On Tue, Jun 3, 2025 at 10:01 PM Yuya Ebihara <
> yuya.ebih...@starburstdata.com> wrote:
>
>> Hi,
>>
>> I've been investigating an OOM issue during planning in the Trino
>> coordinator, and I've found that the main cause is the column stats
>> handling in the DeleteFileIndex class - it loads all delete files into
>> memory.
>> While rewriting delete files is one option, I'd like to explore reducing
>> memory usage within the Iceberg library itself.
>>
>> I've opened a PR (#13161 <https://github.com/apache/iceberg/pull/13161>)
>> that reduces memory usage on the Trino coordinator from 12.8 GB to 2.5 GB
>> in my benchmark.
>> The change copies only the file_path stats in DeleteFileIndex when the
>> file is a positional delete.
>>
>> I'd appreciate your feedback on whether this is an acceptable approach,
>> or if you have other suggestions.
>> I understand that v4 will improve stats handling as part of #13153
>> <https://github.com/apache/iceberg/issues/13153>, but in the Trino
>> community, we're also interested in reducing memory usage for tables using
>> formats earlier than v4.
>>
>> Thanks,
>> Yuya
>>
>

Re: [DISCUSS] Reduce memory pressure due to column stats in position delete files

Reply via email to