Hi,

I've been investigating an OOM issue during planning in the Trino
coordinator, and I've found that the main cause is the column stats
handling in the DeleteFileIndex class - it loads all delete files into
memory.
While rewriting delete files is one option, I'd like to explore reducing
memory usage within the Iceberg library itself.

I've opened a PR (#13161 <https://github.com/apache/iceberg/pull/13161>)
that reduces memory usage on the Trino coordinator from 12.8 GB to 2.5 GB
in my benchmark.
The change copies only the file_path stats in DeleteFileIndex when the file
is a positional delete.

I'd appreciate your feedback on whether this is an acceptable approach, or
if you have other suggestions.
I understand that v4 will improve stats handling as part of #13153
<https://github.com/apache/iceberg/issues/13153>, but in the Trino
community, we're also interested in reducing memory usage for tables using
formats earlier than v4.

Thanks,
Yuya

Reply via email to