Re: [DISCUSS] Reduce memory pressure due to column stats in position delete files

2025-06-04 Thread Anton Okolnychyi
What kind of stats do we produce for position delete files beyond the file path and row positions? Are we dealing with a writer that persists the entire row in the position delete file? So far we modified the writer in Iceberg core to discard all bounds if a position delete file references more tha

Re: [DISCUSS] Reduce memory pressure due to column stats in position delete files

2025-06-04 Thread Ryan Blue
I think we can discard column stats for position deletes, as long as the data file path is preserved (as it is in #13161). For position deletes, we need to preserve the stats for any equality ID columns. That reduces false positives by ensuring that the IDs being deleted might be in the data file t

Re: [DISCUSS] Reduce memory pressure due to column stats in position delete files

2025-06-04 Thread Steven Wu
It seems like a reasonable approach for DeleteFileIndex . I saw equality delete file matching uses column stats. But it seems that column stats (like lower/upper bounds) aren't used for associating position delete files with a data file. Plus with file-scoped position delete files (V2), matching wo

[DISCUSS] Reduce memory pressure due to column stats in position delete files

2025-06-03 Thread Yuya Ebihara
Hi, I've been investigating an OOM issue during planning in the Trino coordinator, and I've found that the main cause is the column stats handling in the DeleteFileIndex class - it loads all delete files into memory. While rewriting delete files is one option, I'd like to explore reducing memory u