Re: [DISCUSS] v4 - Improved column statistics

2025-06-03 Thread Péter Váry
I would love to see more flexibility in file stats. Together with the change which allows storing metadata in columnar formats will open up many new possibilities. Bloom filters in metadata which could be used for filtering out files, HLL scratches etc +1 for the change On Tue, Jun 3, 2025, 0

Re: [DISCUSS] Pre-Proposal: Improving Merge-On-Read Query Performance With Indexing

2025-06-03 Thread Péter Váry
Hi Xiaoxuan, > 2. File-Level Indexing > [..] > To make this efficient, the table should be partitioned and sorted by the PK. If the table is partitioned and sorted by the PK, we don't really need to have any index. We can find the data file containing the record based on the Content File statisti

Re: [VOTE] Release Apache Iceberg Rust 0.5.1 RC1

2025-06-03 Thread Kevin Liu
A quick update on the release. We're seeing an issue publishing to crates.io using Github Action, the secret token required seems to be empty. I opened https://issues.apache.org/jira/browse/INFRA-26882 to coordinate with Apache Infra and set the secret. Best, Kevin Liu On Sat, May 31, 2025 at 1

Re: [DISCUSS] Pre-Proposal: Improving Merge-On-Read Query Performance With Indexing

2025-06-03 Thread Xiaoxuan Li
Hi Peter, > If the table is partitioned and sorted by the PK, we don't really need to have any index. We can find the data file containing the record based on the Content File statistics, and the RowGroup containing the record based on the Parquet metadata. Our primary strategy for accelerating l

[DISCUSS] Reduce memory pressure due to column stats in position delete files

2025-06-03 Thread Yuya Ebihara
Hi, I've been investigating an OOM issue during planning in the Trino coordinator, and I've found that the main cause is the column stats handling in the DeleteFileIndex class - it loads all delete files into memory. While rewriting delete files is one option, I'd like to explore reducing memory u