Re: [DISCUSS] v4 - Improved column statistics

2025-06-04 Thread Gang Wu
> Together with the change which allows storing metadata in columnar formats +1 on this. I think it does not make sense to stick manifest files to Avro if we break column stats into sub fields. On Tue, Jun 3, 2025 at 7:19 PM Péter Váry wrote: > I would love to see more flexibility in file stats

[ANNOUNCE] Release Apache Iceberg Rust 0.5.1

2025-06-04 Thread Kevin Liu
Hi all, The Apache Iceberg Rust community is pleased to announce that Apache Iceberg Rust 0.5.1 has been released! This also includes the release of pyiceberg-core 0.5.1, which is the rust bindings for python. Apache Iceberg is an open table format for huge analytic datasets. Iceberg delivers hig

Re: [DISCUSS] June board report

2025-06-04 Thread Russell Spitzer
Looking good! On Wed, Jun 4, 2025 at 4:21 PM Ryan Blue wrote: > Hi everyone, > > Here’s my draft of our board report for June. I went through the old syncs > for highlights, but please reply if you want me to add any more! > > Ryan > Description: > > Apache Iceberg is a table format for huge ana

Re: [DISCUSS] Pre-Proposal: Improving Merge-On-Read Query Performance With Indexing

2025-06-04 Thread Steven Wu
Haizhou, 1. it is probably inaccurate to call Parquet a table format provider. Parquet is a just file format. Delete vectors (position deletes) are outside the scope of Parquet files. The nature of equality deletes just make it impossible to read in constant time O(1) 2. The inverted index idea i

Re: Unofficial iceberg-rust crates on crates.io

2025-06-04 Thread Kevin Liu
Thanks for the context, Chengxu. I was curious about the historical context and did some digging. I found that others have had similar questions and have asked on the repo via issues. See - New home of the repository? https://github.com/JanKaul/iceberg-rust/issues/10 - Project status https://githu

Re: [DISCUSS] Restructuring Docs side navigation

2025-06-04 Thread Manu Zhang
Hi all, I know you've been busy finalizing v3 spec and discussing new features in v4 spec. When you find time, could you take a look at this as well? I think well-organized docs are also important to further grow the project and the community. Thanks, Manu On Mon, May 26, 2025 at 11:01 AM Manu Z

Re: [DISCUSS] Pre-Proposal: Improving Merge-On-Read Query Performance With Indexing

2025-06-04 Thread Haizhou Zhao
Hey folks, Thanks for discussing this interesting topic. I have couple relevant thoughts while reading through this thread: 1. Is this an Iceberg issue, or a Parquet (table format provider) issue? For example, if Parquet (or other table format provider) provides a mechanism where both query by po

Re: Unofficial iceberg-rust crates on crates.io

2025-06-04 Thread Chengxu Bian
Just to add some context: as far as I know, the owner of the iceberg-rust crate was involved in the early development of the project. I believe some code was also ported from that repo in the early stages. I recall there were some discussions on Slack back in 2023 around naming — both the repo a

[DISCUSS] June board report

2025-06-04 Thread Ryan Blue
Hi everyone, Here’s my draft of our board report for June. I went through the old syncs for highlights, but please reply if you want me to add any more! Ryan Description: Apache Iceberg is a table format for huge analytic datasets that is designed for high performance and ease of use. Project St

Re: Unofficial iceberg-rust crates on crates.io

2025-06-04 Thread Denny Lee
Absolutely - glad to help, eh?! On Wed, Jun 4, 2025 at 3:20 PM Kevin Liu wrote: > Great idea Denny! Personally, I'd like to see a unified effort on the rust > implementation for iceberg. > > I'll reach out to the author on Slack. Can you include you in the > conversation? Happy to include anyon

Re: [DISCUSS] Pre-Proposal: Improving Merge-On-Read Query Performance With Indexing

2025-06-04 Thread Xiaoxuan Li
Totally agree, supporting mutability on top of immutable storage at scale is a non-trivial problem. I think the number of index files is ok, we can preload them in parallel or cache them on disk. Not sure yet about caching deserialized data, that might need some more thought. Xiaoxuan On Wed, Jun

Re: Unofficial iceberg-rust crates on crates.io

2025-06-04 Thread Kevin Liu
Great idea Denny! Personally, I'd like to see a unified effort on the rust implementation for iceberg. I'll reach out to the author on Slack. Can you include you in the conversation? Happy to include anyone else interested too. I'll update this thread again once we get some updates. Best, Kevin

Re: Unofficial iceberg-rust crates on crates.io

2025-06-04 Thread Denny Lee
Hey Kevin, Prior to PMC private mailing / ASF trademark, perhaps we can start a conversation with the iceberg-rust crate folks? It looks like the project is active and perhaps we can discuss with them the goals to figure out the best way to leverage each

Unofficial iceberg-rust crates on crates.io

2025-06-04 Thread Kevin Liu
Hi everyone, I was working on the iceberg-rust 0.5.1 release and noticed a few crates on crates.io that seem to be unofficial implementations or releases of iceberg-rust. For example, the official Apache iceberg-rust repo is published under the ` iceberg`

Re: [DISCUSS] Reduce memory pressure due to column stats in position delete files

2025-06-04 Thread Anton Okolnychyi
What kind of stats do we produce for position delete files beyond the file path and row positions? Are we dealing with a writer that persists the entire row in the position delete file? So far we modified the writer in Iceberg core to discard all bounds if a position delete file references more tha

Re: [DISCUSS] Reduce memory pressure due to column stats in position delete files

2025-06-04 Thread Ryan Blue
I think we can discard column stats for position deletes, as long as the data file path is preserved (as it is in #13161). For position deletes, we need to preserve the stats for any equality ID columns. That reduces false positives by ensuring that the IDs being deleted might be in the data file t

Re: [DISCUSS] Reduce memory pressure due to column stats in position delete files

2025-06-04 Thread Steven Wu
It seems like a reasonable approach for DeleteFileIndex . I saw equality delete file matching uses column stats. But it seems that column stats (like lower/upper bounds) aren't used for associating position delete files with a data file. Plus with file-scoped position delete files (V2), matching wo

Re: [Discussion] Versioned SQL UDFs (Catalog routines) in Iceberg

2025-06-04 Thread Ajantha Bhat
Thanks to everyone who joined the sync. Here is the meeting recording: https://drive.google.com/file/d/1WItItsNs3m3-no7_qWPHftGqVNOdpw5C/view?usp=sharing Summary: - We discussed including Python support; the majority agreed *not to* (see recording for details). - No strong opposi

Re: [DISCUSS] Pre-Proposal: Improving Merge-On-Read Query Performance With Indexing

2025-06-04 Thread Péter Váry
> Our primary strategy for accelerating lookups focuses on optimizing the index file itself, leveraging sorted keys, smaller row group sizes, and Bloom filters for Parquet files. We’re also exploring custom formats that support more fine-grained skipping. The techniques you mentioned are important

Re: [DISCUSS] Apache Iceberg 1.10.0 release

2025-06-04 Thread ConradJam
Okay, I will try it out. If there are any issues, I will propose PR corrections Steven Wu 于2025年5月30日周五 04:02写道: > > whether Spark 3.5 can perform some basic queries or provide file merging > capabilities in the current or next version of V3? > > ConradJam, that should already work for a while n