Re: [DISCUSS] Finalizing the v3 spec

2025-05-13 Thread Anton Okolnychyi
I went ahead and created https://github.com/apache/iceberg/pull/13042 to include the discussed requirement for DVs. ср, 7 трав. 2025 р. о 20:31 Anton Okolnychyi пише: > Steven, that may be a good point to add to ensure the metadata is properly > maintained. If I remember correctly, the Spark imp

Re: [DISCUSS] Finalizing the v3 spec

2025-05-07 Thread Anton Okolnychyi
Steven, that may be a good point to add to ensure the metadata is properly maintained. If I remember correctly, the Spark implementation already drops old DVs in DELETE/UPDATE/MERGE but the data compaction wasn't doing it originally. I wonder if we fixed it. Eduard may know more. - Anton ср, 7 тр

Re: [DISCUSS] Finalizing the v3 spec

2025-05-07 Thread Steven Wu
For the delete vection change, should we add the following constraint/requirement for the write path in the spec? I don't know if this is already the behavior of the Spark implementation. "if a data file is removed from the table, the corresponding DV reference must also be removed from delete man

Re: [DISCUSS] Finalizing the v3 spec

2025-05-06 Thread Manu Zhang
Thanks for clarification Ryan. I'm aware of the major changes, but I find it hard to go through all the related descriptions which are scattered all over the place. Manu On Tue, May 6, 2025 at 11:24 PM Ryan Blue wrote: > Manu, > > We aren't currently voting. We are discussing any outstanding i

Re: [DISCUSS] Finalizing the v3 spec

2025-05-06 Thread Ryan Blue
Manu, We aren't currently voting. We are discussing any outstanding items to address before we close v3 to further changes and adopt the existing v3 changes. Right now, the open item is to clarify NaN behavior in geometry and geography, PR #12956 . Th

Re: [DISCUSS] Finalizing the v3 spec

2025-05-06 Thread Manu Zhang
I'm wondering what changes we are voting for here. Is it everything related to https://iceberg.apache.org/spec/#version-3-extended-types-and-capabilities from the table spec? How about changes to other specs? Do we summarize all the changes in https://iceberg.apache.org/spec/#appendix-e-format-ver

Re: [DISCUSS] Finalizing the v3 spec

2025-05-05 Thread Anton Okolnychyi
DVs in Spark seem to behave reasonably, serving as a reference implementation of the V3 spec. There are areas for optimization/refinement but nothing was observed that requires changing the spec. I would also like to add the notion of content overhead/metadata (for Puffin/Parquet footers) to manife

Re: [DISCUSS] Finalizing the v3 spec

2025-05-02 Thread Jean-Baptiste Onofré
Hi Ryan All good for the spec. The idea for release is just a help to "double check" the spec is good (we already saw some slightly changes on the spec while working on release). I think we can be "confident" that we won't have unexpected change. Thanks ! Regards JB On Thu, May 1, 2025 at 7:04 P

Re: [DISCUSS] Finalizing the v3 spec

2025-05-02 Thread Russell Spitzer
Sounds good to me, I think we can move ahead with this, for all intents and purposes I think we are past any breaking changes for Spec V3 and should consider it "stable" for implementation purposes. I want to work on some official descriptions of our spec versioning / library process to better expl

Re: [DISCUSS] Finalizing the v3 spec

2025-05-01 Thread Ryan Blue
Thanks, everyone! Looks like there are a few points to discuss. [JB] Maybe a release with the core updated before announcing spec v3 officially would be a good idea ? [Manu] Agree with Russell and JB that we make a “RC” release for V3 spec to test implementations, compatibility, etc before finaliz

Re: [DISCUSS] Finalizing the v3 spec

2025-04-30 Thread Szehon Ho
Hi Jia I feel it would be nice to get that Parquet spec clarificiation https://github.com/apache/parquet-format/pull/494 into Iceberg V3 spec as well, once we finalize that. Thanks Szehon On Tue, Apr 29, 2025 at 10:55 PM Jia Yu wrote: > Hi Szehon, > > Thanks for clarifying it. > > We’re curren

Re: [DISCUSS] Finalizing the v3 spec

2025-04-30 Thread Gang Wu
Thanks JB and Fokko! I agree that we are good with multi-arg transform for v3. Best, Gang On Wed, Apr 30, 2025 at 2:12 PM Xuanwo wrote: > Hi Ryan. > > Thank for starting this. > > I share the same concern as Russell regarding the recent discussion about > `metadata.json.gz`. I think it's a good

Re: [DISCUSS] Finalizing the v3 spec

2025-04-29 Thread Xuanwo
Hi Ryan. Thank for starting this. I share the same concern as Russell regarding the recent discussion about `metadata.json.gz`. I think it's a good time to clarify the behavior and perhaps allow for additional compression algorithms here. We can start a seperate discuss thread if needed. > A

Re: [DISCUSS] Finalizing the v3 spec

2025-04-29 Thread Jia Yu
Hi Szehon, Thanks for clarifying it. We’re currently addressing the handling of null/NaN values for X, Y, Z, and M coordinates in the Parquet format repository. We’ve already concluded that the spec of Parquet (same on the Iceberg side I believe) only needs additional clarification to guide expec

Re: [DISCUSS] Finalizing the v3 spec

2025-04-29 Thread Fokko Driesprong
Hey Ryan, Thanks for raising this, and I'm very excited to see V3 being finalized! The v3 spec for multi-arg transform only advises to use `source-ids` > instead of `source-id`. Although it is implicit and obvious that only > bucket transform can apply to multi-arg transform, it is still unclear

Re: [DISCUSS] Finalizing the v3 spec

2025-04-29 Thread Szehon Ho
Hi Jia I think its about the spec, and not the implementation (which is definitely good to reduce risk to need to change the spec). We actually wanted to get our Parquet reader/writer out for this effort, but as we see, it seems it depends on next Parquet-java release for the new Geo types on Par

Re: [DISCUSS] Finalizing the v3 spec

2025-04-29 Thread Jia Yu
Hi folks, For Iceberg Geo, we are still waiting for the PR of geospatial bounds and geospatial predicate to be merged: https://github.com/apache/iceberg/pull/12667 Should a release with core updates include this PR? Thanks, Jia On Tue, Apr 29, 2025 at 10:21 PM Manu Zhang wrote: > Agree with R

Re: [DISCUSS] Finalizing the v3 spec

2025-04-29 Thread Manu Zhang
Agree with Russell and JB that we make a "RC" release for V3 spec to test implementations, compatibility, etc before finalizing it. Thanks, Manu On Wed, Apr 30, 2025 at 12:24 PM Jean-Baptiste Onofré wrote: > Hi Ryan > > It sounds good. > > About multi-args transforms, with the clarification we

Re: [DISCUSS] Finalizing the v3 spec

2025-04-29 Thread Jean-Baptiste Onofré
Hi Ryan It sounds good. About multi-args transforms, with the clarification we did a couple of weeks ago, I think we are good. Maybe a release with the core updated before announcing spec v3 officially would be a good idea ? Regards JB Le mer. 30 avr. 2025 à 00:35, Ryan Blue a écrit : > Hi ev

Re: [DISCUSS] Finalizing the v3 spec

2025-04-29 Thread Jean-Baptiste Onofré
Hi Gang I’m working on the multi args transforms support: https://github.com/apache/iceberg/pull/12897 You can find details about impl in core. Regards JB Le mer. 30 avr. 2025 à 03:47, Gang Wu a écrit : > Please correct me if I'm wrong. > > The v3 spec for multi-arg transform only advises to

Re: [DISCUSS] Finalizing the v3 spec

2025-04-29 Thread Gang Wu
Please correct me if I'm wrong. The v3 spec for multi-arg transform only advises to use `source-ids` instead of `source-id`. Although it is implicit and obvious that only bucket transform can apply to multi-arg transform, it is still unclear the order of source columns and algorithm to use to calc

Re: [DISCUSS] Finalizing the v3 spec

2025-04-29 Thread Russell Spitzer
We should probably come to a resolution on the compressed metadata.json name as well, although that's mostly retroactive. V3 would be the place where we could officially change the naming convention. I'm also interested in getting a release with the full implementation of V3 as it currently stands

[DISCUSS] Finalizing the v3 spec

2025-04-29 Thread Ryan Blue
Hi everyone, I think we’ve reached the point where it’s time to finalize and adopt the changes for Iceberg v3. We’ve been working toward this for the last few months and have now implemented the v3 features in the Java library to reduce the risk of needing changes or hitting problems (row lineage