Thanks, everyone! Looks like there are a few points to discuss. [JB] Maybe a release with the core updated before announcing spec v3 officially would be a good idea ? [Manu] Agree with Russell and JB that we make a “RC” release for V3 spec to test implementations, compatibility, etc before finalizing it.
As Fokko noted, we are currently concerned about the spec and not implementations. The reason is that implementation work before the spec is finalized is to reduce risk and build confidence that the spec is complete and correct. Once that’s done, it is important to finalize the changes. If we don’t finalize the changes, then implementations don’t know how/what build and cannot plan when they will fully support v3 — because it could change. Most of the work in other implementations will take place after the spec is adopted. Our process for building confidence in new spec versions is to update the spec with pending changes, implement them to validate (and clarify or adjust as needed), and vote to adopt the new version as a confirmation that we agree that the spec changes are reasonable and correct. We’ve already voted to accept the pending v3 changes into the spec, so the changes have already been in a candidate state for quite some time to work on implementations. Now we’re at the point where we’ve implemented the features and, in my opinion, have demonstrated the spec changes are correct and complete. To that end, the question I’m raising in this thread is *“what areas and features need further validation?”* I appreciate the ideas here — releasing will assist other implementations — but I don’t think that changes the question for this thread. The aim is to identify specific risks and blockers that we need to tackle before adopting the changes. [Russell] We should probably come to a resolution on the compressed metadata.json name as well, although that’s mostly retroactive. V3 would be the place where we could officially change the naming convention. I don’t think that this affects v3, but we should agree before moving on. The only part of the spec that would depend on this is the paths used by file system tables and that strategy is deprecated. We should only document for clarify (we can’t change it) and I think we can do that any time. For the conventions used in catalog tables, I don’t think that we want to have requirements in the spec for file naming. We’ve avoided that in the past and it isn’t needed. It’s nice to have a convention in implementation notes, but there are other ways to handle this like magic bytes and catalog tracking. [Gang] it is implicit and obvious that only bucket transform can apply to multi-arg transform, it is still unclear the order of source columns and algorithm to use to calculate the bucket value I think there is some confusion here, but Fokko may have already cleared it up. Right now, there are no multi-argument transforms in the spec. We have discussed adding a multi-argument bucket function, but there is not currently one in the spec. In order to minimize changes required for v3, we opted to update the spec to allow adding new transforms in a forward-compatible way between major spec versions (implementations must ignore unknown transforms). [Jia] We’re currently addressing the handling of null/NaN values for X, Y, Z, and M coordinates in the Parquet format repository I agree that this is a good thing to clarify. We currently state that the ranges are [-180, 180] and [-90, 90] for geography, but we should state how points with NaN values are handled. On Wed, Apr 30, 2025 at 12:27 PM Szehon Ho <szehon.apa...@gmail.com> wrote: > Hi Jia > > I feel it would be nice to get that Parquet spec clarificiation > https://github.com/apache/parquet-format/pull/494 into Iceberg V3 spec as > well, once we finalize that. > > Thanks > Szehon > > On Tue, Apr 29, 2025 at 10:55 PM Jia Yu <ji...@apache.org> wrote: > >> Hi Szehon, >> >> Thanks for clarifying it. >> >> We’re currently addressing the handling of null/NaN values for X, Y, Z, >> and M coordinates in the Parquet format repository. We’ve already concluded >> that the spec of Parquet (same on the Iceberg side I believe) only needs >> additional clarification to guide expected behavior: >> https://github.com/apache/parquet-format/pull/494 >> >> BTW the Parquet Geo C++ PR has been merged today: >> https://github.com/apache/arrow/pull/45459 I believe the Parquet Geo >> Java PR is also very close. >> >> Thanks, >> Jia >> >> On Tue, Apr 29, 2025 at 10:48 PM Fokko Driesprong <fo...@apache.org> >> wrote: >> >>> Hey Ryan, >>> >>> Thanks for raising this, and I'm very excited to see V3 being finalized! >>> >>> The v3 spec for multi-arg transform only advises to use `source-ids` >>>> instead of `source-id`. Although it is implicit and obvious that only >>>> bucket transform can apply to multi-arg transform, it is still unclear the >>>> order of source columns and algorithm to use to calculate the bucket value. >>>> >>> >>> V3 now uses source IDs when there are multiple arguments and source IDs >>> when there is just one. PR can be found here >>> <https://github.com/apache/iceberg/pull/12644>. This makes the >>> serialization deterministic without knowing the format-version, simplifying >>> the readers/writers. After some discussion on the PR, we've decided to >>> leave out the multi-arg bucket transform so the V3 spec can be finalized. >>> So V3 only contains the scaffolding for multi-arg transforms. >>> >>> For Iceberg Geo, we are still waiting for the PR of geospatial bounds >>>> and geospatial predicate to be merged: >>>> https://github.com/apache/iceberg/pull/12667 >>> >>> >>> I think it is a good idea to distinguish between the spec and the actual >>> code. If we all feel comfortable with the spec, I think we could finalize >>> it. Being comfortable also means that we know that we have a working >>> implementation, but I don't think we have to wrap up all the loose ends >>> before voting on the spec. >>> >>> At the PyIceberg side, we're also working to catch up on the V3 >>> capabilities <https://github.com/apache/iceberg-python/issues/1818>. >>> Having a Java release that exposes these capabilities helps, so we can do >>> round-trip validation. >>> >>> Kind regards, >>> Fokko >>> >>> >>> Op wo 30 apr 2025 om 07:26 schreef Jia Yu <ji...@apache.org>: >>> >>>> Hi folks, >>>> >>>> For Iceberg Geo, we are still waiting for the PR of geospatial bounds >>>> and geospatial predicate to be merged: >>>> https://github.com/apache/iceberg/pull/12667 >>>> >>>> Should a release with core updates include this PR? >>>> >>>> Thanks, >>>> Jia >>>> >>>> On Tue, Apr 29, 2025 at 10:21 PM Manu Zhang <owenzhang1...@gmail.com> >>>> wrote: >>>> >>>>> Agree with Russell and JB that we make a "RC" release for V3 spec to >>>>> test implementations, compatibility, etc before finalizing it. >>>>> >>>>> Thanks, >>>>> Manu >>>>> >>>>> On Wed, Apr 30, 2025 at 12:24 PM Jean-Baptiste Onofré <j...@nanthrax.net> >>>>> wrote: >>>>> >>>>>> Hi Ryan >>>>>> >>>>>> It sounds good. >>>>>> >>>>>> About multi-args transforms, with the clarification we did a couple >>>>>> of weeks ago, I think we are good. >>>>>> Maybe a release with the core updated before announcing spec v3 >>>>>> officially would be a good idea ? >>>>>> >>>>>> Regards >>>>>> JB >>>>>> >>>>>> Le mer. 30 avr. 2025 à 00:35, Ryan Blue <rdb...@gmail.com> a écrit : >>>>>> >>>>>>> Hi everyone, >>>>>>> >>>>>>> I think we’ve reached the point where it’s time to finalize and >>>>>>> adopt the changes for Iceberg v3. We’ve been working toward this for the >>>>>>> last few months and have now implemented the v3 features in the Java >>>>>>> library to reduce the risk of needing changes or hitting problems (row >>>>>>> lineage support in Spark 3.5 just went in!). We’ve also incorporated >>>>>>> some >>>>>>> clarifications and minor changes back into the spec from what we’ve >>>>>>> learned. >>>>>>> >>>>>>> At this point, I’m confident that the spec is reasonable and >>>>>>> correct. Thank you to everyone working on these reference >>>>>>> implementations! >>>>>>> >>>>>>> The next step is to discuss any outstanding items or concerns about >>>>>>> moving forward, and then to have a vote thread to adopt the spec. I’ll >>>>>>> start off with a couple of items: >>>>>>> >>>>>>> One potential concern is that the upstream Variant spec hasn’t yet >>>>>>> been finalized by the Parquet community, but we’ve built a full, >>>>>>> independent implementation in Iceberg to validate the spec. I think the >>>>>>> Parquet community is primarily waiting on getting the PRs in to have a >>>>>>> Java >>>>>>> reference implementation, so the risk of changes to the Variant spec is >>>>>>> small. >>>>>>> >>>>>>> There’s also an on-going vote to add encryption keys in support of >>>>>>> full table encryption that I think we want to get in. >>>>>>> >>>>>>> Any other items we may want to clear up? >>>>>>> >>>>>>> Ryan >>>>>>> >>>>>>