DVs in Spark seem to behave reasonably, serving as a reference implementation of the V3 spec. There are areas for optimization/refinement but nothing was observed that requires changing the spec. I would also like to add the notion of content overhead/metadata (for Puffin/Parquet footers) to manifests to optimize DVs maintenance. That said, it is optional information and can be added after finalizing V3.
- Anton пт, 2 трав. 2025 р. о 23:23 Jean-Baptiste Onofré <j...@nanthrax.net> пише: > Hi Ryan > > All good for the spec. The idea for release is just a help to "double > check" the spec is good (we already saw some slightly changes on the > spec while working on release). I think we can be "confident" that we > won't have unexpected change. > > Thanks ! > Regards > JB > > On Thu, May 1, 2025 at 7:04 PM Ryan Blue <rdb...@gmail.com> wrote: > > > > Thanks, everyone! Looks like there are a few points to discuss. > > > > [JB] Maybe a release with the core updated before announcing spec v3 > officially would be a good idea ? > > [Manu] Agree with Russell and JB that we make a “RC” release for V3 spec > to test implementations, compatibility, etc before finalizing it. > > > > As Fokko noted, we are currently concerned about the spec and not > implementations. The reason is that implementation work before the spec is > finalized is to reduce risk and build confidence that the spec is complete > and correct. Once that’s done, it is important to finalize the changes. If > we don’t finalize the changes, then implementations don’t know how/what > build and cannot plan when they will fully support v3 — because it could > change. Most of the work in other implementations will take place after the > spec is adopted. > > > > Our process for building confidence in new spec versions is to update > the spec with pending changes, implement them to validate (and clarify or > adjust as needed), and vote to adopt the new version as a confirmation that > we agree that the spec changes are reasonable and correct. > > > > We’ve already voted to accept the pending v3 changes into the spec, so > the changes have already been in a candidate state for quite some time to > work on implementations. Now we’re at the point where we’ve implemented the > features and, in my opinion, have demonstrated the spec changes are correct > and complete. > > > > To that end, the question I’m raising in this thread is “what areas and > features need further validation?” > > > > I appreciate the ideas here — releasing will assist other > implementations — but I don’t think that changes the question for this > thread. The aim is to identify specific risks and blockers that we need to > tackle before adopting the changes. > > > > [Russell] We should probably come to a resolution on the compressed > metadata.json name as well, although that’s mostly retroactive. V3 would be > the place where we could officially change the naming convention. > > > > I don’t think that this affects v3, but we should agree before moving > on. The only part of the spec that would depend on this is the paths used > by file system tables and that strategy is deprecated. We should only > document for clarify (we can’t change it) and I think we can do that any > time. > > > > For the conventions used in catalog tables, I don’t think that we want > to have requirements in the spec for file naming. We’ve avoided that in the > past and it isn’t needed. It’s nice to have a convention in implementation > notes, but there are other ways to handle this like magic bytes and catalog > tracking. > > > > [Gang] it is implicit and obvious that only bucket transform can apply > to multi-arg transform, it is still unclear the order of source columns and > algorithm to use to calculate the bucket value > > > > I think there is some confusion here, but Fokko may have already cleared > it up. > > > > Right now, there are no multi-argument transforms in the spec. We have > discussed adding a multi-argument bucket function, but there is not > currently one in the spec. In order to minimize changes required for v3, we > opted to update the spec to allow adding new transforms in a > forward-compatible way between major spec versions (implementations must > ignore unknown transforms). > > > > [Jia] We’re currently addressing the handling of null/NaN values for X, > Y, Z, and M coordinates in the Parquet format repository > > > > I agree that this is a good thing to clarify. We currently state that > the ranges are [-180, 180] and [-90, 90] for geography, but we should state > how points with NaN values are handled. > > > > > > On Wed, Apr 30, 2025 at 12:27 PM Szehon Ho <szehon.apa...@gmail.com> > wrote: > >> > >> Hi Jia > >> > >> I feel it would be nice to get that Parquet spec clarificiation > https://github.com/apache/parquet-format/pull/494 into Iceberg V3 spec as > well, once we finalize that. > >> > >> Thanks > >> Szehon > >> > >> On Tue, Apr 29, 2025 at 10:55 PM Jia Yu <ji...@apache.org> wrote: > >>> > >>> Hi Szehon, > >>> > >>> Thanks for clarifying it. > >>> > >>> We’re currently addressing the handling of null/NaN values for X, Y, > Z, and M coordinates in the Parquet format repository. We’ve already > concluded that the spec of Parquet (same on the Iceberg side I believe) > only needs additional clarification to guide expected behavior: > https://github.com/apache/parquet-format/pull/494 > >>> > >>> BTW the Parquet Geo C++ PR has been merged today: > https://github.com/apache/arrow/pull/45459 I believe the Parquet Geo > Java PR is also very close. > >>> > >>> Thanks, > >>> Jia > >>> > >>> On Tue, Apr 29, 2025 at 10:48 PM Fokko Driesprong <fo...@apache.org> > wrote: > >>>> > >>>> Hey Ryan, > >>>> > >>>> Thanks for raising this, and I'm very excited to see V3 being > finalized! > >>>> > >>>>> The v3 spec for multi-arg transform only advises to use `source-ids` > instead of `source-id`. Although it is implicit and obvious that only > bucket transform can apply to multi-arg transform, it is still unclear the > order of source columns and algorithm to use to calculate the bucket value. > >>>> > >>>> > >>>> V3 now uses source IDs when there are multiple arguments and source > IDs when there is just one. PR can be found here. This makes the > serialization deterministic without knowing the format-version, simplifying > the readers/writers. After some discussion on the PR, we've decided to > leave out the multi-arg bucket transform so the V3 spec can be finalized. > So V3 only contains the scaffolding for multi-arg transforms. > >>>> > >>>>> For Iceberg Geo, we are still waiting for the PR of geospatial > bounds and geospatial predicate to be merged: > https://github.com/apache/iceberg/pull/12667 > >>>> > >>>> > >>>> I think it is a good idea to distinguish between the spec and the > actual code. If we all feel comfortable with the spec, I think we could > finalize it. Being comfortable also means that we know that we have a > working implementation, but I don't think we have to wrap up all the loose > ends before voting on the spec. > >>>> > >>>> At the PyIceberg side, we're also working to catch up on the V3 > capabilities. Having a Java release that exposes these capabilities helps, > so we can do round-trip validation. > >>>> > >>>> Kind regards, > >>>> Fokko > >>>> > >>>> > >>>> Op wo 30 apr 2025 om 07:26 schreef Jia Yu <ji...@apache.org>: > >>>>> > >>>>> Hi folks, > >>>>> > >>>>> For Iceberg Geo, we are still waiting for the PR of geospatial > bounds and geospatial predicate to be merged: > https://github.com/apache/iceberg/pull/12667 > >>>>> > >>>>> Should a release with core updates include this PR? > >>>>> > >>>>> Thanks, > >>>>> Jia > >>>>> > >>>>> On Tue, Apr 29, 2025 at 10:21 PM Manu Zhang <owenzhang1...@gmail.com> > wrote: > >>>>>> > >>>>>> Agree with Russell and JB that we make a "RC" release for V3 spec > to test implementations, compatibility, etc before finalizing it. > >>>>>> > >>>>>> Thanks, > >>>>>> Manu > >>>>>> > >>>>>> On Wed, Apr 30, 2025 at 12:24 PM Jean-Baptiste Onofré < > j...@nanthrax.net> wrote: > >>>>>>> > >>>>>>> Hi Ryan > >>>>>>> > >>>>>>> It sounds good. > >>>>>>> > >>>>>>> About multi-args transforms, with the clarification we did a > couple of weeks ago, I think we are good. > >>>>>>> Maybe a release with the core updated before announcing spec v3 > officially would be a good idea ? > >>>>>>> > >>>>>>> Regards > >>>>>>> JB > >>>>>>> > >>>>>>> Le mer. 30 avr. 2025 à 00:35, Ryan Blue <rdb...@gmail.com> a > écrit : > >>>>>>>> > >>>>>>>> Hi everyone, > >>>>>>>> > >>>>>>>> I think we’ve reached the point where it’s time to finalize and > adopt the changes for Iceberg v3. We’ve been working toward this for the > last few months and have now implemented the v3 features in the Java > library to reduce the risk of needing changes or hitting problems (row > lineage support in Spark 3.5 just went in!). We’ve also incorporated some > clarifications and minor changes back into the spec from what we’ve learned. > >>>>>>>> > >>>>>>>> At this point, I’m confident that the spec is reasonable and > correct. Thank you to everyone working on these reference implementations! > >>>>>>>> > >>>>>>>> The next step is to discuss any outstanding items or concerns > about moving forward, and then to have a vote thread to adopt the spec. > I’ll start off with a couple of items: > >>>>>>>> > >>>>>>>> One potential concern is that the upstream Variant spec hasn’t > yet been finalized by the Parquet community, but we’ve built a full, > independent implementation in Iceberg to validate the spec. I think the > Parquet community is primarily waiting on getting the PRs in to have a Java > reference implementation, so the risk of changes to the Variant spec is > small. > >>>>>>>> > >>>>>>>> There’s also an on-going vote to add encryption keys in support > of full table encryption that I think we want to get in. > >>>>>>>> > >>>>>>>> Any other items we may want to clear up? > >>>>>>>> > >>>>>>>> Ryan >