Re: [DISCUSS] Finalizing the v3 spec

Anton Okolnychyi Mon, 05 May 2025 21:09:57 -0700

DVs in Spark seem to behave reasonably, serving as a reference
implementation of the V3 spec. There are areas for optimization/refinement
but nothing was observed that requires changing the spec. I would also like
to add the notion of content overhead/metadata (for Puffin/Parquet footers)
to manifests to optimize DVs maintenance. That said, it is optional
information and can be added after finalizing V3.


- Anton

пт, 2 трав. 2025 р. о 23:23 Jean-Baptiste Onofré <j...@nanthrax.net> пише:

> Hi Ryan
>
> All good for the spec. The idea for release is just a help to "double
> check" the spec is good (we already saw some slightly changes on the
> spec while working on release). I think we can be "confident" that we
> won't have unexpected change.
>
> Thanks !
> Regards
> JB
>
> On Thu, May 1, 2025 at 7:04 PM Ryan Blue <rdb...@gmail.com> wrote:
> >
> > Thanks, everyone! Looks like there are a few points to discuss.
> >
> > [JB] Maybe a release with the core updated before announcing spec v3
> officially would be a good idea ?
> > [Manu] Agree with Russell and JB that we make a “RC” release for V3 spec
> to test implementations, compatibility, etc before finalizing it.
> >
> > As Fokko noted, we are currently concerned about the spec and not
> implementations. The reason is that implementation work before the spec is
> finalized is to reduce risk and build confidence that the spec is complete
> and correct. Once that’s done, it is important to finalize the changes. If
> we don’t finalize the changes, then implementations don’t know how/what
> build and cannot plan when they will fully support v3 — because it could
> change. Most of the work in other implementations will take place after the
> spec is adopted.
> >
> > Our process for building confidence in new spec versions is to update
> the spec with pending changes, implement them to validate (and clarify or
> adjust as needed), and vote to adopt the new version as a confirmation that
> we agree that the spec changes are reasonable and correct.
> >
> > We’ve already voted to accept the pending v3 changes into the spec, so
> the changes have already been in a candidate state for quite some time to
> work on implementations. Now we’re at the point where we’ve implemented the
> features and, in my opinion, have demonstrated the spec changes are correct
> and complete.
> >
> > To that end, the question I’m raising in this thread is “what areas and
> features need further validation?”
> >
> > I appreciate the ideas here — releasing will assist other
> implementations — but I don’t think that changes the question for this
> thread. The aim is to identify specific risks and blockers that we need to
> tackle before adopting the changes.
> >
> > [Russell] We should probably come to a resolution on the compressed
> metadata.json name as well, although that’s mostly retroactive. V3 would be
> the place where we could officially change the naming convention.
> >
> > I don’t think that this affects v3, but we should agree before moving
> on. The only part of the spec that would depend on this is the paths used
> by file system tables and that strategy is deprecated. We should only
> document for clarify (we can’t change it) and I think we can do that any
> time.
> >
> > For the conventions used in catalog tables, I don’t think that we want
> to have requirements in the spec for file naming. We’ve avoided that in the
> past and it isn’t needed. It’s nice to have a convention in implementation
> notes, but there are other ways to handle this like magic bytes and catalog
> tracking.
> >
> > [Gang] it is implicit and obvious that only bucket transform can apply
> to multi-arg transform, it is still unclear the order of source columns and
> algorithm to use to calculate the bucket value
> >
> > I think there is some confusion here, but Fokko may have already cleared
> it up.
> >
> > Right now, there are no multi-argument transforms in the spec. We have
> discussed adding a multi-argument bucket function, but there is not
> currently one in the spec. In order to minimize changes required for v3, we
> opted to update the spec to allow adding new transforms in a
> forward-compatible way between major spec versions (implementations must
> ignore unknown transforms).
> >
> > [Jia] We’re currently addressing the handling of null/NaN values for X,
> Y, Z, and M coordinates in the Parquet format repository
> >
> > I agree that this is a good thing to clarify. We currently state that
> the ranges are [-180, 180] and [-90, 90] for geography, but we should state
> how points with NaN values are handled.
> >
> >
> > On Wed, Apr 30, 2025 at 12:27 PM Szehon Ho <szehon.apa...@gmail.com>
> wrote:
> >>
> >> Hi Jia
> >>
> >> I feel it would be nice to get that Parquet spec clarificiation
> https://github.com/apache/parquet-format/pull/494 into Iceberg V3 spec as
> well, once we finalize that.
> >>
> >> Thanks
> >> Szehon
> >>
> >> On Tue, Apr 29, 2025 at 10:55 PM Jia Yu <ji...@apache.org> wrote:
> >>>
> >>> Hi Szehon,
> >>>
> >>> Thanks for clarifying it.
> >>>
> >>> We’re currently addressing the handling of null/NaN values for X, Y,
> Z, and M coordinates in the Parquet format repository. We’ve already
> concluded that the spec of Parquet (same on the Iceberg side I believe)
> only needs additional clarification to guide expected behavior:
> https://github.com/apache/parquet-format/pull/494
> >>>
> >>> BTW the Parquet Geo C++ PR has been merged today:
> https://github.com/apache/arrow/pull/45459  I believe the Parquet Geo
> Java PR is also very close.
> >>>
> >>> Thanks,
> >>> Jia
> >>>
> >>> On Tue, Apr 29, 2025 at 10:48 PM Fokko Driesprong <fo...@apache.org>
> wrote:
> >>>>
> >>>> Hey Ryan,
> >>>>
> >>>> Thanks for raising this, and I'm very excited to see V3 being
> finalized!
> >>>>
> >>>>> The v3 spec for multi-arg transform only advises to use `source-ids`
> instead of `source-id`. Although it is implicit and obvious that only
> bucket transform can apply to multi-arg transform, it is still unclear the
> order of source columns and algorithm to use to calculate the bucket value.
> >>>>
> >>>>
> >>>> V3 now uses source IDs when there are multiple arguments and source
> IDs when there is just one. PR can be found here. This makes the
> serialization deterministic without knowing the format-version, simplifying
> the readers/writers. After some discussion on the PR, we've decided to
> leave out the multi-arg bucket transform so the V3 spec can be finalized.
> So V3 only contains the scaffolding for multi-arg transforms.
> >>>>
> >>>>> For Iceberg Geo, we are still waiting for the PR of geospatial
> bounds and geospatial predicate to be merged:
> https://github.com/apache/iceberg/pull/12667
> >>>>
> >>>>
> >>>> I think it is a good idea to distinguish between the spec and the
> actual code. If we all feel comfortable with the spec, I think we could
> finalize it. Being comfortable also means that we know that we have a
> working implementation, but I don't think we have to wrap up all the loose
> ends before voting on the spec.
> >>>>
> >>>> At the PyIceberg side, we're also working to catch up on the V3
> capabilities. Having a Java release that exposes these capabilities helps,
> so we can do round-trip validation.
> >>>>
> >>>> Kind regards,
> >>>> Fokko
> >>>>
> >>>>
> >>>> Op wo 30 apr 2025 om 07:26 schreef Jia Yu <ji...@apache.org>:
> >>>>>
> >>>>> Hi folks,
> >>>>>
> >>>>> For Iceberg Geo, we are still waiting for the PR of geospatial
> bounds and geospatial predicate to be merged:
> https://github.com/apache/iceberg/pull/12667
> >>>>>
> >>>>> Should a release with core updates include this PR?
> >>>>>
> >>>>> Thanks,
> >>>>> Jia
> >>>>>
> >>>>> On Tue, Apr 29, 2025 at 10:21 PM Manu Zhang <owenzhang1...@gmail.com>
> wrote:
> >>>>>>
> >>>>>> Agree with Russell and JB that we make a "RC" release for V3 spec
> to test implementations, compatibility, etc before finalizing it.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Manu
> >>>>>>
> >>>>>> On Wed, Apr 30, 2025 at 12:24 PM Jean-Baptiste Onofré <
> j...@nanthrax.net> wrote:
> >>>>>>>
> >>>>>>> Hi Ryan
> >>>>>>>
> >>>>>>> It sounds good.
> >>>>>>>
> >>>>>>> About multi-args transforms, with the clarification we did a
> couple of weeks ago, I think we are good.
> >>>>>>> Maybe a release with the core updated before announcing spec v3
> officially would be a good idea ?
> >>>>>>>
> >>>>>>> Regards
> >>>>>>> JB
> >>>>>>>
> >>>>>>> Le mer. 30 avr. 2025 à 00:35, Ryan Blue <rdb...@gmail.com> a
> écrit :
> >>>>>>>>
> >>>>>>>> Hi everyone,
> >>>>>>>>
> >>>>>>>> I think we’ve reached the point where it’s time to finalize and
> adopt the changes for Iceberg v3. We’ve been working toward this for the
> last few months and have now implemented the v3 features in the Java
> library to reduce the risk of needing changes or hitting problems (row
> lineage support in Spark 3.5 just went in!). We’ve also incorporated some
> clarifications and minor changes back into the spec from what we’ve learned.
> >>>>>>>>
> >>>>>>>> At this point, I’m confident that the spec is reasonable and
> correct. Thank you to everyone working on these reference implementations!
> >>>>>>>>
> >>>>>>>> The next step is to discuss any outstanding items or concerns
> about moving forward, and then to have a vote thread to adopt the spec.
> I’ll start off with a couple of items:
> >>>>>>>>
> >>>>>>>> One potential concern is that the upstream Variant spec hasn’t
> yet been finalized by the Parquet community, but we’ve built a full,
> independent implementation in Iceberg to validate the spec. I think the
> Parquet community is primarily waiting on getting the PRs in to have a Java
> reference implementation, so the risk of changes to the Variant spec is
> small.
> >>>>>>>>
> >>>>>>>> There’s also an on-going vote to add encryption keys in support
> of full table encryption that I think we want to get in.
> >>>>>>>>
> >>>>>>>> Any other items we may want to clear up?
> >>>>>>>>
> >>>>>>>> Ryan
>

Re: [DISCUSS] Finalizing the v3 spec

Reply via email to