Re: [DISCUSS] Finalizing the v3 spec

Anton Okolnychyi Tue, 13 May 2025 15:25:02 -0700

I went ahead and created https://github.com/apache/iceberg/pull/13042 to
include the discussed requirement for DVs.


ср, 7 трав. 2025 р. о 20:31 Anton Okolnychyi <aokolnyc...@gmail.com> пише:

> Steven, that may be a good point to add to ensure the metadata is properly
> maintained. If I remember correctly, the Spark implementation already drops
> old DVs in DELETE/UPDATE/MERGE but the data compaction wasn't doing it
> originally. I wonder if we fixed it. Eduard may know more.
>
> - Anton
>
> ср, 7 трав. 2025 р. о 16:29 Steven Wu <stevenz...@gmail.com> пише:
>
>> For the delete vection change, should we add the following
>> constraint/requirement for the write path in the spec? I don't know if this
>> is already the behavior of the Spark implementation.
>>
>> "if a data file is removed from the table, the corresponding DV reference
>> must also be removed from delete manifest file"
>>
>> This constraint is to guarantee no orphaned DVs in the table state. It
>> will be cheaper to calculate *accurate* table row count. Just iterate
>> through the manifest files (data and delete) using add and subtraction
>> calculations. There is no need to validate DVs if the referenced data files
>> are still part of the table, which can be a little more expensive.
>>
>>
>>
>> On Tue, May 6, 2025 at 9:18 AM Manu Zhang <owenzhang1...@gmail.com>
>> wrote:
>>
>>> Thanks for clarification Ryan.
>>>
>>> I'm aware of the major changes, but I find it hard to go through all the
>>> related descriptions which are scattered all over the place.
>>>
>>> Manu
>>>
>>> On Tue, May 6, 2025 at 11:24 PM Ryan Blue <rdb...@gmail.com> wrote:
>>>
>>>> Manu,
>>>>
>>>> We aren't currently voting. We are discussing any outstanding items to
>>>> address before we close v3 to further changes and adopt the existing v3
>>>> changes. Right now, the open item is to clarify NaN behavior in geometry
>>>> and geography, PR #12956 <https://github.com/apache/iceberg/pull/12956>
>>>> .
>>>>
>>>> Thanks for noting that the row lineage changes should be added to the
>>>> appendix, I'll open a PR to add it. That appendix is an area to highlight
>>>> things that have changed across versions, but an omission does not alter
>>>> the requirements elsewhere the spec. The changes we are discussing are the
>>>> things that are noted as part of v3 in the spec. The major additions are
>>>> new types, DVs, and row lineage.
>>>>
>>>> Ryan
>>>>
>>>> On Tue, May 6, 2025 at 3:32 AM Manu Zhang <owenzhang1...@gmail.com>
>>>> wrote:
>>>>
>>>>> I'm wondering what changes we are voting for here. Is it everything
>>>>> related to
>>>>> https://iceberg.apache.org/spec/#version-3-extended-types-and-capabilities
>>>>>  from
>>>>> the table spec?
>>>>> How about changes to other specs?
>>>>>
>>>>> Do we summarize all the changes in
>>>>> https://iceberg.apache.org/spec/#appendix-e-format-version-changes?
>>>>> It looks row lineage is missing here.
>>>>>
>>>>> Thanks,
>>>>> Manu
>>>>>
>>>>> On Tue, May 6, 2025 at 12:09 PM Anton Okolnychyi <
>>>>> aokolnyc...@gmail.com> wrote:
>>>>>
>>>>>> DVs in Spark seem to behave reasonably, serving as a reference
>>>>>> implementation of the V3 spec. There are areas for 
>>>>>> optimization/refinement
>>>>>> but nothing was observed that requires changing the spec. I would also 
>>>>>> like
>>>>>> to add the notion of content overhead/metadata (for Puffin/Parquet 
>>>>>> footers)
>>>>>> to manifests to optimize DVs maintenance. That said, it is optional
>>>>>> information and can be added after finalizing V3.
>>>>>>
>>>>>> - Anton
>>>>>>
>>>>>> пт, 2 трав. 2025 р. о 23:23 Jean-Baptiste Onofré <j...@nanthrax.net>
>>>>>> пише:
>>>>>>
>>>>>>> Hi Ryan
>>>>>>>
>>>>>>> All good for the spec. The idea for release is just a help to "double
>>>>>>> check" the spec is good (we already saw some slightly changes on the
>>>>>>> spec while working on release). I think we can be "confident" that we
>>>>>>> won't have unexpected change.
>>>>>>>
>>>>>>> Thanks !
>>>>>>> Regards
>>>>>>> JB
>>>>>>>
>>>>>>> On Thu, May 1, 2025 at 7:04 PM Ryan Blue <rdb...@gmail.com> wrote:
>>>>>>> >
>>>>>>> > Thanks, everyone! Looks like there are a few points to discuss.
>>>>>>> >
>>>>>>> > [JB] Maybe a release with the core updated before announcing spec
>>>>>>> v3 officially would be a good idea ?
>>>>>>> > [Manu] Agree with Russell and JB that we make a “RC” release for
>>>>>>> V3 spec to test implementations, compatibility, etc before finalizing 
>>>>>>> it.
>>>>>>> >
>>>>>>> > As Fokko noted, we are currently concerned about the spec and not
>>>>>>> implementations. The reason is that implementation work before the spec 
>>>>>>> is
>>>>>>> finalized is to reduce risk and build confidence that the spec is 
>>>>>>> complete
>>>>>>> and correct. Once that’s done, it is important to finalize the changes. 
>>>>>>> If
>>>>>>> we don’t finalize the changes, then implementations don’t know how/what
>>>>>>> build and cannot plan when they will fully support v3 — because it could
>>>>>>> change. Most of the work in other implementations will take place after 
>>>>>>> the
>>>>>>> spec is adopted.
>>>>>>> >
>>>>>>> > Our process for building confidence in new spec versions is to
>>>>>>> update the spec with pending changes, implement them to validate (and
>>>>>>> clarify or adjust as needed), and vote to adopt the new version as a
>>>>>>> confirmation that we agree that the spec changes are reasonable and 
>>>>>>> correct.
>>>>>>> >
>>>>>>> > We’ve already voted to accept the pending v3 changes into the
>>>>>>> spec, so the changes have already been in a candidate state for quite 
>>>>>>> some
>>>>>>> time to work on implementations. Now we’re at the point where we’ve
>>>>>>> implemented the features and, in my opinion, have demonstrated the spec
>>>>>>> changes are correct and complete.
>>>>>>> >
>>>>>>> > To that end, the question I’m raising in this thread is “what
>>>>>>> areas and features need further validation?”
>>>>>>> >
>>>>>>> > I appreciate the ideas here — releasing will assist other
>>>>>>> implementations — but I don’t think that changes the question for this
>>>>>>> thread. The aim is to identify specific risks and blockers that we need 
>>>>>>> to
>>>>>>> tackle before adopting the changes.
>>>>>>> >
>>>>>>> > [Russell] We should probably come to a resolution on the
>>>>>>> compressed metadata.json name as well, although that’s mostly 
>>>>>>> retroactive.
>>>>>>> V3 would be the place where we could officially change the naming
>>>>>>> convention.
>>>>>>> >
>>>>>>> > I don’t think that this affects v3, but we should agree before
>>>>>>> moving on. The only part of the spec that would depend on this is the 
>>>>>>> paths
>>>>>>> used by file system tables and that strategy is deprecated. We should 
>>>>>>> only
>>>>>>> document for clarify (we can’t change it) and I think we can do that any
>>>>>>> time.
>>>>>>> >
>>>>>>> > For the conventions used in catalog tables, I don’t think that we
>>>>>>> want to have requirements in the spec for file naming. We’ve avoided 
>>>>>>> that
>>>>>>> in the past and it isn’t needed. It’s nice to have a convention in
>>>>>>> implementation notes, but there are other ways to handle this like magic
>>>>>>> bytes and catalog tracking.
>>>>>>> >
>>>>>>> > [Gang] it is implicit and obvious that only bucket transform can
>>>>>>> apply to multi-arg transform, it is still unclear the order of source
>>>>>>> columns and algorithm to use to calculate the bucket value
>>>>>>> >
>>>>>>> > I think there is some confusion here, but Fokko may have already
>>>>>>> cleared it up.
>>>>>>> >
>>>>>>> > Right now, there are no multi-argument transforms in the spec. We
>>>>>>> have discussed adding a multi-argument bucket function, but there is not
>>>>>>> currently one in the spec. In order to minimize changes required for 
>>>>>>> v3, we
>>>>>>> opted to update the spec to allow adding new transforms in a
>>>>>>> forward-compatible way between major spec versions (implementations must
>>>>>>> ignore unknown transforms).
>>>>>>> >
>>>>>>> > [Jia] We’re currently addressing the handling of null/NaN values
>>>>>>> for X, Y, Z, and M coordinates in the Parquet format repository
>>>>>>> >
>>>>>>> > I agree that this is a good thing to clarify. We currently state
>>>>>>> that the ranges are [-180, 180] and [-90, 90] for geography, but we 
>>>>>>> should
>>>>>>> state how points with NaN values are handled.
>>>>>>> >
>>>>>>> >
>>>>>>> > On Wed, Apr 30, 2025 at 12:27 PM Szehon Ho <
>>>>>>> szehon.apa...@gmail.com> wrote:
>>>>>>> >>
>>>>>>> >> Hi Jia
>>>>>>> >>
>>>>>>> >> I feel it would be nice to get that Parquet spec clarificiation
>>>>>>> https://github.com/apache/parquet-format/pull/494 into Iceberg V3
>>>>>>> spec as well, once we finalize that.
>>>>>>> >>
>>>>>>> >> Thanks
>>>>>>> >> Szehon
>>>>>>> >>
>>>>>>> >> On Tue, Apr 29, 2025 at 10:55 PM Jia Yu <ji...@apache.org> wrote:
>>>>>>> >>>
>>>>>>> >>> Hi Szehon,
>>>>>>> >>>
>>>>>>> >>> Thanks for clarifying it.
>>>>>>> >>>
>>>>>>> >>> We’re currently addressing the handling of null/NaN values for
>>>>>>> X, Y, Z, and M coordinates in the Parquet format repository. We’ve 
>>>>>>> already
>>>>>>> concluded that the spec of Parquet (same on the Iceberg side I believe)
>>>>>>> only needs additional clarification to guide expected behavior:
>>>>>>> https://github.com/apache/parquet-format/pull/494
>>>>>>> >>>
>>>>>>> >>> BTW the Parquet Geo C++ PR has been merged today:
>>>>>>> https://github.com/apache/arrow/pull/45459  I believe the Parquet
>>>>>>> Geo Java PR is also very close.
>>>>>>> >>>
>>>>>>> >>> Thanks,
>>>>>>> >>> Jia
>>>>>>> >>>
>>>>>>> >>> On Tue, Apr 29, 2025 at 10:48 PM Fokko Driesprong <
>>>>>>> fo...@apache.org> wrote:
>>>>>>> >>>>
>>>>>>> >>>> Hey Ryan,
>>>>>>> >>>>
>>>>>>> >>>> Thanks for raising this, and I'm very excited to see V3 being
>>>>>>> finalized!
>>>>>>> >>>>
>>>>>>> >>>>> The v3 spec for multi-arg transform only advises to use
>>>>>>> `source-ids` instead of `source-id`. Although it is implicit and obvious
>>>>>>> that only bucket transform can apply to multi-arg transform, it is still
>>>>>>> unclear the order of source columns and algorithm to use to calculate 
>>>>>>> the
>>>>>>> bucket value.
>>>>>>> >>>>
>>>>>>> >>>>
>>>>>>> >>>> V3 now uses source IDs when there are multiple arguments and
>>>>>>> source IDs when there is just one. PR can be found here. This makes the
>>>>>>> serialization deterministic without knowing the format-version, 
>>>>>>> simplifying
>>>>>>> the readers/writers. After some discussion on the PR, we've decided to
>>>>>>> leave out the multi-arg bucket transform so the V3 spec can be 
>>>>>>> finalized.
>>>>>>> So V3 only contains the scaffolding for multi-arg transforms.
>>>>>>> >>>>
>>>>>>> >>>>> For Iceberg Geo, we are still waiting for the PR of geospatial
>>>>>>> bounds and geospatial predicate to be merged:
>>>>>>> https://github.com/apache/iceberg/pull/12667
>>>>>>> >>>>
>>>>>>> >>>>
>>>>>>> >>>> I think it is a good idea to distinguish between the spec and
>>>>>>> the actual code. If we all feel comfortable with the spec, I think we 
>>>>>>> could
>>>>>>> finalize it. Being comfortable also means that we know that we have a
>>>>>>> working implementation, but I don't think we have to wrap up all the 
>>>>>>> loose
>>>>>>> ends before voting on the spec.
>>>>>>> >>>>
>>>>>>> >>>> At the PyIceberg side, we're also working to catch up on the V3
>>>>>>> capabilities. Having a Java release that exposes these capabilities 
>>>>>>> helps,
>>>>>>> so we can do round-trip validation.
>>>>>>> >>>>
>>>>>>> >>>> Kind regards,
>>>>>>> >>>> Fokko
>>>>>>> >>>>
>>>>>>> >>>>
>>>>>>> >>>> Op wo 30 apr 2025 om 07:26 schreef Jia Yu <ji...@apache.org>:
>>>>>>> >>>>>
>>>>>>> >>>>> Hi folks,
>>>>>>> >>>>>
>>>>>>> >>>>> For Iceberg Geo, we are still waiting for the PR of geospatial
>>>>>>> bounds and geospatial predicate to be merged:
>>>>>>> https://github.com/apache/iceberg/pull/12667
>>>>>>> >>>>>
>>>>>>> >>>>> Should a release with core updates include this PR?
>>>>>>> >>>>>
>>>>>>> >>>>> Thanks,
>>>>>>> >>>>> Jia
>>>>>>> >>>>>
>>>>>>> >>>>> On Tue, Apr 29, 2025 at 10:21 PM Manu Zhang <
>>>>>>> owenzhang1...@gmail.com> wrote:
>>>>>>> >>>>>>
>>>>>>> >>>>>> Agree with Russell and JB that we make a "RC" release for V3
>>>>>>> spec to test implementations, compatibility, etc before finalizing it.
>>>>>>> >>>>>>
>>>>>>> >>>>>> Thanks,
>>>>>>> >>>>>> Manu
>>>>>>> >>>>>>
>>>>>>> >>>>>> On Wed, Apr 30, 2025 at 12:24 PM Jean-Baptiste Onofré <
>>>>>>> j...@nanthrax.net> wrote:
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> Hi Ryan
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> It sounds good.
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> About multi-args transforms, with the clarification we did a
>>>>>>> couple of weeks ago, I think we are good.
>>>>>>> >>>>>>> Maybe a release with the core updated before announcing spec
>>>>>>> v3 officially would be a good idea ?
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> Regards
>>>>>>> >>>>>>> JB
>>>>>>> >>>>>>>
>>>>>>> >>>>>>> Le mer. 30 avr. 2025 à 00:35, Ryan Blue <rdb...@gmail.com>
>>>>>>> a écrit :
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> Hi everyone,
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> I think we’ve reached the point where it’s time to finalize
>>>>>>> and adopt the changes for Iceberg v3. We’ve been working toward this for
>>>>>>> the last few months and have now implemented the v3 features in the Java
>>>>>>> library to reduce the risk of needing changes or hitting problems (row
>>>>>>> lineage support in Spark 3.5 just went in!). We’ve also incorporated 
>>>>>>> some
>>>>>>> clarifications and minor changes back into the spec from what we’ve 
>>>>>>> learned.
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> At this point, I’m confident that the spec is reasonable
>>>>>>> and correct. Thank you to everyone working on these reference
>>>>>>> implementations!
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> The next step is to discuss any outstanding items or
>>>>>>> concerns about moving forward, and then to have a vote thread to adopt 
>>>>>>> the
>>>>>>> spec. I’ll start off with a couple of items:
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> One potential concern is that the upstream Variant spec
>>>>>>> hasn’t yet been finalized by the Parquet community, but we’ve built a 
>>>>>>> full,
>>>>>>> independent implementation in Iceberg to validate the spec. I think the
>>>>>>> Parquet community is primarily waiting on getting the PRs in to have a 
>>>>>>> Java
>>>>>>> reference implementation, so the risk of changes to the Variant spec is
>>>>>>> small.
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> There’s also an on-going vote to add encryption keys in
>>>>>>> support of full table encryption that I think we want to get in.
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> Any other items we may want to clear up?
>>>>>>> >>>>>>>>
>>>>>>> >>>>>>>> Ryan
>>>>>>>
>>>>>>

Re: [DISCUSS] Finalizing the v3 spec

Reply via email to