Re: [DISCUSS] Finalizing the v3 spec

Ryan Blue Thu, 01 May 2025 10:04:57 -0700

Thanks, everyone! Looks like there are a few points to discuss.

[JB] Maybe a release with the core updated before announcing spec v3
officially would be a good idea ?
[Manu] Agree with Russell and JB that we make a “RC” release for V3 spec to
test implementations, compatibility, etc before finalizing it.

As Fokko noted, we are currently concerned about the spec and not
implementations. The reason is that implementation work before the spec is
finalized is to reduce risk and build confidence that the spec is complete
and correct. Once that’s done, it is important to finalize the changes. If
we don’t finalize the changes, then implementations don’t know how/what
build and cannot plan when they will fully support v3 — because it could
change. Most of the work in other implementations will take place after the
spec is adopted.

Our process for building confidence in new spec versions is to update the
spec with pending changes, implement them to validate (and clarify or
adjust as needed), and vote to adopt the new version as a confirmation that
we agree that the spec changes are reasonable and correct.

We’ve already voted to accept the pending v3 changes into the spec, so the
changes have already been in a candidate state for quite some time to work
on implementations. Now we’re at the point where we’ve implemented the
features and, in my opinion, have demonstrated the spec changes are correct
and complete.

To that end, the question I’m raising in this thread is *“what areas and
features need further validation?”*

I appreciate the ideas here — releasing will assist other implementations —
but I don’t think that changes the question for this thread. The aim is to
identify specific risks and blockers that we need to tackle before adopting
the changes.

[Russell] We should probably come to a resolution on the compressed
metadata.json name as well, although that’s mostly retroactive. V3 would be
the place where we could officially change the naming convention.

I don’t think that this affects v3, but we should agree before moving on.
The only part of the spec that would depend on this is the paths used by
file system tables and that strategy is deprecated. We should only document
for clarify (we can’t change it) and I think we can do that any time.

For the conventions used in catalog tables, I don’t think that we want to
have requirements in the spec for file naming. We’ve avoided that in the
past and it isn’t needed. It’s nice to have a convention in implementation
notes, but there are other ways to handle this like magic bytes and catalog
tracking.

[Gang] it is implicit and obvious that only bucket transform can apply to
multi-arg transform, it is still unclear the order of source columns and
algorithm to use to calculate the bucket value

I think there is some confusion here, but Fokko may have already cleared it
up.

Right now, there are no multi-argument transforms in the spec. We have
discussed adding a multi-argument bucket function, but there is not
currently one in the spec. In order to minimize changes required for v3, we
opted to update the spec to allow adding new transforms in a
forward-compatible way between major spec versions (implementations must
ignore unknown transforms).

[Jia] We’re currently addressing the handling of null/NaN values for X, Y,
Z, and M coordinates in the Parquet format repository

I agree that this is a good thing to clarify. We currently state that the
ranges are [-180, 180] and [-90, 90] for geography, but we should state how
points with NaN values are handled.

On Wed, Apr 30, 2025 at 12:27 PM Szehon Ho <szehon.apa...@gmail.com> wrote:

> Hi Jia
>
> I feel it would be nice to get that Parquet spec clarificiation
> https://github.com/apache/parquet-format/pull/494 into Iceberg V3 spec as
> well, once we finalize that.
>
> Thanks
> Szehon
>
> On Tue, Apr 29, 2025 at 10:55 PM Jia Yu <ji...@apache.org> wrote:
>
>> Hi Szehon,
>>
>> Thanks for clarifying it.
>>
>> We’re currently addressing the handling of null/NaN values for X, Y, Z,
>> and M coordinates in the Parquet format repository. We’ve already concluded
>> that the spec of Parquet (same on the Iceberg side I believe) only needs
>> additional clarification to guide expected behavior:
>> https://github.com/apache/parquet-format/pull/494
>>
>> BTW the Parquet Geo C++ PR has been merged today:
>> https://github.com/apache/arrow/pull/45459  I believe the Parquet Geo
>> Java PR is also very close.
>>
>> Thanks,
>> Jia
>>
>> On Tue, Apr 29, 2025 at 10:48 PM Fokko Driesprong <fo...@apache.org>
>> wrote:
>>
>>> Hey Ryan,
>>>
>>> Thanks for raising this, and I'm very excited to see V3 being finalized!
>>>
>>> The v3 spec for multi-arg transform only advises to use `source-ids`
>>>> instead of `source-id`. Although it is implicit and obvious that only
>>>> bucket transform can apply to multi-arg transform, it is still unclear the
>>>> order of source columns and algorithm to use to calculate the bucket value.
>>>>
>>>
>>> V3 now uses source IDs when there are multiple arguments and source IDs
>>> when there is just one. PR can be found here
>>> <https://github.com/apache/iceberg/pull/12644>. This makes the
>>> serialization deterministic without knowing the format-version, simplifying
>>> the readers/writers. After some discussion on the PR, we've decided to
>>> leave out the multi-arg bucket transform so the V3 spec can be finalized.
>>> So V3 only contains the scaffolding for multi-arg transforms.
>>>
>>> For Iceberg Geo, we are still waiting for the PR of geospatial bounds
>>>> and geospatial predicate to be merged:
>>>> https://github.com/apache/iceberg/pull/12667
>>>
>>>
>>> I think it is a good idea to distinguish between the spec and the actual
>>> code. If we all feel comfortable with the spec, I think we could finalize
>>> it. Being comfortable also means that we know that we have a working
>>> implementation, but I don't think we have to wrap up all the loose ends
>>> before voting on the spec.
>>>
>>> At the PyIceberg side, we're also working to catch up on the V3
>>> capabilities <https://github.com/apache/iceberg-python/issues/1818>.
>>> Having a Java release that exposes these capabilities helps, so we can do
>>> round-trip validation.
>>>
>>> Kind regards,
>>> Fokko
>>>
>>>
>>> Op wo 30 apr 2025 om 07:26 schreef Jia Yu <ji...@apache.org>:
>>>
>>>> Hi folks,
>>>>
>>>> For Iceberg Geo, we are still waiting for the PR of geospatial bounds
>>>> and geospatial predicate to be merged:
>>>> https://github.com/apache/iceberg/pull/12667
>>>>
>>>> Should a release with core updates include this PR?
>>>>
>>>> Thanks,
>>>> Jia
>>>>
>>>> On Tue, Apr 29, 2025 at 10:21 PM Manu Zhang <owenzhang1...@gmail.com>
>>>> wrote:
>>>>
>>>>> Agree with Russell and JB that we make a "RC" release for V3 spec to
>>>>> test implementations, compatibility, etc before finalizing it.
>>>>>
>>>>> Thanks,
>>>>> Manu
>>>>>
>>>>> On Wed, Apr 30, 2025 at 12:24 PM Jean-Baptiste Onofré <j...@nanthrax.net>
>>>>> wrote:
>>>>>
>>>>>> Hi Ryan
>>>>>>
>>>>>> It sounds good.
>>>>>>
>>>>>> About multi-args transforms, with the clarification we did a couple
>>>>>> of weeks ago, I think we are good.
>>>>>> Maybe a release with the core updated before announcing spec v3
>>>>>> officially would be a good idea ?
>>>>>>
>>>>>> Regards
>>>>>> JB
>>>>>>
>>>>>> Le mer. 30 avr. 2025 à 00:35, Ryan Blue <rdb...@gmail.com> a écrit :
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> I think we’ve reached the point where it’s time to finalize and
>>>>>>> adopt the changes for Iceberg v3. We’ve been working toward this for the
>>>>>>> last few months and have now implemented the v3 features in the Java
>>>>>>> library to reduce the risk of needing changes or hitting problems (row
>>>>>>> lineage support in Spark 3.5 just went in!). We’ve also incorporated 
>>>>>>> some
>>>>>>> clarifications and minor changes back into the spec from what we’ve 
>>>>>>> learned.
>>>>>>>
>>>>>>> At this point, I’m confident that the spec is reasonable and
>>>>>>> correct. Thank you to everyone working on these reference 
>>>>>>> implementations!
>>>>>>>
>>>>>>> The next step is to discuss any outstanding items or concerns about
>>>>>>> moving forward, and then to have a vote thread to adopt the spec. I’ll
>>>>>>> start off with a couple of items:
>>>>>>>
>>>>>>> One potential concern is that the upstream Variant spec hasn’t yet
>>>>>>> been finalized by the Parquet community, but we’ve built a full,
>>>>>>> independent implementation in Iceberg to validate the spec. I think the
>>>>>>> Parquet community is primarily waiting on getting the PRs in to have a 
>>>>>>> Java
>>>>>>> reference implementation, so the risk of changes to the Variant spec is
>>>>>>> small.
>>>>>>>
>>>>>>> There’s also an on-going vote to add encryption keys in support of
>>>>>>> full table encryption that I think we want to get in.
>>>>>>>
>>>>>>> Any other items we may want to clear up?
>>>>>>>
>>>>>>> Ryan
>>>>>>>
>>>>>>

Re: [DISCUSS] Finalizing the v3 spec

Reply via email to