I went ahead and created https://github.com/apache/iceberg/pull/13042 to include the discussed requirement for DVs.
ср, 7 трав. 2025 р. о 20:31 Anton Okolnychyi <aokolnyc...@gmail.com> пише: > Steven, that may be a good point to add to ensure the metadata is properly > maintained. If I remember correctly, the Spark implementation already drops > old DVs in DELETE/UPDATE/MERGE but the data compaction wasn't doing it > originally. I wonder if we fixed it. Eduard may know more. > > - Anton > > ср, 7 трав. 2025 р. о 16:29 Steven Wu <stevenz...@gmail.com> пише: > >> For the delete vection change, should we add the following >> constraint/requirement for the write path in the spec? I don't know if this >> is already the behavior of the Spark implementation. >> >> "if a data file is removed from the table, the corresponding DV reference >> must also be removed from delete manifest file" >> >> This constraint is to guarantee no orphaned DVs in the table state. It >> will be cheaper to calculate *accurate* table row count. Just iterate >> through the manifest files (data and delete) using add and subtraction >> calculations. There is no need to validate DVs if the referenced data files >> are still part of the table, which can be a little more expensive. >> >> >> >> On Tue, May 6, 2025 at 9:18 AM Manu Zhang <owenzhang1...@gmail.com> >> wrote: >> >>> Thanks for clarification Ryan. >>> >>> I'm aware of the major changes, but I find it hard to go through all the >>> related descriptions which are scattered all over the place. >>> >>> Manu >>> >>> On Tue, May 6, 2025 at 11:24 PM Ryan Blue <rdb...@gmail.com> wrote: >>> >>>> Manu, >>>> >>>> We aren't currently voting. We are discussing any outstanding items to >>>> address before we close v3 to further changes and adopt the existing v3 >>>> changes. Right now, the open item is to clarify NaN behavior in geometry >>>> and geography, PR #12956 <https://github.com/apache/iceberg/pull/12956> >>>> . >>>> >>>> Thanks for noting that the row lineage changes should be added to the >>>> appendix, I'll open a PR to add it. That appendix is an area to highlight >>>> things that have changed across versions, but an omission does not alter >>>> the requirements elsewhere the spec. The changes we are discussing are the >>>> things that are noted as part of v3 in the spec. The major additions are >>>> new types, DVs, and row lineage. >>>> >>>> Ryan >>>> >>>> On Tue, May 6, 2025 at 3:32 AM Manu Zhang <owenzhang1...@gmail.com> >>>> wrote: >>>> >>>>> I'm wondering what changes we are voting for here. Is it everything >>>>> related to >>>>> https://iceberg.apache.org/spec/#version-3-extended-types-and-capabilities >>>>> from >>>>> the table spec? >>>>> How about changes to other specs? >>>>> >>>>> Do we summarize all the changes in >>>>> https://iceberg.apache.org/spec/#appendix-e-format-version-changes? >>>>> It looks row lineage is missing here. >>>>> >>>>> Thanks, >>>>> Manu >>>>> >>>>> On Tue, May 6, 2025 at 12:09 PM Anton Okolnychyi < >>>>> aokolnyc...@gmail.com> wrote: >>>>> >>>>>> DVs in Spark seem to behave reasonably, serving as a reference >>>>>> implementation of the V3 spec. There are areas for >>>>>> optimization/refinement >>>>>> but nothing was observed that requires changing the spec. I would also >>>>>> like >>>>>> to add the notion of content overhead/metadata (for Puffin/Parquet >>>>>> footers) >>>>>> to manifests to optimize DVs maintenance. That said, it is optional >>>>>> information and can be added after finalizing V3. >>>>>> >>>>>> - Anton >>>>>> >>>>>> пт, 2 трав. 2025 р. о 23:23 Jean-Baptiste Onofré <j...@nanthrax.net> >>>>>> пише: >>>>>> >>>>>>> Hi Ryan >>>>>>> >>>>>>> All good for the spec. The idea for release is just a help to "double >>>>>>> check" the spec is good (we already saw some slightly changes on the >>>>>>> spec while working on release). I think we can be "confident" that we >>>>>>> won't have unexpected change. >>>>>>> >>>>>>> Thanks ! >>>>>>> Regards >>>>>>> JB >>>>>>> >>>>>>> On Thu, May 1, 2025 at 7:04 PM Ryan Blue <rdb...@gmail.com> wrote: >>>>>>> > >>>>>>> > Thanks, everyone! Looks like there are a few points to discuss. >>>>>>> > >>>>>>> > [JB] Maybe a release with the core updated before announcing spec >>>>>>> v3 officially would be a good idea ? >>>>>>> > [Manu] Agree with Russell and JB that we make a “RC” release for >>>>>>> V3 spec to test implementations, compatibility, etc before finalizing >>>>>>> it. >>>>>>> > >>>>>>> > As Fokko noted, we are currently concerned about the spec and not >>>>>>> implementations. The reason is that implementation work before the spec >>>>>>> is >>>>>>> finalized is to reduce risk and build confidence that the spec is >>>>>>> complete >>>>>>> and correct. Once that’s done, it is important to finalize the changes. >>>>>>> If >>>>>>> we don’t finalize the changes, then implementations don’t know how/what >>>>>>> build and cannot plan when they will fully support v3 — because it could >>>>>>> change. Most of the work in other implementations will take place after >>>>>>> the >>>>>>> spec is adopted. >>>>>>> > >>>>>>> > Our process for building confidence in new spec versions is to >>>>>>> update the spec with pending changes, implement them to validate (and >>>>>>> clarify or adjust as needed), and vote to adopt the new version as a >>>>>>> confirmation that we agree that the spec changes are reasonable and >>>>>>> correct. >>>>>>> > >>>>>>> > We’ve already voted to accept the pending v3 changes into the >>>>>>> spec, so the changes have already been in a candidate state for quite >>>>>>> some >>>>>>> time to work on implementations. Now we’re at the point where we’ve >>>>>>> implemented the features and, in my opinion, have demonstrated the spec >>>>>>> changes are correct and complete. >>>>>>> > >>>>>>> > To that end, the question I’m raising in this thread is “what >>>>>>> areas and features need further validation?” >>>>>>> > >>>>>>> > I appreciate the ideas here — releasing will assist other >>>>>>> implementations — but I don’t think that changes the question for this >>>>>>> thread. The aim is to identify specific risks and blockers that we need >>>>>>> to >>>>>>> tackle before adopting the changes. >>>>>>> > >>>>>>> > [Russell] We should probably come to a resolution on the >>>>>>> compressed metadata.json name as well, although that’s mostly >>>>>>> retroactive. >>>>>>> V3 would be the place where we could officially change the naming >>>>>>> convention. >>>>>>> > >>>>>>> > I don’t think that this affects v3, but we should agree before >>>>>>> moving on. The only part of the spec that would depend on this is the >>>>>>> paths >>>>>>> used by file system tables and that strategy is deprecated. We should >>>>>>> only >>>>>>> document for clarify (we can’t change it) and I think we can do that any >>>>>>> time. >>>>>>> > >>>>>>> > For the conventions used in catalog tables, I don’t think that we >>>>>>> want to have requirements in the spec for file naming. We’ve avoided >>>>>>> that >>>>>>> in the past and it isn’t needed. It’s nice to have a convention in >>>>>>> implementation notes, but there are other ways to handle this like magic >>>>>>> bytes and catalog tracking. >>>>>>> > >>>>>>> > [Gang] it is implicit and obvious that only bucket transform can >>>>>>> apply to multi-arg transform, it is still unclear the order of source >>>>>>> columns and algorithm to use to calculate the bucket value >>>>>>> > >>>>>>> > I think there is some confusion here, but Fokko may have already >>>>>>> cleared it up. >>>>>>> > >>>>>>> > Right now, there are no multi-argument transforms in the spec. We >>>>>>> have discussed adding a multi-argument bucket function, but there is not >>>>>>> currently one in the spec. In order to minimize changes required for >>>>>>> v3, we >>>>>>> opted to update the spec to allow adding new transforms in a >>>>>>> forward-compatible way between major spec versions (implementations must >>>>>>> ignore unknown transforms). >>>>>>> > >>>>>>> > [Jia] We’re currently addressing the handling of null/NaN values >>>>>>> for X, Y, Z, and M coordinates in the Parquet format repository >>>>>>> > >>>>>>> > I agree that this is a good thing to clarify. We currently state >>>>>>> that the ranges are [-180, 180] and [-90, 90] for geography, but we >>>>>>> should >>>>>>> state how points with NaN values are handled. >>>>>>> > >>>>>>> > >>>>>>> > On Wed, Apr 30, 2025 at 12:27 PM Szehon Ho < >>>>>>> szehon.apa...@gmail.com> wrote: >>>>>>> >> >>>>>>> >> Hi Jia >>>>>>> >> >>>>>>> >> I feel it would be nice to get that Parquet spec clarificiation >>>>>>> https://github.com/apache/parquet-format/pull/494 into Iceberg V3 >>>>>>> spec as well, once we finalize that. >>>>>>> >> >>>>>>> >> Thanks >>>>>>> >> Szehon >>>>>>> >> >>>>>>> >> On Tue, Apr 29, 2025 at 10:55 PM Jia Yu <ji...@apache.org> wrote: >>>>>>> >>> >>>>>>> >>> Hi Szehon, >>>>>>> >>> >>>>>>> >>> Thanks for clarifying it. >>>>>>> >>> >>>>>>> >>> We’re currently addressing the handling of null/NaN values for >>>>>>> X, Y, Z, and M coordinates in the Parquet format repository. We’ve >>>>>>> already >>>>>>> concluded that the spec of Parquet (same on the Iceberg side I believe) >>>>>>> only needs additional clarification to guide expected behavior: >>>>>>> https://github.com/apache/parquet-format/pull/494 >>>>>>> >>> >>>>>>> >>> BTW the Parquet Geo C++ PR has been merged today: >>>>>>> https://github.com/apache/arrow/pull/45459 I believe the Parquet >>>>>>> Geo Java PR is also very close. >>>>>>> >>> >>>>>>> >>> Thanks, >>>>>>> >>> Jia >>>>>>> >>> >>>>>>> >>> On Tue, Apr 29, 2025 at 10:48 PM Fokko Driesprong < >>>>>>> fo...@apache.org> wrote: >>>>>>> >>>> >>>>>>> >>>> Hey Ryan, >>>>>>> >>>> >>>>>>> >>>> Thanks for raising this, and I'm very excited to see V3 being >>>>>>> finalized! >>>>>>> >>>> >>>>>>> >>>>> The v3 spec for multi-arg transform only advises to use >>>>>>> `source-ids` instead of `source-id`. Although it is implicit and obvious >>>>>>> that only bucket transform can apply to multi-arg transform, it is still >>>>>>> unclear the order of source columns and algorithm to use to calculate >>>>>>> the >>>>>>> bucket value. >>>>>>> >>>> >>>>>>> >>>> >>>>>>> >>>> V3 now uses source IDs when there are multiple arguments and >>>>>>> source IDs when there is just one. PR can be found here. This makes the >>>>>>> serialization deterministic without knowing the format-version, >>>>>>> simplifying >>>>>>> the readers/writers. After some discussion on the PR, we've decided to >>>>>>> leave out the multi-arg bucket transform so the V3 spec can be >>>>>>> finalized. >>>>>>> So V3 only contains the scaffolding for multi-arg transforms. >>>>>>> >>>> >>>>>>> >>>>> For Iceberg Geo, we are still waiting for the PR of geospatial >>>>>>> bounds and geospatial predicate to be merged: >>>>>>> https://github.com/apache/iceberg/pull/12667 >>>>>>> >>>> >>>>>>> >>>> >>>>>>> >>>> I think it is a good idea to distinguish between the spec and >>>>>>> the actual code. If we all feel comfortable with the spec, I think we >>>>>>> could >>>>>>> finalize it. Being comfortable also means that we know that we have a >>>>>>> working implementation, but I don't think we have to wrap up all the >>>>>>> loose >>>>>>> ends before voting on the spec. >>>>>>> >>>> >>>>>>> >>>> At the PyIceberg side, we're also working to catch up on the V3 >>>>>>> capabilities. Having a Java release that exposes these capabilities >>>>>>> helps, >>>>>>> so we can do round-trip validation. >>>>>>> >>>> >>>>>>> >>>> Kind regards, >>>>>>> >>>> Fokko >>>>>>> >>>> >>>>>>> >>>> >>>>>>> >>>> Op wo 30 apr 2025 om 07:26 schreef Jia Yu <ji...@apache.org>: >>>>>>> >>>>> >>>>>>> >>>>> Hi folks, >>>>>>> >>>>> >>>>>>> >>>>> For Iceberg Geo, we are still waiting for the PR of geospatial >>>>>>> bounds and geospatial predicate to be merged: >>>>>>> https://github.com/apache/iceberg/pull/12667 >>>>>>> >>>>> >>>>>>> >>>>> Should a release with core updates include this PR? >>>>>>> >>>>> >>>>>>> >>>>> Thanks, >>>>>>> >>>>> Jia >>>>>>> >>>>> >>>>>>> >>>>> On Tue, Apr 29, 2025 at 10:21 PM Manu Zhang < >>>>>>> owenzhang1...@gmail.com> wrote: >>>>>>> >>>>>> >>>>>>> >>>>>> Agree with Russell and JB that we make a "RC" release for V3 >>>>>>> spec to test implementations, compatibility, etc before finalizing it. >>>>>>> >>>>>> >>>>>>> >>>>>> Thanks, >>>>>>> >>>>>> Manu >>>>>>> >>>>>> >>>>>>> >>>>>> On Wed, Apr 30, 2025 at 12:24 PM Jean-Baptiste Onofré < >>>>>>> j...@nanthrax.net> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hi Ryan >>>>>>> >>>>>>> >>>>>>> >>>>>>> It sounds good. >>>>>>> >>>>>>> >>>>>>> >>>>>>> About multi-args transforms, with the clarification we did a >>>>>>> couple of weeks ago, I think we are good. >>>>>>> >>>>>>> Maybe a release with the core updated before announcing spec >>>>>>> v3 officially would be a good idea ? >>>>>>> >>>>>>> >>>>>>> >>>>>>> Regards >>>>>>> >>>>>>> JB >>>>>>> >>>>>>> >>>>>>> >>>>>>> Le mer. 30 avr. 2025 à 00:35, Ryan Blue <rdb...@gmail.com> >>>>>>> a écrit : >>>>>>> >>>>>>>> >>>>>>> >>>>>>>> Hi everyone, >>>>>>> >>>>>>>> >>>>>>> >>>>>>>> I think we’ve reached the point where it’s time to finalize >>>>>>> and adopt the changes for Iceberg v3. We’ve been working toward this for >>>>>>> the last few months and have now implemented the v3 features in the Java >>>>>>> library to reduce the risk of needing changes or hitting problems (row >>>>>>> lineage support in Spark 3.5 just went in!). We’ve also incorporated >>>>>>> some >>>>>>> clarifications and minor changes back into the spec from what we’ve >>>>>>> learned. >>>>>>> >>>>>>>> >>>>>>> >>>>>>>> At this point, I’m confident that the spec is reasonable >>>>>>> and correct. Thank you to everyone working on these reference >>>>>>> implementations! >>>>>>> >>>>>>>> >>>>>>> >>>>>>>> The next step is to discuss any outstanding items or >>>>>>> concerns about moving forward, and then to have a vote thread to adopt >>>>>>> the >>>>>>> spec. I’ll start off with a couple of items: >>>>>>> >>>>>>>> >>>>>>> >>>>>>>> One potential concern is that the upstream Variant spec >>>>>>> hasn’t yet been finalized by the Parquet community, but we’ve built a >>>>>>> full, >>>>>>> independent implementation in Iceberg to validate the spec. I think the >>>>>>> Parquet community is primarily waiting on getting the PRs in to have a >>>>>>> Java >>>>>>> reference implementation, so the risk of changes to the Variant spec is >>>>>>> small. >>>>>>> >>>>>>>> >>>>>>> >>>>>>>> There’s also an on-going vote to add encryption keys in >>>>>>> support of full table encryption that I think we want to get in. >>>>>>> >>>>>>>> >>>>>>> >>>>>>>> Any other items we may want to clear up? >>>>>>> >>>>>>>> >>>>>>> >>>>>>>> Ryan >>>>>>> >>>>>>