Re: [DISCUSS] adoption of format version 3

Szehon Ho Tue, 06 Aug 2024 16:50:05 -0700

It makes sense to me, thanks for summarizing it, it's an exciting list of
new features.


For Geo, I will let Wherobots engineers (Jia Yu and others) working there
to comment, but geo type could take more time, if we wait for
Parquet-Format change, followed by Parquet implementation release.

+1 about multi-value transform, I think it will be great and do-able to get
those in, the spec allows their existence but its just waiting
implementation/ review.

Thanks
Szehon

On Tue, Aug 6, 2024 at 4:42 PM Ryan Blue <[email protected]>
wrote:

> I’ve been going through the list I’ve accumulated for v3 changes and I
> think we do have a fairly clear set of things that people are working on.
> There are two main areas. The first is centered around types and extending
> existing metadata:
>
>    - Add new types: timestamp(ns), variant, blob, and null
>    - Add new type promotion: long to timestamp,
>    boolean/int/long/date/time/timestamp/uuid to string, null to anything, most
>    types to variant
>    - Add default value support via initial-default, write-default
>    - Add multi-arg transforms (multi-column bucket, zorder)
>
> Then there are a few bigger items that have people actively working:
>
>    - Row-level tracking metadata
>    - Improvements for position delete performance
>    - Encryption metadata
>    - Geo support: geometry type, xz transform, and geo predicates
>
> I propose that we target the first set of things since that’s a group of
> similar changes. It makes sense (at least to me) to add new types in a
> group, and it also makes sense to extend type capabilities (defaults and
> promotions) at the same time. (Also, we can choose to exclude blob if it is
> a large amount of work)
>
> I’d include multi-arg transforms in that group since the design is well
> written and nearly done. And we can make sure that there is a
> backward-compatible way to add new transforms between major releases. The
> Java library can currently handle new transforms and if we get those
> details into the spec then we don’t need to get the specifics of multi-arg
> bucketing as part of the v3 release.
>
> For the second group of projects, I suggest that we continue to actively
> work on them and try to get at least 2 of them in. Encryption metadata is
> quite close and just needs a few table-level additions to the metadata
> file. The changes for row-level tracking and position delete performance
> should be reasonably sized.
>
> I’d also love to see the geo support in v3, but that’s also a well-scoped
> feature that could be a v4 if it isn’t going to make it in time. My main
> concern here is the size of the changes where I don’t have much context.
>
> In summary, I’d say we should aim to include the new types, promotion,
> default values, and multi-arg transforms. Then include any of the larger
> items that are ready in time. Does that sound reasonable?
>
> Ryan
>
> On Mon, Aug 5, 2024 at 3:20 PM Micah Kornfield <[email protected]>
> wrote:
>
>> I suggest keeping those things separate — Micah, would you mind starting
>>> a separate thread so this one can focus on v3?
>>
>>
>> Yes I'll start another thread on this post V3, to allow for focus on
>> closing off V3 with the current process (and see if there is interest in
>> trying something new for v4.
>>
>> Thanks,
>> Micah
>>
>> On Mon, Aug 5, 2024 at 12:17 PM Ryan Blue <[email protected]>
>> wrote:
>>
>>> At least for discussion purposes, I think the REST spec (and any spec
>>> that involves code that will ultimately be consumed) is probably a harder
>>> conversation.
>>>
>>> I agree that it’s a very different conversation and probably out of
>>> scope for the table v3 spec.
>>>
>>> I’m undecided if minor releases are necessary for non-code specs, this
>>> seems like it might be too much overhead and might not provide a ton of
>>> value (maybe you could elaborate on the value you see in it?).
>>>
>>> Thanks for bringing up the point about minor versions. It’s critical to
>>> keep in mind that we’re talking about two different types of changes. For
>>> the v3 discussion, I think the question is what changes we want to add in
>>> v3, which is an opportunity to group together forward-incompatible changes
>>> that require new behavior to read tables correctly.
>>>
>>> It’s great to discuss whether we want to change how we version the spec
>>> and see if we want to release breaking changes more often. I *think*
>>> that was Micah’s original intent for bringing up a regular release cadence
>>> for the spec. But we should also be aware that this is a separate
>>> discussion. Most of the points that Micah raised are covered by our
>>> existing process for new *major* versions:
>>>
>>>    1. Add changes to the spec such that they are clearly attached to a
>>>    future version
>>>    2. Implement the changes in at least one implementation, probably
>>>    the reference implementation
>>>    3. When we have accumulated enough breaking changes, vote to adopt
>>>    the new version
>>>
>>> There are differences that we may choose to change, like adding the
>>> changes to the spec rather than keeping them in PRs. And we may want to
>>> introduce a regular cadence to make the last step more predictable. Those
>>> are great discussions to have, but right now we know that we have changes
>>> we want to get into a v3 in the next few months. I suggest keeping those
>>> things separate — Micah, would you mind starting a separate thread so this
>>> one can focus on v3?
>>>
>>> I also see that if we were to go with Micah’s suggestion, it has an
>>> impact on the decisions that we need to make for the v3 release. But I
>>> think that even if we were to have a regular release cadence, it would
>>> still make sense to group features like new types together because it makes
>>> the versions easier to understand and limits the overall impact in the
>>> implementations.
>>>
>>> Ryan
>>>
>>> On Fri, Aug 2, 2024 at 11:39 AM Micah Kornfield <[email protected]>
>>> wrote:
>>>
>>>> I have been a big advocate for releasing all the Iceberg specs
>>>>> regularly, and just follow a normal product release cycle with major and
>>>>> minor releases. I touched a bit of the reasoning in the thread for fixing
>>>>> stats fields in REST spec [1]. This helps a lot with engines that do not
>>>>> use any Iceberg open source library and just look at a spec and implement
>>>>> it. With a regular release, they can have a stable version to look into,
>>>>> rather than a spec that is changing all the time within the same version.
>>>>
>>>>
>>>> At least for discussion purposes, I think the REST spec (and any spec
>>>> that involves code that will ultimately be consumed) is probably a harder
>>>> conversation.  I'm undecided if minor releases are necessary for non-code
>>>> specs, this seems like it might be too much overhead and might not provide
>>>> a ton of value (maybe you could elaborate on the value you see in it?).
>>>>
>>>>
>>>>> I think Fokko brought up a point that "this will introduce a process
>>>>> that will slow the evolution down", which is true because you need to 
>>>>> spend
>>>>> additional effort and release it. And without a reference implementation,
>>>>> it is hard to say if the spec is mature enough to be released, which again
>>>>> makes it potentially tied to the release cycle of at least the Java 
>>>>> library.
>>>>
>>>>
>>>> Sorry I think I missed Fokko's argument on the linked thread.  In my
>>>> mind, the order of operations on non-code spec changes would be:
>>>>
>>>> 1.  Spec change is proposed/reviewed and agreed upon but not merged.
>>>> 2.  Reference implementation happens (possibly with revisions if
>>>> implementation challenges arise).
>>>> 3.  Reference implementation is merged
>>>> 4.  Spec change is merged.
>>>> 5.  Spec is officially  "released" at some normal cadence (or in theory
>>>> it could be done immediately).
>>>>
>>>> Steps 3 and 4 could happen simultaneously, or 4 could potentially have
>>>> some lag to it to allow for further feedback (i.e. letting reference
>>>> implementation be released) and revision.
>>>>
>>>> If step 5 is done immediately after step 4, I don't think this would
>>>> slow down evolution (but comes at the cost of more versions).  Part of step
>>>> five would necessitate changing code for any incomplete implementations to
>>>> only be turned on in the next revision (or larger features could be worked
>>>> on in a separate branch to avoid this complication).
>>>>
>>>> Thanks,
>>>> Micah
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Aug 2, 2024 at 9:10 AM Jack Ye <[email protected]> wrote:
>>>>
>>>>> > An alternative view: Would it make sense to start releasing the
>>>>> table specification on a regular cadence (e.g. quarterly, every 6 months 
>>>>> or
>>>>> yearly)?
>>>>>
>>>>> I have been a big advocate for releasing all the Iceberg specs
>>>>> regularly, and just follow a normal product release cycle with major and
>>>>> minor releases. I touched a bit of the reasoning in the thread for fixing
>>>>> stats fields in REST spec [1]. This helps a lot with engines that do not
>>>>> use any Iceberg open source library and just look at a spec and implement
>>>>> it. With a regular release, they can have a stable version to look into,
>>>>> rather than a spec that is changing all the time within the same version.
>>>>>
>>>>> It is important to note that minor spec versions will not be leveraged
>>>>> in implementations like how we have logics right now for switching
>>>>> behaviors depending on major versions. It is purely for the purpose of
>>>>> making more incremental progress on the spec, and providing stable spec
>>>>> versions for other reference implementations. Otherwise, the branches in
>>>>> the codebase to handle different versions easily get out of control.
>>>>>
>>>>> I think Fokko brought up a point that "this will introduce a process
>>>>> that will slow the evolution down", which is true because you need to 
>>>>> spend
>>>>> additional effort and release it. And without a reference implementation,
>>>>> it is hard to say if the spec is mature enough to be released, which again
>>>>> makes it potentially tied to the release cycle of at least the Java 
>>>>> library.
>>>>>
>>>>> Curious what people think.
>>>>>
>>>>> Best,
>>>>> Jack Ye
>>>>>
>>>>> [1] https://lists.apache.org/thread/v6x772v9sgo0xhpwmh4br756zhbgomtf
>>>>>
>>>>> On Wed, Jul 31, 2024 at 10:19 PM Micah Kornfield <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> It sounds like most of the opinions so far are waiting for the scope
>>>>>> of work to finish before finalizing the specification.
>>>>>>
>>>>>> An alternative view: Would it make sense to start releasing the table
>>>>>> specification on a regular cadence (e.g. quarterly, every 6 months or
>>>>>> yearly)?  I think the problem with waiting for features to get in is that
>>>>>> priorities change and things take longer than expected, thus leaving the
>>>>>> actual finalization of the specification in limbo and probably adds to
>>>>>> project management overhead.   If the specification is released regularly
>>>>>> then it means features can always be included in the next release without
>>>>>> too much delay hopefully.  The main downside I can think of in this
>>>>>> approach is having to have more branches in code to handle different
>>>>>> versions.
>>>>>>
>>>>>> One corollary to this approach is spec changes shouldn't be merged
>>>>>> before their implementations are ready.
>>>>>>
>>>>>>   - At least one complete reference implementation should exist.
>>>>>>
>>>>>>
>>>>>> For more complicated features I think at some point soon it might be
>>>>>> worth considering two implementations (or at least 1 full implementation
>>>>>> and 1 read only implementation) to make sure there aren't compatibility
>>>>>> issues/misunderstandings in the specification (e.g. I think Variant and
>>>>>> Geography fall into this category).
>>>>>>
>>>>>> Cheers,
>>>>>> Micah
>>>>>>
>>>>>> On Wed, Jul 31, 2024 at 12:47 PM Russell Spitzer <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> I think this all sounds good, the real question is whether or not we
>>>>>>> have someone to actively work on the proposals. I think for things like
>>>>>>> Default Values and Geo Types we have folks actively working on them so 
>>>>>>> it's
>>>>>>> not a big deal.
>>>>>>>
>>>>>>> On Wed, Jul 31, 2024 at 2:09 PM Szehon Ho <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Sorry I missed the sync this morning (sick), I'd like to push for
>>>>>>>> geo too.
>>>>>>>>
>>>>>>>> I think on this front as per the last sync, Ryan recommended to
>>>>>>>> wait for Parquet support to land, to avoid having two versions on 
>>>>>>>> Iceberg
>>>>>>>> side (Iceberg-native vs Parquet-native).  Parquet support is being 
>>>>>>>> actively
>>>>>>>> worked on iiuc: https://github.com/apache/parquet-format/pull/240
>>>>>>>> .  But it would bind V3 to the parquet-format release timeline, unless 
>>>>>>>> we
>>>>>>>> start with iceberg-native support first and move later (as we 
>>>>>>>> originally
>>>>>>>> proposed).
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Szehon
>>>>>>>>
>>>>>>>> On Wed, Jul 31, 2024 at 10:58 AM Walaa Eldin Moustafa <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Another feature that was planned for V3 is support for default
>>>>>>>>> values.
>>>>>>>>> Spec doc update was already merged a while ago [1]. Implementation
>>>>>>>>> is
>>>>>>>>> ongoing in this PR [2].
>>>>>>>>>
>>>>>>>>> [1] https://iceberg.apache.org/spec/#default-values
>>>>>>>>> [2] https://github.com/apache/iceberg/pull/9502
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Walaa.
>>>>>>>>>
>>>>>>>>> On Wed, Jul 31, 2024 at 10:52 AM Russell Spitzer
>>>>>>>>> <[email protected]> wrote:
>>>>>>>>> >
>>>>>>>>> > Thanks for bringing this up, I would say that from my
>>>>>>>>> perspective I have time to really push through hopefully two things
>>>>>>>>> >
>>>>>>>>> > Variant Type and
>>>>>>>>> > Row Lineage (which I will have a proposal for on the mailing
>>>>>>>>> list next week)
>>>>>>>>> >
>>>>>>>>> > I'm using the Project to try to track logistics and minutia
>>>>>>>>> required for the new spec version but I would like to bring other 
>>>>>>>>> work in
>>>>>>>>> there as well so we can get a clear picture of what is actually being
>>>>>>>>> actively worked on.
>>>>>>>>> >
>>>>>>>>> > On Wed, Jul 31, 2024 at 12:27 PM Jacob Marble <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>> >>
>>>>>>>>> >> Good morning,
>>>>>>>>> >>
>>>>>>>>> >> To continue the community sync today when format version 3 was
>>>>>>>>> discussed.
>>>>>>>>> >>
>>>>>>>>> >> Questions answered by consensus:
>>>>>>>>> >> - Format version releases should _not_ be tied to Iceberg
>>>>>>>>> version releases.
>>>>>>>>> >> - Several planned features will require format version
>>>>>>>>> releases; the process shouldn't be onerous.
>>>>>>>>> >>
>>>>>>>>> >> Unanswered questions:
>>>>>>>>> >> - What will be included in format version 3?
>>>>>>>>> >>   - What is a reasonable target date?
>>>>>>>>> >>   - How to track progress? Today, there are two public lists:
>>>>>>>>> >>     - GH milestone:
>>>>>>>>> https://github.com/apache/iceberg/milestone/42
>>>>>>>>> >>     - GH project: https://github.com/orgs/apache/projects/377
>>>>>>>>> >> - What is required of a feature in order to be included in any
>>>>>>>>> adopted format version?
>>>>>>>>> >>   - At least one complete reference implementation should exist.
>>>>>>>>> >>     - Java is the reference implementation by convention;
>>>>>>>>> that's OK, but not perfect. Should Java be the reference 
>>>>>>>>> implementation by
>>>>>>>>> mandate?
>>>>>>>>> >>
>>>>>>>>> >> Have I missed anything?
>>>>>>>>> >>
>>>>>>>>> >> --
>>>>>>>>> >> Jacob Marble
>>>>>>>>>
>>>>>>>>
>>>
>>> --
>>> Ryan Blue
>>> Databricks
>>>
>>
>
> --
> Ryan Blue
> Databricks
>

Re: [DISCUSS] adoption of format version 3

Reply via email to