It makes sense to me, thanks for summarizing it, it's an exciting list of new features.
For Geo, I will let Wherobots engineers (Jia Yu and others) working there to comment, but geo type could take more time, if we wait for Parquet-Format change, followed by Parquet implementation release. +1 about multi-value transform, I think it will be great and do-able to get those in, the spec allows their existence but its just waiting implementation/ review. Thanks Szehon On Tue, Aug 6, 2024 at 4:42 PM Ryan Blue <b...@databricks.com.invalid> wrote: > I’ve been going through the list I’ve accumulated for v3 changes and I > think we do have a fairly clear set of things that people are working on. > There are two main areas. The first is centered around types and extending > existing metadata: > > - Add new types: timestamp(ns), variant, blob, and null > - Add new type promotion: long to timestamp, > boolean/int/long/date/time/timestamp/uuid to string, null to anything, most > types to variant > - Add default value support via initial-default, write-default > - Add multi-arg transforms (multi-column bucket, zorder) > > Then there are a few bigger items that have people actively working: > > - Row-level tracking metadata > - Improvements for position delete performance > - Encryption metadata > - Geo support: geometry type, xz transform, and geo predicates > > I propose that we target the first set of things since that’s a group of > similar changes. It makes sense (at least to me) to add new types in a > group, and it also makes sense to extend type capabilities (defaults and > promotions) at the same time. (Also, we can choose to exclude blob if it is > a large amount of work) > > I’d include multi-arg transforms in that group since the design is well > written and nearly done. And we can make sure that there is a > backward-compatible way to add new transforms between major releases. The > Java library can currently handle new transforms and if we get those > details into the spec then we don’t need to get the specifics of multi-arg > bucketing as part of the v3 release. > > For the second group of projects, I suggest that we continue to actively > work on them and try to get at least 2 of them in. Encryption metadata is > quite close and just needs a few table-level additions to the metadata > file. The changes for row-level tracking and position delete performance > should be reasonably sized. > > I’d also love to see the geo support in v3, but that’s also a well-scoped > feature that could be a v4 if it isn’t going to make it in time. My main > concern here is the size of the changes where I don’t have much context. > > In summary, I’d say we should aim to include the new types, promotion, > default values, and multi-arg transforms. Then include any of the larger > items that are ready in time. Does that sound reasonable? > > Ryan > > On Mon, Aug 5, 2024 at 3:20 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> I suggest keeping those things separate — Micah, would you mind starting >>> a separate thread so this one can focus on v3? >> >> >> Yes I'll start another thread on this post V3, to allow for focus on >> closing off V3 with the current process (and see if there is interest in >> trying something new for v4. >> >> Thanks, >> Micah >> >> On Mon, Aug 5, 2024 at 12:17 PM Ryan Blue <b...@databricks.com.invalid> >> wrote: >> >>> At least for discussion purposes, I think the REST spec (and any spec >>> that involves code that will ultimately be consumed) is probably a harder >>> conversation. >>> >>> I agree that it’s a very different conversation and probably out of >>> scope for the table v3 spec. >>> >>> I’m undecided if minor releases are necessary for non-code specs, this >>> seems like it might be too much overhead and might not provide a ton of >>> value (maybe you could elaborate on the value you see in it?). >>> >>> Thanks for bringing up the point about minor versions. It’s critical to >>> keep in mind that we’re talking about two different types of changes. For >>> the v3 discussion, I think the question is what changes we want to add in >>> v3, which is an opportunity to group together forward-incompatible changes >>> that require new behavior to read tables correctly. >>> >>> It’s great to discuss whether we want to change how we version the spec >>> and see if we want to release breaking changes more often. I *think* >>> that was Micah’s original intent for bringing up a regular release cadence >>> for the spec. But we should also be aware that this is a separate >>> discussion. Most of the points that Micah raised are covered by our >>> existing process for new *major* versions: >>> >>> 1. Add changes to the spec such that they are clearly attached to a >>> future version >>> 2. Implement the changes in at least one implementation, probably >>> the reference implementation >>> 3. When we have accumulated enough breaking changes, vote to adopt >>> the new version >>> >>> There are differences that we may choose to change, like adding the >>> changes to the spec rather than keeping them in PRs. And we may want to >>> introduce a regular cadence to make the last step more predictable. Those >>> are great discussions to have, but right now we know that we have changes >>> we want to get into a v3 in the next few months. I suggest keeping those >>> things separate — Micah, would you mind starting a separate thread so this >>> one can focus on v3? >>> >>> I also see that if we were to go with Micah’s suggestion, it has an >>> impact on the decisions that we need to make for the v3 release. But I >>> think that even if we were to have a regular release cadence, it would >>> still make sense to group features like new types together because it makes >>> the versions easier to understand and limits the overall impact in the >>> implementations. >>> >>> Ryan >>> >>> On Fri, Aug 2, 2024 at 11:39 AM Micah Kornfield <emkornfi...@gmail.com> >>> wrote: >>> >>>> I have been a big advocate for releasing all the Iceberg specs >>>>> regularly, and just follow a normal product release cycle with major and >>>>> minor releases. I touched a bit of the reasoning in the thread for fixing >>>>> stats fields in REST spec [1]. This helps a lot with engines that do not >>>>> use any Iceberg open source library and just look at a spec and implement >>>>> it. With a regular release, they can have a stable version to look into, >>>>> rather than a spec that is changing all the time within the same version. >>>> >>>> >>>> At least for discussion purposes, I think the REST spec (and any spec >>>> that involves code that will ultimately be consumed) is probably a harder >>>> conversation. I'm undecided if minor releases are necessary for non-code >>>> specs, this seems like it might be too much overhead and might not provide >>>> a ton of value (maybe you could elaborate on the value you see in it?). >>>> >>>> >>>>> I think Fokko brought up a point that "this will introduce a process >>>>> that will slow the evolution down", which is true because you need to >>>>> spend >>>>> additional effort and release it. And without a reference implementation, >>>>> it is hard to say if the spec is mature enough to be released, which again >>>>> makes it potentially tied to the release cycle of at least the Java >>>>> library. >>>> >>>> >>>> Sorry I think I missed Fokko's argument on the linked thread. In my >>>> mind, the order of operations on non-code spec changes would be: >>>> >>>> 1. Spec change is proposed/reviewed and agreed upon but not merged. >>>> 2. Reference implementation happens (possibly with revisions if >>>> implementation challenges arise). >>>> 3. Reference implementation is merged >>>> 4. Spec change is merged. >>>> 5. Spec is officially "released" at some normal cadence (or in theory >>>> it could be done immediately). >>>> >>>> Steps 3 and 4 could happen simultaneously, or 4 could potentially have >>>> some lag to it to allow for further feedback (i.e. letting reference >>>> implementation be released) and revision. >>>> >>>> If step 5 is done immediately after step 4, I don't think this would >>>> slow down evolution (but comes at the cost of more versions). Part of step >>>> five would necessitate changing code for any incomplete implementations to >>>> only be turned on in the next revision (or larger features could be worked >>>> on in a separate branch to avoid this complication). >>>> >>>> Thanks, >>>> Micah >>>> >>>> >>>> >>>> >>>> On Fri, Aug 2, 2024 at 9:10 AM Jack Ye <yezhao...@gmail.com> wrote: >>>> >>>>> > An alternative view: Would it make sense to start releasing the >>>>> table specification on a regular cadence (e.g. quarterly, every 6 months >>>>> or >>>>> yearly)? >>>>> >>>>> I have been a big advocate for releasing all the Iceberg specs >>>>> regularly, and just follow a normal product release cycle with major and >>>>> minor releases. I touched a bit of the reasoning in the thread for fixing >>>>> stats fields in REST spec [1]. This helps a lot with engines that do not >>>>> use any Iceberg open source library and just look at a spec and implement >>>>> it. With a regular release, they can have a stable version to look into, >>>>> rather than a spec that is changing all the time within the same version. >>>>> >>>>> It is important to note that minor spec versions will not be leveraged >>>>> in implementations like how we have logics right now for switching >>>>> behaviors depending on major versions. It is purely for the purpose of >>>>> making more incremental progress on the spec, and providing stable spec >>>>> versions for other reference implementations. Otherwise, the branches in >>>>> the codebase to handle different versions easily get out of control. >>>>> >>>>> I think Fokko brought up a point that "this will introduce a process >>>>> that will slow the evolution down", which is true because you need to >>>>> spend >>>>> additional effort and release it. And without a reference implementation, >>>>> it is hard to say if the spec is mature enough to be released, which again >>>>> makes it potentially tied to the release cycle of at least the Java >>>>> library. >>>>> >>>>> Curious what people think. >>>>> >>>>> Best, >>>>> Jack Ye >>>>> >>>>> [1] https://lists.apache.org/thread/v6x772v9sgo0xhpwmh4br756zhbgomtf >>>>> >>>>> On Wed, Jul 31, 2024 at 10:19 PM Micah Kornfield < >>>>> emkornfi...@gmail.com> wrote: >>>>> >>>>>> It sounds like most of the opinions so far are waiting for the scope >>>>>> of work to finish before finalizing the specification. >>>>>> >>>>>> An alternative view: Would it make sense to start releasing the table >>>>>> specification on a regular cadence (e.g. quarterly, every 6 months or >>>>>> yearly)? I think the problem with waiting for features to get in is that >>>>>> priorities change and things take longer than expected, thus leaving the >>>>>> actual finalization of the specification in limbo and probably adds to >>>>>> project management overhead. If the specification is released regularly >>>>>> then it means features can always be included in the next release without >>>>>> too much delay hopefully. The main downside I can think of in this >>>>>> approach is having to have more branches in code to handle different >>>>>> versions. >>>>>> >>>>>> One corollary to this approach is spec changes shouldn't be merged >>>>>> before their implementations are ready. >>>>>> >>>>>> - At least one complete reference implementation should exist. >>>>>> >>>>>> >>>>>> For more complicated features I think at some point soon it might be >>>>>> worth considering two implementations (or at least 1 full implementation >>>>>> and 1 read only implementation) to make sure there aren't compatibility >>>>>> issues/misunderstandings in the specification (e.g. I think Variant and >>>>>> Geography fall into this category). >>>>>> >>>>>> Cheers, >>>>>> Micah >>>>>> >>>>>> On Wed, Jul 31, 2024 at 12:47 PM Russell Spitzer < >>>>>> russell.spit...@gmail.com> wrote: >>>>>> >>>>>>> I think this all sounds good, the real question is whether or not we >>>>>>> have someone to actively work on the proposals. I think for things like >>>>>>> Default Values and Geo Types we have folks actively working on them so >>>>>>> it's >>>>>>> not a big deal. >>>>>>> >>>>>>> On Wed, Jul 31, 2024 at 2:09 PM Szehon Ho <szehon.apa...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Sorry I missed the sync this morning (sick), I'd like to push for >>>>>>>> geo too. >>>>>>>> >>>>>>>> I think on this front as per the last sync, Ryan recommended to >>>>>>>> wait for Parquet support to land, to avoid having two versions on >>>>>>>> Iceberg >>>>>>>> side (Iceberg-native vs Parquet-native). Parquet support is being >>>>>>>> actively >>>>>>>> worked on iiuc: https://github.com/apache/parquet-format/pull/240 >>>>>>>> . But it would bind V3 to the parquet-format release timeline, unless >>>>>>>> we >>>>>>>> start with iceberg-native support first and move later (as we >>>>>>>> originally >>>>>>>> proposed). >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Szehon >>>>>>>> >>>>>>>> On Wed, Jul 31, 2024 at 10:58 AM Walaa Eldin Moustafa < >>>>>>>> wa.moust...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Another feature that was planned for V3 is support for default >>>>>>>>> values. >>>>>>>>> Spec doc update was already merged a while ago [1]. Implementation >>>>>>>>> is >>>>>>>>> ongoing in this PR [2]. >>>>>>>>> >>>>>>>>> [1] https://iceberg.apache.org/spec/#default-values >>>>>>>>> [2] https://github.com/apache/iceberg/pull/9502 >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Walaa. >>>>>>>>> >>>>>>>>> On Wed, Jul 31, 2024 at 10:52 AM Russell Spitzer >>>>>>>>> <russell.spit...@gmail.com> wrote: >>>>>>>>> > >>>>>>>>> > Thanks for bringing this up, I would say that from my >>>>>>>>> perspective I have time to really push through hopefully two things >>>>>>>>> > >>>>>>>>> > Variant Type and >>>>>>>>> > Row Lineage (which I will have a proposal for on the mailing >>>>>>>>> list next week) >>>>>>>>> > >>>>>>>>> > I'm using the Project to try to track logistics and minutia >>>>>>>>> required for the new spec version but I would like to bring other >>>>>>>>> work in >>>>>>>>> there as well so we can get a clear picture of what is actually being >>>>>>>>> actively worked on. >>>>>>>>> > >>>>>>>>> > On Wed, Jul 31, 2024 at 12:27 PM Jacob Marble < >>>>>>>>> jacobmar...@influxdata.com> wrote: >>>>>>>>> >> >>>>>>>>> >> Good morning, >>>>>>>>> >> >>>>>>>>> >> To continue the community sync today when format version 3 was >>>>>>>>> discussed. >>>>>>>>> >> >>>>>>>>> >> Questions answered by consensus: >>>>>>>>> >> - Format version releases should _not_ be tied to Iceberg >>>>>>>>> version releases. >>>>>>>>> >> - Several planned features will require format version >>>>>>>>> releases; the process shouldn't be onerous. >>>>>>>>> >> >>>>>>>>> >> Unanswered questions: >>>>>>>>> >> - What will be included in format version 3? >>>>>>>>> >> - What is a reasonable target date? >>>>>>>>> >> - How to track progress? Today, there are two public lists: >>>>>>>>> >> - GH milestone: >>>>>>>>> https://github.com/apache/iceberg/milestone/42 >>>>>>>>> >> - GH project: https://github.com/orgs/apache/projects/377 >>>>>>>>> >> - What is required of a feature in order to be included in any >>>>>>>>> adopted format version? >>>>>>>>> >> - At least one complete reference implementation should exist. >>>>>>>>> >> - Java is the reference implementation by convention; >>>>>>>>> that's OK, but not perfect. Should Java be the reference >>>>>>>>> implementation by >>>>>>>>> mandate? >>>>>>>>> >> >>>>>>>>> >> Have I missed anything? >>>>>>>>> >> >>>>>>>>> >> -- >>>>>>>>> >> Jacob Marble >>>>>>>>> >>>>>>>> >>> >>> -- >>> Ryan Blue >>> Databricks >>> >> > > -- > Ryan Blue > Databricks >