Sorry I may have misunderstood the statement and maybe this is specific to multi-arg transform, in any case let's get a spec pr earlier in to discuss/specify behavior for V1-2 vs 3.
Thanks Szehon On Tue, Jan 30, 2024 at 9:23 AM Szehon Ho <szehon.apa...@gmail.com> wrote: > Thanks all for the discussion. > > For the specific point about any new transform being able to be read in > current versions but only written in V3 (which I missed as well): > > While this is a v3 feature and must be supported for v3 compatibility, the >> community usually also has guidelines for using features like this with >> older spec versions. For example, before releasing v2 we allowed snapshot >> ID inheritance in v1 by enabling a table property. That allowed people that >> could ensure their versions supported it or who were okay with errors to >> use the feature before v2 was released. I think we'd want to do the same >> thing here. The reference implementation can read but not write tables that >> have unknown partition transforms. We need to be clear about the details, >> but I think this is generally a good idea -- I'm curious what you think >> about it, Micah. > > > I feel this implies adding versions to partition transforms (ie, this > partition transform is in V1, V2, or V3). And any new partition transform > (even single-arg transform, not just multi-arg transform) will be in > VNext. Should we clarify that in the spec as a general point, even outside > multi-arg transforms? Let me know if I mis-understood. > > Ryan's right. I will send a new PR to clarify that this change is >> targeting V3 spec and how V1 and V2 should handle them. I am planning to >> send it after I finish the `bucketV2` spec change, WDYT, Ryan? > > > Advancedxy, I feel we can prioritize this pr in parallel for review, as it > seems important to get in soon as well? Although, currently the PR was > just to say that partition transforms may be multi-arg, but doesnt add any > new ones, so I feel it makes sense that we can clarify it at the same time > we add bucketV2 to the spec. > > Thanks > Szehon > > On Mon, Jan 29, 2024 at 10:41 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> Thanks YE, Ryan and Szehon for your thoughts. >> >> As was already touched on, my primary concern is it seems like features >> were being added to V2 that were not forward compatible. It seems there is >> consensus that these will be V3 and possibly backported to V2, this >> makes much more sense. I've added some more detailed responses below on >> the process. My answers are a little verbose, so as a summary: >> >> IMO Specification additions have a broad enough impact that I think they >> warrant a higher degree of formality. This includes: >> 1. Documenting what is required for new spec changes. >> 2. Ideally each spec change would have a first class DISCUSS [4] and >> VOTE thread on the mailing list before it is committed (the exact process >> is likely something for the community to decide but I think this most >> closely reflects the intent of the "Apache Way" [2]). I think this becomes >> more important as different github repos are spun out for different >> implementations, as there will likely be some community members who do not >> keep track for the main repo. >> >> For the change, there were discussions about this from: >>> 1. Github: https://github.com/apache/iceberg/issues/8258 (author made >>> an end-to-end reference implementation there for a sample transform) >>> 2. Google Doc Dicussion: >>> https://docs.google.com/document/d/1aDoZqRgvDOOUVAGhvKZbp5vFstjsAMY4EFCyjlxpaaw/edit#heading=h.si1nr6ftu79b >>> 3. August 2023 meetup : >>> https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit#heading=h.5jsm89ozy58 >>> (there was a general consensus there) >> >> >> Part of "The Apache Way" is "if it didn't happen on the mailing list it >> didn't happen" [2]. Different Apache projects have different standards in >> this regard, and with a lot of development becoming Github focused it seems >> some aspects of "The Apache Way" are interpreted differently. >> >> I think the exact process details are less important than having an >> agreed upon and documented process for new specification features (it >> sounds like there might be an agreed upon process but I don't think I've >> seen it documented, apologies if I missed it). There is a separate active >> thread "Process for creating new Proposals", that is probably more >> appropriate to further this part of the discussion, so I'll reply there as >> well. >> >> 1. Discuss the feature, 2. Build 2 reference implementations, and 3. hold >>> a vote. That's very similar to what we typically do. The only difference is >>> that for step 2, we typically just build one reference implementation in >>> the Java library. We do vote on the large spec updates, but in this case >>> you haven't seen one since we haven't built the reference implementation >>> yet. >> >> >> This is a good process. I think there are potentially a couple areas of >> discussion (as noted above it might be better for the parallel "Process for >> creating new Proposals" thread but I'll enumerate them here and also post >> there). >> >> 1. At some point discussing when or if PyIceberg (or any of the other >> implementations under the project umbrella) should be part of the reference >> implementation requirement. There are pros and cons to requiring at least >> 2 implementations for spec changes and it is probably worth tabling at >> least for now (or discussing in a separate thread). >> >> 2. I think it is important to distinguish between three types of spec >> changes: >> a. Clarifying existing intent of the specification. >> b. Adding new features to VNext of a Spec (i.e. for Iceberg V3). >> c. Finalizing the scope of VNext (e.g. V3 currently). >> >> IMO, I don't think (a) needs anything more than a mailing list >> discussion + PR approved by a committer (maybe PMC member?). >> Votes/discussion on (c) seem to be occurring on the mailing list. For (b) >> discussion sometimes occurs on the mailing list, but votes do not appear to >> happen (I might have been searching the archives incorrectly). In >> particular I think this introduces two problems: >> 1. It makes spec changes less visible in general. >> 2. By not voting on incremental changes to the specification it couples >> two aspects into finalizing the VNext. First, "do all the new features >> make sense", and secondly "is now the time to do a version revision". For >> the first aspect, it seems very costly to backout a feature once the spec >> has been changed, a reference implementation has been released and possibly >> allowed to be backported to prior table versions. For the second aspect, I >> think similar to library releases, it is important to release new spec >> versions on a regular cadence. >> >> While this is a v3 feature and must be supported for v3 compatibility, >>> the community usually also has guidelines for using features like this with >>> older spec versions. For example, before releasing v2 we allowed snapshot >>> ID inheritance in v1 by enabling a table property. That allowed people that >>> could ensure their versions supported it or who were okay with errors to >>> use the feature before v2 was released. I think we'd want to do the same >>> thing here. The reference implementation can read but not write tables that >>> have unknown partition transforms. We need to be clear about the details, >>> but I think this is generally a good idea -- I'm curious what you think >>> about it, Micah. >> >> >> I'm generally ok with the concept as described. As touched on above I >> think the other two important points are: >> 1. A general documented policy when features will or won't be expected to >> be optionally in scope for V1/V2 implementations (this can also be >> considered on a case by case basis, but I think it should be part of a >> proposal that is voted on). >> 2. It is clearly spelled out in the specification how this is intended to >> work with previous versions and a warning similar to nanosecond temporal >> types [1] about compatibility guarantees. >> >> Thanks, >> Micah >> >> [1] >> https://github.com/apache/iceberg/blob/main/format/spec.md#primitive-types >> [2] https://theapacheway.com/on-list/ >> [3] https://github.com/apache/iceberg/blob/main/CONTRIBUTING.md >> [4] I think a discuss thread pointing to either a google doc, or github >> issue is sufficient here, as Google docs seem generally better for >> reviewing proposals (github issues I think suffer somewhat the same >> disadvantages from mailing lists, as threads can diverge and seeing the >> current state is harder). >> >> >> >> >> >> On Sun, Jan 28, 2024 at 7:38 PM YE <advance...@gmail.com> wrote: >> >>> Sorry for the previous email, it was sent by accident without complete >>> reply. Please discard the previous email, and see the whole reply in this >>> email. >>> >>> Thanks for Micah and Ryan's reply. >>> >>> As Szehon already pointed out, this change is to allow creation of *new* >>> multi-arg >>> transforms. I remember there's a discussion in the google doc >>> whether targeting this as a `V3` spec change, it turns out that we may >>> support this as long as we make sure old writers cannot >>> write to a multi-arg transformed table. So we didn't explicitly state >>> it's a `V3` spec change. >>> >>> > I think the PR that was merged is missing clarity around the version >>> of the spec that requires these changes and how to handle them in v1 and v2 >>> tables. >>> >>> Ryan's right. I will send a new PR to clarify that this change is >>> targeting V3 spec and how V1 and V2 should handle them. I am planning to >>> send it after I finish the `bucketV2` spec change, WDYT, Ryan? >>> >>> And BTW, it might introduce additional overhead to support it in the V1 >>> spec, I am aiming to support this in V2 by enabling a specific table >>> property. >>> >>> YE <advance...@gmail.com> 于2024年1月29日周一 11:27写道: >>> >>>> Thanks for Micah and Ryan's reply. >>>> >>>> As Szehon already pointed out, this change is to allow creation of >>>> *new* multi-arg transforms. I remember there's a discussion in the >>>> google doc whether targeting this as a `V3` spec change, it turns out that >>>> we may support this as long as we make sure old writers cannot >>>> write to a multi-arg transformed table. So we didn't explicitly state >>>> it's a `V3 >>>> >>>> > I think the PR that was merged is missing clarity around the version >>>> of the spec that requires these changes and how to handle them in v1 and v2 >>>> tables. >>>> >>>> >>>> >>>> >>>> >>>> Ryan Blue <b...@tabular.io> 于2024年1月29日周一 02:36写道: >>>> >>>>> Thanks for working on this, Szehon and AdvanceXY! I'm glad to see this >>>>> picking up for the v3 work. >>>>> >>>>> I also want to address Micah's comments and suggest how we can do >>>>> better next time. From Micah's suggestion, there are 3 steps: 1. Discuss >>>>> the feature, 2. Build 2 reference implementations, and 3. hold a vote. >>>>> That's very similar to what we typically do. The only difference is that >>>>> for step 2, we typically just build one reference implementation in the >>>>> Java library. We do vote on the large spec updates, but in this case you >>>>> haven't seen one since we haven't built the reference implementation yet. >>>>> >>>>> I think the confusion here comes from updating the spec markdown doc >>>>> prematurely. I think the PR that was merged is missing clarity around the >>>>> version of the spec that requires these changes and how to handle them in >>>>> v1 and v2 tables. It should be clear that this is a v3 feature and that v3 >>>>> has not been formally adopted by a vote. We'll clean that up. >>>>> >>>>> While this is a v3 feature and must be supported for v3 compatibility, >>>>> the community usually also has guidelines for using features like this >>>>> with >>>>> older spec versions. For example, before releasing v2 we allowed snapshot >>>>> ID inheritance in v1 by enabling a table property. That allowed people >>>>> that >>>>> could ensure their versions supported it or who were okay with errors to >>>>> use the feature before v2 was released. I think we'd want to do the same >>>>> thing here. The reference implementation can read but not write tables >>>>> that >>>>> have unknown partition transforms. We need to be clear about the details, >>>>> but I think this is generally a good idea -- I'm curious what you think >>>>> about it, Micah. >>>>> >>>>> Ryan >>>>> >>>>> On Sun, Jan 28, 2024 at 8:01 AM Szehon Ho <szehon.apa...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> This would not be retrofitting existing partition transforms, but >>>>>> just allowing for the creation of new multi-arg transforms. Is the >>>>>> concern >>>>>> that some implementations are never expecting new transforms to be added? >>>>>> Old implementations would indeed not be able to read Iceberg tables >>>>>> created >>>>>> with the new transforms (this is the case even today without allowing >>>>>> multi-arg transforms). >>>>>> >>>>>> For the change, there were discussions about this from: >>>>>> 1. Github: https://github.com/apache/iceberg/issues/8258 (author >>>>>> made an end-to-end reference implementation there for a sample transform) >>>>>> 2. Google Doc Dicussion: >>>>>> https://docs.google.com/document/d/1aDoZqRgvDOOUVAGhvKZbp5vFstjsAMY4EFCyjlxpaaw/edit#heading=h.si1nr6ftu79b >>>>>> 3. August 2023 meetup : >>>>>> https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit#heading=h.5jsm89ozy58 >>>>>> (there was a general consensus there) >>>>>> >>>>>> As there seemed to be enough consensus at the design discussion and >>>>>> meetup, we did not think to go for a vote. Though I realize this will >>>>>> have >>>>>> missed some in the community who did not attend. I am still not entirely >>>>>> sure this is a large enough spec change for a vote (given my >>>>>> understanding >>>>>> of the impact), but we definitely should have sent an email to dev list >>>>>> to >>>>>> get more eyes on the above discussions to collect further concerns. >>>>>> >>>>>> Thanks, >>>>>> Szehon >>>>>> >>>>>> >>>>>> On Sat, Jan 27, 2024 at 10:06 AM Micah Kornfield < >>>>>> emkornfi...@gmail.com> wrote: >>>>>> >>>>>>> I think this is a good idea but I have concerns about compatibility. >>>>>>> IMO, I think changing the cardinality of input columns is a large enough >>>>>>> change that trying to retrofit it into V1 or V2 of the specification >>>>>>> will >>>>>>> cause pain for implementations not relying on reference implementation. >>>>>>> I >>>>>>> >>>>>>> As a secondary concern, I think it would be worthwhile for PMC to >>>>>>> formalize the process around specification changes as these have broader >>>>>>> implications for Iceberg adoption. A model that I've seen work >>>>>>> reasonably >>>>>>> well in other communities is the following: >>>>>>> >>>>>>> 1. Discussion of overall features on the mailing list (this can >>>>>>> also be a pointer to the GitHub issue). >>>>>>> 2. 2 reference implementations demonstrating the change is viable >>>>>>> (it seems like PyIceberg is close to being fully functional enough that >>>>>>> this will be viable in the near term). >>>>>>> 3. A formal vote adopting the change. >>>>>>> >>>>>>> But really any statement of policy around how specification changes >>>>>>> occur (and what changes will be considered for backporting to finalized >>>>>>> specifications) would be useful. >>>>>>> >>>>>>> Thanks, >>>>>>> Micah >>>>>>> >>>>>>> On Sat, Jan 27, 2024 at 2:55 AM 叶先进 <advance...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> This is just a heads up. Szehon and I just make a spec change to >>>>>>>> include multi-arg transform: >>>>>>>> https://github.com/apache/iceberg/pull/8579 recently. I am sending >>>>>>>> this to get input from others who did not review the pr before Iceberg >>>>>>>> 1.5 >>>>>>>> release. Any concerns/suggestions are appreciated. >>>>>>>> >>>>>>>> After this change, we are working to get the API/Core and engine >>>>>>>> changes into the iceberg and more importantly the concrete multi-arg >>>>>>>> transforms, such as bucketV2 or zorder, etc. >>>>>>>> >>>>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Tabular >>>>> >>>>