Re: Spec change for multi-arg transform

Szehon Ho Tue, 30 Jan 2024 09:24:12 -0800

Thanks all for the discussion.

For the specific point about any new transform being able to be read in
current versions but only written in V3 (which I missed as well):


While this is a v3 feature and must be supported for v3 compatibility, the
> community usually also has guidelines for using features like this with
> older spec versions. For example, before releasing v2 we allowed snapshot
> ID inheritance in v1 by enabling a table property. That allowed people that
> could ensure their versions supported it or who were okay with errors to
> use the feature before v2 was released. I think we'd want to do the same
> thing here. The reference implementation can read but not write tables that
> have unknown partition transforms. We need to be clear about the details,
> but I think this is generally a good idea -- I'm curious what you think
> about it, Micah.


I feel this implies adding versions to partition transforms (ie, this
partition transform is in V1, V2, or V3).  And any new partition transform
(even single-arg transform, not just multi-arg transform) will be in
VNext.  Should we clarify that in the spec as a general point, even outside
multi-arg transforms?  Let me know if I mis-understood.

Ryan's right. I will send a new PR to clarify that this change is targeting
> V3 spec and how V1 and V2 should handle them. I am planning to send it
> after I finish the `bucketV2` spec change, WDYT, Ryan?


Advancedxy, I feel we can prioritize this pr in parallel for review, as it
seems important to get in soon as well?  Although, currently the PR was
just to say that partition transforms may be multi-arg, but doesnt add any
new ones, so I feel it makes sense that we can clarify it at the same time
we add bucketV2 to the spec.

Thanks
Szehon

On Mon, Jan 29, 2024 at 10:41 AM Micah Kornfield <[email protected]>
wrote:

> Thanks YE, Ryan and Szehon for your thoughts.
>
> As was already touched on, my primary concern is it seems like features
> were being added to V2 that were not forward compatible.  It seems there is
> consensus that these will be V3 and possibly backported to V2, this
> makes much more sense.  I've added some more detailed responses below on
> the process.  My answers are a little verbose, so as a summary:
>
> IMO Specification additions have a broad enough impact that I think they
> warrant a higher degree of formality.  This includes:
> 1.  Documenting what is required for new spec changes.
> 2.  Ideally each spec change would have a first class DISCUSS [4] and VOTE
> thread on the mailing list before it is committed (the exact process is
> likely something for the community to decide but I think this most closely
> reflects the intent of the "Apache Way" [2]).  I think this becomes more
> important as different github repos are spun out for different
> implementations, as there will likely be some community members who do not
> keep track for the main repo.
>
> For the change, there were discussions about this from:
>> 1. Github:  https://github.com/apache/iceberg/issues/8258 (author made
>> an end-to-end reference implementation there for a sample transform)
>> 2. Google Doc Dicussion:
>> https://docs.google.com/document/d/1aDoZqRgvDOOUVAGhvKZbp5vFstjsAMY4EFCyjlxpaaw/edit#heading=h.si1nr6ftu79b
>> 3. August 2023 meetup :
>> https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit#heading=h.5jsm89ozy58
>> (there was a general consensus there)
>
>
> Part of "The Apache Way" is "if it didn't happen on the mailing list it
> didn't happen" [2]. Different Apache projects have different standards in
> this regard, and with a lot of development becoming Github focused it seems
> some aspects of "The Apache Way" are interpreted differently.
>
> I think the exact process details are less important than having an
> agreed upon and documented process for new specification features (it
> sounds like there might be an agreed upon process but I don't think I've
> seen it documented, apologies if I missed it). There is a separate active
> thread "Process for creating new Proposals", that is probably more
> appropriate to further this part of the discussion, so I'll reply there as
> well.
>
> 1. Discuss the feature, 2. Build 2 reference implementations, and 3. hold
>> a vote. That's very similar to what we typically do. The only difference is
>> that for step 2, we typically just build one reference implementation in
>> the Java library. We do vote on the large spec updates, but in this case
>> you haven't seen one since we haven't built the reference implementation
>> yet.
>
>
> This is a good process.  I think there are potentially a couple areas of
> discussion (as noted above it might be better for the parallel "Process for
> creating new Proposals" thread but I'll enumerate them here and also post
> there).
>
> 1.  At some point discussing when or if PyIceberg (or any of the other
> implementations under the project umbrella) should be part of the reference
> implementation requirement.  There are pros and cons to requiring at least
> 2 implementations for spec changes and it is probably worth tabling at
> least for now (or discussing in a separate thread).
>
> 2.  I think it is important to distinguish between three types of spec
> changes:
>     a.  Clarifying existing intent of the specification.
>     b.  Adding new features to VNext of a Spec (i.e. for Iceberg V3).
>     c.  Finalizing the scope of VNext (e.g. V3 currently).
>
> IMO, I don't think (a) needs anything more than a mailing list
> discussion + PR approved by a committer (maybe PMC member?).
> Votes/discussion on (c) seem to be occurring on the mailing list.  For (b)
> discussion sometimes occurs on the mailing list, but votes do not appear to
> happen (I might have been searching the archives incorrectly).   In
> particular I think this introduces two problems:
> 1.  It makes spec changes less visible in general.
> 2.  By not voting on incremental changes to the specification it couples
> two aspects into finalizing the VNext.  First, "do all the new features
> make sense", and secondly "is now the time to do a version revision".  For
> the first aspect, it seems very costly to backout a feature once the spec
> has been changed, a reference implementation has been released and possibly
> allowed to be backported to prior table versions.  For the second aspect, I
> think similar to library releases, it is important to release new spec
> versions  on a regular cadence.
>
> While this is a v3 feature and must be supported for v3 compatibility, the
>> community usually also has guidelines for using features like this with
>> older spec versions. For example, before releasing v2 we allowed snapshot
>> ID inheritance in v1 by enabling a table property. That allowed people that
>> could ensure their versions supported it or who were okay with errors to
>> use the feature before v2 was released. I think we'd want to do the same
>> thing here. The reference implementation can read but not write tables that
>> have unknown partition transforms. We need to be clear about the details,
>> but I think this is generally a good idea -- I'm curious what you think
>> about it, Micah.
>
>
> I'm generally ok with the concept as described. As touched on above I
> think the other two important points are:
> 1. A general documented policy when features will or won't be expected to
> be optionally in scope for V1/V2 implementations (this can also be
> considered on a case by case basis, but I think it should be part of a
> proposal that is voted on).
> 2. It is clearly spelled out in the specification how this is intended to
> work with previous versions and a warning similar to nanosecond temporal
> types [1] about compatibility guarantees.
>
> Thanks,
> Micah
>
> [1]
> https://github.com/apache/iceberg/blob/main/format/spec.md#primitive-types
> [2] https://theapacheway.com/on-list/
> [3] https://github.com/apache/iceberg/blob/main/CONTRIBUTING.md
> [4] I think a discuss thread pointing to either a google doc, or github
> issue is sufficient here, as Google docs seem generally better for
> reviewing proposals (github issues I think suffer somewhat the same
> disadvantages from mailing lists, as threads can diverge and seeing the
> current state is harder).
>
>
>
>
>
> On Sun, Jan 28, 2024 at 7:38 PM YE <[email protected]> wrote:
>
>> Sorry for the previous email, it was sent by accident without complete
>> reply. Please discard the previous email, and see the whole reply in this
>> email.
>>
>> Thanks for Micah and Ryan's reply.
>>
>> As Szehon already pointed out, this change is to allow creation of *new* 
>> multi-arg
>> transforms. I remember there's a discussion in the google doc
>> whether targeting this as a `V3` spec change, it turns out that we may
>> support this as long as we make sure old writers cannot
>> write to a multi-arg transformed table. So we didn't explicitly state
>> it's a `V3` spec change.
>>
>> > I think the PR that was merged is missing clarity around the version of
>> the spec that requires these changes and how to handle them in v1 and v2
>> tables.
>>
>> Ryan's right. I will send a new PR to clarify that this change is
>> targeting V3 spec and how V1 and V2 should handle them. I am planning to
>> send it after I finish the `bucketV2` spec change, WDYT, Ryan?
>>
>> And BTW, it might introduce additional overhead to support it in the V1
>> spec, I am aiming to support this in V2 by enabling a specific table
>> property.
>>
>> YE <[email protected]> 于2024年1月29日周一 11:27写道：
>>
>>> Thanks for Micah and Ryan's reply.
>>>
>>> As Szehon already pointed out, this change is to allow creation of *new*
>>> multi-arg transforms. I remember there's a discussion in the google doc
>>> whether targeting this as a `V3` spec change, it turns out that we may
>>> support this as long as we make sure old writers cannot
>>> write to a multi-arg transformed table. So we didn't explicitly state
>>> it's a `V3
>>>
>>> > I think the PR that was merged is missing clarity around the version
>>> of the spec that requires these changes and how to handle them in v1 and v2
>>> tables.
>>>
>>>
>>>
>>>
>>>
>>> Ryan Blue <[email protected]> 于2024年1月29日周一 02:36写道：
>>>
>>>> Thanks for working on this, Szehon and AdvanceXY! I'm glad to see this
>>>> picking up for the v3 work.
>>>>
>>>> I also want to address Micah's comments and suggest how we can do
>>>> better next time. From Micah's suggestion, there are 3 steps: 1. Discuss
>>>> the feature, 2. Build 2 reference implementations, and 3. hold a vote.
>>>> That's very similar to what we typically do. The only difference is that
>>>> for step 2, we typically just build one reference implementation in the
>>>> Java library. We do vote on the large spec updates, but in this case you
>>>> haven't seen one since we haven't built the reference implementation yet.
>>>>
>>>> I think the confusion here comes from updating the spec markdown doc
>>>> prematurely. I think the PR that was merged is missing clarity around the
>>>> version of the spec that requires these changes and how to handle them in
>>>> v1 and v2 tables. It should be clear that this is a v3 feature and that v3
>>>> has not been formally adopted by a vote. We'll clean that up.
>>>>
>>>> While this is a v3 feature and must be supported for v3 compatibility,
>>>> the community usually also has guidelines for using features like this with
>>>> older spec versions. For example, before releasing v2 we allowed snapshot
>>>> ID inheritance in v1 by enabling a table property. That allowed people that
>>>> could ensure their versions supported it or who were okay with errors to
>>>> use the feature before v2 was released. I think we'd want to do the same
>>>> thing here. The reference implementation can read but not write tables that
>>>> have unknown partition transforms. We need to be clear about the details,
>>>> but I think this is generally a good idea -- I'm curious what you think
>>>> about it, Micah.
>>>>
>>>> Ryan
>>>>
>>>> On Sun, Jan 28, 2024 at 8:01 AM Szehon Ho <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> This would not be retrofitting existing partition transforms, but just
>>>>> allowing for the creation of new multi-arg transforms.  Is the concern 
>>>>> that
>>>>> some implementations are never expecting new transforms to be added?  Old
>>>>> implementations would indeed not be able to read Iceberg tables created
>>>>> with the new transforms (this is the case even today without allowing
>>>>> multi-arg transforms).
>>>>>
>>>>> For the change, there were discussions about this from:
>>>>> 1. Github:  https://github.com/apache/iceberg/issues/8258 (author
>>>>> made an end-to-end reference implementation there for a sample transform)
>>>>> 2. Google Doc Dicussion:
>>>>> https://docs.google.com/document/d/1aDoZqRgvDOOUVAGhvKZbp5vFstjsAMY4EFCyjlxpaaw/edit#heading=h.si1nr6ftu79b
>>>>> 3. August 2023 meetup :
>>>>> https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit#heading=h.5jsm89ozy58
>>>>> (there was a general consensus there)
>>>>>
>>>>> As there seemed to be enough consensus at the design discussion and
>>>>> meetup, we did not think to go for a vote.  Though I realize this will 
>>>>> have
>>>>> missed some in the community who did not attend.  I am still not entirely
>>>>> sure this is a large enough spec change for a vote (given my understanding
>>>>> of the impact), but we definitely should have sent an email to dev list to
>>>>> get more eyes on the above discussions to collect further concerns.
>>>>>
>>>>> Thanks,
>>>>> Szehon
>>>>>
>>>>>
>>>>> On Sat, Jan 27, 2024 at 10:06 AM Micah Kornfield <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> I think this is a good idea but I have concerns about compatibility.
>>>>>> IMO, I think changing the cardinality of input columns is a large enough
>>>>>> change that trying to retrofit it into V1 or V2 of the specification will
>>>>>> cause pain for implementations not relying on reference implementation.  
>>>>>> I
>>>>>>
>>>>>> As a secondary concern, I think it would be worthwhile for PMC to
>>>>>> formalize the process around specification changes as these have broader
>>>>>> implications for Iceberg adoption.  A model that I've seen work 
>>>>>> reasonably
>>>>>> well in other communities is the following:
>>>>>>
>>>>>> 1.  Discussion of overall features on the mailing list (this can also
>>>>>> be a pointer to the GitHub issue).
>>>>>> 2.  2 reference implementations demonstrating the change is viable
>>>>>> (it seems like PyIceberg is close to being fully functional enough that
>>>>>> this will be viable in the near term).
>>>>>> 3.  A formal vote adopting the change.
>>>>>>
>>>>>> But really any statement of policy around how specification changes
>>>>>> occur (and what changes will be considered for backporting to finalized
>>>>>> specifications) would be useful.
>>>>>>
>>>>>> Thanks,
>>>>>> Micah
>>>>>>
>>>>>> On Sat, Jan 27, 2024 at 2:55 AM 叶先进 <[email protected]> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> This is just a heads up. Szehon and I just make a spec change to
>>>>>>> include multi-arg transform:
>>>>>>> https://github.com/apache/iceberg/pull/8579 recently. I am sending
>>>>>>> this to get input from others who did not review the pr before Iceberg 
>>>>>>> 1.5
>>>>>>> release. Any concerns/suggestions are appreciated.
>>>>>>>
>>>>>>> After this change, we are working to get the API/Core and engine
>>>>>>> changes into the iceberg and more importantly the concrete multi-arg
>>>>>>> transforms, such as bucketV2 or zorder, etc.
>>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>

Re: Spec change for multi-arg transform

Reply via email to