Re: Spec change for multi-arg transform

Szehon Ho Tue, 30 Jan 2024 09:36:01 -0800

Sorry I may have misunderstood the statement and maybe this is specific to
multi-arg transform, in any case let's get a spec pr earlier in to
discuss/specify behavior for V1-2 vs 3.


Thanks
Szehon

On Tue, Jan 30, 2024 at 9:23 AM Szehon Ho <szehon.apa...@gmail.com> wrote:

> Thanks all for the discussion.
>
> For the specific point about any new transform being able to be read in
> current versions but only written in V3 (which I missed as well):
>
> While this is a v3 feature and must be supported for v3 compatibility, the
>> community usually also has guidelines for using features like this with
>> older spec versions. For example, before releasing v2 we allowed snapshot
>> ID inheritance in v1 by enabling a table property. That allowed people that
>> could ensure their versions supported it or who were okay with errors to
>> use the feature before v2 was released. I think we'd want to do the same
>> thing here. The reference implementation can read but not write tables that
>> have unknown partition transforms. We need to be clear about the details,
>> but I think this is generally a good idea -- I'm curious what you think
>> about it, Micah.
>
>
> I feel this implies adding versions to partition transforms (ie, this
> partition transform is in V1, V2, or V3).  And any new partition transform
> (even single-arg transform, not just multi-arg transform) will be in
> VNext.  Should we clarify that in the spec as a general point, even outside
> multi-arg transforms?  Let me know if I mis-understood.
>
> Ryan's right. I will send a new PR to clarify that this change is
>> targeting V3 spec and how V1 and V2 should handle them. I am planning to
>> send it after I finish the `bucketV2` spec change, WDYT, Ryan?
>
>
> Advancedxy, I feel we can prioritize this pr in parallel for review, as it
> seems important to get in soon as well?  Although, currently the PR was
> just to say that partition transforms may be multi-arg, but doesnt add any
> new ones, so I feel it makes sense that we can clarify it at the same time
> we add bucketV2 to the spec.
>
> Thanks
> Szehon
>
> On Mon, Jan 29, 2024 at 10:41 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> Thanks YE, Ryan and Szehon for your thoughts.
>>
>> As was already touched on, my primary concern is it seems like features
>> were being added to V2 that were not forward compatible.  It seems there is
>> consensus that these will be V3 and possibly backported to V2, this
>> makes much more sense.  I've added some more detailed responses below on
>> the process.  My answers are a little verbose, so as a summary:
>>
>> IMO Specification additions have a broad enough impact that I think they
>> warrant a higher degree of formality.  This includes:
>> 1.  Documenting what is required for new spec changes.
>> 2.  Ideally each spec change would have a first class DISCUSS [4] and
>> VOTE thread on the mailing list before it is committed (the exact process
>> is likely something for the community to decide but I think this most
>> closely reflects the intent of the "Apache Way" [2]).  I think this becomes
>> more important as different github repos are spun out for different
>> implementations, as there will likely be some community members who do not
>> keep track for the main repo.
>>
>> For the change, there were discussions about this from:
>>> 1. Github:  https://github.com/apache/iceberg/issues/8258 (author made
>>> an end-to-end reference implementation there for a sample transform)
>>> 2. Google Doc Dicussion:
>>> https://docs.google.com/document/d/1aDoZqRgvDOOUVAGhvKZbp5vFstjsAMY4EFCyjlxpaaw/edit#heading=h.si1nr6ftu79b
>>> 3. August 2023 meetup :
>>> https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit#heading=h.5jsm89ozy58
>>> (there was a general consensus there)
>>
>>
>> Part of "The Apache Way" is "if it didn't happen on the mailing list it
>> didn't happen" [2]. Different Apache projects have different standards in
>> this regard, and with a lot of development becoming Github focused it seems
>> some aspects of "The Apache Way" are interpreted differently.
>>
>> I think the exact process details are less important than having an
>> agreed upon and documented process for new specification features (it
>> sounds like there might be an agreed upon process but I don't think I've
>> seen it documented, apologies if I missed it). There is a separate active
>> thread "Process for creating new Proposals", that is probably more
>> appropriate to further this part of the discussion, so I'll reply there as
>> well.
>>
>> 1. Discuss the feature, 2. Build 2 reference implementations, and 3. hold
>>> a vote. That's very similar to what we typically do. The only difference is
>>> that for step 2, we typically just build one reference implementation in
>>> the Java library. We do vote on the large spec updates, but in this case
>>> you haven't seen one since we haven't built the reference implementation
>>> yet.
>>
>>
>> This is a good process.  I think there are potentially a couple areas of
>> discussion (as noted above it might be better for the parallel "Process for
>> creating new Proposals" thread but I'll enumerate them here and also post
>> there).
>>
>> 1.  At some point discussing when or if PyIceberg (or any of the other
>> implementations under the project umbrella) should be part of the reference
>> implementation requirement.  There are pros and cons to requiring at least
>> 2 implementations for spec changes and it is probably worth tabling at
>> least for now (or discussing in a separate thread).
>>
>> 2.  I think it is important to distinguish between three types of spec
>> changes:
>>     a.  Clarifying existing intent of the specification.
>>     b.  Adding new features to VNext of a Spec (i.e. for Iceberg V3).
>>     c.  Finalizing the scope of VNext (e.g. V3 currently).
>>
>> IMO, I don't think (a) needs anything more than a mailing list
>> discussion + PR approved by a committer (maybe PMC member?).
>> Votes/discussion on (c) seem to be occurring on the mailing list.  For (b)
>> discussion sometimes occurs on the mailing list, but votes do not appear to
>> happen (I might have been searching the archives incorrectly).   In
>> particular I think this introduces two problems:
>> 1.  It makes spec changes less visible in general.
>> 2.  By not voting on incremental changes to the specification it couples
>> two aspects into finalizing the VNext.  First, "do all the new features
>> make sense", and secondly "is now the time to do a version revision".  For
>> the first aspect, it seems very costly to backout a feature once the spec
>> has been changed, a reference implementation has been released and possibly
>> allowed to be backported to prior table versions.  For the second aspect, I
>> think similar to library releases, it is important to release new spec
>> versions  on a regular cadence.
>>
>> While this is a v3 feature and must be supported for v3 compatibility,
>>> the community usually also has guidelines for using features like this with
>>> older spec versions. For example, before releasing v2 we allowed snapshot
>>> ID inheritance in v1 by enabling a table property. That allowed people that
>>> could ensure their versions supported it or who were okay with errors to
>>> use the feature before v2 was released. I think we'd want to do the same
>>> thing here. The reference implementation can read but not write tables that
>>> have unknown partition transforms. We need to be clear about the details,
>>> but I think this is generally a good idea -- I'm curious what you think
>>> about it, Micah.
>>
>>
>> I'm generally ok with the concept as described. As touched on above I
>> think the other two important points are:
>> 1. A general documented policy when features will or won't be expected to
>> be optionally in scope for V1/V2 implementations (this can also be
>> considered on a case by case basis, but I think it should be part of a
>> proposal that is voted on).
>> 2. It is clearly spelled out in the specification how this is intended to
>> work with previous versions and a warning similar to nanosecond temporal
>> types [1] about compatibility guarantees.
>>
>> Thanks,
>> Micah
>>
>> [1]
>> https://github.com/apache/iceberg/blob/main/format/spec.md#primitive-types
>> [2] https://theapacheway.com/on-list/
>> [3] https://github.com/apache/iceberg/blob/main/CONTRIBUTING.md
>> [4] I think a discuss thread pointing to either a google doc, or github
>> issue is sufficient here, as Google docs seem generally better for
>> reviewing proposals (github issues I think suffer somewhat the same
>> disadvantages from mailing lists, as threads can diverge and seeing the
>> current state is harder).
>>
>>
>>
>>
>>
>> On Sun, Jan 28, 2024 at 7:38 PM YE <advance...@gmail.com> wrote:
>>
>>> Sorry for the previous email, it was sent by accident without complete
>>> reply. Please discard the previous email, and see the whole reply in this
>>> email.
>>>
>>> Thanks for Micah and Ryan's reply.
>>>
>>> As Szehon already pointed out, this change is to allow creation of *new* 
>>> multi-arg
>>> transforms. I remember there's a discussion in the google doc
>>> whether targeting this as a `V3` spec change, it turns out that we may
>>> support this as long as we make sure old writers cannot
>>> write to a multi-arg transformed table. So we didn't explicitly state
>>> it's a `V3` spec change.
>>>
>>> > I think the PR that was merged is missing clarity around the version
>>> of the spec that requires these changes and how to handle them in v1 and v2
>>> tables.
>>>
>>> Ryan's right. I will send a new PR to clarify that this change is
>>> targeting V3 spec and how V1 and V2 should handle them. I am planning to
>>> send it after I finish the `bucketV2` spec change, WDYT, Ryan?
>>>
>>> And BTW, it might introduce additional overhead to support it in the V1
>>> spec, I am aiming to support this in V2 by enabling a specific table
>>> property.
>>>
>>> YE <advance...@gmail.com> 于2024年1月29日周一 11:27写道：
>>>
>>>> Thanks for Micah and Ryan's reply.
>>>>
>>>> As Szehon already pointed out, this change is to allow creation of
>>>> *new* multi-arg transforms. I remember there's a discussion in the
>>>> google doc whether targeting this as a `V3` spec change, it turns out that
>>>> we may support this as long as we make sure old writers cannot
>>>> write to a multi-arg transformed table. So we didn't explicitly state
>>>> it's a `V3
>>>>
>>>> > I think the PR that was merged is missing clarity around the version
>>>> of the spec that requires these changes and how to handle them in v1 and v2
>>>> tables.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Ryan Blue <b...@tabular.io> 于2024年1月29日周一 02:36写道：
>>>>
>>>>> Thanks for working on this, Szehon and AdvanceXY! I'm glad to see this
>>>>> picking up for the v3 work.
>>>>>
>>>>> I also want to address Micah's comments and suggest how we can do
>>>>> better next time. From Micah's suggestion, there are 3 steps: 1. Discuss
>>>>> the feature, 2. Build 2 reference implementations, and 3. hold a vote.
>>>>> That's very similar to what we typically do. The only difference is that
>>>>> for step 2, we typically just build one reference implementation in the
>>>>> Java library. We do vote on the large spec updates, but in this case you
>>>>> haven't seen one since we haven't built the reference implementation yet.
>>>>>
>>>>> I think the confusion here comes from updating the spec markdown doc
>>>>> prematurely. I think the PR that was merged is missing clarity around the
>>>>> version of the spec that requires these changes and how to handle them in
>>>>> v1 and v2 tables. It should be clear that this is a v3 feature and that v3
>>>>> has not been formally adopted by a vote. We'll clean that up.
>>>>>
>>>>> While this is a v3 feature and must be supported for v3 compatibility,
>>>>> the community usually also has guidelines for using features like this 
>>>>> with
>>>>> older spec versions. For example, before releasing v2 we allowed snapshot
>>>>> ID inheritance in v1 by enabling a table property. That allowed people 
>>>>> that
>>>>> could ensure their versions supported it or who were okay with errors to
>>>>> use the feature before v2 was released. I think we'd want to do the same
>>>>> thing here. The reference implementation can read but not write tables 
>>>>> that
>>>>> have unknown partition transforms. We need to be clear about the details,
>>>>> but I think this is generally a good idea -- I'm curious what you think
>>>>> about it, Micah.
>>>>>
>>>>> Ryan
>>>>>
>>>>> On Sun, Jan 28, 2024 at 8:01 AM Szehon Ho <szehon.apa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> This would not be retrofitting existing partition transforms, but
>>>>>> just allowing for the creation of new multi-arg transforms.  Is the 
>>>>>> concern
>>>>>> that some implementations are never expecting new transforms to be added?
>>>>>> Old implementations would indeed not be able to read Iceberg tables 
>>>>>> created
>>>>>> with the new transforms (this is the case even today without allowing
>>>>>> multi-arg transforms).
>>>>>>
>>>>>> For the change, there were discussions about this from:
>>>>>> 1. Github:  https://github.com/apache/iceberg/issues/8258 (author
>>>>>> made an end-to-end reference implementation there for a sample transform)
>>>>>> 2. Google Doc Dicussion:
>>>>>> https://docs.google.com/document/d/1aDoZqRgvDOOUVAGhvKZbp5vFstjsAMY4EFCyjlxpaaw/edit#heading=h.si1nr6ftu79b
>>>>>> 3. August 2023 meetup :
>>>>>> https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit#heading=h.5jsm89ozy58
>>>>>> (there was a general consensus there)
>>>>>>
>>>>>> As there seemed to be enough consensus at the design discussion and
>>>>>> meetup, we did not think to go for a vote.  Though I realize this will 
>>>>>> have
>>>>>> missed some in the community who did not attend.  I am still not entirely
>>>>>> sure this is a large enough spec change for a vote (given my 
>>>>>> understanding
>>>>>> of the impact), but we definitely should have sent an email to dev list 
>>>>>> to
>>>>>> get more eyes on the above discussions to collect further concerns.
>>>>>>
>>>>>> Thanks,
>>>>>> Szehon
>>>>>>
>>>>>>
>>>>>> On Sat, Jan 27, 2024 at 10:06 AM Micah Kornfield <
>>>>>> emkornfi...@gmail.com> wrote:
>>>>>>
>>>>>>> I think this is a good idea but I have concerns about compatibility.
>>>>>>> IMO, I think changing the cardinality of input columns is a large enough
>>>>>>> change that trying to retrofit it into V1 or V2 of the specification 
>>>>>>> will
>>>>>>> cause pain for implementations not relying on reference implementation. 
>>>>>>>  I
>>>>>>>
>>>>>>> As a secondary concern, I think it would be worthwhile for PMC to
>>>>>>> formalize the process around specification changes as these have broader
>>>>>>> implications for Iceberg adoption.  A model that I've seen work 
>>>>>>> reasonably
>>>>>>> well in other communities is the following:
>>>>>>>
>>>>>>> 1.  Discussion of overall features on the mailing list (this can
>>>>>>> also be a pointer to the GitHub issue).
>>>>>>> 2.  2 reference implementations demonstrating the change is viable
>>>>>>> (it seems like PyIceberg is close to being fully functional enough that
>>>>>>> this will be viable in the near term).
>>>>>>> 3.  A formal vote adopting the change.
>>>>>>>
>>>>>>> But really any statement of policy around how specification changes
>>>>>>> occur (and what changes will be considered for backporting to finalized
>>>>>>> specifications) would be useful.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Micah
>>>>>>>
>>>>>>> On Sat, Jan 27, 2024 at 2:55 AM 叶先进 <advance...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> This is just a heads up. Szehon and I just make a spec change to
>>>>>>>> include multi-arg transform:
>>>>>>>> https://github.com/apache/iceberg/pull/8579 recently. I am sending
>>>>>>>> this to get input from others who did not review the pr before Iceberg 
>>>>>>>> 1.5
>>>>>>>> release. Any concerns/suggestions are appreciated.
>>>>>>>>
>>>>>>>> After this change, we are working to get the API/Core and engine
>>>>>>>> changes into the iceberg and more importantly the concrete multi-arg
>>>>>>>> transforms, such as bucketV2 or zorder, etc.
>>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>

Re: Spec change for multi-arg transform

Reply via email to