Re: [DISCUSS] Multi-arg transforms

Ryan Blue Thu, 03 Apr 2025 09:52:51 -0700

Sorry I didn't see the discussion about adding a new bucket transform
earlier. I think it's great to start talking about a new bucket transform,
but we made sure that we could add new transforms without breaking
forward-compatibility so that we didn't need to rush getting one in. I
think that we're pretty confident in how to represent multi-arg transforms
(updating to use `source-ids`), so I don't think adding it now decreases
risk very much. I'd favor getting the other parts of v3 done and adopted
since we don't want that work to linger too long.


On Thu, Apr 3, 2025 at 9:28 AM Fokko Driesprong <fo...@apache.org> wrote:

> Hey JB,
>
> Thanks for jumping in here.
>
> My point in the PR <https://github.com/apache/iceberg/pull/12644> is that
> in the current version of the spec, we must write source-ids for ≥V3
> tables, and write source-id for ≤V2—this requires carrying the
> format-version to the serializer. Instead, what I propose in the PR is to
> write source-id in the case of a single argument (compatible with all
> versions on read time), and write source-ids only when there are multiple
> arguments. This way, we don't need to know about the table version when
> serializing the partition-spec/sort-order.
>
> I've simplified the PR as suggested by Szehon to first leave the bucketing
> transform for now, which I think is a great idea.
>
> Kind regards,
> Fokko
>
> Op do 3 apr 2025 om 16:17 schreef Jean-Baptiste Onofré <j...@nanthrax.net>:
>
>> Hi Fokko
>>
>> Sorry for the late reply :)
>>
>> 1. It sounds good to me.
>> 2. I started to work on the core to use only source-ids. The Writer is
>> writing only source-ids, whereas the Reader detects if source-id
>> exists and use it (for backward compatibility). By using source-ids,
>> it's clearly simpler and consistent.
>>
>> Regards
>> JB
>>
>> On Tue, Mar 25, 2025 at 8:03 PM Fokko Driesprong <fo...@apache.org>
>> wrote:
>> >
>> > Hi everyone,
>> >
>> > I wanted to get your attention to some small changes to the multi-arg
>> transforms that I've bumped into while working on the V3 spec for PyIceberg.
>> >
>> > Up for debate. The spec does not point out an actual implementation of
>> transforms that accept multiple arguments. From the existing transforms,
>> the only contender is the bucket transform. Should we include this in the
>> V3 spec? It will only allow you to prune metadata if you do an equality
>> expression on all the fields that are part of the transform.
>> > Along the way, we've removed something that we did not intend. First we
>> allowed to write source-id and source-ids based on the number of arguments.
>> This has been changed to only allow source-ids for V3 in a PR that
>> introduces backward compatibility. I think this makes the JSON
>> parsers/producers more complex than needed (specifically PyIceberg). Also,
>> in Java, we would need to plumb down the table version to the
>> PartitionSpecParser.java. I think it would be great to simplify this.
>> >
>> > Please let me know what you think so we can tie up the loose ends for
>> V3.
>> >
>> > Kind regards,
>> > Fokko
>> >
>> >
>> >
>>
>

Re: [DISCUSS] Multi-arg transforms

Reply via email to