Re: [DISCUSS] Multi-arg transforms

Jean-Baptiste Onofré Fri, 04 Apr 2025 03:03:13 -0700

Hi Fokko

OK, I was focusing only on source-ids for V3 Writer, and the
compatibility on the reader.
Your proposal works as well (Writer V3 writes source-id or source-ids
depending of the number of args), the Reader V3 also has to be updated
to read source-id or source-ids (Reader V2 doesn't change).
The only "drawback" is that we maintain two metadata for the same
thing (source-id and source-ids, I would have prefer only source-ids
at some point as it can contain single arg too :)), that's not a big
deal.
So, with this, we don't need the format-version anymore, we have to:
1. on writer, we write source-id or source-ids depending of the number
of args (source-id for single arg, source-ids for multi-args)
2. on reader, we "test" if source-id or source-ids exist. I guess if
source-ids contains a single arg column that's OK :)


So, it sounds good to me, I can update my PR accordingly.

Regards
JB

On Thu, Apr 3, 2025 at 6:28 PM Fokko Driesprong <[email protected]> wrote:
>
> Hey JB,
>
> Thanks for jumping in here.
>
> My point in the PR is that in the current version of the spec, we must write 
> source-ids for ≥V3 tables, and write source-id for ≤V2—this requires carrying 
> the format-version to the serializer. Instead, what I propose in the PR is to 
> write source-id in the case of a single argument (compatible with all 
> versions on read time), and write source-ids only when there are multiple 
> arguments. This way, we don't need to know about the table version when 
> serializing the partition-spec/sort-order.
>
> I've simplified the PR as suggested by Szehon to first leave the bucketing 
> transform for now, which I think is a great idea.
>
> Kind regards,
> Fokko
>
> Op do 3 apr 2025 om 16:17 schreef Jean-Baptiste Onofré <[email protected]>:
>>
>> Hi Fokko
>>
>> Sorry for the late reply :)
>>
>> 1. It sounds good to me.
>> 2. I started to work on the core to use only source-ids. The Writer is
>> writing only source-ids, whereas the Reader detects if source-id
>> exists and use it (for backward compatibility). By using source-ids,
>> it's clearly simpler and consistent.
>>
>> Regards
>> JB
>>
>> On Tue, Mar 25, 2025 at 8:03 PM Fokko Driesprong <[email protected]> wrote:
>> >
>> > Hi everyone,
>> >
>> > I wanted to get your attention to some small changes to the multi-arg 
>> > transforms that I've bumped into while working on the V3 spec for 
>> > PyIceberg.
>> >
>> > Up for debate. The spec does not point out an actual implementation of 
>> > transforms that accept multiple arguments. From the existing transforms, 
>> > the only contender is the bucket transform. Should we include this in the 
>> > V3 spec? It will only allow you to prune metadata if you do an equality 
>> > expression on all the fields that are part of the transform.
>> > Along the way, we've removed something that we did not intend. First we 
>> > allowed to write source-id and source-ids based on the number of 
>> > arguments. This has been changed to only allow source-ids for V3 in a PR 
>> > that introduces backward compatibility. I think this makes the JSON 
>> > parsers/producers more complex than needed (specifically PyIceberg). Also, 
>> > in Java, we would need to plumb down the table version to the 
>> > PartitionSpecParser.java. I think it would be great to simplify this.
>> >
>> > Please let me know what you think so we can tie up the loose ends for V3.
>> >
>> > Kind regards,
>> > Fokko
>> >
>> >
>> >

Re: [DISCUSS] Multi-arg transforms

Reply via email to