Re: [DISCUSS] Multi-arg transforms

Fokko Driesprong Thu, 03 Apr 2025 09:28:34 -0700

Hey JB,

Thanks for jumping in here.

My point in the PR <https://github.com/apache/iceberg/pull/12644> is that
in the current version of the spec, we must write source-ids for ≥V3
tables, and write source-id for ≤V2—this requires carrying the
format-version to the serializer. Instead, what I propose in the PR is to
write source-id in the case of a single argument (compatible with all
versions on read time), and write source-ids only when there are multiple
arguments. This way, we don't need to know about the table version when
serializing the partition-spec/sort-order.

I've simplified the PR as suggested by Szehon to first leave the bucketing
transform for now, which I think is a great idea.

Kind regards,
Fokko

Op do 3 apr 2025 om 16:17 schreef Jean-Baptiste Onofré <[email protected]>:

> Hi Fokko
>
> Sorry for the late reply :)
>
> 1. It sounds good to me.
> 2. I started to work on the core to use only source-ids. The Writer is
> writing only source-ids, whereas the Reader detects if source-id
> exists and use it (for backward compatibility). By using source-ids,
> it's clearly simpler and consistent.
>
> Regards
> JB
>
> On Tue, Mar 25, 2025 at 8:03 PM Fokko Driesprong <[email protected]> wrote:
> >
> > Hi everyone,
> >
> > I wanted to get your attention to some small changes to the multi-arg
> transforms that I've bumped into while working on the V3 spec for PyIceberg.
> >
> > Up for debate. The spec does not point out an actual implementation of
> transforms that accept multiple arguments. From the existing transforms,
> the only contender is the bucket transform. Should we include this in the
> V3 spec? It will only allow you to prune metadata if you do an equality
> expression on all the fields that are part of the transform.
> > Along the way, we've removed something that we did not intend. First we
> allowed to write source-id and source-ids based on the number of arguments.
> This has been changed to only allow source-ids for V3 in a PR that
> introduces backward compatibility. I think this makes the JSON
> parsers/producers more complex than needed (specifically PyIceberg). Also,
> in Java, we would need to plumb down the table version to the
> PartitionSpecParser.java. I think it would be great to simplify this.
> >
> > Please let me know what you think so we can tie up the loose ends for V3.
> >
> > Kind regards,
> > Fokko
> >
> >
> >
>

Re: [DISCUSS] Multi-arg transforms

Reply via email to