Re: [DISCUSS] Multi-arg transforms

Fokko Driesprong Thu, 03 Apr 2025 23:52:05 -0700

Good morning,

I'd favor getting the other parts of v3 done and adopted since we don't
> want that work to linger too long.



That was exactly my goal of bringing this back up :)

Thanks for working on this JB, looking forward to the PR!

Kind regards,
Fokko

Op vr 4 apr 2025 om 07:17 schreef Jean-Baptiste Onofré <j...@nanthrax.net>:

> Hi Fokko
>
> OK, I was focusing only on source-ids for V3 Writer, and the
> compatibility on the reader.
> Your proposal works as well (Writer V3 writes source-id or source-ids
> depending of the number of args), the Reader V3 also has to be updated
> to read source-id or source-ids (Reader V2 doesn't change).
> The only "drawback" is that we maintain two metadata for the same
> thing (source-id and source-ids, I would have prefer only source-ids
> at some point as it can contain single arg too :)), that's not a big
> deal.
> So, with this, we don't need the format-version anymore, we have to:
> 1. on writer, we write source-id or source-ids depending of the number
> of args (source-id for single arg, source-ids for multi-args)
> 2. on reader, we "test" if source-id or source-ids exist. I guess if
> source-ids contains a single arg column that's OK :)
>
> So, it sounds good to me, I can update my PR accordingly.
>
> Regards
> JB
>
> On Thu, Apr 3, 2025 at 6:28 PM Fokko Driesprong <fo...@apache.org> wrote:
> >
> > Hey JB,
> >
> > Thanks for jumping in here.
> >
> > My point in the PR is that in the current version of the spec, we must
> write source-ids for ≥V3 tables, and write source-id for ≤V2—this requires
> carrying the format-version to the serializer. Instead, what I propose in
> the PR is to write source-id in the case of a single argument (compatible
> with all versions on read time), and write source-ids only when there are
> multiple arguments. This way, we don't need to know about the table version
> when serializing the partition-spec/sort-order.
> >
> > I've simplified the PR as suggested by Szehon to first leave the
> bucketing transform for now, which I think is a great idea.
> >
> > Kind regards,
> > Fokko
> >
> > Op do 3 apr 2025 om 16:17 schreef Jean-Baptiste Onofré <j...@nanthrax.net
> >:
> >>
> >> Hi Fokko
> >>
> >> Sorry for the late reply :)
> >>
> >> 1. It sounds good to me.
> >> 2. I started to work on the core to use only source-ids. The Writer is
> >> writing only source-ids, whereas the Reader detects if source-id
> >> exists and use it (for backward compatibility). By using source-ids,
> >> it's clearly simpler and consistent.
> >>
> >> Regards
> >> JB
> >>
> >> On Tue, Mar 25, 2025 at 8:03 PM Fokko Driesprong <fo...@apache.org>
> wrote:
> >> >
> >> > Hi everyone,
> >> >
> >> > I wanted to get your attention to some small changes to the multi-arg
> transforms that I've bumped into while working on the V3 spec for PyIceberg.
> >> >
> >> > Up for debate. The spec does not point out an actual implementation
> of transforms that accept multiple arguments. From the existing transforms,
> the only contender is the bucket transform. Should we include this in the
> V3 spec? It will only allow you to prune metadata if you do an equality
> expression on all the fields that are part of the transform.
> >> > Along the way, we've removed something that we did not intend. First
> we allowed to write source-id and source-ids based on the number of
> arguments. This has been changed to only allow source-ids for V3 in a PR
> that introduces backward compatibility. I think this makes the JSON
> parsers/producers more complex than needed (specifically PyIceberg). Also,
> in Java, we would need to plumb down the table version to the
> PartitionSpecParser.java. I think it would be great to simplify this.
> >> >
> >> > Please let me know what you think so we can tie up the loose ends for
> V3.
> >> >
> >> > Kind regards,
> >> > Fokko
> >> >
> >> >
> >> >
>

Re: [DISCUSS] Multi-arg transforms

Reply via email to