Good morning, I'd favor getting the other parts of v3 done and adopted since we don't > want that work to linger too long.
That was exactly my goal of bringing this back up :) Thanks for working on this JB, looking forward to the PR! Kind regards, Fokko Op vr 4 apr 2025 om 07:17 schreef Jean-Baptiste Onofré <j...@nanthrax.net>: > Hi Fokko > > OK, I was focusing only on source-ids for V3 Writer, and the > compatibility on the reader. > Your proposal works as well (Writer V3 writes source-id or source-ids > depending of the number of args), the Reader V3 also has to be updated > to read source-id or source-ids (Reader V2 doesn't change). > The only "drawback" is that we maintain two metadata for the same > thing (source-id and source-ids, I would have prefer only source-ids > at some point as it can contain single arg too :)), that's not a big > deal. > So, with this, we don't need the format-version anymore, we have to: > 1. on writer, we write source-id or source-ids depending of the number > of args (source-id for single arg, source-ids for multi-args) > 2. on reader, we "test" if source-id or source-ids exist. I guess if > source-ids contains a single arg column that's OK :) > > So, it sounds good to me, I can update my PR accordingly. > > Regards > JB > > On Thu, Apr 3, 2025 at 6:28 PM Fokko Driesprong <fo...@apache.org> wrote: > > > > Hey JB, > > > > Thanks for jumping in here. > > > > My point in the PR is that in the current version of the spec, we must > write source-ids for ≥V3 tables, and write source-id for ≤V2—this requires > carrying the format-version to the serializer. Instead, what I propose in > the PR is to write source-id in the case of a single argument (compatible > with all versions on read time), and write source-ids only when there are > multiple arguments. This way, we don't need to know about the table version > when serializing the partition-spec/sort-order. > > > > I've simplified the PR as suggested by Szehon to first leave the > bucketing transform for now, which I think is a great idea. > > > > Kind regards, > > Fokko > > > > Op do 3 apr 2025 om 16:17 schreef Jean-Baptiste Onofré <j...@nanthrax.net > >: > >> > >> Hi Fokko > >> > >> Sorry for the late reply :) > >> > >> 1. It sounds good to me. > >> 2. I started to work on the core to use only source-ids. The Writer is > >> writing only source-ids, whereas the Reader detects if source-id > >> exists and use it (for backward compatibility). By using source-ids, > >> it's clearly simpler and consistent. > >> > >> Regards > >> JB > >> > >> On Tue, Mar 25, 2025 at 8:03 PM Fokko Driesprong <fo...@apache.org> > wrote: > >> > > >> > Hi everyone, > >> > > >> > I wanted to get your attention to some small changes to the multi-arg > transforms that I've bumped into while working on the V3 spec for PyIceberg. > >> > > >> > Up for debate. The spec does not point out an actual implementation > of transforms that accept multiple arguments. From the existing transforms, > the only contender is the bucket transform. Should we include this in the > V3 spec? It will only allow you to prune metadata if you do an equality > expression on all the fields that are part of the transform. > >> > Along the way, we've removed something that we did not intend. First > we allowed to write source-id and source-ids based on the number of > arguments. This has been changed to only allow source-ids for V3 in a PR > that introduces backward compatibility. I think this makes the JSON > parsers/producers more complex than needed (specifically PyIceberg). Also, > in Java, we would need to plumb down the table version to the > PartitionSpecParser.java. I think it would be great to simplify this. > >> > > >> > Please let me know what you think so we can tie up the loose ends for > V3. > >> > > >> > Kind regards, > >> > Fokko > >> > > >> > > >> > >