Hi Fokko OK, I was focusing only on source-ids for V3 Writer, and the compatibility on the reader. Your proposal works as well (Writer V3 writes source-id or source-ids depending of the number of args), the Reader V3 also has to be updated to read source-id or source-ids (Reader V2 doesn't change). The only "drawback" is that we maintain two metadata for the same thing (source-id and source-ids, I would have prefer only source-ids at some point as it can contain single arg too :)), that's not a big deal. So, with this, we don't need the format-version anymore, we have to: 1. on writer, we write source-id or source-ids depending of the number of args (source-id for single arg, source-ids for multi-args) 2. on reader, we "test" if source-id or source-ids exist. I guess if source-ids contains a single arg column that's OK :)
So, it sounds good to me, I can update my PR accordingly. Regards JB On Thu, Apr 3, 2025 at 6:28 PM Fokko Driesprong <fo...@apache.org> wrote: > > Hey JB, > > Thanks for jumping in here. > > My point in the PR is that in the current version of the spec, we must write > source-ids for ≥V3 tables, and write source-id for ≤V2—this requires carrying > the format-version to the serializer. Instead, what I propose in the PR is to > write source-id in the case of a single argument (compatible with all > versions on read time), and write source-ids only when there are multiple > arguments. This way, we don't need to know about the table version when > serializing the partition-spec/sort-order. > > I've simplified the PR as suggested by Szehon to first leave the bucketing > transform for now, which I think is a great idea. > > Kind regards, > Fokko > > Op do 3 apr 2025 om 16:17 schreef Jean-Baptiste Onofré <j...@nanthrax.net>: >> >> Hi Fokko >> >> Sorry for the late reply :) >> >> 1. It sounds good to me. >> 2. I started to work on the core to use only source-ids. The Writer is >> writing only source-ids, whereas the Reader detects if source-id >> exists and use it (for backward compatibility). By using source-ids, >> it's clearly simpler and consistent. >> >> Regards >> JB >> >> On Tue, Mar 25, 2025 at 8:03 PM Fokko Driesprong <fo...@apache.org> wrote: >> > >> > Hi everyone, >> > >> > I wanted to get your attention to some small changes to the multi-arg >> > transforms that I've bumped into while working on the V3 spec for >> > PyIceberg. >> > >> > Up for debate. The spec does not point out an actual implementation of >> > transforms that accept multiple arguments. From the existing transforms, >> > the only contender is the bucket transform. Should we include this in the >> > V3 spec? It will only allow you to prune metadata if you do an equality >> > expression on all the fields that are part of the transform. >> > Along the way, we've removed something that we did not intend. First we >> > allowed to write source-id and source-ids based on the number of >> > arguments. This has been changed to only allow source-ids for V3 in a PR >> > that introduces backward compatibility. I think this makes the JSON >> > parsers/producers more complex than needed (specifically PyIceberg). Also, >> > in Java, we would need to plumb down the table version to the >> > PartitionSpecParser.java. I think it would be great to simplify this. >> > >> > Please let me know what you think so we can tie up the loose ends for V3. >> > >> > Kind regards, >> > Fokko >> > >> > >> >