Thanks for the historical context here. I wasn't aware of the equivalent between -1 and null.
It looks like we're all aligned on not retroactively changing the spec. Let's wrap this up with some actions: - To revert the changes, I've raised the PRs (1.8.x <https://github.com/apache/iceberg/pull/12313>, main <https://github.com/apache/iceberg/pull/12312>) and they are already in (thanks Eduard!). - Created a PR to write <https://github.com/apache/iceberg/pull/12335/files>null <https://github.com/apache/iceberg/pull/12335/files> for ≥V3 <https://github.com/apache/iceberg/pull/12335/files>. I like the solution of tying it to the metadata version, since folks have to update their libraries/catalogs anyway to use V3. - Distilled the words said here around -1 into an implementation note <https://github.com/apache/iceberg/pull/12334> to avoid confusion in the future, let me know what y'all think. Thanks again everyone for the input here. Kind regards, Fokko Op di 18 feb 2025 om 21:55 schreef Daniel Weeks <dwe...@apache.org>: > I agree that we shouldn't retroactively change the spec to align with the > implementation. > > We're trying to strike a balance here and I think the notes are an > effective way to convey that while omitting or producing null with v1/v2 is > technically to spec, it's going to be incompatible with most > implementations and should be discouraged for compatibility reasons. > > So, no on spec change, yes on notes. > > On Tue, Feb 18, 2025 at 12:37 PM Russell Spitzer < > russell.spit...@gmail.com> wrote: > >> `For compatibility with existing libraries, we should maintain that `-1` >> is equivalent to no snapshot and it should be written for v1/v2.` >> >> The only issue I have with this is are we saying that for v1 and v2 we >> are changing the spec to say that current-snapshot-id is required? Or are >> we adding an implementation note, the java implementation will always >> return -1 for these tables but this is not required for other clients? >> >> Unless we say that it is required we can have other implementations which >> can return null or missing. I'm not a big fan of changing the spec to match >> the implementation especially when we are narrowing what is allowed. >> >> >> On Tue, Feb 18, 2025 at 2:15 PM Daniel Weeks <dwe...@apache.org> wrote: >> >>> I would agree the best path forward is to note the current behavior for >>> v1/v2 since that's well established and address the behavior in v3. >>> >>> For compatibility with existing libraries, we should maintain that `-1` >>> is equivalent to no snapshot and it should be written for v1/v2. >>> >>> With V3 we should support `-1` for backwards compatibility (though >>> there's low likelihood that someone will promote an empty table), but >>> always write `null`. >>> >>> I believe this is in part due to TableMetadata tracking the >>> current-snapshot-id as a `long` value (as opposed to a Long) reference, >>> which resulted in this oversight. >>> >>> -Dan >>> >>> On Tue, Feb 18, 2025 at 11:57 AM rdb...@gmail.com <rdb...@gmail.com> >>> wrote: >>> >>>> +1 to reverting PT 11560 in main and 1.8.1. That avoids unnecessary >>>> incompatibility with older readers. >>>> >>>> I also agree that we should update the spec to say what Russell >>>> suggests: >>>> > that -1 has meant "no current snapshot" in the past and is >>>> equivalent to missing/null. >>>> >>>> That's a correct description of the behavior. >>>> >>>> I don't think that we should create new requirements for this that >>>> didn't previously exist -- that is, I don't see much value in going back to >>>> mandate this behavior when writing v1 or v2 tables. If an implementation >>>> were writing null for the current snapshot ID up until now, I don't think >>>> that we can say that behavior was or is incorrect. It was an >>>> incompatibility with the Java implementation and we should note the >>>> behavior in the "Implementation Notes" section. >>>> >>>> > How about reverting #11560 for 1.8.1, and then reinstating this for >>>> 2.0.0? >>>> >>>> I think we need to fix this at a format version boundary, not a library >>>> version boundary. I'd be up for reinstating the write change for v3. >>>> >>>> On Tue, Feb 18, 2025 at 7:56 AM Russell Spitzer < >>>> russell.spit...@gmail.com> wrote: >>>> >>>>> The only thing I think I agree with is defining that -1 has meant "no >>>>> current snapshot" in the past and >>>>> is equivalent to a missing/nuil (if we have to specify that) . >>>>> >>>>> I don't think there is any reason to change the behavior of writing >>>>> null / missing unless >>>>> that's really a point of confusion for folks. Is there a JSON library >>>>> folks are using >>>>> which distinguishes between missing/null? >>>>> >>>>> Would having field:null be the same as missing have avoided the issue >>>>> because it seems like libraries >>>>> that only handled non-null would also not handle "missing" >>>>> current-snapshot-id as well? >>>>> >>>>> I'd really like to hear from other implementers here since changing >>>>> all this again would be a lot of >>>>> work and I thought we had understanding of Optional == Nullable as a >>>>> valid thing in the spec. >>>>> >>>>> On Tue, Feb 18, 2025 at 7:35 AM Robert Stupp <sn...@snazy.de> wrote: >>>>> >>>>>> Correcting myself: schema/spec/sort seem to be always present - >>>>>> please ignore that part in my previous email. The valid values for those >>>>>> fields however should be defined. >>>>>> On 18.02.25 14:29, Robert Stupp wrote: >>>>>> >>>>>> Reality is that Iceberg did write '-1' into current-snapshot-id (and >>>>>> other "non-exist" marker values for schema/spec/sort) instead of omitting >>>>>> the field. >>>>>> >>>>>> Side note: the table-spec says that these fields are optional, but >>>>>> nothing about whether it is nullable. >>>>>> >>>>>> The spec should at least be amended to explicitly define the valid >>>>>> values (-1 for current-snapshot-id for no snapshot is just a fact now >>>>>> that >>>>>> it's there). IMO that field being 'null' isn't defined in the spec, but >>>>>> the >>>>>> absence of the field is. >>>>>> >>>>>> Proposal for the implementation: >>>>>> >>>>>> * revert https://github.com/apache/iceberg/pull/11560 in 1.8.1 and >>>>>> on main >>>>>> >>>>>> Proposal for the spec (for current-snapshot-id): >>>>>> >>>>>> * Define the valid value ranges for the field >>>>>> * Define that the absence of the field means "no current snapshot" >>>>>> * Define that the value -1 of the field means "no current snapshot" >>>>>> * Define that the value 'null' of the field means "no current >>>>>> snapshot" >>>>>> * Define that new implementations must not write the field if there's >>>>>> no current snapshot >>>>>> Proposal for the spec (for schema/spec/sort): >>>>>> >>>>>> * Define the valid value ranges for the fields >>>>>> * Define the "schema/spec/sort not present" values (the fields are >>>>>> optional for v1 but required for v2+v3). >>>>>> * OR Define that "schema/spec/sort must be absent" if there is no >>>>>> current schema/spec/sort. >>>>>> >>>>>> WDYT? >>>>>> >>>>>> On 17.02.25 21:07, Russell Spitzer wrote: >>>>>> >>>>>> It sounds like the argument here is that we should change the Spec >>>>>> for V1, V2, and V3 to mark current-snapshot-id >>>>>> as required. Then we should change all other implementations to >>>>>> follow this new standard. I'm not sure that >>>>>> is a good solution going forwards but I'm not sure of how we can >>>>>> support catalogs/engines that cannot handle a null >>>>>> correctly in this situation otherwise. Perhaps we should source out >>>>>> to see if any other implementers worked off the >>>>>> assumption of a non-optional "current-snapshot-id" and if we get a >>>>>> critical mass we can try to make that change? >>>>>> Because of how wide that change would be, I think we would need >>>>>> pretty broad consensus to do so. >>>>>> >>>>>> We could possibly also have a flag to allow the old behavior but that >>>>>> also feels wrong to me, we have often gone with a motto of >>>>>> read "wrong" write "correct" for things like this in the past and >>>>>> continuing to write "wrong" is a disservice to any >>>>>> new implementers . When we do have a contradiction between our >>>>>> implementation and the spec I think we have >>>>>> to trust that implementers followed the spec and fix the core library. >>>>>> >>>>>> Are there any other solutions here? >>>>>> >>>>>> On Mon, Feb 17, 2025 at 11:45 AM Fokko Driesprong <fo...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> Hey Robert, >>>>>>> >>>>>>> The thing is, that -1 cannot "go away". >>>>>>> >>>>>>> >>>>>>> Yes, I agree, but that's also the case for null, as the field is >>>>>>> optional >>>>>>> in the spec >>>>>>> <https://iceberg.apache.org/spec/?column-projection#table-metadata-fields>. >>>>>>> Therefore we support both in PyIceberg >>>>>>> <https://github.com/apache/iceberg-python/blob/300b8405a0fe7d0111321e5644d704026af9266b/pyiceberg/table/metadata.py#L71-L77>, >>>>>>> Iceberg-Rust >>>>>>> <https://github.com/apache/iceberg-rust/blob/752d69041e0461989c48dd1ca79bcff577776f5d/crates/iceberg/src/spec/table_metadata.rs#L500>, >>>>>>> Iceberg-Java >>>>>>> <https://github.com/apache/iceberg/blob/bcbbd0344623ffea5b092e2de5debb0bc12892a1/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L458-L462>, >>>>>>> and Iceberg-Go >>>>>>> <https://github.com/apache/iceberg-go/blob/ada5480954d9b41d2f8eb4c765523614fad65e1a/table/metadata.go#L837-L841>. >>>>>>> On the write side, they all produce null instead of -1. Therefore, >>>>>>> I was surprised that it comes up now, and not earlier. >>>>>>> >>>>>>> I'd prefer to keep the previous behavior - otherwise implementations >>>>>>>> may fall back to 0, which is definitely wrong. >>>>>>> >>>>>>> >>>>>>> I'm not seeing why it would fall back to 0, and I agree, that's >>>>>>> wrong. >>>>>>> >>>>>>> Would be better IMHO not to break existing implementations / render >>>>>>>> existing setups incompatible with Iceberg 1.8. >>>>>>> >>>>>>> >>>>>>> In my opinion, if this had been caught in an RC, it would be open >>>>>>> for discussion, but that ship has sailed. Let's hear what others think. >>>>>>> >>>>>>> Kind regards, >>>>>>> Fokko >>>>>>> >>>>>>> Op ma 17 feb 2025 om 18:16 schreef Robert Stupp <sn...@snazy.de>: >>>>>>> >>>>>>>> Hi Fokko, >>>>>>>> >>>>>>>> sure, in general "absent" or "null" would be cleaner. But now we >>>>>>>> have two representations for the same case - I suspect most went with >>>>>>>> the >>>>>>>> "reference behavior". >>>>>>>> >>>>>>>> The thing is, that -1 cannot "go away". >>>>>>>> >>>>>>>> I'd prefer to keep the previous behavior - otherwise >>>>>>>> implementations may fall back to 0, which is definitely wrong. Would be >>>>>>>> better IMHO not to break existing implementations / render existing >>>>>>>> setups >>>>>>>> incompatible with Iceberg 1.8. >>>>>>>> >>>>>>>> >>>>>>>> On 17.02.25 15:49, Fokko Driesprong wrote: >>>>>>>> >>>>>>>> Hey Robert, >>>>>>>> >>>>>>>> Thanks for raising this. >>>>>>>> >>>>>>>> snapshot-ID -1 isn't per-se invalid, because the valid values are >>>>>>>>> not defined in the spec. >>>>>>>> >>>>>>>> >>>>>>>> For me, this is invalid, since there is no snapshot with -1 in the >>>>>>>> snapshots property. In the tests with the PR, you can see that >>>>>>>> there are no snapshots >>>>>>>> <https://github.com/apache/iceberg/pull/11560/files#diff-41bdfb6698d2aa7b47ff7d5fabc558a5a64f8b7496fe1bcd8f8ecb69b2afc128R112>. >>>>>>>> A year ago we had a similar discussion on PyIceberg >>>>>>>> <https://py.iceberg.apache.org/configuration/#backward-compatibility> >>>>>>>> around this and this ended up in adding a flag to fall back to the old >>>>>>>> behavior. I do agree that we should have communicated this more clearly >>>>>>>> with the release. >>>>>>>> >>>>>>>> Kind regards, >>>>>>>> Fokko >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Op ma 17 feb 2025 om 12:25 schreef Robert Stupp <sn...@snazy.de>: >>>>>>>> >>>>>>>>> Feels like https://github.com/apache/iceberg/pull/11560 >>>>>>>>> introduced a behavior change. >>>>>>>>> >>>>>>>>> snapshot-ID -1 isn't per-se invalid, because the valid values are >>>>>>>>> not defined in the spec. >>>>>>>>> >>>>>>>>> Previous Iceberg-Java versions always produced -1 if there's no >>>>>>>>> current snapshot - 1.8 produces `null` in that case. So there are now >>>>>>>>> two >>>>>>>>> _different_ values for "no current snapshot". >>>>>>>>> >>>>>>>>> Implementations that rely on the behavior of the "reference >>>>>>>>> implementation" (Iceberg-Java) do now fail in case there's no current >>>>>>>>> snapshot. >>>>>>>>> On 13.02.25 10:09, Amogh Jahagirdar wrote: >>>>>>>>> >>>>>>>>> I'm pleased to announce the release of Apache Iceberg 1.8.0! >>>>>>>>> >>>>>>>>> Apache Iceberg is an open table format for huge analytic datasets. >>>>>>>>> Iceberg >>>>>>>>> delivers high query performance for tables with tens of petabytes >>>>>>>>> of data, >>>>>>>>> along with atomic commits, concurrent writes, and SQL-compatible >>>>>>>>> table >>>>>>>>> evolution. >>>>>>>>> >>>>>>>>> This release can be downloaded from: >>>>>>>>> https://www.apache.org/dyn/closer.cgi/iceberg/apache-iceberg-1.8.0/apache-iceberg-1.8.0.tar.gz >>>>>>>>> >>>>>>>>> >>>>>>>>> Release notes: https://iceberg.apache.org/releases/#180-release >>>>>>>>> <https://iceberg.apache.org/releases/180-release> >>>>>>>>> >>>>>>>>> Java artifacts are available from Maven Central. >>>>>>>>> >>>>>>>>> Thanks to everyone for contributing! >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Robert Stupp >>>>>>>>> @snazy >>>>>>>>> >>>>>>>>> -- >>>>>>>> Robert Stupp >>>>>>>> @snazy >>>>>>>> >>>>>>>> -- >>>>>> Robert Stupp >>>>>> @snazy >>>>>> >>>>>> -- >>>>>> Robert Stupp >>>>>> @snazy >>>>>> >>>>>>