Thanks for the historical context here. I wasn't aware of the equivalent
between -1 and null.

It looks like we're all aligned on not retroactively changing the spec.
Let's wrap this up with some actions:

   - To revert the changes, I've raised the PRs (1.8.x
   <https://github.com/apache/iceberg/pull/12313>, main
   <https://github.com/apache/iceberg/pull/12312>) and they are already in
   (thanks Eduard!).
   - Created a PR to write
   <https://github.com/apache/iceberg/pull/12335/files>null
   <https://github.com/apache/iceberg/pull/12335/files> for ≥V3
   <https://github.com/apache/iceberg/pull/12335/files>. I like the
   solution of tying it to the metadata version, since folks have to update
   their libraries/catalogs anyway to use V3.
   - Distilled the words said here around -1 into an implementation note
   <https://github.com/apache/iceberg/pull/12334> to avoid confusion in the
   future, let me know what y'all think.

Thanks again everyone for the input here.

Kind regards,
Fokko






Op di 18 feb 2025 om 21:55 schreef Daniel Weeks <dwe...@apache.org>:

> I agree that we shouldn't retroactively change the spec to align with the
> implementation.
>
> We're trying to strike a balance here and I think the notes are an
> effective way to convey that while omitting or producing null with v1/v2 is
> technically to spec, it's going to be incompatible with most
> implementations and should be discouraged for compatibility reasons.
>
> So, no on spec change, yes on notes.
>
> On Tue, Feb 18, 2025 at 12:37 PM Russell Spitzer <
> russell.spit...@gmail.com> wrote:
>
>> `For compatibility with existing libraries, we should maintain that `-1`
>> is equivalent to no snapshot and it should be written for v1/v2.`
>>
>> The only issue I have with this is are we saying that for v1 and v2 we
>> are changing the spec to say that current-snapshot-id is required? Or are
>> we adding an implementation note, the java implementation will always
>> return -1 for these tables but this is not required for other clients?
>>
>> Unless we say that it is required we can have other implementations which
>> can return null or missing. I'm not a big fan of changing the spec to match
>> the implementation especially when we are narrowing what is allowed.
>>
>>
>> On Tue, Feb 18, 2025 at 2:15 PM Daniel Weeks <dwe...@apache.org> wrote:
>>
>>> I would agree the best path forward is to note the current behavior for
>>> v1/v2 since that's well established and address the behavior in v3.
>>>
>>> For compatibility with existing libraries, we should maintain that `-1`
>>> is equivalent to no snapshot and it should be written for v1/v2.
>>>
>>> With V3 we should support `-1` for backwards compatibility (though
>>> there's low likelihood that someone will promote an empty table), but
>>> always write `null`.
>>>
>>> I believe this is in part due to TableMetadata tracking the
>>> current-snapshot-id as a `long` value (as opposed to a Long) reference,
>>> which resulted in this oversight.
>>>
>>> -Dan
>>>
>>> On Tue, Feb 18, 2025 at 11:57 AM rdb...@gmail.com <rdb...@gmail.com>
>>> wrote:
>>>
>>>> +1 to reverting PT 11560 in main and 1.8.1. That avoids unnecessary
>>>> incompatibility with older readers.
>>>>
>>>> I also agree that we should update the spec to say what Russell
>>>> suggests:
>>>> > that -1 has meant "no current snapshot" in the past and is
>>>> equivalent to missing/null.
>>>>
>>>> That's a correct description of the behavior.
>>>>
>>>> I don't think that we should create new requirements for this that
>>>> didn't previously exist -- that is, I don't see much value in going back to
>>>> mandate this behavior when writing v1 or v2 tables. If an implementation
>>>> were writing null for the current snapshot ID up until now, I don't think
>>>> that we can say that behavior was or is incorrect. It was an
>>>> incompatibility with the Java implementation and we should note the
>>>> behavior in the "Implementation Notes" section.
>>>>
>>>> > How about reverting #11560 for 1.8.1, and then reinstating this for
>>>> 2.0.0?
>>>>
>>>> I think we need to fix this at a format version boundary, not a library
>>>> version boundary. I'd be up for reinstating the write change for v3.
>>>>
>>>> On Tue, Feb 18, 2025 at 7:56 AM Russell Spitzer <
>>>> russell.spit...@gmail.com> wrote:
>>>>
>>>>> The only thing I think I agree with is defining that -1 has meant "no
>>>>> current snapshot" in the past and
>>>>> is equivalent to a missing/nuil (if we have to specify that) .
>>>>>
>>>>> I don't think there is any reason to change the behavior of writing
>>>>> null / missing unless
>>>>> that's really a point of confusion for folks. Is there a JSON library
>>>>> folks are using
>>>>>  which distinguishes between missing/null?
>>>>>
>>>>> Would having field:null be the same as missing have avoided the issue
>>>>> because it seems like libraries
>>>>> that only handled non-null would also not handle "missing"
>>>>> current-snapshot-id as well?
>>>>>
>>>>> I'd really like to hear from other implementers here since changing
>>>>> all this again would be a lot of
>>>>> work and I thought we had understanding of Optional == Nullable as a
>>>>> valid thing in the spec.
>>>>>
>>>>> On Tue, Feb 18, 2025 at 7:35 AM Robert Stupp <sn...@snazy.de> wrote:
>>>>>
>>>>>> Correcting myself: schema/spec/sort seem to be always present -
>>>>>> please ignore that part in my previous email. The valid values for those
>>>>>> fields however should be defined.
>>>>>> On 18.02.25 14:29, Robert Stupp wrote:
>>>>>>
>>>>>> Reality is that Iceberg did write '-1' into current-snapshot-id (and
>>>>>> other "non-exist" marker values for schema/spec/sort) instead of omitting
>>>>>> the field.
>>>>>>
>>>>>> Side note: the table-spec says that these fields are optional, but
>>>>>> nothing about whether it is nullable.
>>>>>>
>>>>>> The spec should at least be amended to explicitly define the valid
>>>>>> values (-1 for current-snapshot-id for no snapshot is just a fact now 
>>>>>> that
>>>>>> it's there). IMO that field being 'null' isn't defined in the spec, but 
>>>>>> the
>>>>>> absence of the field is.
>>>>>>
>>>>>> Proposal for the implementation:
>>>>>>
>>>>>> * revert https://github.com/apache/iceberg/pull/11560 in 1.8.1 and
>>>>>> on main
>>>>>>
>>>>>> Proposal for the spec (for current-snapshot-id):
>>>>>>
>>>>>> * Define the valid value ranges for the field
>>>>>> * Define that the absence of the field means "no current snapshot"
>>>>>> * Define that the value -1 of the field means "no current snapshot"
>>>>>> * Define that the value 'null' of the field means "no current
>>>>>> snapshot"
>>>>>> * Define that new implementations must not write the field if there's
>>>>>> no current snapshot
>>>>>> Proposal for the spec (for schema/spec/sort):
>>>>>>
>>>>>> * Define the valid value ranges for the fields
>>>>>> * Define the "schema/spec/sort not present" values (the fields are
>>>>>> optional for v1 but required for v2+v3).
>>>>>> * OR Define that "schema/spec/sort must be absent" if there is no
>>>>>> current schema/spec/sort.
>>>>>>
>>>>>> WDYT?
>>>>>>
>>>>>> On 17.02.25 21:07, Russell Spitzer wrote:
>>>>>>
>>>>>> It sounds like the argument here is that we should change the Spec
>>>>>> for V1, V2, and V3 to mark current-snapshot-id
>>>>>> as required. Then we should change all other implementations to
>>>>>> follow this new standard. I'm not sure that
>>>>>> is a good solution going forwards but I'm not sure of how we can
>>>>>> support catalogs/engines that cannot handle a null
>>>>>> correctly in this situation otherwise. Perhaps we should source out
>>>>>> to see if any other implementers worked off the
>>>>>> assumption of a non-optional "current-snapshot-id" and if we get a
>>>>>> critical mass we can try to make that change?
>>>>>> Because of how wide that change would be, I think we would need
>>>>>> pretty broad consensus to do so.
>>>>>>
>>>>>> We could possibly also have a flag to allow the old behavior but that
>>>>>> also feels wrong to me, we have often gone with a motto of
>>>>>> read "wrong" write "correct" for things like this in the past and
>>>>>> continuing to write "wrong" is a disservice to any
>>>>>> new implementers . When we do have a contradiction between our
>>>>>> implementation and the spec I think we have
>>>>>> to trust that implementers followed the spec and fix the core library.
>>>>>>
>>>>>> Are there any other solutions here?
>>>>>>
>>>>>> On Mon, Feb 17, 2025 at 11:45 AM Fokko Driesprong <fo...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey Robert,
>>>>>>>
>>>>>>> The thing is, that -1 cannot "go away".
>>>>>>>
>>>>>>>
>>>>>>> Yes, I agree, but that's also the case for null, as the field is 
>>>>>>> optional
>>>>>>> in the spec
>>>>>>> <https://iceberg.apache.org/spec/?column-projection#table-metadata-fields>.
>>>>>>> Therefore we support both in PyIceberg
>>>>>>> <https://github.com/apache/iceberg-python/blob/300b8405a0fe7d0111321e5644d704026af9266b/pyiceberg/table/metadata.py#L71-L77>,
>>>>>>> Iceberg-Rust
>>>>>>> <https://github.com/apache/iceberg-rust/blob/752d69041e0461989c48dd1ca79bcff577776f5d/crates/iceberg/src/spec/table_metadata.rs#L500>,
>>>>>>> Iceberg-Java
>>>>>>> <https://github.com/apache/iceberg/blob/bcbbd0344623ffea5b092e2de5debb0bc12892a1/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L458-L462>,
>>>>>>> and Iceberg-Go
>>>>>>> <https://github.com/apache/iceberg-go/blob/ada5480954d9b41d2f8eb4c765523614fad65e1a/table/metadata.go#L837-L841>.
>>>>>>> On the write side, they all produce null instead of -1. Therefore,
>>>>>>> I was surprised that it comes up now, and not earlier.
>>>>>>>
>>>>>>> I'd prefer to keep the previous behavior - otherwise implementations
>>>>>>>> may fall back to 0, which is definitely wrong.
>>>>>>>
>>>>>>>
>>>>>>> I'm not seeing why it would fall back to 0, and I agree, that's
>>>>>>> wrong.
>>>>>>>
>>>>>>> Would be better IMHO not to break existing implementations / render
>>>>>>>> existing setups incompatible with Iceberg 1.8.
>>>>>>>
>>>>>>>
>>>>>>> In my opinion, if this had been caught in an RC, it would be open
>>>>>>> for discussion, but that ship has sailed. Let's hear what others think.
>>>>>>>
>>>>>>> Kind regards,
>>>>>>> Fokko
>>>>>>>
>>>>>>> Op ma 17 feb 2025 om 18:16 schreef Robert Stupp <sn...@snazy.de>:
>>>>>>>
>>>>>>>> Hi Fokko,
>>>>>>>>
>>>>>>>> sure, in general "absent" or "null" would be cleaner. But now we
>>>>>>>> have two representations for the same case - I suspect most went with 
>>>>>>>> the
>>>>>>>> "reference behavior".
>>>>>>>>
>>>>>>>> The thing is, that -1 cannot "go away".
>>>>>>>>
>>>>>>>> I'd prefer to keep the previous behavior - otherwise
>>>>>>>> implementations may fall back to 0, which is definitely wrong. Would be
>>>>>>>> better IMHO not to break existing implementations / render existing 
>>>>>>>> setups
>>>>>>>> incompatible with Iceberg 1.8.
>>>>>>>>
>>>>>>>>
>>>>>>>> On 17.02.25 15:49, Fokko Driesprong wrote:
>>>>>>>>
>>>>>>>> Hey Robert,
>>>>>>>>
>>>>>>>> Thanks for raising this.
>>>>>>>>
>>>>>>>> snapshot-ID -1 isn't per-se invalid, because the valid values are
>>>>>>>>> not defined in the spec.
>>>>>>>>
>>>>>>>>
>>>>>>>> For me, this is invalid, since there is no snapshot with -1 in the
>>>>>>>> snapshots property. In the tests with the PR, you can see that
>>>>>>>> there are no snapshots
>>>>>>>> <https://github.com/apache/iceberg/pull/11560/files#diff-41bdfb6698d2aa7b47ff7d5fabc558a5a64f8b7496fe1bcd8f8ecb69b2afc128R112>.
>>>>>>>> A year ago we had a similar discussion on PyIceberg
>>>>>>>> <https://py.iceberg.apache.org/configuration/#backward-compatibility>
>>>>>>>> around this and this ended up in adding a flag to fall back to the old
>>>>>>>> behavior. I do agree that we should have communicated this more clearly
>>>>>>>> with the release.
>>>>>>>>
>>>>>>>> Kind regards,
>>>>>>>> Fokko
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Op ma 17 feb 2025 om 12:25 schreef Robert Stupp <sn...@snazy.de>:
>>>>>>>>
>>>>>>>>> Feels like https://github.com/apache/iceberg/pull/11560
>>>>>>>>> introduced a behavior change.
>>>>>>>>>
>>>>>>>>> snapshot-ID -1 isn't per-se invalid, because the valid values are
>>>>>>>>> not defined in the spec.
>>>>>>>>>
>>>>>>>>> Previous Iceberg-Java versions always produced -1 if there's no
>>>>>>>>> current snapshot - 1.8 produces `null` in that case. So there are now 
>>>>>>>>> two
>>>>>>>>> _different_ values for "no current snapshot".
>>>>>>>>>
>>>>>>>>> Implementations that rely on the behavior of the "reference
>>>>>>>>> implementation" (Iceberg-Java) do now fail in case there's no current
>>>>>>>>> snapshot.
>>>>>>>>> On 13.02.25 10:09, Amogh Jahagirdar wrote:
>>>>>>>>>
>>>>>>>>> I'm pleased to announce the release of Apache Iceberg 1.8.0!
>>>>>>>>>
>>>>>>>>> Apache Iceberg is an open table format for huge analytic datasets.
>>>>>>>>> Iceberg
>>>>>>>>> delivers high query performance for tables with tens of petabytes
>>>>>>>>> of data,
>>>>>>>>> along with atomic commits, concurrent writes, and SQL-compatible
>>>>>>>>> table
>>>>>>>>> evolution.
>>>>>>>>>
>>>>>>>>> This release can be downloaded from:
>>>>>>>>> https://www.apache.org/dyn/closer.cgi/iceberg/apache-iceberg-1.8.0/apache-iceberg-1.8.0.tar.gz
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Release notes: https://iceberg.apache.org/releases/#180-release
>>>>>>>>> <https://iceberg.apache.org/releases/180-release>
>>>>>>>>>
>>>>>>>>> Java artifacts are available from Maven Central.
>>>>>>>>>
>>>>>>>>> Thanks to everyone for contributing!
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Robert Stupp
>>>>>>>>> @snazy
>>>>>>>>>
>>>>>>>>> --
>>>>>>>> Robert Stupp
>>>>>>>> @snazy
>>>>>>>>
>>>>>>>> --
>>>>>> Robert Stupp
>>>>>> @snazy
>>>>>>
>>>>>> --
>>>>>> Robert Stupp
>>>>>> @snazy
>>>>>>
>>>>>>

Reply via email to