Re: [DISCUSS] Fix property names in REST spec for statistics / partition statistics

Fokko Driesprong Wed, 10 Jul 2024 03:33:10 -0700

Hey everyone,

I'm fine with a vote, it is a change to the spec indeed, but it is because
of a discrepancy between the reference implementation and the spec, so
therefore you can also see it as fixing a bug.


Let me give some context around how this is done for PyIceberg and
Iceberg-Rust.

Clients that have been implemented relying on the REST spec (I think
> pyiceberg generates code from it) are impacted


Initially, we generated the models from the spec generator (same as the
Python code that you're seeing in the PR
<https://github.com/apache/iceberg/pull/10662>), but we took it from there
and checked them into source control. The main problem with the code
generated by the spec is that there are limitations
<https://github.com/apache/iceberg/pull/6672>, so we have to amend them by
hand to make it work. Same if you look at Java (de)coders of the JSON
<https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/TableMetadataParser.java>,
there is a lot of complexity in there that's hard to capture in the
open-api spec.

The change in the PR does not affect PyIceberg since it just doesn't
support the statistics or partition-statistics. When implementing this we
would directly notice this PyIceberg with the integration tests
<https://github.com/apache/iceberg-python/tree/main/tests/integration> that
we have with a REST catalog that's backed by the reference Java
implementation <https://github.com/tabular-io/iceberg-rest-image>. I
believe it is much easier to test against the reference implementation,
instead of using the generated models, because from my experience with
PyIceberg and Iceberg-Rust, I don't think it is that easy to have a
meaningful way of validating the spec.

Looking at the current ecosystem that supports the Iceberg REST catalog, we
should ensure that the changes are not breaking older clients. meaning we
can extend the models, support table v3 format, and add new capabilities. I
think it might look tempting to also version the REST catalog spec, but my
main concern is that this will introduce a process that will slow the
evolution down. We can already see that with the table v3 spec that's
pending for quite a while. Introducing versioning on the REST spec will
cause even more delays in all the proposals, while there are other ways of
versioning out there (version in the URL, capabilities, etc).

Kind regards,
Fokko

Op wo 10 jul 2024 om 01:07 schreef Daniel Weeks <dwe...@apache.org>:

> I just want to point out that the Iceberg Improvement Proposal
> <https://iceberg.apache.org/contribute/#apache-iceberg-improvement-proposals>
> does use the language ". . . or significant additions/changes to any of the
> existing Iceberg implementations", which is consistent with other similar
> Apache improvement proposals, so that every minor
> clarification/fix/adjustment doesn't require a heavy weight process.
>
> I'm sure we can argue what "significant" means in that context, but I'm
> fine if we want to hold a vote for something like this just to be
> consistent.  We've made small clarifications to the Iceberg table spec
> without a vote, but I'm open to either handling these on a case by case
> basis or just holding a quick vote.
>
> -Dan
>
>
>
> On Tue, Jul 9, 2024 at 1:36 PM Jack Ye <yezhao...@gmail.com> wrote:
>
>> I propose two things:
>>
>> 1. We officially release the specs with major and minor versions. With
>> some experience of implementing against both the table and the catalog
>> spec, I would say that such error is not as obvious as we think when there
>> are hundreds of models and property fields you need to implement, and you
>> just work with the latest text spec at a point of time to implement it.
>> Having released spec versions marks that the spec text you are looking at
>> is a version that can be consumed, and if there are any changes or fixes in
>> existing parts of the spec, it will be in a compatible and predictable way.
>>
>> 2. We have an official reference implementation of the REST server. There
>> have been a lot of debates about it already, but I am feeling increasingly
>> strong about it to ensure the quality of the spec. This is another case
>> where the error would have been easily caught if there was validation of
>> end-to-end client-server interaction using the generated models. We had a
>> similar issue previously with error models that the definition was not
>> matching the implementation for a long time:
>> https://github.com/apache/iceberg/pull/8914. As the spec evolves to be
>> increasingly complex, it will be more and more challenging to spot errors
>> just by looking at the spec text and the generated model shape without
>> actually using it.
>>
>> -Jack
>>
>>
>> On Tue, Jul 9, 2024 at 1:10 PM Ryan Blue <b...@databricks.com.invalid>
>> wrote:
>>
>>> What I'm asking is whether we want to have a vote for a substantive
>>> change and not for bug fixes. I'm not suggesting anything beyond that, like
>>> there is some ability for committers and PMC members to not follow the same
>>> rules.
>>>
>>> While this is a change to a property name, the name that is being fixed
>>> conflicts with both the implementation
>>> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L111-L112>
>>>  and
>>> the table spec
>>> <https://github.com/apache/iceberg/blob/main/format/spec.md#table-metadata-fields>.
>>> I think that's fairly clearly a bug or typo in the spec that is being
>>> updated. (The generated Python code is an example)
>>>
>>> Raising awareness is definitely something we should do. It's also clear
>>> that we want to make this change -- I don't think there are arguments
>>> against it -- and I'm saying we may not want or need a vote to confirm.
>>> Those can be separate things, like sending a [ANNOUNCE] message on this
>>> list rather than having a [VOTE] thread open for 3 days before fixing it.
>>>
>>> Again, I'm asking a question about how we want to handle situations like
>>> this moving forward. As I said originally, I think it's fine to have a
>>> vote when people think it's needed.
>>>
>>> On Tue, Jul 9, 2024 at 12:44 PM Jack Ye <yezhao...@gmail.com> wrote:
>>>
>>>> I agree with Robert here that we need to get into the habit of doing
>>>> votes on the spec changes.
>>>>
>>>> There are typos that could be in sections like description, that does
>>>> not affect the overall spec usage at all, maybe those changes do not need
>>>> an official vote. However, this is a change of an existing field name in a
>>>> data model, and we do not know if there is a dependency of the model
>>>> somewhere. We are assuming that the feature is likely not used, which might
>>>> be true in this case, but might not be true going forward.
>>>>
>>>> Related to this, we should think about officially releasing the spec
>>>> and properly updating global spec versions for this purpose. One important
>>>> use case of a spec is that there are engines that implement against the
>>>> spec directly, instead of using any provided SDKs in Java/Python/Rust. For
>>>> example, we implemented all the format v2 support directly in Amazon
>>>> Redshift against the table v2 spec without using any existing Iceberg
>>>> library. I cannot imagine how such engines could implement against the
>>>> catalog spec if somehow a field name can just change silently, and we don't
>>>> know when we can actually take a dependency on it. A spec release seems
>>>> like the proper way to inform that external parties can start to take a
>>>> dependency on it, and future changes will be backwards compatible in minor
>>>> version updates, or will require a deprecation cycle until the next major
>>>> version update.
>>>>
>>>> -Jack
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Jul 9, 2024 at 11:25 AM Robert Stupp <sn...@snazy.de> wrote:
>>>>
>>>>> I think it is needed, because of the reasons emphasized either by
>>>>> Daniel or you yesterday in the call: people have to be aware of changes in
>>>>> specifications.
>>>>>
>>>>> Maybe I'm alone and maybe it's perceived "pedantic", but I think you
>>>>> missed the point: the rule mentioned in yesterday's call (not sure where
>>>>> it's written tho) is to communicate every spec change. If Iceberg
>>>>> committers and PMCs don't follow this rule, you cannot expect others to do
>>>>> so.
>>>>>
>>>>> This PR changes the schema in the REST spec. Clients that have been
>>>>> implemented relying on the REST spec (I think pyiceberg generates code 
>>>>> from
>>>>> it) are impacted. Other implementers might have just relied on the
>>>>> _specification_.
>>>>>
>>>>>
>>>>> On 09.07.24 17:28, Ryan Blue wrote:
>>>>>
>>>>> I think it's fine to have a vote for this if anyone thinks that it is
>>>>> needed. But since this is just fixing the part of the REST spec that 
>>>>> duplicates
>>>>> the table spec and correcting a typo
>>>>> <https://github.com/apache/iceberg/blob/main/format/spec.md#table-metadata-fields>,
>>>>> it seems like more of a correction than a substantive change.
>>>>>
>>>>> On Tue, Jul 9, 2024 at 3:14 AM Robert Stupp <sn...@snazy.de> wrote:
>>>>>
>>>>>> Hi Eduard,
>>>>>>
>>>>>> this needs to be a formal code-change vote, because it's a change to
>>>>>> a spec (this was emphasized during yesterday's call). Can you add some
>>>>>> background about the change?
>>>>>>
>>>>>> Robert
>>>>>>
>>>>>>
>>>>>> On 09.07.24 11:26, Eduard Tudenhöfner wrote:
>>>>>>
>>>>>> Hey everyone,
>>>>>>
>>>>>> I've opened #10662 <https://github.com/apache/iceberg/pull/10662> to
>>>>>> fix property names for statistics / partition statistics in the REST 
>>>>>> spec.
>>>>>> I can start a separate VOTE thread if there is agreement around the
>>>>>> proposed Spec change.
>>>>>>
>>>>>> Thanks
>>>>>> Eduard
>>>>>>
>>>>>> --
>>>>>> Robert Stupp
>>>>>> @snazy
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Databricks
>>>>>
>>>>> --
>>>>> Robert Stupp
>>>>> @snazy
>>>>>
>>>>>
>>>
>>> --
>>> Ryan Blue
>>> Databricks
>>>
>>

Re: [DISCUSS] Fix property names in REST spec for statistics / partition statistics

Reply via email to