Hi Ryan,

I've created a revert PR [1]. I agree that we should take a more permissive
approach when reading a table, allowing for reading non-compliant table
metadata, especially for an opportunity to "fix" the metadata. However, I
think we still need a way to enforce the table specification to ensure that
other operations interact with a compliant table.

Perhaps we could permit `SnapshotParser` to read the metadata but enforce
the `operation` field at a different location to guarantee the table’s
compliance with the specification.

Best,
Kevin Liu

[1] https://github.com/apache/iceberg/pull/11409

On Sat, Oct 26, 2024 at 11:05 AM rdb...@gmail.com <rdb...@gmail.com> wrote:

> I see it's been merged, but I don't think it is a good idea to enforce
> this. The spec can and should require the `operation` but we want to be
> careful about creating situations where bad metadata can needlessly break a
> table. I would be much more permissive here, which is why this probably
> wasn't enforced in the first place.
>
> On Fri, Oct 25, 2024 at 2:36 PM Kevin Liu <kevin.jq....@gmail.com> wrote:
>
>> Thanks, everyone! The PR[1] has been merged
>>
>> Best,
>> Kevin Liu
>>
>> [1] https://github.com/apache/iceberg/pull/11354
>>
>>
>> On Fri, Oct 25, 2024 at 1:02 PM Kevin Liu <kevin.jq....@gmail.com> wrote:
>>
>>> Thanks, Ryan! That makes sense.
>>>
>>> I want to follow up on the original issue. I've made a PR [1] to enforce
>>> that the Snapshot `summary` map must have an `operation` key. Please take a
>>> look. Thank you @nastra for the comments and reviews.
>>>
>>> Best,
>>> Kevin Liu
>>>
>>> [1] https://github.com/apache/iceberg/pull/11354
>>>
>>>
>>>
>>> On Tue, Oct 22, 2024 at 4:06 PM rdb...@gmail.com <rdb...@gmail.com>
>>> wrote:
>>>
>>>> > For example, the `Snapshot` `summary` field is optional in V1 but
>>>> required in V2. Therefore, the REST spec definition should mark the
>>>> `summary` field as optional to support both versions.
>>>>
>>>> Yeah, this is technically true. But as I said in my first email, unless
>>>> you have tables that are 5 years old, it's unlikely that this is going to
>>>> be a problem. A failure here is more likely with newer implementations that
>>>> have a bug. So I'd argue there's value in leaving it as required.
>>>>
>>>> On Mon, Oct 21, 2024 at 9:41 AM Kevin Liu <kevin.jq....@gmail.com>
>>>> wrote:
>>>>
>>>>> > No. They were introduced at the same time.
>>>>> Great! Since the `summary` field and the `operation` key were
>>>>> introduced together, we should enforce the rule that the `summary`
>>>>> field must always have an accompanying `operation` key. This has been
>>>>> addressed in PR 11354 [1].
>>>>>
>>>>> > I am strongly against this. The REST spec should be independent of
>>>>> the table versions.
>>>>> That makes sense. For the REST spec to support both V1 and V2 tables,
>>>>> it should "accept" the least common denominator between the two versions.
>>>>> For example, the `Snapshot` `summary` field is optional in V1 but required
>>>>> in V2. Therefore, the REST spec definition should mark the `summary` field
>>>>> as optional to support both versions. However, the current REST spec leans
>>>>> towards the V2 table spec; fields that are optional in V1 and required in
>>>>> V2 are marked as required in the spec, such as `TableMetadata.table-uuid`
>>>>> [2][3] and `Snapshot.summary` [4][5].
>>>>>
>>>>> Would love to get other people's thoughts on this.
>>>>>
>>>>> Best,
>>>>> Kevin Liu
>>>>>
>>>>> [1] https://github.com/apache/iceberg/pull/11354
>>>>> [2]
>>>>> https://github.com/apache/iceberg/blob/8e743a5b5209569f84b6bace36e1106c67e1eab3/open-api/rest-catalog-open-api.yaml#L2414
>>>>> [3] https://iceberg.apache.org/spec/#table-metadata-fields
>>>>> [4]
>>>>> https://github.com/apache/iceberg/blob/8e743a5b5209569f84b6bace36e1106c67e1eab3/open-api/rest-catalog-open-api.yaml#L2325
>>>>> [5] https://iceberg.apache.org/spec/#snapshots
>>>>>
>>>>> On Sun, Oct 20, 2024 at 11:24 AM rdb...@gmail.com <rdb...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Was it ever valid to have a summary field without the operation key?
>>>>>>
>>>>>> No. They were introduced at the same time.
>>>>>>
>>>>>> Would it be helpful to create alternative versions of the REST spec
>>>>>> specifically for referencing V1 and V2 tables?
>>>>>>
>>>>>> I am strongly against this. The REST spec should be independent of
>>>>>> the table versions. Any table format version can be passed and the table
>>>>>> format should be the canonical reference for what is allowed. We want to
>>>>>> avoid cases where there are discrepancies. The table spec is canonical 
>>>>>> for
>>>>>> table metadata, and the REST spec allows passing it.
>>>>>>
>>>>>> On Sun, Oct 20, 2024 at 11:18 AM Kevin Liu <kevin.jq....@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey folks,
>>>>>>>
>>>>>>> Thanks, everyone for the discussion, and thanks Ryan for providing
>>>>>>> the historical context.
>>>>>>> Enforce the `operation` key in Snapshot’s `summary` field
>>>>>>>
>>>>>>> When serializing the `Snapshot` object from JSON, the Java
>>>>>>> implementation does not enforce that the `summary` field must contain an
>>>>>>> `operation` key. In the V1 spec, the `summary` field is optional, while 
>>>>>>> in
>>>>>>> the V2 spec, it is required. However, in both versions, if a `summary`
>>>>>>> field is present, it must include an `operation` key. Any `summary` 
>>>>>>> field
>>>>>>> lacking an `operation` key should be considered invalid.
>>>>>>>
>>>>>>> I’ve addressed this issue in PR 11354 [1] by adding this constraint
>>>>>>> when parsing JSON.
>>>>>>>
>>>>>>> > We initially did not have the snapshot summary or operation. When
>>>>>>> I added the summary, the operation was intended to be required in cases
>>>>>>> where the summary is present. It should always be there if the summary 
>>>>>>> is
>>>>>>> and the summary should always be there unless you wrote the
>>>>>>> metadata.json file way back in 2017 or 2018.
>>>>>>>
>>>>>>> @Ryan, does this constraint also apply to `metadata.json` files from
>>>>>>> 2017/2018? Was it ever valid to have a `summary` field without the
>>>>>>> `operation` key?
>>>>>>>
>>>>>>> > Well, the spec says nothing about a top-level `operation` field in
>>>>>>> JSON [1]. Yet the Java implementation produces it [2] and removes the
>>>>>>> operation from the summary map. This seems inconsistent?
>>>>>>>
>>>>>>> @Anton, the Java `Snapshot` object includes both the `summary` and
>>>>>>> `operation` fields. When serializing to JSON, the `operation` field is
>>>>>>> included in the `summary` map [2], rather than as a top-level field. 
>>>>>>> During
>>>>>>> deserialization from JSON, the `operation` field is extracted from the
>>>>>>> `summary` map [3].
>>>>>>>
>>>>>>> I believe this is consistent with the table spec, which defines the
>>>>>>> JSON output, not how the `Snapshot` object is implemented in Java.
>>>>>>> On REST spec and Table spec
>>>>>>>
>>>>>>> Thanks, Yufei, for highlighting the difference between the REST spec
>>>>>>> and the table spec. I mistakenly used the REST spec
>>>>>>> (`rest-catalog-open-api.yaml` [4]) as the source of truth for V2 tables.
>>>>>>>
>>>>>>> Looking at the REST spec file, it can be challenging to determine
>>>>>>> how a REST server should handle V1 versus V2 tables. Even for V2 tables,
>>>>>>> the current version of the file combines features from V2, along with
>>>>>>> additional changes made in preparation for the upcoming V3 spec.
>>>>>>>
>>>>>>> Would it be helpful to create alternative versions of the REST spec
>>>>>>> specifically for referencing V1 and V2 tables? The goal would be to 
>>>>>>> have a
>>>>>>> "frozen" version of the REST spec dedicated to V1 tables and another 
>>>>>>> for V2
>>>>>>> tables while allowing the current REST spec file to evolve as needed.
>>>>>>>
>>>>>>> Taking a step back, I think we need more documentation on the REST
>>>>>>> spec, including support for different table versions and guidance on
>>>>>>> upgrading from one version to another. I’d love to hear everyone’s 
>>>>>>> thoughts
>>>>>>> on this.
>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Kevin Liu
>>>>>>>
>>>>>>>
>>>>>>> [1] https://github.com/apache/iceberg/pull/11354
>>>>>>>
>>>>>>> [2]
>>>>>>> https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L63-L66
>>>>>>>
>>>>>>> [3]
>>>>>>> https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L124-L137
>>>>>>>
>>>>>>> [4]
>>>>>>> https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Oct 19, 2024 at 7:48 PM Sung Yun <sun...@apache.org> wrote:
>>>>>>>
>>>>>>>> Hi Ryan, thank you for your response!
>>>>>>>>
>>>>>>>> That detailed context is very helpful in allowing me to
>>>>>>>> understanding why the REST catalog spec has evolved the way it has, 
>>>>>>>> and how
>>>>>>>> the Table Spec and the REST Catalog Spec should each be referenced in 
>>>>>>>> the
>>>>>>>> sub-communities (like in PyIceberg). I'll keep those motivations in 
>>>>>>>> mind as
>>>>>>>> we discuss those Specs in the future.
>>>>>>>>
>>>>>>>> Also, here's a small PR to specify more explicitly that the
>>>>>>>> operation field should be a required field in the summary field:
>>>>>>>> https://github.com/apache/iceberg/pull/11355
>>>>>>>>
>>>>>>>> Sung
>>>>>>>>
>>>>>>>> On 2024/10/19 22:14:59 "rdb...@gmail.com" wrote:
>>>>>>>> > I can provide some historical context here about how the table
>>>>>>>> spec evolved
>>>>>>>> > and how the REST spec works with respect to table versions.
>>>>>>>> >
>>>>>>>> > We initially did not have the snapshot summary or operation. When
>>>>>>>> I added
>>>>>>>> > the summary, the operation was intended to be required in cases
>>>>>>>> where the
>>>>>>>> > summary is present. It should always be there if the summary is
>>>>>>>> and the
>>>>>>>> > summary should always be there unless you wrote the metadata.json
>>>>>>>> file way
>>>>>>>> > back in 2017 or 2018. It looks like the spec could be more clear
>>>>>>>> that the
>>>>>>>> > operation is required when summary is present. Anyone want to
>>>>>>>> open a PR?
>>>>>>>> >
>>>>>>>> > Anton, I don't think there is a top-level operation field. The
>>>>>>>> Java
>>>>>>>> > Snapshot class tracks the operation as top-level, but it is
>>>>>>>> always stored
>>>>>>>> > in the summary. I think this is consistent with the spec.
>>>>>>>> >
>>>>>>>> > For the REST spec, I think that it should be strictly optional to
>>>>>>>> support
>>>>>>>> > v1 tables written with no summary, but it should always be
>>>>>>>> present. I'd
>>>>>>>> > probably leave it required since it already is and is catching a
>>>>>>>> valuable
>>>>>>>> > error case here.
>>>>>>>> >
>>>>>>>> > When building the REST spec, I took the same approach as the Java
>>>>>>>> parser
>>>>>>>> > (which is also to parse table metadata coming from REST servers).
>>>>>>>> That is,
>>>>>>>> > it is compatible with v1 metadata; fields that were not required
>>>>>>>> in v1 are
>>>>>>>> > optional. This fits with the principle of "be liberal in what you
>>>>>>>> accept
>>>>>>>> > and strict in what you produce". The REST spec allows passing
>>>>>>>> metadata for
>>>>>>>> > any table version so that the specs are not tightly coupled. The
>>>>>>>> table
>>>>>>>> > version is passed when loading and clients should reject table
>>>>>>>> versions
>>>>>>>> > that are newer than what they can support. The REST protocol just
>>>>>>>> needs to
>>>>>>>> > facilitate passing the table metadata.
>>>>>>>> >
>>>>>>>> > Most v2 structures, like the `schemas` list, are introduced as
>>>>>>>> optional in
>>>>>>>> > v1 and made required in v2. That way, it's possible to add
>>>>>>>> metadata to
>>>>>>>> > existing format versions and make the upgrade path easier.
>>>>>>>> Maintaining the
>>>>>>>> > newer structures even though there are different writer versions
>>>>>>>> deployed
>>>>>>>> > is one of the reasons why the REST spec changes to a change-based
>>>>>>>> model.
>>>>>>>> > New metadata only needs to be supported by the service
>>>>>>>> maintaining the
>>>>>>>> > metadata.json files and any writers that want to update it.
>>>>>>>> >
>>>>>>>> > I see some points about being able to remove old table versions.
>>>>>>>> I don't
>>>>>>>> > think that the REST protocol itself is the place to do this. The
>>>>>>>> protocol
>>>>>>>> > is format-agnostic. Implementations are free to reject requests
>>>>>>>> to create
>>>>>>>> > tables with older versions, or to update the table to a new
>>>>>>>> version.
>>>>>>>> >
>>>>>>>> > Ryan
>>>>>>>> >
>>>>>>>> > On Fri, Oct 18, 2024 at 6:42 AM Sung Yun <sun...@apache.org>
>>>>>>>> wrote:
>>>>>>>> >
>>>>>>>> > > Folks, thank you for your responses! These area really helpful
>>>>>>>> insights.
>>>>>>>> > >
>>>>>>>> > > > I agree that the REST spec should aim to support v1, v2, and
>>>>>>>> potentially
>>>>>>>> > > the upcoming v3. In practice, it seems like the choice of table
>>>>>>>> spec might
>>>>>>>> > > ultimately be dictated by the REST catalog implementation.
>>>>>>>> > >
>>>>>>>> > > > A best practice would be for the server to strive to support
>>>>>>>> all Iceberg
>>>>>>>> > > versions, but the REST spec itself should remain flexible
>>>>>>>> enough to
>>>>>>>> > > accommodate less strict table specs.
>>>>>>>> > >
>>>>>>>> > > Yufei, yes that makes sense, and I agree that the server should
>>>>>>>> strive to
>>>>>>>> > > support all format versions, because otherwise the an older
>>>>>>>> client
>>>>>>>> > > application, may just not be compatible with a REST Catalog
>>>>>>>> running on a
>>>>>>>> > > higher version of table spec.  I think we have two choices here
>>>>>>>> in ensuring
>>>>>>>> > > that the REST Catalog server is able to support multiple
>>>>>>>> versions of the
>>>>>>>> > > Table Spec:
>>>>>>>> > >
>>>>>>>> > > 1. We could create single components that are common
>>>>>>>> denominators of all
>>>>>>>> > > existing table specs to accommodate the less table specs. The
>>>>>>>> REST Catalog
>>>>>>>> > > Spec currently falls short in this approach, and I've put up
>>>>>>>> this PR to
>>>>>>>> > > show what this change would look like just for the Snapshot
>>>>>>>> component:
>>>>>>>> > > https://github.com/apache/iceberg/pull/11353 - My take on this
>>>>>>>> is that,
>>>>>>>> > > this approach will make the REST catalog spec more confusing
>>>>>>>> and difficult
>>>>>>>> > > to manage as we add more Table Spec versions moving forward.
>>>>>>>> The discussion
>>>>>>>> > > on this mail list thread is I think a great demonstration of
>>>>>>>> that confusion
>>>>>>>> > > :)
>>>>>>>> > > 2. We could instead create separate Table Spec version specific
>>>>>>>> components
>>>>>>>> > > on the REST Catalog Open API Spec. For example, a Snapshot
>>>>>>>> component could
>>>>>>>> > > be anyOf SnapshotV1 and SnapshotV2, which match the Table Spec
>>>>>>>> V1 and V2
>>>>>>>> > > definitions. I think creating explicit components that match
>>>>>>>> the spec
>>>>>>>> > > definitions will work in our favor when we continue to
>>>>>>>> introduce more Spec
>>>>>>>> > > changes and manage their lifecycles. And perhaps, maybe we
>>>>>>>> could also
>>>>>>>> > > indicate what format-versions the REST Catalog Server supports
>>>>>>>> through an
>>>>>>>> > > endpoint, and communicate it to a client application.
>>>>>>>> > >
>>>>>>>> > > I'd love to hear the community's opinion on suggestion (2)! I'm
>>>>>>>> very
>>>>>>>> > > curious to hear if we've considered it before.
>>>>>>>> > >
>>>>>>>> > > Sung
>>>>>>>> > >
>>>>>>>> > > On 2024/10/18 05:13:15 Péter Váry wrote:
>>>>>>>> > > > Hi Team,
>>>>>>>> > > > Apart from fixing this current issue by relaxing the current
>>>>>>>> spec
>>>>>>>> > > > constraints, to support both v1 and v2 specifications, we
>>>>>>>> should think
>>>>>>>> > > > about how to handle table spec evolution for the long term.
>>>>>>>> > > >
>>>>>>>> > > > What are the base factors we can start from (please add your
>>>>>>>> own ideas
>>>>>>>> > > if I
>>>>>>>> > > > have missed something):
>>>>>>>> > > > - We evolve the specifications in a way that is backwards
>>>>>>>> compatible (v1
>>>>>>>> > > > table could be read by v2 reader) but not forwards compatible
>>>>>>>> (v2 table
>>>>>>>> > > > could not be read by an old reader)
>>>>>>>> > > > - The rest spec ideally should conform to the currently used
>>>>>>>> table spec
>>>>>>>> > > > schema/constraints
>>>>>>>> > > > - REST catalogs sooner-or-later would like to drop support
>>>>>>>> for older
>>>>>>>> > > table
>>>>>>>> > > > spec. We need to avoid the situation of Hive Metastore, where
>>>>>>>> the
>>>>>>>> > > decisions
>>>>>>>> > > > made 10 years ago prevented enhancing the APIs as the old
>>>>>>>> specifications
>>>>>>>> > > > were supported forever.
>>>>>>>> > > >
>>>>>>>> > > > Probably (when the spec difference becomes big enough) a
>>>>>>>> composit request
>>>>>>>> > > > (version + different content spec) or a different endpoint
>>>>>>>> will be
>>>>>>>> > > required.
>>>>>>>> > > >
>>>>>>>> > > > Thanks, Peter
>>>>>>>> > > >
>>>>>>>> > > > On Thu, Oct 17, 2024, 23:11 Yufei Gu <flyrain...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> > > >
>>>>>>>> > > > > Hi Sung,
>>>>>>>> > > > >
>>>>>>>> > > > > It seems we are running to issues related to a mismatch
>>>>>>>> between the
>>>>>>>> > > REST
>>>>>>>> > > > > spec and table specifications. Currently, there's no clear
>>>>>>>> definition
>>>>>>>> > > of
>>>>>>>> > > > > how the REST spec is meant to support different table
>>>>>>>> specs. The
>>>>>>>> > > closest
>>>>>>>> > > > > reference I found is this statement
>>>>>>>> > > > > <
>>>>>>>> > >
>>>>>>>> https://github.com/apache/iceberg/blob/8e743a5b5209569f84b6bace36e1106c67e1eab3/open-api/rest-catalog-open-api.yaml#L30-L30
>>>>>>>> > > >
>>>>>>>> > > > > in the REST spec.
>>>>>>>> > > > >
>>>>>>>> > > > > Implementations should ideally support both Iceberg table
>>>>>>>> specs v1 and
>>>>>>>> > > v2,
>>>>>>>> > > > >> with priority given to v2.
>>>>>>>> > > > >
>>>>>>>> > > > >
>>>>>>>> > > > > I agree that the REST spec should aim to support v1, v2, and
>>>>>>>> > > potentially
>>>>>>>> > > > > the upcoming v3. In practice, it seems like the choice of
>>>>>>>> table spec
>>>>>>>> > > might
>>>>>>>> > > > > ultimately be dictated by the REST catalog implementation.
>>>>>>>> > > > >
>>>>>>>> > > > >
>>>>>>>> > > > > A best practice would be for the server to strive to
>>>>>>>> support all
>>>>>>>> > > Iceberg
>>>>>>>> > > > > versions, but the REST spec itself should remain flexible
>>>>>>>> enough to
>>>>>>>> > > > > accommodate less strict table specs. For the case you
>>>>>>>> mentioned, it
>>>>>>>> > > should
>>>>>>>> > > > > be fine to make sequence number optional since the spec has
>>>>>>>> to support
>>>>>>>> > > v1
>>>>>>>> > > > > table spec. It does feel confusing though.
>>>>>>>> > > > >
>>>>>>>> > > > >
>>>>>>>> > > > > WDYT?
>>>>>>>> > > > >
>>>>>>>> > > > >
>>>>>>>> > > > > Yufei
>>>>>>>> > > > >
>>>>>>>> > > > >
>>>>>>>> > > > > On Thu, Oct 17, 2024 at 1:56 PM Anton Okolnychyi <
>>>>>>>> > > aokolnyc...@gmail.com>
>>>>>>>> > > > > wrote:
>>>>>>>> > > > >
>>>>>>>> > > > >> Well, the spec says nothing about a top-level `operation`
>>>>>>>> field in
>>>>>>>> > > JSON
>>>>>>>> > > > >> [1]. Yet the Java implementation produces it [2] and
>>>>>>>> removes the
>>>>>>>> > > operation
>>>>>>>> > > > >> from the summary map. This seems inconsistent?
>>>>>>>> > > > >>
>>>>>>>> > > > >> - Anton
>>>>>>>> > > > >>
>>>>>>>> > > > >> [1] - https://iceberg.apache.org/spec/#snapshots
>>>>>>>> > > > >> [2] -
>>>>>>>> > > > >>
>>>>>>>> > >
>>>>>>>> https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L63
>>>>>>>> > > > >>
>>>>>>>> > > > >>
>>>>>>>> > > > >> чт, 17 жовт. 2024 р. о 10:06 Sung Yun <sun...@apache.org>
>>>>>>>> пише:
>>>>>>>> > > > >>
>>>>>>>> > > > >>> > As a side note, the `rest-catalog-open-api.yaml` file
>>>>>>>> [2] in the
>>>>>>>> > > > >>> Iceberg repo contains the latest version of the spec.
>>>>>>>> > > > >>>
>>>>>>>> > > > >>> I think more clarity on this would be helpful. Is it
>>>>>>>> really the case
>>>>>>>> > > > >>> that the Open API spec contains the latest version of the
>>>>>>>> spec? For
>>>>>>>> > > > >>> example, I'm noticing a discrepancy between
>>>>>>>> sequence-number in the
>>>>>>>> > > Table
>>>>>>>> > > > >>> Spec and in the Open API Spec...
>>>>>>>> > > > >>>
>>>>>>>> > > > >>> In the table spec, it's required for V2, but it's
>>>>>>>> optional in the
>>>>>>>> > > REST
>>>>>>>> > > > >>> API Spec:
>>>>>>>> > > > >>>
>>>>>>>> > > > >>>
>>>>>>>> > >
>>>>>>>> https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L2319-L2335
>>>>>>>> > > > >>>
>>>>>>>> > > > >>> On 2024/10/17 16:58:17 Kevin Liu wrote:
>>>>>>>> > > > >>> > > Based on the example metadata, that looks like it is
>>>>>>>> not to
>>>>>>>> > > spec, so
>>>>>>>> > > > >>> it's
>>>>>>>> > > > >>> > reasonable that python would reject it.  If the java
>>>>>>>> > > implementation is
>>>>>>>> > > > >>> > allowing for that, it's likely that we're being too
>>>>>>>> relaxed
>>>>>>>> > > (possibly a
>>>>>>>> > > > >>> > holdover from v1 parsing).
>>>>>>>> > > > >>> > I believe the Java implementation is relaxing the
>>>>>>>> constraint. I'll
>>>>>>>> > > > >>> create a
>>>>>>>> > > > >>> > PR with test cases and the necessary changes.
>>>>>>>> > > > >>> >
>>>>>>>> > > > >>> > > Do you know what produced the metadata?
>>>>>>>> > > > >>> > It was created by Snowflake [1]. After verifying this,
>>>>>>>> I'll look
>>>>>>>> > > into
>>>>>>>> > > > >>> > raising the issue with them.
>>>>>>>> > > > >>> >
>>>>>>>> > > > >>> > As a side note, the `rest-catalog-open-api.yaml` file
>>>>>>>> [2] in the
>>>>>>>> > > > >>> Iceberg
>>>>>>>> > > > >>> > repo contains the latest version of the spec. As we're
>>>>>>>> continuing
>>>>>>>> > > to
>>>>>>>> > > > >>> evolve
>>>>>>>> > > > >>> > to spec for V3, would it be helpful to create a frozen
>>>>>>>> version
>>>>>>>> > > > >>> representing
>>>>>>>> > > > >>> > both the V1 and V2 specs for reference, possibly as a
>>>>>>>> separate
>>>>>>>> > > file?
>>>>>>>> > > > >>> >
>>>>>>>> > > > >>> > Best,
>>>>>>>> > > > >>> > Kevin Liu
>>>>>>>> > > > >>> >
>>>>>>>> > > > >>> > [1]
>>>>>>>> > > > >>> >
>>>>>>>> > > > >>>
>>>>>>>> > >
>>>>>>>> https://github.com/apache/iceberg-python/issues/1106#issuecomment-2312108455
>>>>>>>> > > > >>> > [2]
>>>>>>>> > > > >>> >
>>>>>>>> > > > >>>
>>>>>>>> > >
>>>>>>>> https://github.com/apache/iceberg/blob/8e2eb9ac2e33ce4bac8956d4e2f099444d03c0e3/open-api/rest-catalog-open-api.yaml
>>>>>>>> > > > >>> >
>>>>>>>> > > > >>> > On Thu, Oct 17, 2024 at 9:20 AM Daniel Weeks <
>>>>>>>> dwe...@apache.org>
>>>>>>>> > > > >>> wrote:
>>>>>>>> > > > >>> >
>>>>>>>> > > > >>> > > Sung,
>>>>>>>> > > > >>> > >
>>>>>>>> > > > >>> > > I was thinking of v1, so you're right that
>>>>>>>> manifest-list and
>>>>>>>> > > summary
>>>>>>>> > > > >>> are
>>>>>>>> > > > >>> > > required as of v2.  The REST Spec seems to follow the
>>>>>>>> v2
>>>>>>>> > > definition,
>>>>>>>> > > > >>> so I
>>>>>>>> > > > >>> > > think we're somewhat implicitly requiring those
>>>>>>>> fields via REST.
>>>>>>>> > > > >>> > >
>>>>>>>> > > > >>> > > Kevin,
>>>>>>>> > > > >>> > >
>>>>>>>> > > > >>> > > Based on the example metadata, that looks like it is
>>>>>>>> not to
>>>>>>>> > > spec, so
>>>>>>>> > > > >>> it's
>>>>>>>> > > > >>> > > reasonable that python would reject it.  If the java
>>>>>>>> > > implementation
>>>>>>>> > > > >>> is
>>>>>>>> > > > >>> > > allowing for that, it's likely that we're being too
>>>>>>>> relaxed
>>>>>>>> > > > >>> (possibly a
>>>>>>>> > > > >>> > > holdover from v1 parsing).
>>>>>>>> > > > >>> > >
>>>>>>>> > > > >>> > > Do you know what produced the metadata?
>>>>>>>> > > > >>> > >
>>>>>>>> > > > >>> > > -Dan
>>>>>>>> > > > >>> > >
>>>>>>>> > > > >>> > > On Thu, Oct 17, 2024 at 9:02 AM Kevin Liu <
>>>>>>>> > > kevin.jq....@gmail.com>
>>>>>>>> > > > >>> wrote:
>>>>>>>> > > > >>> > >
>>>>>>>> > > > >>> > >> Thanks for the additional context.
>>>>>>>> > > > >>> > >>
>>>>>>>> > > > >>> > >> My understanding is that if a Snapshot has a
>>>>>>>> `summary` field, it
>>>>>>>> > > > >>> must
>>>>>>>> > > > >>> > >> also have a corresponding `operation` key in the
>>>>>>>> summary map. Is
>>>>>>>> > > > >>> that
>>>>>>>> > > > >>> > >> correct? Based on the `SnapshotParser`, this is not
>>>>>>>> enforced
>>>>>>>> > > [1].
>>>>>>>> > > > >>> > >>
>>>>>>>> > > > >>> > >> The underlying issue in #1106 [2] is the missing
>>>>>>>> `operation`
>>>>>>>> > > field
>>>>>>>> > > > >>> when
>>>>>>>> > > > >>> > >> the `summary` field is present.
>>>>>>>> > > > >>> > >> For example,
>>>>>>>> > > > >>> > >> ```
>>>>>>>> > > > >>> > >>     "summary" : {
>>>>>>>> > > > >>> > >>       "manifests-created" : "8",
>>>>>>>> > > > >>> > >>       "total-records" : "26508666891",
>>>>>>>> > > > >>> > >>       "added-files-size" : "3927895626752",
>>>>>>>> > > > >>> > >>       "manifests-kept" : "0",
>>>>>>>> > > > >>> > >>       "total-files-size" : "3927895626752",
>>>>>>>> > > > >>> > >>       "added-records" : "26508666891",
>>>>>>>> > > > >>> > >>       "added-data-files" : "231513",
>>>>>>>> > > > >>> > >>       "manifests-replaced" : "0",
>>>>>>>> > > > >>> > >>       "total-data-files" : "231513"
>>>>>>>> > > > >>> > >>     }
>>>>>>>> > > > >>> > >> ```
>>>>>>>> > > > >>> > >>
>>>>>>>> > > > >>> > >> It could be the case that this particular
>>>>>>>> `metadata.json` was
>>>>>>>> > > > >>> generated
>>>>>>>> > > > >>> > >> not according to the spec.
>>>>>>>> > > > >>> > >>
>>>>>>>> > > > >>> > >> Best,
>>>>>>>> > > > >>> > >> Kevin Liu
>>>>>>>> > > > >>> > >>
>>>>>>>> > > > >>> > >>
>>>>>>>> > > > >>> > >> [1]
>>>>>>>> > > > >>> > >>
>>>>>>>> > > > >>>
>>>>>>>> > >
>>>>>>>> https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L124-L142
>>>>>>>> > > > >>> > >> [2]
>>>>>>>> https://github.com/apache/iceberg-python/issues/1106
>>>>>>>> > > > >>> > >>
>>>>>>>> > > > >>> > >>
>>>>>>>> > > > >>> > >> On Thu, Oct 17, 2024 at 8:47 AM Sung Yun <
>>>>>>>> sun...@apache.org>
>>>>>>>> > > wrote:
>>>>>>>> > > > >>> > >>
>>>>>>>> > > > >>> > >>> Thank you for the clarification Daniel, and thank
>>>>>>>> you Kevin for
>>>>>>>> > > > >>> raising
>>>>>>>> > > > >>> > >>> this issue!
>>>>>>>> > > > >>> > >>>
>>>>>>>> > > > >>> > >>> Does that mean that we are creating component
>>>>>>>> schemas that are
>>>>>>>> > > the
>>>>>>>> > > > >>> > >>> superset of the V1 and V2 schemas? And if so,
>>>>>>>> should we remove
>>>>>>>> > > > >>> summary and
>>>>>>>> > > > >>> > >>> manifest-list from the required properties, and add
>>>>>>>> manifests
>>>>>>>> > > > >>> optional
>>>>>>>> > > > >>> > >>> property to the Snapshot schema to support both V1
>>>>>>>> and V2
>>>>>>>> > > Summary
>>>>>>>> > > > >>> specs?
>>>>>>>> > > > >>> > >>> https://iceberg.apache.org/spec/#snapshots
>>>>>>>> > > > >>> > >>>
>>>>>>>> > > > >>> > >>> Or would creating separate component schemas for
>>>>>>>> V1/V2 be a
>>>>>>>> > > > >>> cleaner way
>>>>>>>> > > > >>> > >>> to align the REST spec with the table spec?
>>>>>>>> > > > >>> > >>>
>>>>>>>> > > > >>> > >>> Sung
>>>>>>>> > > > >>> > >>>
>>>>>>>> > > > >>> > >>> On 2024/10/17 15:19:23 Daniel Weeks wrote:
>>>>>>>> > > > >>> > >>> > I'm not convinced this is incorrect behavior
>>>>>>>> (table spec or
>>>>>>>> > > > >>> > >>> > implementation), but it does lend to some
>>>>>>>> confusion.  The
>>>>>>>> > > > >>> 'summary'
>>>>>>>> > > > >>> > >>> field
>>>>>>>> > > > >>> > >>> > is optional, which means that if a summary is not
>>>>>>>> provided,
>>>>>>>> > > you
>>>>>>>> > > > >>> do not
>>>>>>>> > > > >>> > >>> have
>>>>>>>> > > > >>> > >>> > an associated 'operation' field.  The 'operation'
>>>>>>>> field is
>>>>>>>> > > only
>>>>>>>> > > > >>> > >>> required in
>>>>>>>> > > > >>> > >>> > the context of the summary, so it's actually
>>>>>>>> possible for the
>>>>>>>> > > > >>> > >>> > implementation (i.e. the tests you reference) to
>>>>>>>> not have an
>>>>>>>> > > > >>> operation.
>>>>>>>> > > > >>> > >>> >
>>>>>>>> > > > >>> > >>> > I think what is wrong here is that the REST spec
>>>>>>>> marked the
>>>>>>>> > > > >>> summary as
>>>>>>>> > > > >>> > >>> > required
>>>>>>>> > > > >>> > >>> > <
>>>>>>>> > > > >>> > >>>
>>>>>>>> > > > >>>
>>>>>>>> > >
>>>>>>>> https://github.com/apache/iceberg/blob/8e2eb9ac2e33ce4bac8956d4e2f099444d03c0e3/open-api/rest-catalog-open-api.yaml#L2040
>>>>>>>> > > > >>> > >>> >,
>>>>>>>> > > > >>> > >>> > which is inconsistent with the table spec.
>>>>>>>> > > > >>> > >>> >
>>>>>>>> > > > >>> > >>> > On Wed, Oct 16, 2024 at 3:52 PM Anton Okolnychyi <
>>>>>>>> > > > >>> > >>> aokolnyc...@gmail.com>
>>>>>>>> > > > >>> > >>> > wrote:
>>>>>>>> > > > >>> > >>> >
>>>>>>>> > > > >>> > >>> > > Based on [1], we never persisted the operation
>>>>>>>> in the
>>>>>>>> > > summary
>>>>>>>> > > > >>> map.
>>>>>>>> > > > >>> > >>> > > Instead, we persisted it as a top-level field
>>>>>>>> in Java,
>>>>>>>> > > which is
>>>>>>>> > > > >>> > >>> actually
>>>>>>>> > > > >>> > >>> > > NOT what the spec says. Does anyone remember
>>>>>>>> cases when the
>>>>>>>> > > > >>> > >>> operation was
>>>>>>>> > > > >>> > >>> > > unknown? I personally don't.
>>>>>>>> > > > >>> > >>> > >
>>>>>>>> > > > >>> > >>> > > [1] -
>>>>>>>> > > > >>> > >>> > >
>>>>>>>> > > > >>> > >>>
>>>>>>>> > > > >>>
>>>>>>>> > >
>>>>>>>> https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L63
>>>>>>>> > > > >>> > >>> > >
>>>>>>>> > > > >>> > >>> > >
>>>>>>>> > > > >>> > >>> > > ср, 16 жовт. 2024 р. о 12:42 Kevin Liu <
>>>>>>>> > > kevin.jq....@gmail.com
>>>>>>>> > > > >>> >
>>>>>>>> > > > >>> > >>> пише:
>>>>>>>> > > > >>> > >>> > >
>>>>>>>> > > > >>> > >>> > >> Hey folks,
>>>>>>>> > > > >>> > >>> > >>
>>>>>>>> > > > >>> > >>> > >> I’ve noticed a discrepancy between the Iceberg
>>>>>>>> > > specification
>>>>>>>> > > > >>> and
>>>>>>>> > > > >>> > >>> the Java
>>>>>>>> > > > >>> > >>> > >> implementation regarding the `operation` key
>>>>>>>> in the
>>>>>>>> > > `Snapshot`
>>>>>>>> > > > >>> > >>> `summary`
>>>>>>>> > > > >>> > >>> > >> field.
>>>>>>>> > > > >>> > >>> > >>
>>>>>>>> > > > >>> > >>> > >> The `Snapshot` object's `summary` dictionary
>>>>>>>> includes a
>>>>>>>> > > > >>> *required*
>>>>>>>> > > > >>> > >>> key
>>>>>>>> > > > >>> > >>> > >> named `operation`, as outlined in the spec
>>>>>>>> describing
>>>>>>>> > > Table
>>>>>>>> > > > >>> > >>> Metadata and
>>>>>>>> > > > >>> > >>> > >> Snapshots [1] and the generated OpenAPI YAML
>>>>>>>> [2].
>>>>>>>> > > However, in
>>>>>>>> > > > >>> the
>>>>>>>> > > > >>> > >>> Java
>>>>>>>> > > > >>> > >>> > >> implementation [3], `operation` is treated as
>>>>>>>> optional. In
>>>>>>>> > > > >>> > >>> contrast, it
>>>>>>>> > > > >>> > >>> > >> remains a required field in the Python
>>>>>>>> implementation [4].
>>>>>>>> > > > >>> > >>> > >> I also found that Java tests for
>>>>>>>> `SnapshotParser` assert
>>>>>>>> > > that
>>>>>>>> > > > >>> the
>>>>>>>> > > > >>> > >>> > >> `operation` field is null. [5]
>>>>>>>> > > > >>> > >>> > >>
>>>>>>>> > > > >>> > >>> > >> Due to this discrepancy, a user reported [6]
>>>>>>>> that the
>>>>>>>> > > > >>> > >>> `metadata.json`
>>>>>>>> > > > >>> > >>> > >> file generated for an Iceberg table could not
>>>>>>>> be read by
>>>>>>>> > > > >>> PyIceberg,
>>>>>>>> > > > >>> > >>> though
>>>>>>>> > > > >>> > >>> > >> it is readable using the Iceberg Java library.
>>>>>>>> > > > >>> > >>> > >>
>>>>>>>> > > > >>> > >>> > >> How should we proceed from here? Should the
>>>>>>>> Java library
>>>>>>>> > > > >>> enforce
>>>>>>>> > > > >>> > >>> this
>>>>>>>> > > > >>> > >>> > >> requirement? Additionally, how should we
>>>>>>>> handle existing
>>>>>>>> > > > >>> > >>> `metadata.json`
>>>>>>>> > > > >>> > >>> > >> files that were generated without this field?
>>>>>>>> > > > >>> > >>> > >>
>>>>>>>> > > > >>> > >>> > >> Best,
>>>>>>>> > > > >>> > >>> > >> Kevin Liu
>>>>>>>> > > > >>> > >>> > >>
>>>>>>>> > > > >>> > >>> > >> [1]
>>>>>>>> > > > >>>
>>>>>>>> https://iceberg.apache.org/spec/#table-metadata-and-snapshots
>>>>>>>> > > > >>> > >>> > >> [2]
>>>>>>>> > > > >>> > >>> > >>
>>>>>>>> > > > >>> > >>>
>>>>>>>> > > > >>>
>>>>>>>> > >
>>>>>>>> https://github.com/apache/iceberg/blob/8e2eb9ac2e33ce4bac8956d4e2f099444d03c0e3/open-api/rest-catalog-open-api.yaml#L2057-L2060
>>>>>>>> > > > >>> > >>> > >> [3]
>>>>>>>> > > > >>> > >>> > >>
>>>>>>>> > > > >>> > >>>
>>>>>>>> > > > >>>
>>>>>>>> > >
>>>>>>>> https://github.com/apache/iceberg/blob/64b36999d7ff716ae2534fb0972fcc10d22a64c2/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L124
>>>>>>>> > > > >>> > >>> > >> [4]
>>>>>>>> > > > >>> > >>> > >>
>>>>>>>> > > > >>> > >>>
>>>>>>>> > > > >>>
>>>>>>>> > >
>>>>>>>> https://github.com/apache/iceberg-python/blob/7cf0c225c3cdb32ac5e390de06b7b0e4fe7de92e/pyiceberg/table/snapshots.py#L182
>>>>>>>> > > > >>> > >>> > >> [5]
>>>>>>>> > > > >>> > >>> > >>
>>>>>>>> > > > >>> > >>>
>>>>>>>> > > > >>>
>>>>>>>> > >
>>>>>>>> https://github.com/apache/iceberg/blob/22a6b19c2e226eacc0aa78c1f2ffbdbb168b13be/core/src/test/java/org/apache/iceberg/TestSnapshotJson.java#L52
>>>>>>>> > > > >>> > >>> > >> [6]
>>>>>>>> https://github.com/apache/iceberg-python/issues/1106
>>>>>>>> > > > >>> > >>> > >>
>>>>>>>> > > > >>> > >>> > >>
>>>>>>>> > > > >>> > >>> >
>>>>>>>> > > > >>> > >>>
>>>>>>>> > > > >>> > >>
>>>>>>>> > > > >>> >
>>>>>>>> > > > >>>
>>>>>>>> > > > >>
>>>>>>>> > > >
>>>>>>>> > >
>>>>>>>> >
>>>>>>>>
>>>>>>>

Reply via email to