Hi Ryan, I've created a revert PR [1]. I agree that we should take a more permissive approach when reading a table, allowing for reading non-compliant table metadata, especially for an opportunity to "fix" the metadata. However, I think we still need a way to enforce the table specification to ensure that other operations interact with a compliant table.
Perhaps we could permit `SnapshotParser` to read the metadata but enforce the `operation` field at a different location to guarantee the table’s compliance with the specification. Best, Kevin Liu [1] https://github.com/apache/iceberg/pull/11409 On Sat, Oct 26, 2024 at 11:05 AM rdb...@gmail.com <rdb...@gmail.com> wrote: > I see it's been merged, but I don't think it is a good idea to enforce > this. The spec can and should require the `operation` but we want to be > careful about creating situations where bad metadata can needlessly break a > table. I would be much more permissive here, which is why this probably > wasn't enforced in the first place. > > On Fri, Oct 25, 2024 at 2:36 PM Kevin Liu <kevin.jq....@gmail.com> wrote: > >> Thanks, everyone! The PR[1] has been merged >> >> Best, >> Kevin Liu >> >> [1] https://github.com/apache/iceberg/pull/11354 >> >> >> On Fri, Oct 25, 2024 at 1:02 PM Kevin Liu <kevin.jq....@gmail.com> wrote: >> >>> Thanks, Ryan! That makes sense. >>> >>> I want to follow up on the original issue. I've made a PR [1] to enforce >>> that the Snapshot `summary` map must have an `operation` key. Please take a >>> look. Thank you @nastra for the comments and reviews. >>> >>> Best, >>> Kevin Liu >>> >>> [1] https://github.com/apache/iceberg/pull/11354 >>> >>> >>> >>> On Tue, Oct 22, 2024 at 4:06 PM rdb...@gmail.com <rdb...@gmail.com> >>> wrote: >>> >>>> > For example, the `Snapshot` `summary` field is optional in V1 but >>>> required in V2. Therefore, the REST spec definition should mark the >>>> `summary` field as optional to support both versions. >>>> >>>> Yeah, this is technically true. But as I said in my first email, unless >>>> you have tables that are 5 years old, it's unlikely that this is going to >>>> be a problem. A failure here is more likely with newer implementations that >>>> have a bug. So I'd argue there's value in leaving it as required. >>>> >>>> On Mon, Oct 21, 2024 at 9:41 AM Kevin Liu <kevin.jq....@gmail.com> >>>> wrote: >>>> >>>>> > No. They were introduced at the same time. >>>>> Great! Since the `summary` field and the `operation` key were >>>>> introduced together, we should enforce the rule that the `summary` >>>>> field must always have an accompanying `operation` key. This has been >>>>> addressed in PR 11354 [1]. >>>>> >>>>> > I am strongly against this. The REST spec should be independent of >>>>> the table versions. >>>>> That makes sense. For the REST spec to support both V1 and V2 tables, >>>>> it should "accept" the least common denominator between the two versions. >>>>> For example, the `Snapshot` `summary` field is optional in V1 but required >>>>> in V2. Therefore, the REST spec definition should mark the `summary` field >>>>> as optional to support both versions. However, the current REST spec leans >>>>> towards the V2 table spec; fields that are optional in V1 and required in >>>>> V2 are marked as required in the spec, such as `TableMetadata.table-uuid` >>>>> [2][3] and `Snapshot.summary` [4][5]. >>>>> >>>>> Would love to get other people's thoughts on this. >>>>> >>>>> Best, >>>>> Kevin Liu >>>>> >>>>> [1] https://github.com/apache/iceberg/pull/11354 >>>>> [2] >>>>> https://github.com/apache/iceberg/blob/8e743a5b5209569f84b6bace36e1106c67e1eab3/open-api/rest-catalog-open-api.yaml#L2414 >>>>> [3] https://iceberg.apache.org/spec/#table-metadata-fields >>>>> [4] >>>>> https://github.com/apache/iceberg/blob/8e743a5b5209569f84b6bace36e1106c67e1eab3/open-api/rest-catalog-open-api.yaml#L2325 >>>>> [5] https://iceberg.apache.org/spec/#snapshots >>>>> >>>>> On Sun, Oct 20, 2024 at 11:24 AM rdb...@gmail.com <rdb...@gmail.com> >>>>> wrote: >>>>> >>>>>> Was it ever valid to have a summary field without the operation key? >>>>>> >>>>>> No. They were introduced at the same time. >>>>>> >>>>>> Would it be helpful to create alternative versions of the REST spec >>>>>> specifically for referencing V1 and V2 tables? >>>>>> >>>>>> I am strongly against this. The REST spec should be independent of >>>>>> the table versions. Any table format version can be passed and the table >>>>>> format should be the canonical reference for what is allowed. We want to >>>>>> avoid cases where there are discrepancies. The table spec is canonical >>>>>> for >>>>>> table metadata, and the REST spec allows passing it. >>>>>> >>>>>> On Sun, Oct 20, 2024 at 11:18 AM Kevin Liu <kevin.jq....@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hey folks, >>>>>>> >>>>>>> Thanks, everyone for the discussion, and thanks Ryan for providing >>>>>>> the historical context. >>>>>>> Enforce the `operation` key in Snapshot’s `summary` field >>>>>>> >>>>>>> When serializing the `Snapshot` object from JSON, the Java >>>>>>> implementation does not enforce that the `summary` field must contain an >>>>>>> `operation` key. In the V1 spec, the `summary` field is optional, while >>>>>>> in >>>>>>> the V2 spec, it is required. However, in both versions, if a `summary` >>>>>>> field is present, it must include an `operation` key. Any `summary` >>>>>>> field >>>>>>> lacking an `operation` key should be considered invalid. >>>>>>> >>>>>>> I’ve addressed this issue in PR 11354 [1] by adding this constraint >>>>>>> when parsing JSON. >>>>>>> >>>>>>> > We initially did not have the snapshot summary or operation. When >>>>>>> I added the summary, the operation was intended to be required in cases >>>>>>> where the summary is present. It should always be there if the summary >>>>>>> is >>>>>>> and the summary should always be there unless you wrote the >>>>>>> metadata.json file way back in 2017 or 2018. >>>>>>> >>>>>>> @Ryan, does this constraint also apply to `metadata.json` files from >>>>>>> 2017/2018? Was it ever valid to have a `summary` field without the >>>>>>> `operation` key? >>>>>>> >>>>>>> > Well, the spec says nothing about a top-level `operation` field in >>>>>>> JSON [1]. Yet the Java implementation produces it [2] and removes the >>>>>>> operation from the summary map. This seems inconsistent? >>>>>>> >>>>>>> @Anton, the Java `Snapshot` object includes both the `summary` and >>>>>>> `operation` fields. When serializing to JSON, the `operation` field is >>>>>>> included in the `summary` map [2], rather than as a top-level field. >>>>>>> During >>>>>>> deserialization from JSON, the `operation` field is extracted from the >>>>>>> `summary` map [3]. >>>>>>> >>>>>>> I believe this is consistent with the table spec, which defines the >>>>>>> JSON output, not how the `Snapshot` object is implemented in Java. >>>>>>> On REST spec and Table spec >>>>>>> >>>>>>> Thanks, Yufei, for highlighting the difference between the REST spec >>>>>>> and the table spec. I mistakenly used the REST spec >>>>>>> (`rest-catalog-open-api.yaml` [4]) as the source of truth for V2 tables. >>>>>>> >>>>>>> Looking at the REST spec file, it can be challenging to determine >>>>>>> how a REST server should handle V1 versus V2 tables. Even for V2 tables, >>>>>>> the current version of the file combines features from V2, along with >>>>>>> additional changes made in preparation for the upcoming V3 spec. >>>>>>> >>>>>>> Would it be helpful to create alternative versions of the REST spec >>>>>>> specifically for referencing V1 and V2 tables? The goal would be to >>>>>>> have a >>>>>>> "frozen" version of the REST spec dedicated to V1 tables and another >>>>>>> for V2 >>>>>>> tables while allowing the current REST spec file to evolve as needed. >>>>>>> >>>>>>> Taking a step back, I think we need more documentation on the REST >>>>>>> spec, including support for different table versions and guidance on >>>>>>> upgrading from one version to another. I’d love to hear everyone’s >>>>>>> thoughts >>>>>>> on this. >>>>>>> >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> Kevin Liu >>>>>>> >>>>>>> >>>>>>> [1] https://github.com/apache/iceberg/pull/11354 >>>>>>> >>>>>>> [2] >>>>>>> https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L63-L66 >>>>>>> >>>>>>> [3] >>>>>>> https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L124-L137 >>>>>>> >>>>>>> [4] >>>>>>> https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml >>>>>>> >>>>>>> >>>>>>> On Sat, Oct 19, 2024 at 7:48 PM Sung Yun <sun...@apache.org> wrote: >>>>>>> >>>>>>>> Hi Ryan, thank you for your response! >>>>>>>> >>>>>>>> That detailed context is very helpful in allowing me to >>>>>>>> understanding why the REST catalog spec has evolved the way it has, >>>>>>>> and how >>>>>>>> the Table Spec and the REST Catalog Spec should each be referenced in >>>>>>>> the >>>>>>>> sub-communities (like in PyIceberg). I'll keep those motivations in >>>>>>>> mind as >>>>>>>> we discuss those Specs in the future. >>>>>>>> >>>>>>>> Also, here's a small PR to specify more explicitly that the >>>>>>>> operation field should be a required field in the summary field: >>>>>>>> https://github.com/apache/iceberg/pull/11355 >>>>>>>> >>>>>>>> Sung >>>>>>>> >>>>>>>> On 2024/10/19 22:14:59 "rdb...@gmail.com" wrote: >>>>>>>> > I can provide some historical context here about how the table >>>>>>>> spec evolved >>>>>>>> > and how the REST spec works with respect to table versions. >>>>>>>> > >>>>>>>> > We initially did not have the snapshot summary or operation. When >>>>>>>> I added >>>>>>>> > the summary, the operation was intended to be required in cases >>>>>>>> where the >>>>>>>> > summary is present. It should always be there if the summary is >>>>>>>> and the >>>>>>>> > summary should always be there unless you wrote the metadata.json >>>>>>>> file way >>>>>>>> > back in 2017 or 2018. It looks like the spec could be more clear >>>>>>>> that the >>>>>>>> > operation is required when summary is present. Anyone want to >>>>>>>> open a PR? >>>>>>>> > >>>>>>>> > Anton, I don't think there is a top-level operation field. The >>>>>>>> Java >>>>>>>> > Snapshot class tracks the operation as top-level, but it is >>>>>>>> always stored >>>>>>>> > in the summary. I think this is consistent with the spec. >>>>>>>> > >>>>>>>> > For the REST spec, I think that it should be strictly optional to >>>>>>>> support >>>>>>>> > v1 tables written with no summary, but it should always be >>>>>>>> present. I'd >>>>>>>> > probably leave it required since it already is and is catching a >>>>>>>> valuable >>>>>>>> > error case here. >>>>>>>> > >>>>>>>> > When building the REST spec, I took the same approach as the Java >>>>>>>> parser >>>>>>>> > (which is also to parse table metadata coming from REST servers). >>>>>>>> That is, >>>>>>>> > it is compatible with v1 metadata; fields that were not required >>>>>>>> in v1 are >>>>>>>> > optional. This fits with the principle of "be liberal in what you >>>>>>>> accept >>>>>>>> > and strict in what you produce". The REST spec allows passing >>>>>>>> metadata for >>>>>>>> > any table version so that the specs are not tightly coupled. The >>>>>>>> table >>>>>>>> > version is passed when loading and clients should reject table >>>>>>>> versions >>>>>>>> > that are newer than what they can support. The REST protocol just >>>>>>>> needs to >>>>>>>> > facilitate passing the table metadata. >>>>>>>> > >>>>>>>> > Most v2 structures, like the `schemas` list, are introduced as >>>>>>>> optional in >>>>>>>> > v1 and made required in v2. That way, it's possible to add >>>>>>>> metadata to >>>>>>>> > existing format versions and make the upgrade path easier. >>>>>>>> Maintaining the >>>>>>>> > newer structures even though there are different writer versions >>>>>>>> deployed >>>>>>>> > is one of the reasons why the REST spec changes to a change-based >>>>>>>> model. >>>>>>>> > New metadata only needs to be supported by the service >>>>>>>> maintaining the >>>>>>>> > metadata.json files and any writers that want to update it. >>>>>>>> > >>>>>>>> > I see some points about being able to remove old table versions. >>>>>>>> I don't >>>>>>>> > think that the REST protocol itself is the place to do this. The >>>>>>>> protocol >>>>>>>> > is format-agnostic. Implementations are free to reject requests >>>>>>>> to create >>>>>>>> > tables with older versions, or to update the table to a new >>>>>>>> version. >>>>>>>> > >>>>>>>> > Ryan >>>>>>>> > >>>>>>>> > On Fri, Oct 18, 2024 at 6:42 AM Sung Yun <sun...@apache.org> >>>>>>>> wrote: >>>>>>>> > >>>>>>>> > > Folks, thank you for your responses! These area really helpful >>>>>>>> insights. >>>>>>>> > > >>>>>>>> > > > I agree that the REST spec should aim to support v1, v2, and >>>>>>>> potentially >>>>>>>> > > the upcoming v3. In practice, it seems like the choice of table >>>>>>>> spec might >>>>>>>> > > ultimately be dictated by the REST catalog implementation. >>>>>>>> > > >>>>>>>> > > > A best practice would be for the server to strive to support >>>>>>>> all Iceberg >>>>>>>> > > versions, but the REST spec itself should remain flexible >>>>>>>> enough to >>>>>>>> > > accommodate less strict table specs. >>>>>>>> > > >>>>>>>> > > Yufei, yes that makes sense, and I agree that the server should >>>>>>>> strive to >>>>>>>> > > support all format versions, because otherwise the an older >>>>>>>> client >>>>>>>> > > application, may just not be compatible with a REST Catalog >>>>>>>> running on a >>>>>>>> > > higher version of table spec. I think we have two choices here >>>>>>>> in ensuring >>>>>>>> > > that the REST Catalog server is able to support multiple >>>>>>>> versions of the >>>>>>>> > > Table Spec: >>>>>>>> > > >>>>>>>> > > 1. We could create single components that are common >>>>>>>> denominators of all >>>>>>>> > > existing table specs to accommodate the less table specs. The >>>>>>>> REST Catalog >>>>>>>> > > Spec currently falls short in this approach, and I've put up >>>>>>>> this PR to >>>>>>>> > > show what this change would look like just for the Snapshot >>>>>>>> component: >>>>>>>> > > https://github.com/apache/iceberg/pull/11353 - My take on this >>>>>>>> is that, >>>>>>>> > > this approach will make the REST catalog spec more confusing >>>>>>>> and difficult >>>>>>>> > > to manage as we add more Table Spec versions moving forward. >>>>>>>> The discussion >>>>>>>> > > on this mail list thread is I think a great demonstration of >>>>>>>> that confusion >>>>>>>> > > :) >>>>>>>> > > 2. We could instead create separate Table Spec version specific >>>>>>>> components >>>>>>>> > > on the REST Catalog Open API Spec. For example, a Snapshot >>>>>>>> component could >>>>>>>> > > be anyOf SnapshotV1 and SnapshotV2, which match the Table Spec >>>>>>>> V1 and V2 >>>>>>>> > > definitions. I think creating explicit components that match >>>>>>>> the spec >>>>>>>> > > definitions will work in our favor when we continue to >>>>>>>> introduce more Spec >>>>>>>> > > changes and manage their lifecycles. And perhaps, maybe we >>>>>>>> could also >>>>>>>> > > indicate what format-versions the REST Catalog Server supports >>>>>>>> through an >>>>>>>> > > endpoint, and communicate it to a client application. >>>>>>>> > > >>>>>>>> > > I'd love to hear the community's opinion on suggestion (2)! I'm >>>>>>>> very >>>>>>>> > > curious to hear if we've considered it before. >>>>>>>> > > >>>>>>>> > > Sung >>>>>>>> > > >>>>>>>> > > On 2024/10/18 05:13:15 Péter Váry wrote: >>>>>>>> > > > Hi Team, >>>>>>>> > > > Apart from fixing this current issue by relaxing the current >>>>>>>> spec >>>>>>>> > > > constraints, to support both v1 and v2 specifications, we >>>>>>>> should think >>>>>>>> > > > about how to handle table spec evolution for the long term. >>>>>>>> > > > >>>>>>>> > > > What are the base factors we can start from (please add your >>>>>>>> own ideas >>>>>>>> > > if I >>>>>>>> > > > have missed something): >>>>>>>> > > > - We evolve the specifications in a way that is backwards >>>>>>>> compatible (v1 >>>>>>>> > > > table could be read by v2 reader) but not forwards compatible >>>>>>>> (v2 table >>>>>>>> > > > could not be read by an old reader) >>>>>>>> > > > - The rest spec ideally should conform to the currently used >>>>>>>> table spec >>>>>>>> > > > schema/constraints >>>>>>>> > > > - REST catalogs sooner-or-later would like to drop support >>>>>>>> for older >>>>>>>> > > table >>>>>>>> > > > spec. We need to avoid the situation of Hive Metastore, where >>>>>>>> the >>>>>>>> > > decisions >>>>>>>> > > > made 10 years ago prevented enhancing the APIs as the old >>>>>>>> specifications >>>>>>>> > > > were supported forever. >>>>>>>> > > > >>>>>>>> > > > Probably (when the spec difference becomes big enough) a >>>>>>>> composit request >>>>>>>> > > > (version + different content spec) or a different endpoint >>>>>>>> will be >>>>>>>> > > required. >>>>>>>> > > > >>>>>>>> > > > Thanks, Peter >>>>>>>> > > > >>>>>>>> > > > On Thu, Oct 17, 2024, 23:11 Yufei Gu <flyrain...@gmail.com> >>>>>>>> wrote: >>>>>>>> > > > >>>>>>>> > > > > Hi Sung, >>>>>>>> > > > > >>>>>>>> > > > > It seems we are running to issues related to a mismatch >>>>>>>> between the >>>>>>>> > > REST >>>>>>>> > > > > spec and table specifications. Currently, there's no clear >>>>>>>> definition >>>>>>>> > > of >>>>>>>> > > > > how the REST spec is meant to support different table >>>>>>>> specs. The >>>>>>>> > > closest >>>>>>>> > > > > reference I found is this statement >>>>>>>> > > > > < >>>>>>>> > > >>>>>>>> https://github.com/apache/iceberg/blob/8e743a5b5209569f84b6bace36e1106c67e1eab3/open-api/rest-catalog-open-api.yaml#L30-L30 >>>>>>>> > > > >>>>>>>> > > > > in the REST spec. >>>>>>>> > > > > >>>>>>>> > > > > Implementations should ideally support both Iceberg table >>>>>>>> specs v1 and >>>>>>>> > > v2, >>>>>>>> > > > >> with priority given to v2. >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > I agree that the REST spec should aim to support v1, v2, and >>>>>>>> > > potentially >>>>>>>> > > > > the upcoming v3. In practice, it seems like the choice of >>>>>>>> table spec >>>>>>>> > > might >>>>>>>> > > > > ultimately be dictated by the REST catalog implementation. >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > A best practice would be for the server to strive to >>>>>>>> support all >>>>>>>> > > Iceberg >>>>>>>> > > > > versions, but the REST spec itself should remain flexible >>>>>>>> enough to >>>>>>>> > > > > accommodate less strict table specs. For the case you >>>>>>>> mentioned, it >>>>>>>> > > should >>>>>>>> > > > > be fine to make sequence number optional since the spec has >>>>>>>> to support >>>>>>>> > > v1 >>>>>>>> > > > > table spec. It does feel confusing though. >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > WDYT? >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > Yufei >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > On Thu, Oct 17, 2024 at 1:56 PM Anton Okolnychyi < >>>>>>>> > > aokolnyc...@gmail.com> >>>>>>>> > > > > wrote: >>>>>>>> > > > > >>>>>>>> > > > >> Well, the spec says nothing about a top-level `operation` >>>>>>>> field in >>>>>>>> > > JSON >>>>>>>> > > > >> [1]. Yet the Java implementation produces it [2] and >>>>>>>> removes the >>>>>>>> > > operation >>>>>>>> > > > >> from the summary map. This seems inconsistent? >>>>>>>> > > > >> >>>>>>>> > > > >> - Anton >>>>>>>> > > > >> >>>>>>>> > > > >> [1] - https://iceberg.apache.org/spec/#snapshots >>>>>>>> > > > >> [2] - >>>>>>>> > > > >> >>>>>>>> > > >>>>>>>> https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L63 >>>>>>>> > > > >> >>>>>>>> > > > >> >>>>>>>> > > > >> чт, 17 жовт. 2024 р. о 10:06 Sung Yun <sun...@apache.org> >>>>>>>> пише: >>>>>>>> > > > >> >>>>>>>> > > > >>> > As a side note, the `rest-catalog-open-api.yaml` file >>>>>>>> [2] in the >>>>>>>> > > > >>> Iceberg repo contains the latest version of the spec. >>>>>>>> > > > >>> >>>>>>>> > > > >>> I think more clarity on this would be helpful. Is it >>>>>>>> really the case >>>>>>>> > > > >>> that the Open API spec contains the latest version of the >>>>>>>> spec? For >>>>>>>> > > > >>> example, I'm noticing a discrepancy between >>>>>>>> sequence-number in the >>>>>>>> > > Table >>>>>>>> > > > >>> Spec and in the Open API Spec... >>>>>>>> > > > >>> >>>>>>>> > > > >>> In the table spec, it's required for V2, but it's >>>>>>>> optional in the >>>>>>>> > > REST >>>>>>>> > > > >>> API Spec: >>>>>>>> > > > >>> >>>>>>>> > > > >>> >>>>>>>> > > >>>>>>>> https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L2319-L2335 >>>>>>>> > > > >>> >>>>>>>> > > > >>> On 2024/10/17 16:58:17 Kevin Liu wrote: >>>>>>>> > > > >>> > > Based on the example metadata, that looks like it is >>>>>>>> not to >>>>>>>> > > spec, so >>>>>>>> > > > >>> it's >>>>>>>> > > > >>> > reasonable that python would reject it. If the java >>>>>>>> > > implementation is >>>>>>>> > > > >>> > allowing for that, it's likely that we're being too >>>>>>>> relaxed >>>>>>>> > > (possibly a >>>>>>>> > > > >>> > holdover from v1 parsing). >>>>>>>> > > > >>> > I believe the Java implementation is relaxing the >>>>>>>> constraint. I'll >>>>>>>> > > > >>> create a >>>>>>>> > > > >>> > PR with test cases and the necessary changes. >>>>>>>> > > > >>> > >>>>>>>> > > > >>> > > Do you know what produced the metadata? >>>>>>>> > > > >>> > It was created by Snowflake [1]. After verifying this, >>>>>>>> I'll look >>>>>>>> > > into >>>>>>>> > > > >>> > raising the issue with them. >>>>>>>> > > > >>> > >>>>>>>> > > > >>> > As a side note, the `rest-catalog-open-api.yaml` file >>>>>>>> [2] in the >>>>>>>> > > > >>> Iceberg >>>>>>>> > > > >>> > repo contains the latest version of the spec. As we're >>>>>>>> continuing >>>>>>>> > > to >>>>>>>> > > > >>> evolve >>>>>>>> > > > >>> > to spec for V3, would it be helpful to create a frozen >>>>>>>> version >>>>>>>> > > > >>> representing >>>>>>>> > > > >>> > both the V1 and V2 specs for reference, possibly as a >>>>>>>> separate >>>>>>>> > > file? >>>>>>>> > > > >>> > >>>>>>>> > > > >>> > Best, >>>>>>>> > > > >>> > Kevin Liu >>>>>>>> > > > >>> > >>>>>>>> > > > >>> > [1] >>>>>>>> > > > >>> > >>>>>>>> > > > >>> >>>>>>>> > > >>>>>>>> https://github.com/apache/iceberg-python/issues/1106#issuecomment-2312108455 >>>>>>>> > > > >>> > [2] >>>>>>>> > > > >>> > >>>>>>>> > > > >>> >>>>>>>> > > >>>>>>>> https://github.com/apache/iceberg/blob/8e2eb9ac2e33ce4bac8956d4e2f099444d03c0e3/open-api/rest-catalog-open-api.yaml >>>>>>>> > > > >>> > >>>>>>>> > > > >>> > On Thu, Oct 17, 2024 at 9:20 AM Daniel Weeks < >>>>>>>> dwe...@apache.org> >>>>>>>> > > > >>> wrote: >>>>>>>> > > > >>> > >>>>>>>> > > > >>> > > Sung, >>>>>>>> > > > >>> > > >>>>>>>> > > > >>> > > I was thinking of v1, so you're right that >>>>>>>> manifest-list and >>>>>>>> > > summary >>>>>>>> > > > >>> are >>>>>>>> > > > >>> > > required as of v2. The REST Spec seems to follow the >>>>>>>> v2 >>>>>>>> > > definition, >>>>>>>> > > > >>> so I >>>>>>>> > > > >>> > > think we're somewhat implicitly requiring those >>>>>>>> fields via REST. >>>>>>>> > > > >>> > > >>>>>>>> > > > >>> > > Kevin, >>>>>>>> > > > >>> > > >>>>>>>> > > > >>> > > Based on the example metadata, that looks like it is >>>>>>>> not to >>>>>>>> > > spec, so >>>>>>>> > > > >>> it's >>>>>>>> > > > >>> > > reasonable that python would reject it. If the java >>>>>>>> > > implementation >>>>>>>> > > > >>> is >>>>>>>> > > > >>> > > allowing for that, it's likely that we're being too >>>>>>>> relaxed >>>>>>>> > > > >>> (possibly a >>>>>>>> > > > >>> > > holdover from v1 parsing). >>>>>>>> > > > >>> > > >>>>>>>> > > > >>> > > Do you know what produced the metadata? >>>>>>>> > > > >>> > > >>>>>>>> > > > >>> > > -Dan >>>>>>>> > > > >>> > > >>>>>>>> > > > >>> > > On Thu, Oct 17, 2024 at 9:02 AM Kevin Liu < >>>>>>>> > > kevin.jq....@gmail.com> >>>>>>>> > > > >>> wrote: >>>>>>>> > > > >>> > > >>>>>>>> > > > >>> > >> Thanks for the additional context. >>>>>>>> > > > >>> > >> >>>>>>>> > > > >>> > >> My understanding is that if a Snapshot has a >>>>>>>> `summary` field, it >>>>>>>> > > > >>> must >>>>>>>> > > > >>> > >> also have a corresponding `operation` key in the >>>>>>>> summary map. Is >>>>>>>> > > > >>> that >>>>>>>> > > > >>> > >> correct? Based on the `SnapshotParser`, this is not >>>>>>>> enforced >>>>>>>> > > [1]. >>>>>>>> > > > >>> > >> >>>>>>>> > > > >>> > >> The underlying issue in #1106 [2] is the missing >>>>>>>> `operation` >>>>>>>> > > field >>>>>>>> > > > >>> when >>>>>>>> > > > >>> > >> the `summary` field is present. >>>>>>>> > > > >>> > >> For example, >>>>>>>> > > > >>> > >> ``` >>>>>>>> > > > >>> > >> "summary" : { >>>>>>>> > > > >>> > >> "manifests-created" : "8", >>>>>>>> > > > >>> > >> "total-records" : "26508666891", >>>>>>>> > > > >>> > >> "added-files-size" : "3927895626752", >>>>>>>> > > > >>> > >> "manifests-kept" : "0", >>>>>>>> > > > >>> > >> "total-files-size" : "3927895626752", >>>>>>>> > > > >>> > >> "added-records" : "26508666891", >>>>>>>> > > > >>> > >> "added-data-files" : "231513", >>>>>>>> > > > >>> > >> "manifests-replaced" : "0", >>>>>>>> > > > >>> > >> "total-data-files" : "231513" >>>>>>>> > > > >>> > >> } >>>>>>>> > > > >>> > >> ``` >>>>>>>> > > > >>> > >> >>>>>>>> > > > >>> > >> It could be the case that this particular >>>>>>>> `metadata.json` was >>>>>>>> > > > >>> generated >>>>>>>> > > > >>> > >> not according to the spec. >>>>>>>> > > > >>> > >> >>>>>>>> > > > >>> > >> Best, >>>>>>>> > > > >>> > >> Kevin Liu >>>>>>>> > > > >>> > >> >>>>>>>> > > > >>> > >> >>>>>>>> > > > >>> > >> [1] >>>>>>>> > > > >>> > >> >>>>>>>> > > > >>> >>>>>>>> > > >>>>>>>> https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L124-L142 >>>>>>>> > > > >>> > >> [2] >>>>>>>> https://github.com/apache/iceberg-python/issues/1106 >>>>>>>> > > > >>> > >> >>>>>>>> > > > >>> > >> >>>>>>>> > > > >>> > >> On Thu, Oct 17, 2024 at 8:47 AM Sung Yun < >>>>>>>> sun...@apache.org> >>>>>>>> > > wrote: >>>>>>>> > > > >>> > >> >>>>>>>> > > > >>> > >>> Thank you for the clarification Daniel, and thank >>>>>>>> you Kevin for >>>>>>>> > > > >>> raising >>>>>>>> > > > >>> > >>> this issue! >>>>>>>> > > > >>> > >>> >>>>>>>> > > > >>> > >>> Does that mean that we are creating component >>>>>>>> schemas that are >>>>>>>> > > the >>>>>>>> > > > >>> > >>> superset of the V1 and V2 schemas? And if so, >>>>>>>> should we remove >>>>>>>> > > > >>> summary and >>>>>>>> > > > >>> > >>> manifest-list from the required properties, and add >>>>>>>> manifests >>>>>>>> > > > >>> optional >>>>>>>> > > > >>> > >>> property to the Snapshot schema to support both V1 >>>>>>>> and V2 >>>>>>>> > > Summary >>>>>>>> > > > >>> specs? >>>>>>>> > > > >>> > >>> https://iceberg.apache.org/spec/#snapshots >>>>>>>> > > > >>> > >>> >>>>>>>> > > > >>> > >>> Or would creating separate component schemas for >>>>>>>> V1/V2 be a >>>>>>>> > > > >>> cleaner way >>>>>>>> > > > >>> > >>> to align the REST spec with the table spec? >>>>>>>> > > > >>> > >>> >>>>>>>> > > > >>> > >>> Sung >>>>>>>> > > > >>> > >>> >>>>>>>> > > > >>> > >>> On 2024/10/17 15:19:23 Daniel Weeks wrote: >>>>>>>> > > > >>> > >>> > I'm not convinced this is incorrect behavior >>>>>>>> (table spec or >>>>>>>> > > > >>> > >>> > implementation), but it does lend to some >>>>>>>> confusion. The >>>>>>>> > > > >>> 'summary' >>>>>>>> > > > >>> > >>> field >>>>>>>> > > > >>> > >>> > is optional, which means that if a summary is not >>>>>>>> provided, >>>>>>>> > > you >>>>>>>> > > > >>> do not >>>>>>>> > > > >>> > >>> have >>>>>>>> > > > >>> > >>> > an associated 'operation' field. The 'operation' >>>>>>>> field is >>>>>>>> > > only >>>>>>>> > > > >>> > >>> required in >>>>>>>> > > > >>> > >>> > the context of the summary, so it's actually >>>>>>>> possible for the >>>>>>>> > > > >>> > >>> > implementation (i.e. the tests you reference) to >>>>>>>> not have an >>>>>>>> > > > >>> operation. >>>>>>>> > > > >>> > >>> > >>>>>>>> > > > >>> > >>> > I think what is wrong here is that the REST spec >>>>>>>> marked the >>>>>>>> > > > >>> summary as >>>>>>>> > > > >>> > >>> > required >>>>>>>> > > > >>> > >>> > < >>>>>>>> > > > >>> > >>> >>>>>>>> > > > >>> >>>>>>>> > > >>>>>>>> https://github.com/apache/iceberg/blob/8e2eb9ac2e33ce4bac8956d4e2f099444d03c0e3/open-api/rest-catalog-open-api.yaml#L2040 >>>>>>>> > > > >>> > >>> >, >>>>>>>> > > > >>> > >>> > which is inconsistent with the table spec. >>>>>>>> > > > >>> > >>> > >>>>>>>> > > > >>> > >>> > On Wed, Oct 16, 2024 at 3:52 PM Anton Okolnychyi < >>>>>>>> > > > >>> > >>> aokolnyc...@gmail.com> >>>>>>>> > > > >>> > >>> > wrote: >>>>>>>> > > > >>> > >>> > >>>>>>>> > > > >>> > >>> > > Based on [1], we never persisted the operation >>>>>>>> in the >>>>>>>> > > summary >>>>>>>> > > > >>> map. >>>>>>>> > > > >>> > >>> > > Instead, we persisted it as a top-level field >>>>>>>> in Java, >>>>>>>> > > which is >>>>>>>> > > > >>> > >>> actually >>>>>>>> > > > >>> > >>> > > NOT what the spec says. Does anyone remember >>>>>>>> cases when the >>>>>>>> > > > >>> > >>> operation was >>>>>>>> > > > >>> > >>> > > unknown? I personally don't. >>>>>>>> > > > >>> > >>> > > >>>>>>>> > > > >>> > >>> > > [1] - >>>>>>>> > > > >>> > >>> > > >>>>>>>> > > > >>> > >>> >>>>>>>> > > > >>> >>>>>>>> > > >>>>>>>> https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L63 >>>>>>>> > > > >>> > >>> > > >>>>>>>> > > > >>> > >>> > > >>>>>>>> > > > >>> > >>> > > ср, 16 жовт. 2024 р. о 12:42 Kevin Liu < >>>>>>>> > > kevin.jq....@gmail.com >>>>>>>> > > > >>> > >>>>>>>> > > > >>> > >>> пише: >>>>>>>> > > > >>> > >>> > > >>>>>>>> > > > >>> > >>> > >> Hey folks, >>>>>>>> > > > >>> > >>> > >> >>>>>>>> > > > >>> > >>> > >> I’ve noticed a discrepancy between the Iceberg >>>>>>>> > > specification >>>>>>>> > > > >>> and >>>>>>>> > > > >>> > >>> the Java >>>>>>>> > > > >>> > >>> > >> implementation regarding the `operation` key >>>>>>>> in the >>>>>>>> > > `Snapshot` >>>>>>>> > > > >>> > >>> `summary` >>>>>>>> > > > >>> > >>> > >> field. >>>>>>>> > > > >>> > >>> > >> >>>>>>>> > > > >>> > >>> > >> The `Snapshot` object's `summary` dictionary >>>>>>>> includes a >>>>>>>> > > > >>> *required* >>>>>>>> > > > >>> > >>> key >>>>>>>> > > > >>> > >>> > >> named `operation`, as outlined in the spec >>>>>>>> describing >>>>>>>> > > Table >>>>>>>> > > > >>> > >>> Metadata and >>>>>>>> > > > >>> > >>> > >> Snapshots [1] and the generated OpenAPI YAML >>>>>>>> [2]. >>>>>>>> > > However, in >>>>>>>> > > > >>> the >>>>>>>> > > > >>> > >>> Java >>>>>>>> > > > >>> > >>> > >> implementation [3], `operation` is treated as >>>>>>>> optional. In >>>>>>>> > > > >>> > >>> contrast, it >>>>>>>> > > > >>> > >>> > >> remains a required field in the Python >>>>>>>> implementation [4]. >>>>>>>> > > > >>> > >>> > >> I also found that Java tests for >>>>>>>> `SnapshotParser` assert >>>>>>>> > > that >>>>>>>> > > > >>> the >>>>>>>> > > > >>> > >>> > >> `operation` field is null. [5] >>>>>>>> > > > >>> > >>> > >> >>>>>>>> > > > >>> > >>> > >> Due to this discrepancy, a user reported [6] >>>>>>>> that the >>>>>>>> > > > >>> > >>> `metadata.json` >>>>>>>> > > > >>> > >>> > >> file generated for an Iceberg table could not >>>>>>>> be read by >>>>>>>> > > > >>> PyIceberg, >>>>>>>> > > > >>> > >>> though >>>>>>>> > > > >>> > >>> > >> it is readable using the Iceberg Java library. >>>>>>>> > > > >>> > >>> > >> >>>>>>>> > > > >>> > >>> > >> How should we proceed from here? Should the >>>>>>>> Java library >>>>>>>> > > > >>> enforce >>>>>>>> > > > >>> > >>> this >>>>>>>> > > > >>> > >>> > >> requirement? Additionally, how should we >>>>>>>> handle existing >>>>>>>> > > > >>> > >>> `metadata.json` >>>>>>>> > > > >>> > >>> > >> files that were generated without this field? >>>>>>>> > > > >>> > >>> > >> >>>>>>>> > > > >>> > >>> > >> Best, >>>>>>>> > > > >>> > >>> > >> Kevin Liu >>>>>>>> > > > >>> > >>> > >> >>>>>>>> > > > >>> > >>> > >> [1] >>>>>>>> > > > >>> >>>>>>>> https://iceberg.apache.org/spec/#table-metadata-and-snapshots >>>>>>>> > > > >>> > >>> > >> [2] >>>>>>>> > > > >>> > >>> > >> >>>>>>>> > > > >>> > >>> >>>>>>>> > > > >>> >>>>>>>> > > >>>>>>>> https://github.com/apache/iceberg/blob/8e2eb9ac2e33ce4bac8956d4e2f099444d03c0e3/open-api/rest-catalog-open-api.yaml#L2057-L2060 >>>>>>>> > > > >>> > >>> > >> [3] >>>>>>>> > > > >>> > >>> > >> >>>>>>>> > > > >>> > >>> >>>>>>>> > > > >>> >>>>>>>> > > >>>>>>>> https://github.com/apache/iceberg/blob/64b36999d7ff716ae2534fb0972fcc10d22a64c2/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L124 >>>>>>>> > > > >>> > >>> > >> [4] >>>>>>>> > > > >>> > >>> > >> >>>>>>>> > > > >>> > >>> >>>>>>>> > > > >>> >>>>>>>> > > >>>>>>>> https://github.com/apache/iceberg-python/blob/7cf0c225c3cdb32ac5e390de06b7b0e4fe7de92e/pyiceberg/table/snapshots.py#L182 >>>>>>>> > > > >>> > >>> > >> [5] >>>>>>>> > > > >>> > >>> > >> >>>>>>>> > > > >>> > >>> >>>>>>>> > > > >>> >>>>>>>> > > >>>>>>>> https://github.com/apache/iceberg/blob/22a6b19c2e226eacc0aa78c1f2ffbdbb168b13be/core/src/test/java/org/apache/iceberg/TestSnapshotJson.java#L52 >>>>>>>> > > > >>> > >>> > >> [6] >>>>>>>> https://github.com/apache/iceberg-python/issues/1106 >>>>>>>> > > > >>> > >>> > >> >>>>>>>> > > > >>> > >>> > >> >>>>>>>> > > > >>> > >>> > >>>>>>>> > > > >>> > >>> >>>>>>>> > > > >>> > >> >>>>>>>> > > > >>> > >>>>>>>> > > > >>> >>>>>>>> > > > >> >>>>>>>> > > > >>>>>>>> > > >>>>>>>> > >>>>>>>> >>>>>>>