Thanks, Ryan! That makes sense. I want to follow up on the original issue. I've made a PR [1] to enforce that the Snapshot `summary` map must have an `operation` key. Please take a look. Thank you @nastra for the comments and reviews.
Best, Kevin Liu [1] https://github.com/apache/iceberg/pull/11354 On Tue, Oct 22, 2024 at 4:06 PM rdb...@gmail.com <rdb...@gmail.com> wrote: > > For example, the `Snapshot` `summary` field is optional in V1 but > required in V2. Therefore, the REST spec definition should mark the > `summary` field as optional to support both versions. > > Yeah, this is technically true. But as I said in my first email, unless > you have tables that are 5 years old, it's unlikely that this is going to > be a problem. A failure here is more likely with newer implementations that > have a bug. So I'd argue there's value in leaving it as required. > > On Mon, Oct 21, 2024 at 9:41 AM Kevin Liu <kevin.jq....@gmail.com> wrote: > >> > No. They were introduced at the same time. >> Great! Since the `summary` field and the `operation` key were introduced >> together, we should enforce the rule that the `summary` field must >> always have an accompanying `operation` key. This has been addressed in >> PR 11354 [1]. >> >> > I am strongly against this. The REST spec should be independent of the >> table versions. >> That makes sense. For the REST spec to support both V1 and V2 tables, it >> should "accept" the least common denominator between the two versions. For >> example, the `Snapshot` `summary` field is optional in V1 but required in >> V2. Therefore, the REST spec definition should mark the `summary` field as >> optional to support both versions. However, the current REST spec leans >> towards the V2 table spec; fields that are optional in V1 and required in >> V2 are marked as required in the spec, such as `TableMetadata.table-uuid` >> [2][3] and `Snapshot.summary` [4][5]. >> >> Would love to get other people's thoughts on this. >> >> Best, >> Kevin Liu >> >> [1] https://github.com/apache/iceberg/pull/11354 >> [2] >> https://github.com/apache/iceberg/blob/8e743a5b5209569f84b6bace36e1106c67e1eab3/open-api/rest-catalog-open-api.yaml#L2414 >> [3] https://iceberg.apache.org/spec/#table-metadata-fields >> [4] >> https://github.com/apache/iceberg/blob/8e743a5b5209569f84b6bace36e1106c67e1eab3/open-api/rest-catalog-open-api.yaml#L2325 >> [5] https://iceberg.apache.org/spec/#snapshots >> >> On Sun, Oct 20, 2024 at 11:24 AM rdb...@gmail.com <rdb...@gmail.com> >> wrote: >> >>> Was it ever valid to have a summary field without the operation key? >>> >>> No. They were introduced at the same time. >>> >>> Would it be helpful to create alternative versions of the REST spec >>> specifically for referencing V1 and V2 tables? >>> >>> I am strongly against this. The REST spec should be independent of the >>> table versions. Any table format version can be passed and the table format >>> should be the canonical reference for what is allowed. We want to avoid >>> cases where there are discrepancies. The table spec is canonical for table >>> metadata, and the REST spec allows passing it. >>> >>> On Sun, Oct 20, 2024 at 11:18 AM Kevin Liu <kevin.jq....@gmail.com> >>> wrote: >>> >>>> Hey folks, >>>> >>>> Thanks, everyone for the discussion, and thanks Ryan for providing the >>>> historical context. >>>> Enforce the `operation` key in Snapshot’s `summary` field >>>> >>>> When serializing the `Snapshot` object from JSON, the Java >>>> implementation does not enforce that the `summary` field must contain an >>>> `operation` key. In the V1 spec, the `summary` field is optional, while in >>>> the V2 spec, it is required. However, in both versions, if a `summary` >>>> field is present, it must include an `operation` key. Any `summary` field >>>> lacking an `operation` key should be considered invalid. >>>> >>>> I’ve addressed this issue in PR 11354 [1] by adding this constraint >>>> when parsing JSON. >>>> >>>> > We initially did not have the snapshot summary or operation. When I >>>> added the summary, the operation was intended to be required in cases where >>>> the summary is present. It should always be there if the summary is and the >>>> summary should always be there unless you wrote the metadata.json file >>>> way back in 2017 or 2018. >>>> >>>> @Ryan, does this constraint also apply to `metadata.json` files from >>>> 2017/2018? Was it ever valid to have a `summary` field without the >>>> `operation` key? >>>> >>>> > Well, the spec says nothing about a top-level `operation` field in >>>> JSON [1]. Yet the Java implementation produces it [2] and removes the >>>> operation from the summary map. This seems inconsistent? >>>> >>>> @Anton, the Java `Snapshot` object includes both the `summary` and >>>> `operation` fields. When serializing to JSON, the `operation` field is >>>> included in the `summary` map [2], rather than as a top-level field. During >>>> deserialization from JSON, the `operation` field is extracted from the >>>> `summary` map [3]. >>>> >>>> I believe this is consistent with the table spec, which defines the >>>> JSON output, not how the `Snapshot` object is implemented in Java. >>>> On REST spec and Table spec >>>> >>>> Thanks, Yufei, for highlighting the difference between the REST spec >>>> and the table spec. I mistakenly used the REST spec >>>> (`rest-catalog-open-api.yaml` [4]) as the source of truth for V2 tables. >>>> >>>> Looking at the REST spec file, it can be challenging to determine how a >>>> REST server should handle V1 versus V2 tables. Even for V2 tables, the >>>> current version of the file combines features from V2, along with >>>> additional changes made in preparation for the upcoming V3 spec. >>>> >>>> Would it be helpful to create alternative versions of the REST spec >>>> specifically for referencing V1 and V2 tables? The goal would be to have a >>>> "frozen" version of the REST spec dedicated to V1 tables and another for V2 >>>> tables while allowing the current REST spec file to evolve as needed. >>>> >>>> Taking a step back, I think we need more documentation on the REST >>>> spec, including support for different table versions and guidance on >>>> upgrading from one version to another. I’d love to hear everyone’s thoughts >>>> on this. >>>> >>>> >>>> Best, >>>> >>>> Kevin Liu >>>> >>>> >>>> [1] https://github.com/apache/iceberg/pull/11354 >>>> >>>> [2] >>>> https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L63-L66 >>>> >>>> [3] >>>> https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L124-L137 >>>> >>>> [4] >>>> https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml >>>> >>>> >>>> On Sat, Oct 19, 2024 at 7:48 PM Sung Yun <sun...@apache.org> wrote: >>>> >>>>> Hi Ryan, thank you for your response! >>>>> >>>>> That detailed context is very helpful in allowing me to understanding >>>>> why the REST catalog spec has evolved the way it has, and how the Table >>>>> Spec and the REST Catalog Spec should each be referenced in the >>>>> sub-communities (like in PyIceberg). I'll keep those motivations in mind >>>>> as >>>>> we discuss those Specs in the future. >>>>> >>>>> Also, here's a small PR to specify more explicitly that the operation >>>>> field should be a required field in the summary field: >>>>> https://github.com/apache/iceberg/pull/11355 >>>>> >>>>> Sung >>>>> >>>>> On 2024/10/19 22:14:59 "rdb...@gmail.com" wrote: >>>>> > I can provide some historical context here about how the table spec >>>>> evolved >>>>> > and how the REST spec works with respect to table versions. >>>>> > >>>>> > We initially did not have the snapshot summary or operation. When I >>>>> added >>>>> > the summary, the operation was intended to be required in cases >>>>> where the >>>>> > summary is present. It should always be there if the summary is and >>>>> the >>>>> > summary should always be there unless you wrote the metadata.json >>>>> file way >>>>> > back in 2017 or 2018. It looks like the spec could be more clear >>>>> that the >>>>> > operation is required when summary is present. Anyone want to open a >>>>> PR? >>>>> > >>>>> > Anton, I don't think there is a top-level operation field. The Java >>>>> > Snapshot class tracks the operation as top-level, but it is always >>>>> stored >>>>> > in the summary. I think this is consistent with the spec. >>>>> > >>>>> > For the REST spec, I think that it should be strictly optional to >>>>> support >>>>> > v1 tables written with no summary, but it should always be present. >>>>> I'd >>>>> > probably leave it required since it already is and is catching a >>>>> valuable >>>>> > error case here. >>>>> > >>>>> > When building the REST spec, I took the same approach as the Java >>>>> parser >>>>> > (which is also to parse table metadata coming from REST servers). >>>>> That is, >>>>> > it is compatible with v1 metadata; fields that were not required in >>>>> v1 are >>>>> > optional. This fits with the principle of "be liberal in what you >>>>> accept >>>>> > and strict in what you produce". The REST spec allows passing >>>>> metadata for >>>>> > any table version so that the specs are not tightly coupled. The >>>>> table >>>>> > version is passed when loading and clients should reject table >>>>> versions >>>>> > that are newer than what they can support. The REST protocol just >>>>> needs to >>>>> > facilitate passing the table metadata. >>>>> > >>>>> > Most v2 structures, like the `schemas` list, are introduced as >>>>> optional in >>>>> > v1 and made required in v2. That way, it's possible to add metadata >>>>> to >>>>> > existing format versions and make the upgrade path easier. >>>>> Maintaining the >>>>> > newer structures even though there are different writer versions >>>>> deployed >>>>> > is one of the reasons why the REST spec changes to a change-based >>>>> model. >>>>> > New metadata only needs to be supported by the service maintaining >>>>> the >>>>> > metadata.json files and any writers that want to update it. >>>>> > >>>>> > I see some points about being able to remove old table versions. I >>>>> don't >>>>> > think that the REST protocol itself is the place to do this. The >>>>> protocol >>>>> > is format-agnostic. Implementations are free to reject requests to >>>>> create >>>>> > tables with older versions, or to update the table to a new version. >>>>> > >>>>> > Ryan >>>>> > >>>>> > On Fri, Oct 18, 2024 at 6:42 AM Sung Yun <sun...@apache.org> wrote: >>>>> > >>>>> > > Folks, thank you for your responses! These area really helpful >>>>> insights. >>>>> > > >>>>> > > > I agree that the REST spec should aim to support v1, v2, and >>>>> potentially >>>>> > > the upcoming v3. In practice, it seems like the choice of table >>>>> spec might >>>>> > > ultimately be dictated by the REST catalog implementation. >>>>> > > >>>>> > > > A best practice would be for the server to strive to support all >>>>> Iceberg >>>>> > > versions, but the REST spec itself should remain flexible enough to >>>>> > > accommodate less strict table specs. >>>>> > > >>>>> > > Yufei, yes that makes sense, and I agree that the server should >>>>> strive to >>>>> > > support all format versions, because otherwise the an older client >>>>> > > application, may just not be compatible with a REST Catalog >>>>> running on a >>>>> > > higher version of table spec. I think we have two choices here in >>>>> ensuring >>>>> > > that the REST Catalog server is able to support multiple versions >>>>> of the >>>>> > > Table Spec: >>>>> > > >>>>> > > 1. We could create single components that are common denominators >>>>> of all >>>>> > > existing table specs to accommodate the less table specs. The REST >>>>> Catalog >>>>> > > Spec currently falls short in this approach, and I've put up this >>>>> PR to >>>>> > > show what this change would look like just for the Snapshot >>>>> component: >>>>> > > https://github.com/apache/iceberg/pull/11353 - My take on this is >>>>> that, >>>>> > > this approach will make the REST catalog spec more confusing and >>>>> difficult >>>>> > > to manage as we add more Table Spec versions moving forward. The >>>>> discussion >>>>> > > on this mail list thread is I think a great demonstration of that >>>>> confusion >>>>> > > :) >>>>> > > 2. We could instead create separate Table Spec version specific >>>>> components >>>>> > > on the REST Catalog Open API Spec. For example, a Snapshot >>>>> component could >>>>> > > be anyOf SnapshotV1 and SnapshotV2, which match the Table Spec V1 >>>>> and V2 >>>>> > > definitions. I think creating explicit components that match the >>>>> spec >>>>> > > definitions will work in our favor when we continue to introduce >>>>> more Spec >>>>> > > changes and manage their lifecycles. And perhaps, maybe we could >>>>> also >>>>> > > indicate what format-versions the REST Catalog Server supports >>>>> through an >>>>> > > endpoint, and communicate it to a client application. >>>>> > > >>>>> > > I'd love to hear the community's opinion on suggestion (2)! I'm >>>>> very >>>>> > > curious to hear if we've considered it before. >>>>> > > >>>>> > > Sung >>>>> > > >>>>> > > On 2024/10/18 05:13:15 Péter Váry wrote: >>>>> > > > Hi Team, >>>>> > > > Apart from fixing this current issue by relaxing the current spec >>>>> > > > constraints, to support both v1 and v2 specifications, we should >>>>> think >>>>> > > > about how to handle table spec evolution for the long term. >>>>> > > > >>>>> > > > What are the base factors we can start from (please add your own >>>>> ideas >>>>> > > if I >>>>> > > > have missed something): >>>>> > > > - We evolve the specifications in a way that is backwards >>>>> compatible (v1 >>>>> > > > table could be read by v2 reader) but not forwards compatible >>>>> (v2 table >>>>> > > > could not be read by an old reader) >>>>> > > > - The rest spec ideally should conform to the currently used >>>>> table spec >>>>> > > > schema/constraints >>>>> > > > - REST catalogs sooner-or-later would like to drop support for >>>>> older >>>>> > > table >>>>> > > > spec. We need to avoid the situation of Hive Metastore, where the >>>>> > > decisions >>>>> > > > made 10 years ago prevented enhancing the APIs as the old >>>>> specifications >>>>> > > > were supported forever. >>>>> > > > >>>>> > > > Probably (when the spec difference becomes big enough) a >>>>> composit request >>>>> > > > (version + different content spec) or a different endpoint will >>>>> be >>>>> > > required. >>>>> > > > >>>>> > > > Thanks, Peter >>>>> > > > >>>>> > > > On Thu, Oct 17, 2024, 23:11 Yufei Gu <flyrain...@gmail.com> >>>>> wrote: >>>>> > > > >>>>> > > > > Hi Sung, >>>>> > > > > >>>>> > > > > It seems we are running to issues related to a mismatch >>>>> between the >>>>> > > REST >>>>> > > > > spec and table specifications. Currently, there's no clear >>>>> definition >>>>> > > of >>>>> > > > > how the REST spec is meant to support different table specs. >>>>> The >>>>> > > closest >>>>> > > > > reference I found is this statement >>>>> > > > > < >>>>> > > >>>>> https://github.com/apache/iceberg/blob/8e743a5b5209569f84b6bace36e1106c67e1eab3/open-api/rest-catalog-open-api.yaml#L30-L30 >>>>> > > > >>>>> > > > > in the REST spec. >>>>> > > > > >>>>> > > > > Implementations should ideally support both Iceberg table >>>>> specs v1 and >>>>> > > v2, >>>>> > > > >> with priority given to v2. >>>>> > > > > >>>>> > > > > >>>>> > > > > I agree that the REST spec should aim to support v1, v2, and >>>>> > > potentially >>>>> > > > > the upcoming v3. In practice, it seems like the choice of >>>>> table spec >>>>> > > might >>>>> > > > > ultimately be dictated by the REST catalog implementation. >>>>> > > > > >>>>> > > > > >>>>> > > > > A best practice would be for the server to strive to support >>>>> all >>>>> > > Iceberg >>>>> > > > > versions, but the REST spec itself should remain flexible >>>>> enough to >>>>> > > > > accommodate less strict table specs. For the case you >>>>> mentioned, it >>>>> > > should >>>>> > > > > be fine to make sequence number optional since the spec has to >>>>> support >>>>> > > v1 >>>>> > > > > table spec. It does feel confusing though. >>>>> > > > > >>>>> > > > > >>>>> > > > > WDYT? >>>>> > > > > >>>>> > > > > >>>>> > > > > Yufei >>>>> > > > > >>>>> > > > > >>>>> > > > > On Thu, Oct 17, 2024 at 1:56 PM Anton Okolnychyi < >>>>> > > aokolnyc...@gmail.com> >>>>> > > > > wrote: >>>>> > > > > >>>>> > > > >> Well, the spec says nothing about a top-level `operation` >>>>> field in >>>>> > > JSON >>>>> > > > >> [1]. Yet the Java implementation produces it [2] and removes >>>>> the >>>>> > > operation >>>>> > > > >> from the summary map. This seems inconsistent? >>>>> > > > >> >>>>> > > > >> - Anton >>>>> > > > >> >>>>> > > > >> [1] - https://iceberg.apache.org/spec/#snapshots >>>>> > > > >> [2] - >>>>> > > > >> >>>>> > > >>>>> https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L63 >>>>> > > > >> >>>>> > > > >> >>>>> > > > >> чт, 17 жовт. 2024 р. о 10:06 Sung Yun <sun...@apache.org> >>>>> пише: >>>>> > > > >> >>>>> > > > >>> > As a side note, the `rest-catalog-open-api.yaml` file [2] >>>>> in the >>>>> > > > >>> Iceberg repo contains the latest version of the spec. >>>>> > > > >>> >>>>> > > > >>> I think more clarity on this would be helpful. Is it really >>>>> the case >>>>> > > > >>> that the Open API spec contains the latest version of the >>>>> spec? For >>>>> > > > >>> example, I'm noticing a discrepancy between sequence-number >>>>> in the >>>>> > > Table >>>>> > > > >>> Spec and in the Open API Spec... >>>>> > > > >>> >>>>> > > > >>> In the table spec, it's required for V2, but it's optional >>>>> in the >>>>> > > REST >>>>> > > > >>> API Spec: >>>>> > > > >>> >>>>> > > > >>> >>>>> > > >>>>> https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L2319-L2335 >>>>> > > > >>> >>>>> > > > >>> On 2024/10/17 16:58:17 Kevin Liu wrote: >>>>> > > > >>> > > Based on the example metadata, that looks like it is not >>>>> to >>>>> > > spec, so >>>>> > > > >>> it's >>>>> > > > >>> > reasonable that python would reject it. If the java >>>>> > > implementation is >>>>> > > > >>> > allowing for that, it's likely that we're being too relaxed >>>>> > > (possibly a >>>>> > > > >>> > holdover from v1 parsing). >>>>> > > > >>> > I believe the Java implementation is relaxing the >>>>> constraint. I'll >>>>> > > > >>> create a >>>>> > > > >>> > PR with test cases and the necessary changes. >>>>> > > > >>> > >>>>> > > > >>> > > Do you know what produced the metadata? >>>>> > > > >>> > It was created by Snowflake [1]. After verifying this, >>>>> I'll look >>>>> > > into >>>>> > > > >>> > raising the issue with them. >>>>> > > > >>> > >>>>> > > > >>> > As a side note, the `rest-catalog-open-api.yaml` file [2] >>>>> in the >>>>> > > > >>> Iceberg >>>>> > > > >>> > repo contains the latest version of the spec. As we're >>>>> continuing >>>>> > > to >>>>> > > > >>> evolve >>>>> > > > >>> > to spec for V3, would it be helpful to create a frozen >>>>> version >>>>> > > > >>> representing >>>>> > > > >>> > both the V1 and V2 specs for reference, possibly as a >>>>> separate >>>>> > > file? >>>>> > > > >>> > >>>>> > > > >>> > Best, >>>>> > > > >>> > Kevin Liu >>>>> > > > >>> > >>>>> > > > >>> > [1] >>>>> > > > >>> > >>>>> > > > >>> >>>>> > > >>>>> https://github.com/apache/iceberg-python/issues/1106#issuecomment-2312108455 >>>>> > > > >>> > [2] >>>>> > > > >>> > >>>>> > > > >>> >>>>> > > >>>>> https://github.com/apache/iceberg/blob/8e2eb9ac2e33ce4bac8956d4e2f099444d03c0e3/open-api/rest-catalog-open-api.yaml >>>>> > > > >>> > >>>>> > > > >>> > On Thu, Oct 17, 2024 at 9:20 AM Daniel Weeks < >>>>> dwe...@apache.org> >>>>> > > > >>> wrote: >>>>> > > > >>> > >>>>> > > > >>> > > Sung, >>>>> > > > >>> > > >>>>> > > > >>> > > I was thinking of v1, so you're right that manifest-list >>>>> and >>>>> > > summary >>>>> > > > >>> are >>>>> > > > >>> > > required as of v2. The REST Spec seems to follow the v2 >>>>> > > definition, >>>>> > > > >>> so I >>>>> > > > >>> > > think we're somewhat implicitly requiring those fields >>>>> via REST. >>>>> > > > >>> > > >>>>> > > > >>> > > Kevin, >>>>> > > > >>> > > >>>>> > > > >>> > > Based on the example metadata, that looks like it is not >>>>> to >>>>> > > spec, so >>>>> > > > >>> it's >>>>> > > > >>> > > reasonable that python would reject it. If the java >>>>> > > implementation >>>>> > > > >>> is >>>>> > > > >>> > > allowing for that, it's likely that we're being too >>>>> relaxed >>>>> > > > >>> (possibly a >>>>> > > > >>> > > holdover from v1 parsing). >>>>> > > > >>> > > >>>>> > > > >>> > > Do you know what produced the metadata? >>>>> > > > >>> > > >>>>> > > > >>> > > -Dan >>>>> > > > >>> > > >>>>> > > > >>> > > On Thu, Oct 17, 2024 at 9:02 AM Kevin Liu < >>>>> > > kevin.jq....@gmail.com> >>>>> > > > >>> wrote: >>>>> > > > >>> > > >>>>> > > > >>> > >> Thanks for the additional context. >>>>> > > > >>> > >> >>>>> > > > >>> > >> My understanding is that if a Snapshot has a `summary` >>>>> field, it >>>>> > > > >>> must >>>>> > > > >>> > >> also have a corresponding `operation` key in the >>>>> summary map. Is >>>>> > > > >>> that >>>>> > > > >>> > >> correct? Based on the `SnapshotParser`, this is not >>>>> enforced >>>>> > > [1]. >>>>> > > > >>> > >> >>>>> > > > >>> > >> The underlying issue in #1106 [2] is the missing >>>>> `operation` >>>>> > > field >>>>> > > > >>> when >>>>> > > > >>> > >> the `summary` field is present. >>>>> > > > >>> > >> For example, >>>>> > > > >>> > >> ``` >>>>> > > > >>> > >> "summary" : { >>>>> > > > >>> > >> "manifests-created" : "8", >>>>> > > > >>> > >> "total-records" : "26508666891", >>>>> > > > >>> > >> "added-files-size" : "3927895626752", >>>>> > > > >>> > >> "manifests-kept" : "0", >>>>> > > > >>> > >> "total-files-size" : "3927895626752", >>>>> > > > >>> > >> "added-records" : "26508666891", >>>>> > > > >>> > >> "added-data-files" : "231513", >>>>> > > > >>> > >> "manifests-replaced" : "0", >>>>> > > > >>> > >> "total-data-files" : "231513" >>>>> > > > >>> > >> } >>>>> > > > >>> > >> ``` >>>>> > > > >>> > >> >>>>> > > > >>> > >> It could be the case that this particular >>>>> `metadata.json` was >>>>> > > > >>> generated >>>>> > > > >>> > >> not according to the spec. >>>>> > > > >>> > >> >>>>> > > > >>> > >> Best, >>>>> > > > >>> > >> Kevin Liu >>>>> > > > >>> > >> >>>>> > > > >>> > >> >>>>> > > > >>> > >> [1] >>>>> > > > >>> > >> >>>>> > > > >>> >>>>> > > >>>>> https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L124-L142 >>>>> > > > >>> > >> [2] >>>>> https://github.com/apache/iceberg-python/issues/1106 >>>>> > > > >>> > >> >>>>> > > > >>> > >> >>>>> > > > >>> > >> On Thu, Oct 17, 2024 at 8:47 AM Sung Yun < >>>>> sun...@apache.org> >>>>> > > wrote: >>>>> > > > >>> > >> >>>>> > > > >>> > >>> Thank you for the clarification Daniel, and thank you >>>>> Kevin for >>>>> > > > >>> raising >>>>> > > > >>> > >>> this issue! >>>>> > > > >>> > >>> >>>>> > > > >>> > >>> Does that mean that we are creating component schemas >>>>> that are >>>>> > > the >>>>> > > > >>> > >>> superset of the V1 and V2 schemas? And if so, should >>>>> we remove >>>>> > > > >>> summary and >>>>> > > > >>> > >>> manifest-list from the required properties, and add >>>>> manifests >>>>> > > > >>> optional >>>>> > > > >>> > >>> property to the Snapshot schema to support both V1 and >>>>> V2 >>>>> > > Summary >>>>> > > > >>> specs? >>>>> > > > >>> > >>> https://iceberg.apache.org/spec/#snapshots >>>>> > > > >>> > >>> >>>>> > > > >>> > >>> Or would creating separate component schemas for V1/V2 >>>>> be a >>>>> > > > >>> cleaner way >>>>> > > > >>> > >>> to align the REST spec with the table spec? >>>>> > > > >>> > >>> >>>>> > > > >>> > >>> Sung >>>>> > > > >>> > >>> >>>>> > > > >>> > >>> On 2024/10/17 15:19:23 Daniel Weeks wrote: >>>>> > > > >>> > >>> > I'm not convinced this is incorrect behavior (table >>>>> spec or >>>>> > > > >>> > >>> > implementation), but it does lend to some >>>>> confusion. The >>>>> > > > >>> 'summary' >>>>> > > > >>> > >>> field >>>>> > > > >>> > >>> > is optional, which means that if a summary is not >>>>> provided, >>>>> > > you >>>>> > > > >>> do not >>>>> > > > >>> > >>> have >>>>> > > > >>> > >>> > an associated 'operation' field. The 'operation' >>>>> field is >>>>> > > only >>>>> > > > >>> > >>> required in >>>>> > > > >>> > >>> > the context of the summary, so it's actually >>>>> possible for the >>>>> > > > >>> > >>> > implementation (i.e. the tests you reference) to not >>>>> have an >>>>> > > > >>> operation. >>>>> > > > >>> > >>> > >>>>> > > > >>> > >>> > I think what is wrong here is that the REST spec >>>>> marked the >>>>> > > > >>> summary as >>>>> > > > >>> > >>> > required >>>>> > > > >>> > >>> > < >>>>> > > > >>> > >>> >>>>> > > > >>> >>>>> > > >>>>> https://github.com/apache/iceberg/blob/8e2eb9ac2e33ce4bac8956d4e2f099444d03c0e3/open-api/rest-catalog-open-api.yaml#L2040 >>>>> > > > >>> > >>> >, >>>>> > > > >>> > >>> > which is inconsistent with the table spec. >>>>> > > > >>> > >>> > >>>>> > > > >>> > >>> > On Wed, Oct 16, 2024 at 3:52 PM Anton Okolnychyi < >>>>> > > > >>> > >>> aokolnyc...@gmail.com> >>>>> > > > >>> > >>> > wrote: >>>>> > > > >>> > >>> > >>>>> > > > >>> > >>> > > Based on [1], we never persisted the operation in >>>>> the >>>>> > > summary >>>>> > > > >>> map. >>>>> > > > >>> > >>> > > Instead, we persisted it as a top-level field in >>>>> Java, >>>>> > > which is >>>>> > > > >>> > >>> actually >>>>> > > > >>> > >>> > > NOT what the spec says. Does anyone remember cases >>>>> when the >>>>> > > > >>> > >>> operation was >>>>> > > > >>> > >>> > > unknown? I personally don't. >>>>> > > > >>> > >>> > > >>>>> > > > >>> > >>> > > [1] - >>>>> > > > >>> > >>> > > >>>>> > > > >>> > >>> >>>>> > > > >>> >>>>> > > >>>>> https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L63 >>>>> > > > >>> > >>> > > >>>>> > > > >>> > >>> > > >>>>> > > > >>> > >>> > > ср, 16 жовт. 2024 р. о 12:42 Kevin Liu < >>>>> > > kevin.jq....@gmail.com >>>>> > > > >>> > >>>>> > > > >>> > >>> пише: >>>>> > > > >>> > >>> > > >>>>> > > > >>> > >>> > >> Hey folks, >>>>> > > > >>> > >>> > >> >>>>> > > > >>> > >>> > >> I’ve noticed a discrepancy between the Iceberg >>>>> > > specification >>>>> > > > >>> and >>>>> > > > >>> > >>> the Java >>>>> > > > >>> > >>> > >> implementation regarding the `operation` key in >>>>> the >>>>> > > `Snapshot` >>>>> > > > >>> > >>> `summary` >>>>> > > > >>> > >>> > >> field. >>>>> > > > >>> > >>> > >> >>>>> > > > >>> > >>> > >> The `Snapshot` object's `summary` dictionary >>>>> includes a >>>>> > > > >>> *required* >>>>> > > > >>> > >>> key >>>>> > > > >>> > >>> > >> named `operation`, as outlined in the spec >>>>> describing >>>>> > > Table >>>>> > > > >>> > >>> Metadata and >>>>> > > > >>> > >>> > >> Snapshots [1] and the generated OpenAPI YAML [2]. >>>>> > > However, in >>>>> > > > >>> the >>>>> > > > >>> > >>> Java >>>>> > > > >>> > >>> > >> implementation [3], `operation` is treated as >>>>> optional. In >>>>> > > > >>> > >>> contrast, it >>>>> > > > >>> > >>> > >> remains a required field in the Python >>>>> implementation [4]. >>>>> > > > >>> > >>> > >> I also found that Java tests for `SnapshotParser` >>>>> assert >>>>> > > that >>>>> > > > >>> the >>>>> > > > >>> > >>> > >> `operation` field is null. [5] >>>>> > > > >>> > >>> > >> >>>>> > > > >>> > >>> > >> Due to this discrepancy, a user reported [6] that >>>>> the >>>>> > > > >>> > >>> `metadata.json` >>>>> > > > >>> > >>> > >> file generated for an Iceberg table could not be >>>>> read by >>>>> > > > >>> PyIceberg, >>>>> > > > >>> > >>> though >>>>> > > > >>> > >>> > >> it is readable using the Iceberg Java library. >>>>> > > > >>> > >>> > >> >>>>> > > > >>> > >>> > >> How should we proceed from here? Should the Java >>>>> library >>>>> > > > >>> enforce >>>>> > > > >>> > >>> this >>>>> > > > >>> > >>> > >> requirement? Additionally, how should we handle >>>>> existing >>>>> > > > >>> > >>> `metadata.json` >>>>> > > > >>> > >>> > >> files that were generated without this field? >>>>> > > > >>> > >>> > >> >>>>> > > > >>> > >>> > >> Best, >>>>> > > > >>> > >>> > >> Kevin Liu >>>>> > > > >>> > >>> > >> >>>>> > > > >>> > >>> > >> [1] >>>>> > > > >>> >>>>> https://iceberg.apache.org/spec/#table-metadata-and-snapshots >>>>> > > > >>> > >>> > >> [2] >>>>> > > > >>> > >>> > >> >>>>> > > > >>> > >>> >>>>> > > > >>> >>>>> > > >>>>> https://github.com/apache/iceberg/blob/8e2eb9ac2e33ce4bac8956d4e2f099444d03c0e3/open-api/rest-catalog-open-api.yaml#L2057-L2060 >>>>> > > > >>> > >>> > >> [3] >>>>> > > > >>> > >>> > >> >>>>> > > > >>> > >>> >>>>> > > > >>> >>>>> > > >>>>> https://github.com/apache/iceberg/blob/64b36999d7ff716ae2534fb0972fcc10d22a64c2/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L124 >>>>> > > > >>> > >>> > >> [4] >>>>> > > > >>> > >>> > >> >>>>> > > > >>> > >>> >>>>> > > > >>> >>>>> > > >>>>> https://github.com/apache/iceberg-python/blob/7cf0c225c3cdb32ac5e390de06b7b0e4fe7de92e/pyiceberg/table/snapshots.py#L182 >>>>> > > > >>> > >>> > >> [5] >>>>> > > > >>> > >>> > >> >>>>> > > > >>> > >>> >>>>> > > > >>> >>>>> > > >>>>> https://github.com/apache/iceberg/blob/22a6b19c2e226eacc0aa78c1f2ffbdbb168b13be/core/src/test/java/org/apache/iceberg/TestSnapshotJson.java#L52 >>>>> > > > >>> > >>> > >> [6] >>>>> https://github.com/apache/iceberg-python/issues/1106 >>>>> > > > >>> > >>> > >> >>>>> > > > >>> > >>> > >> >>>>> > > > >>> > >>> > >>>>> > > > >>> > >>> >>>>> > > > >>> > >> >>>>> > > > >>> > >>>>> > > > >>> >>>>> > > > >> >>>>> > > > >>>>> > > >>>>> > >>>>> >>>>