Hi Ryan, thank you for your response! That detailed context is very helpful in allowing me to understanding why the REST catalog spec has evolved the way it has, and how the Table Spec and the REST Catalog Spec should each be referenced in the sub-communities (like in PyIceberg). I'll keep those motivations in mind as we discuss those Specs in the future.
Also, here's a small PR to specify more explicitly that the operation field should be a required field in the summary field: https://github.com/apache/iceberg/pull/11355 Sung On 2024/10/19 22:14:59 "rdb...@gmail.com" wrote: > I can provide some historical context here about how the table spec evolved > and how the REST spec works with respect to table versions. > > We initially did not have the snapshot summary or operation. When I added > the summary, the operation was intended to be required in cases where the > summary is present. It should always be there if the summary is and the > summary should always be there unless you wrote the metadata.json file way > back in 2017 or 2018. It looks like the spec could be more clear that the > operation is required when summary is present. Anyone want to open a PR? > > Anton, I don't think there is a top-level operation field. The Java > Snapshot class tracks the operation as top-level, but it is always stored > in the summary. I think this is consistent with the spec. > > For the REST spec, I think that it should be strictly optional to support > v1 tables written with no summary, but it should always be present. I'd > probably leave it required since it already is and is catching a valuable > error case here. > > When building the REST spec, I took the same approach as the Java parser > (which is also to parse table metadata coming from REST servers). That is, > it is compatible with v1 metadata; fields that were not required in v1 are > optional. This fits with the principle of "be liberal in what you accept > and strict in what you produce". The REST spec allows passing metadata for > any table version so that the specs are not tightly coupled. The table > version is passed when loading and clients should reject table versions > that are newer than what they can support. The REST protocol just needs to > facilitate passing the table metadata. > > Most v2 structures, like the `schemas` list, are introduced as optional in > v1 and made required in v2. That way, it's possible to add metadata to > existing format versions and make the upgrade path easier. Maintaining the > newer structures even though there are different writer versions deployed > is one of the reasons why the REST spec changes to a change-based model. > New metadata only needs to be supported by the service maintaining the > metadata.json files and any writers that want to update it. > > I see some points about being able to remove old table versions. I don't > think that the REST protocol itself is the place to do this. The protocol > is format-agnostic. Implementations are free to reject requests to create > tables with older versions, or to update the table to a new version. > > Ryan > > On Fri, Oct 18, 2024 at 6:42 AM Sung Yun <sun...@apache.org> wrote: > > > Folks, thank you for your responses! These area really helpful insights. > > > > > I agree that the REST spec should aim to support v1, v2, and potentially > > the upcoming v3. In practice, it seems like the choice of table spec might > > ultimately be dictated by the REST catalog implementation. > > > > > A best practice would be for the server to strive to support all Iceberg > > versions, but the REST spec itself should remain flexible enough to > > accommodate less strict table specs. > > > > Yufei, yes that makes sense, and I agree that the server should strive to > > support all format versions, because otherwise the an older client > > application, may just not be compatible with a REST Catalog running on a > > higher version of table spec. I think we have two choices here in ensuring > > that the REST Catalog server is able to support multiple versions of the > > Table Spec: > > > > 1. We could create single components that are common denominators of all > > existing table specs to accommodate the less table specs. The REST Catalog > > Spec currently falls short in this approach, and I've put up this PR to > > show what this change would look like just for the Snapshot component: > > https://github.com/apache/iceberg/pull/11353 - My take on this is that, > > this approach will make the REST catalog spec more confusing and difficult > > to manage as we add more Table Spec versions moving forward. The discussion > > on this mail list thread is I think a great demonstration of that confusion > > :) > > 2. We could instead create separate Table Spec version specific components > > on the REST Catalog Open API Spec. For example, a Snapshot component could > > be anyOf SnapshotV1 and SnapshotV2, which match the Table Spec V1 and V2 > > definitions. I think creating explicit components that match the spec > > definitions will work in our favor when we continue to introduce more Spec > > changes and manage their lifecycles. And perhaps, maybe we could also > > indicate what format-versions the REST Catalog Server supports through an > > endpoint, and communicate it to a client application. > > > > I'd love to hear the community's opinion on suggestion (2)! I'm very > > curious to hear if we've considered it before. > > > > Sung > > > > On 2024/10/18 05:13:15 Péter Váry wrote: > > > Hi Team, > > > Apart from fixing this current issue by relaxing the current spec > > > constraints, to support both v1 and v2 specifications, we should think > > > about how to handle table spec evolution for the long term. > > > > > > What are the base factors we can start from (please add your own ideas > > if I > > > have missed something): > > > - We evolve the specifications in a way that is backwards compatible (v1 > > > table could be read by v2 reader) but not forwards compatible (v2 table > > > could not be read by an old reader) > > > - The rest spec ideally should conform to the currently used table spec > > > schema/constraints > > > - REST catalogs sooner-or-later would like to drop support for older > > table > > > spec. We need to avoid the situation of Hive Metastore, where the > > decisions > > > made 10 years ago prevented enhancing the APIs as the old specifications > > > were supported forever. > > > > > > Probably (when the spec difference becomes big enough) a composit request > > > (version + different content spec) or a different endpoint will be > > required. > > > > > > Thanks, Peter > > > > > > On Thu, Oct 17, 2024, 23:11 Yufei Gu <flyrain...@gmail.com> wrote: > > > > > > > Hi Sung, > > > > > > > > It seems we are running to issues related to a mismatch between the > > REST > > > > spec and table specifications. Currently, there's no clear definition > > of > > > > how the REST spec is meant to support different table specs. The > > closest > > > > reference I found is this statement > > > > < > > https://github.com/apache/iceberg/blob/8e743a5b5209569f84b6bace36e1106c67e1eab3/open-api/rest-catalog-open-api.yaml#L30-L30 > > > > > > > in the REST spec. > > > > > > > > Implementations should ideally support both Iceberg table specs v1 and > > v2, > > > >> with priority given to v2. > > > > > > > > > > > > I agree that the REST spec should aim to support v1, v2, and > > potentially > > > > the upcoming v3. In practice, it seems like the choice of table spec > > might > > > > ultimately be dictated by the REST catalog implementation. > > > > > > > > > > > > A best practice would be for the server to strive to support all > > Iceberg > > > > versions, but the REST spec itself should remain flexible enough to > > > > accommodate less strict table specs. For the case you mentioned, it > > should > > > > be fine to make sequence number optional since the spec has to support > > v1 > > > > table spec. It does feel confusing though. > > > > > > > > > > > > WDYT? > > > > > > > > > > > > Yufei > > > > > > > > > > > > On Thu, Oct 17, 2024 at 1:56 PM Anton Okolnychyi < > > aokolnyc...@gmail.com> > > > > wrote: > > > > > > > >> Well, the spec says nothing about a top-level `operation` field in > > JSON > > > >> [1]. Yet the Java implementation produces it [2] and removes the > > operation > > > >> from the summary map. This seems inconsistent? > > > >> > > > >> - Anton > > > >> > > > >> [1] - https://iceberg.apache.org/spec/#snapshots > > > >> [2] - > > > >> > > https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L63 > > > >> > > > >> > > > >> чт, 17 жовт. 2024 р. о 10:06 Sung Yun <sun...@apache.org> пише: > > > >> > > > >>> > As a side note, the `rest-catalog-open-api.yaml` file [2] in the > > > >>> Iceberg repo contains the latest version of the spec. > > > >>> > > > >>> I think more clarity on this would be helpful. Is it really the case > > > >>> that the Open API spec contains the latest version of the spec? For > > > >>> example, I'm noticing a discrepancy between sequence-number in the > > Table > > > >>> Spec and in the Open API Spec... > > > >>> > > > >>> In the table spec, it's required for V2, but it's optional in the > > REST > > > >>> API Spec: > > > >>> > > > >>> > > https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L2319-L2335 > > > >>> > > > >>> On 2024/10/17 16:58:17 Kevin Liu wrote: > > > >>> > > Based on the example metadata, that looks like it is not to > > spec, so > > > >>> it's > > > >>> > reasonable that python would reject it. If the java > > implementation is > > > >>> > allowing for that, it's likely that we're being too relaxed > > (possibly a > > > >>> > holdover from v1 parsing). > > > >>> > I believe the Java implementation is relaxing the constraint. I'll > > > >>> create a > > > >>> > PR with test cases and the necessary changes. > > > >>> > > > > >>> > > Do you know what produced the metadata? > > > >>> > It was created by Snowflake [1]. After verifying this, I'll look > > into > > > >>> > raising the issue with them. > > > >>> > > > > >>> > As a side note, the `rest-catalog-open-api.yaml` file [2] in the > > > >>> Iceberg > > > >>> > repo contains the latest version of the spec. As we're continuing > > to > > > >>> evolve > > > >>> > to spec for V3, would it be helpful to create a frozen version > > > >>> representing > > > >>> > both the V1 and V2 specs for reference, possibly as a separate > > file? > > > >>> > > > > >>> > Best, > > > >>> > Kevin Liu > > > >>> > > > > >>> > [1] > > > >>> > > > > >>> > > https://github.com/apache/iceberg-python/issues/1106#issuecomment-2312108455 > > > >>> > [2] > > > >>> > > > > >>> > > https://github.com/apache/iceberg/blob/8e2eb9ac2e33ce4bac8956d4e2f099444d03c0e3/open-api/rest-catalog-open-api.yaml > > > >>> > > > > >>> > On Thu, Oct 17, 2024 at 9:20 AM Daniel Weeks <dwe...@apache.org> > > > >>> wrote: > > > >>> > > > > >>> > > Sung, > > > >>> > > > > > >>> > > I was thinking of v1, so you're right that manifest-list and > > summary > > > >>> are > > > >>> > > required as of v2. The REST Spec seems to follow the v2 > > definition, > > > >>> so I > > > >>> > > think we're somewhat implicitly requiring those fields via REST. > > > >>> > > > > > >>> > > Kevin, > > > >>> > > > > > >>> > > Based on the example metadata, that looks like it is not to > > spec, so > > > >>> it's > > > >>> > > reasonable that python would reject it. If the java > > implementation > > > >>> is > > > >>> > > allowing for that, it's likely that we're being too relaxed > > > >>> (possibly a > > > >>> > > holdover from v1 parsing). > > > >>> > > > > > >>> > > Do you know what produced the metadata? > > > >>> > > > > > >>> > > -Dan > > > >>> > > > > > >>> > > On Thu, Oct 17, 2024 at 9:02 AM Kevin Liu < > > kevin.jq....@gmail.com> > > > >>> wrote: > > > >>> > > > > > >>> > >> Thanks for the additional context. > > > >>> > >> > > > >>> > >> My understanding is that if a Snapshot has a `summary` field, it > > > >>> must > > > >>> > >> also have a corresponding `operation` key in the summary map. Is > > > >>> that > > > >>> > >> correct? Based on the `SnapshotParser`, this is not enforced > > [1]. > > > >>> > >> > > > >>> > >> The underlying issue in #1106 [2] is the missing `operation` > > field > > > >>> when > > > >>> > >> the `summary` field is present. > > > >>> > >> For example, > > > >>> > >> ``` > > > >>> > >> "summary" : { > > > >>> > >> "manifests-created" : "8", > > > >>> > >> "total-records" : "26508666891", > > > >>> > >> "added-files-size" : "3927895626752", > > > >>> > >> "manifests-kept" : "0", > > > >>> > >> "total-files-size" : "3927895626752", > > > >>> > >> "added-records" : "26508666891", > > > >>> > >> "added-data-files" : "231513", > > > >>> > >> "manifests-replaced" : "0", > > > >>> > >> "total-data-files" : "231513" > > > >>> > >> } > > > >>> > >> ``` > > > >>> > >> > > > >>> > >> It could be the case that this particular `metadata.json` was > > > >>> generated > > > >>> > >> not according to the spec. > > > >>> > >> > > > >>> > >> Best, > > > >>> > >> Kevin Liu > > > >>> > >> > > > >>> > >> > > > >>> > >> [1] > > > >>> > >> > > > >>> > > https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L124-L142 > > > >>> > >> [2] https://github.com/apache/iceberg-python/issues/1106 > > > >>> > >> > > > >>> > >> > > > >>> > >> On Thu, Oct 17, 2024 at 8:47 AM Sung Yun <sun...@apache.org> > > wrote: > > > >>> > >> > > > >>> > >>> Thank you for the clarification Daniel, and thank you Kevin for > > > >>> raising > > > >>> > >>> this issue! > > > >>> > >>> > > > >>> > >>> Does that mean that we are creating component schemas that are > > the > > > >>> > >>> superset of the V1 and V2 schemas? And if so, should we remove > > > >>> summary and > > > >>> > >>> manifest-list from the required properties, and add manifests > > > >>> optional > > > >>> > >>> property to the Snapshot schema to support both V1 and V2 > > Summary > > > >>> specs? > > > >>> > >>> https://iceberg.apache.org/spec/#snapshots > > > >>> > >>> > > > >>> > >>> Or would creating separate component schemas for V1/V2 be a > > > >>> cleaner way > > > >>> > >>> to align the REST spec with the table spec? > > > >>> > >>> > > > >>> > >>> Sung > > > >>> > >>> > > > >>> > >>> On 2024/10/17 15:19:23 Daniel Weeks wrote: > > > >>> > >>> > I'm not convinced this is incorrect behavior (table spec or > > > >>> > >>> > implementation), but it does lend to some confusion. The > > > >>> 'summary' > > > >>> > >>> field > > > >>> > >>> > is optional, which means that if a summary is not provided, > > you > > > >>> do not > > > >>> > >>> have > > > >>> > >>> > an associated 'operation' field. The 'operation' field is > > only > > > >>> > >>> required in > > > >>> > >>> > the context of the summary, so it's actually possible for the > > > >>> > >>> > implementation (i.e. the tests you reference) to not have an > > > >>> operation. > > > >>> > >>> > > > > >>> > >>> > I think what is wrong here is that the REST spec marked the > > > >>> summary as > > > >>> > >>> > required > > > >>> > >>> > < > > > >>> > >>> > > > >>> > > https://github.com/apache/iceberg/blob/8e2eb9ac2e33ce4bac8956d4e2f099444d03c0e3/open-api/rest-catalog-open-api.yaml#L2040 > > > >>> > >>> >, > > > >>> > >>> > which is inconsistent with the table spec. > > > >>> > >>> > > > > >>> > >>> > On Wed, Oct 16, 2024 at 3:52 PM Anton Okolnychyi < > > > >>> > >>> aokolnyc...@gmail.com> > > > >>> > >>> > wrote: > > > >>> > >>> > > > > >>> > >>> > > Based on [1], we never persisted the operation in the > > summary > > > >>> map. > > > >>> > >>> > > Instead, we persisted it as a top-level field in Java, > > which is > > > >>> > >>> actually > > > >>> > >>> > > NOT what the spec says. Does anyone remember cases when the > > > >>> > >>> operation was > > > >>> > >>> > > unknown? I personally don't. > > > >>> > >>> > > > > > >>> > >>> > > [1] - > > > >>> > >>> > > > > > >>> > >>> > > > >>> > > https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L63 > > > >>> > >>> > > > > > >>> > >>> > > > > > >>> > >>> > > ср, 16 жовт. 2024 р. о 12:42 Kevin Liu < > > kevin.jq....@gmail.com > > > >>> > > > > >>> > >>> пише: > > > >>> > >>> > > > > > >>> > >>> > >> Hey folks, > > > >>> > >>> > >> > > > >>> > >>> > >> I’ve noticed a discrepancy between the Iceberg > > specification > > > >>> and > > > >>> > >>> the Java > > > >>> > >>> > >> implementation regarding the `operation` key in the > > `Snapshot` > > > >>> > >>> `summary` > > > >>> > >>> > >> field. > > > >>> > >>> > >> > > > >>> > >>> > >> The `Snapshot` object's `summary` dictionary includes a > > > >>> *required* > > > >>> > >>> key > > > >>> > >>> > >> named `operation`, as outlined in the spec describing > > Table > > > >>> > >>> Metadata and > > > >>> > >>> > >> Snapshots [1] and the generated OpenAPI YAML [2]. > > However, in > > > >>> the > > > >>> > >>> Java > > > >>> > >>> > >> implementation [3], `operation` is treated as optional. In > > > >>> > >>> contrast, it > > > >>> > >>> > >> remains a required field in the Python implementation [4]. > > > >>> > >>> > >> I also found that Java tests for `SnapshotParser` assert > > that > > > >>> the > > > >>> > >>> > >> `operation` field is null. [5] > > > >>> > >>> > >> > > > >>> > >>> > >> Due to this discrepancy, a user reported [6] that the > > > >>> > >>> `metadata.json` > > > >>> > >>> > >> file generated for an Iceberg table could not be read by > > > >>> PyIceberg, > > > >>> > >>> though > > > >>> > >>> > >> it is readable using the Iceberg Java library. > > > >>> > >>> > >> > > > >>> > >>> > >> How should we proceed from here? Should the Java library > > > >>> enforce > > > >>> > >>> this > > > >>> > >>> > >> requirement? Additionally, how should we handle existing > > > >>> > >>> `metadata.json` > > > >>> > >>> > >> files that were generated without this field? > > > >>> > >>> > >> > > > >>> > >>> > >> Best, > > > >>> > >>> > >> Kevin Liu > > > >>> > >>> > >> > > > >>> > >>> > >> [1] > > > >>> https://iceberg.apache.org/spec/#table-metadata-and-snapshots > > > >>> > >>> > >> [2] > > > >>> > >>> > >> > > > >>> > >>> > > > >>> > > https://github.com/apache/iceberg/blob/8e2eb9ac2e33ce4bac8956d4e2f099444d03c0e3/open-api/rest-catalog-open-api.yaml#L2057-L2060 > > > >>> > >>> > >> [3] > > > >>> > >>> > >> > > > >>> > >>> > > > >>> > > https://github.com/apache/iceberg/blob/64b36999d7ff716ae2534fb0972fcc10d22a64c2/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L124 > > > >>> > >>> > >> [4] > > > >>> > >>> > >> > > > >>> > >>> > > > >>> > > https://github.com/apache/iceberg-python/blob/7cf0c225c3cdb32ac5e390de06b7b0e4fe7de92e/pyiceberg/table/snapshots.py#L182 > > > >>> > >>> > >> [5] > > > >>> > >>> > >> > > > >>> > >>> > > > >>> > > https://github.com/apache/iceberg/blob/22a6b19c2e226eacc0aa78c1f2ffbdbb168b13be/core/src/test/java/org/apache/iceberg/TestSnapshotJson.java#L52 > > > >>> > >>> > >> [6] https://github.com/apache/iceberg-python/issues/1106 > > > >>> > >>> > >> > > > >>> > >>> > >> > > > >>> > >>> > > > > >>> > >>> > > > >>> > >> > > > >>> > > > > >>> > > > >> > > > > > >