Well, the spec says nothing about a top-level `operation` field in JSON [1]. Yet the Java implementation produces it [2] and removes the operation from the summary map. This seems inconsistent?
- Anton [1] - https://iceberg.apache.org/spec/#snapshots [2] - https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L63 чт, 17 жовт. 2024 р. о 10:06 Sung Yun <sun...@apache.org> пише: > > As a side note, the `rest-catalog-open-api.yaml` file [2] in the Iceberg > repo contains the latest version of the spec. > > I think more clarity on this would be helpful. Is it really the case that > the Open API spec contains the latest version of the spec? For example, I'm > noticing a discrepancy between sequence-number in the Table Spec and in the > Open API Spec... > > In the table spec, it's required for V2, but it's optional in the REST API > Spec: > > https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L2319-L2335 > > On 2024/10/17 16:58:17 Kevin Liu wrote: > > > Based on the example metadata, that looks like it is not to spec, so > it's > > reasonable that python would reject it. If the java implementation is > > allowing for that, it's likely that we're being too relaxed (possibly a > > holdover from v1 parsing). > > I believe the Java implementation is relaxing the constraint. I'll > create a > > PR with test cases and the necessary changes. > > > > > Do you know what produced the metadata? > > It was created by Snowflake [1]. After verifying this, I'll look into > > raising the issue with them. > > > > As a side note, the `rest-catalog-open-api.yaml` file [2] in the Iceberg > > repo contains the latest version of the spec. As we're continuing to > evolve > > to spec for V3, would it be helpful to create a frozen version > representing > > both the V1 and V2 specs for reference, possibly as a separate file? > > > > Best, > > Kevin Liu > > > > [1] > > > https://github.com/apache/iceberg-python/issues/1106#issuecomment-2312108455 > > [2] > > > https://github.com/apache/iceberg/blob/8e2eb9ac2e33ce4bac8956d4e2f099444d03c0e3/open-api/rest-catalog-open-api.yaml > > > > On Thu, Oct 17, 2024 at 9:20 AM Daniel Weeks <dwe...@apache.org> wrote: > > > > > Sung, > > > > > > I was thinking of v1, so you're right that manifest-list and summary > are > > > required as of v2. The REST Spec seems to follow the v2 definition, > so I > > > think we're somewhat implicitly requiring those fields via REST. > > > > > > Kevin, > > > > > > Based on the example metadata, that looks like it is not to spec, so > it's > > > reasonable that python would reject it. If the java implementation is > > > allowing for that, it's likely that we're being too relaxed (possibly a > > > holdover from v1 parsing). > > > > > > Do you know what produced the metadata? > > > > > > -Dan > > > > > > On Thu, Oct 17, 2024 at 9:02 AM Kevin Liu <kevin.jq....@gmail.com> > wrote: > > > > > >> Thanks for the additional context. > > >> > > >> My understanding is that if a Snapshot has a `summary` field, it must > > >> also have a corresponding `operation` key in the summary map. Is that > > >> correct? Based on the `SnapshotParser`, this is not enforced [1]. > > >> > > >> The underlying issue in #1106 [2] is the missing `operation` field > when > > >> the `summary` field is present. > > >> For example, > > >> ``` > > >> "summary" : { > > >> "manifests-created" : "8", > > >> "total-records" : "26508666891", > > >> "added-files-size" : "3927895626752", > > >> "manifests-kept" : "0", > > >> "total-files-size" : "3927895626752", > > >> "added-records" : "26508666891", > > >> "added-data-files" : "231513", > > >> "manifests-replaced" : "0", > > >> "total-data-files" : "231513" > > >> } > > >> ``` > > >> > > >> It could be the case that this particular `metadata.json` was > generated > > >> not according to the spec. > > >> > > >> Best, > > >> Kevin Liu > > >> > > >> > > >> [1] > > >> > https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L124-L142 > > >> [2] https://github.com/apache/iceberg-python/issues/1106 > > >> > > >> > > >> On Thu, Oct 17, 2024 at 8:47 AM Sung Yun <sun...@apache.org> wrote: > > >> > > >>> Thank you for the clarification Daniel, and thank you Kevin for > raising > > >>> this issue! > > >>> > > >>> Does that mean that we are creating component schemas that are the > > >>> superset of the V1 and V2 schemas? And if so, should we remove > summary and > > >>> manifest-list from the required properties, and add manifests > optional > > >>> property to the Snapshot schema to support both V1 and V2 Summary > specs? > > >>> https://iceberg.apache.org/spec/#snapshots > > >>> > > >>> Or would creating separate component schemas for V1/V2 be a cleaner > way > > >>> to align the REST spec with the table spec? > > >>> > > >>> Sung > > >>> > > >>> On 2024/10/17 15:19:23 Daniel Weeks wrote: > > >>> > I'm not convinced this is incorrect behavior (table spec or > > >>> > implementation), but it does lend to some confusion. The 'summary' > > >>> field > > >>> > is optional, which means that if a summary is not provided, you do > not > > >>> have > > >>> > an associated 'operation' field. The 'operation' field is only > > >>> required in > > >>> > the context of the summary, so it's actually possible for the > > >>> > implementation (i.e. the tests you reference) to not have an > operation. > > >>> > > > >>> > I think what is wrong here is that the REST spec marked the > summary as > > >>> > required > > >>> > < > > >>> > https://github.com/apache/iceberg/blob/8e2eb9ac2e33ce4bac8956d4e2f099444d03c0e3/open-api/rest-catalog-open-api.yaml#L2040 > > >>> >, > > >>> > which is inconsistent with the table spec. > > >>> > > > >>> > On Wed, Oct 16, 2024 at 3:52 PM Anton Okolnychyi < > > >>> aokolnyc...@gmail.com> > > >>> > wrote: > > >>> > > > >>> > > Based on [1], we never persisted the operation in the summary > map. > > >>> > > Instead, we persisted it as a top-level field in Java, which is > > >>> actually > > >>> > > NOT what the spec says. Does anyone remember cases when the > > >>> operation was > > >>> > > unknown? I personally don't. > > >>> > > > > >>> > > [1] - > > >>> > > > > >>> > https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L63 > > >>> > > > > >>> > > > > >>> > > ср, 16 жовт. 2024 р. о 12:42 Kevin Liu <kevin.jq....@gmail.com> > > >>> пише: > > >>> > > > > >>> > >> Hey folks, > > >>> > >> > > >>> > >> I’ve noticed a discrepancy between the Iceberg specification and > > >>> the Java > > >>> > >> implementation regarding the `operation` key in the `Snapshot` > > >>> `summary` > > >>> > >> field. > > >>> > >> > > >>> > >> The `Snapshot` object's `summary` dictionary includes a > *required* > > >>> key > > >>> > >> named `operation`, as outlined in the spec describing Table > > >>> Metadata and > > >>> > >> Snapshots [1] and the generated OpenAPI YAML [2]. However, in > the > > >>> Java > > >>> > >> implementation [3], `operation` is treated as optional. In > > >>> contrast, it > > >>> > >> remains a required field in the Python implementation [4]. > > >>> > >> I also found that Java tests for `SnapshotParser` assert that > the > > >>> > >> `operation` field is null. [5] > > >>> > >> > > >>> > >> Due to this discrepancy, a user reported [6] that the > > >>> `metadata.json` > > >>> > >> file generated for an Iceberg table could not be read by > PyIceberg, > > >>> though > > >>> > >> it is readable using the Iceberg Java library. > > >>> > >> > > >>> > >> How should we proceed from here? Should the Java library enforce > > >>> this > > >>> > >> requirement? Additionally, how should we handle existing > > >>> `metadata.json` > > >>> > >> files that were generated without this field? > > >>> > >> > > >>> > >> Best, > > >>> > >> Kevin Liu > > >>> > >> > > >>> > >> [1] > https://iceberg.apache.org/spec/#table-metadata-and-snapshots > > >>> > >> [2] > > >>> > >> > > >>> > https://github.com/apache/iceberg/blob/8e2eb9ac2e33ce4bac8956d4e2f099444d03c0e3/open-api/rest-catalog-open-api.yaml#L2057-L2060 > > >>> > >> [3] > > >>> > >> > > >>> > https://github.com/apache/iceberg/blob/64b36999d7ff716ae2534fb0972fcc10d22a64c2/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L124 > > >>> > >> [4] > > >>> > >> > > >>> > https://github.com/apache/iceberg-python/blob/7cf0c225c3cdb32ac5e390de06b7b0e4fe7de92e/pyiceberg/table/snapshots.py#L182 > > >>> > >> [5] > > >>> > >> > > >>> > https://github.com/apache/iceberg/blob/22a6b19c2e226eacc0aa78c1f2ffbdbb168b13be/core/src/test/java/org/apache/iceberg/TestSnapshotJson.java#L52 > > >>> > >> [6] https://github.com/apache/iceberg-python/issues/1106 > > >>> > >> > > >>> > >> > > >>> > > > >>> > > >> > > >