Thanks for the additional context.

My understanding is that if a Snapshot has a `summary` field, it must also
have a corresponding `operation` key in the summary map. Is that correct?
Based on the `SnapshotParser`, this is not enforced [1].

The underlying issue in #1106 [2] is the missing `operation` field when the
`summary` field is present.
For example,
```
    "summary" : {
      "manifests-created" : "8",
      "total-records" : "26508666891",
      "added-files-size" : "3927895626752",
      "manifests-kept" : "0",
      "total-files-size" : "3927895626752",
      "added-records" : "26508666891",
      "added-data-files" : "231513",
      "manifests-replaced" : "0",
      "total-data-files" : "231513"
    }
```

It could be the case that this particular `metadata.json` was generated not
according to the spec.

Best,
Kevin Liu


[1]
https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L124-L142
[2] https://github.com/apache/iceberg-python/issues/1106


On Thu, Oct 17, 2024 at 8:47 AM Sung Yun <sun...@apache.org> wrote:

> Thank you for the clarification Daniel, and thank you Kevin for raising
> this issue!
>
> Does that mean that we are creating component schemas that are the
> superset of the V1 and V2 schemas? And if so, should we remove summary and
> manifest-list from the required properties, and add manifests optional
> property to the Snapshot schema to support both V1 and V2 Summary specs?
> https://iceberg.apache.org/spec/#snapshots
>
> Or would creating separate component schemas for V1/V2 be a cleaner way to
> align the REST spec with the table spec?
>
> Sung
>
> On 2024/10/17 15:19:23 Daniel Weeks wrote:
> > I'm not convinced this is incorrect behavior (table spec or
> > implementation), but it does lend to some confusion.  The 'summary' field
> > is optional, which means that if a summary is not provided, you do not
> have
> > an associated 'operation' field.  The 'operation' field is only required
> in
> > the context of the summary, so it's actually possible for the
> > implementation (i.e. the tests you reference) to not have an operation.
> >
> > I think what is wrong here is that the REST spec marked the summary as
> > required
> > <
> https://github.com/apache/iceberg/blob/8e2eb9ac2e33ce4bac8956d4e2f099444d03c0e3/open-api/rest-catalog-open-api.yaml#L2040
> >,
> > which is inconsistent with the table spec.
> >
> > On Wed, Oct 16, 2024 at 3:52 PM Anton Okolnychyi <aokolnyc...@gmail.com>
> > wrote:
> >
> > > Based on [1], we never persisted the operation in the summary map.
> > > Instead, we persisted it as a top-level field in Java, which is
> actually
> > > NOT what the spec says. Does anyone remember cases when the operation
> was
> > > unknown? I personally don't.
> > >
> > > [1] -
> > >
> https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L63
> > >
> > >
> > > ср, 16 жовт. 2024 р. о 12:42 Kevin Liu <kevin.jq....@gmail.com> пише:
> > >
> > >> Hey folks,
> > >>
> > >> I’ve noticed a discrepancy between the Iceberg specification and the
> Java
> > >> implementation regarding the `operation` key in the `Snapshot`
> `summary`
> > >> field.
> > >>
> > >> The `Snapshot` object's `summary` dictionary includes a *required* key
> > >> named `operation`, as outlined in the spec describing Table Metadata
> and
> > >> Snapshots [1] and the generated OpenAPI YAML [2]. However, in the Java
> > >> implementation [3], `operation` is treated as optional. In contrast,
> it
> > >> remains a required field in the Python implementation [4].
> > >> I also found that Java tests for `SnapshotParser` assert that the
> > >> `operation` field is null. [5]
> > >>
> > >> Due to this discrepancy, a user reported [6] that the `metadata.json`
> > >> file generated for an Iceberg table could not be read by PyIceberg,
> though
> > >> it is readable using the Iceberg Java library.
> > >>
> > >> How should we proceed from here? Should the Java library enforce this
> > >> requirement? Additionally, how should we handle existing
> `metadata.json`
> > >> files that were generated without this field?
> > >>
> > >> Best,
> > >> Kevin Liu
> > >>
> > >> [1] https://iceberg.apache.org/spec/#table-metadata-and-snapshots
> > >> [2]
> > >>
> https://github.com/apache/iceberg/blob/8e2eb9ac2e33ce4bac8956d4e2f099444d03c0e3/open-api/rest-catalog-open-api.yaml#L2057-L2060
> > >> [3]
> > >>
> https://github.com/apache/iceberg/blob/64b36999d7ff716ae2534fb0972fcc10d22a64c2/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L124
> > >> [4]
> > >>
> https://github.com/apache/iceberg-python/blob/7cf0c225c3cdb32ac5e390de06b7b0e4fe7de92e/pyiceberg/table/snapshots.py#L182
> > >> [5]
> > >>
> https://github.com/apache/iceberg/blob/22a6b19c2e226eacc0aa78c1f2ffbdbb168b13be/core/src/test/java/org/apache/iceberg/TestSnapshotJson.java#L52
> > >> [6] https://github.com/apache/iceberg-python/issues/1106
> > >>
> > >>
> >
>

Reply via email to