Well, the spec says nothing about a top-level `operation` field in JSON
[1]. Yet the Java implementation produces it [2] and removes the operation
from the summary map. This seems inconsistent?

- Anton

[1] - https://iceberg.apache.org/spec/#snapshots
[2] -
https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L63


чт, 17 жовт. 2024 р. о 10:06 Sung Yun <sun...@apache.org> пише:

> > As a side note, the `rest-catalog-open-api.yaml` file [2] in the Iceberg
> repo contains the latest version of the spec.
>
> I think more clarity on this would be helpful. Is it really the case that
> the Open API spec contains the latest version of the spec? For example, I'm
> noticing a discrepancy between sequence-number in the Table Spec and in the
> Open API Spec...
>
> In the table spec, it's required for V2, but it's optional in the REST API
> Spec:
>
> https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L2319-L2335
>
> On 2024/10/17 16:58:17 Kevin Liu wrote:
> > > Based on the example metadata, that looks like it is not to spec, so
> it's
> > reasonable that python would reject it.  If the java implementation is
> > allowing for that, it's likely that we're being too relaxed (possibly a
> > holdover from v1 parsing).
> > I believe the Java implementation is relaxing the constraint. I'll
> create a
> > PR with test cases and the necessary changes.
> >
> > > Do you know what produced the metadata?
> > It was created by Snowflake [1]. After verifying this, I'll look into
> > raising the issue with them.
> >
> > As a side note, the `rest-catalog-open-api.yaml` file [2] in the Iceberg
> > repo contains the latest version of the spec. As we're continuing to
> evolve
> > to spec for V3, would it be helpful to create a frozen version
> representing
> > both the V1 and V2 specs for reference, possibly as a separate file?
> >
> > Best,
> > Kevin Liu
> >
> > [1]
> >
> https://github.com/apache/iceberg-python/issues/1106#issuecomment-2312108455
> > [2]
> >
> https://github.com/apache/iceberg/blob/8e2eb9ac2e33ce4bac8956d4e2f099444d03c0e3/open-api/rest-catalog-open-api.yaml
> >
> > On Thu, Oct 17, 2024 at 9:20 AM Daniel Weeks <dwe...@apache.org> wrote:
> >
> > > Sung,
> > >
> > > I was thinking of v1, so you're right that manifest-list and summary
> are
> > > required as of v2.  The REST Spec seems to follow the v2 definition,
> so I
> > > think we're somewhat implicitly requiring those fields via REST.
> > >
> > > Kevin,
> > >
> > > Based on the example metadata, that looks like it is not to spec, so
> it's
> > > reasonable that python would reject it.  If the java implementation is
> > > allowing for that, it's likely that we're being too relaxed (possibly a
> > > holdover from v1 parsing).
> > >
> > > Do you know what produced the metadata?
> > >
> > > -Dan
> > >
> > > On Thu, Oct 17, 2024 at 9:02 AM Kevin Liu <kevin.jq....@gmail.com>
> wrote:
> > >
> > >> Thanks for the additional context.
> > >>
> > >> My understanding is that if a Snapshot has a `summary` field, it must
> > >> also have a corresponding `operation` key in the summary map. Is that
> > >> correct? Based on the `SnapshotParser`, this is not enforced [1].
> > >>
> > >> The underlying issue in #1106 [2] is the missing `operation` field
> when
> > >> the `summary` field is present.
> > >> For example,
> > >> ```
> > >>     "summary" : {
> > >>       "manifests-created" : "8",
> > >>       "total-records" : "26508666891",
> > >>       "added-files-size" : "3927895626752",
> > >>       "manifests-kept" : "0",
> > >>       "total-files-size" : "3927895626752",
> > >>       "added-records" : "26508666891",
> > >>       "added-data-files" : "231513",
> > >>       "manifests-replaced" : "0",
> > >>       "total-data-files" : "231513"
> > >>     }
> > >> ```
> > >>
> > >> It could be the case that this particular `metadata.json` was
> generated
> > >> not according to the spec.
> > >>
> > >> Best,
> > >> Kevin Liu
> > >>
> > >>
> > >> [1]
> > >>
> https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L124-L142
> > >> [2] https://github.com/apache/iceberg-python/issues/1106
> > >>
> > >>
> > >> On Thu, Oct 17, 2024 at 8:47 AM Sung Yun <sun...@apache.org> wrote:
> > >>
> > >>> Thank you for the clarification Daniel, and thank you Kevin for
> raising
> > >>> this issue!
> > >>>
> > >>> Does that mean that we are creating component schemas that are the
> > >>> superset of the V1 and V2 schemas? And if so, should we remove
> summary and
> > >>> manifest-list from the required properties, and add manifests
> optional
> > >>> property to the Snapshot schema to support both V1 and V2 Summary
> specs?
> > >>> https://iceberg.apache.org/spec/#snapshots
> > >>>
> > >>> Or would creating separate component schemas for V1/V2 be a cleaner
> way
> > >>> to align the REST spec with the table spec?
> > >>>
> > >>> Sung
> > >>>
> > >>> On 2024/10/17 15:19:23 Daniel Weeks wrote:
> > >>> > I'm not convinced this is incorrect behavior (table spec or
> > >>> > implementation), but it does lend to some confusion.  The 'summary'
> > >>> field
> > >>> > is optional, which means that if a summary is not provided, you do
> not
> > >>> have
> > >>> > an associated 'operation' field.  The 'operation' field is only
> > >>> required in
> > >>> > the context of the summary, so it's actually possible for the
> > >>> > implementation (i.e. the tests you reference) to not have an
> operation.
> > >>> >
> > >>> > I think what is wrong here is that the REST spec marked the
> summary as
> > >>> > required
> > >>> > <
> > >>>
> https://github.com/apache/iceberg/blob/8e2eb9ac2e33ce4bac8956d4e2f099444d03c0e3/open-api/rest-catalog-open-api.yaml#L2040
> > >>> >,
> > >>> > which is inconsistent with the table spec.
> > >>> >
> > >>> > On Wed, Oct 16, 2024 at 3:52 PM Anton Okolnychyi <
> > >>> aokolnyc...@gmail.com>
> > >>> > wrote:
> > >>> >
> > >>> > > Based on [1], we never persisted the operation in the summary
> map.
> > >>> > > Instead, we persisted it as a top-level field in Java, which is
> > >>> actually
> > >>> > > NOT what the spec says. Does anyone remember cases when the
> > >>> operation was
> > >>> > > unknown? I personally don't.
> > >>> > >
> > >>> > > [1] -
> > >>> > >
> > >>>
> https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L63
> > >>> > >
> > >>> > >
> > >>> > > ср, 16 жовт. 2024 р. о 12:42 Kevin Liu <kevin.jq....@gmail.com>
> > >>> пише:
> > >>> > >
> > >>> > >> Hey folks,
> > >>> > >>
> > >>> > >> I’ve noticed a discrepancy between the Iceberg specification and
> > >>> the Java
> > >>> > >> implementation regarding the `operation` key in the `Snapshot`
> > >>> `summary`
> > >>> > >> field.
> > >>> > >>
> > >>> > >> The `Snapshot` object's `summary` dictionary includes a
> *required*
> > >>> key
> > >>> > >> named `operation`, as outlined in the spec describing Table
> > >>> Metadata and
> > >>> > >> Snapshots [1] and the generated OpenAPI YAML [2]. However, in
> the
> > >>> Java
> > >>> > >> implementation [3], `operation` is treated as optional. In
> > >>> contrast, it
> > >>> > >> remains a required field in the Python implementation [4].
> > >>> > >> I also found that Java tests for `SnapshotParser` assert that
> the
> > >>> > >> `operation` field is null. [5]
> > >>> > >>
> > >>> > >> Due to this discrepancy, a user reported [6] that the
> > >>> `metadata.json`
> > >>> > >> file generated for an Iceberg table could not be read by
> PyIceberg,
> > >>> though
> > >>> > >> it is readable using the Iceberg Java library.
> > >>> > >>
> > >>> > >> How should we proceed from here? Should the Java library enforce
> > >>> this
> > >>> > >> requirement? Additionally, how should we handle existing
> > >>> `metadata.json`
> > >>> > >> files that were generated without this field?
> > >>> > >>
> > >>> > >> Best,
> > >>> > >> Kevin Liu
> > >>> > >>
> > >>> > >> [1]
> https://iceberg.apache.org/spec/#table-metadata-and-snapshots
> > >>> > >> [2]
> > >>> > >>
> > >>>
> https://github.com/apache/iceberg/blob/8e2eb9ac2e33ce4bac8956d4e2f099444d03c0e3/open-api/rest-catalog-open-api.yaml#L2057-L2060
> > >>> > >> [3]
> > >>> > >>
> > >>>
> https://github.com/apache/iceberg/blob/64b36999d7ff716ae2534fb0972fcc10d22a64c2/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L124
> > >>> > >> [4]
> > >>> > >>
> > >>>
> https://github.com/apache/iceberg-python/blob/7cf0c225c3cdb32ac5e390de06b7b0e4fe7de92e/pyiceberg/table/snapshots.py#L182
> > >>> > >> [5]
> > >>> > >>
> > >>>
> https://github.com/apache/iceberg/blob/22a6b19c2e226eacc0aa78c1f2ffbdbb168b13be/core/src/test/java/org/apache/iceberg/TestSnapshotJson.java#L52
> > >>> > >> [6] https://github.com/apache/iceberg-python/issues/1106
> > >>> > >>
> > >>> > >>
> > >>> >
> > >>>
> > >>
> >
>

Reply via email to