I think we more or less agree here, and as Russel points out, it is a bit of a chicken and egg. I would suggest reading the faulty metadata and emitting a warning that it defaults to overwrite:
- Java: https://github.com/apache/iceberg/pull/11421 - PyIceberg: https://github.com/apache/iceberg-python/pull/1263 This way we always produce data that's compliant with the spec, and we're more permissive when reading data. Adding an unknown operation (similar to null) adds another state that is not supported by clients that implement the spec. Kind regards, Fokko Op di 29 okt 2024 om 05:36 schreef Amogh Jahagirdar <2am...@gmail.com>: > To clarify what I meant by my last statement: > > > I was originally going to propose having a notion of an "unknown" > operation on read when the field is null but the operations are defined in > the spec so we would need to change that to produce valid metadata on write. > After some more thought, I like Fokko's idea of using "overwrite" since I > *think *that it should be able to safely encompass every combination of > data + delete file writes without breaking semantics in user applications > which rely on operations being defined. > > I particularly meant this is what I think we'd want to do if we still feel > like we want to be permissive. My preference is that we should be strict > since it's required, or revisit if it really should be required if there's > doubts. > > Thanks, > > Amogh Jahagirdar > > > On Mon, Oct 28, 2024 at 10:33 PM Amogh Jahagirdar <2am...@gmail.com> > wrote: > >> I largely agree with Russell, I feel like ultimately we should be taking >> a firm stance here on if it's required or optional. I think the issue with >> making it required but then on read in the reference implementation having >> it be more permissive is that this ends up creating a "case by case" >> ambiguity on which fields could be considered as necessary or not for read, >> when we already have the mechanism in the spec just by defining it via a >> binary required or optional. I also think that the reference implementation >> being more permissive on read opens up the door for other implementations >> to also just follow that, and then it compromises the value of having the >> field be required in the spec to begin with. >> >> I'll also add that I noticed that there are parts in the implementation >> which currently are written with the expectation that operation is defined >> (unless I'm missing something). Some examples: >> 1. IncrementalDataTableScan >> <https://github.com/apache/iceberg/blob/1e3ee1e4e80873018af716a190e541925f09c285/core/src/main/java/org/apache/iceberg/IncrementalDataTableScan.java#L135>, >> ChangelogScan >> <https://github.com/apache/iceberg/blob/1e3ee1e4e80873018af716a190e541925f09c285/core/src/main/java/org/apache/iceberg/BaseIncrementalChangelogScan.java#L107> >> 2. MergingSnapshotProdu >> <https://github.com/apache/iceberg/blob/1e3ee1e4e80873018af716a190e541925f09c285/core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java#L787>cer >> (for this one, we may be getting lucky in the case it's null because the >> set check may just return false, and everything just works but it's not >> super obvious imo) >> >> So we may need to look further at these, depending on the direction we >> want to go. >> >> I was originally going to propose having a notion of an "unknown" >> operation on read when the field is null but the operations are defined in >> the spec so we would need to change that to produce valid metadata on write. >> After some more thought, I like Fokko's idea of using "overwrite" since I >> *think >> *that it should be able to safely encompass every combination of data + >> delete file writes without breaking semantics in user applications which >> rely on operations being defined. >> >> Thanks, >> >> Amogh Jahagirdar >> >> >> >> On Mon, Oct 28, 2024 at 5:46 PM Russell Spitzer < >> russell.spit...@gmail.com> wrote: >> >>> It's fine to not cause failures, all we have to do is change the spec to >>> say that it's not required. If the spec says it's required then the >>> reference implementation should require it. If the reference implementation >>> doesn't require it, then should any implementation? >>> >>> On Mon, Oct 28, 2024 at 6:40 PM rdb...@gmail.com <rdb...@gmail.com> >>> wrote: >>> >>>> > I'd rather we just enforce the spec as written and provide >>>> separate functionality for cleaning/repairing out of spec metadata. >>>> >>>> I definitely prefer to keep metadata clean, but is it really worth >>>> causing failures? When there's a problem with the metadata for a table, >>>> it's unlikely that a user is going to be able to go and fix it themselves. >>>> So if there's a bug in a library that produces bad metadata (for example, a >>>> case statement that returns null instead of throwing an exception to >>>> generate the operation field) then the table would be wedged. We can >>>> hopefully fix a lot of this in catalog service implementations (reject bad >>>> commits) but I'm worried about the idea of just not being able to read or >>>> fix tables. >>>> >>>> On Mon, Oct 28, 2024 at 9:35 AM Russell Spitzer < >>>> russell.spit...@gmail.com> wrote: >>>> >>>>> This is one of the reasons I'm opposed to metadata we don't use/need. >>>>> We end up forking the spec and then we have some odd behaviors, a metadata >>>>> which is illegal in one implementation (PyIceberg) will be legal in >>>>> another >>>>> (Iceberg Java) (and with the suggestion above be silently modified to >>>>> bring >>>>> it to spec). I'd rather we just enforce the spec as written and provide >>>>> separate functionality for cleaning/repairing out of spec metadata. If we >>>>> do want to auto-correct metadata.json we need to do so at read time, >>>>> otherwise a user of the Iceberg library can read an "invalid" >>>>> metadata.json, try to use the operation key and fail in an unexpected way, >>>>> only to modify the table in an unrelated DML and have it start working >>>>> again >>>>> >>>>> That said I don't feel strongly enough to not revert if that's the >>>>> consensus here. At minimum we should forbid the Iceberg java >>>>> implementation >>>>> from being able to write a Snapshot without an Operation. >>>>> >>>>> On Mon, Oct 28, 2024 at 6:31 AM Fokko Driesprong <fo...@apache.org> >>>>> wrote: >>>>> >>>>>> Hey everyone, >>>>>> >>>>>> Thanks Kevin for spotting and raising this. >>>>>> >>>>>> I agree with Ryan that we still should be able to read the table. We >>>>>> had similar issues in the past (looking at you -1 for >>>>>> current-snapshot-id) >>>>>> which we can fix. Can I suggest when we encounter a missing operation, >>>>>> and >>>>>> we write the table again, that set the operation to overwrite? This seems >>>>>> to be the safest option. This way the Java library doesn't produce any >>>>>> metadata that conflicts with the spec. I checked quickly, and it looks >>>>>> like >>>>>> it allows writing a summary without an operation: >>>>>> >>>>>> @Test >>>>>> public void testJsonConversionSummaryWithoutOperationFails() { >>>>>> String json = >>>>>> String.format( >>>>>> "{\n" >>>>>> + " \"snapshot-id\" : 2,\n" >>>>>> + " \"parent-snapshot-id\" : 1,\n" >>>>>> + " \"timestamp-ms\" : %s,\n" >>>>>> + " \"summary\" : {\n" >>>>>> + " \"files-added\" : \"4\",\n" >>>>>> + " \"files-deleted\" : \"100\"\n" >>>>>> + " },\n" >>>>>> + " \"manifests\" : [ >>>>>> \"/tmp/manifest1.avro\", \"/tmp/manifest2.avro\" ],\n" >>>>>> + " \"schema-id\" : 3\n" >>>>>> + "}", >>>>>> System.currentTimeMillis()); >>>>>> >>>>>> Snapshot snap = SnapshotParser.fromJson(json); >>>>>> assertThat(snap.operation()).isNull(); >>>>>> assertThat(SnapshotParser.toJson(snap)).contains("operation"); // >>>>>> This fails >>>>>> } >>>>>> >>>>>> Kind regards, >>>>>> Fokko >>>>>> >>>>>> >>>>>> Op ma 28 okt 2024 om 00:16 schreef Kevin Liu <kevin.jq....@gmail.com >>>>>> >: >>>>>> >>>>>>> Hi Ryan, >>>>>>> >>>>>>> I've created a revert PR [1]. I agree that we should take a more >>>>>>> permissive approach when reading a table, allowing for reading >>>>>>> non-compliant table metadata, especially for an opportunity to "fix" the >>>>>>> metadata. However, I think we still need a way to enforce the table >>>>>>> specification to ensure that other operations interact with a compliant >>>>>>> table. >>>>>>> >>>>>>> Perhaps we could permit `SnapshotParser` to read the metadata but >>>>>>> enforce the `operation` field at a different location to guarantee the >>>>>>> table’s compliance with the specification. >>>>>>> >>>>>>> Best, >>>>>>> Kevin Liu >>>>>>> >>>>>>> [1] https://github.com/apache/iceberg/pull/11409 >>>>>>> >>>>>>> On Sat, Oct 26, 2024 at 11:05 AM rdb...@gmail.com <rdb...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> I see it's been merged, but I don't think it is a good idea to >>>>>>>> enforce this. The spec can and should require the `operation` but we >>>>>>>> want >>>>>>>> to be careful about creating situations where bad metadata can >>>>>>>> needlessly >>>>>>>> break a table. I would be much more permissive here, which is why this >>>>>>>> probably wasn't enforced in the first place. >>>>>>>> >>>>>>>> On Fri, Oct 25, 2024 at 2:36 PM Kevin Liu <kevin.jq....@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Thanks, everyone! The PR[1] has been merged >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Kevin Liu >>>>>>>>> >>>>>>>>> [1] https://github.com/apache/iceberg/pull/11354 >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Oct 25, 2024 at 1:02 PM Kevin Liu <kevin.jq....@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Thanks, Ryan! That makes sense. >>>>>>>>>> >>>>>>>>>> I want to follow up on the original issue. I've made a PR [1] to >>>>>>>>>> enforce that the Snapshot `summary` map must have an `operation` key. >>>>>>>>>> Please take a look. Thank you @nastra for the comments and reviews. >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Kevin Liu >>>>>>>>>> >>>>>>>>>> [1] https://github.com/apache/iceberg/pull/11354 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Oct 22, 2024 at 4:06 PM rdb...@gmail.com < >>>>>>>>>> rdb...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> > For example, the `Snapshot` `summary` field is optional in V1 >>>>>>>>>>> but required in V2. Therefore, the REST spec definition should mark >>>>>>>>>>> the >>>>>>>>>>> `summary` field as optional to support both versions. >>>>>>>>>>> >>>>>>>>>>> Yeah, this is technically true. But as I said in my first email, >>>>>>>>>>> unless you have tables that are 5 years old, it's unlikely that >>>>>>>>>>> this is >>>>>>>>>>> going to be a problem. A failure here is more likely with newer >>>>>>>>>>> implementations that have a bug. So I'd argue there's value in >>>>>>>>>>> leaving it >>>>>>>>>>> as required. >>>>>>>>>>> >>>>>>>>>>> On Mon, Oct 21, 2024 at 9:41 AM Kevin Liu < >>>>>>>>>>> kevin.jq....@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> > No. They were introduced at the same time. >>>>>>>>>>>> Great! Since the `summary` field and the `operation` key were >>>>>>>>>>>> introduced together, we should enforce the rule that the ` >>>>>>>>>>>> summary` field must always have an accompanying `operation` >>>>>>>>>>>> key. This has been addressed in PR 11354 [1]. >>>>>>>>>>>> >>>>>>>>>>>> > I am strongly against this. The REST spec should be >>>>>>>>>>>> independent of the table versions. >>>>>>>>>>>> That makes sense. For the REST spec to support both V1 and V2 >>>>>>>>>>>> tables, it should "accept" the least common denominator between >>>>>>>>>>>> the two >>>>>>>>>>>> versions. For example, the `Snapshot` `summary` field is optional >>>>>>>>>>>> in V1 but >>>>>>>>>>>> required in V2. Therefore, the REST spec definition should mark the >>>>>>>>>>>> `summary` field as optional to support both versions. However, the >>>>>>>>>>>> current >>>>>>>>>>>> REST spec leans towards the V2 table spec; fields that are >>>>>>>>>>>> optional in V1 >>>>>>>>>>>> and required in V2 are marked as required in the spec, such as >>>>>>>>>>>> `TableMetadata.table-uuid` [2][3] and `Snapshot.summary` [4][5]. >>>>>>>>>>>> >>>>>>>>>>>> Would love to get other people's thoughts on this. >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> Kevin Liu >>>>>>>>>>>> >>>>>>>>>>>> [1] https://github.com/apache/iceberg/pull/11354 >>>>>>>>>>>> [2] >>>>>>>>>>>> https://github.com/apache/iceberg/blob/8e743a5b5209569f84b6bace36e1106c67e1eab3/open-api/rest-catalog-open-api.yaml#L2414 >>>>>>>>>>>> [3] https://iceberg.apache.org/spec/#table-metadata-fields >>>>>>>>>>>> [4] >>>>>>>>>>>> https://github.com/apache/iceberg/blob/8e743a5b5209569f84b6bace36e1106c67e1eab3/open-api/rest-catalog-open-api.yaml#L2325 >>>>>>>>>>>> [5] https://iceberg.apache.org/spec/#snapshots >>>>>>>>>>>> >>>>>>>>>>>> On Sun, Oct 20, 2024 at 11:24 AM rdb...@gmail.com < >>>>>>>>>>>> rdb...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Was it ever valid to have a summary field without the >>>>>>>>>>>>> operation key? >>>>>>>>>>>>> >>>>>>>>>>>>> No. They were introduced at the same time. >>>>>>>>>>>>> >>>>>>>>>>>>> Would it be helpful to create alternative versions of the REST >>>>>>>>>>>>> spec specifically for referencing V1 and V2 tables? >>>>>>>>>>>>> >>>>>>>>>>>>> I am strongly against this. The REST spec should be >>>>>>>>>>>>> independent of the table versions. Any table format version can >>>>>>>>>>>>> be passed >>>>>>>>>>>>> and the table format should be the canonical reference for what >>>>>>>>>>>>> is allowed. >>>>>>>>>>>>> We want to avoid cases where there are discrepancies. The table >>>>>>>>>>>>> spec is >>>>>>>>>>>>> canonical for table metadata, and the REST spec allows passing it. >>>>>>>>>>>>> >>>>>>>>>>>>> On Sun, Oct 20, 2024 at 11:18 AM Kevin Liu < >>>>>>>>>>>>> kevin.jq....@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hey folks, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, everyone for the discussion, and thanks Ryan for >>>>>>>>>>>>>> providing the historical context. >>>>>>>>>>>>>> Enforce the `operation` key in Snapshot’s `summary` field >>>>>>>>>>>>>> >>>>>>>>>>>>>> When serializing the `Snapshot` object from JSON, the Java >>>>>>>>>>>>>> implementation does not enforce that the `summary` field must >>>>>>>>>>>>>> contain an >>>>>>>>>>>>>> `operation` key. In the V1 spec, the `summary` field is >>>>>>>>>>>>>> optional, while in >>>>>>>>>>>>>> the V2 spec, it is required. However, in both versions, if a >>>>>>>>>>>>>> `summary` >>>>>>>>>>>>>> field is present, it must include an `operation` key. Any >>>>>>>>>>>>>> `summary` field >>>>>>>>>>>>>> lacking an `operation` key should be considered invalid. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I’ve addressed this issue in PR 11354 [1] by adding this >>>>>>>>>>>>>> constraint when parsing JSON. >>>>>>>>>>>>>> >>>>>>>>>>>>>> > We initially did not have the snapshot summary or >>>>>>>>>>>>>> operation. When I added the summary, the operation was intended >>>>>>>>>>>>>> to be >>>>>>>>>>>>>> required in cases where the summary is present. It should always >>>>>>>>>>>>>> be there >>>>>>>>>>>>>> if the summary is and the summary should always be there unless >>>>>>>>>>>>>> you wrote the metadata.json file way back in 2017 or 2018. >>>>>>>>>>>>>> >>>>>>>>>>>>>> @Ryan, does this constraint also apply to `metadata.json` >>>>>>>>>>>>>> files from 2017/2018? Was it ever valid to have a `summary` >>>>>>>>>>>>>> field without >>>>>>>>>>>>>> the `operation` key? >>>>>>>>>>>>>> >>>>>>>>>>>>>> > Well, the spec says nothing about a top-level `operation` >>>>>>>>>>>>>> field in JSON [1]. Yet the Java implementation produces it [2] >>>>>>>>>>>>>> and removes >>>>>>>>>>>>>> the operation from the summary map. This seems inconsistent? >>>>>>>>>>>>>> >>>>>>>>>>>>>> @Anton, the Java `Snapshot` object includes both the >>>>>>>>>>>>>> `summary` and `operation` fields. When serializing to JSON, the >>>>>>>>>>>>>> `operation` >>>>>>>>>>>>>> field is included in the `summary` map [2], rather than as a >>>>>>>>>>>>>> top-level >>>>>>>>>>>>>> field. During deserialization from JSON, the `operation` field >>>>>>>>>>>>>> is extracted >>>>>>>>>>>>>> from the `summary` map [3]. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I believe this is consistent with the table spec, which >>>>>>>>>>>>>> defines the JSON output, not how the `Snapshot` object is >>>>>>>>>>>>>> implemented in >>>>>>>>>>>>>> Java. >>>>>>>>>>>>>> On REST spec and Table spec >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, Yufei, for highlighting the difference between the >>>>>>>>>>>>>> REST spec and the table spec. I mistakenly used the REST spec >>>>>>>>>>>>>> (`rest-catalog-open-api.yaml` [4]) as the source of truth for V2 >>>>>>>>>>>>>> tables. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Looking at the REST spec file, it can be challenging to >>>>>>>>>>>>>> determine how a REST server should handle V1 versus V2 tables. >>>>>>>>>>>>>> Even for V2 >>>>>>>>>>>>>> tables, the current version of the file combines features from >>>>>>>>>>>>>> V2, along >>>>>>>>>>>>>> with additional changes made in preparation for the upcoming V3 >>>>>>>>>>>>>> spec. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Would it be helpful to create alternative versions of the >>>>>>>>>>>>>> REST spec specifically for referencing V1 and V2 tables? The >>>>>>>>>>>>>> goal would be >>>>>>>>>>>>>> to have a "frozen" version of the REST spec dedicated to V1 >>>>>>>>>>>>>> tables and >>>>>>>>>>>>>> another for V2 tables while allowing the current REST spec file >>>>>>>>>>>>>> to evolve >>>>>>>>>>>>>> as needed. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Taking a step back, I think we need more documentation on the >>>>>>>>>>>>>> REST spec, including support for different table versions and >>>>>>>>>>>>>> guidance on >>>>>>>>>>>>>> upgrading from one version to another. I’d love to hear >>>>>>>>>>>>>> everyone’s thoughts >>>>>>>>>>>>>> on this. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Kevin Liu >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> [1] https://github.com/apache/iceberg/pull/11354 >>>>>>>>>>>>>> >>>>>>>>>>>>>> [2] >>>>>>>>>>>>>> https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L63-L66 >>>>>>>>>>>>>> >>>>>>>>>>>>>> [3] >>>>>>>>>>>>>> https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L124-L137 >>>>>>>>>>>>>> >>>>>>>>>>>>>> [4] >>>>>>>>>>>>>> https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sat, Oct 19, 2024 at 7:48 PM Sung Yun <sun...@apache.org> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Ryan, thank you for your response! >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> That detailed context is very helpful in allowing me to >>>>>>>>>>>>>>> understanding why the REST catalog spec has evolved the way it >>>>>>>>>>>>>>> has, and how >>>>>>>>>>>>>>> the Table Spec and the REST Catalog Spec should each be >>>>>>>>>>>>>>> referenced in the >>>>>>>>>>>>>>> sub-communities (like in PyIceberg). I'll keep those >>>>>>>>>>>>>>> motivations in mind as >>>>>>>>>>>>>>> we discuss those Specs in the future. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Also, here's a small PR to specify more explicitly that the >>>>>>>>>>>>>>> operation field should be a required field in the summary field: >>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/11355 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Sung >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 2024/10/19 22:14:59 "rdb...@gmail.com" wrote: >>>>>>>>>>>>>>> > I can provide some historical context here about how the >>>>>>>>>>>>>>> table spec evolved >>>>>>>>>>>>>>> > and how the REST spec works with respect to table versions. >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > We initially did not have the snapshot summary or >>>>>>>>>>>>>>> operation. When I added >>>>>>>>>>>>>>> > the summary, the operation was intended to be required in >>>>>>>>>>>>>>> cases where the >>>>>>>>>>>>>>> > summary is present. It should always be there if the >>>>>>>>>>>>>>> summary is and the >>>>>>>>>>>>>>> > summary should always be there unless you wrote the >>>>>>>>>>>>>>> metadata.json file way >>>>>>>>>>>>>>> > back in 2017 or 2018. It looks like the spec could be more >>>>>>>>>>>>>>> clear that the >>>>>>>>>>>>>>> > operation is required when summary is present. Anyone want >>>>>>>>>>>>>>> to open a PR? >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > Anton, I don't think there is a top-level operation field. >>>>>>>>>>>>>>> The Java >>>>>>>>>>>>>>> > Snapshot class tracks the operation as top-level, but it >>>>>>>>>>>>>>> is always stored >>>>>>>>>>>>>>> > in the summary. I think this is consistent with the spec. >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > For the REST spec, I think that it should be strictly >>>>>>>>>>>>>>> optional to support >>>>>>>>>>>>>>> > v1 tables written with no summary, but it should always be >>>>>>>>>>>>>>> present. I'd >>>>>>>>>>>>>>> > probably leave it required since it already is and is >>>>>>>>>>>>>>> catching a valuable >>>>>>>>>>>>>>> > error case here. >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > When building the REST spec, I took the same approach as >>>>>>>>>>>>>>> the Java parser >>>>>>>>>>>>>>> > (which is also to parse table metadata coming from REST >>>>>>>>>>>>>>> servers). That is, >>>>>>>>>>>>>>> > it is compatible with v1 metadata; fields that were not >>>>>>>>>>>>>>> required in v1 are >>>>>>>>>>>>>>> > optional. This fits with the principle of "be liberal in >>>>>>>>>>>>>>> what you accept >>>>>>>>>>>>>>> > and strict in what you produce". The REST spec allows >>>>>>>>>>>>>>> passing metadata for >>>>>>>>>>>>>>> > any table version so that the specs are not tightly >>>>>>>>>>>>>>> coupled. The table >>>>>>>>>>>>>>> > version is passed when loading and clients should reject >>>>>>>>>>>>>>> table versions >>>>>>>>>>>>>>> > that are newer than what they can support. The REST >>>>>>>>>>>>>>> protocol just needs to >>>>>>>>>>>>>>> > facilitate passing the table metadata. >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > Most v2 structures, like the `schemas` list, are >>>>>>>>>>>>>>> introduced as optional in >>>>>>>>>>>>>>> > v1 and made required in v2. That way, it's possible to add >>>>>>>>>>>>>>> metadata to >>>>>>>>>>>>>>> > existing format versions and make the upgrade path easier. >>>>>>>>>>>>>>> Maintaining the >>>>>>>>>>>>>>> > newer structures even though there are different writer >>>>>>>>>>>>>>> versions deployed >>>>>>>>>>>>>>> > is one of the reasons why the REST spec changes to a >>>>>>>>>>>>>>> change-based model. >>>>>>>>>>>>>>> > New metadata only needs to be supported by the service >>>>>>>>>>>>>>> maintaining the >>>>>>>>>>>>>>> > metadata.json files and any writers that want to update it. >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > I see some points about being able to remove old table >>>>>>>>>>>>>>> versions. I don't >>>>>>>>>>>>>>> > think that the REST protocol itself is the place to do >>>>>>>>>>>>>>> this. The protocol >>>>>>>>>>>>>>> > is format-agnostic. Implementations are free to reject >>>>>>>>>>>>>>> requests to create >>>>>>>>>>>>>>> > tables with older versions, or to update the table to a >>>>>>>>>>>>>>> new version. >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > Ryan >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > On Fri, Oct 18, 2024 at 6:42 AM Sung Yun < >>>>>>>>>>>>>>> sun...@apache.org> wrote: >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > > Folks, thank you for your responses! These area really >>>>>>>>>>>>>>> helpful insights. >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > > I agree that the REST spec should aim to support v1, >>>>>>>>>>>>>>> v2, and potentially >>>>>>>>>>>>>>> > > the upcoming v3. In practice, it seems like the choice >>>>>>>>>>>>>>> of table spec might >>>>>>>>>>>>>>> > > ultimately be dictated by the REST catalog >>>>>>>>>>>>>>> implementation. >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > > A best practice would be for the server to strive to >>>>>>>>>>>>>>> support all Iceberg >>>>>>>>>>>>>>> > > versions, but the REST spec itself should remain >>>>>>>>>>>>>>> flexible enough to >>>>>>>>>>>>>>> > > accommodate less strict table specs. >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > Yufei, yes that makes sense, and I agree that the server >>>>>>>>>>>>>>> should strive to >>>>>>>>>>>>>>> > > support all format versions, because otherwise the an >>>>>>>>>>>>>>> older client >>>>>>>>>>>>>>> > > application, may just not be compatible with a REST >>>>>>>>>>>>>>> Catalog running on a >>>>>>>>>>>>>>> > > higher version of table spec. I think we have two >>>>>>>>>>>>>>> choices here in ensuring >>>>>>>>>>>>>>> > > that the REST Catalog server is able to support multiple >>>>>>>>>>>>>>> versions of the >>>>>>>>>>>>>>> > > Table Spec: >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > 1. We could create single components that are common >>>>>>>>>>>>>>> denominators of all >>>>>>>>>>>>>>> > > existing table specs to accommodate the less table >>>>>>>>>>>>>>> specs. The REST Catalog >>>>>>>>>>>>>>> > > Spec currently falls short in this approach, and I've >>>>>>>>>>>>>>> put up this PR to >>>>>>>>>>>>>>> > > show what this change would look like just for the >>>>>>>>>>>>>>> Snapshot component: >>>>>>>>>>>>>>> > > https://github.com/apache/iceberg/pull/11353 - My take >>>>>>>>>>>>>>> on this is that, >>>>>>>>>>>>>>> > > this approach will make the REST catalog spec more >>>>>>>>>>>>>>> confusing and difficult >>>>>>>>>>>>>>> > > to manage as we add more Table Spec versions moving >>>>>>>>>>>>>>> forward. The discussion >>>>>>>>>>>>>>> > > on this mail list thread is I think a great >>>>>>>>>>>>>>> demonstration of that confusion >>>>>>>>>>>>>>> > > :) >>>>>>>>>>>>>>> > > 2. We could instead create separate Table Spec version >>>>>>>>>>>>>>> specific components >>>>>>>>>>>>>>> > > on the REST Catalog Open API Spec. For example, a >>>>>>>>>>>>>>> Snapshot component could >>>>>>>>>>>>>>> > > be anyOf SnapshotV1 and SnapshotV2, which match the >>>>>>>>>>>>>>> Table Spec V1 and V2 >>>>>>>>>>>>>>> > > definitions. I think creating explicit components that >>>>>>>>>>>>>>> match the spec >>>>>>>>>>>>>>> > > definitions will work in our favor when we continue to >>>>>>>>>>>>>>> introduce more Spec >>>>>>>>>>>>>>> > > changes and manage their lifecycles. And perhaps, maybe >>>>>>>>>>>>>>> we could also >>>>>>>>>>>>>>> > > indicate what format-versions the REST Catalog Server >>>>>>>>>>>>>>> supports through an >>>>>>>>>>>>>>> > > endpoint, and communicate it to a client application. >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > I'd love to hear the community's opinion on suggestion >>>>>>>>>>>>>>> (2)! I'm very >>>>>>>>>>>>>>> > > curious to hear if we've considered it before. >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > Sung >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > On 2024/10/18 05:13:15 Péter Váry wrote: >>>>>>>>>>>>>>> > > > Hi Team, >>>>>>>>>>>>>>> > > > Apart from fixing this current issue by relaxing the >>>>>>>>>>>>>>> current spec >>>>>>>>>>>>>>> > > > constraints, to support both v1 and v2 specifications, >>>>>>>>>>>>>>> we should think >>>>>>>>>>>>>>> > > > about how to handle table spec evolution for the long >>>>>>>>>>>>>>> term. >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > What are the base factors we can start from (please >>>>>>>>>>>>>>> add your own ideas >>>>>>>>>>>>>>> > > if I >>>>>>>>>>>>>>> > > > have missed something): >>>>>>>>>>>>>>> > > > - We evolve the specifications in a way that is >>>>>>>>>>>>>>> backwards compatible (v1 >>>>>>>>>>>>>>> > > > table could be read by v2 reader) but not forwards >>>>>>>>>>>>>>> compatible (v2 table >>>>>>>>>>>>>>> > > > could not be read by an old reader) >>>>>>>>>>>>>>> > > > - The rest spec ideally should conform to the >>>>>>>>>>>>>>> currently used table spec >>>>>>>>>>>>>>> > > > schema/constraints >>>>>>>>>>>>>>> > > > - REST catalogs sooner-or-later would like to drop >>>>>>>>>>>>>>> support for older >>>>>>>>>>>>>>> > > table >>>>>>>>>>>>>>> > > > spec. We need to avoid the situation of Hive >>>>>>>>>>>>>>> Metastore, where the >>>>>>>>>>>>>>> > > decisions >>>>>>>>>>>>>>> > > > made 10 years ago prevented enhancing the APIs as the >>>>>>>>>>>>>>> old specifications >>>>>>>>>>>>>>> > > > were supported forever. >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > Probably (when the spec difference becomes big enough) >>>>>>>>>>>>>>> a composit request >>>>>>>>>>>>>>> > > > (version + different content spec) or a different >>>>>>>>>>>>>>> endpoint will be >>>>>>>>>>>>>>> > > required. >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > Thanks, Peter >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > On Thu, Oct 17, 2024, 23:11 Yufei Gu < >>>>>>>>>>>>>>> flyrain...@gmail.com> wrote: >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > > Hi Sung, >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> > > > > It seems we are running to issues related to a >>>>>>>>>>>>>>> mismatch between the >>>>>>>>>>>>>>> > > REST >>>>>>>>>>>>>>> > > > > spec and table specifications. Currently, there's no >>>>>>>>>>>>>>> clear definition >>>>>>>>>>>>>>> > > of >>>>>>>>>>>>>>> > > > > how the REST spec is meant to support different >>>>>>>>>>>>>>> table specs. The >>>>>>>>>>>>>>> > > closest >>>>>>>>>>>>>>> > > > > reference I found is this statement >>>>>>>>>>>>>>> > > > > < >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> https://github.com/apache/iceberg/blob/8e743a5b5209569f84b6bace36e1106c67e1eab3/open-api/rest-catalog-open-api.yaml#L30-L30 >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > > in the REST spec. >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> > > > > Implementations should ideally support both Iceberg >>>>>>>>>>>>>>> table specs v1 and >>>>>>>>>>>>>>> > > v2, >>>>>>>>>>>>>>> > > > >> with priority given to v2. >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> > > > > I agree that the REST spec should aim to support v1, >>>>>>>>>>>>>>> v2, and >>>>>>>>>>>>>>> > > potentially >>>>>>>>>>>>>>> > > > > the upcoming v3. In practice, it seems like the >>>>>>>>>>>>>>> choice of table spec >>>>>>>>>>>>>>> > > might >>>>>>>>>>>>>>> > > > > ultimately be dictated by the REST catalog >>>>>>>>>>>>>>> implementation. >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> > > > > A best practice would be for the server to strive to >>>>>>>>>>>>>>> support all >>>>>>>>>>>>>>> > > Iceberg >>>>>>>>>>>>>>> > > > > versions, but the REST spec itself should remain >>>>>>>>>>>>>>> flexible enough to >>>>>>>>>>>>>>> > > > > accommodate less strict table specs. For the case >>>>>>>>>>>>>>> you mentioned, it >>>>>>>>>>>>>>> > > should >>>>>>>>>>>>>>> > > > > be fine to make sequence number optional since the >>>>>>>>>>>>>>> spec has to support >>>>>>>>>>>>>>> > > v1 >>>>>>>>>>>>>>> > > > > table spec. It does feel confusing though. >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> > > > > WDYT? >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> > > > > Yufei >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> > > > > On Thu, Oct 17, 2024 at 1:56 PM Anton Okolnychyi < >>>>>>>>>>>>>>> > > aokolnyc...@gmail.com> >>>>>>>>>>>>>>> > > > > wrote: >>>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>>> > > > >> Well, the spec says nothing about a top-level >>>>>>>>>>>>>>> `operation` field in >>>>>>>>>>>>>>> > > JSON >>>>>>>>>>>>>>> > > > >> [1]. Yet the Java implementation produces it [2] >>>>>>>>>>>>>>> and removes the >>>>>>>>>>>>>>> > > operation >>>>>>>>>>>>>>> > > > >> from the summary map. This seems inconsistent? >>>>>>>>>>>>>>> > > > >> >>>>>>>>>>>>>>> > > > >> - Anton >>>>>>>>>>>>>>> > > > >> >>>>>>>>>>>>>>> > > > >> [1] - https://iceberg.apache.org/spec/#snapshots >>>>>>>>>>>>>>> > > > >> [2] - >>>>>>>>>>>>>>> > > > >> >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L63 >>>>>>>>>>>>>>> > > > >> >>>>>>>>>>>>>>> > > > >> >>>>>>>>>>>>>>> > > > >> чт, 17 жовт. 2024 р. о 10:06 Sung Yun < >>>>>>>>>>>>>>> sun...@apache.org> пише: >>>>>>>>>>>>>>> > > > >> >>>>>>>>>>>>>>> > > > >>> > As a side note, the `rest-catalog-open-api.yaml` >>>>>>>>>>>>>>> file [2] in the >>>>>>>>>>>>>>> > > > >>> Iceberg repo contains the latest version of the >>>>>>>>>>>>>>> spec. >>>>>>>>>>>>>>> > > > >>> >>>>>>>>>>>>>>> > > > >>> I think more clarity on this would be helpful. Is >>>>>>>>>>>>>>> it really the case >>>>>>>>>>>>>>> > > > >>> that the Open API spec contains the latest version >>>>>>>>>>>>>>> of the spec? For >>>>>>>>>>>>>>> > > > >>> example, I'm noticing a discrepancy between >>>>>>>>>>>>>>> sequence-number in the >>>>>>>>>>>>>>> > > Table >>>>>>>>>>>>>>> > > > >>> Spec and in the Open API Spec... >>>>>>>>>>>>>>> > > > >>> >>>>>>>>>>>>>>> > > > >>> In the table spec, it's required for V2, but it's >>>>>>>>>>>>>>> optional in the >>>>>>>>>>>>>>> > > REST >>>>>>>>>>>>>>> > > > >>> API Spec: >>>>>>>>>>>>>>> > > > >>> >>>>>>>>>>>>>>> > > > >>> >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L2319-L2335 >>>>>>>>>>>>>>> > > > >>> >>>>>>>>>>>>>>> > > > >>> On 2024/10/17 16:58:17 Kevin Liu wrote: >>>>>>>>>>>>>>> > > > >>> > > Based on the example metadata, that looks like >>>>>>>>>>>>>>> it is not to >>>>>>>>>>>>>>> > > spec, so >>>>>>>>>>>>>>> > > > >>> it's >>>>>>>>>>>>>>> > > > >>> > reasonable that python would reject it. If the >>>>>>>>>>>>>>> java >>>>>>>>>>>>>>> > > implementation is >>>>>>>>>>>>>>> > > > >>> > allowing for that, it's likely that we're being >>>>>>>>>>>>>>> too relaxed >>>>>>>>>>>>>>> > > (possibly a >>>>>>>>>>>>>>> > > > >>> > holdover from v1 parsing). >>>>>>>>>>>>>>> > > > >>> > I believe the Java implementation is relaxing >>>>>>>>>>>>>>> the constraint. I'll >>>>>>>>>>>>>>> > > > >>> create a >>>>>>>>>>>>>>> > > > >>> > PR with test cases and the necessary changes. >>>>>>>>>>>>>>> > > > >>> > >>>>>>>>>>>>>>> > > > >>> > > Do you know what produced the metadata? >>>>>>>>>>>>>>> > > > >>> > It was created by Snowflake [1]. After verifying >>>>>>>>>>>>>>> this, I'll look >>>>>>>>>>>>>>> > > into >>>>>>>>>>>>>>> > > > >>> > raising the issue with them. >>>>>>>>>>>>>>> > > > >>> > >>>>>>>>>>>>>>> > > > >>> > As a side note, the `rest-catalog-open-api.yaml` >>>>>>>>>>>>>>> file [2] in the >>>>>>>>>>>>>>> > > > >>> Iceberg >>>>>>>>>>>>>>> > > > >>> > repo contains the latest version of the spec. As >>>>>>>>>>>>>>> we're continuing >>>>>>>>>>>>>>> > > to >>>>>>>>>>>>>>> > > > >>> evolve >>>>>>>>>>>>>>> > > > >>> > to spec for V3, would it be helpful to create a >>>>>>>>>>>>>>> frozen version >>>>>>>>>>>>>>> > > > >>> representing >>>>>>>>>>>>>>> > > > >>> > both the V1 and V2 specs for reference, possibly >>>>>>>>>>>>>>> as a separate >>>>>>>>>>>>>>> > > file? >>>>>>>>>>>>>>> > > > >>> > >>>>>>>>>>>>>>> > > > >>> > Best, >>>>>>>>>>>>>>> > > > >>> > Kevin Liu >>>>>>>>>>>>>>> > > > >>> > >>>>>>>>>>>>>>> > > > >>> > [1] >>>>>>>>>>>>>>> > > > >>> > >>>>>>>>>>>>>>> > > > >>> >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> https://github.com/apache/iceberg-python/issues/1106#issuecomment-2312108455 >>>>>>>>>>>>>>> > > > >>> > [2] >>>>>>>>>>>>>>> > > > >>> > >>>>>>>>>>>>>>> > > > >>> >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> https://github.com/apache/iceberg/blob/8e2eb9ac2e33ce4bac8956d4e2f099444d03c0e3/open-api/rest-catalog-open-api.yaml >>>>>>>>>>>>>>> > > > >>> > >>>>>>>>>>>>>>> > > > >>> > On Thu, Oct 17, 2024 at 9:20 AM Daniel Weeks < >>>>>>>>>>>>>>> dwe...@apache.org> >>>>>>>>>>>>>>> > > > >>> wrote: >>>>>>>>>>>>>>> > > > >>> > >>>>>>>>>>>>>>> > > > >>> > > Sung, >>>>>>>>>>>>>>> > > > >>> > > >>>>>>>>>>>>>>> > > > >>> > > I was thinking of v1, so you're right that >>>>>>>>>>>>>>> manifest-list and >>>>>>>>>>>>>>> > > summary >>>>>>>>>>>>>>> > > > >>> are >>>>>>>>>>>>>>> > > > >>> > > required as of v2. The REST Spec seems to >>>>>>>>>>>>>>> follow the v2 >>>>>>>>>>>>>>> > > definition, >>>>>>>>>>>>>>> > > > >>> so I >>>>>>>>>>>>>>> > > > >>> > > think we're somewhat implicitly requiring >>>>>>>>>>>>>>> those fields via REST. >>>>>>>>>>>>>>> > > > >>> > > >>>>>>>>>>>>>>> > > > >>> > > Kevin, >>>>>>>>>>>>>>> > > > >>> > > >>>>>>>>>>>>>>> > > > >>> > > Based on the example metadata, that looks like >>>>>>>>>>>>>>> it is not to >>>>>>>>>>>>>>> > > spec, so >>>>>>>>>>>>>>> > > > >>> it's >>>>>>>>>>>>>>> > > > >>> > > reasonable that python would reject it. If >>>>>>>>>>>>>>> the java >>>>>>>>>>>>>>> > > implementation >>>>>>>>>>>>>>> > > > >>> is >>>>>>>>>>>>>>> > > > >>> > > allowing for that, it's likely that we're >>>>>>>>>>>>>>> being too relaxed >>>>>>>>>>>>>>> > > > >>> (possibly a >>>>>>>>>>>>>>> > > > >>> > > holdover from v1 parsing). >>>>>>>>>>>>>>> > > > >>> > > >>>>>>>>>>>>>>> > > > >>> > > Do you know what produced the metadata? >>>>>>>>>>>>>>> > > > >>> > > >>>>>>>>>>>>>>> > > > >>> > > -Dan >>>>>>>>>>>>>>> > > > >>> > > >>>>>>>>>>>>>>> > > > >>> > > On Thu, Oct 17, 2024 at 9:02 AM Kevin Liu < >>>>>>>>>>>>>>> > > kevin.jq....@gmail.com> >>>>>>>>>>>>>>> > > > >>> wrote: >>>>>>>>>>>>>>> > > > >>> > > >>>>>>>>>>>>>>> > > > >>> > >> Thanks for the additional context. >>>>>>>>>>>>>>> > > > >>> > >> >>>>>>>>>>>>>>> > > > >>> > >> My understanding is that if a Snapshot has a >>>>>>>>>>>>>>> `summary` field, it >>>>>>>>>>>>>>> > > > >>> must >>>>>>>>>>>>>>> > > > >>> > >> also have a corresponding `operation` key in >>>>>>>>>>>>>>> the summary map. Is >>>>>>>>>>>>>>> > > > >>> that >>>>>>>>>>>>>>> > > > >>> > >> correct? Based on the `SnapshotParser`, this >>>>>>>>>>>>>>> is not enforced >>>>>>>>>>>>>>> > > [1]. >>>>>>>>>>>>>>> > > > >>> > >> >>>>>>>>>>>>>>> > > > >>> > >> The underlying issue in #1106 [2] is the >>>>>>>>>>>>>>> missing `operation` >>>>>>>>>>>>>>> > > field >>>>>>>>>>>>>>> > > > >>> when >>>>>>>>>>>>>>> > > > >>> > >> the `summary` field is present. >>>>>>>>>>>>>>> > > > >>> > >> For example, >>>>>>>>>>>>>>> > > > >>> > >> ``` >>>>>>>>>>>>>>> > > > >>> > >> "summary" : { >>>>>>>>>>>>>>> > > > >>> > >> "manifests-created" : "8", >>>>>>>>>>>>>>> > > > >>> > >> "total-records" : "26508666891", >>>>>>>>>>>>>>> > > > >>> > >> "added-files-size" : "3927895626752", >>>>>>>>>>>>>>> > > > >>> > >> "manifests-kept" : "0", >>>>>>>>>>>>>>> > > > >>> > >> "total-files-size" : "3927895626752", >>>>>>>>>>>>>>> > > > >>> > >> "added-records" : "26508666891", >>>>>>>>>>>>>>> > > > >>> > >> "added-data-files" : "231513", >>>>>>>>>>>>>>> > > > >>> > >> "manifests-replaced" : "0", >>>>>>>>>>>>>>> > > > >>> > >> "total-data-files" : "231513" >>>>>>>>>>>>>>> > > > >>> > >> } >>>>>>>>>>>>>>> > > > >>> > >> ``` >>>>>>>>>>>>>>> > > > >>> > >> >>>>>>>>>>>>>>> > > > >>> > >> It could be the case that this particular >>>>>>>>>>>>>>> `metadata.json` was >>>>>>>>>>>>>>> > > > >>> generated >>>>>>>>>>>>>>> > > > >>> > >> not according to the spec. >>>>>>>>>>>>>>> > > > >>> > >> >>>>>>>>>>>>>>> > > > >>> > >> Best, >>>>>>>>>>>>>>> > > > >>> > >> Kevin Liu >>>>>>>>>>>>>>> > > > >>> > >> >>>>>>>>>>>>>>> > > > >>> > >> >>>>>>>>>>>>>>> > > > >>> > >> [1] >>>>>>>>>>>>>>> > > > >>> > >> >>>>>>>>>>>>>>> > > > >>> >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L124-L142 >>>>>>>>>>>>>>> > > > >>> > >> [2] >>>>>>>>>>>>>>> https://github.com/apache/iceberg-python/issues/1106 >>>>>>>>>>>>>>> > > > >>> > >> >>>>>>>>>>>>>>> > > > >>> > >> >>>>>>>>>>>>>>> > > > >>> > >> On Thu, Oct 17, 2024 at 8:47 AM Sung Yun < >>>>>>>>>>>>>>> sun...@apache.org> >>>>>>>>>>>>>>> > > wrote: >>>>>>>>>>>>>>> > > > >>> > >> >>>>>>>>>>>>>>> > > > >>> > >>> Thank you for the clarification Daniel, and >>>>>>>>>>>>>>> thank you Kevin for >>>>>>>>>>>>>>> > > > >>> raising >>>>>>>>>>>>>>> > > > >>> > >>> this issue! >>>>>>>>>>>>>>> > > > >>> > >>> >>>>>>>>>>>>>>> > > > >>> > >>> Does that mean that we are creating >>>>>>>>>>>>>>> component schemas that are >>>>>>>>>>>>>>> > > the >>>>>>>>>>>>>>> > > > >>> > >>> superset of the V1 and V2 schemas? And if >>>>>>>>>>>>>>> so, should we remove >>>>>>>>>>>>>>> > > > >>> summary and >>>>>>>>>>>>>>> > > > >>> > >>> manifest-list from the required properties, >>>>>>>>>>>>>>> and add manifests >>>>>>>>>>>>>>> > > > >>> optional >>>>>>>>>>>>>>> > > > >>> > >>> property to the Snapshot schema to support >>>>>>>>>>>>>>> both V1 and V2 >>>>>>>>>>>>>>> > > Summary >>>>>>>>>>>>>>> > > > >>> specs? >>>>>>>>>>>>>>> > > > >>> > >>> https://iceberg.apache.org/spec/#snapshots >>>>>>>>>>>>>>> > > > >>> > >>> >>>>>>>>>>>>>>> > > > >>> > >>> Or would creating separate component schemas >>>>>>>>>>>>>>> for V1/V2 be a >>>>>>>>>>>>>>> > > > >>> cleaner way >>>>>>>>>>>>>>> > > > >>> > >>> to align the REST spec with the table spec? >>>>>>>>>>>>>>> > > > >>> > >>> >>>>>>>>>>>>>>> > > > >>> > >>> Sung >>>>>>>>>>>>>>> > > > >>> > >>> >>>>>>>>>>>>>>> > > > >>> > >>> On 2024/10/17 15:19:23 Daniel Weeks wrote: >>>>>>>>>>>>>>> > > > >>> > >>> > I'm not convinced this is incorrect >>>>>>>>>>>>>>> behavior (table spec or >>>>>>>>>>>>>>> > > > >>> > >>> > implementation), but it does lend to some >>>>>>>>>>>>>>> confusion. The >>>>>>>>>>>>>>> > > > >>> 'summary' >>>>>>>>>>>>>>> > > > >>> > >>> field >>>>>>>>>>>>>>> > > > >>> > >>> > is optional, which means that if a summary >>>>>>>>>>>>>>> is not provided, >>>>>>>>>>>>>>> > > you >>>>>>>>>>>>>>> > > > >>> do not >>>>>>>>>>>>>>> > > > >>> > >>> have >>>>>>>>>>>>>>> > > > >>> > >>> > an associated 'operation' field. The >>>>>>>>>>>>>>> 'operation' field is >>>>>>>>>>>>>>> > > only >>>>>>>>>>>>>>> > > > >>> > >>> required in >>>>>>>>>>>>>>> > > > >>> > >>> > the context of the summary, so it's >>>>>>>>>>>>>>> actually possible for the >>>>>>>>>>>>>>> > > > >>> > >>> > implementation (i.e. the tests you >>>>>>>>>>>>>>> reference) to not have an >>>>>>>>>>>>>>> > > > >>> operation. >>>>>>>>>>>>>>> > > > >>> > >>> > >>>>>>>>>>>>>>> > > > >>> > >>> > I think what is wrong here is that the >>>>>>>>>>>>>>> REST spec marked the >>>>>>>>>>>>>>> > > > >>> summary as >>>>>>>>>>>>>>> > > > >>> > >>> > required >>>>>>>>>>>>>>> > > > >>> > >>> > < >>>>>>>>>>>>>>> > > > >>> > >>> >>>>>>>>>>>>>>> > > > >>> >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> https://github.com/apache/iceberg/blob/8e2eb9ac2e33ce4bac8956d4e2f099444d03c0e3/open-api/rest-catalog-open-api.yaml#L2040 >>>>>>>>>>>>>>> > > > >>> > >>> >, >>>>>>>>>>>>>>> > > > >>> > >>> > which is inconsistent with the table spec. >>>>>>>>>>>>>>> > > > >>> > >>> > >>>>>>>>>>>>>>> > > > >>> > >>> > On Wed, Oct 16, 2024 at 3:52 PM Anton >>>>>>>>>>>>>>> Okolnychyi < >>>>>>>>>>>>>>> > > > >>> > >>> aokolnyc...@gmail.com> >>>>>>>>>>>>>>> > > > >>> > >>> > wrote: >>>>>>>>>>>>>>> > > > >>> > >>> > >>>>>>>>>>>>>>> > > > >>> > >>> > > Based on [1], we never persisted the >>>>>>>>>>>>>>> operation in the >>>>>>>>>>>>>>> > > summary >>>>>>>>>>>>>>> > > > >>> map. >>>>>>>>>>>>>>> > > > >>> > >>> > > Instead, we persisted it as a top-level >>>>>>>>>>>>>>> field in Java, >>>>>>>>>>>>>>> > > which is >>>>>>>>>>>>>>> > > > >>> > >>> actually >>>>>>>>>>>>>>> > > > >>> > >>> > > NOT what the spec says. Does anyone >>>>>>>>>>>>>>> remember cases when the >>>>>>>>>>>>>>> > > > >>> > >>> operation was >>>>>>>>>>>>>>> > > > >>> > >>> > > unknown? I personally don't. >>>>>>>>>>>>>>> > > > >>> > >>> > > >>>>>>>>>>>>>>> > > > >>> > >>> > > [1] - >>>>>>>>>>>>>>> > > > >>> > >>> > > >>>>>>>>>>>>>>> > > > >>> > >>> >>>>>>>>>>>>>>> > > > >>> >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> https://github.com/apache/iceberg/blob/17f1c4d2205b59c2bd877d4d31bbbef9e90979c5/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L63 >>>>>>>>>>>>>>> > > > >>> > >>> > > >>>>>>>>>>>>>>> > > > >>> > >>> > > >>>>>>>>>>>>>>> > > > >>> > >>> > > ср, 16 жовт. 2024 р. о 12:42 Kevin Liu < >>>>>>>>>>>>>>> > > kevin.jq....@gmail.com >>>>>>>>>>>>>>> > > > >>> > >>>>>>>>>>>>>>> > > > >>> > >>> пише: >>>>>>>>>>>>>>> > > > >>> > >>> > > >>>>>>>>>>>>>>> > > > >>> > >>> > >> Hey folks, >>>>>>>>>>>>>>> > > > >>> > >>> > >> >>>>>>>>>>>>>>> > > > >>> > >>> > >> I’ve noticed a discrepancy between the >>>>>>>>>>>>>>> Iceberg >>>>>>>>>>>>>>> > > specification >>>>>>>>>>>>>>> > > > >>> and >>>>>>>>>>>>>>> > > > >>> > >>> the Java >>>>>>>>>>>>>>> > > > >>> > >>> > >> implementation regarding the >>>>>>>>>>>>>>> `operation` key in the >>>>>>>>>>>>>>> > > `Snapshot` >>>>>>>>>>>>>>> > > > >>> > >>> `summary` >>>>>>>>>>>>>>> > > > >>> > >>> > >> field. >>>>>>>>>>>>>>> > > > >>> > >>> > >> >>>>>>>>>>>>>>> > > > >>> > >>> > >> The `Snapshot` object's `summary` >>>>>>>>>>>>>>> dictionary includes a >>>>>>>>>>>>>>> > > > >>> *required* >>>>>>>>>>>>>>> > > > >>> > >>> key >>>>>>>>>>>>>>> > > > >>> > >>> > >> named `operation`, as outlined in the >>>>>>>>>>>>>>> spec describing >>>>>>>>>>>>>>> > > Table >>>>>>>>>>>>>>> > > > >>> > >>> Metadata and >>>>>>>>>>>>>>> > > > >>> > >>> > >> Snapshots [1] and the generated OpenAPI >>>>>>>>>>>>>>> YAML [2]. >>>>>>>>>>>>>>> > > However, in >>>>>>>>>>>>>>> > > > >>> the >>>>>>>>>>>>>>> > > > >>> > >>> Java >>>>>>>>>>>>>>> > > > >>> > >>> > >> implementation [3], `operation` is >>>>>>>>>>>>>>> treated as optional. In >>>>>>>>>>>>>>> > > > >>> > >>> contrast, it >>>>>>>>>>>>>>> > > > >>> > >>> > >> remains a required field in the Python >>>>>>>>>>>>>>> implementation [4]. >>>>>>>>>>>>>>> > > > >>> > >>> > >> I also found that Java tests for >>>>>>>>>>>>>>> `SnapshotParser` assert >>>>>>>>>>>>>>> > > that >>>>>>>>>>>>>>> > > > >>> the >>>>>>>>>>>>>>> > > > >>> > >>> > >> `operation` field is null. [5] >>>>>>>>>>>>>>> > > > >>> > >>> > >> >>>>>>>>>>>>>>> > > > >>> > >>> > >> Due to this discrepancy, a user >>>>>>>>>>>>>>> reported [6] that the >>>>>>>>>>>>>>> > > > >>> > >>> `metadata.json` >>>>>>>>>>>>>>> > > > >>> > >>> > >> file generated for an Iceberg table >>>>>>>>>>>>>>> could not be read by >>>>>>>>>>>>>>> > > > >>> PyIceberg, >>>>>>>>>>>>>>> > > > >>> > >>> though >>>>>>>>>>>>>>> > > > >>> > >>> > >> it is readable using the Iceberg Java >>>>>>>>>>>>>>> library. >>>>>>>>>>>>>>> > > > >>> > >>> > >> >>>>>>>>>>>>>>> > > > >>> > >>> > >> How should we proceed from here? Should >>>>>>>>>>>>>>> the Java library >>>>>>>>>>>>>>> > > > >>> enforce >>>>>>>>>>>>>>> > > > >>> > >>> this >>>>>>>>>>>>>>> > > > >>> > >>> > >> requirement? Additionally, how should >>>>>>>>>>>>>>> we handle existing >>>>>>>>>>>>>>> > > > >>> > >>> `metadata.json` >>>>>>>>>>>>>>> > > > >>> > >>> > >> files that were generated without this >>>>>>>>>>>>>>> field? >>>>>>>>>>>>>>> > > > >>> > >>> > >> >>>>>>>>>>>>>>> > > > >>> > >>> > >> Best, >>>>>>>>>>>>>>> > > > >>> > >>> > >> Kevin Liu >>>>>>>>>>>>>>> > > > >>> > >>> > >> >>>>>>>>>>>>>>> > > > >>> > >>> > >> [1] >>>>>>>>>>>>>>> > > > >>> >>>>>>>>>>>>>>> https://iceberg.apache.org/spec/#table-metadata-and-snapshots >>>>>>>>>>>>>>> > > > >>> > >>> > >> [2] >>>>>>>>>>>>>>> > > > >>> > >>> > >> >>>>>>>>>>>>>>> > > > >>> > >>> >>>>>>>>>>>>>>> > > > >>> >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> https://github.com/apache/iceberg/blob/8e2eb9ac2e33ce4bac8956d4e2f099444d03c0e3/open-api/rest-catalog-open-api.yaml#L2057-L2060 >>>>>>>>>>>>>>> > > > >>> > >>> > >> [3] >>>>>>>>>>>>>>> > > > >>> > >>> > >> >>>>>>>>>>>>>>> > > > >>> > >>> >>>>>>>>>>>>>>> > > > >>> >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> https://github.com/apache/iceberg/blob/64b36999d7ff716ae2534fb0972fcc10d22a64c2/core/src/main/java/org/apache/iceberg/SnapshotParser.java#L124 >>>>>>>>>>>>>>> > > > >>> > >>> > >> [4] >>>>>>>>>>>>>>> > > > >>> > >>> > >> >>>>>>>>>>>>>>> > > > >>> > >>> >>>>>>>>>>>>>>> > > > >>> >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> https://github.com/apache/iceberg-python/blob/7cf0c225c3cdb32ac5e390de06b7b0e4fe7de92e/pyiceberg/table/snapshots.py#L182 >>>>>>>>>>>>>>> > > > >>> > >>> > >> [5] >>>>>>>>>>>>>>> > > > >>> > >>> > >> >>>>>>>>>>>>>>> > > > >>> > >>> >>>>>>>>>>>>>>> > > > >>> >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> https://github.com/apache/iceberg/blob/22a6b19c2e226eacc0aa78c1f2ffbdbb168b13be/core/src/test/java/org/apache/iceberg/TestSnapshotJson.java#L52 >>>>>>>>>>>>>>> > > > >>> > >>> > >> [6] >>>>>>>>>>>>>>> https://github.com/apache/iceberg-python/issues/1106 >>>>>>>>>>>>>>> > > > >>> > >>> > >> >>>>>>>>>>>>>>> > > > >>> > >>> > >> >>>>>>>>>>>>>>> > > > >>> > >>> > >>>>>>>>>>>>>>> > > > >>> > >>> >>>>>>>>>>>>>>> > > > >>> > >> >>>>>>>>>>>>>>> > > > >>> > >>>>>>>>>>>>>>> > > > >>> >>>>>>>>>>>>>>> > > > >> >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >>>>>>>>>>>>>>