Hi, Haizhou. Thanks for bringing this up. Yes, this problem appears similar to what I explained but with a slightly different manifestation. What is funny is that you observed it when doing REST integration tests for Spark, and I was doing the same for Trino (the fully-fledged integration tests are not executed at the moment in Trino).
I am ready to create a PR for the "implementation notes" update regarding my problem. But your input raises another important question: should engines behave similarly for all catalogs in the first place? From the user's standpoint, the answer might be obvious - "yes". On the other hand, aligning the behavior might be difficult due to architectural differences. One example: in Trino, it is possible to purge a table with corrupted metadata because you can distinguish between IO error and parsing error, as both happen on an engine side; but for the REST catalog, this is impossible because REST protocol doesn't allow you distinguish one error from another. The net result is a subtle difference in the DROP TABLE command in Trino in various edge cases. More discrepancies could be found, for sure. My gut is that the more efforts are dedicated to REST catalogs, the more differences in behavior will appear due to various performance considerations, bugs, missing protocol pieces, etc. So, a completely transparent migration between catalogs would resemble the infamous ORM myth "Just switch the database." So perhaps a sensible amount of flexibility in catalog behavior seems to be a good compromise, IMO. WDYT? Regards, Vladimir On Thu, Oct 24, 2024 at 8:41 PM Haizhou Zhao <zhaohaizhou940...@gmail.com> wrote: > Hello Vladimir, > > I want to raise that we've been observing similar behavior differences > regarding CREATE OR REPLACE between Hive/Hadoop catalog and REST catalog at > here: https://github.com/apache/iceberg/issues/11109 > > The context: Iceberg spark integration test has traditionally only > included tests against Hive/Hadoop catalog, with the recently added > RCK (REST Capability Kit) setup for testing purpose, we are adding REST > catalog based Spark integration tests. > The assumption: all the Spark integration tests that used to pass on > Hive/Hadoop catalog should also pass on REST catalog, so that we can make > sure REST catalog client has the same behavior and equivalent in power to > Hive/Hadoop catalog client. > Details: https://github.com/apache/iceberg/issues/11079 > > Currently, there's one such test with CREATE OR REPLACE statement, that > can pass when using Hive/Hadoop catalog, but won't pass when using REST > catalog (server is reference implementation from RCK). Turns out that the > CREATE OR REPLACE statement won't trigger clean up of snapshot history if > using Hive/Hadoop catalog, but it will when using REST catalog. > > Based on the discussion above, we should fix some implementation details > in the RCK reference implementation for our issue. Yet these are the kind > of cases where we could benefit from having a general consensus on the > behavior of CREATE OR REPLACE across different catalog types or query > engines. > > Another suggestion for Trino: if you already have existing integration > tests on Iceberg connector for Trino for Hive/Hadoop catalog, then just > setting up the exact same tests against REST catalog for Trino connector > can help systematically detect behavior differences between catalog types. > > Regards, > Haizhou > > On Wed, Oct 23, 2024 at 7:33 AM Vladimir Ozerov <voze...@querifylabs.com> > wrote: > >> Hi, >> >> Sure, will do. >> >> *Vladimir Ozerov* >> Founder >> querifylabs.com >> >> >> Ср, 23 окт. 2024 г. в 08:50, Jean-Baptiste Onofré <j...@nanthrax.net>: >> >>> I second Ryan here, it would be great to clarify in the >>> "implementation notes" section. >>> >>> Thanks ! >>> Regards >>> JB >>> >>> On Wed, Oct 23, 2024 at 1:10 AM rdb...@gmail.com <rdb...@gmail.com> >>> wrote: >>> > >>> > Thanks Vladimir! Would you like to open a PR to make that change? It >>> sounds like another good item to put into the "Implementation notes" >>> section. >>> > >>> > On Sun, Oct 20, 2024 at 11:41 PM Vladimir Ozerov < >>> voze...@querifylabs.com> wrote: >>> >> >>> >> Hi Jean-Baptiste, >>> >> >>> >> Agreed. REST spec looks good. I am talking about the general spec, >>> where it might be useful to add a hint to engine developers, that CREATE OR >>> REPLACE semantics in Iceberg is expected to follow slightly different >>> semantics. This is already broken in Trino: depending on catalog type users >>> may get either classical "DROP + CREATE" (for non-REST catalogs), or >>> "CREATE AND UPDATE" for REST catalog. For Flink, their official docs say >>> that CREATE OR REPLACE == DROP + CREATE, while for Iceberg tables this >>> should not be the case. These are definitively things that should be fixed >>> at engine levels. But at the same time it highlights that engine developers >>> are having hard time defining proper semantics for CREATE OR REPLACE in the >>> Iceberg integrations, so a paragraph or so in the main Iceberg spec may >>> help us align expectations. >>> >> >>> >> Regards, >>> >> Vladimir. >>> >> >>> >> On Mon, Oct 21, 2024 at 8:28 AM Jean-Baptiste Onofré <j...@nanthrax.net> >>> wrote: >>> >>> >>> >>> Hi Vladimir, >>> >>> >>> >>> As Ryan said, it's not a bug: CREATE OR REPLACE can be seen as >>> "CREATE >>> >>> AND UPDATE" from table format perspective. Specifically for the >>> >>> properties, it makes sense to not delete the current properties as it >>> >>> can be used in several use cases (security, tables grouping, ...). >>> >>> I'm not sure a REST Spec update is required, probably more on the >>> >>> engine side. In the REST Spec, you can create a table >>> >>> ( >>> https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L553 >>> ) >>> >>> and update a table >>> >>> ( >>> https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L975 >>> ), >>> >>> and it's up to the query engine to implement the "CREATE OR REPLACE" >>> >>> with the correct semantic. >>> >>> >>> >>> Regards >>> >>> JB >>> >>> >>> >>> On Sun, Oct 20, 2024 at 9:26 PM Vladimir Ozerov < >>> voze...@querifylabs.com> wrote: >>> >>> > >>> >>> > Hi Ryan, >>> >>> > >>> >>> > Thanks for the clarification. Yes, I think my confusion was caused >>> by the fact that many engines treat CREATE OR REPLACE as a semantic >>> equivalent of DROP + CREATE, which is performed atomically (e.g., Flink >>> [1]). Table formats add history on top of that, which is expected to be >>> retained, no questions here. Permission propagation also make sense. For >>> properties things become a bit blurry, because on the one hand there are >>> Iceberg specific properties, which may affect table maintenance, and on the >>> other hand there are user-defined properties in the same bag. The question >>> appeared in the first place because I observed a discrepancy in Trino: all >>> catalogs except for REST completely overrides table properties on REPLACE, >>> and REST catalog merges them, which might be confusing to end users. >>> Perhaps some clarification at the spec level might be useful, because >>> without agreement between engines the could be some subtle bugs in >>> multi-engine environments, such as sudden data format changes between >>> replaces, etc. >>> >>> > >>> >>> > [1] >>> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/create/#create-or-replace-table >>> >>> > >>> >>> > Regards, >>> >>> > Vladimir. >>> >>> > >>> >>> > On Sun, Oct 20, 2024 at 9:20 PM rdb...@gmail.com <rdb...@gmail.com> >>> wrote: >>> >>> >> >>> >>> >> Hi Vladimir, >>> >>> >> >>> >>> >> This isn't a bug. The behavior of CREATE OR REPLACE is to replace >>> the data of a table, but to maintain things like other refs, snapshot >>> history, permissions (if supported by the catalog), and table properties. >>> Table properties are replaced if they are set in the operation like `b` in >>> your example. This is not the same as a drop and create, which may be what >>> you want instead. >>> >>> >> >>> >>> >> The reason for this behavior is that the CREATE OR REPLACE >>> operation is used to replace a table's data without needing to handle >>> schema changes between versions. For example, producing a daily report >>> table that replaces the previous day. However, the table still exists and >>> it is valuable to be able to time travel to older versions or to be able to >>> use branches and tags. Clearly, that means that table history and refs >>> stick around so the table is not completely new every time it is replaced. >>> >>> >> >>> >>> >> Adding on to that, properties control things like ref and >>> snapshot retention, file format, compression, and other settings. These >>> aren't settings that need to be carried through in every replace operation. >>> And it would make no sense if you set the snapshot retention because older >>> snapshots are retained, only to have it discarded the next time you replace >>> the table data. A good way to think about this is that table properties are >>> set infrequently, while table data changes regularly. And the person >>> changing the data may not be the person tuning the table settings. >>> >>> >> >>> >>> >> Hopefully that helps, >>> >>> >> >>> >>> >> Ryan >>> >>> >> >>> >>> >> On Sun, Oct 20, 2024 at 9:45 AM Vladimir Ozerov < >>> voze...@querifylabs.com> wrote: >>> >>> >>> >>> >>> >>> Hi, >>> >>> >>> >>> >>> >>> Consider a REST catalog and a user calls "CREATE OR REPLACE >>> <table>" command. When processing the command, engines will usually >>> initiate a "createOrReplace" transaction and add metadata, such as the >>> properties of a new table. >>> >>> >>> >>> >>> >>> Users expect a table to be replaced with a new one if it exists, >>> including properties. However, I observe the following: >>> >>> >>> >>> >>> >>> RESTSessionCatalog loads previous table metadata, adds new >>> properties (MetadataUpdate.SetProperties), and invokes the backend >>> >>> >>> The backend (e.g., Polaris) will typically invoke >>> "CatalogHandler.updateTable." There, the previous table state, including >>> its properties, is loaded >>> >>> >>> Finally, metadata updates are applied, and old table properties >>> are merged with new ones. That is, if the old table has properties [a=1, >>> b=2], and the new table has properties [b=3, c=4], then the final >>> properties would be [a=1, b=3, c=4], while the user expects [b=3, c=4]. >>> >>> >>> >>> >>> >>> It looks like a bug because the user expects complete property >>> replacement instead of a merge. Shall we explicitly clear all previous >>> properties in RESTSessionCatalog.Builder.replaceTransaction? >>> >>> >>> >>> >>> >>> Regards, >>> >>> >>> Vladimir. >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> >>> Vladimir Ozerov >>> >>> >>> Founder >>> >>> >>> querifylabs.com >>> >>> > >>> >>> > >>> >>> > >>> >>> > -- >>> >>> > Vladimir Ozerov >>> >>> > Founder >>> >>> > querifylabs.com >>> >> >>> >> >>> >> >>> >> -- >>> >> Vladimir Ozerov >>> >> Founder >>> >> querifylabs.com >>> >> -- *Vladimir Ozerov* Founder querifylabs.com