+1 for using the REST catalog in the tests. Thanks Haizhou for doing this! Yufei
On Thu, Sep 19, 2024 at 12:41 AM Eduard Tudenhöfner < etudenhoef...@apache.org> wrote: > Thanks for looking into this Haizhou. I'll take a closer look at the PRs > this/next week. > > Eduard > > On Thu, Sep 19, 2024 at 2:22 AM Haizhou Zhao <zhaohaizhou940...@gmail.com> > wrote: > >> Hello dev-list, >> >> *What* >> I'm looking for issues and PRs reviews from the community to enable REST >> Catalog based Integration Test for Query Engines. >> >> Issue: https://github.com/apache/iceberg/issues/11079 >> PR: https://github.com/apache/iceberg/pull/11093 >> >> *Background* >> Recently, thanks to @Daniel's effort of adding RCK (REST Compatibility >> Kit) test utilities (ref: https://github.com/apache/iceberg/pull/10908), >> we now can spin up a simple REST Catalog within test environment. I saw our >> existing Spark integration tests are based on Hive & Hadoop Catalog only >> (ref: >> https://github.com/apache/iceberg/blob/2025e79/spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/CatalogTestBase.java), >> and I think our Spark connector release procedure will benefit from running >> the existing Spark integration test against REST Catalog (leveraging RCK >> util), alongside Hadoop & Hive. >> >> *Why* >> As the community gradually adopts REST Catalog, having Spark integration >> tests running against REST Catalog will make sure we capture any issues >> relevant to RESTCatalog clients early on, better serving REST Catalog >> adopters in the community. Additionally, if we can build Spark integration >> tests against REST Catalog, then this idea could extend to more query >> engines like Flink later. >> >> *Current opened issues and PRs* >> *PR:* >> 1. https://github.com/apache/iceberg/pull/11093, the very first step >> here is to add REST based integ tests to Spark 3.5 tests. We can extend the >> tests to Spark 3.4 & 3.3 later if the community likes the idea. >> >> *Issues:* >> When enabling Spark integ tests on REST Catalog alongside Hadoop/Hive >> Catalog, there are some test cases where Hadoop/Hive can pass, but REST >> cannot pass. They either indicate a behavior difference between the >> catalogs (when handling the same Spark command), or a potential issue to be >> looked into further. >> >> 1. https://github.com/apache/iceberg/issues/11103, REST Client will >> incorrectly modify the "last-updated-ms" attribute of table metadata after >> receiving responses from servers. This issue has been closed by community >> effort (thx to @Eduard, @Ryan, @Daniel, @Steve for >> discussing/fixing/reviewing) >> 2. https://github.com/apache/iceberg/issues/11109, when Issuing a Spark >> "CREATE OR REPLACE ${table}" command, Hive/Hadoop Catalog will not clear >> the snapshot logs (prior to table replacement), while REST Catalog will. I >> think we need some clarification on whether table replacement should clear >> snapshot logs. >> 3. https://github.com/apache/iceberg/issues/11154, REST Catalog at the >> moment will fail Spark rename tests ("ALTER ${table} RENAME TO >> ${table_rename}"). Spark call stacks (RenameTableExec) will pass catalog >> name along with namespace name together in the "to" identifier to Iceberg >> Spark connector call stacks. Meanwhile, HiveCatalog rename method will >> always treat the first namespace layer of "to" identifier as catalog name >> and strip it before actual renaming; while RESTCatalog does not have >> similar pre-processing, thus HiveCatalog will pass the "ALTER TABLE RENAME" >> test but not RESTCatalog. >> >> Let me know any feedback, and also welcome any reviews on PRs and >> discussions on issues. >> >> Thanks, >> -Haizhou >> >>