Thanks for looking into this Haizhou. I'll take a closer look at the PRs this/next week.
Eduard On Thu, Sep 19, 2024 at 2:22 AM Haizhou Zhao <zhaohaizhou940...@gmail.com> wrote: > Hello dev-list, > > *What* > I'm looking for issues and PRs reviews from the community to enable REST > Catalog based Integration Test for Query Engines. > > Issue: https://github.com/apache/iceberg/issues/11079 > PR: https://github.com/apache/iceberg/pull/11093 > > *Background* > Recently, thanks to @Daniel's effort of adding RCK (REST Compatibility > Kit) test utilities (ref: https://github.com/apache/iceberg/pull/10908), > we now can spin up a simple REST Catalog within test environment. I saw our > existing Spark integration tests are based on Hive & Hadoop Catalog only > (ref: > https://github.com/apache/iceberg/blob/2025e79/spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/CatalogTestBase.java), > and I think our Spark connector release procedure will benefit from running > the existing Spark integration test against REST Catalog (leveraging RCK > util), alongside Hadoop & Hive. > > *Why* > As the community gradually adopts REST Catalog, having Spark integration > tests running against REST Catalog will make sure we capture any issues > relevant to RESTCatalog clients early on, better serving REST Catalog > adopters in the community. Additionally, if we can build Spark integration > tests against REST Catalog, then this idea could extend to more query > engines like Flink later. > > *Current opened issues and PRs* > *PR:* > 1. https://github.com/apache/iceberg/pull/11093, the very first step here > is to add REST based integ tests to Spark 3.5 tests. We can extend the > tests to Spark 3.4 & 3.3 later if the community likes the idea. > > *Issues:* > When enabling Spark integ tests on REST Catalog alongside Hadoop/Hive > Catalog, there are some test cases where Hadoop/Hive can pass, but REST > cannot pass. They either indicate a behavior difference between the > catalogs (when handling the same Spark command), or a potential issue to be > looked into further. > > 1. https://github.com/apache/iceberg/issues/11103, REST Client will > incorrectly modify the "last-updated-ms" attribute of table metadata after > receiving responses from servers. This issue has been closed by community > effort (thx to @Eduard, @Ryan, @Daniel, @Steve for > discussing/fixing/reviewing) > 2. https://github.com/apache/iceberg/issues/11109, when Issuing a Spark > "CREATE OR REPLACE ${table}" command, Hive/Hadoop Catalog will not clear > the snapshot logs (prior to table replacement), while REST Catalog will. I > think we need some clarification on whether table replacement should clear > snapshot logs. > 3. https://github.com/apache/iceberg/issues/11154, REST Catalog at the > moment will fail Spark rename tests ("ALTER ${table} RENAME TO > ${table_rename}"). Spark call stacks (RenameTableExec) will pass catalog > name along with namespace name together in the "to" identifier to Iceberg > Spark connector call stacks. Meanwhile, HiveCatalog rename method will > always treat the first namespace layer of "to" identifier as catalog name > and strip it before actual renaming; while RESTCatalog does not have > similar pre-processing, thus HiveCatalog will pass the "ALTER TABLE RENAME" > test but not RESTCatalog. > > Let me know any feedback, and also welcome any reviews on PRs and > discussions on issues. > > Thanks, > -Haizhou > >