Re: REST Catalog based Integration Test for Query Engines

Eduard Tudenhöfner Thu, 19 Sep 2024 00:41:29 -0700

Thanks for looking into this Haizhou. I'll take a closer look at the PRs
this/next week.


Eduard

On Thu, Sep 19, 2024 at 2:22 AM Haizhou Zhao <zhaohaizhou940...@gmail.com>
wrote:

> Hello dev-list,
>
> *What*
> I'm looking for issues and PRs reviews from the community to enable REST
> Catalog based Integration Test for Query Engines.
>
> Issue: https://github.com/apache/iceberg/issues/11079
> PR: https://github.com/apache/iceberg/pull/11093
>
> *Background*
> Recently, thanks to @Daniel's effort of adding RCK (REST Compatibility
> Kit) test utilities (ref: https://github.com/apache/iceberg/pull/10908),
> we now can spin up a simple REST Catalog within test environment. I saw our
> existing Spark integration tests are based on Hive & Hadoop Catalog only
> (ref:
> https://github.com/apache/iceberg/blob/2025e79/spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/CatalogTestBase.java),
> and I think our Spark connector release procedure will benefit from running
> the existing Spark integration test against REST Catalog (leveraging RCK
> util), alongside Hadoop & Hive.
>
> *Why*
> As the community gradually adopts REST Catalog, having Spark integration
> tests running against REST Catalog will make sure we capture any issues
> relevant to RESTCatalog clients early on, better serving REST Catalog
> adopters in the community. Additionally, if we can build Spark integration
> tests against REST Catalog, then this idea could extend to more query
> engines like Flink later.
>
> *Current opened issues and PRs*
> *PR:*
> 1. https://github.com/apache/iceberg/pull/11093, the very first step here
> is to add REST based integ tests to Spark 3.5 tests. We can extend the
> tests to Spark 3.4 & 3.3 later if the community likes the idea.
>
> *Issues:*
> When enabling Spark integ tests on REST Catalog alongside Hadoop/Hive
> Catalog, there are some test cases where Hadoop/Hive can pass, but REST
> cannot pass. They either indicate a behavior difference between the
> catalogs (when handling the same Spark command), or a potential issue to be
> looked into further.
>
> 1. https://github.com/apache/iceberg/issues/11103, REST Client will
> incorrectly modify the "last-updated-ms" attribute of table metadata after
> receiving responses from servers. This issue has been closed by community
> effort (thx to @Eduard, @Ryan, @Daniel, @Steve for
> discussing/fixing/reviewing)
> 2. https://github.com/apache/iceberg/issues/11109, when Issuing a Spark
> "CREATE OR REPLACE ${table}" command, Hive/Hadoop Catalog will not clear
> the snapshot logs (prior to table replacement), while REST Catalog will. I
> think we need some clarification on whether table replacement should clear
> snapshot logs.
> 3. https://github.com/apache/iceberg/issues/11154, REST Catalog at the
> moment will fail Spark rename tests ("ALTER ${table} RENAME TO
> ${table_rename}"). Spark call stacks (RenameTableExec) will pass catalog
> name along with namespace name together in the "to" identifier to Iceberg
> Spark connector call stacks. Meanwhile, HiveCatalog rename method will
> always treat the first namespace layer of "to" identifier as catalog name
> and strip it before actual renaming; while RESTCatalog does not have
> similar pre-processing, thus HiveCatalog will pass the "ALTER TABLE RENAME"
> test but not RESTCatalog.
>
> Let me know any feedback, and also welcome any reviews on PRs and
> discussions on issues.
>
> Thanks,
> -Haizhou
>
>

Re: REST Catalog based Integration Test for Query Engines

Reply via email to