Hello dev-list,

*What*
I'm looking for issues and PRs reviews from the community to enable REST
Catalog based Integration Test for Query Engines.

Issue: https://github.com/apache/iceberg/issues/11079
PR: https://github.com/apache/iceberg/pull/11093

*Background*
Recently, thanks to @Daniel's effort of adding RCK (REST Compatibility Kit)
test utilities (ref: https://github.com/apache/iceberg/pull/10908), we now
can spin up a simple REST Catalog within test environment. I saw our
existing Spark integration tests are based on Hive & Hadoop Catalog only
(ref:
https://github.com/apache/iceberg/blob/2025e79/spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/CatalogTestBase.java),
and I think our Spark connector release procedure will benefit from running
the existing Spark integration test against REST Catalog (leveraging RCK
util), alongside Hadoop & Hive.

*Why*
As the community gradually adopts REST Catalog, having Spark integration
tests running against REST Catalog will make sure we capture any issues
relevant to RESTCatalog clients early on, better serving REST Catalog
adopters in the community. Additionally, if we can build Spark integration
tests against REST Catalog, then this idea could extend to more query
engines like Flink later.

*Current opened issues and PRs*
*PR:*
1. https://github.com/apache/iceberg/pull/11093, the very first step here
is to add REST based integ tests to Spark 3.5 tests. We can extend the
tests to Spark 3.4 & 3.3 later if the community likes the idea.

*Issues:*
When enabling Spark integ tests on REST Catalog alongside Hadoop/Hive
Catalog, there are some test cases where Hadoop/Hive can pass, but REST
cannot pass. They either indicate a behavior difference between the
catalogs (when handling the same Spark command), or a potential issue to be
looked into further.

1. https://github.com/apache/iceberg/issues/11103, REST Client will
incorrectly modify the "last-updated-ms" attribute of table metadata after
receiving responses from servers. This issue has been closed by community
effort (thx to @Eduard, @Ryan, @Daniel, @Steve for
discussing/fixing/reviewing)
2. https://github.com/apache/iceberg/issues/11109, when Issuing a Spark
"CREATE OR REPLACE ${table}" command, Hive/Hadoop Catalog will not clear
the snapshot logs (prior to table replacement), while REST Catalog will. I
think we need some clarification on whether table replacement should clear
snapshot logs.
3. https://github.com/apache/iceberg/issues/11154, REST Catalog at the
moment will fail Spark rename tests ("ALTER ${table} RENAME TO
${table_rename}"). Spark call stacks (RenameTableExec) will pass catalog
name along with namespace name together in the "to" identifier to Iceberg
Spark connector call stacks. Meanwhile, HiveCatalog rename method will
always treat the first namespace layer of "to" identifier as catalog name
and strip it before actual renaming; while RESTCatalog does not have
similar pre-processing, thus HiveCatalog will pass the "ALTER TABLE RENAME"
test but not RESTCatalog.

Let me know any feedback, and also welcome any reviews on PRs and
discussions on issues.

Thanks,
-Haizhou

Reply via email to