+1 for using the REST catalog in the tests. Thanks Haizhou for doing this!

Yufei


On Thu, Sep 19, 2024 at 12:41 AM Eduard Tudenhöfner <
etudenhoef...@apache.org> wrote:

> Thanks for looking into this Haizhou. I'll take a closer look at the PRs
> this/next week.
>
> Eduard
>
> On Thu, Sep 19, 2024 at 2:22 AM Haizhou Zhao <zhaohaizhou940...@gmail.com>
> wrote:
>
>> Hello dev-list,
>>
>> *What*
>> I'm looking for issues and PRs reviews from the community to enable REST
>> Catalog based Integration Test for Query Engines.
>>
>> Issue: https://github.com/apache/iceberg/issues/11079
>> PR: https://github.com/apache/iceberg/pull/11093
>>
>> *Background*
>> Recently, thanks to @Daniel's effort of adding RCK (REST Compatibility
>> Kit) test utilities (ref: https://github.com/apache/iceberg/pull/10908),
>> we now can spin up a simple REST Catalog within test environment. I saw our
>> existing Spark integration tests are based on Hive & Hadoop Catalog only
>> (ref:
>> https://github.com/apache/iceberg/blob/2025e79/spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/CatalogTestBase.java),
>> and I think our Spark connector release procedure will benefit from running
>> the existing Spark integration test against REST Catalog (leveraging RCK
>> util), alongside Hadoop & Hive.
>>
>> *Why*
>> As the community gradually adopts REST Catalog, having Spark integration
>> tests running against REST Catalog will make sure we capture any issues
>> relevant to RESTCatalog clients early on, better serving REST Catalog
>> adopters in the community. Additionally, if we can build Spark integration
>> tests against REST Catalog, then this idea could extend to more query
>> engines like Flink later.
>>
>> *Current opened issues and PRs*
>> *PR:*
>> 1. https://github.com/apache/iceberg/pull/11093, the very first step
>> here is to add REST based integ tests to Spark 3.5 tests. We can extend the
>> tests to Spark 3.4 & 3.3 later if the community likes the idea.
>>
>> *Issues:*
>> When enabling Spark integ tests on REST Catalog alongside Hadoop/Hive
>> Catalog, there are some test cases where Hadoop/Hive can pass, but REST
>> cannot pass. They either indicate a behavior difference between the
>> catalogs (when handling the same Spark command), or a potential issue to be
>> looked into further.
>>
>> 1. https://github.com/apache/iceberg/issues/11103, REST Client will
>> incorrectly modify the "last-updated-ms" attribute of table metadata after
>> receiving responses from servers. This issue has been closed by community
>> effort (thx to @Eduard, @Ryan, @Daniel, @Steve for
>> discussing/fixing/reviewing)
>> 2. https://github.com/apache/iceberg/issues/11109, when Issuing a Spark
>> "CREATE OR REPLACE ${table}" command, Hive/Hadoop Catalog will not clear
>> the snapshot logs (prior to table replacement), while REST Catalog will. I
>> think we need some clarification on whether table replacement should clear
>> snapshot logs.
>> 3. https://github.com/apache/iceberg/issues/11154, REST Catalog at the
>> moment will fail Spark rename tests ("ALTER ${table} RENAME TO
>> ${table_rename}"). Spark call stacks (RenameTableExec) will pass catalog
>> name along with namespace name together in the "to" identifier to Iceberg
>> Spark connector call stacks. Meanwhile, HiveCatalog rename method will
>> always treat the first namespace layer of "to" identifier as catalog name
>> and strip it before actual renaming; while RESTCatalog does not have
>> similar pre-processing, thus HiveCatalog will pass the "ALTER TABLE RENAME"
>> test but not RESTCatalog.
>>
>> Let me know any feedback, and also welcome any reviews on PRs and
>> discussions on issues.
>>
>> Thanks,
>> -Haizhou
>>
>>

Reply via email to