Thank you Leon for starting this.

It's very important for open formats like Iceberg to be interoperable across 
different implementations. And it's on the top list of iceberg-rust.

My only concern is about the JSON spec. I'm thinking of if it's a good idea for 
us to adopt sqllogictests format: 
https://sqlite.org/sqllogictest/doc/trunk/about.wiki and 
https://github.com/risinglightdb/sqllogictest-rs.

It's used by sqlite first and now is widely borrowed by many other SQL engines 
to build their test suites.

It's something like:

statement ok
INSERT INTO a VALUES (42, 84);

query II
SELECT * FROM a;
----
42 84

Basicly, we have a way to define the SQL we are using, what's resutl we are 
expecting and a way to hint. 

What do you think?


On Sat, Jun 7, 2025, at 07:46, Leon Lin wrote:
> Hi Kevin, 
> 
> Thanks for bringing up the Arrow integration tests as a reference! I’ve 
> looked into that setup as well. However, I found it difficult to apply the 
> same model to Iceberg since Arrow and Iceberg are very different. Arrow tests 
> are centered around in-memory serialization and deserialization using 
> JSON-defined schema types, whereas Iceberg operates on persisted table state 
> and requires more extensive infrastructure, like a catalog and storage, to 
> run the integration tests.
> 
> One of the alternative approaches listed in the doc has a similar producer / 
> consumer strategy as Arrow, which is defining producer and consumer spec 
> files in JSON that describe the actions clients should perform. Each client 
> would then implement a runner that parses and executes those actions. 
> However, mapping out every Iceberg capability with its inputs and expected 
> outputs becomes quite complex, and I’m concerned it won’t scale well over 
> time.
> 
> Feel free to leave comments in the doc and let me know what you think. I’m 
> happy to explore and experiment with other ideas!
> 
> Thanks,
> Leon
> 
> On Fri, Jun 6, 2025 at 12:39 PM Kevin Liu <kevinjq...@apache.org> wrote:
>> Hi Leon,
>> 
>> Thanks for starting this thread! I think this is a great idea. Happy to 
>> support this in any way I can.
>> 
>> Matt Topol and I have previously discussed cross-client testing regarding 
>> the iceberg-go and iceberg-python implementations. There are a class of bugs 
>> that can be caught in this way. We somewhat do this today by copying over 
>> the integration test suite from iceberg-python to iceberg-go. I think even 
>> supporting a single verification step, through Spark, can provide us a lot 
>> of value in terms of testing for correctness. 
>> 
>> BTW,  Matt mentioned that the Arrow ecosystem has similar integration tests 
>> across its clients. I haven't been able to look further, but he pointed me 
>> to https://github.com/apache/arrow/tree/main/dev/archery/archery/integration
>> 
>> Looking forward to this! 
>> 
>> Best,
>> Kevin Liu
>> 
>> On Thu, Jun 5, 2025 at 4:56 PM Leon Lin <lianglin....@gmail.com> wrote:
>>> Hello all,
>>> 
>>> I would like to start a discussion on standardizing the cross client 
>>> integration testing in iceberg projects. With all the active development 
>>> among the different client implementations (python, rust, go, etc), it will 
>>> be important to make sure the implementations are interoperable between one 
>>> another, making sure tables created by one client can be read and write by 
>>> another client without any incompatibilities and help detect divergence 
>>> between implementations early. 
>>> 
>>> There is already some great work done in PyIceberg to verify compatibility 
>>> with iceberg java implementation with Spark, we could easily extend this to 
>>> do two steps verification. I’ve outlined the details in the doc attached 
>>> below. But the idea is to:
>>>  • Write tables using PySpark and verify them with client-side read tests.
>>>  • Write using the client and validate using PySpark scripts with 
>>> assertions.
>>> While a full matrix testing would be ideal to verify interoperability 
>>> between any combination of clients, but I haven’t able to find any clean 
>>> way to do this without adding too much complexity or operational burden. 
>>> I’d really appreciate any thoughts or ideas from the community, and I’m 
>>> happy to contribute to moving this forward. 
>>> 
>>> Best,
>>> Leon Lin
>>> 
>>> *References:*
>>> https://github.com/apache/iceberg-python/blob/main/tests/conftest.py#L2429
>>> Issue: https://github.com/apache/iceberg/issues/13229
>>>  Standardize Cross Client Integration Testing 
>>> <https://drive.google.com/open?id=1vZfVzGZucsDc35uoRrbn5CKHGd0muzcY7FAwmKt-KNI>
Xuanwo

https://xuanwo.io/

Reply via email to