Re: [DISCUSS] Standardize cross client integration tests

Xuanwo Mon, 09 Jun 2025 21:27:14 -0700

Thank you, Leon

> How complex would it be to integrate sqllogictest into non-Rust clients?


I checked a bit about existsing sqllogictest integration projects:

- java: https://github.com/hydromatic/sql-logic-test
- go: https://github.com/alkemir/sqllogictest
- python: https://github.com/duckdb/duckdb-sqllogictest-python
  - Maybe we can export a python binding from sqllogictest-rs directly?

> Should we centralize the shared Docker images and test suites, or let each 
> client repo manage their own setup with flexibility to evolve as the 
> development progresses?

We are still in the very early stages. I prefer to let each client implement 
their own approach first, and then we can decide how to evolve and collaborate 
together.

On Tue, Jun 10, 2025, at 03:54, Leon Lin wrote:
> Hi Xuanwo, Renjie
> 
> I think sqllogictests is a good replacement on the JSON spec, and I’m 
> definitely not trying to recommend on using JSON spec as I think it is very 
> be too complex to execute.
> 
> As Renjie pointed out, sqllogictests only suitable when sql engine is 
> supported, but right now not all of the client implementation has engine 
> integration, like iceberg-go. Although sqllogictests could still be useful 
> for provisioning tables and validating results via Iceberg Spark and can be 
> extended later once engine integration is added.
> 
> Few concerns I have right now are:
>  • How complex would it be to integrate sqllogictest into non-Rust clients?
>  • Should we centralize the shared Docker images and test suites, or let each 
> client repo manage their own setup with flexibility to evolve as the 
> development progresses?
> Will do some experiments with the sqllogictests offline and happy to discuss 
> more!
> 
> Best,
> Leon
> 
> On Mon, Jun 9, 2025 at 2:34 AM Renjie Liu <liurenjie2...@gmail.com> wrote:
>> Hi, Leon:
>> 
>> Thanks for raising this.
>> 
>> In rust we also have similar plan to do integration tests against rust and 
>> java implementation: https://github.com/apache/iceberg-rust/pull/581 
>> 
>> This approach is pure data driven, as Xuanwo mentioned, motivated by 
>> sqllogictests. That's to say, we will define a set of sql statements, and 
>> they can be executed by spark sql and rust engine(datafusion in this case).  
>> The downside of this method is that it requires integration with a sql 
>> engine. Luckily in rust we have datafusion, but I'm not sure if this is the 
>> case for python and go.
>> 
>> On Sat, Jun 7, 2025 at 9:47 AM Xuanwo <xua...@apache.org> wrote:
>>> __
>>> Thank you Leon for starting this.
>>> 
>>> It's very important for open formats like Iceberg to be interoperable 
>>> across different implementations. And it's on the top list of iceberg-rust.
>>> 
>>> My only concern is about the JSON spec. I'm thinking of if it's a good idea 
>>> for us to adopt sqllogictests format: 
>>> https://sqlite.org/sqllogictest/doc/trunk/about.wiki and 
>>> https://github.com/risinglightdb/sqllogictest-rs.
>>> 
>>> It's used by sqlite first and now is widely borrowed by many other SQL 
>>> engines to build their test suites.
>>> 
>>> It's something like:
>>> 
>>> statement ok
>>> INSERT INTO a VALUES (42, 84);
>>> 
>>> query II
>>> SELECT * FROM a;
>>> ----
>>> 42 84
>>> 
>>> Basicly, we have a way to define the SQL we are using, what's resutl we are 
>>> expecting and a way to hint. 
>>> 
>>> What do you think?
>>> 
>>> 
>>> On Sat, Jun 7, 2025, at 07:46, Leon Lin wrote:
>>>> Hi Kevin,
>>>> 
>>>> Thanks for bringing up the Arrow integration tests as a reference! I’ve 
>>>> looked into that setup as well. However, I found it difficult to apply the 
>>>> same model to Iceberg since Arrow and Iceberg are very different. Arrow 
>>>> tests are centered around in-memory serialization and deserialization 
>>>> using JSON-defined schema types, whereas Iceberg operates on persisted 
>>>> table state and requires more extensive infrastructure, like a catalog and 
>>>> storage, to run the integration tests.
>>>> 
>>>> One of the alternative approaches listed in the doc has a similar producer 
>>>> / consumer strategy as Arrow, which is defining producer and consumer spec 
>>>> files in JSON that describe the actions clients should perform. Each 
>>>> client would then implement a runner that parses and executes those 
>>>> actions. However, mapping out every Iceberg capability with its inputs and 
>>>> expected outputs becomes quite complex, and I’m concerned it won’t scale 
>>>> well over time.
>>>> 
>>>> Feel free to leave comments in the doc and let me know what you think. I’m 
>>>> happy to explore and experiment with other ideas!
>>>> 
>>>> Thanks,
>>>> Leon
>>>> 
>>>> On Fri, Jun 6, 2025 at 12:39 PM Kevin Liu <kevinjq...@apache.org> wrote:
>>>>> Hi Leon,
>>>>> 
>>>>> Thanks for starting this thread! I think this is a great idea. Happy to 
>>>>> support this in any way I can.
>>>>> 
>>>>> Matt Topol and I have previously discussed cross-client testing regarding 
>>>>> the iceberg-go and iceberg-python implementations. There are a class of 
>>>>> bugs that can be caught in this way. We somewhat do this today by copying 
>>>>> over the integration test suite from iceberg-python to iceberg-go. I 
>>>>> think even supporting a single verification step, through Spark, can 
>>>>> provide us a lot of value in terms of testing for correctness. 
>>>>> 
>>>>> BTW,  Matt mentioned that the Arrow ecosystem has similar integration 
>>>>> tests across its clients. I haven't been able to look further, but he 
>>>>> pointed me to 
>>>>> https://github.com/apache/arrow/tree/main/dev/archery/archery/integration
>>>>> 
>>>>> Looking forward to this!
>>>>> 
>>>>> Best,
>>>>> Kevin Liu
>>>>> 
>>>>> On Thu, Jun 5, 2025 at 4:56 PM Leon Lin <lianglin....@gmail.com> wrote:
>>>>>> Hello all,
>>>>>> 
>>>>>> I would like to start a discussion on standardizing the cross client 
>>>>>> integration testing in iceberg projects. With all the active development 
>>>>>> among the different client implementations (python, rust, go, etc), it 
>>>>>> will be important to make sure the implementations are interoperable 
>>>>>> between one another, making sure tables created by one client can be 
>>>>>> read and write by another client without any incompatibilities and help 
>>>>>> detect divergence between implementations early.
>>>>>> 
>>>>>> There is already some great work done in PyIceberg to verify 
>>>>>> compatibility with iceberg java implementation with Spark, we could 
>>>>>> easily extend this to do two steps verification. I’ve outlined the 
>>>>>> details in the doc attached below. But the idea is to:
>>>>>>  • Write tables using PySpark and verify them with client-side read 
>>>>>> tests.
>>>>>>  • Write using the client and validate using PySpark scripts with 
>>>>>> assertions.
>>>>>> While a full matrix testing would be ideal to verify interoperability 
>>>>>> between any combination of clients, but I haven’t able to find any clean 
>>>>>> way to do this without adding too much complexity or operational burden. 
>>>>>> I’d really appreciate any thoughts or ideas from the community, and I’m 
>>>>>> happy to contribute to moving this forward.
>>>>>> 
>>>>>> Best,
>>>>>> Leon Lin
>>>>>> 
>>>>>> *References:*
>>>>>> https://github.com/apache/iceberg-python/blob/main/tests/conftest.py#L2429
>>>>>> Issue: https://github.com/apache/iceberg/issues/13229
>>>>>>  Standardize Cross Client Integration Testing 
>>>>>> <https://drive.google.com/open?id=1vZfVzGZucsDc35uoRrbn5CKHGd0muzcY7FAwmKt-KNI>
>>> Xuanwo
>>> 
>>> https://xuanwo.io/
>>> 
Xuanwo

https://xuanwo.io/

Re: [DISCUSS] Standardize cross client integration tests

Reply via email to