Re: [DISCUSS] Standardize cross client integration tests

Leon Lin Wed, 11 Jun 2025 17:04:22 -0700

Hi all,

Thanks for all the suggestions. I agree that a good starting point would be
to have some fixtures that can be easily reused across different
implementations. This could be a Docker image published from the main
Iceberg repository, bundling everything needed.


Then we can proceed to experiment with sqllogictests for engines that
support SQL execution, and use PySpark scripts for verifying core API
behaviors where engine support isn't available.

Hope this aligns with everyone’s thinking!

Best,
Leon

On Tue, Jun 10, 2025 at 1:39 PM Jayce Slesar <[email protected]> wrote:

> Hi all, also happy to support this!
>
> I think one thing I'm looking forward out of this work in addition to the
> general improvements is being able to use this as a building block for
> instrumenting different Iceberg implementations. From what I currently
> understand maintainers need to know what different implementations support
> in terms of features and we could eventually use standardized testing to
> automatically gather that information, as well as performance statistics
> for various tests later down the road.
>
> On Tue, Jun 10, 2025 at 6:35 AM Renjie Liu <[email protected]>
> wrote:
>
>> Hi, Leon:
>>
>> > How complex would it be to integrate sqllogictest into non-Rust
>> clients?
>>
>> This seems non-trivial to me. Note that it's not only about
>> parsing/executing sqllogictest, the underlying sql engine needs to
>> integrate with iceberg's language client.
>>
>> > Should we centralize the shared Docker images and test suites, or let
>> each client repo manage their own setup with flexibility to evolve as the
>> development progresses?
>>
>> I agree with Xuwanwo that we should wait for some time so that our
>> solutions are mature enough. Before that, we should not rush to make
>> decisions.
>>
>>
>> On Tue, Jun 10, 2025 at 12:24 PM Xuanwo <[email protected]> wrote:
>>
>>> Thank you, Leon
>>>
>>> > How complex would it be to integrate sqllogictest into non-Rust
>>> clients?
>>>
>>> I checked a bit about existsing sqllogictest integration projects:
>>>
>>> - java: https://github.com/hydromatic/sql-logic-test
>>> - go: https://github.com/alkemir/sqllogictest
>>> - python: https://github.com/duckdb/duckdb-sqllogictest-python
>>>   - Maybe we can export a python binding from sqllogictest-rs directly?
>>>
>>> > Should we centralize the shared Docker images and test suites, or let
>>> each client repo manage their own setup with flexibility to evolve as the
>>> development progresses?
>>>
>>> We are still in the very early stages. I prefer to let each client
>>> implement their own approach first, and then we can decide how to evolve
>>> and collaborate together.
>>>
>>> On Tue, Jun 10, 2025, at 03:54, Leon Lin wrote:
>>>
>>> Hi Xuanwo, Renjie
>>>
>>> I think sqllogictests is a good replacement on the JSON spec, and I’m
>>> definitely not trying to recommend on using JSON spec as I think it is very
>>> be too complex to execute.
>>>
>>> As Renjie pointed out, sqllogictests only suitable when sql engine is
>>> supported, but right now not all of the client implementation has engine
>>> integration, like iceberg-go. Although sqllogictests could still be useful
>>> for provisioning tables and validating results via Iceberg Spark and can be
>>> extended later once engine integration is added.
>>>
>>> Few concerns I have right now are:
>>>
>>>    - How complex would it be to integrate sqllogictest into non-Rust
>>>    clients?
>>>    - Should we centralize the shared Docker images and test suites, or
>>>    let each client repo manage their own setup with flexibility to evolve as
>>>    the development progresses?
>>>
>>> Will do some experiments with the sqllogictests offline and happy to
>>> discuss more!
>>>
>>> Best,
>>> Leon
>>>
>>> On Mon, Jun 9, 2025 at 2:34 AM Renjie Liu <[email protected]>
>>> wrote:
>>>
>>> Hi, Leon:
>>>
>>> Thanks for raising this.
>>>
>>> In rust we also have similar plan to do integration tests against rust
>>> and java implementation: https://github.com/apache/iceberg-rust/pull/581
>>>
>>>
>>> This approach is pure data driven, as Xuanwo mentioned, motivated by
>>> sqllogictests. That's to say, we will define a set of sql statements, and
>>> they can be executed by spark sql and rust engine(datafusion in this
>>> case).  The downside of this method is that it requires integration with a
>>> sql engine. Luckily in rust we have datafusion, but I'm not sure if this is
>>> the case for python and go.
>>>
>>> On Sat, Jun 7, 2025 at 9:47 AM Xuanwo <[email protected]> wrote:
>>>
>>>
>>> Thank you Leon for starting this.
>>>
>>> It's very important for open formats like Iceberg to be interoperable
>>> across different implementations. And it's on the top list of iceberg-rust.
>>>
>>> My only concern is about the JSON spec. I'm thinking of if it's a good
>>> idea for us to adopt sqllogictests format:
>>> https://sqlite.org/sqllogictest/doc/trunk/about.wiki and
>>> https://github.com/risinglightdb/sqllogictest-rs.
>>>
>>> It's used by sqlite first and now is widely borrowed by many other SQL
>>> engines to build their test suites.
>>>
>>> It's something like:
>>>
>>> statement ok
>>> INSERT INTO a VALUES (42, 84);
>>>
>>> query II
>>> SELECT * FROM a;
>>> ----
>>> 42 84
>>>
>>> Basicly, we have a way to define the SQL we are using, what's resutl we
>>> are expecting and a way to hint.
>>>
>>> What do you think?
>>>
>>>
>>> On Sat, Jun 7, 2025, at 07:46, Leon Lin wrote:
>>>
>>> Hi Kevin,
>>>
>>> Thanks for bringing up the Arrow integration tests as a reference! I’ve
>>> looked into that setup as well. However, I found it difficult to apply the
>>> same model to Iceberg since Arrow and Iceberg are very different. Arrow
>>> tests are centered around in-memory serialization and deserialization using
>>> JSON-defined schema types, whereas Iceberg operates on persisted table
>>> state and requires more extensive infrastructure, like a catalog and
>>> storage, to run the integration tests.
>>>
>>> One of the alternative approaches listed in the doc has a similar
>>> producer / consumer strategy as Arrow, which is defining producer and
>>> consumer spec files in JSON that describe the actions clients should
>>> perform. Each client would then implement a runner that parses and executes
>>> those actions. However, mapping out every Iceberg capability with its
>>> inputs and expected outputs becomes quite complex, and I’m concerned it
>>> won’t scale well over time.
>>>
>>> Feel free to leave comments in the doc and let me know what you think.
>>> I’m happy to explore and experiment with other ideas!
>>>
>>> Thanks,
>>> Leon
>>>
>>> On Fri, Jun 6, 2025 at 12:39 PM Kevin Liu <[email protected]> wrote:
>>>
>>> Hi Leon,
>>>
>>> Thanks for starting this thread! I think this is a great idea. Happy to
>>> support this in any way I can.
>>>
>>> Matt Topol and I have previously discussed cross-client testing
>>> regarding the iceberg-go and iceberg-python implementations. There are a
>>> class of bugs that can be caught in this way. We somewhat do this today by
>>> copying over the integration test suite from iceberg-python to iceberg-go.
>>> I think even supporting a single verification step, through Spark, can
>>> provide us a lot of value in terms of testing for correctness.
>>>
>>> BTW,  Matt mentioned that the Arrow ecosystem has similar integration
>>> tests across its clients. I haven't been able to look further, but he
>>> pointed me to
>>> https://github.com/apache/arrow/tree/main/dev/archery/archery/integration
>>>
>>> Looking forward to this!
>>>
>>> Best,
>>> Kevin Liu
>>>
>>> On Thu, Jun 5, 2025 at 4:56 PM Leon Lin <[email protected]> wrote:
>>>
>>> Hello all,
>>>
>>> I would like to start a discussion on standardizing the cross client
>>> integration testing in iceberg projects. With all the active development
>>> among the different client implementations (python, rust, go, etc), it will
>>> be important to make sure the implementations are interoperable between one
>>> another, making sure tables created by one client can be read and write by
>>> another client without any incompatibilities and help detect divergence
>>> between implementations early.
>>>
>>> There is already some great work done in PyIceberg to verify
>>> compatibility with iceberg java implementation with Spark, we could easily
>>> extend this to do two steps verification. I’ve outlined the details in the
>>> doc attached below. But the idea is to:
>>>
>>>    - Write tables using PySpark and verify them with client-side read
>>>    tests.
>>>    - Write using the client and validate using PySpark scripts with
>>>    assertions.
>>>
>>> While a full matrix testing would be ideal to verify interoperability
>>> between any combination of clients, but I haven’t able to find any clean
>>> way to do this without adding too much complexity or operational burden.
>>> I’d really appreciate any thoughts or ideas from the community, and I’m
>>> happy to contribute to moving this forward.
>>>
>>> Best,
>>> Leon Lin
>>>
>>> *References:*
>>>
>>> https://github.com/apache/iceberg-python/blob/main/tests/conftest.py#L2429
>>> Issue: https://github.com/apache/iceberg/issues/13229
>>>  Standardize Cross Client Integration Testing
>>> <https://drive.google.com/open?id=1vZfVzGZucsDc35uoRrbn5CKHGd0muzcY7FAwmKt-KNI>
>>>
>>> Xuanwo
>>>
>>> https://xuanwo.io/
>>>
>>> Xuanwo
>>>
>>> https://xuanwo.io/
>>>
>>>

Re: [DISCUSS] Standardize cross client integration tests

Reply via email to