Hi, Leon:

Thanks for raising this.

In rust we also have similar plan to do integration tests against rust and
java implementation: https://github.com/apache/iceberg-rust/pull/581

This approach is pure data driven, as Xuanwo mentioned, motivated by
sqllogictests. That's to say, we will define a set of sql statements, and
they can be executed by spark sql and rust engine(datafusion in this
case).  The downside of this method is that it requires integration with a
sql engine. Luckily in rust we have datafusion, but I'm not sure if this is
the case for python and go.

On Sat, Jun 7, 2025 at 9:47 AM Xuanwo <xua...@apache.org> wrote:

> Thank you Leon for starting this.
>
> It's very important for open formats like Iceberg to be interoperable
> across different implementations. And it's on the top list of iceberg-rust.
>
> My only concern is about the JSON spec. I'm thinking of if it's a good
> idea for us to adopt sqllogictests format:
> https://sqlite.org/sqllogictest/doc/trunk/about.wiki and
> https://github.com/risinglightdb/sqllogictest-rs.
>
> It's used by sqlite first and now is widely borrowed by many other SQL
> engines to build their test suites.
>
> It's something like:
>
> statement ok
> INSERT INTO a VALUES (42, 84);
>
> query II
> SELECT * FROM a;
> ----
> 42 84
>
> Basicly, we have a way to define the SQL we are using, what's resutl we
> are expecting and a way to hint.
>
> What do you think?
>
>
> On Sat, Jun 7, 2025, at 07:46, Leon Lin wrote:
>
> Hi Kevin,
>
> Thanks for bringing up the Arrow integration tests as a reference! I’ve
> looked into that setup as well. However, I found it difficult to apply the
> same model to Iceberg since Arrow and Iceberg are very different. Arrow
> tests are centered around in-memory serialization and deserialization using
> JSON-defined schema types, whereas Iceberg operates on persisted table
> state and requires more extensive infrastructure, like a catalog and
> storage, to run the integration tests.
>
> One of the alternative approaches listed in the doc has a similar producer
> / consumer strategy as Arrow, which is defining producer and consumer spec
> files in JSON that describe the actions clients should perform. Each client
> would then implement a runner that parses and executes those actions.
> However, mapping out every Iceberg capability with its inputs and expected
> outputs becomes quite complex, and I’m concerned it won’t scale well over
> time.
>
> Feel free to leave comments in the doc and let me know what you think. I’m
> happy to explore and experiment with other ideas!
>
> Thanks,
> Leon
>
> On Fri, Jun 6, 2025 at 12:39 PM Kevin Liu <kevinjq...@apache.org> wrote:
>
> Hi Leon,
>
> Thanks for starting this thread! I think this is a great idea. Happy to
> support this in any way I can.
>
> Matt Topol and I have previously discussed cross-client testing regarding
> the iceberg-go and iceberg-python implementations. There are a class of
> bugs that can be caught in this way. We somewhat do this today by copying
> over the integration test suite from iceberg-python to iceberg-go. I think
> even supporting a single verification step, through Spark, can provide us a
> lot of value in terms of testing for correctness.
>
> BTW,  Matt mentioned that the Arrow ecosystem has similar integration
> tests across its clients. I haven't been able to look further, but he
> pointed me to
> https://github.com/apache/arrow/tree/main/dev/archery/archery/integration
>
> Looking forward to this!
>
> Best,
> Kevin Liu
>
> On Thu, Jun 5, 2025 at 4:56 PM Leon Lin <lianglin....@gmail.com> wrote:
>
> Hello all,
>
> I would like to start a discussion on standardizing the cross client
> integration testing in iceberg projects. With all the active development
> among the different client implementations (python, rust, go, etc), it will
> be important to make sure the implementations are interoperable between one
> another, making sure tables created by one client can be read and write by
> another client without any incompatibilities and help detect divergence
> between implementations early.
>
> There is already some great work done in PyIceberg to verify compatibility
> with iceberg java implementation with Spark, we could easily extend this to
> do two steps verification. I’ve outlined the details in the doc attached
> below. But the idea is to:
>
>    - Write tables using PySpark and verify them with client-side read
>    tests.
>    - Write using the client and validate using PySpark scripts with
>    assertions.
>
> While a full matrix testing would be ideal to verify interoperability
> between any combination of clients, but I haven’t able to find any clean
> way to do this without adding too much complexity or operational burden.
> I’d really appreciate any thoughts or ideas from the community, and I’m
> happy to contribute to moving this forward.
>
> Best,
> Leon Lin
>
> *References:*
> https://github.com/apache/iceberg-python/blob/main/tests/conftest.py#L2429
> Issue: https://github.com/apache/iceberg/issues/13229
>  Standardize Cross Client Integration Testing
> <https://drive.google.com/open?id=1vZfVzGZucsDc35uoRrbn5CKHGd0muzcY7FAwmKt-KNI>
>
> Xuanwo
>
> https://xuanwo.io/
>
>

Reply via email to