Hi, Leon: Thanks for raising this.
In rust we also have similar plan to do integration tests against rust and java implementation: https://github.com/apache/iceberg-rust/pull/581 This approach is pure data driven, as Xuanwo mentioned, motivated by sqllogictests. That's to say, we will define a set of sql statements, and they can be executed by spark sql and rust engine(datafusion in this case). The downside of this method is that it requires integration with a sql engine. Luckily in rust we have datafusion, but I'm not sure if this is the case for python and go. On Sat, Jun 7, 2025 at 9:47 AM Xuanwo <xua...@apache.org> wrote: > Thank you Leon for starting this. > > It's very important for open formats like Iceberg to be interoperable > across different implementations. And it's on the top list of iceberg-rust. > > My only concern is about the JSON spec. I'm thinking of if it's a good > idea for us to adopt sqllogictests format: > https://sqlite.org/sqllogictest/doc/trunk/about.wiki and > https://github.com/risinglightdb/sqllogictest-rs. > > It's used by sqlite first and now is widely borrowed by many other SQL > engines to build their test suites. > > It's something like: > > statement ok > INSERT INTO a VALUES (42, 84); > > query II > SELECT * FROM a; > ---- > 42 84 > > Basicly, we have a way to define the SQL we are using, what's resutl we > are expecting and a way to hint. > > What do you think? > > > On Sat, Jun 7, 2025, at 07:46, Leon Lin wrote: > > Hi Kevin, > > Thanks for bringing up the Arrow integration tests as a reference! I’ve > looked into that setup as well. However, I found it difficult to apply the > same model to Iceberg since Arrow and Iceberg are very different. Arrow > tests are centered around in-memory serialization and deserialization using > JSON-defined schema types, whereas Iceberg operates on persisted table > state and requires more extensive infrastructure, like a catalog and > storage, to run the integration tests. > > One of the alternative approaches listed in the doc has a similar producer > / consumer strategy as Arrow, which is defining producer and consumer spec > files in JSON that describe the actions clients should perform. Each client > would then implement a runner that parses and executes those actions. > However, mapping out every Iceberg capability with its inputs and expected > outputs becomes quite complex, and I’m concerned it won’t scale well over > time. > > Feel free to leave comments in the doc and let me know what you think. I’m > happy to explore and experiment with other ideas! > > Thanks, > Leon > > On Fri, Jun 6, 2025 at 12:39 PM Kevin Liu <kevinjq...@apache.org> wrote: > > Hi Leon, > > Thanks for starting this thread! I think this is a great idea. Happy to > support this in any way I can. > > Matt Topol and I have previously discussed cross-client testing regarding > the iceberg-go and iceberg-python implementations. There are a class of > bugs that can be caught in this way. We somewhat do this today by copying > over the integration test suite from iceberg-python to iceberg-go. I think > even supporting a single verification step, through Spark, can provide us a > lot of value in terms of testing for correctness. > > BTW, Matt mentioned that the Arrow ecosystem has similar integration > tests across its clients. I haven't been able to look further, but he > pointed me to > https://github.com/apache/arrow/tree/main/dev/archery/archery/integration > > Looking forward to this! > > Best, > Kevin Liu > > On Thu, Jun 5, 2025 at 4:56 PM Leon Lin <lianglin....@gmail.com> wrote: > > Hello all, > > I would like to start a discussion on standardizing the cross client > integration testing in iceberg projects. With all the active development > among the different client implementations (python, rust, go, etc), it will > be important to make sure the implementations are interoperable between one > another, making sure tables created by one client can be read and write by > another client without any incompatibilities and help detect divergence > between implementations early. > > There is already some great work done in PyIceberg to verify compatibility > with iceberg java implementation with Spark, we could easily extend this to > do two steps verification. I’ve outlined the details in the doc attached > below. But the idea is to: > > - Write tables using PySpark and verify them with client-side read > tests. > - Write using the client and validate using PySpark scripts with > assertions. > > While a full matrix testing would be ideal to verify interoperability > between any combination of clients, but I haven’t able to find any clean > way to do this without adding too much complexity or operational burden. > I’d really appreciate any thoughts or ideas from the community, and I’m > happy to contribute to moving this forward. > > Best, > Leon Lin > > *References:* > https://github.com/apache/iceberg-python/blob/main/tests/conftest.py#L2429 > Issue: https://github.com/apache/iceberg/issues/13229 > Standardize Cross Client Integration Testing > <https://drive.google.com/open?id=1vZfVzGZucsDc35uoRrbn5CKHGd0muzcY7FAwmKt-KNI> > > Xuanwo > > https://xuanwo.io/ > >