Thank you Leon for starting this. It's very important for open formats like Iceberg to be interoperable across different implementations. And it's on the top list of iceberg-rust.
My only concern is about the JSON spec. I'm thinking of if it's a good idea for us to adopt sqllogictests format: https://sqlite.org/sqllogictest/doc/trunk/about.wiki and https://github.com/risinglightdb/sqllogictest-rs. It's used by sqlite first and now is widely borrowed by many other SQL engines to build their test suites. It's something like: statement ok INSERT INTO a VALUES (42, 84); query II SELECT * FROM a; ---- 42 84 Basicly, we have a way to define the SQL we are using, what's resutl we are expecting and a way to hint. What do you think? On Sat, Jun 7, 2025, at 07:46, Leon Lin wrote: > Hi Kevin, > > Thanks for bringing up the Arrow integration tests as a reference! I’ve > looked into that setup as well. However, I found it difficult to apply the > same model to Iceberg since Arrow and Iceberg are very different. Arrow tests > are centered around in-memory serialization and deserialization using > JSON-defined schema types, whereas Iceberg operates on persisted table state > and requires more extensive infrastructure, like a catalog and storage, to > run the integration tests. > > One of the alternative approaches listed in the doc has a similar producer / > consumer strategy as Arrow, which is defining producer and consumer spec > files in JSON that describe the actions clients should perform. Each client > would then implement a runner that parses and executes those actions. > However, mapping out every Iceberg capability with its inputs and expected > outputs becomes quite complex, and I’m concerned it won’t scale well over > time. > > Feel free to leave comments in the doc and let me know what you think. I’m > happy to explore and experiment with other ideas! > > Thanks, > Leon > > On Fri, Jun 6, 2025 at 12:39 PM Kevin Liu <kevinjq...@apache.org> wrote: >> Hi Leon, >> >> Thanks for starting this thread! I think this is a great idea. Happy to >> support this in any way I can. >> >> Matt Topol and I have previously discussed cross-client testing regarding >> the iceberg-go and iceberg-python implementations. There are a class of bugs >> that can be caught in this way. We somewhat do this today by copying over >> the integration test suite from iceberg-python to iceberg-go. I think even >> supporting a single verification step, through Spark, can provide us a lot >> of value in terms of testing for correctness. >> >> BTW, Matt mentioned that the Arrow ecosystem has similar integration tests >> across its clients. I haven't been able to look further, but he pointed me >> to https://github.com/apache/arrow/tree/main/dev/archery/archery/integration >> >> Looking forward to this! >> >> Best, >> Kevin Liu >> >> On Thu, Jun 5, 2025 at 4:56 PM Leon Lin <lianglin....@gmail.com> wrote: >>> Hello all, >>> >>> I would like to start a discussion on standardizing the cross client >>> integration testing in iceberg projects. With all the active development >>> among the different client implementations (python, rust, go, etc), it will >>> be important to make sure the implementations are interoperable between one >>> another, making sure tables created by one client can be read and write by >>> another client without any incompatibilities and help detect divergence >>> between implementations early. >>> >>> There is already some great work done in PyIceberg to verify compatibility >>> with iceberg java implementation with Spark, we could easily extend this to >>> do two steps verification. I’ve outlined the details in the doc attached >>> below. But the idea is to: >>> • Write tables using PySpark and verify them with client-side read tests. >>> • Write using the client and validate using PySpark scripts with >>> assertions. >>> While a full matrix testing would be ideal to verify interoperability >>> between any combination of clients, but I haven’t able to find any clean >>> way to do this without adding too much complexity or operational burden. >>> I’d really appreciate any thoughts or ideas from the community, and I’m >>> happy to contribute to moving this forward. >>> >>> Best, >>> Leon Lin >>> >>> *References:* >>> https://github.com/apache/iceberg-python/blob/main/tests/conftest.py#L2429 >>> Issue: https://github.com/apache/iceberg/issues/13229 >>> Standardize Cross Client Integration Testing >>> <https://drive.google.com/open?id=1vZfVzGZucsDc35uoRrbn5CKHGd0muzcY7FAwmKt-KNI> Xuanwo https://xuanwo.io/