Hi Kevin, Thanks for bringing up the Arrow integration tests as a reference! I’ve looked into that setup as well. However, I found it difficult to apply the same model to Iceberg since Arrow and Iceberg are very different. Arrow tests are centered around in-memory serialization and deserialization using JSON-defined schema types, whereas Iceberg operates on persisted table state and requires more extensive infrastructure, like a catalog and storage, to run the integration tests.
One of the alternative approaches listed in the doc has a similar producer / consumer strategy as Arrow, which is defining producer and consumer spec files in JSON that describe the actions clients should perform. Each client would then implement a runner that parses and executes those actions. However, mapping out every Iceberg capability with its inputs and expected outputs becomes quite complex, and I’m concerned it won’t scale well over time. Feel free to leave comments in the doc and let me know what you think. I’m happy to explore and experiment with other ideas! Thanks, Leon On Fri, Jun 6, 2025 at 12:39 PM Kevin Liu <kevinjq...@apache.org> wrote: > Hi Leon, > > Thanks for starting this thread! I think this is a great idea. Happy to > support this in any way I can. > > Matt Topol and I have previously discussed cross-client testing regarding > the iceberg-go and iceberg-python implementations. There are a class of > bugs that can be caught in this way. We somewhat do this today by copying > over the integration test suite from iceberg-python to iceberg-go. I think > even supporting a single verification step, through Spark, can provide us a > lot of value in terms of testing for correctness. > > BTW, Matt mentioned that the Arrow ecosystem has similar integration > tests across its clients. I haven't been able to look further, but he > pointed me to > https://github.com/apache/arrow/tree/main/dev/archery/archery/integration > > Looking forward to this! > > Best, > Kevin Liu > > On Thu, Jun 5, 2025 at 4:56 PM Leon Lin <lianglin....@gmail.com> wrote: > >> Hello all, >> >> I would like to start a discussion on standardizing the cross client >> integration testing in iceberg projects. With all the active development >> among the different client implementations (python, rust, go, etc), it will >> be important to make sure the implementations are interoperable between one >> another, making sure tables created by one client can be read and write by >> another client without any incompatibilities and help detect divergence >> between implementations early. >> >> There is already some great work done in PyIceberg to verify >> compatibility with iceberg java implementation with Spark, we could easily >> extend this to do two steps verification. I’ve outlined the details in the >> doc attached below. But the idea is to: >> >> - Write tables using PySpark and verify them with client-side read >> tests. >> - Write using the client and validate using PySpark scripts with >> assertions. >> >> While a full matrix testing would be ideal to verify interoperability >> between any combination of clients, but I haven’t able to find any clean >> way to do this without adding too much complexity or operational burden. >> I’d really appreciate any thoughts or ideas from the community, and I’m >> happy to contribute to moving this forward. >> >> Best, >> Leon Lin >> >> *References:* >> https://github.com/apache/iceberg-python/blob/main/tests/conftest.py#L2429 >> Issue: https://github.com/apache/iceberg/issues/13229 >> Standardize Cross Client Integration Testing >> <https://drive.google.com/open?id=1vZfVzGZucsDc35uoRrbn5CKHGd0muzcY7FAwmKt-KNI> >> >