Thank you, Leon > How complex would it be to integrate sqllogictest into non-Rust clients?
I checked a bit about existsing sqllogictest integration projects: - java: https://github.com/hydromatic/sql-logic-test - go: https://github.com/alkemir/sqllogictest - python: https://github.com/duckdb/duckdb-sqllogictest-python - Maybe we can export a python binding from sqllogictest-rs directly? > Should we centralize the shared Docker images and test suites, or let each > client repo manage their own setup with flexibility to evolve as the > development progresses? We are still in the very early stages. I prefer to let each client implement their own approach first, and then we can decide how to evolve and collaborate together. On Tue, Jun 10, 2025, at 03:54, Leon Lin wrote: > Hi Xuanwo, Renjie > > I think sqllogictests is a good replacement on the JSON spec, and I’m > definitely not trying to recommend on using JSON spec as I think it is very > be too complex to execute. > > As Renjie pointed out, sqllogictests only suitable when sql engine is > supported, but right now not all of the client implementation has engine > integration, like iceberg-go. Although sqllogictests could still be useful > for provisioning tables and validating results via Iceberg Spark and can be > extended later once engine integration is added. > > Few concerns I have right now are: > • How complex would it be to integrate sqllogictest into non-Rust clients? > • Should we centralize the shared Docker images and test suites, or let each > client repo manage their own setup with flexibility to evolve as the > development progresses? > Will do some experiments with the sqllogictests offline and happy to discuss > more! > > Best, > Leon > > On Mon, Jun 9, 2025 at 2:34 AM Renjie Liu <liurenjie2...@gmail.com> wrote: >> Hi, Leon: >> >> Thanks for raising this. >> >> In rust we also have similar plan to do integration tests against rust and >> java implementation: https://github.com/apache/iceberg-rust/pull/581 >> >> This approach is pure data driven, as Xuanwo mentioned, motivated by >> sqllogictests. That's to say, we will define a set of sql statements, and >> they can be executed by spark sql and rust engine(datafusion in this case). >> The downside of this method is that it requires integration with a sql >> engine. Luckily in rust we have datafusion, but I'm not sure if this is the >> case for python and go. >> >> On Sat, Jun 7, 2025 at 9:47 AM Xuanwo <xua...@apache.org> wrote: >>> __ >>> Thank you Leon for starting this. >>> >>> It's very important for open formats like Iceberg to be interoperable >>> across different implementations. And it's on the top list of iceberg-rust. >>> >>> My only concern is about the JSON spec. I'm thinking of if it's a good idea >>> for us to adopt sqllogictests format: >>> https://sqlite.org/sqllogictest/doc/trunk/about.wiki and >>> https://github.com/risinglightdb/sqllogictest-rs. >>> >>> It's used by sqlite first and now is widely borrowed by many other SQL >>> engines to build their test suites. >>> >>> It's something like: >>> >>> statement ok >>> INSERT INTO a VALUES (42, 84); >>> >>> query II >>> SELECT * FROM a; >>> ---- >>> 42 84 >>> >>> Basicly, we have a way to define the SQL we are using, what's resutl we are >>> expecting and a way to hint. >>> >>> What do you think? >>> >>> >>> On Sat, Jun 7, 2025, at 07:46, Leon Lin wrote: >>>> Hi Kevin, >>>> >>>> Thanks for bringing up the Arrow integration tests as a reference! I’ve >>>> looked into that setup as well. However, I found it difficult to apply the >>>> same model to Iceberg since Arrow and Iceberg are very different. Arrow >>>> tests are centered around in-memory serialization and deserialization >>>> using JSON-defined schema types, whereas Iceberg operates on persisted >>>> table state and requires more extensive infrastructure, like a catalog and >>>> storage, to run the integration tests. >>>> >>>> One of the alternative approaches listed in the doc has a similar producer >>>> / consumer strategy as Arrow, which is defining producer and consumer spec >>>> files in JSON that describe the actions clients should perform. Each >>>> client would then implement a runner that parses and executes those >>>> actions. However, mapping out every Iceberg capability with its inputs and >>>> expected outputs becomes quite complex, and I’m concerned it won’t scale >>>> well over time. >>>> >>>> Feel free to leave comments in the doc and let me know what you think. I’m >>>> happy to explore and experiment with other ideas! >>>> >>>> Thanks, >>>> Leon >>>> >>>> On Fri, Jun 6, 2025 at 12:39 PM Kevin Liu <kevinjq...@apache.org> wrote: >>>>> Hi Leon, >>>>> >>>>> Thanks for starting this thread! I think this is a great idea. Happy to >>>>> support this in any way I can. >>>>> >>>>> Matt Topol and I have previously discussed cross-client testing regarding >>>>> the iceberg-go and iceberg-python implementations. There are a class of >>>>> bugs that can be caught in this way. We somewhat do this today by copying >>>>> over the integration test suite from iceberg-python to iceberg-go. I >>>>> think even supporting a single verification step, through Spark, can >>>>> provide us a lot of value in terms of testing for correctness. >>>>> >>>>> BTW, Matt mentioned that the Arrow ecosystem has similar integration >>>>> tests across its clients. I haven't been able to look further, but he >>>>> pointed me to >>>>> https://github.com/apache/arrow/tree/main/dev/archery/archery/integration >>>>> >>>>> Looking forward to this! >>>>> >>>>> Best, >>>>> Kevin Liu >>>>> >>>>> On Thu, Jun 5, 2025 at 4:56 PM Leon Lin <lianglin....@gmail.com> wrote: >>>>>> Hello all, >>>>>> >>>>>> I would like to start a discussion on standardizing the cross client >>>>>> integration testing in iceberg projects. With all the active development >>>>>> among the different client implementations (python, rust, go, etc), it >>>>>> will be important to make sure the implementations are interoperable >>>>>> between one another, making sure tables created by one client can be >>>>>> read and write by another client without any incompatibilities and help >>>>>> detect divergence between implementations early. >>>>>> >>>>>> There is already some great work done in PyIceberg to verify >>>>>> compatibility with iceberg java implementation with Spark, we could >>>>>> easily extend this to do two steps verification. I’ve outlined the >>>>>> details in the doc attached below. But the idea is to: >>>>>> • Write tables using PySpark and verify them with client-side read >>>>>> tests. >>>>>> • Write using the client and validate using PySpark scripts with >>>>>> assertions. >>>>>> While a full matrix testing would be ideal to verify interoperability >>>>>> between any combination of clients, but I haven’t able to find any clean >>>>>> way to do this without adding too much complexity or operational burden. >>>>>> I’d really appreciate any thoughts or ideas from the community, and I’m >>>>>> happy to contribute to moving this forward. >>>>>> >>>>>> Best, >>>>>> Leon Lin >>>>>> >>>>>> *References:* >>>>>> https://github.com/apache/iceberg-python/blob/main/tests/conftest.py#L2429 >>>>>> Issue: https://github.com/apache/iceberg/issues/13229 >>>>>> Standardize Cross Client Integration Testing >>>>>> <https://drive.google.com/open?id=1vZfVzGZucsDc35uoRrbn5CKHGd0muzcY7FAwmKt-KNI> >>> Xuanwo >>> >>> https://xuanwo.io/ >>> Xuanwo https://xuanwo.io/