Thanks Leon for bringing this up.

The main reason that all the implementations test against Spark is that it
is well supported and has a nice SQL API to easily set up test cases. But
most importantly, it uses the Iceberg Java SDK underneath, which we
consider the reference implementation of Iceberg. The Java SDK is the
front-runner when it comes to functionality, so it makes sense to test
against that.

I would be hesitant to set up a test matrix between all the different
language implementations because I think it would be quite a bit of work to
first set up, but also to maintain. The added value compared to testing
against just Java is limited. Maybe as a first step, what I do think would
be valuable is packaging all the requirements (catalog, object-storage,
maybe some fixtures) into some test framework that we can easily replicate
across the different languages, but also expose it to third-party
implementations (duckdb, etc).

Kind regards,
Fokko


Op di 10 jun 2025 om 06:20 schreef Xuanwo <xua...@apache.org>:

> Thank you, Leon
>
> > How complex would it be to integrate sqllogictest into non-Rust clients?
>
> I checked a bit about existsing sqllogictest integration projects:
>
> - java: https://github.com/hydromatic/sql-logic-test
> - go: https://github.com/alkemir/sqllogictest
> - python: https://github.com/duckdb/duckdb-sqllogictest-python
>   - Maybe we can export a python binding from sqllogictest-rs directly?
>
> > Should we centralize the shared Docker images and test suites, or let
> each client repo manage their own setup with flexibility to evolve as the
> development progresses?
>
> We are still in the very early stages. I prefer to let each client
> implement their own approach first, and then we can decide how to evolve
> and collaborate together.
>
> On Tue, Jun 10, 2025, at 03:54, Leon Lin wrote:
>
> Hi Xuanwo, Renjie
>
> I think sqllogictests is a good replacement on the JSON spec, and I’m
> definitely not trying to recommend on using JSON spec as I think it is very
> be too complex to execute.
>
> As Renjie pointed out, sqllogictests only suitable when sql engine is
> supported, but right now not all of the client implementation has engine
> integration, like iceberg-go. Although sqllogictests could still be useful
> for provisioning tables and validating results via Iceberg Spark and can be
> extended later once engine integration is added.
>
> Few concerns I have right now are:
>
>    - How complex would it be to integrate sqllogictest into non-Rust
>    clients?
>    - Should we centralize the shared Docker images and test suites, or
>    let each client repo manage their own setup with flexibility to evolve as
>    the development progresses?
>
> Will do some experiments with the sqllogictests offline and happy to
> discuss more!
>
> Best,
> Leon
>
> On Mon, Jun 9, 2025 at 2:34 AM Renjie Liu <liurenjie2...@gmail.com> wrote:
>
> Hi, Leon:
>
> Thanks for raising this.
>
> In rust we also have similar plan to do integration tests against rust and
> java implementation: https://github.com/apache/iceberg-rust/pull/581
>
> This approach is pure data driven, as Xuanwo mentioned, motivated by
> sqllogictests. That's to say, we will define a set of sql statements, and
> they can be executed by spark sql and rust engine(datafusion in this
> case).  The downside of this method is that it requires integration with a
> sql engine. Luckily in rust we have datafusion, but I'm not sure if this is
> the case for python and go.
>
> On Sat, Jun 7, 2025 at 9:47 AM Xuanwo <xua...@apache.org> wrote:
>
>
> Thank you Leon for starting this.
>
> It's very important for open formats like Iceberg to be interoperable
> across different implementations. And it's on the top list of iceberg-rust.
>
> My only concern is about the JSON spec. I'm thinking of if it's a good
> idea for us to adopt sqllogictests format:
> https://sqlite.org/sqllogictest/doc/trunk/about.wiki and
> https://github.com/risinglightdb/sqllogictest-rs.
>
> It's used by sqlite first and now is widely borrowed by many other SQL
> engines to build their test suites.
>
> It's something like:
>
> statement ok
> INSERT INTO a VALUES (42, 84);
>
> query II
> SELECT * FROM a;
> ----
> 42 84
>
> Basicly, we have a way to define the SQL we are using, what's resutl we
> are expecting and a way to hint.
>
> What do you think?
>
>
> On Sat, Jun 7, 2025, at 07:46, Leon Lin wrote:
>
> Hi Kevin,
>
> Thanks for bringing up the Arrow integration tests as a reference! I’ve
> looked into that setup as well. However, I found it difficult to apply the
> same model to Iceberg since Arrow and Iceberg are very different. Arrow
> tests are centered around in-memory serialization and deserialization using
> JSON-defined schema types, whereas Iceberg operates on persisted table
> state and requires more extensive infrastructure, like a catalog and
> storage, to run the integration tests.
>
> One of the alternative approaches listed in the doc has a similar producer
> / consumer strategy as Arrow, which is defining producer and consumer spec
> files in JSON that describe the actions clients should perform. Each client
> would then implement a runner that parses and executes those actions.
> However, mapping out every Iceberg capability with its inputs and expected
> outputs becomes quite complex, and I’m concerned it won’t scale well over
> time.
>
> Feel free to leave comments in the doc and let me know what you think. I’m
> happy to explore and experiment with other ideas!
>
> Thanks,
> Leon
>
> On Fri, Jun 6, 2025 at 12:39 PM Kevin Liu <kevinjq...@apache.org> wrote:
>
> Hi Leon,
>
> Thanks for starting this thread! I think this is a great idea. Happy to
> support this in any way I can.
>
> Matt Topol and I have previously discussed cross-client testing regarding
> the iceberg-go and iceberg-python implementations. There are a class of
> bugs that can be caught in this way. We somewhat do this today by copying
> over the integration test suite from iceberg-python to iceberg-go. I think
> even supporting a single verification step, through Spark, can provide us a
> lot of value in terms of testing for correctness.
>
> BTW,  Matt mentioned that the Arrow ecosystem has similar integration
> tests across its clients. I haven't been able to look further, but he
> pointed me to
> https://github.com/apache/arrow/tree/main/dev/archery/archery/integration
>
> Looking forward to this!
>
> Best,
> Kevin Liu
>
> On Thu, Jun 5, 2025 at 4:56 PM Leon Lin <lianglin....@gmail.com> wrote:
>
> Hello all,
>
> I would like to start a discussion on standardizing the cross client
> integration testing in iceberg projects. With all the active development
> among the different client implementations (python, rust, go, etc), it will
> be important to make sure the implementations are interoperable between one
> another, making sure tables created by one client can be read and write by
> another client without any incompatibilities and help detect divergence
> between implementations early.
>
> There is already some great work done in PyIceberg to verify compatibility
> with iceberg java implementation with Spark, we could easily extend this to
> do two steps verification. I’ve outlined the details in the doc attached
> below. But the idea is to:
>
>    - Write tables using PySpark and verify them with client-side read
>    tests.
>    - Write using the client and validate using PySpark scripts with
>    assertions.
>
> While a full matrix testing would be ideal to verify interoperability
> between any combination of clients, but I haven’t able to find any clean
> way to do this without adding too much complexity or operational burden.
> I’d really appreciate any thoughts or ideas from the community, and I’m
> happy to contribute to moving this forward.
>
> Best,
> Leon Lin
>
> *References:*
> https://github.com/apache/iceberg-python/blob/main/tests/conftest.py#L2429
> Issue: https://github.com/apache/iceberg/issues/13229
>  Standardize Cross Client Integration Testing
> <https://drive.google.com/open?id=1vZfVzGZucsDc35uoRrbn5CKHGd0muzcY7FAwmKt-KNI>
>
> Xuanwo
>
> https://xuanwo.io/
>
> Xuanwo
>
> https://xuanwo.io/
>
>

Reply via email to