This is a great idea. There is a previous discussion about a similar idea
here[1]

Specifically, I think Alkis's sketch of the "carpenter" program would have
caught this situation.

In my opinion, improving interoperability testing like this is a key step
towards being able to  reliably evolve the Parquet standard itself.

Andrew

[1]: https://github.com/apache/parquet-format/issues/441

On Wed, Jan 29, 2025 at 3:49 PM Bryce Mecum <bryceme...@gmail.com> wrote:

> Hello Parquet community,
>
> The Arrow project recently fixed a bug [1] in its C++ Parquet
> implementation that was causing compliant Parquet files written by
> recent versions of parquet-rs [2] to be unreadable by the C++
> implementation due to differences in the implementation of Parquet’s
> SizeStatistics feature [3]. This also affected the Arrow libraries
> that bind to the C++ implementation, including PyArrow. The C++
> implementation has been patched [4] and a new Arrow release (19.0.1)
> is in the works.
>
> Given this, I wanted to start a discussion about what kind of
> cross-implementation testing facilities may already exist in any of
> the Parquet implementations and what kind of testing facilities might
> be created to help catch situations like these.
>
> I’ll start off with my thoughts and encourage people to jump in:
>
> 1. The specific integration test that could have been run to catch
> this bug would be a test that used the Arrow 19.0.0 release candidate
> to read any Parquet file written by parquet-rs >=53.0. This would have
> halted the release process. Should the Arrow project just add a CI job
> like this and move on?
> 2. Testing every combination of Parquet format versions, feature
> toggles, implementations, and implementation versions is clearly too
> large a problem to solve so it might be best to start off with a
> narrow scope.
>
> Please note that I've cross-posted this to the Apache Arrow mailing
> list. Please reply to the Apache Parquet post. I’m looking forward to
> hearing others’ thoughts and ideas.
>
> Thanks,
> Bryce
>
> [1] https://github.com/apache/arrow/issues/45283
> [2] https://github.com/apache/arrow-rs/tree/main/parquet
> [3] https://github.com/apache/parquet-format/pull/197
> [4] https://github.com/apache/arrow/pull/45285
>

Reply via email to