This is a great idea. There is a previous discussion about a similar idea here[1]
Specifically, I think Alkis's sketch of the "carpenter" program would have caught this situation. In my opinion, improving interoperability testing like this is a key step towards being able to reliably evolve the Parquet standard itself. Andrew [1]: https://github.com/apache/parquet-format/issues/441 On Wed, Jan 29, 2025 at 3:49 PM Bryce Mecum <bryceme...@gmail.com> wrote: > Hello Parquet community, > > The Arrow project recently fixed a bug [1] in its C++ Parquet > implementation that was causing compliant Parquet files written by > recent versions of parquet-rs [2] to be unreadable by the C++ > implementation due to differences in the implementation of Parquet’s > SizeStatistics feature [3]. This also affected the Arrow libraries > that bind to the C++ implementation, including PyArrow. The C++ > implementation has been patched [4] and a new Arrow release (19.0.1) > is in the works. > > Given this, I wanted to start a discussion about what kind of > cross-implementation testing facilities may already exist in any of > the Parquet implementations and what kind of testing facilities might > be created to help catch situations like these. > > I’ll start off with my thoughts and encourage people to jump in: > > 1. The specific integration test that could have been run to catch > this bug would be a test that used the Arrow 19.0.0 release candidate > to read any Parquet file written by parquet-rs >=53.0. This would have > halted the release process. Should the Arrow project just add a CI job > like this and move on? > 2. Testing every combination of Parquet format versions, feature > toggles, implementations, and implementation versions is clearly too > large a problem to solve so it might be best to start off with a > narrow scope. > > Please note that I've cross-posted this to the Apache Arrow mailing > list. Please reply to the Apache Parquet post. I’m looking forward to > hearing others’ thoughts and ideas. > > Thanks, > Bryce > > [1] https://github.com/apache/arrow/issues/45283 > [2] https://github.com/apache/arrow-rs/tree/main/parquet > [3] https://github.com/apache/parquet-format/pull/197 > [4] https://github.com/apache/arrow/pull/45285 >