Hello Parquet community, The Arrow project recently fixed a bug [1] in its C++ Parquet implementation that was causing compliant Parquet files written by recent versions of parquet-rs [2] to be unreadable by the C++ implementation due to differences in the implementation of Parquet’s SizeStatistics feature [3]. This also affected the Arrow libraries that bind to the C++ implementation, including PyArrow. The C++ implementation has been patched [4] and a new Arrow release (19.0.1) is in the works.
Given this, I wanted to start a discussion about what kind of cross-implementation testing facilities may already exist in any of the Parquet implementations and what kind of testing facilities might be created to help catch situations like these. I’ll start off with my thoughts and encourage people to jump in: 1. The specific integration test that could have been run to catch this bug would be a test that used the Arrow 19.0.0 release candidate to read any Parquet file written by parquet-rs >=53.0. This would have halted the release process. Should the Arrow project just add a CI job like this and move on? 2. Testing every combination of Parquet format versions, feature toggles, implementations, and implementation versions is clearly too large a problem to solve so it might be best to start off with a narrow scope. Please note that I've cross-posted this to the Apache Arrow mailing list. Please reply to the Apache Parquet post. I’m looking forward to hearing others’ thoughts and ideas. Thanks, Bryce [1] https://github.com/apache/arrow/issues/45283 [2] https://github.com/apache/arrow-rs/tree/main/parquet [3] https://github.com/apache/parquet-format/pull/197 [4] https://github.com/apache/arrow/pull/45285