Hello Parquet community,

The Arrow project recently fixed a bug [1] in its C++ Parquet
implementation that was causing compliant Parquet files written by
recent versions of parquet-rs [2] to be unreadable by the C++
implementation due to differences in the implementation of Parquet’s
SizeStatistics feature [3]. This also affected the Arrow libraries
that bind to the C++ implementation, including PyArrow. The C++
implementation has been patched [4] and a new Arrow release (19.0.1)
is in the works.

Given this, I wanted to start a discussion about what kind of
cross-implementation testing facilities may already exist in any of
the Parquet implementations and what kind of testing facilities might
be created to help catch situations like these.

I’ll start off with my thoughts and encourage people to jump in:

1. The specific integration test that could have been run to catch
this bug would be a test that used the Arrow 19.0.0 release candidate
to read any Parquet file written by parquet-rs >=53.0. This would have
halted the release process. Should the Arrow project just add a CI job
like this and move on?
2. Testing every combination of Parquet format versions, feature
toggles, implementations, and implementation versions is clearly too
large a problem to solve so it might be best to start off with a
narrow scope.

Please note that I've cross-posted this to the Apache Arrow mailing
list. Please reply to the Apache Parquet post. I’m looking forward to
hearing others’ thoughts and ideas.

Thanks,
Bryce

[1] https://github.com/apache/arrow/issues/45283
[2] https://github.com/apache/arrow-rs/tree/main/parquet
[3] https://github.com/apache/parquet-format/pull/197
[4] https://github.com/apache/arrow/pull/45285

Reply via email to