hi folks,

More than a year has passed since the Parquet Rust project joined
forces with Apache Arrow.

I raised this issue in the past, but the project still cannot write
files originating from Arrow records. In my opinion, this creates
sustainability / development scalability problems for the ongoing
development of the project. In particular, testing has to rely on
binary files either pre-generated or generated by another library.
This makes everything harder (testing, feature development,
benchmarking, and so forth) and increases the chance of failing to
cover edge cases.

Looking back on over 4 years of C++ Parquet development, I doubt we
could have gotten the project to where it is now without a writer
implementation moving together with the reader. For example, we've had
to deal with issues arising in very large files (e.g. BinaryArray
overflows), and in many cases it would not be practical to store a
pre-generated file exhibiting some of these problems.

Of course, as a volunteer driven effort no one can be forced to
implement a writer, but since a good amount of time has passed I feel
I need to raise awareness of the issue again to see if an effort might
be mobilized, since this also impacts people who might come to rely on
this code in production. Given the importance of Parquet in current
times, having a rock solid Parquet library will likely become
essential to sustained adoption of the Arrow Rust project (it has
certainly been very important for C++/Python/R adoption).

best,
Wes

Reply via email to