hi folks, More than a year has passed since the Parquet Rust project joined forces with Apache Arrow.
I raised this issue in the past, but the project still cannot write files originating from Arrow records. In my opinion, this creates sustainability / development scalability problems for the ongoing development of the project. In particular, testing has to rely on binary files either pre-generated or generated by another library. This makes everything harder (testing, feature development, benchmarking, and so forth) and increases the chance of failing to cover edge cases. Looking back on over 4 years of C++ Parquet development, I doubt we could have gotten the project to where it is now without a writer implementation moving together with the reader. For example, we've had to deal with issues arising in very large files (e.g. BinaryArray overflows), and in many cases it would not be practical to store a pre-generated file exhibiting some of these problems. Of course, as a volunteer driven effort no one can be forced to implement a writer, but since a good amount of time has passed I feel I need to raise awareness of the issue again to see if an effort might be mobilized, since this also impacts people who might come to rely on this code in production. Given the importance of Parquet in current times, having a rock solid Parquet library will likely become essential to sustained adoption of the Arrow Rust project (it has certainly been very important for C++/Python/R adoption). best, Wes