To get the ball rolling, here is a quick and dirty PR adding a test that writes an Arrow batch to a Parquet file.
https://github.com/apache/arrow/pull/6785 I'll keep iterating on this but will gladly accept help or hand this off to someone better qualified. On Tue, Mar 31, 2020 at 8:15 AM Wes McKinney <wesmck...@gmail.com> wrote: > Here was the last discussion about this 6 months ago > > https://github.com/apache/parquet-testing/pull/9 > > I saw another PR come through like this so that's why I'm bringing it up > again > > https://github.com/apache/parquet-testing/pull/11 > > On Tue, Mar 31, 2020 at 9:08 AM Andy Grove <andygrov...@gmail.com> wrote: > > > > Hi Wes, > > > > I agree that this is important. I have been looking at the Parquet > > implementation this morning and I do see code for writing files., along > > with roundtrip tests As you said, It isn't writing from Arrow types yet > but > > I would hope that this would be relatively simple to add. I don't know > how > > complete the Parquet writer code is. It would be useful to get some > > guidance from the main authors of this crate. > > > > I'd be happy to create some JIRAs and try and help organize an effort > here > > for the next release. > > > > Andy. > > > > > > > > > > > > > > > > > > > > On Mon, Mar 30, 2020 at 6:07 PM Wes McKinney <wesmck...@gmail.com> > wrote: > > > > > hi folks, > > > > > > More than a year has passed since the Parquet Rust project joined > > > forces with Apache Arrow. > > > > > > I raised this issue in the past, but the project still cannot write > > > files originating from Arrow records. In my opinion, this creates > > > sustainability / development scalability problems for the ongoing > > > development of the project. In particular, testing has to rely on > > > binary files either pre-generated or generated by another library. > > > This makes everything harder (testing, feature development, > > > benchmarking, and so forth) and increases the chance of failing to > > > cover edge cases. > > > > > > Looking back on over 4 years of C++ Parquet development, I doubt we > > > could have gotten the project to where it is now without a writer > > > implementation moving together with the reader. For example, we've had > > > to deal with issues arising in very large files (e.g. BinaryArray > > > overflows), and in many cases it would not be practical to store a > > > pre-generated file exhibiting some of these problems. > > > > > > Of course, as a volunteer driven effort no one can be forced to > > > implement a writer, but since a good amount of time has passed I feel > > > I need to raise awareness of the issue again to see if an effort might > > > be mobilized, since this also impacts people who might come to rely on > > > this code in production. Given the importance of Parquet in current > > > times, having a rock solid Parquet library will likely become > > > essential to sustained adoption of the Arrow Rust project (it has > > > certainly been very important for C++/Python/R adoption). > > > > > > best, > > > Wes > > > >